randomkang opened a new issue, #3202:
URL: https://github.com/apache/brpc/issues/3202

   **Describe the bug**
   
   After using https://github.com/apache/brpc/pull/3145, i still get the error. 
The error details is as follows:
   
   -------------------------------------------------log 
start----------------------------------------------------------------
   [ps-57] [        UNKNOWN][UNKNOWN][W][2026-01-26 
09:05:15.372264][169925][rdma_endpoint.cpp:895] Fail to ibv_post_send: Cannot 
allocate memory, window=1, sq_current=16
   [ps-57] [        UNKNOWN][UNKNOWN][W][2026-01-26 
09:05:15.372387][169925][socket.cpp:1841] Fail to keep-write into Socket{id=861 
fd=787 addr=10.39.61.118:55530:19336} (0x7f152db0ec40): Cannot allocate memory 
[12]
   [ps-57] [        UNKNOWN][UNKNOWN][W][2026-01-26 
09:05:15.494266][170049][rdma_endpoint.cpp:578] Fail to read Hello Message from 
client:brpc::Socket{id=1055 fd=787 addr=10.39.61.118:38306:19336} 
(0x7f15035dd7c0) 10.39.61.118:38306: Unknown error 1014 [1014]
   [ps-57] [        UNKNOWN][UNKNOWN][F][2026-01-26 
09:05:16.457825][169792][ps_server.cc:265] Fail to pull with 
request_id=198536955979 WK=12 cache_id=0 global_cache_id=18860592829102089 
retry_count=1
   [ps-57] *** Check failure stack trace: ***
   [ps-57] [        UNKNOWN][UNKNOWN][I][2026-01-26 
09:05:18.885514][170075][block_pool.cpp:199] Start extend rdma memory 1024MB
   [ps-57] MiniDump path: /tmp/fbb13d5d-6993-4de5-80903f9b-281ee912.dmp
   ------------------------------------------------log 
end-------------------------------------------------------------------
   
   
   
   **To Reproduce**
   The task i run is model training. It includes 11 cpu machines(500C2000G) and 
7 gpu machines(380C2200G8GPU).
   1) every cpu machine has 2 emb server, one emb server for one numa;
   2) every gpu machine has 8 gpus, there are one dense server、one sparse 
server and one worker for one gpu;
   3) only gpu machines can use rdma. RDMA is used for communication between 
workers and emb ps in gpu machines; GDR is used for communication between 
workers and dense ps in gpu machines.
   4) TCP is used for communication between workers and emb ps in cpu machines.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to