randomkang opened a new issue, #3202: URL: https://github.com/apache/brpc/issues/3202
**Describe the bug** After using https://github.com/apache/brpc/pull/3145, i still get the error. The error details is as follows: -------------------------------------------------log start---------------------------------------------------------------- [ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372264][169925][rdma_endpoint.cpp:895] Fail to ibv_post_send: Cannot allocate memory, window=1, sq_current=16 [ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.372387][169925][socket.cpp:1841] Fail to keep-write into Socket{id=861 fd=787 addr=10.39.61.118:55530:19336} (0x7f152db0ec40): Cannot allocate memory [12] [ps-57] [ UNKNOWN][UNKNOWN][W][2026-01-26 09:05:15.494266][170049][rdma_endpoint.cpp:578] Fail to read Hello Message from client:brpc::Socket{id=1055 fd=787 addr=10.39.61.118:38306:19336} (0x7f15035dd7c0) 10.39.61.118:38306: Unknown error 1014 [1014] [ps-57] [ UNKNOWN][UNKNOWN][F][2026-01-26 09:05:16.457825][169792][ps_server.cc:265] Fail to pull with request_id=198536955979 WK=12 cache_id=0 global_cache_id=18860592829102089 retry_count=1 [ps-57] *** Check failure stack trace: *** [ps-57] [ UNKNOWN][UNKNOWN][I][2026-01-26 09:05:18.885514][170075][block_pool.cpp:199] Start extend rdma memory 1024MB [ps-57] MiniDump path: /tmp/fbb13d5d-6993-4de5-80903f9b-281ee912.dmp ------------------------------------------------log end------------------------------------------------------------------- **To Reproduce** The task i run is model training. It includes 11 cpu machines(500C2000G) and 7 gpu machines(380C2200G8GPU). 1) every cpu machine has 2 emb server, one emb server for one numa; 2) every gpu machine has 8 gpus, there are one dense server、one sparse server and one worker for one gpu; 3) only gpu machines can use rdma. RDMA is used for communication between workers and emb ps in gpu machines; GDR is used for communication between workers and dense ps in gpu machines. 4) TCP is used for communication between workers and emb ps in cpu machines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
