sunce4t opened a new issue, #3102:
URL: https://github.com/apache/brpc/issues/3102

   **Is your feature request related to a problem?**
   
   
   **Describe the solution you'd like**
   
在GPU上训练时使用brpc的RDMA传输,期望使用GDR能力,直接将**RPC请求**写入到GPU的buffer中,再通过cudamemcp或者gdrcopy从GPU上拷贝出协议部分进行解析,以减少CPU和GPU之间的数据传输。
   
   **Describe alternatives you've considered**
   进行的尝试:
   将block_pool.cpp中的BlockPool替换成了通过cuda分配的显存,但在使用IOBuf时,会在分配出来的内存上进行placement 
new,
   
   
![Image](https://github.com/user-attachments/assets/956be465-714f-45e1-af0b-2f5a06337568)
   
   而这导致CPU直接访问显存,产生core dumped
   
   因此,有以下三种想法:
   1.修改block_pool.cpp,新增一个ObjectPool用于分配Block的实体,将Block的data指向BlockPool -> 即 
Block的实体部分 与 数据部分分离
   在此基础上,修改IOBuf的create 
block接口,objmem_allocate分配一个block出来,blockmem_allocate分配一段内存用于传输RPC请求
   
   2.增加RDMA Write/Read接口;原本brpc的Send/Recv用于传输控制数据(例如,对端GPU地址),再使用RDMA 
Write直接将训练数据写到对端GPU地址上
   
   
3.使用append_with_user_data,将attachment从显存上分配,发送时使用RDMA的scatter-gatther特性,使其request请求写入到主机内存,attachment写入到显存上。
   但是此接口也是会在分配出来的内存上进行placement new,因此也存在core dumped的问题,也需要修改IOBuf的接口
   
   **Additional context/screenshots**
   想请教这三个中是否有具有可行性的方案
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to