sunce4t opened a new issue, #3119:
URL: https://github.com/apache/brpc/issues/3119
**Describe the bug**
目前,BRPC在选择RDMA设备的GID时,其流程是首先通过ibv_query_port获取GID表的大小(gid_tbl_len),进而通过ibv_query_gid从高索引到低索引反向遍历该表,并选择首个可用的GID。
问题在于,当容器启用VPC网络后,系统会动态地向宿主机的RDMA设备GID表中添加新的条目。在单一宿主机上部署多个容器时,每个容器都会向同一张GID表添加其条目。
此时,若用户未显式指定gid_index,BRPC的通用遍历逻辑会扫描整个GID表。由于缺乏容器级别的隔离感知能力,该逻辑可能使一个容器错误地选中属于另一个容器的GID,导致在初始化时报错:
```
W1017 15:55:46.241584 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:1237
BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:55:46.241602 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:520
ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket
```
不仅仅是容器内使用会报错,只要这台宿主机上有任一一个容器,在宿主机上使用BRPC时均会出现此问题。
**To Reproduce**
在任意一台宿主机上启动多个使用VPC网络的容器,并执行rdma_performance下的测试程序即可。
如下,第一次启动一个容器,此时所用GID的index为5:
```
./client --servers=xxxxx:12000 --attachment_size=5000
--rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:52:13.284984 10684 0 src/brpc/rdma/rdma_helper.cpp:387
ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:52:13.291336 10684 0 src/brpc/rdma/rdma_helper.cpp:529
GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:52:13.291347 10684 0 src/brpc/rdma/rdma_helper.cpp:531
GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:52:13.291438 10684 0 src/brpc/rdma/rdma_helper.cpp:536
GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 5
I1017 15:52:13.292021 10684 0 src/brpc/rdma/block_pool.cpp:214
ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:52:14.061898 10684 0 src/brpc/server.cpp:1260 StartInternal]
Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
Avg-Latency: 49, 90th-Latency: 52, 99th-Latency: 58, 99.9th-Latency: 63,
Throughput: 94.1512MB/s, QPS: 19k, Server CPU-utilization: 53%, Client
CPU-utilization: 21%
```
在此机器启动第二个容器,第二个容器中不做任何操作,继续在第一个容器执行测试程序,所用GID的index为7:
```
./client --servers=xxxxx:12000 --attachment_size=5000
--rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:54:50.501079 10728 0 src/brpc/rdma/rdma_helper.cpp:387
ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:54:50.507048 10728 0 src/brpc/rdma/rdma_helper.cpp:529
GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:54:50.507060 10728 0 src/brpc/rdma/rdma_helper.cpp:531
GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:54:50.507147 10728 0 src/brpc/rdma/rdma_helper.cpp:536
GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 7
I1017 15:54:50.507764 10728 0 src/brpc/rdma/block_pool.cpp:214
ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:54:51.261788 10728 0 src/brpc/server.cpp:1260 StartInternal]
Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
W1017 15:54:51.266231 10731 4294967297 src/brpc/rdma/rdma_endpoint.cpp:1237
BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:54:51.266251 10731 4294967297 src/brpc/rdma/rdma_endpoint.cpp:520
ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket
```
我们的宿主机上默认是有4个GID的,每次启动容器增加两个GID;因此第二次启动容器后的GID index=7也符合预期;
在此机器启动第三个容器,第三个容器中不做任何操作,继续在第一个容器执行测试程序,所用GID的index为9:
```
./client --servers=xxxxx:12000 --attachment_size=5000
--rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:55:45.475493 10759 0 src/brpc/rdma/rdma_helper.cpp:387
ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:55:45.481561 10759 0 src/brpc/rdma/rdma_helper.cpp:529
GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:55:45.481573 10759 0 src/brpc/rdma/rdma_helper.cpp:531
GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:55:45.481661 10759 0 src/brpc/rdma/rdma_helper.cpp:536
GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 9
I1017 15:55:45.482310 10759 0 src/brpc/rdma/block_pool.cpp:214
ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:55:46.237825 10759 0 src/brpc/server.cpp:1260 StartInternal]
Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
W1017 15:55:46.241584 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:1237
BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:55:46.241602 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:520
ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket
```
**Expected behavior**
BRPC选择GID时应当排除非本容器的GID,这可以通过
```
cat /sys/class/infiniband/{device_name}/ports/{port_num}/gids/{gid_index}
```
进行判断。
我们增加了一个patch以过滤非本容器的GID,如下:
```
diff --git a/src/brpc/rdma/rdma_helper.cpp b/src/brpc/rdma/rdma_helper.cpp
index 9bad3375..2106eaf2 100644
--- a/src/brpc/rdma/rdma_helper.cpp
+++ b/src/brpc/rdma/rdma_helper.cpp
@@ -21,6 +21,7 @@
#include <pthread.h>
#include <stdlib.h>
#include <vector>
+#include <fstream>
#include <gflags/gflags.h>
#include "butil/containers/flat_map.h" // butil::FlatMap
#include "butil/fd_guard.h"
@@ -216,6 +217,30 @@ static void FindRdmaLid() {
return;
}
+
+static int IsSelfGid(const std::string& device_name, int port_num, int
gid_index) {
+ std::string path = "/sys/class/infiniband/" + device_name +
+ "/ports/" + std::to_string(port_num) +
+ "/gids/" + std::to_string(gid_index);
+
+ std::ifstream file(path);
+ if (!file.is_open()) {
+ return -1;
+ }
+
+ std::string line;
+ if (!std::getline(file, line)) {
+ return -2;
+ }
+
+ if (line == "0000:0000:0000:0000:0000:0000:0000:0000" ||
+ line == "::" ||
+ line == "0000:0000:0000:0000:0000:ffff:0000:0000" ) {
+ return 1;
+ }
+ return 0;
+}
+
static bool FindRdmaGid(ibv_context* context) {
bool found = false;
for (int i = g_gid_tbl_len - 1; i >= 0; --i) {
@@ -223,14 +248,23 @@ static bool FindRdmaGid(ibv_context* context) {
if (IbvQueryGid(context, g_port_num, i, &gid) != 0) {
continue;
}
+
if (gid.global.interface_id == 0) {
continue;
}
+
if (FLAGS_rdma_gid_index == i) {
g_gid = gid;
g_gid_index = i;
return true;
}
+
+ const char* device_name_cstr = IbvGetDeviceName(context->device);
+ std::string device_name(device_name_cstr);
+ if(IsSelfGid(device_name, g_port_num, i) != 0) {
+ continue;
+ }
+
```
**Versions**
OS: 与OS无关
Compiler: 与编译器无关
brpc: commit id 为 7229c3608f8cb98b24a0a2e7f99bc01d357d9312
protobuf: 与protobuf无关
**Additional context/screenshots**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]