Re: [PR] fix(pd): add timeout and null-safety to getLeaderGrpcAddress() [hugegraph]

via GitHub Mon, 09 Mar 2026 03:59:09 -0700


imbajin commented on code in PR #2961:
URL: https://github.com/apache/hugegraph/pull/2961#discussion_r2904690778



##########
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java:
##########
@@ -239,8 +241,20 @@ public String getLeaderGrpcAddress() throws 
ExecutionException, InterruptedExcep
             waitingForLeader(10000);
         }
 
-        return 
raftRpcClient.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString()).get()
-                            .getGrpcAddress();
+        try {
+            RaftRpcProcessor.GetMemberResponse response = raftRpcClient
+                    
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
+                    .get(config.getRpcTimeout(), TimeUnit.MILLISECONDS);
+            if (response != null && response.getGrpcAddress() != null) {
+                return response.getGrpcAddress();
+            }
+        } catch (TimeoutException | ExecutionException e) {
+            log.warn("Failed to get leader gRPC address via RPC, falling back 
to endpoint derivation", e);
+        }
+
+        // Fallback: derive from raft endpoint IP + local gRPC port (best 
effort)
+        String leaderIp = raftNode.getLeaderId().getEndpoint().getIp();
+        return leaderIp + ":" + config.getGrpcPort();

Review Comment:
   ‼️ This fallback is still incorrect for clusters where PD nodes use 
different `grpc.port` values. In this repo's own multi-node test configs, 
`application-server1.yml`, `application-server2.yml`, and 
`application-server3.yml` advertise `8686`, `8687`, and `8688` respectively, so 
a follower on `8687` will redirect to `leader-ip:8687` even when the elected 
leader is actually listening on `8686` or `8688`. That turns the original NPE 
into a silent misroute.
   
   If we can't recover the leader's advertised gRPC endpoint here, I think it's 
safer to fail fast than to synthesize an address from the local port, for 
example:
   
   ```suggestion
           } catch (TimeoutException | ExecutionException e) {
               throw new ExecutionException(
                       String.format("Failed to resolve leader gRPC address for 
%s", raftNode.getLeaderId()),
                       e);
           }
   ```
   
   A more complete fix would need a source of truth for the leader's actual 
`grpcAddress`, not the local node's port.



##########
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java:
##########
@@ -239,8 +241,20 @@ public String getLeaderGrpcAddress() throws 
ExecutionException, InterruptedExcep
             waitingForLeader(10000);
         }
 
-        return 
raftRpcClient.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString()).get()
-                            .getGrpcAddress();
+        try {
+            RaftRpcProcessor.GetMemberResponse response = raftRpcClient
+                    
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())

Review Comment:
   ⚠️ `waitingForLeader(10000)` does not guarantee that `leaderId` is non-null; 
it just waits up to the timeout and returns whatever it has. That means both 
the RPC call and this fallback path can still dereference 
`raftNode.getLeaderId()` and reintroduce an NPE during leader-election gaps.
   
   Could we cache the leader once after waiting and turn the "leader still 
unknown" case into a controlled exception instead?
   
   ```suggestion
           PeerId leader = raftNode.getLeaderId();
           if (leader == null) {
               throw new ExecutionException(new IllegalStateException("Leader 
is not ready"));
           }
   ```
   
   Then reuse `leader` for both the RPC request and any follow-up handling.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(pd): add timeout and null-safety to getLeaderGrpcAddress() [hugegraph]

Reply via email to