imbajin commented on code in PR #2961:
URL: https://github.com/apache/hugegraph/pull/2961#discussion_r2904690778
##########
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java:
##########
@@ -239,8 +241,20 @@ public String getLeaderGrpcAddress() throws
ExecutionException, InterruptedExcep
waitingForLeader(10000);
}
- return
raftRpcClient.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString()).get()
- .getGrpcAddress();
+ try {
+ RaftRpcProcessor.GetMemberResponse response = raftRpcClient
+
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
+ .get(config.getRpcTimeout(), TimeUnit.MILLISECONDS);
+ if (response != null && response.getGrpcAddress() != null) {
+ return response.getGrpcAddress();
+ }
+ } catch (TimeoutException | ExecutionException e) {
+ log.warn("Failed to get leader gRPC address via RPC, falling back
to endpoint derivation", e);
+ }
+
+ // Fallback: derive from raft endpoint IP + local gRPC port (best
effort)
+ String leaderIp = raftNode.getLeaderId().getEndpoint().getIp();
+ return leaderIp + ":" + config.getGrpcPort();
Review Comment:
‼️ This fallback is still incorrect for clusters where PD nodes use
different `grpc.port` values. In this repo's own multi-node test configs,
`application-server1.yml`, `application-server2.yml`, and
`application-server3.yml` advertise `8686`, `8687`, and `8688` respectively, so
a follower on `8687` will redirect to `leader-ip:8687` even when the elected
leader is actually listening on `8686` or `8688`. That turns the original NPE
into a silent misroute.
If we can't recover the leader's advertised gRPC endpoint here, I think it's
safer to fail fast than to synthesize an address from the local port, for
example:
```suggestion
} catch (TimeoutException | ExecutionException e) {
throw new ExecutionException(
String.format("Failed to resolve leader gRPC address for
%s", raftNode.getLeaderId()),
e);
}
```
A more complete fix would need a source of truth for the leader's actual
`grpcAddress`, not the local node's port.
##########
hugegraph-pd/hg-pd-core/src/main/java/org/apache/hugegraph/pd/raft/RaftEngine.java:
##########
@@ -239,8 +241,20 @@ public String getLeaderGrpcAddress() throws
ExecutionException, InterruptedExcep
waitingForLeader(10000);
}
- return
raftRpcClient.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString()).get()
- .getGrpcAddress();
+ try {
+ RaftRpcProcessor.GetMemberResponse response = raftRpcClient
+
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
Review Comment:
⚠️ `waitingForLeader(10000)` does not guarantee that `leaderId` is non-null;
it just waits up to the timeout and returns whatever it has. That means both
the RPC call and this fallback path can still dereference
`raftNode.getLeaderId()` and reintroduce an NPE during leader-election gaps.
Could we cache the leader once after waiting and turn the "leader still
unknown" case into a controlled exception instead?
```suggestion
PeerId leader = raftNode.getLeaderId();
if (leader == null) {
throw new ExecutionException(new IllegalStateException("Leader
is not ready"));
}
```
Then reuse `leader` for both the RPC request and any follow-up handling.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]