yashmayya commented on PR #18513:
URL: https://github.com/apache/pinot/pull/18513#issuecomment-4492234210
This is the pre-existing flake tracked in #18490, not something this PR
caused.
**Failure point.** `org.apache.pinot.compat.StreamOp.fetchExistingTotalDocs`
at line 281:
```java
if (response.has(EXCEPTIONS) && !response.get(EXCEPTIONS).isEmpty()) {
...
JsonNode exceptions = response.get(EXCEPTIONS); // ArrayNode
JsonNode errorCode = exceptions.get(ERROR_CODE); //
ArrayNode.get(String) → null
if (QueryErrorCode.BROKER_INSTANCE_MISSING.getId() == errorCode.asInt()) {
// NPE
```
`exceptions` is a JSON array (`[{"errorCode": ..., "message": ...}, ...]`),
but the code indexes it by a string key, so `errorCode` is always `null`
whenever the broker returns any non-empty `exceptions` array. That happens
during the post-upgrade/post-downgrade window where segments are still
bootstrapping and the broker emits errors like `305=N segments unavailable`
(visible right above the stack trace in the log). That code path is unreachable
in the steady state, so the rest of the suite passes — it only blows up when
the test hits the cluster mid-bootstrap.
The query that crashes is `SELECT count(*) FROM <table>` (StreamOp.java:268)
— no UNION, so `AggregateUnionTransposeRule` can't even match the plan.
**Why I'm confident it's unrelated to this PR:**
1. The same NPE at the same stack frame hits master itself in run
[26066589112](https://github.com/apache/pinot/actions/runs/26066589112) (a pure
`com.google.cloud:libraries-bom` dependency bump), with no MSE-planner changes
whatsoever.
2. Across the 5 attempts of this PR's compat check, the failures don't track
this PR's code — they track the cluster's startup timing:
- Attempt 1: against-master FAIL, against-release-1.5.0 SUCCESS
- Attempt 2: both FAIL
- Attempts 3–5: against-master SUCCESS, against-release-1.5.0 FAIL
If the change in 332f5cf (hint propagation) were the cause, both
baselines should fail deterministically — they don't.
3. Issue #18490 (filed 2026-05-13) describes exactly this NPE, with the same
root cause and the same fix (`exceptions.get(0).get("errorCode")` plus retry on
`BROKER_SEGMENT_UNAVAILABLE`).
Happy to send a separate small PR fixing #18490 to stop this from blocking
unrelated PRs, but it's out of scope for this change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]