gortiz commented on PR #18519:
URL: https://github.com/apache/pinot/pull/18519#issuecomment-4519341207
After the latest round of reviewer-requested fixes, I ran an internal strict
pre-push review across the cumulative diff. It surfaced three additional issues
that weren't on any review thread — fixed all three to avoid shipping them.
### Fixed
1. **`send(Eos)` vs `cancel` race throwing `IllegalStateException`** —
symmetric to the data-path race that `c39e12f1a4` closed, but on the *closing*
edge of the stream. Both paths pass their top-of-method `isTerminated()` check
before either sets `_senderSideClosed`. Cancel wins the `_readyLock`, calls
`onCompleted()`, releases. The EOS path then takes the lock and calls `onNext`
/ `onCompleted` on an already-half-closed stream → `IllegalStateException`
propagates out of `send(Eos)`. Cancel-path swallowed it defensively; EOS-path
didn't. Fixed in `840c5640ea` with a defensive `try/catch` around the EOS path
(matching the cancel-side pattern) plus a regression test
`concurrentSendEosAndCancelDoesNotThrow` that drives `send(EOS) + cancel()`
through a `CyclicBarrier(2)` for 200 iterations and asserts neither call throws.
2. **`close()` leaving the gRPC sender stream half-open** — `close()` pushes
an internal-error block via `processAndSend(...bypassReady=true)` but never
calls `onCompleted()`. The stream sits half-open from the sender side until the
per-call deadline fires — for multi-minute MSE queries that's exactly the
allocator pin this PR exists to fix. Also had a `_contentObserver != null`
check-then-act outside the lock (same shape `ensureContentObserverInitialized`
exists to prevent). Fixed in `aeaacc893d`: route through
`ensureContentObserverInitialized`, take `_readyLock`, set `_senderSideClosed =
true`, call `onCompleted()` with the same defensive `try/catch`.
3. **`wakeWaiters` from Netty event-loop locks `_readyLock`** —
`setOnReadyHandler` fires on a Netty channel/event-loop thread; `wakeWaiters`
then acquires `_readyLock`. If the sender thread is mid-`onNext` under that
lock, the event-loop thread briefly blocks. Intentional and the lock-held
sections are bounded by a single `onNext`/`onCompleted`, but the pattern is
non-obvious enough to surprise a maintainer debugging latency spikes.
Documented in `e367aa697f` so the next reader doesn't reach for a
`tryLock`+executor restructure without context.
### Considered and skipped
- `AtomicBoolean` CAS on `_senderSideClosed` to make the close transition
atomic across EOS / cancel / `close()`. Behavioral effect is now bounded by the
defensive try/catches; left as a separate concern.
- Documenting the precondition on `ensureContentObserverInitialized`
("callers must `isTerminated()`-check first") — pure docs.
- Bench drainer's `_stop.set(true)` before `_readSignal.release()` ordering
— benign at JVM shutdown.
Strict review run: `code-reviewer` agent with 8 domain-specialized
sub-reviewers in parallel (concurrency-state, performance, correctness-nulls,
testing, architecture, naming-api, process-scope, config-backcompat). The three
above were everything in the CRITICAL/MAJOR/MINOR-worth-fixing buckets.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]