FrankKanBear commented on PR #10:
URL:
https://github.com/apache/ignite-nodejs-thin-client/pull/10#issuecomment-4726619144
Thanks for the fixes — the per-message buffer isolation, the `hasMore()`
peek, the `_getValues()` exhaustion guard, and the `Errors.ts` `.default` fix
looks fine to me, and the earlier review feedback (the `_id: Long` typing and
the zero-row `hasMore()` case) is fully addressed in the current revision.
I do think there is **one blocker** that should be resolved before merge,
plus a couple of smaller items.
## Blocker: the serialization queue can deadlock the connection
Serializing `data`-event processing onto `_processingQueue`
(`ClientSocket._connectSocket`) makes `_processResponse` hold the queue across
its `await _finalizeResponse(...)`. The problem is that a `payloadReader` can
issue a **nested request on the same socket and await it** — which can now
never complete.
Concretely, for a `cache.get` / `getAll` / `localPeek` of a value that is a
`COMPLEX_OBJECT` whose binary type is **not already cached** in this client's
`BinaryTypeStorage._types`:
1. The value is read **inline** in the payloadReader (`CacheClient`
get/getAll/`_localPeek`), which runs inside `await
request.payloadReader(buffer)` in `_finalizeResponse` — i.e. inside the queued
`_processResponse`.
2. `readObject` → `_readComplexObject` → `BinaryObject._readHeader` →
`BinaryTypeBuilder.fromTypeId` → `typeStorage.getType` (miss) →
`_getBinaryType` → `communicator.send(GET_BINARY_TYPE)`.
3. `Router.send` is called with no `affinityHint`, so it routes to
`_getAllConnections()[0].sendRequest(...)` — **the same socket** in the default
single-connection case — and awaits the reply.
4. That `GET_BINARY_TYPE` reply arrives as a **future `data` event**, which
is chained behind the still-pending `_processResponse` from step 1. That queue
entry only completes once the reply is processed; the reply can only be
processed by a later queue entry → circular wait → the socket stops parsing and
`get()` never resolves. The `.catch` never fires, so there is no
timeout/disconnect escape.
The pre-PR `async (data) => { await this._processResponse(data); }` handler
avoided this because Node does not serialize async listeners — the nested reply
was processed by a concurrent re-entrant `_processResponse`. (That concurrency
is exactly the `_buffer`/`_offset` race this PR set out to fix; the fix just
needs to keep the race closed without serializing the *nested-request* path.)
**Why the 103 tests don't catch it:** the suites `put`/`fromObject` the
complex value before reading it, which registers the type in the same client's
`_types` (`addType`), so `getType` is a cache hit and no nested request is
issued. The bug surfaces for the normal production pattern of a read-only /
freshly-started client reading complex objects it did not itself write.
### Minimal, deterministic repro (no cluster needed)
This models the exact async topology (single socket, one response whose
payloadReader awaits a nested same-socket request whose reply arrives on a
later event):
```
BASE -> cache.get() of uncached complex object: RESOLVED
PR #10 -> cache.get() of uncached complex object: DEADLOCK (hung)
PROPOSED -> cache.get() of uncached complex object: RESOLVED
```
### Suggested fix
The PR already carves an **independent** `MessageBuffer` per message, so
`_finalizeResponse` no longer needs to be inside the serialized section to be
race-free — the queue only needs to protect the synchronous frame-splitting
that touches `this._buffer`/`this._offset`. Dispatch `_finalizeResponse`
**off** the queue (don't await it in the chain):
```ts
if (isHandshake) {
// single-in-flight, transitions _state, issues no nested request →
await inline
await this._finalizeHandshake(freshBuffer, request);
}
else {
// Do NOT await: the payloadReader may issue a nested same-socket request
// (e.g. GET_BINARY_TYPE for an uncached complex type) and await it;
awaiting
// here would deadlock the queue. freshBuffer is an independent copy, so
// finalizing off the parse chain cannot corrupt _buffer/_offset.
this._finalizeResponse(freshBuffer, request).catch(err => {
this._error = err.message;
this._disconnect();
});
}
```
With finalize detached, the CONNECTED path of `_processResponse` has no
`await`, so it runs to completion synchronously and two invocations still
cannot interleave on `_buffer`/`_offset` — the race stays fixed, and nested
replies can be parsed by subsequent queue entries. (In the repro above this is
the `PROPOSED` case.)
## Description no longer matches the code
The summary still says *"silently discard unsolicited server notification
frames … Fixed by logging and discarding such frames,"* but the final commit
(`Fail fast on response stream desync…`) **throws** on an unmatched request id
instead. Throwing is actually the right call here — this client caps the
protocol at ≤ 1.4.0 and exposes no notification/continuous-query operation, so
an unmatched id genuinely means stream desync where fail-fast beats hanging.
But two parts of the description are now inaccurate and worth fixing:
- it describes discard/log, while the code throws (→ disconnect + reject all
pending);
- it states *"The original code had no handler for this case; the missing
branch caused a crash"* — the base already threw `Invalid response id` on this
exact path, so no branch was missing.
## Minor
- **`hasMore()` is now a throwing method.** The peek calls `readInteger()` /
`readBoolean()`, which throw via `_ensureSize` on a short/truncated page; and
on that throw the `this._buffer.position = savedPosition` restore is skipped,
leaving the buffer position advanced. Low impact (only on a malformed page),
but a `try { … } finally { this._buffer.position = saved }` makes it total
again and keeps the previous non-throwing contract.
- **Bundled, undocumented refactors.** The `import Long = require('long')`
change, the `PROTOCOL_VERSION_1_0_0` dead-const removal, and the `statusCode`
local cleanup are all fine, but they aren't among the four advertised fixes —
worth a line in the description so reviewers know the diff intentionally
exceeds them. (Note: the `import Long` change is purely an emit cleanup — it
removes a now-dead `const Long = require('long')` from the compiled `Cursor.js`
since `Long` is used only as a type; the generated `Cursor.d.ts` is unchanged,
`protected _id: Long` in both base and PR. It does not change type safety.)
- **Per-page copy cost (informational).**
`MessageBuffer.from(getSlice(...))` now copies each matched response's full
payload (`Buffer.from`). Worth it for the correctness gain (it also de-aliases
lazily-decoded `BinaryObject`s, not just cursors), but worth noting as a
deliberate new allocation on the query hot path.
- Pre-existing, not introduced here: `_logMessage(...,
buffer.getSlice(this._offset - length, length))` passes `length` where
`getSlice(start, end)` wants an end offset, so it logs empty bytes for any 2nd+
frame in a segment. Debug-only; trivially fixable now that `msgStart`/`msgEnd`
are in scope (`getSlice(msgStart, msgEnd)`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]