FrankKanBear commented on PR #10:
URL: 
https://github.com/apache/ignite-nodejs-thin-client/pull/10#issuecomment-4726619144

   Thanks for the fixes — the per-message buffer isolation, the `hasMore()` 
peek, the `_getValues()` exhaustion guard, and the `Errors.ts` `.default` fix 
looks fine to me, and the earlier review feedback (the `_id: Long` typing and 
the zero-row `hasMore()` case) is fully addressed in the current revision.
   
   I do think there is **one blocker** that should be resolved before merge, 
plus a couple of smaller items.
   
   ## Blocker: the serialization queue can deadlock the connection
   
   Serializing `data`-event processing onto `_processingQueue` 
(`ClientSocket._connectSocket`) makes `_processResponse` hold the queue across 
its `await _finalizeResponse(...)`. The problem is that a `payloadReader` can 
issue a **nested request on the same socket and await it** — which can now 
never complete.
   
   Concretely, for a `cache.get` / `getAll` / `localPeek` of a value that is a 
`COMPLEX_OBJECT` whose binary type is **not already cached** in this client's 
`BinaryTypeStorage._types`:
   
   1. The value is read **inline** in the payloadReader (`CacheClient` 
get/getAll/`_localPeek`), which runs inside `await 
request.payloadReader(buffer)` in `_finalizeResponse` — i.e. inside the queued 
`_processResponse`.
   2. `readObject` → `_readComplexObject` → `BinaryObject._readHeader` → 
`BinaryTypeBuilder.fromTypeId` → `typeStorage.getType` (miss) → 
`_getBinaryType` → `communicator.send(GET_BINARY_TYPE)`.
   3. `Router.send` is called with no `affinityHint`, so it routes to 
`_getAllConnections()[0].sendRequest(...)` — **the same socket** in the default 
single-connection case — and awaits the reply.
   4. That `GET_BINARY_TYPE` reply arrives as a **future `data` event**, which 
is chained behind the still-pending `_processResponse` from step 1. That queue 
entry only completes once the reply is processed; the reply can only be 
processed by a later queue entry → circular wait → the socket stops parsing and 
`get()` never resolves. The `.catch` never fires, so there is no 
timeout/disconnect escape.
   
   The pre-PR `async (data) => { await this._processResponse(data); }` handler 
avoided this because Node does not serialize async listeners — the nested reply 
was processed by a concurrent re-entrant `_processResponse`. (That concurrency 
is exactly the `_buffer`/`_offset` race this PR set out to fix; the fix just 
needs to keep the race closed without serializing the *nested-request* path.)
   
   **Why the 103 tests don't catch it:** the suites `put`/`fromObject` the 
complex value before reading it, which registers the type in the same client's 
`_types` (`addType`), so `getType` is a cache hit and no nested request is 
issued. The bug surfaces for the normal production pattern of a read-only / 
freshly-started client reading complex objects it did not itself write.
   
   ### Minimal, deterministic repro (no cluster needed)
   
   This models the exact async topology (single socket, one response whose 
payloadReader awaits a nested same-socket request whose reply arrives on a 
later event):
   
   ```
   BASE       -> cache.get() of uncached complex object: RESOLVED
   PR #10     -> cache.get() of uncached complex object: DEADLOCK (hung)
   PROPOSED   -> cache.get() of uncached complex object: RESOLVED
   ```
   
   ### Suggested fix
   
   The PR already carves an **independent** `MessageBuffer` per message, so 
`_finalizeResponse` no longer needs to be inside the serialized section to be 
race-free — the queue only needs to protect the synchronous frame-splitting 
that touches `this._buffer`/`this._offset`. Dispatch `_finalizeResponse` 
**off** the queue (don't await it in the chain):
   
   ```ts
   if (isHandshake) {
       // single-in-flight, transitions _state, issues no nested request → 
await inline
       await this._finalizeHandshake(freshBuffer, request);
   }
   else {
       // Do NOT await: the payloadReader may issue a nested same-socket request
       // (e.g. GET_BINARY_TYPE for an uncached complex type) and await it; 
awaiting
       // here would deadlock the queue. freshBuffer is an independent copy, so
       // finalizing off the parse chain cannot corrupt _buffer/_offset.
       this._finalizeResponse(freshBuffer, request).catch(err => {
           this._error = err.message;
           this._disconnect();
       });
   }
   ```
   
   With finalize detached, the CONNECTED path of `_processResponse` has no 
`await`, so it runs to completion synchronously and two invocations still 
cannot interleave on `_buffer`/`_offset` — the race stays fixed, and nested 
replies can be parsed by subsequent queue entries. (In the repro above this is 
the `PROPOSED` case.)
   
   ## Description no longer matches the code
   
   The summary still says *"silently discard unsolicited server notification 
frames … Fixed by logging and discarding such frames,"* but the final commit 
(`Fail fast on response stream desync…`) **throws** on an unmatched request id 
instead. Throwing is actually the right call here — this client caps the 
protocol at ≤ 1.4.0 and exposes no notification/continuous-query operation, so 
an unmatched id genuinely means stream desync where fail-fast beats hanging. 
But two parts of the description are now inaccurate and worth fixing:
   - it describes discard/log, while the code throws (→ disconnect + reject all 
pending);
   - it states *"The original code had no handler for this case; the missing 
branch caused a crash"* — the base already threw `Invalid response id` on this 
exact path, so no branch was missing.
   
   ## Minor
   
   - **`hasMore()` is now a throwing method.** The peek calls `readInteger()` / 
`readBoolean()`, which throw via `_ensureSize` on a short/truncated page; and 
on that throw the `this._buffer.position = savedPosition` restore is skipped, 
leaving the buffer position advanced. Low impact (only on a malformed page), 
but a `try { … } finally { this._buffer.position = saved }` makes it total 
again and keeps the previous non-throwing contract.
   - **Bundled, undocumented refactors.** The `import Long = require('long')` 
change, the `PROTOCOL_VERSION_1_0_0` dead-const removal, and the `statusCode` 
local cleanup are all fine, but they aren't among the four advertised fixes — 
worth a line in the description so reviewers know the diff intentionally 
exceeds them. (Note: the `import Long` change is purely an emit cleanup — it 
removes a now-dead `const Long = require('long')` from the compiled `Cursor.js` 
since `Long` is used only as a type; the generated `Cursor.d.ts` is unchanged, 
`protected _id: Long` in both base and PR. It does not change type safety.)
   - **Per-page copy cost (informational).** 
`MessageBuffer.from(getSlice(...))` now copies each matched response's full 
payload (`Buffer.from`). Worth it for the correctness gain (it also de-aliases 
lazily-decoded `BinaryObject`s, not just cursors), but worth noting as a 
deliberate new allocation on the query hot path.
   - Pre-existing, not introduced here: `_logMessage(..., 
buffer.getSlice(this._offset - length, length))` passes `length` where 
`getSlice(start, end)` wants an end offset, so it logs empty bytes for any 2nd+ 
frame in a segment. Debug-only; trivially fixable now that `msgStart`/`msgEnd` 
are in scope (`getSlice(msgStart, msgEnd)`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to