SiyaoIsHiding commented on PR #462:
URL: 
https://github.com/apache/cassandra-nodejs-driver/pull/462#issuecomment-4549538403

   
[This](https://github.com/apache/cassandra-nodejs-driver/actions/runs/26197967397/job/77124427064?pr=462)
 test failure is new and it concerns me.
   
   ```
     1) ControlConnection
          #init()
            should subscribe to SCHEMA_CHANGE events and refresh keyspace 
information:
        Error: Condition still false after 100 attempts: () => 
cc.metadata.keyspaces['sample_change_1'].strategyOptions.replication_factor === 
'2'
         at whilstItem (test/test-helper.js:769:23)
         at Timeout.next [as _onTimeout] (lib/utils.js:1042:5)
         at listOnTimeout (node:internal/timers:585:17)
         at process.processTimers (node:internal/timers:521:7)
   ```
   After some investigation, I think the problem is that a schema refresh 
triggered by an event can be permanently lost if the CC's connection is down at 
that moment and cannot establish new connection within 2 seconds. And this 
singleton refresh implementation makes it more vulnerable than before.
   
   **What might have happened:**
   
   In CI, a brief TCP hiccup (or Cassandra-side connection reset) closed the 
CC's socket during the 100ms debounce window after the ALTER KEYSPACE event:
   
   1. ALTER KEYSPACE fires → SCHEMA_CHANGE event → EventDebouncer starts 100ms 
timer
   2. TCP socket closes → `socketClose` →` _refresh()` acquires 
`_refreshInProgress`, sets `this.connection = null`
   3. Debouncer fires → `metadata.refreshKeyspace` → `cc.query()` → `connection 
=== null` →` _waitForReconnection()` (2s timeout)
   4. `_refresh()` fails or does not establish new connection within 2 seconds 
-> `_waitForReconnection` rejects
   5. Error is swallowed by toBackground() — the ALTER KEYSPACE schema update 
is silently dropped
   6. CC eventually reconnects but no new SCHEMA_CHANGE event arrives → 
keyspace stays at replication_factor=3 → poll times out
   
   In the past, when we accidentally allowed concurrent `_refresh()` calls, If 
one concurrent attempt succeeded while another failed, the 
`newConnection(null)` event would resolve the pending `_waitForReconnection` 
before the error could reject it. The new singleton approach eliminates that 
accidental rescue path.
   
   **Fix**
   
   Instead of allowing concurrent refresh, I think we should fix the problem of 
schema refresh errors being swallowed. For example, in `control-connection.js` 
`_nodeSchemaChangeHandler`
   ```js
     // Instead of: toBackground(this.handleSchemaChange(event, false))
     this.handleSchemaChange(event, false).catch(() => {
       // CC will reconnect; re-queue this event once it does
       this.once('newConnection', (err) => {
         if (!err) toBackground(this.handleSchemaChange(event, false));
       });
     });
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to