SiyaoIsHiding commented on PR #462: URL: https://github.com/apache/cassandra-nodejs-driver/pull/462#issuecomment-4549538403
[This](https://github.com/apache/cassandra-nodejs-driver/actions/runs/26197967397/job/77124427064?pr=462) test failure is new and it concerns me. ``` 1) ControlConnection #init() should subscribe to SCHEMA_CHANGE events and refresh keyspace information: Error: Condition still false after 100 attempts: () => cc.metadata.keyspaces['sample_change_1'].strategyOptions.replication_factor === '2' at whilstItem (test/test-helper.js:769:23) at Timeout.next [as _onTimeout] (lib/utils.js:1042:5) at listOnTimeout (node:internal/timers:585:17) at process.processTimers (node:internal/timers:521:7) ``` After some investigation, I think the problem is that a schema refresh triggered by an event can be permanently lost if the CC's connection is down at that moment and cannot establish new connection within 2 seconds. And this singleton refresh implementation makes it more vulnerable than before. **What might have happened:** In CI, a brief TCP hiccup (or Cassandra-side connection reset) closed the CC's socket during the 100ms debounce window after the ALTER KEYSPACE event: 1. ALTER KEYSPACE fires → SCHEMA_CHANGE event → EventDebouncer starts 100ms timer 2. TCP socket closes → `socketClose` →` _refresh()` acquires `_refreshInProgress`, sets `this.connection = null` 3. Debouncer fires → `metadata.refreshKeyspace` → `cc.query()` → `connection === null` →` _waitForReconnection()` (2s timeout) 4. `_refresh()` fails or does not establish new connection within 2 seconds -> `_waitForReconnection` rejects 5. Error is swallowed by toBackground() — the ALTER KEYSPACE schema update is silently dropped 6. CC eventually reconnects but no new SCHEMA_CHANGE event arrives → keyspace stays at replication_factor=3 → poll times out In the past, when we accidentally allowed concurrent `_refresh()` calls, If one concurrent attempt succeeded while another failed, the `newConnection(null)` event would resolve the pending `_waitForReconnection` before the error could reject it. The new singleton approach eliminates that accidental rescue path. **Fix** Instead of allowing concurrent refresh, I think we should fix the problem of schema refresh errors being swallowed. For example, in `control-connection.js` `_nodeSchemaChangeHandler` ```js // Instead of: toBackground(this.handleSchemaChange(event, false)) this.handleSchemaChange(event, false).catch(() => { // CC will reconnect; re-queue this event once it does this.once('newConnection', (err) => { if (!err) toBackground(this.handleSchemaChange(event, false)); }); }); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
