xiaobijuan2026 opened a new issue, #61916: URL: https://github.com/apache/doris/issues/61916
### Problem When `INSERT INTO ... SELECT ...` writes to multiple replicas, if one node channel is slow and times out during `close_wait`, it gets cancelled but NOT marked as failed. This causes: 1. `close_wait` returns OK even though a node was cancelled 2. FE is unaware of the failure, commits the transaction 3. PUBLISH_VERSION task is sent to ALL nodes including the cancelled one 4. Cancelled node can't find the rowset → publish fails 5. Data stays COMMITTED but not VISIBLE for a long time (30+ minutes until retry) ### Root Cause In `IndexChannel::close_wait()` (vtablet_writer.cpp), when unfinished node channels are cancelled due to timeout, `mark_as_failed()` is not called. FE receives no error tablet info for the cancelled replicas. ### Fix After cancelling unfinished node channels in `close_wait` timeout: 1. Call `mark_as_failed()` to record failed tablets 2. Call `check_intolerable_failure()` - if failures exceed tolerance, fail the entire load 3. Call `set_error_tablet_in_state()` to propagate error info to FE This allows FE to: - Skip failed replicas during PUBLISH_VERSION - Data becomes visible immediately on healthy replicas - Background TabletScheduler auto-repairs the failed replica ### Behavior after fix | Scenario | Replicas | Result | |----------|----------|--------| | 3 replicas, 1 timeout | 2/3 success | ✅ Publish succeeds, failed replica auto-repairs | | 3 replicas, 2 timeout | 1/3 success | ❌ Load fails, user gets error | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
