xiaobijuan2026 opened a new pull request, #61917: URL: https://github.com/apache/doris/pull/61917
Fixes #61916 ## Summary When `close_wait` times out waiting for unfinished node channels during `INSERT INTO SELECT`, the cancelled channels were not marked as failed. This caused: 1. FE unaware of the failure, commits the transaction 2. PUBLISH_VERSION sent to all nodes including the cancelled one 3. Cancelled node can't find rowset -> publish fails 4. Data stays COMMITTED but not VISIBLE for a long time (30+ minutes until retry) ## Root Cause In `IndexChannel::close_wait()`, when unfinished node channels are cancelled due to timeout, `mark_as_failed()` was not called. FE received no error tablet info for the cancelled replicas. ## Fix After cancelling unfinished node channels in `close_wait` timeout: - Call `mark_as_failed()` to record failed tablets - Call `check_intolerable_failure()` - if failures exceed tolerance, fail entire load - Call `set_error_tablet_in_state()` to propagate error info to FE ## Behavior after fix | Scenario | Replicas | Result | |----------|----------|--------| | 3 replicas, 1 timeout | 2/3 success | ✅ Publish succeeds, failed replica auto-repairs | | 3 replicas, 2 timeout | 1/3 success | ❌ Load fails, user gets error, can retry | ## Test plan - [x] Verified on production cluster (3 BE nodes, HDD, high concurrency INSERT SELECT) - [ ] Add unit test for close_wait timeout + mark_as_failed scenario -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
