xiaobijuan2026 opened a new pull request, #61917:
URL: https://github.com/apache/doris/pull/61917

   Fixes #61916
   
   ## Summary
   
   When `close_wait` times out waiting for unfinished node channels during 
`INSERT INTO SELECT`, the cancelled channels were not marked as failed. This 
caused:
   
   1. FE unaware of the failure, commits the transaction
   2. PUBLISH_VERSION sent to all nodes including the cancelled one
   3. Cancelled node can't find rowset -> publish fails
   4. Data stays COMMITTED but not VISIBLE for a long time (30+ minutes until 
retry)
   
   ## Root Cause
   
   In `IndexChannel::close_wait()`, when unfinished node channels are cancelled 
due to timeout, `mark_as_failed()` was not called. FE received no error tablet 
info for the cancelled replicas.
   
   ## Fix
   
   After cancelling unfinished node channels in `close_wait` timeout:
   - Call `mark_as_failed()` to record failed tablets
   - Call `check_intolerable_failure()` - if failures exceed tolerance, fail 
entire load
   - Call `set_error_tablet_in_state()` to propagate error info to FE
   
   ## Behavior after fix
   
   | Scenario | Replicas | Result |
   |----------|----------|--------|
   | 3 replicas, 1 timeout | 2/3 success | ✅ Publish succeeds, failed replica 
auto-repairs |
   | 3 replicas, 2 timeout | 1/3 success | ❌ Load fails, user gets error, can 
retry |
   
   ## Test plan
   - [x] Verified on production cluster (3 BE nodes, HDD, high concurrency 
INSERT SELECT)
   - [ ] Add unit test for close_wait timeout + mark_as_failed scenario
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to