mzhou-oai opened a new pull request, #3632:
URL: https://github.com/apache/celeborn/pull/3632

   ## Problem Statement
   A worker restart can leave an in-flight reducer holding a stale `streamId` 
even though the worker comes back on the same stable hostname and still has the 
shuffle data on disk. In that case the worker correctly reports:
   
   - `Stream <id> is not registered with worker. This can happen if the worker 
was restart recently.`
   
   That is a restart-specific condition, not necessarily a hard worker failure. 
Live stream registrations are process-local and are not reconstructed by 
`recoverPath`, so the old `streamId` cannot be used after the worker process 
comes back.
   
   Before this change, `CelebornInputStream` treated that response like a 
generic fetch failure. It excluded the worker, consumed normal retry budget and 
backoff, and entered the usual peer-failover or retry path instead of reopening 
the stream on the same worker.
   
   ## Proposal
   This change makes stale-stream handling explicit in the client retry path:
   
   - detect the stale-stream signature in the exception chain
   - do not classify that specific failure as a critical fetch cause
   - do not exclude the restarted worker before retrying
   - recreate the reader on the same `PartitionLocation` with `pbStreamHandler 
= null` so the client issues a fresh `OPEN_STREAM`
   - reuse checkpoint metadata so already returned chunks are skipped on the 
reopened stream
   
   This keeps the fix narrow and aligned with the current worker-client 
boundary. The worker already reports the right signal, and the client already 
has the machinery to reopen a reader and resume from checkpointed progress.
   
   ## Validation
   - `mvn -pl client -am -DskipTests compile`
   - `mvn -pl common -am -Dtest=ExceptionUtilsSuiteJ test`
   
   The targeted JUnit regression test passes. On this macOS arm64 machine, the 
broader ScalaTest portion of `mvn -pl common -am -Dtest=ExceptionUtilsSuiteJ 
test` still reports two pre-existing `CelebornConfSuite` failures because the 
suite expects `[EPOLL]` while the local runtime reports `[KQUEUE]`. Those 
failures are unrelated to this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to