mzhou-oai opened a new pull request, #3632: URL: https://github.com/apache/celeborn/pull/3632
## Problem Statement A worker restart can leave an in-flight reducer holding a stale `streamId` even though the worker comes back on the same stable hostname and still has the shuffle data on disk. In that case the worker correctly reports: - `Stream <id> is not registered with worker. This can happen if the worker was restart recently.` That is a restart-specific condition, not necessarily a hard worker failure. Live stream registrations are process-local and are not reconstructed by `recoverPath`, so the old `streamId` cannot be used after the worker process comes back. Before this change, `CelebornInputStream` treated that response like a generic fetch failure. It excluded the worker, consumed normal retry budget and backoff, and entered the usual peer-failover or retry path instead of reopening the stream on the same worker. ## Proposal This change makes stale-stream handling explicit in the client retry path: - detect the stale-stream signature in the exception chain - do not classify that specific failure as a critical fetch cause - do not exclude the restarted worker before retrying - recreate the reader on the same `PartitionLocation` with `pbStreamHandler = null` so the client issues a fresh `OPEN_STREAM` - reuse checkpoint metadata so already returned chunks are skipped on the reopened stream This keeps the fix narrow and aligned with the current worker-client boundary. The worker already reports the right signal, and the client already has the machinery to reopen a reader and resume from checkpointed progress. ## Validation - `mvn -pl client -am -DskipTests compile` - `mvn -pl common -am -Dtest=ExceptionUtilsSuiteJ test` The targeted JUnit regression test passes. On this macOS arm64 machine, the broader ScalaTest portion of `mvn -pl common -am -Dtest=ExceptionUtilsSuiteJ test` still reports two pre-existing `CelebornConfSuite` failures because the suite expects `[EPOLL]` while the local runtime reports `[KQUEUE]`. Those failures are unrelated to this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
