[
https://issues.apache.org/jira/browse/SPARK-56773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56773:
-----------------------------------
Labels: pull-request-available (was: )
> Add more knobs to INJECT_SHUFFLE_FETCH_FAILURES
> -----------------------------------------------
>
> Key: SPARK-56773
> URL: https://issues.apache.org/jira/browse/SPARK-56773
> Project: Spark
> Issue Type: Test
> Components: Spark Core
> Affects Versions: 4.2.0
> Reporter: Juliusz Sompolski
> Priority: Major
> Labels: pull-request-available
>
> Spark's existing test-only `INJECT_SHUFFLE_FETCH_FAILURES` flag corrupts
> every map task of stage attempt 0, but only **leaf** shuffle map stages
> (scans) succeed on attempt 0. Non-leaf shuffle map stages fail-fetch from
> their corrupted upstream on attempt 0, succeed on a later attempt, and are
> then never themselves corrupted under the existing flag. That makes it
> impossible to write unit tests for the full range of stage-retry shapes that
> production hits (e.g. metric stability under "lost shuffle on recompute",
> rollback behaviour for indeterminate shuffle producers, SLAM semantics across
> retries).
> This proposes extending the DAGScheduler injection infrastructure with three
> orthogonal knobs:
> 1. **`INJECT_SHUFFLE_FETCH_FAILURES`** (existing master switch, semantics
> changed). Always corrupts the partition-0 task of the first SUCCESSFUL
> attempt of every shuffle map stage. Non-leaf stages get hit too, so their
> first-success outputs are invalidated and a downstream-induced re-run
> exercises the same simulated non-determinism a real lost-shuffle on recompute
> would produce.
> 2. **`INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY`** (int, default 1,
> new). Defers the producer's mapper-0 corruption until N task-success events
> have been observed in stages that directly consume the shuffle. Because the
> DAGScheduler event loop processes completion events serially, this guarantees
> N consumer tasks fully completed BEFORE the FetchFailed cascade kicks in -
> the realistic "consumer made progress before lost shuffle on recompute"
> shape, rather than the unrealistic "everything fails before anything
> succeeds" of corrupting at producer registration time. Set to 0 to corrupt
> inline at registration.
> 3. **`INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE`** (boolean,
> default false, new). After a downstream FetchFailed forces the producer's
> partition 0 to be recomputed, register that recomputed MapStatus as if it had
> a checksum mismatch with the original. This drives
> `rollbackSucceedingStages`, which clears the downstream ShuffleMapStage's
> outputs and forces a full retry. Mirrors the production path where a
> recomputed shuffle output has a different checksum from a non-deterministic
> producer. ResultStage downstreams are aborted (OSS Spark does not support
> rolling them back, SPARK-25342).
> All three knobs are gated on `Utils.isTesting` and documented in
> `Tests.scala`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]