buska88 opened a new pull request, #3351:
URL: https://github.com/apache/celeborn/pull/3351
…sProcessed
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX]
Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a
faster review.
-->
### What changes were proposed in this pull request?
In SkewHandlingWithoutMapRangeValidator, we should put
"updateCommitMetadata" ahead of "subPartitionsProcessed.put(endMapIndex,
actualCommitMetadata)"
### Why are the changes needed?
In test,we found some cases.When a job enable intergrity checked && enable
celeborn.client.adaptive.optimizeSkewedPartitionRead.enabled && job is skew,
the last two reduce tasks excutes concurrently may cause validate failed.
task1-attempt1
```
org.apache.celeborn.common.exception.CelebornIOException: AQE Partition 121
failed validation checkwhile processing range startMapIndex: 5 endMapIndex:
1ExpectedCommitMetadata CommitMetadata{bytes=27755354976,
crc=CelebornCRC32{current=1320432810}}, ActualCommitMetadata
CommitMetadata{bytes=41366743972, crc=CelebornCRC32{current=477185228}},
```
task1-attempt2
```
org.apache.celeborn.common.exception.CelebornIOException: AQE Partition 121
failed validation checkwhile processing range startMapIndex: 5 endMapIndex:
1ExpectedCommitMetadata CommitMetadata{bytes=27755354976,
crc=CelebornCRC32{current=1320432810}}, ActualCommitMetadata
CommitMetadata{bytes=48072750200, crc=CelebornCRC32{current=755010}},
```
task2-attempt1
```
org.apache.celeborn.common.exception.CelebornIOException: AQE Partition 121
failed validation checkwhile processing range startMapIndex: 5 endMapIndex:
0ExpectedCommitMetadata CommitMetadata{bytes=27755354976,
crc=CelebornCRC32{current=1320432810}}, ActualCommitMetadata
CommitMetadata{bytes=34660737744, crc=CelebornCRC32{current=953615190}},
```
task2-attempt2
```
org.apache.celeborn.common.exception.CelebornIOException: AQE Partition 121
failed validation checkwhile processing range startMapIndex: 5 endMapIndex:
0ExpectedCommitMetadata CommitMetadata{bytes=27755354976,
crc=CelebornCRC32{current=1320432810}}, ActualCommitMetadata
CommitMetadata{bytes=54978132968, crc=CelebornCRC32{current=-366062354}},
```
they both read skew Partition 121.And they are the last two reduce tasks of
the stage.When task1 executes
'CommitMetadata.checkCommitMetadata(expectedCommitMetadata,
currentCommitMetadata)' meanwhile task2 puts its endMapIndex to
subPartitionsProcessed but not update commitMeta yet, task1 fails to validate.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test by skew jobs with optimizeSkewedPartitionRead.enabled.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]