Hi Gaurav,

I hope this message finds you well.

As you may have read in the provided link, Celeborn has successfully
implemented exactly-once processing for data batches. To ensure data
integrity within these batches, the shuffle data is compressed. If any
issues arise with the shuffle data, decompression will fail, and the
client will be notified, ensuring that only correct data batches are
processed.

However, it's important to note that there is currently no data
integrity check for the header of a data batch. To address this, we
plan to implement a checksum feature [0] to provide comprehensive data
validation.

If you have any further questions or need additional clarification on
any specific checks, please don't hesitate to reach out.


Ethan Feng

[0]https://issues.apache.org/jira/browse/CELEBORN-894

Gaurav Mittal <[email protected]> 于2024年10月17日周四 04:44写道:

>
> Hi Celeborn devs,
>
> I am trying to better understand the end-2-end data integrity checks that
> exist in Celeborn today
> * I saw some details about invariants that allow for Exactly Once Behavior
> here
> <https://celeborn.apache.org/docs/latest/developers/faulttolerant/#exactly-once>
> .
> * Are there other checks that are performed that help guarantee data
> correctness such as row count validation - total number of rows read by
> reducers for a partition are equal to the number of rows written by the
> mappers for that partition?
>
> Thanks
> Gaurav

Reply via email to