Hi Ethan/Keyong Thanks a lot for the information
I am proposing <https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0> that Celeborn should optionally provide capability to perform end to end integrity checks that provide: 1. Higher confidence that the data being transported is borth correct and complete 2. Make it easier to detect data corruption due to bugs/race conditions in Celeborn Could you please review when you get a chance? On Fri, Oct 18, 2024 at 5:34 AM Keyong Zhou <[email protected]> wrote: > Hi Gaurav, > > Currently Celeborn doesn't check row count read and written, however > Celeborn integrates > with Spark's metrics and you can check through the Spark UI. Also I think > it's possible to add > integrity check based on the metrics. > > Regards, > Keyong Zhou > > Ethan Feng <[email protected]> 于2024年10月18日周五 18:35写道: > > > Hi Gaurav, > > > > I hope this message finds you well. > > > > As you may have read in the provided link, Celeborn has successfully > > implemented exactly-once processing for data batches. To ensure data > > integrity within these batches, the shuffle data is compressed. If any > > issues arise with the shuffle data, decompression will fail, and the > > client will be notified, ensuring that only correct data batches are > > processed. > > > > However, it's important to note that there is currently no data > > integrity check for the header of a data batch. To address this, we > > plan to implement a checksum feature [0] to provide comprehensive data > > validation. > > > > If you have any further questions or need additional clarification on > > any specific checks, please don't hesitate to reach out. > > > > > > Ethan Feng > > > > [0]https://issues.apache.org/jira/browse/CELEBORN-894 > > > > Gaurav Mittal <[email protected]> 于2024年10月17日周四 04:44写道: > > > > > > > > Hi Celeborn devs, > > > > > > I am trying to better understand the end-2-end data integrity checks > that > > > exist in Celeborn today > > > * I saw some details about invariants that allow for Exactly Once > > Behavior > > > here > > > < > > > https://celeborn.apache.org/docs/latest/developers/faulttolerant/#exactly-once > > > > > > . > > > * Are there other checks that are performed that help guarantee data > > > correctness such as row count validation - total number of rows read by > > > reducers for a partition are equal to the number of rows written by the > > > mappers for that partition? > > > > > > Thanks > > > Gaurav > > >
