Hi Gaurav, Currently Celeborn doesn't check row count read and written, however Celeborn integrates with Spark's metrics and you can check through the Spark UI. Also I think it's possible to add integrity check based on the metrics.
Regards, Keyong Zhou Ethan Feng <[email protected]> 于2024年10月18日周五 18:35写道: > Hi Gaurav, > > I hope this message finds you well. > > As you may have read in the provided link, Celeborn has successfully > implemented exactly-once processing for data batches. To ensure data > integrity within these batches, the shuffle data is compressed. If any > issues arise with the shuffle data, decompression will fail, and the > client will be notified, ensuring that only correct data batches are > processed. > > However, it's important to note that there is currently no data > integrity check for the header of a data batch. To address this, we > plan to implement a checksum feature [0] to provide comprehensive data > validation. > > If you have any further questions or need additional clarification on > any specific checks, please don't hesitate to reach out. > > > Ethan Feng > > [0]https://issues.apache.org/jira/browse/CELEBORN-894 > > Gaurav Mittal <[email protected]> 于2024年10月17日周四 04:44写道: > > > > > Hi Celeborn devs, > > > > I am trying to better understand the end-2-end data integrity checks that > > exist in Celeborn today > > * I saw some details about invariants that allow for Exactly Once > Behavior > > here > > < > https://celeborn.apache.org/docs/latest/developers/faulttolerant/#exactly-once > > > > . > > * Are there other checks that are performed that help guarantee data > > correctness such as row count validation - total number of rows read by > > reducers for a partition are equal to the number of rows written by the > > mappers for that partition? > > > > Thanks > > Gaurav >
