Hi Ethan/Keyong

Thanks a lot for the information

I am proposing
<https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0>
that Celeborn should optionally provide capability to perform end to end
integrity checks that provide:
1. Higher confidence that the data being transported is borth correct and
complete
2. Make it easier to detect data corruption due to bugs/race conditions in
Celeborn

Could you please review when you get a chance?

On Fri, Oct 18, 2024 at 5:34 AM Keyong Zhou <[email protected]> wrote:

> Hi Gaurav,
>
> Currently Celeborn doesn't check row count read and written, however
> Celeborn integrates
> with Spark's metrics and you can check through the Spark UI. Also I think
> it's possible to add
> integrity check based on the metrics.
>
> Regards,
> Keyong Zhou
>
> Ethan Feng <[email protected]> 于2024年10月18日周五 18:35写道:
>
> > Hi Gaurav,
> >
> > I hope this message finds you well.
> >
> > As you may have read in the provided link, Celeborn has successfully
> > implemented exactly-once processing for data batches. To ensure data
> > integrity within these batches, the shuffle data is compressed. If any
> > issues arise with the shuffle data, decompression will fail, and the
> > client will be notified, ensuring that only correct data batches are
> > processed.
> >
> > However, it's important to note that there is currently no data
> > integrity check for the header of a data batch. To address this, we
> > plan to implement a checksum feature [0] to provide comprehensive data
> > validation.
> >
> > If you have any further questions or need additional clarification on
> > any specific checks, please don't hesitate to reach out.
> >
> >
> > Ethan Feng
> >
> > [0]https://issues.apache.org/jira/browse/CELEBORN-894
> >
> > Gaurav Mittal <[email protected]> 于2024年10月17日周四 04:44写道:
> >
> > >
> > > Hi Celeborn devs,
> > >
> > > I am trying to better understand the end-2-end data integrity checks
> that
> > > exist in Celeborn today
> > > * I saw some details about invariants that allow for Exactly Once
> > Behavior
> > > here
> > > <
> >
> https://celeborn.apache.org/docs/latest/developers/faulttolerant/#exactly-once
> > >
> > > .
> > > * Are there other checks that are performed that help guarantee data
> > > correctness such as row count validation - total number of rows read by
> > > reducers for a partition are equal to the number of rows written by the
> > > mappers for that partition?
> > >
> > > Thanks
> > > Gaurav
> >
>

Reply via email to