Thanks Xinyu for the proposal!  It's a great enhancement for the current
Split mechanism. The design LGTM overall. There are some additional
considerations:
1) We need to ensure client compatibility, so it's best to reuse
RpcResponse, or add a new protocol to ensure the old behavior remains
compatible.
2) We could consider first implementing single-replica support for Split
and then extending it to dual replicas to reduce implementation complexity.

Thanks,
Jiashu Xiong

Keyong Zhou <[email protected]> 于2024年9月22日周日 11:15写道:

> Thanks Xinyu for the proposal! Glad to see PushMergeData will support
> splitting finally :)
>
> The design LGTM overall, still I have some comments:
>
> 1. We can also support soft split as well
>
> 2. Typically there are more than one disks, so it's better to only splits
> partitions
>     belonging to full disks instead of splitting all partitions
>
> 3. We can group partitions by disks to increase efficiency when checking
>     whether disk is full for each partition
>
> 4. For FILE_ALREADY_CLOSED I think we shouldn't decrease remainReviveTimes
>     either, it's not an exception or error
>
> Regards,
> Keyong Zhou
>
>
>
> Fu Chen <[email protected]> 于2024年9月21日周六 18:03写道:
>
> > Thanks Xinyu for the proposal. Adding HARD_SPLIT support for
> > PushMergeData is valuable for production. We've encountered issues
> > with small disk nodes getting overloaded in heterogeneous clusters.
> >
> > I had a discussion with @rexxiong, the current implementation requires
> > introducing PUSH_MERGED_DATA_RESPONSE, which increases the complexity
> > of modifications. We could consider reusing the RpcResponse
> >
> > Thanks,
> > Fu Chen
> >
> > 王馨雨 <[email protected]> 于2024年9月20日周五 10:35写道:
> > >
> > > Hi all,
> > >
> > > I've written up a proposal for supporting HARD_SPLIT in Celeborn. You
> > can find
> > > the proposal here
> > > <
> >
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-12+Support+HARD_SPLIT+in+PushMergedData
> > >.
> > > Please let me know if you have any comments or questions.Unlike
> > PushData, Celeborn won’t actively trigger HARD_SPLIT in PushMergedData
> > unless there are one or more partitions which have been split in the
> > partition group of PushMergedData.
> > > This leads to several problems:
> > > Cascading HARD_SPLIT in PushMergedData will be too wasted because most
> > partitions may not reach the HARD_SPLIT threshold.Worker pressure cannot
> be
> > transferred if the partitions won’t be split.ReverveSize won’t take
> > effect.Supporting HARD_SPLIT in PushMergedData will solve the above
> > problems which will only split the partitions that need to be split.
> > >
> > > Thanks,
> > > Xinyu Wang
> > >
> >
>

Reply via email to