Thanks Xinyu for the proposal! It's a great enhancement for the current Split mechanism. The design LGTM overall. There are some additional considerations: 1) We need to ensure client compatibility, so it's best to reuse RpcResponse, or add a new protocol to ensure the old behavior remains compatible. 2) We could consider first implementing single-replica support for Split and then extending it to dual replicas to reduce implementation complexity.
Thanks, Jiashu Xiong Keyong Zhou <[email protected]> 于2024年9月22日周日 11:15写道: > Thanks Xinyu for the proposal! Glad to see PushMergeData will support > splitting finally :) > > The design LGTM overall, still I have some comments: > > 1. We can also support soft split as well > > 2. Typically there are more than one disks, so it's better to only splits > partitions > belonging to full disks instead of splitting all partitions > > 3. We can group partitions by disks to increase efficiency when checking > whether disk is full for each partition > > 4. For FILE_ALREADY_CLOSED I think we shouldn't decrease remainReviveTimes > either, it's not an exception or error > > Regards, > Keyong Zhou > > > > Fu Chen <[email protected]> 于2024年9月21日周六 18:03写道: > > > Thanks Xinyu for the proposal. Adding HARD_SPLIT support for > > PushMergeData is valuable for production. We've encountered issues > > with small disk nodes getting overloaded in heterogeneous clusters. > > > > I had a discussion with @rexxiong, the current implementation requires > > introducing PUSH_MERGED_DATA_RESPONSE, which increases the complexity > > of modifications. We could consider reusing the RpcResponse > > > > Thanks, > > Fu Chen > > > > 王馨雨 <[email protected]> 于2024年9月20日周五 10:35写道: > > > > > > Hi all, > > > > > > I've written up a proposal for supporting HARD_SPLIT in Celeborn. You > > can find > > > the proposal here > > > < > > > https://cwiki.apache.org/confluence/display/CELEBORN/CIP-12+Support+HARD_SPLIT+in+PushMergedData > > >. > > > Please let me know if you have any comments or questions.Unlike > > PushData, Celeborn won’t actively trigger HARD_SPLIT in PushMergedData > > unless there are one or more partitions which have been split in the > > partition group of PushMergedData. > > > This leads to several problems: > > > Cascading HARD_SPLIT in PushMergedData will be too wasted because most > > partitions may not reach the HARD_SPLIT threshold.Worker pressure cannot > be > > transferred if the partitions won’t be split.ReverveSize won’t take > > effect.Supporting HARD_SPLIT in PushMergedData will solve the above > > problems which will only split the partitions that need to be split. > > > > > > Thanks, > > > Xinyu Wang > > > > > >
