Thanks Xinyu for the proposal. Adding HARD_SPLIT support for PushMergeData is valuable for production. We've encountered issues with small disk nodes getting overloaded in heterogeneous clusters.
I had a discussion with @rexxiong, the current implementation requires introducing PUSH_MERGED_DATA_RESPONSE, which increases the complexity of modifications. We could consider reusing the RpcResponse Thanks, Fu Chen 王馨雨 <[email protected]> 于2024年9月20日周五 10:35写道: > > Hi all, > > I've written up a proposal for supporting HARD_SPLIT in Celeborn. You can find > the proposal here > <https://cwiki.apache.org/confluence/display/CELEBORN/CIP-12+Support+HARD_SPLIT+in+PushMergedData>. > Please let me know if you have any comments or questions.Unlike PushData, > Celeborn won’t actively trigger HARD_SPLIT in PushMergedData unless there are > one or more partitions which have been split in the partition group of > PushMergedData. > This leads to several problems: > Cascading HARD_SPLIT in PushMergedData will be too wasted because most > partitions may not reach the HARD_SPLIT threshold.Worker pressure cannot be > transferred if the partitions won’t be split.ReverveSize won’t take > effect.Supporting HARD_SPLIT in PushMergedData will solve the above problems > which will only split the partitions that need to be split. > > Thanks, > Xinyu Wang >
