Thanks Yuxin for your explanation. That sounds reasonable. Looking forward to the new shuffle.
Best, Weihua On Fri, Mar 10, 2023 at 11:48 AM Yuxin Tan <tanyuxinw...@gmail.com> wrote: > Hi, Weihua, > Thanks for the questions and the ideas. > > > 1. How many performance regressions would there be if we only > used remote storage? > > The new architecture can support to use remote storage only, but this > FLIP target is to improve job stability. And the change in the FLIP has > been significantly complex and the goal of the first version is to update > Hybrid Shuffle to the new architecture and support remote storage as > a supplement. The performance of this version is not the first priority, > so we haven’t tested the performance of using only remote storage. > If there are indeed regressions, we will keep optimizing the performance > of the remote storages and improve it until only remote storage is > available in the production environment. > > > 2. Shall we move the local data to remote storage if the producer is > finished for a long time? > > I agree that it is a good idea, which can release task manager resources > more timely. But moving data from TM local disk to remote storage needs > more detailed discussion and design, and it is easier to implement it based > on the new architecture. Considering the complexity, the target focus, and > the iteration cycle of the FLIP, we decide that the details are not > included > in the first version. We will extend and implement them in the subsequent > versions. > > Best, > Yuxin > > > Weihua Hu <huweihua....@gmail.com> 于2023年3月9日周四 11:22写道: > > > Hi, Yuxin > > > > Thanks for driving this FLIP. > > > > The remote storage shuffle could improve the stability of Batch jobs. > > > > In our internal scenario, we use a hybrid cluster to run both > > Streaming(high priority) > > and Batch jobs(low priority). When there is not enough resources(such as > > cpu usage > > reaches a threshold), the batch containers will be evicted. So this will > > cause some re-run > > of batch tasks. > > > > It would be a great help if the remote storage could address this. So I > > have a few questions. > > > > 1. How many performance regressions would there be if we only used remote > > storage? > > > > 2. In current design, the shuffle data segment will write to one kind of > > storage tier. > > Shall we move the local data to remote storage if the producer is > finished > > for a long time? > > So we can release the idle task manager with no shuffle data on it. This > > may help to reduce > > the resource usage when producer parallelism is larger than consume. > > > > Best, > > Weihua > > > > > > On Thu, Mar 9, 2023 at 10:38 AM Yuxin Tan <tanyuxinw...@gmail.com> > wrote: > > > > > Hi, Junrui, > > > Thanks for the suggestions and ideas. > > > > > > > If they are fixed, I suggest that FLIP could provide clearer > > > explanations. > > > I have updated the FLIP and described the segment size more clearly. > > > > > > > can we provide configuration options for users to manually adjust the > > > sizes? > > > The segment size can be configured if necessary. But considering that > if > > we > > > exposed these parameters prematurely, it may be difficult to modify the > > > implementation later because the user has already used the configs. We > > > can make these internal configs or fixed values when implementing the > > first > > > version, I think we can use either of these two ways, because they are > > > internal and do not affect the public APIs. > > > > > > Best, > > > Yuxin > > > > > > > > > Junrui Lee <jrlee....@gmail.com> 于2023年3月8日周三 00:24写道: > > > > > > > Hi Yuxin, > > > > > > > > This FLIP looks quite reasonable. Flink can solve the problem of > Batch > > > > shuffle by > > > > combining local and remote storage, and can use fixed local disks for > > > > better performance > > > > in most scenarios, while using remote storage as a supplement when > > local > > > > disks are not > > > > sufficient, avoiding wasteful costs and poor job stability. > Moreover, > > > the > > > > solution also > > > > considers the issue of dynamic switching, which can automatically > > switch > > > to > > > > remote > > > > storage when the local disk is full, saving costs, and automatically > > > switch > > > > back when > > > > there is available space on the local disk. > > > > > > > > As Wencong Liu stated, an appropriate segment size is essential, as > it > > > can > > > > significantly > > > > affect shuffle performance. I also agree that the first version > should > > > > focus mainly on the > > > > design and implementation. However, I have a small question about > > FLIP. I > > > > did not see > > > > any information regarding the segment size of memory, local disk, and > > > > remote storage > > > > in this FLIP. Are these three values fixed at present? If they are > > > fixed, I > > > > suggest that FLIP > > > > could provide clearer explanations. Moreover, although a dynamic > > segment > > > > size > > > > mechanism is not necessary at the moment, can we provide > configuration > > > > options for users > > > > to manually adjust these sizes? I think it might be useful. > > > > > > > > Best, > > > > Junrui. > > > > > > > > Yuxin Tan <tanyuxinw...@gmail.com> 于2023年3月7日周二 20:14写道: > > > > > > > > > Thanks for joining the discussion. > > > > > > > > > > @weijie guo > > > > > > 1. How to optimize the broadcast result partition? > > > > > For the partitions with multi-consumers, e.g., broadcast result > > > > partition, > > > > > partition reuse, > > > > > speculative, etc, the processing logic is the same as the original > > > Hybrid > > > > > Shuffle, that is, > > > > > using the full spilling strategy. It indeed may reduce the > > opportunity > > > to > > > > > consume from > > > > > memory, but the PoC shows that it has no effect on the performance > > > > > basically. > > > > > > > > > > > 2. Can the new proposal completely avoid this problem of > inaccurate > > > > > backlog > > > > > calculation? > > > > > Yes, this can avoid the problem completely. About the read buffers, > > > the N > > > > > is to reserve > > > > > one exclusive buffer per channel, which is to avoid the deadlock > > > because > > > > > the buffers > > > > > are acquired by some channels and other channels can not request > any > > > > > buffers. But > > > > > the buffers except for the N can be floating (competing to request > > the > > > > > buffers) by all > > > > > channels. > > > > > > > > > > @Wencong Liu > > > > > > Deciding the Segment size dynamically will be helpful. > > > > > I agree that it may be better if the segment size is dynamically > > > decided, > > > > > but for simplifying > > > > > the implementation of the first version, we want to make this a > fixed > > > > value > > > > > for each tier. > > > > > In the future, this can be a good improvement if necessary. In the > > > first > > > > > version, we will mainly > > > > > focus on the more important features, such as the tiered storage > > > > > architecture, dynamic > > > > > switching tiers, supporting remote storage, memory management, etc. > > > > > > > > > > Best, > > > > > Yuxin > > > > > > > > > > > > > > > Wencong Liu <liuwencle...@163.com> 于2023年3月7日周二 16:48写道: > > > > > > > > > > > Hello Yuxin, > > > > > > > > > > > > > > > > > > Thanks for your proposal! Adding remote storage capability to > > > > Flink's > > > > > > Hybrid Shuffle is a significant improvement that addresses the > > issue > > > of > > > > > > local disk storage limitations. This enhancement not only ensures > > > > > > uninterrupted Shuffle, but also enables Flink to handle larger > > > > workloads > > > > > > and more complex data processing tasks. With the ability to > > > seamlessly > > > > > > shift between local and remote storage, Flink's Hybrid Shuffle > will > > > be > > > > > more > > > > > > versatile and scalable, making it an ideal choice for > organizations > > > > > looking > > > > > > to build distributed data processing applications with ease. > > > > > > Besides, I've a small question about the size of Segment in > > > > different > > > > > > storages. According to the FLIP, the size of Segment may be fixed > > for > > > > > each > > > > > > Storage Tier, but I think the fixed size may affect the shuffle > > > > > > performance. For example, smaller segment size will improve the > > > > > utilization > > > > > > rate of Memory Storage Tier, but it may brings extra cost to Disk > > > > Storage > > > > > > Tier or Remote Storage Tier. Deciding the size of Segment > dynamicly > > > > will > > > > > be > > > > > > helpful. > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > Wencong Liu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2023-03-06 13:51:21, "Yuxin Tan" <tanyuxinw...@gmail.com> > > wrote: > > > > > > >Hi everyone, > > > > > > > > > > > > > >I would like to start a discussion on FLIP-301: Hybrid Shuffle > > > > supports > > > > > > >Remote Storage[1]. > > > > > > > > > > > > > >In the cloud-native environment, it is difficult to determine > the > > > > > > >appropriate > > > > > > >disk space for Batch shuffle, which will affect job stability. > > > > > > > > > > > > > >This FLIP is to support Remote Storage for Hybrid Shuffle to > > improve > > > > the > > > > > > >Batch job stability in the cloud-native environment. > > > > > > > > > > > > > >The goals of this FLIP are as follows. > > > > > > >1. By default, use the local memory and disk to ensure high > > shuffle > > > > > > >performance if the local storage space is sufficient. > > > > > > >2. When the local storage space is insufficient, use remote > > storage > > > as > > > > > > >a supplement to avoid large-scale Batch job failure. > > > > > > > > > > > > > >Looking forward to hearing from you. > > > > > > > > > > > > > >[1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-301%3A+Hybrid+Shuffle+supports+Remote+Storage > > > > > > > > > > > > > >Best, > > > > > > >Yuxin > > > > > > > > > > > > > > > > > > > > >