Re: [DISCUSS] FLIP-301: Hybrid Shuffle supports Remote Storage

Weihua Hu Thu, 09 Mar 2023 19:52:56 -0800

Thanks Yuxin for your explanation.

That sounds reasonable. Looking forward to the new shuffle.



Best,
Weihua


On Fri, Mar 10, 2023 at 11:48 AM Yuxin Tan <[email protected]> wrote:

> Hi, Weihua,
> Thanks for the questions and the ideas.
>
> > 1. How many performance regressions would there be if we only
> used remote storage?
>
> The new architecture can support to use remote storage only, but this
> FLIP target is to improve job stability. And the change in the FLIP has
> been significantly complex and the goal of the first version is to update
> Hybrid Shuffle to the new architecture and support remote storage as
> a supplement. The performance of this version is not the first priority,
> so we haven’t tested the performance of using only remote storage.
> If there are indeed regressions, we will keep optimizing the performance
> of the remote storages and improve it until only remote storage is
> available in the production environment.
>
> > 2. Shall we move the local data to remote storage if the producer is
> finished for a long time?
>
> I agree that it is a good idea, which can release task manager resources
> more timely. But moving data from TM local disk to remote storage needs
> more detailed discussion and design, and it is easier to implement it based
> on the new architecture. Considering the complexity, the target focus, and
> the iteration cycle of the FLIP, we decide that the details are not
> included
> in the first version. We will extend and implement them in the subsequent
> versions.
>
> Best,
> Yuxin
>
>
> Weihua Hu <[email protected]> 于2023年3月9日周四 11:22写道：
>
> > Hi, Yuxin
> >
> > Thanks for driving this FLIP.
> >
> > The remote storage shuffle could improve the stability of Batch jobs.
> >
> > In our internal scenario, we use a hybrid cluster to run both
> > Streaming(high priority)
> > and Batch jobs(low priority). When there is not enough resources(such as
> > cpu usage
> > reaches a threshold), the batch containers will be evicted. So this will
> > cause some re-run
> > of batch tasks.
> >
> > It would be a great help if the remote storage could address this. So I
> > have a few questions.
> >
> > 1. How many performance regressions would there be if we only used remote
> > storage?
> >
> > 2. In current design, the shuffle data segment will write to one kind of
> > storage tier.
> > Shall we move the local data to remote storage if the producer is
> finished
> > for a long time?
> > So we can release the idle task manager with no shuffle data on it. This
> > may help to reduce
> > the resource usage when producer parallelism is larger than consume.
> >
> > Best,
> > Weihua
> >
> >
> > On Thu, Mar 9, 2023 at 10:38 AM Yuxin Tan <[email protected]>
> wrote:
> >
> > > Hi, Junrui,
> > > Thanks for the suggestions and ideas.
> > >
> > > > If they are fixed, I suggest that FLIP could provide clearer
> > > explanations.
> > > I have updated the FLIP and described the segment size more clearly.
> > >
> > > > can we provide configuration options for users to manually adjust the
> > > sizes?
> > > The segment size can be configured if necessary. But considering that
> if
> > we
> > > exposed these parameters prematurely, it may be difficult to modify the
> > > implementation later because the user has already used the configs. We
> > > can make these internal configs or fixed values when implementing the
> > first
> > > version, I think we can use either of these two ways, because they are
> > > internal and do not affect the public APIs.
> > >
> > > Best,
> > > Yuxin
> > >
> > >
> > > Junrui Lee <[email protected]> 于2023年3月8日周三 00:24写道：
> > >
> > > > Hi Yuxin,
> > > >
> > > > This FLIP looks quite reasonable. Flink can solve the problem of
> Batch
> > > > shuffle by
> > > > combining local and remote storage, and can use fixed local disks for
> > > > better performance
> > > >  in most scenarios, while using remote storage as a supplement when
> > local
> > > > disks are not
> > > >  sufficient, avoiding wasteful costs and poor job stability.
> Moreover,
> > > the
> > > > solution also
> > > > considers the issue of dynamic switching, which can automatically
> > switch
> > > to
> > > > remote
> > > > storage when the local disk is full, saving costs, and automatically
> > > switch
> > > > back when
> > > > there is available space on the local disk.
> > > >
> > > > As Wencong Liu stated, an appropriate segment size is essential, as
> it
> > > can
> > > > significantly
> > > > affect shuffle performance. I also agree that the first version
> should
> > > > focus mainly on the
> > > > design and implementation. However, I have a small question about
> > FLIP. I
> > > > did not see
> > > > any information regarding the segment size of memory, local disk, and
> > > > remote storage
> > > > in this FLIP. Are these three values fixed at present? If they are
> > > fixed, I
> > > > suggest that FLIP
> > > > could provide clearer explanations. Moreover, although a dynamic
> > segment
> > > > size
> > > > mechanism is not necessary at the moment, can we provide
> configuration
> > > > options for users
> > > >  to manually adjust these sizes? I think it might be useful.
> > > >
> > > > Best,
> > > > Junrui.
> > > >
> > > > Yuxin Tan <[email protected]> 于2023年3月7日周二 20:14写道：
> > > >
> > > > > Thanks for joining the discussion.
> > > > >
> > > > > @weijie guo
> > > > > > 1. How to optimize the broadcast result partition?
> > > > > For the partitions with multi-consumers, e.g., broadcast result
> > > > partition,
> > > > > partition reuse,
> > > > > speculative, etc, the processing logic is the same as the original
> > > Hybrid
> > > > > Shuffle, that is,
> > > > > using the full spilling strategy. It indeed may reduce the
> > opportunity
> > > to
> > > > > consume from
> > > > > memory, but the PoC shows that it has no effect on the performance
> > > > > basically.
> > > > >
> > > > > > 2. Can the new proposal completely avoid this problem of
> inaccurate
> > > > > backlog
> > > > > calculation?
> > > > > Yes, this can avoid the problem completely. About the read buffers,
> > > the N
> > > > > is to reserve
> > > > > one exclusive buffer per channel, which is to avoid the deadlock
> > > because
> > > > > the buffers
> > > > > are acquired by some channels and other channels can not request
> any
> > > > > buffers. But
> > > > > the buffers except for the N can be floating (competing to request
> > the
> > > > > buffers) by all
> > > > > channels.
> > > > >
> > > > > @Wencong Liu
> > > > > > Deciding the Segment size dynamically will be helpful.
> > > > > I agree that it may be better if the segment size is dynamically
> > > decided,
> > > > > but for simplifying
> > > > > the implementation of the first version, we want to make this a
> fixed
> > > > value
> > > > > for each tier.
> > > > > In the future, this can be a good improvement if necessary. In the
> > > first
> > > > > version, we will mainly
> > > > > focus on the more important features, such as the tiered storage
> > > > > architecture, dynamic
> > > > > switching tiers, supporting remote storage, memory management, etc.
> > > > >
> > > > > Best,
> > > > > Yuxin
> > > > >
> > > > >
> > > > > Wencong Liu <[email protected]> 于2023年3月7日周二 16:48写道：
> > > > >
> > > > > > Hello Yuxin,
> > > > > >
> > > > > >
> > > > > >     Thanks for your proposal! Adding remote storage capability to
> > > > Flink's
> > > > > > Hybrid Shuffle is a significant improvement that addresses the
> > issue
> > > of
> > > > > > local disk storage limitations. This enhancement not only ensures
> > > > > > uninterrupted Shuffle, but also enables Flink to handle larger
> > > > workloads
> > > > > > and more complex data processing tasks. With the ability to
> > > seamlessly
> > > > > > shift between local and remote storage, Flink's Hybrid Shuffle
> will
> > > be
> > > > > more
> > > > > > versatile and scalable, making it an ideal choice for
> organizations
> > > > > looking
> > > > > > to build distributed data processing applications with ease.
> > > > > >     Besides, I've a small question about the size of Segment in
> > > > different
> > > > > > storages. According to the FLIP, the size of Segment may be fixed
> > for
> > > > > each
> > > > > > Storage Tier, but I think the fixed size may affect the shuffle
> > > > > > performance. For example, smaller segment size will improve the
> > > > > utilization
> > > > > > rate of Memory Storage Tier, but it may brings extra cost to Disk
> > > > Storage
> > > > > > Tier or Remote Storage Tier. Deciding the size of Segment
> dynamicly
> > > > will
> > > > > be
> > > > > > helpful.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > >
> > > > > > Wencong Liu
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > At 2023-03-06 13:51:21, "Yuxin Tan" <[email protected]>
> > wrote:
> > > > > > >Hi everyone,
> > > > > > >
> > > > > > >I would like to start a discussion on FLIP-301: Hybrid Shuffle
> > > > supports
> > > > > > >Remote Storage[1].
> > > > > > >
> > > > > > >In the cloud-native environment, it is difficult to determine
> the
> > > > > > >appropriate
> > > > > > >disk space for Batch shuffle, which will affect job stability.
> > > > > > >
> > > > > > >This FLIP is to support Remote Storage for Hybrid Shuffle to
> > improve
> > > > the
> > > > > > >Batch job stability in the cloud-native environment.
> > > > > > >
> > > > > > >The goals of this FLIP are as follows.
> > > > > > >1. By default, use the local memory and disk to ensure high
> > shuffle
> > > > > > >performance if the local storage space is sufficient.
> > > > > > >2. When the local storage space is insufficient, use remote
> > storage
> > > as
> > > > > > >a supplement to avoid large-scale Batch job failure.
> > > > > > >
> > > > > > >Looking forward to hearing from you.
> > > > > > >
> > > > > > >[1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-301%3A+Hybrid+Shuffle+supports+Remote+Storage
> > > > > > >
> > > > > > >Best,
> > > > > > >Yuxin
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-301: Hybrid Shuffle supports Remote Storage

Reply via email to