Hi devs,
I’d like to start a discussion about adding support to cache the
intermediate result at DataStream API for batch processing.
As the DataStream API now supports batch execution mode, we see users
using the DataStream API to run batch jobs. Interactive programming is
an important use case
Hi Xuannan,
thanks for drafting this FLIP.
One immediate thought, from what I've seen for interactive data exploration
with Spark, most people tend to use the higher level APIs, that allow for
faster prototyping (Table API in Flink's case). Should the Table API also
be covered by this FLIP?
Best
Hi David,
Thanks for sharing your thoughts.
You are right that most people tend to use high-level API for
interactive data exploration. Actually, there is
the FLIP-36 [1] covering the cache API at Table/SQL API. As far as I
know, it has been accepted but hasn’t been implemented. At the time
when
Hi Xuannan,
Thanks for starting the discussion. This would certainly help a lot on both
efficiency and reproducibility in machine learning cases :)
I have a few questions as follows:
1. Can we support caching both the output and sideoutputs of a
SingleOutputStreamOperator (which I believe is a r
Hi Xuannan,
I found FLIP-188[1] that is aiming to introduce a built-in dynamic table
storage, which provides a unified changelog & table representation. Tables
stored there can be used in further ad-hoc queries. To my understanding,
it's quite like an implementation of caching in Table API, and th
One more question from my side, should we make sure this plays well with
the remote shuffle service [1] in case of TM failure?
[1] https://github.com/flink-extended/flink-remote-shuffle
D.
On Thu, Dec 30, 2021 at 11:59 AM Gen Luo wrote:
> Hi Xuannan,
>
> I found FLIP-188[1] that is aiming to i
Hi Zhipeng and Gen,
Thanks for joining the discussion.
For Zhipeng:
- Can we support side output
Caching the side output is indeed a valid use case. However, with the
current API, it is not straightforward to cache the side output. You
can apply an identity map function to the DataStream returne
Hi David,
We have a description in the FLIP about the case of TM failure without
the remote shuffle service. Basically, since the partitions are stored
at the TM, a TM failure requires recomputing the intermediate result.
If a Flink job uses the remote shuffle service, the partitions are
stored a
--
From:Xuannan Su
Send Time:2022 Jan. 5 (Wed.) 14:04
To:dev
Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing
Hi David,
We have a description in the FLIP about the case of TM failure without
the remote shuffle service. Basically, since the partitions are
Hi Xuannnan,
Thanks for the reply.
Regarding whether and how to support cache sideoutput, I agree that the
second option might be better if there do exist a use case that users need
to cache only some certain side outputs.
Xuannan Su 于2022年1月4日周二 15:50写道:
> Hi Zhipeng and Gen,
>
> Thanks for
ort the cache ResultPartition, since
> in JobMasterPartitionTrackerImpl we have not support prompt a result
> partition via pluggable ShuffleMaster yet. But we should be able to further
> complete this part.
>
> Best,
> Yun
>
>
> ---------------------
Hi Xuannan,
Thanks for the reply.
I do agree that dynamic tables and cached partitions are similar features
aiming different cases. In my opinion, the main difference of the
implementations is to cache only the data or the whole result partition.
To cache only the data, we can translate the Cac
Hi Gen,
Thanks for your feedback.
I think you are talking about how we are going to store the caching
data. The first option is to write the data with a sink to an external
file system, much like the file store of the Dynamic Table. If I
understand correctly, it requires a distributed file system
Hi Xuannan,
I think this already looks really good. The whole discussions is pretty
long, so I'll just to summarize my current understanding of the outcome:
- This only aims on the DataStream API for now, but can be used as a
building block for the higher level abstractions (Table API).
- We're p
Hi David,
Thanks for pointing out the FLIP-187. After reading the FLIP, I think it
can solve the problem of choosing the proper parallelism, and thus it
should be fine to not provide the method to set the parallelism of the
cache.
And you understand of the outcome of this FLIP is correct.
If the
--
From:Xuannan Su
Send Time:2022 Jan. 17 (Mon.) 13:00
To:dev
Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing
Hi David,
Thanks for pointing out the FLIP-187. After reading the FLIP, I think it
can solve the
--------
> From:Xuannan Su
> Send Time:2022 Jan. 17 (Mon.) 13:00
> To:dev
> Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch
> Processing
>
> Hi David,
>
> Thanks for pointing out the FLIP-187. After reading the FLIP, I think it
> can solve
ver strategy (like attempting for N
> times)
> when we found the cache result partition is missed?
>
> Best,
> Yun
>
>
>
> --------------
> From:Xuannan Su
> Send Time:2022 Jan. 17 (Mon.) 13:00
> To:dev
> Subject:
cached result partition is missing,
> > would
> > this happen in the client side or in the scheduler? If this happens in the
> > client
> > side, we need to bypass the give failover strategy (like attempting for N
> > times)
> > when we found the cac
Hi,
Thanks Xuannan for the clarification, I also have no other issues~
Best,
Yun
--Original Mail --
Sender:Xuannan Su
Send Date:Wed Jan 19 11:35:13 2022
Recipients:Flink Dev
Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing
Hi
20 matches
Mail list logo