[DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-29 Thread Xuannan Su
Hi devs, I’d like to start a discussion about adding support to cache the intermediate result at DataStream API for batch processing. As the DataStream API now supports batch execution mode, we see users using the DataStream API to run batch jobs. Interactive programming is an important use case

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-29 Thread David Morávek
Hi Xuannan, thanks for drafting this FLIP. One immediate thought, from what I've seen for interactive data exploration with Spark, most people tend to use the higher level APIs, that allow for faster prototyping (Table API in Flink's case). Should the Table API also be covered by this FLIP? Best

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-29 Thread Xuannan Su
Hi David, Thanks for sharing your thoughts. You are right that most people tend to use high-level API for interactive data exploration. Actually, there is the FLIP-36 [1] covering the cache API at Table/SQL API. As far as I know, it has been accepted but hasn’t been implemented. At the time when

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-30 Thread Zhipeng Zhang
Hi Xuannan, Thanks for starting the discussion. This would certainly help a lot on both efficiency and reproducibility in machine learning cases :) I have a few questions as follows: 1. Can we support caching both the output and sideoutputs of a SingleOutputStreamOperator (which I believe is a r

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-30 Thread Gen Luo
Hi Xuannan, I found FLIP-188[1] that is aiming to introduce a built-in dynamic table storage, which provides a unified changelog & table representation. Tables stored there can be used in further ad-hoc queries. To my understanding, it's quite like an implementation of caching in Table API, and th

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-03 Thread David Morávek
One more question from my side, should we make sure this plays well with the remote shuffle service [1] in case of TM failure? [1] https://github.com/flink-extended/flink-remote-shuffle D. On Thu, Dec 30, 2021 at 11:59 AM Gen Luo wrote: > Hi Xuannan, > > I found FLIP-188[1] that is aiming to i

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-03 Thread Xuannan Su
Hi Zhipeng and Gen, Thanks for joining the discussion. For Zhipeng: - Can we support side output Caching the side output is indeed a valid use case. However, with the current API, it is not straightforward to cache the side output. You can apply an identity map function to the DataStream returne

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-04 Thread Xuannan Su
Hi David, We have a description in the FLIP about the case of TM failure without the remote shuffle service. Basically, since the partitions are stored at the TM, a TM failure requires recomputing the intermediate result. If a Flink job uses the remote shuffle service, the partitions are stored a

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-05 Thread Yun Gao
-- From:Xuannan Su Send Time:2022 Jan. 5 (Wed.) 14:04 To:dev Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing Hi David, We have a description in the FLIP about the case of TM failure without the remote shuffle service. Basically, since the partitions are

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-05 Thread Zhipeng Zhang
Hi Xuannnan, Thanks for the reply. Regarding whether and how to support cache sideoutput, I agree that the second option might be better if there do exist a use case that users need to cache only some certain side outputs. Xuannan Su 于2022年1月4日周二 15:50写道: > Hi Zhipeng and Gen, > > Thanks for

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-05 Thread Xuannan Su
ort the cache ResultPartition, since > in JobMasterPartitionTrackerImpl we have not support prompt a result > partition via pluggable ShuffleMaster yet. But we should be able to further > complete this part. > > Best, > Yun > > > ---------------------

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-06 Thread Gen Luo
Hi Xuannan, Thanks for the reply. I do agree that dynamic tables and cached partitions are similar features aiming different cases. In my opinion, the main difference of the implementations is to cache only the data or the whole result partition. To cache only the data, we can translate the Cac

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-10 Thread Xuannan Su
Hi Gen, Thanks for your feedback. I think you are talking about how we are going to store the caching data. The first option is to write the data with a sink to an external file system, much like the file store of the Dynamic Table. If I understand correctly, it requires a distributed file system

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-14 Thread David Morávek
Hi Xuannan, I think this already looks really good. The whole discussions is pretty long, so I'll just to summarize my current understanding of the outcome: - This only aims on the DataStream API for now, but can be used as a building block for the higher level abstractions (Table API). - We're p

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-16 Thread Xuannan Su
Hi David, Thanks for pointing out the FLIP-187. After reading the FLIP, I think it can solve the problem of choosing the proper parallelism, and thus it should be fine to not provide the method to set the parallelism of the cache. And you understand of the outcome of this FLIP is correct. If the

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-16 Thread Yun Gao
-- From:Xuannan Su Send Time:2022 Jan. 17 (Mon.) 13:00 To:dev Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing Hi David, Thanks for pointing out the FLIP-187. After reading the FLIP, I think it can solve the

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-17 Thread David Morávek
-------- > From:Xuannan Su > Send Time:2022 Jan. 17 (Mon.) 13:00 > To:dev > Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch > Processing > > Hi David, > > Thanks for pointing out the FLIP-187. After reading the FLIP, I think it > can solve

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-18 Thread Xuannan Su
ver strategy (like attempting for N > times) > when we found the cache result partition is missed? > > Best, > Yun > > > > -------------- > From:Xuannan Su > Send Time:2022 Jan. 17 (Mon.) 13:00 > To:dev > Subject:

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-18 Thread Xuannan Su
cached result partition is missing, > > would > > this happen in the client side or in the scheduler? If this happens in the > > client > > side, we need to bypass the give failover strategy (like attempting for N > > times) > > when we found the cac

Re: Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-19 Thread Yun Gao
Hi, Thanks Xuannan for the clarification, I also have no other issues~ Best, Yun --Original Mail -- Sender:Xuannan Su Send Date:Wed Jan 19 11:35:13 2022 Recipients:Flink Dev Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing Hi