Hi Piotrek, For the cache() method itself, yes, it is equivalent to creating a BUILT-IN materialized view with a lifecycle. That functionality is missing today, though. Not sure if I understand your question. Do you mean we already have the functionality and just need a syntax sugar?
What's more interesting in the proposal is do we want to stop at creating the materialized view? Or do we want to extend that in the future to a more useful unified data store distributed with Flink? And do we want to have a mechanism allow more flexible user job pattern with their own user defined services. These considerations are much more architectural. Thanks, Jiangjie (Becket) Qin On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <pi...@data-artisans.com> wrote: > Hi, > > Interesting idea. I’m trying to understand the problem. Isn’t the > `cache()` call an equivalent of writing data to a sink and later reading > from it? Where this sink has a limited live scope/live time? And the sink > could be implemented as in memory or a file sink? > > If so, what’s the problem with creating a materialised view from a table > “b” (from your document’s example) and reusing this materialised view > later? Maybe we are lacking mechanisms to clean up materialised views (for > example when current session finishes)? Maybe we need some syntactic sugar > on top of it? > > Piotrek > > > On 23 Nov 2018, at 07:21, Becket Qin <becket....@gmail.com> wrote: > > > > Thanks for the suggestion, Jincheng. > > > > Yes, I think it makes sense to have a persist() with lifecycle/defined > > scope. I just added a section in the future work for this. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <sunjincheng...@gmail.com> > > wrote: > > > >> Hi Jiangjie, > >> > >> Thank you for the explanation about the name of `cache()`, I understand > why > >> you designed this way! > >> > >> Another idea is whether we can specify a lifecycle for data persistence? > >> For example, persist (LifeCycle.SESSION), so that the user is not > worried > >> about data loss, and will clearly specify the time range for keeping > time. > >> At the same time, if we want to expand, we can also share in a certain > >> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not > sure, > >> just an immature suggestion, for reference only! > >> > >> Bests, > >> Jincheng > >> > >> Becket Qin <becket....@gmail.com> 于2018年11月23日周五 下午1:33写道: > >> > >>> Re: Jincheng, > >>> > >>> Thanks for the feedback. Regarding cache() v.s. persist(), personally I > >>> find cache() to be more accurately describing the behavior, i.e. the > >> Table > >>> is cached for the session, but will be deleted after the session is > >> closed. > >>> persist() seems a little misleading as people might think the table > will > >>> still be there even after the session is gone. > >>> > >>> Great point about mixing the batch and stream processing in the same > job. > >>> We should absolutely move towards that goal. I imagine that would be a > >> huge > >>> change across the board, including sources, operators and > optimizations, > >> to > >>> name some. Likely we will need several separate in-depth discussions. > >>> > >>> Thanks, > >>> > >>> Jiangjie (Becket) Qin > >>> > >>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xingc...@gmail.com> > wrote: > >>> > >>>> Hi all, > >>>> > >>>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal > >> to > >>>> the cache problem. Essentially, this may be the first time we plan to > >>>> introduce another storage mechanism other than the state. Maybe it’s > >>> better > >>>> to first draw a big picture and then concentrate on a specific part? > >>>> > >>>> @Becket, yes, actually I am more concerned with the underlying > service. > >>>> This seems to be quite a major change to the existing codebase. As you > >>>> claimed, the service should be extendible to support other components > >> and > >>>> we’d better discussed it in another thread. > >>>> > >>>> All in all, I also eager to enjoy the more interactive Table API, in > >> case > >>>> of a general and flexible enough service mechanism. > >>>> > >>>> Best, > >>>> Xingcan > >>>> > >>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xiaow...@gmail.com> > >>> wrote: > >>>>> > >>>>> Relying on a callback for the temp table for clean up is not very > >>>> reliable. > >>>>> There is no guarantee that it will be executed successfully. We may > >>> risk > >>>>> leaks when that happens. I think that it's safer to have an > >> association > >>>>> between temp table and session id. So we can always clean up temp > >>> tables > >>>>> which are no longer associated with any active sessions. > >>>>> > >>>>> Regards, > >>>>> Xiaowei > >>>>> > >>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun < > >>> sunjincheng...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Hi Jiangjie&Shaoxuan, > >>>>>> > >>>>>> Thanks for initiating this great proposal! > >>>>>> > >>>>>> Interactive Programming is very useful and user friendly in case of > >>> your > >>>>>> examples. > >>>>>> Moreover, especially when a business has to be executed in several > >>>> stages > >>>>>> with dependencies,such as the pipeline of Flink ML, in order to > >>> utilize > >>>> the > >>>>>> intermediate calculation results we have to submit a job by > >>>> env.execute(). > >>>>>> > >>>>>> About the `cache()` , I think is better to named `persist()`, And > >> The > >>>>>> Flink framework determines whether we internally cache in memory or > >>>> persist > >>>>>> to the storage system,Maybe save the data into state backend > >>>>>> (MemoryStateBackend or RocksDBStateBackend etc.) > >>>>>> > >>>>>> BTW, from the points of my view in the future, support for streaming > >>> and > >>>>>> batch mode switching in the same job will also benefit in > >> "Interactive > >>>>>> Programming", I am looking forward to your JIRAs and FLIP! > >>>>>> > >>>>>> Best, > >>>>>> Jincheng > >>>>>> > >>>>>> > >>>>>> Becket Qin <becket....@gmail.com> 于2018年11月20日周二 下午9:56写道: > >>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> As a few recent email threads have pointed out, it is a promising > >>>>>>> opportunity to enhance Flink Table API in various aspects, > >> including > >>>>>>> functionality and ease of use among others. One of the scenarios > >>> where > >>>> we > >>>>>>> feel Flink could improve is interactive programming. To explain the > >>>>>> issues > >>>>>>> and facilitate the discussion on the solution, we put together the > >>>>>>> following document with our proposal. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >> > https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing > >>>>>>> > >>>>>>> Feedback and comments are very welcome! > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> Jiangjie (Becket) Qin > >>>>>>> > >>>>>> > >>>> > >>>> > >>> > >> > >