Re: [DISCUSS] Support Interactive Programming in Flink Table API

Becket Qin Wed, 28 Nov 2018 21:21:59 -0800

Just to add a little bit, the materialized view is probably more similar to
the persistent() brought up earlier in the thread. So it is usually cross
session and could be used in a larger scope. For example, a materialized
view created by user A may be visible to user B. It is probably something
we want to have in the future. I'll put it in the future work section.


Thanks,

Jiangjie (Becket) Qin

On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <[email protected]> wrote:

> Hi Piotrek,
>
> Thanks for the explanation.
>
> Right now we are mostly thinking of the cached table as immutable. I can
> see the Materialized view would be useful in the future. That said, I think
> a simple cache mechanism is probably still needed. So to me, cache() and
> materialize() should be two separate method as they address different
> needs. Materialize() is a higher level concept usually implying periodical
> update, while cache() has much simpler semantic. For example, one may
> create a materialized view and use cache() method in the materialized view
> creation logic. So that during the materialized view update, they do not
> need to worry about the case that the cached table is also changed. Maybe
> under the hood, materialized() and cache() could share some mechanism, but
> I think a simple cache() method would be handy in a lot of cases.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <[email protected]>
> wrote:
>
>> Hi Becket,
>>
>> > Is there any extra thing user can do on a MaterializedTable that they
>> cannot do on a Table?
>>
>> Maybe not in the initial implementation, but various DBs offer different
>> ways to “refresh” the materialised view. Hooks, triggers, timers, manually
>> etc. Having `MaterializedTable` would help us to handle that in the future.
>>
>> > After users call *table.cache(), *users can just use that table and do
>> anything that is supported on a Table, including SQL.
>>
>> This is some implicit behaviour with side effects. Imagine if user has a
>> long and complicated program, that touches table `b` multiple times, maybe
>> scattered around different methods. If he modifies his program by inserting
>> in one place
>>
>> b.cache()
>>
>> This implicitly alters the semantic and behaviour of his code all over
>> the place, maybe in a ways that might cause problems. For example what if
>> underlying data is changing?
>>
>> Having invisible side effects is also not very clean, for example think
>> about something like this (but more complicated):
>>
>> Table b = ...;
>>
>> If (some_condition) {
>>   processTable1(b)
>> }
>> else {
>>   processTable2(b)
>> }
>>
>> // do more stuff with b
>>
>> And user adds `b.cache()` call to only one of the `processTable1` or
>> `processTable2` methods.
>>
>> On the other hand
>>
>> Table materialisedB = b.materialize()
>>
>> Avoids (at least some of) the side effect issues and forces user to
>> explicitly use `materialisedB` where it’s appropriate and forces user to
>> think what does it actually mean. And if something doesn’t work in the end
>> for the user, he will know what has he changed instead of blaming Flink for
>> some “magic” underneath. In the above example, after materialising b in
>> only one of the methods, he should/would realise about the issue when
>> handling the return value `MaterializedTable` of that method.
>>
>> I guess it comes down to personal preferences if you like things to be
>> implicit or not. The more power is the user, probably the more likely he is
>> to like/understand implicit behaviour. And we as Table API designers are
>> the most power users out there, so I would proceed with caution (so that we
>> do not end up in the crazy perl realm with it’s lovely implicit method
>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>
>> > Table API to also support non-relational processing cases, cache()
>> might be slightly better.
>>
>> I think even such extended Table API could benefit from sticking to/being
>> consistent with SQL where both SQL and Table API are basically the same.
>>
>> One more thing. `MaterializedTable materialize()` could be more
>> powerful/flexible allowing the user to operate both on materialised and not
>> materialised view at the same time for whatever reasons (underlying data
>> changing/better optimisation opportunities after pushing down more filters
>> etc). For example:
>>
>> Table b = …;
>>
>> MaterlizedTable mb = b.materialize();
>>
>> Val min = mb.min();
>> Val max = mb.max();
>>
>> Val user42 = b.filter(‘userId = 42);
>>
>> Could be more efficient compared to `b.cache()` if `filter(‘userId =
>> 42);` allows for much more aggressive optimisations.
>>
>> Piotrek
>>
>> > On 26 Nov 2018, at 12:14, Fabian Hueske <[email protected]> wrote:
>> >
>> > I'm not suggesting to add support for Ignite. This was just an example.
>> > Plasma and Arrow sound interesting, too.
>> > For the sake of this proposal, it would be up to the user to implement a
>> > TableFactory and corresponding TableSource / TableSink classes to
>> persist
>> > and read the data.
>> >
>> > Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
>> > [email protected]>:
>> >
>> >> What about to add also Apache Plasma + Arrow as an alternative to
>> Apache
>> >> Ignite?
>> >> [1]
>> >>
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>> >>
>> >> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <[email protected]>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> Thanks for the proposal!
>> >>>
>> >>> To summarize, you propose a new method Table.cache(): Table that will
>> >>> trigger a job and write the result into some temporary storage as
>> defined
>> >>> by a TableFactory.
>> >>> The cache() call blocks while the job is running and eventually
>> returns a
>> >>> Table object that represents a scan of the temporary table.
>> >>> When the "session" is closed (closing to be defined?), the temporary
>> >> tables
>> >>> are all dropped.
>> >>>
>> >>> I think this behavior makes sense and is a good first step towards
>> more
>> >>> interactive workloads.
>> >>> However, its performance suffers from writing to and reading from
>> >> external
>> >>> systems.
>> >>> I think this is OK for now. Changes that would significantly improve
>> the
>> >>> situation (i.e., pinning data in-memory across jobs) would have large
>> >>> impacts on many components of Flink.
>> >>> Users could use in-memory filesystems or storage grids (Apache
>> Ignite) to
>> >>> mitigate some of the performance effects.
>> >>>
>> >>> Best, Fabian
>> >>>
>> >>>
>> >>>
>> >>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>> >>> [email protected]
>> >>>> :
>> >>>
>> >>>> Thanks for the explanation, Piotrek.
>> >>>>
>> >>>> Is there any extra thing user can do on a MaterializedTable that they
>> >>>> cannot do on a Table? After users call *table.cache(), *users can
>> just
>> >>> use
>> >>>> that table and do anything that is supported on a Table, including
>> SQL.
>> >>>>
>> >>>> Naming wise, either cache() or materialize() sounds fine to me.
>> cache()
>> >>> is
>> >>>> a bit more general than materialize(). Given that we are enhancing
>> the
>> >>>> Table API to also support non-relational processing cases, cache()
>> >> might
>> >>> be
>> >>>> slightly better.
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Jiangjie (Becket) Qin
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>> >> [email protected]
>> >>>>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Becket,
>> >>>>>
>> >>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>> >>>>> `TableFactory`. I don’t know why, but I assumed that you want to
>> >>> provide
>> >>>> an
>> >>>>> alternate way of writing the data.
>> >>>>>
>> >>>>> Now that I hopefully understand the proposal, maybe we could rename
>> >>>>> `cache()` to
>> >>>>>
>> >>>>> void materialize()
>> >>>>>
>> >>>>> or going step further
>> >>>>>
>> >>>>> MaterializedTable materialize()
>> >>>>> MaterializedTable createMaterializedView()
>> >>>>>
>> >>>>> ?
>> >>>>>
>> >>>>> The second option with returning a handle I think is more flexible
>> >> and
>> >>>>> could provide features such as “refresh”/“delete” or generally
>> >> speaking
>> >>>>> manage the the view. In the future we could also think about adding
>> >>> hooks
>> >>>>> to automatically refresh view etc. It is also more explicit -
>> >>>>> materialization returning a new table handle will not have the same
>> >>>>> implicit side effects as adding a simple line of code like
>> >> `b.cache()`
>> >>>>> would have.
>> >>>>>
>> >>>>> It would also be more SQL like, making it more intuitive for users
>> >>>> already
>> >>>>> familiar with the SQL.
>> >>>>>
>> >>>>> Piotrek
>> >>>>>
>> >>>>>> On 23 Nov 2018, at 14:53, Becket Qin <[email protected]> wrote:
>> >>>>>>
>> >>>>>> Hi Piotrek,
>> >>>>>>
>> >>>>>> For the cache() method itself, yes, it is equivalent to creating a
>> >>>>> BUILT-IN
>> >>>>>> materialized view with a lifecycle. That functionality is missing
>> >>>> today,
>> >>>>>> though. Not sure if I understand your question. Do you mean we
>> >>> already
>> >>>>> have
>> >>>>>> the functionality and just need a syntax sugar?
>> >>>>>>
>> >>>>>> What's more interesting in the proposal is do we want to stop at
>> >>>> creating
>> >>>>>> the materialized view? Or do we want to extend that in the future
>> >> to
>> >>> a
>> >>>>> more
>> >>>>>> useful unified data store distributed with Flink? And do we want to
>> >>>> have
>> >>>>> a
>> >>>>>> mechanism allow more flexible user job pattern with their own user
>> >>>>> defined
>> >>>>>> services. These considerations are much more architectural.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>>
>> >>>>>> Jiangjie (Becket) Qin
>> >>>>>>
>> >>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>> >>>> [email protected]>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t the
>> >>>>>>> `cache()` call an equivalent of writing data to a sink and later
>> >>>> reading
>> >>>>>>> from it? Where this sink has a limited live scope/live time? And
>> >> the
>> >>>>> sink
>> >>>>>>> could be implemented as in memory or a file sink?
>> >>>>>>>
>> >>>>>>> If so, what’s the problem with creating a materialised view from a
>> >>>> table
>> >>>>>>> “b” (from your document’s example) and reusing this materialised
>> >>> view
>> >>>>>>> later? Maybe we are lacking mechanisms to clean up materialised
>> >>> views
>> >>>>> (for
>> >>>>>>> example when current session finishes)? Maybe we need some
>> >> syntactic
>> >>>>> sugar
>> >>>>>>> on top of it?
>> >>>>>>>
>> >>>>>>> Piotrek
>> >>>>>>>
>> >>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <[email protected]>
>> >> wrote:
>> >>>>>>>>
>> >>>>>>>> Thanks for the suggestion, Jincheng.
>> >>>>>>>>
>> >>>>>>>> Yes, I think it makes sense to have a persist() with
>> >>>> lifecycle/defined
>> >>>>>>>> scope. I just added a section in the future work for this.
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>>
>> >>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>
>> >>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>> >>>> [email protected]
>> >>>>>>
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi Jiangjie,
>> >>>>>>>>>
>> >>>>>>>>> Thank you for the explanation about the name of `cache()`, I
>> >>>>> understand
>> >>>>>>> why
>> >>>>>>>>> you designed this way!
>> >>>>>>>>>
>> >>>>>>>>> Another idea is whether we can specify a lifecycle for data
>> >>>>> persistence?
>> >>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user is
>> >> not
>> >>>>>>> worried
>> >>>>>>>>> about data loss, and will clearly specify the time range for
>> >>> keeping
>> >>>>>>> time.
>> >>>>>>>>> At the same time, if we want to expand, we can also share in a
>> >>>> certain
>> >>>>>>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I
>> >> am
>> >>>> not
>> >>>>>>> sure,
>> >>>>>>>>> just an immature suggestion, for reference only!
>> >>>>>>>>>
>> >>>>>>>>> Bests,
>> >>>>>>>>> Jincheng
>> >>>>>>>>>
>> >>>>>>>>> Becket Qin <[email protected]> 于2018年11月23日周五 下午1:33写道：
>> >>>>>>>>>
>> >>>>>>>>>> Re: Jincheng,
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
>> >>>>> personally I
>> >>>>>>>>>> find cache() to be more accurately describing the behavior,
>> >> i.e.
>> >>>> the
>> >>>>>>>>> Table
>> >>>>>>>>>> is cached for the session, but will be deleted after the
>> >> session
>> >>> is
>> >>>>>>>>> closed.
>> >>>>>>>>>> persist() seems a little misleading as people might think the
>> >>> table
>> >>>>>>> will
>> >>>>>>>>>> still be there even after the session is gone.
>> >>>>>>>>>>
>> >>>>>>>>>> Great point about mixing the batch and stream processing in the
>> >>>> same
>> >>>>>>> job.
>> >>>>>>>>>> We should absolutely move towards that goal. I imagine that
>> >> would
>> >>>> be
>> >>>>> a
>> >>>>>>>>> huge
>> >>>>>>>>>> change across the board, including sources, operators and
>> >>>>>>> optimizations,
>> >>>>>>>>> to
>> >>>>>>>>>> name some. Likely we will need several separate in-depth
>> >>>> discussions.
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>>
>> >>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>
>> >>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>> >> [email protected]>
>> >>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi all,
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
>> >>>>> orthogonal
>> >>>>>>>>> to
>> >>>>>>>>>>> the cache problem. Essentially, this may be the first time we
>> >>> plan
>> >>>>> to
>> >>>>>>>>>>> introduce another storage mechanism other than the state.
>> >> Maybe
>> >>>> it’s
>> >>>>>>>>>> better
>> >>>>>>>>>>> to first draw a big picture and then concentrate on a specific
>> >>>> part?
>> >>>>>>>>>>>
>> >>>>>>>>>>> @Becket, yes, actually I am more concerned with the underlying
>> >>>>>>> service.
>> >>>>>>>>>>> This seems to be quite a major change to the existing
>> >> codebase.
>> >>> As
>> >>>>> you
>> >>>>>>>>>>> claimed, the service should be extendible to support other
>> >>>>> components
>> >>>>>>>>> and
>> >>>>>>>>>>> we’d better discussed it in another thread.
>> >>>>>>>>>>>
>> >>>>>>>>>>> All in all, I also eager to enjoy the more interactive Table
>> >>> API,
>> >>>> in
>> >>>>>>>>> case
>> >>>>>>>>>>> of a general and flexible enough service mechanism.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best,
>> >>>>>>>>>>> Xingcan
>> >>>>>>>>>>>
>> >>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>> >>> [email protected]>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Relying on a callback for the temp table for clean up is not
>> >>> very
>> >>>>>>>>>>> reliable.
>> >>>>>>>>>>>> There is no guarantee that it will be executed successfully.
>> >> We
>> >>>> may
>> >>>>>>>>>> risk
>> >>>>>>>>>>>> leaks when that happens. I think that it's safer to have an
>> >>>>>>>>> association
>> >>>>>>>>>>>> between temp table and session id. So we can always clean up
>> >>> temp
>> >>>>>>>>>> tables
>> >>>>>>>>>>>> which are no longer associated with any active sessions.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Regards,
>> >>>>>>>>>>>> Xiaowei
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>> >>>>>>>>>> [email protected]>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks for initiating this great proposal！
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Interactive Programming is very useful and user friendly in
>> >>> case
>> >>>>> of
>> >>>>>>>>>> your
>> >>>>>>>>>>>>> examples.
>> >>>>>>>>>>>>> Moreover， especially when a business has to be executed in
>> >>>> several
>> >>>>>>>>>>> stages
>> >>>>>>>>>>>>> with dependencies，such as the pipeline of Flink ML, in order
>> >>> to
>> >>>>>>>>>> utilize
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>> intermediate calculation results we have to submit a job by
>> >>>>>>>>>>> env.execute().
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> About the `cache()`  , I think is better to named
>> >> `persist()`,
>> >>>> And
>> >>>>>>>>> The
>> >>>>>>>>>>>>> Flink framework determines whether we internally cache in
>> >>> memory
>> >>>>> or
>> >>>>>>>>>>> persist
>> >>>>>>>>>>>>> to the storage system，Maybe save the data into state backend
>> >>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> BTW, from the points of my view in the future, support for
>> >>>>> streaming
>> >>>>>>>>>> and
>> >>>>>>>>>>>>> batch mode switching in the same job will also benefit in
>> >>>>>>>>> "Interactive
>> >>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Best,
>> >>>>>>>>>>>>> Jincheng
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Becket Qin <[email protected]> 于2018年11月20日周二 下午9:56写道：
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hi all,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
>> >>>> promising
>> >>>>>>>>>>>>>> opportunity to enhance Flink Table API in various aspects,
>> >>>>>>>>> including
>> >>>>>>>>>>>>>> functionality and ease of use among others. One of the
>> >>>> scenarios
>> >>>>>>>>>> where
>> >>>>>>>>>>> we
>> >>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
>> >>> explain
>> >>>>> the
>> >>>>>>>>>>>>> issues
>> >>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
>> >>> together
>> >>>>> the
>> >>>>>>>>>>>>>> following document with our proposal.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Feedback and comments are very welcome!
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Jiangjie (Becket) Qin
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Reply via email to