Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Fabian Hueske Mon, 05 Nov 2018 03:04:04 -0800

Hi Jincheng,

Thanks for this interesting proposal.
I like that we can push this effort forward in a very fine-grained manner,
i.e., incrementally adding more APIs to the Table API.


However, I also have a few questions / concerns.
Today, the Table API is tightly integrated with the DataSet and DataStream
APIs. It is very easy to convert a Table into a DataSet or DataStream and
vice versa. This mean it is already easy to combine custom logic an
relational operations. What I like is that several aspects are clearly
separated like retraction and timestamp handling (see below) + all
libraries on DataStream/DataSet can be easily combined with relational
operations.
I can see that adding more functionality to the Table API would remove the
distinction between DataSet and DataStream. However, wouldn't we get a
similar benefit by extending the DataStream API for proper support for
bounded streams (as is the long-term goal of Flink)?
I'm also a bit skeptical about the optimization opportunities we would
gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
without additional information (I did some research on this a few years ago
[1]).

Moreover, I think there are a few tricky details that need to be resolved
to enable a good integration.

1) How to deal with retraction messages? The DataStream API does not have a
notion of retractions. How would a MapFunction or FlatMapFunction handle
retraction? Do they need to be aware of the change flag? Custom windowing
and aggregation logic would certainly need to have that information.
2) How to deal with timestamps? The DataStream API does not give access to
timestamps. In the Table API / SQL these are exposed as regular attributes.
How can we ensure that timestamp attributes remain valid (i.e. aligned with
watermarks) if the output is produced by arbitrary code?
There might be more issues of this kind.

My main question would be how much would we gain with this proposal over a
tight integration of Table API and DataStream API, assuming that batch
functionality is moved to DataStream?

Best, Fabian

[1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf


Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <[email protected]>:

> Hi Jincheng,
>
> Thank you for the proposal! I think being able to define a process /
> co-process function in table API definitely opens up a whole new level of
> applications using a unified API.
>
> In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> optimization layer of Table API will already bring in additional benefit
> over directly programming on top of DataStream/DataSet API. I am very
> interested an looking forward to seeing the support for more complex use
> cases, especially iterations. It will enable table API to define much
> broader, event-driven use cases such as real-time ML prediction/training.
>
> As Timo mentioned, This will make Table API diverge from the SQL API. But
> as from my experience Table API was always giving me the impression to be a
> more sophisticated, syntactic-aware way to express relational operations.
> Looking forward to further discussion and collaborations on the FLIP doc.
>
> --
> Rong
>
> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <[email protected]>
> wrote:
>
> > Hi tison,
> >
> > Thanks a lot for your feedback!
> > I am very happy to see that community contributors agree to enhanced the
> > TableAPI. This work is a long-term continuous work, we will push it in
> > stages, we will soon complete  the enhanced list of the first phase， we
> can
> > go deep discussion  in google doc. thanks again for joining on the very
> > important discussion of the Flink Table API.
> >
> > Thanks,
> > Jincheng
> >
> > Tzu-Li Chen <[email protected]> 于2018年11月2日周五 下午1:49写道：
> >
> > > Hi jingchengm
> > >
> > > Thanks a lot for your proposal! I find it is a good start point for
> > > internal optimization works and help Flink to be more
> > > user-friendly.
> > >
> > > AFAIK, DataStream is the most popular API currently that Flink
> > > users should describe their logic with detailed logic.
> > > From a more internal view the conversion from DataStream to
> > > JobGraph is quite mechanically and hard to be optimized. So when
> > > users program with DataStream, they have to learn more internals
> > > and spend a lot of time to tune for performance.
> > > With your proposal, we provide enhanced functionality of Table API,
> > > so that users can describe their job easily on Table aspect. This gives
> > > an opportunity to Flink developers to introduce an optimize phase
> > > while transforming user program(described by Table API) to internal
> > > representation.
> > >
> > > Given a user who want to start using Flink with simple ETL, pipelining
> > > or analytics, he would find it is most naturally described by SQL/Table
> > > API. Further, as mentioned by @hequn,
> > >
> > > SQL is a widely used language. It follows standards, is a
> > > > descriptive language, and is easy to use
> > >
> > >
> > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > becomes more friendly to users.
> > >
> > > Looking forward to the design doc/FLIP!
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > jincheng sun <[email protected]> 于2018年11月2日周五 上午11:46写道：
> > >
> > > > Hi Hequn,
> > > > Thanks for your feedback! And also thanks for our offline discussion!
> > > > You are right, unification of batch and streaming is very important
> for
> > > > flink API.
> > > > We will provide more detailed design later, Please let me know if you
> > > have
> > > > further thoughts or feedback.
> > > >
> > > > Thanks,
> > > > Jincheng
> > > >
> > > > Hequn Cheng <[email protected]> 于2018年11月2日周五 上午10:02写道：
> > > >
> > > > > Hi Jincheng,
> > > > >
> > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > >
> > > > > As we all know, SQL is a widely used language. It follows
> standards,
> > > is a
> > > > > descriptive language, and is easy to use. A powerful feature of SQL
> > is
> > > > that
> > > > > it supports optimization. Users only need to care about the logic
> of
> > > the
> > > > > program. The underlying optimizer will help users optimize the
> > > > performance
> > > > > of the program. However, in terms of functionality and ease of use,
> > in
> > > > some
> > > > > scenarios sql will be limited, as described in Jincheng's proposal.
> > > > >
> > > > > Correspondingly, the DataStream/DataSet api can provide powerful
> > > > > functionalities. Users can write ProcessFunction/CoProcessFunction
> > and
> > > > get
> > > > > the timer. Compared with SQL, it provides more functionalities and
> > > > > flexibilities. However, it does not support optimization like SQL.
> > > > > Meanwhile, DataStream/DataSet api has not been unified which means,
> > for
> > > > the
> > > > > same logic, users need to write a job for each stream and batch.
> > > > >
> > > > > With TableApi, I think we can combine the advantages of both. Users
> > can
> > > > > easily write relational operations and enjoy optimization. At the
> > same
> > > > > time, it supports more functionality and ease of use. Looking
> forward
> > > to
> > > > > the detailed design/FLIP.
> > > > >
> > > > > Best,
> > > > > Hequn
> > > > >
> > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hi Aljoscha，
> > > > > > Glad that you like the proposal. We have completed the prototype
> of
> > > > most
> > > > > > new proposed functionalities. Once collect the feedback from
> > > community,
> > > > > we
> > > > > > will come up with a concrete FLIP/design doc.
> > > > > >
> > > > > > Regards,
> > > > > > Shaoxuan
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jincheng,
> > > > > > >
> > > > > > > these points sound very good! Are there any concrete proposals
> > for
> > > > > > > changes? For example a FLIP/design document?
> > > > > > >
> > > > > > > See here for FLIPs:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > >
> > > > > > > Best,
> > > > > > > Aljoscha
> > > > > > >
> > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > *--------I am sorry for the formatting of the email content.
> I
> > > > > reformat
> > > > > > > > the **content** as follows-----------*
> > > > > > > >
> > > > > > > > *Hi ALL,*
> > > > > > > >
> > > > > > > > With the continuous efforts from the community, the Flink
> > system
> > > > has
> > > > > > been
> > > > > > > > continuously improved, which has attracted more and more
> users.
> > > > Flink
> > > > > > SQL
> > > > > > > > is a canonical, widely used relational query language.
> However,
> > > > there
> > > > > > are
> > > > > > > > still some scenarios where Flink SQL failed to meet user
> needs
> > in
> > > > > terms
> > > > > > > of
> > > > > > > > functionality and ease of use, such as:
> > > > > > > >
> > > > > > > > *1. In terms of functionality*
> > > > > > > >    Iteration, user-defined window, user-defined join,
> > > user-defined
> > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > >
> > > > > > > > *2. In terms of ease of use*
> > > > > > > >
> > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > “table.select(udf1(),
> > > > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > > > function.,
> > > > > > > with a
> > > > > > > >   map() function returning 100 columns, one has to define or
> > call
> > > > 100
> > > > > > > UDFs
> > > > > > > >   when using SQL, which is quite involved.
> > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> Similarly,
> > > it
> > > > > can
> > > > > > be
> > > > > > > >   implemented with “table.join(udtf).select()”. However, it
> is
> > > > > obvious
> > > > > > > that
> > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > >
> > > > > > > > Due to the above two reasons, some users have to use the
> > > DataStream
> > > > > API
> > > > > > > or
> > > > > > > > the DataSet API. But when they do that, they lose the
> > unification
> > > > of
> > > > > > > batch
> > > > > > > > and streaming. They will also lose the sophisticated
> > > optimizations
> > > > > such
> > > > > > > as
> > > > > > > > codegen, aggregate join transpose and multi-stage agg from
> > Flink
> > > > SQL.
> > > > > > > >
> > > > > > > > We believe that enhancing the functionality and productivity
> is
> > > > vital
> > > > > > for
> > > > > > > > the successful adoption of Table API. To this end,  Table API
> > > still
> > > > > > > > requires more efforts from every contributor in the
> community.
> > We
> > > > see
> > > > > > > great
> > > > > > > > opportunity in improving our user’s experience from this
> work.
> > > Any
> > > > > > > feedback
> > > > > > > > is welcome.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Jincheng
> > > > > > > >
> > > > > > > > jincheng sun <[email protected]> 于2018年11月1日周四
> > 下午5:07写道：
> > > > > > > >
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> With the continuous efforts from the community, the Flink
> > system
> > > > has
> > > > > > > been
> > > > > > > >> continuously improved, which has attracted more and more
> > users.
> > > > > Flink
> > > > > > > SQL
> > > > > > > >> is a canonical, widely used relational query language.
> > However,
> > > > > there
> > > > > > > are
> > > > > > > >> still some scenarios where Flink SQL failed to meet user
> needs
> > > in
> > > > > > terms
> > > > > > > of
> > > > > > > >> functionality and ease of use, such as:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>   -
> > > > > > > >>
> > > > > > > >>   In terms of functionality
> > > > > > > >>
> > > > > > > >> Iteration, user-defined window, user-defined join,
> > user-defined
> > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > >>
> > > > > > > >>   -
> > > > > > > >>
> > > > > > > >>   In terms of ease of use
> > > > > > > >>   -
> > > > > > > >>
> > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > “table.select(udf1(),
> > > > > > > >>      udf2(), udf3()....)” can be used to accomplish the same
> > > > > > function.,
> > > > > > > with a
> > > > > > > >>      map() function returning 100 columns, one has to define
> > or
> > > > call
> > > > > > > 100 UDFs
> > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > >>      -
> > > > > > > >>
> > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > Similarly,
> > > > it
> > > > > > can
> > > > > > > >>      be implemented with “table.join(udtf).select()”.
> However,
> > > it
> > > > is
> > > > > > > obvious
> > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Due to the above two reasons, some users have to use the
> > > > DataStream
> > > > > > API
> > > > > > > or
> > > > > > > >> the DataSet API. But when they do that, they lose the
> > > unification
> > > > of
> > > > > > > batch
> > > > > > > >> and streaming. They will also lose the sophisticated
> > > optimizations
> > > > > > such
> > > > > > > as
> > > > > > > >> codegen, aggregate join transpose  and multi-stage agg from
> > > Flink
> > > > > SQL.
> > > > > > > >>
> > > > > > > >> We believe that enhancing the functionality and productivity
> > is
> > > > > vital
> > > > > > > for
> > > > > > > >> the successful adoption of Table API. To this end,  Table
> API
> > > > still
> > > > > > > >> requires more efforts from every contributor in the
> community.
> > > We
> > > > > see
> > > > > > > great
> > > > > > > >> opportunity in improving our user’s experience from this
> work.
> > > Any
> > > > > > > feedback
> > > > > > > >> is welcome.
> > > > > > > >>
> > > > > > > >> Regards,
> > > > > > > >>
> > > > > > > >> Jincheng
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Reply via email to