Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Becket Qin Mon, 05 Nov 2018 20:19:24 -0800

Hi Xinggang,

Thanks for the comments. Please see the responses inline below.


On Tue, Nov 6, 2018 at 11:28 AM SHI Xiaogang <[email protected]> wrote:

> Hi all,
>
> I think it's good to enhance the functionality and productivity of Table
> API, but still I think SQL + DataStream is a better choice from user
> experience
> 1. The unification of batch and stream processing is very attractive, and
> many our users are moving their batch-processing applications to Flink.
> These users are familiar with SQL and most of their applications are
> written in SQL-like languages. In Ad-hoc or interactive scenarios where
> users write new queries according to the results or previous queries, SQL
> is much more efficient than Table API.
>
As of now, both SQL and Table API support batch and stream at the same
time. However, the scopes of them are a little different. SQL is a query
language which perfectly fits the scenarios such as OLAP/BI. However, it
falls in short when it comes to cases like ML/Graph processing where a
programming interface is needed. As you may probably noticed, some
fundamental features such as iteration is missing from SQL. Besides that,
people rarely write algorithms in SQL. Due to the same argument you have in
(2), users want to see a programming interface instead of a query language.
As Xiaowei mentioned, we are extending Table API to wider range of use
cases. Hence, enhancing table API would be necessary to that goal.

> 2. Though TableAPI is also declarative, most users typically refuse to
> learn another language, not to mention a fast-evolving one.
> 3. There are many efforts to be done in the optimization of both Table API
> and SQL. Given that most data processing systems, including Databases,
> MapReduce-like systems and Continuous Query systems, have done a lot of
> work in the optimization of SQL queries, it seems more easier to adopt
> existing work in Flink with SQL than TableAPI.
>
WRT optimization, at the end of the day, SQL and Table API in fact share
the exact same QO path. Therefore, any optimization made to SQL will be
available to Table API. There is no additional overhead. The additional
optimization opportunity Xiaowei mentioned was specifically for UDFs which
are blackboxes today and may be optimized with symbolic expressions.

> 4. Data stream is still welcome as users can deal with complex logic with
> flexible interfaces. It is also widely adopted in critical missions where
> users can use a lot of "tricky" optimization to achieve optimal
> performance. Most of these tricks typically are not allowed in TableAPI or
> SQL.

Completely agree that DataStream is important to many advanced cases. We
should also invest in that to make it stronger. Enhancing Table API does
not mean getting rid of DataStreams. Table API is more of a high level API
with which users just need to focus on their business logic, while
DataStream API is more of a low level API for users who want to have a
strong control over all the execution details. Both of them are valuable to
different users.

>


> Regards,
> Xiaogang
>
> Xiaowei Jiang <[email protected]> 于2018年11月5日周一 下午8:34写道：
>
> > Hi Fabian, these are great questions! I have some quick thoughts on some
> of
> > these.
> >
> > Optimization opportunities: I think that you are right UDFs are more like
> > blackboxes today. However this can change if we let user develop UDFs
> > symbolically in the future (i.e., Flink will look inside the UDF code,
> > understand it and potentially do codegen to execute it). This will open
> the
> > door for potential optimizations.
> >
> > Moving batch functionality to DataStream: I actually think that moving
> > batch functionality to Table API is probably a better idea. Currently,
> > DataStream API is very deterministic and imperative. On the other hand,
> > DataSet API has an optimizer and can choose very different execution
> plans.
> > Another distinguishing feature of DataStream API is that users get direct
> > access to state/statebackend which we intensionally avoided in Table API
> so
> > far. The introduction of states is probably the biggest innovation by
> > Flink. At the same time, abusing states may also be the largest source
> for
> > reliability/performance issues. Shielding users away from dealing with
> > state directly is a key advantage in Table API. I think this is probably
> a
> > good boundary between DataStream and Table API. If users HAVE to
> manipulate
> > state explicitly, go with DataStream, otherwise, go with Table API.
> >
> > Because Flink is extending into more and more scenarios (e.g., batch,
> > streaming & micro-service), we may inevitably end up with multiple APIs.
> It
> > appears that data analytics (batch & streaming) related applications can
> be
> > well served with Table API which not only unifies batch and streaming,
> but
> > also relieves the users from dealing with states explicitly. On the other
> > hand, DataStream is very convenient and powerful for micro-service kind
> of
> > applications because explicitly state access may be necessary in such
> > cases.
> >
> > We will start a threading outlining what we are proposing in Table API.
> >
> > Regards,
> > Xiaowei
> >
> > On Mon, Nov 5, 2018 at 7:03 PM Fabian Hueske <[email protected]> wrote:
> >
> > > Hi Jincheng,
> > >
> > > Thanks for this interesting proposal.
> > > I like that we can push this effort forward in a very fine-grained
> > manner,
> > > i.e., incrementally adding more APIs to the Table API.
> > >
> > > However, I also have a few questions / concerns.
> > > Today, the Table API is tightly integrated with the DataSet and
> > DataStream
> > > APIs. It is very easy to convert a Table into a DataSet or DataStream
> and
> > > vice versa. This mean it is already easy to combine custom logic an
> > > relational operations. What I like is that several aspects are clearly
> > > separated like retraction and timestamp handling (see below) + all
> > > libraries on DataStream/DataSet can be easily combined with relational
> > > operations.
> > > I can see that adding more functionality to the Table API would remove
> > the
> > > distinction between DataSet and DataStream. However, wouldn't we get a
> > > similar benefit by extending the DataStream API for proper support for
> > > bounded streams (as is the long-term goal of Flink)?
> > > I'm also a bit skeptical about the optimization opportunities we would
> > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > > without additional information (I did some research on this a few years
> > ago
> > > [1]).
> > >
> > > Moreover, I think there are a few tricky details that need to be
> resolved
> > > to enable a good integration.
> > >
> > > 1) How to deal with retraction messages? The DataStream API does not
> > have a
> > > notion of retractions. How would a MapFunction or FlatMapFunction
> handle
> > > retraction? Do they need to be aware of the change flag? Custom
> windowing
> > > and aggregation logic would certainly need to have that information.
> > > 2) How to deal with timestamps? The DataStream API does not give access
> > to
> > > timestamps. In the Table API / SQL these are exposed as regular
> > attributes.
> > > How can we ensure that timestamp attributes remain valid (i.e. aligned
> > with
> > > watermarks) if the output is produced by arbitrary code?
> > > There might be more issues of this kind.
> > >
> > > My main question would be how much would we gain with this proposal
> over
> > a
> > > tight integration of Table API and DataStream API, assuming that batch
> > > functionality is moved to DataStream?
> > >
> > > Best, Fabian
> > >
> > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > >
> > >
> > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> [email protected]
> > >:
> > >
> > > > Hi Jincheng,
> > > >
> > > > Thank you for the proposal! I think being able to define a process /
> > > > co-process function in table API definitely opens up a whole new
> level
> > of
> > > > applications using a unified API.
> > > >
> > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > optimization layer of Table API will already bring in additional
> > benefit
> > > > over directly programming on top of DataStream/DataSet API. I am very
> > > > interested an looking forward to seeing the support for more complex
> > use
> > > > cases, especially iterations. It will enable table API to define much
> > > > broader, event-driven use cases such as real-time ML
> > prediction/training.
> > > >
> > > > As Timo mentioned, This will make Table API diverge from the SQL API.
> > But
> > > > as from my experience Table API was always giving me the impression
> to
> > > be a
> > > > more sophisticated, syntactic-aware way to express relational
> > operations.
> > > > Looking forward to further discussion and collaborations on the FLIP
> > doc.
> > > >
> > > > --
> > > > Rong
> > > >
> > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi tison,
> > > > >
> > > > > Thanks a lot for your feedback!
> > > > > I am very happy to see that community contributors agree to
> enhanced
> > > the
> > > > > TableAPI. This work is a long-term continuous work, we will push it
> > in
> > > > > stages, we will soon complete  the enhanced list of the first
> phase，
> > we
> > > > can
> > > > > go deep discussion  in google doc. thanks again for joining on the
> > very
> > > > > important discussion of the Flink Table API.
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Tzu-Li Chen <[email protected]> 于2018年11月2日周五 下午1:49写道：
> > > > >
> > > > > > Hi jingchengm
> > > > > >
> > > > > > Thanks a lot for your proposal! I find it is a good start point
> for
> > > > > > internal optimization works and help Flink to be more
> > > > > > user-friendly.
> > > > > >
> > > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > > users should describe their logic with detailed logic.
> > > > > > From a more internal view the conversion from DataStream to
> > > > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > > > users program with DataStream, they have to learn more internals
> > > > > > and spend a lot of time to tune for performance.
> > > > > > With your proposal, we provide enhanced functionality of Table
> API,
> > > > > > so that users can describe their job easily on Table aspect. This
> > > gives
> > > > > > an opportunity to Flink developers to introduce an optimize phase
> > > > > > while transforming user program(described by Table API) to
> internal
> > > > > > representation.
> > > > > >
> > > > > > Given a user who want to start using Flink with simple ETL,
> > > pipelining
> > > > > > or analytics, he would find it is most naturally described by
> > > SQL/Table
> > > > > > API. Further, as mentioned by @hequn,
> > > > > >
> > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > descriptive language, and is easy to use
> > > > > >
> > > > > >
> > > > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > > > becomes more friendly to users.
> > > > > >
> > > > > > Looking forward to the design doc/FLIP!
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > >
> > > > > > jincheng sun <[email protected]> 于2018年11月2日周五 上午11:46写道：
> > > > > >
> > > > > > > Hi Hequn,
> > > > > > > Thanks for your feedback! And also thanks for our offline
> > > discussion!
> > > > > > > You are right, unification of batch and streaming is very
> > important
> > > > for
> > > > > > > flink API.
> > > > > > > We will provide more detailed design later, Please let me know
> if
> > > you
> > > > > > have
> > > > > > > further thoughts or feedback.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > Hequn Cheng <[email protected]> 于2018年11月2日周五 上午10:02写道：
> > > > > > >
> > > > > > > > Hi Jincheng,
> > > > > > > >
> > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > >
> > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > standards,
> > > > > > is a
> > > > > > > > descriptive language, and is easy to use. A powerful feature
> of
> > > SQL
> > > > > is
> > > > > > > that
> > > > > > > > it supports optimization. Users only need to care about the
> > logic
> > > > of
> > > > > > the
> > > > > > > > program. The underlying optimizer will help users optimize
> the
> > > > > > > performance
> > > > > > > > of the program. However, in terms of functionality and ease
> of
> > > use,
> > > > > in
> > > > > > > some
> > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > proposal.
> > > > > > > >
> > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > powerful
> > > > > > > > functionalities. Users can write
> > > ProcessFunction/CoProcessFunction
> > > > > and
> > > > > > > get
> > > > > > > > the timer. Compared with SQL, it provides more
> functionalities
> > > and
> > > > > > > > flexibilities. However, it does not support optimization like
> > > SQL.
> > > > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> > > means,
> > > > > for
> > > > > > > the
> > > > > > > > same logic, users need to write a job for each stream and
> > batch.
> > > > > > > >
> > > > > > > > With TableApi, I think we can combine the advantages of both.
> > > Users
> > > > > can
> > > > > > > > easily write relational operations and enjoy optimization. At
> > the
> > > > > same
> > > > > > > > time, it supports more functionality and ease of use. Looking
> > > > forward
> > > > > > to
> > > > > > > > the detailed design/FLIP.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Hequn
> > > > > > > >
> > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Aljoscha，
> > > > > > > > > Glad that you like the proposal. We have completed the
> > > prototype
> > > > of
> > > > > > > most
> > > > > > > > > new proposed functionalities. Once collect the feedback
> from
> > > > > > community,
> > > > > > > > we
> > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Shaoxuan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jincheng,
> > > > > > > > > >
> > > > > > > > > > these points sound very good! Are there any concrete
> > > proposals
> > > > > for
> > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > >
> > > > > > > > > > See here for FLIPs:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > content.
> > > > I
> > > > > > > > reformat
> > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > >
> > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > >
> > > > > > > > > > > With the continuous efforts from the community, the
> Flink
> > > > > system
> > > > > > > has
> > > > > > > > > been
> > > > > > > > > > > continuously improved, which has attracted more and
> more
> > > > users.
> > > > > > > Flink
> > > > > > > > > SQL
> > > > > > > > > > > is a canonical, widely used relational query language.
> > > > However,
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > in
> > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > >
> > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > > user-defined
> > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >
> > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > >
> > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > “table.select(udf1(),
> > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the
> same
> > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >   map() function returning 100 columns, one has to
> define
> > > or
> > > > > call
> > > > > > > 100
> > > > > > > > > > UDFs
> > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > Similarly,
> > > > > > it
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> However,
> > it
> > > > is
> > > > > > > > obvious
> > > > > > > > > > that
> > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > >
> > > > > > > > > > > Due to the above two reasons, some users have to use
> the
> > > > > > DataStream
> > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> > from
> > > > > Flink
> > > > > > > SQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that enhancing the functionality and
> > > productivity
> > > > is
> > > > > > > vital
> > > > > > > > > for
> > > > > > > > > > > the successful adoption of Table API. To this end,
> Table
> > > API
> > > > > > still
> > > > > > > > > > > requires more efforts from every contributor in the
> > > > community.
> > > > > We
> > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > > opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > > is welcome.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Jincheng
> > > > > > > > > > >
> > > > > > > > > > > jincheng sun <[email protected]> 于2018年11月1日周四
> > > > > 下午5:07写道：
> > > > > > > > > > >
> > > > > > > > > > >> Hi all,
> > > > > > > > > > >>
> > > > > > > > > > >> With the continuous efforts from the community, the
> > Flink
> > > > > system
> > > > > > > has
> > > > > > > > > > been
> > > > > > > > > > >> continuously improved, which has attracted more and
> more
> > > > > users.
> > > > > > > > Flink
> > > > > > > > > > SQL
> > > > > > > > > > >> is a canonical, widely used relational query language.
> > > > > However,
> > > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > >> still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > > in
> > > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > >>
> > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > user-defined
> > > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > “table.select(udf1(),
> > > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish
> the
> > > same
> > > > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >>      map() function returning 100 columns, one has to
> > > define
> > > > > or
> > > > > > > call
> > > > > > > > > > 100 UDFs
> > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > >>      -
> > > > > > > > > > >>
> > > > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > > Similarly,
> > > > > > > it
> > > > > > > > > can
> > > > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > > > However,
> > > > > > it
> > > > > > > is
> > > > > > > > > > obvious
> > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> Due to the above two reasons, some users have to use
> the
> > > > > > > DataStream
> > > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> > > from
> > > > > > Flink
> > > > > > > > SQL.
> > > > > > > > > > >>
> > > > > > > > > > >> We believe that enhancing the functionality and
> > > productivity
> > > > > is
> > > > > > > > vital
> > > > > > > > > > for
> > > > > > > > > > >> the successful adoption of Table API. To this end,
> > Table
> > > > API
> > > > > > > still
> > > > > > > > > > >> requires more efforts from every contributor in the
> > > > community.
> > > > > > We
> > > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > >> opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > >> is welcome.
> > > > > > > > > > >>
> > > > > > > > > > >> Regards,
> > > > > > > > > > >>
> > > > > > > > > > >> Jincheng
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Reply via email to