Re: [DISCUSS] Enhancing the functionality and productivity of Table API

jincheng sun Tue, 06 Nov 2018 01:09:25 -0800

Hi Jark,
Glad to see your feedback!
That's Correct, The proposal is aiming to extend the functionality for
Table API! I like add "drop" to fit the use case you mentioned. Not only
that, if a 100-columns Table. and our UDF needs these 100 columns, we don't
want to define the eval as eval(column0...column99), we prefer to define
eval as eval(Row)。Using it like this: table.select(udf (*)). All we also
need to consider if we put the columns package as a row. In a scenario like
this, we have Classification it as cloumn operation, and  list the changes
to the column operation after the map/flatMap/agg/flatAgg phase is
completed. And Currently,  Xiaowei has started a threading outlining which
talk about what we are proposing. Please see the detail in the mail thread:
Please see the detail in the mail thread:
 
https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
<https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB>
 .


At this stage the Table API Enhancement Outline as follows:
https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing

Please let we know if you have further thoughts or feedback!

Thanks，
Jincheng


Jark Wu <[email protected]> 于2018年11月6日周二 下午3:35写道：

> Hi jingcheng,
>
> Thanks for your proposal. I think it is a helpful enhancement for TableAPI
> which is a solid step forward for TableAPI.
> It doesn't weaken SQL or DataStream, because the conversion between
> DataStream and Table still works.
> People with advanced cases (e.g. complex and fine-grained state control)
> can go with DataStream,
> but most general cases can stay in TableAPI. This works is aiming to extend
> the functionality for TableAPI,
> to extend the usage scenario, to help TableAPI becomes a more widely used
> API.
>
> For example, someone want to drop one column from a 100-columns Table.
> Currently, we have to convert
> Table to DataStream and use MapFunction to do that, or select the remaining
> 99 columns using Table.select API.
> But if we support Table.drop() method for TableAPI, it will be a very
> convenient method and let users stay in Table.
>
> Looking forward to the more detailed design and further discussion.
>
> Regards,
> Jark
>
> jincheng sun <[email protected]> 于2018年11月6日周二 下午1:05写道：
>
> > Hi Rong Rong,
> >
> > Sorry for the late reply, And thanks for your feedback!  We will continue
> > to add more convenience features to the TableAPI, such as map, flatmap,
> > agg, flatagg, iteration etc. And I am very happy that you are interested
> on
> > this proposal. Due to this is a long-term continuous work, we will push
> it
> > in stages.  Currently  Xiaowei has started a threading outlining which
> talk
> > about what we are proposing. Please see the detail in the mail thread:
> >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > <
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > >
> >  .
> >
> > The Table API Enhancement Outline as follows:
> >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >
> > Please let we know if you have further thoughts or feedback!
> >
> > Thanks,
> > Jincheng
> >
> > Fabian Hueske <[email protected]> 于2018年11月5日周一 下午7:03写道：
> >
> > > Hi Jincheng,
> > >
> > > Thanks for this interesting proposal.
> > > I like that we can push this effort forward in a very fine-grained
> > manner,
> > > i.e., incrementally adding more APIs to the Table API.
> > >
> > > However, I also have a few questions / concerns.
> > > Today, the Table API is tightly integrated with the DataSet and
> > DataStream
> > > APIs. It is very easy to convert a Table into a DataSet or DataStream
> and
> > > vice versa. This mean it is already easy to combine custom logic an
> > > relational operations. What I like is that several aspects are clearly
> > > separated like retraction and timestamp handling (see below) + all
> > > libraries on DataStream/DataSet can be easily combined with relational
> > > operations.
> > > I can see that adding more functionality to the Table API would remove
> > the
> > > distinction between DataSet and DataStream. However, wouldn't we get a
> > > similar benefit by extending the DataStream API for proper support for
> > > bounded streams (as is the long-term goal of Flink)?
> > > I'm also a bit skeptical about the optimization opportunities we would
> > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > > without additional information (I did some research on this a few years
> > ago
> > > [1]).
> > >
> > > Moreover, I think there are a few tricky details that need to be
> resolved
> > > to enable a good integration.
> > >
> > > 1) How to deal with retraction messages? The DataStream API does not
> > have a
> > > notion of retractions. How would a MapFunction or FlatMapFunction
> handle
> > > retraction? Do they need to be aware of the change flag? Custom
> windowing
> > > and aggregation logic would certainly need to have that information.
> > > 2) How to deal with timestamps? The DataStream API does not give access
> > to
> > > timestamps. In the Table API / SQL these are exposed as regular
> > attributes.
> > > How can we ensure that timestamp attributes remain valid (i.e. aligned
> > with
> > > watermarks) if the output is produced by arbitrary code?
> > > There might be more issues of this kind.
> > >
> > > My main question would be how much would we gain with this proposal
> over
> > a
> > > tight integration of Table API and DataStream API, assuming that batch
> > > functionality is moved to DataStream?
> > >
> > > Best, Fabian
> > >
> > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > >
> > >
> > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> [email protected]
> > >:
> > >
> > > > Hi Jincheng,
> > > >
> > > > Thank you for the proposal! I think being able to define a process /
> > > > co-process function in table API definitely opens up a whole new
> level
> > of
> > > > applications using a unified API.
> > > >
> > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > optimization layer of Table API will already bring in additional
> > benefit
> > > > over directly programming on top of DataStream/DataSet API. I am very
> > > > interested an looking forward to seeing the support for more complex
> > use
> > > > cases, especially iterations. It will enable table API to define much
> > > > broader, event-driven use cases such as real-time ML
> > prediction/training.
> > > >
> > > > As Timo mentioned, This will make Table API diverge from the SQL API.
> > But
> > > > as from my experience Table API was always giving me the impression
> to
> > > be a
> > > > more sophisticated, syntactic-aware way to express relational
> > operations.
> > > > Looking forward to further discussion and collaborations on the FLIP
> > doc.
> > > >
> > > > --
> > > > Rong
> > > >
> > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi tison,
> > > > >
> > > > > Thanks a lot for your feedback!
> > > > > I am very happy to see that community contributors agree to
> enhanced
> > > the
> > > > > TableAPI. This work is a long-term continuous work, we will push it
> > in
> > > > > stages, we will soon complete  the enhanced list of the first
> phase，
> > we
> > > > can
> > > > > go deep discussion  in google doc. thanks again for joining on the
> > very
> > > > > important discussion of the Flink Table API.
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Tzu-Li Chen <[email protected]> 于2018年11月2日周五 下午1:49写道：
> > > > >
> > > > > > Hi jingchengm
> > > > > >
> > > > > > Thanks a lot for your proposal! I find it is a good start point
> for
> > > > > > internal optimization works and help Flink to be more
> > > > > > user-friendly.
> > > > > >
> > > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > > users should describe their logic with detailed logic.
> > > > > > From a more internal view the conversion from DataStream to
> > > > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > > > users program with DataStream, they have to learn more internals
> > > > > > and spend a lot of time to tune for performance.
> > > > > > With your proposal, we provide enhanced functionality of Table
> API,
> > > > > > so that users can describe their job easily on Table aspect. This
> > > gives
> > > > > > an opportunity to Flink developers to introduce an optimize phase
> > > > > > while transforming user program(described by Table API) to
> internal
> > > > > > representation.
> > > > > >
> > > > > > Given a user who want to start using Flink with simple ETL,
> > > pipelining
> > > > > > or analytics, he would find it is most naturally described by
> > > SQL/Table
> > > > > > API. Further, as mentioned by @hequn,
> > > > > >
> > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > descriptive language, and is easy to use
> > > > > >
> > > > > >
> > > > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > > > becomes more friendly to users.
> > > > > >
> > > > > > Looking forward to the design doc/FLIP!
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > >
> > > > > > jincheng sun <[email protected]> 于2018年11月2日周五 上午11:46写道：
> > > > > >
> > > > > > > Hi Hequn,
> > > > > > > Thanks for your feedback! And also thanks for our offline
> > > discussion!
> > > > > > > You are right, unification of batch and streaming is very
> > important
> > > > for
> > > > > > > flink API.
> > > > > > > We will provide more detailed design later, Please let me know
> if
> > > you
> > > > > > have
> > > > > > > further thoughts or feedback.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > Hequn Cheng <[email protected]> 于2018年11月2日周五 上午10:02写道：
> > > > > > >
> > > > > > > > Hi Jincheng,
> > > > > > > >
> > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > >
> > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > standards,
> > > > > > is a
> > > > > > > > descriptive language, and is easy to use. A powerful feature
> of
> > > SQL
> > > > > is
> > > > > > > that
> > > > > > > > it supports optimization. Users only need to care about the
> > logic
> > > > of
> > > > > > the
> > > > > > > > program. The underlying optimizer will help users optimize
> the
> > > > > > > performance
> > > > > > > > of the program. However, in terms of functionality and ease
> of
> > > use,
> > > > > in
> > > > > > > some
> > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > proposal.
> > > > > > > >
> > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > powerful
> > > > > > > > functionalities. Users can write
> > > ProcessFunction/CoProcessFunction
> > > > > and
> > > > > > > get
> > > > > > > > the timer. Compared with SQL, it provides more
> functionalities
> > > and
> > > > > > > > flexibilities. However, it does not support optimization like
> > > SQL.
> > > > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> > > means,
> > > > > for
> > > > > > > the
> > > > > > > > same logic, users need to write a job for each stream and
> > batch.
> > > > > > > >
> > > > > > > > With TableApi, I think we can combine the advantages of both.
> > > Users
> > > > > can
> > > > > > > > easily write relational operations and enjoy optimization. At
> > the
> > > > > same
> > > > > > > > time, it supports more functionality and ease of use. Looking
> > > > forward
> > > > > > to
> > > > > > > > the detailed design/FLIP.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Hequn
> > > > > > > >
> > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Aljoscha，
> > > > > > > > > Glad that you like the proposal. We have completed the
> > > prototype
> > > > of
> > > > > > > most
> > > > > > > > > new proposed functionalities. Once collect the feedback
> from
> > > > > > community,
> > > > > > > > we
> > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Shaoxuan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jincheng,
> > > > > > > > > >
> > > > > > > > > > these points sound very good! Are there any concrete
> > > proposals
> > > > > for
> > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > >
> > > > > > > > > > See here for FLIPs:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > content.
> > > > I
> > > > > > > > reformat
> > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > >
> > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > >
> > > > > > > > > > > With the continuous efforts from the community, the
> Flink
> > > > > system
> > > > > > > has
> > > > > > > > > been
> > > > > > > > > > > continuously improved, which has attracted more and
> more
> > > > users.
> > > > > > > Flink
> > > > > > > > > SQL
> > > > > > > > > > > is a canonical, widely used relational query language.
> > > > However,
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > in
> > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > >
> > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > > user-defined
> > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >
> > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > >
> > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > “table.select(udf1(),
> > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the
> same
> > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >   map() function returning 100 columns, one has to
> define
> > > or
> > > > > call
> > > > > > > 100
> > > > > > > > > > UDFs
> > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > Similarly,
> > > > > > it
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> However,
> > it
> > > > is
> > > > > > > > obvious
> > > > > > > > > > that
> > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > >
> > > > > > > > > > > Due to the above two reasons, some users have to use
> the
> > > > > > DataStream
> > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> > from
> > > > > Flink
> > > > > > > SQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that enhancing the functionality and
> > > productivity
> > > > is
> > > > > > > vital
> > > > > > > > > for
> > > > > > > > > > > the successful adoption of Table API. To this end,
> Table
> > > API
> > > > > > still
> > > > > > > > > > > requires more efforts from every contributor in the
> > > > community.
> > > > > We
> > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > > opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > > is welcome.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Jincheng
> > > > > > > > > > >
> > > > > > > > > > > jincheng sun <[email protected]> 于2018年11月1日周四
> > > > > 下午5:07写道：
> > > > > > > > > > >
> > > > > > > > > > >> Hi all,
> > > > > > > > > > >>
> > > > > > > > > > >> With the continuous efforts from the community, the
> > Flink
> > > > > system
> > > > > > > has
> > > > > > > > > > been
> > > > > > > > > > >> continuously improved, which has attracted more and
> more
> > > > > users.
> > > > > > > > Flink
> > > > > > > > > > SQL
> > > > > > > > > > >> is a canonical, widely used relational query language.
> > > > > However,
> > > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > >> still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > > in
> > > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > >>
> > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > user-defined
> > > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > “table.select(udf1(),
> > > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish
> the
> > > same
> > > > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >>      map() function returning 100 columns, one has to
> > > define
> > > > > or
> > > > > > > call
> > > > > > > > > > 100 UDFs
> > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > >>      -
> > > > > > > > > > >>
> > > > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > > Similarly,
> > > > > > > it
> > > > > > > > > can
> > > > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > > > However,
> > > > > > it
> > > > > > > is
> > > > > > > > > > obvious
> > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> Due to the above two reasons, some users have to use
> the
> > > > > > > DataStream
> > > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> > > from
> > > > > > Flink
> > > > > > > > SQL.
> > > > > > > > > > >>
> > > > > > > > > > >> We believe that enhancing the functionality and
> > > productivity
> > > > > is
> > > > > > > > vital
> > > > > > > > > > for
> > > > > > > > > > >> the successful adoption of Table API. To this end,
> > Table
> > > > API
> > > > > > > still
> > > > > > > > > > >> requires more efforts from every contributor in the
> > > > community.
> > > > > > We
> > > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > >> opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > >> is welcome.
> > > > > > > > > > >>
> > > > > > > > > > >> Regards,
> > > > > > > > > > >>
> > > > > > > > > > >> Jincheng
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Reply via email to