Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Shaoxuan Wang Tue, 06 Nov 2018 22:17:03 -0800

Hi all,

Thanks for the feedback. I enjoyed the discussions, especially the ones
between Fabian and Xiaowei. I think it well revealed the motivations and
design pros/cons behind this proposal. Enhancing tableAPI will not affect
and limit the improvements on Flink SQL (as well as DataStream). Actually
Alibaba is the biggest fun of Flink SQL and our contributions to Flink SQL
will not cease or reduce because of this.


As an supplement to the motivations, I would like to share some of our
experience and lessons learned in Alibaba. In the past 1-2 years we
upgraded most of our production jobs (including data analysis as well as AI
pipeline) on top of the SQL API. Besides SQL, we also provide the tableAPI
to the users and received lots of interests and new requests. We extended
tableAPI with some new functionalities and it indeed helped a lot (in terms
of easy-of-use as well as performance) in many cases. This motivates us to
contribute it back to the community. Xiaowei opened another email thread
and listed a few things that we are proposing to add on tableAPI as the
first step. Please take a look. Let us move the discussions on design
proposal to that ML.

Thanks,
Shaoxuan


On Tue, Nov 6, 2018 at 9:24 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> What is our intended division/border between Table API and DataSet or
> DataStream? If we want Table API to drift away from SQL that would be a
> valid question.
>
> > Another distinguishing feature of DataStream API is that users get direct
> > access to state/statebackend which we intensionally avoided in Table API
>
> Do we really want to make Table API an equivalent of DataSet/DataStream
> API but without a state? Drawing boundary in such way would make it more
> difficult for users to pick the right tool if for many use cases they could
> use both. What if then at some point of time they came to conclusion that
> they need a small state access for something? If that’s our intended end
> goal separation between Table API and DataStream API, it would be very very
> weird having two very similar APIs, that have tons of small differences,
> but are basically equivalent modulo state accesses.
>
> Maybe instead of duplicating efforts and work between our different APIs
> we should more focus on either interoperability or unifying them?
>
> For example if we would like to differentiate those APIs because of the
> presence/lack of optimiser, maybe the APIs should be the same, but there
> should be a way tell whether the UDF/operator is deterministic, has side
> effects, etc. And if such operator is found in the plan, the nodes below
> and above could still be subject to regular optimisation rules.
>
> Piotrek
>
> > On 6 Nov 2018, at 14:04, Fabian Hueske <fhue...@gmail.com> wrote:
> >
> > Hi,
> >
> > An analysis of orthogonal functions would be great!
> > There is certainly some overlap in the functions provided by the DataSet
> > API.
> >
> > In the past, I found that having low-level functions helped a lot to
> > efficiently implement complex logic.
> > Without partitionByHash, sortPartition, sort, mapPartition, combine, etc
> it
> > would not be possible to (efficiently) implement certain operators for
> the
> > Table API, SQL or the Cascading-On-Flink port that I did a while back.
> > I could imaging that these APIs would be useful to implement DSLs on top
> of
> > the Table API, such as Gelly.
> >
> > Anyway, I completely agree that these physical operators should not be
> the
> > first step.
> > If we find that these methods are not needed, even better!
> >
> > Let's try to keep this thread focused on the general proposal of
> extending
> > the scope of the Table API and keep the discussion of concrete proposal
> > that Xiaowei shared in the other thread (and the design doc).
> > That will help to keep all related comments in one place ;-)
> >
> > Best, Fabian
> >
> >
> > Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun <
> > sunjincheng...@gmail.com>:
> >
> >> Hi Fabian,
> >> Thank you for your deep thoughts in this regard, I think most of
> questions
> >> you had mentioned are very worthy of in-depth discussion! I want share
> >> thoughts about following questions:
> >>
> >> 1. Do we need move all DataSet API functionality into the Table API?
> >> I think most of dataset functionality should be add into the TableAPI,
> such
> >> as map, flatmap, groupReduce etc., Because these are very easy to use
> for
> >> the user.
> >>
> >> 2. Do we support explicit physical operations like partitioning,
> sorting or
> >> optimizer hints?
> >> I think we do not want add the physical operations, e.g.:
> >> sortPartition,partitionCustom etc. From the points of my view, those
> >> physical operations are used to optimization, which can be solved by
> >> hints(I think we should add hints feature to both tableAPI and SQL).
> >>
> >> 3. Do we want to support retractions in iteration?
> >> I think support iteration is  a very complicated function。I am not sure,
> >> but i think the implementation of the iteration may be implemented
> >> according to the current batch mode, and the retraction is temporarily
> not
> >> supported, assuming that the trained data will not be updated in the
> >> current iteration. The updated data will be used in the next
> iteration.  So
> >> I think we should in-depth discussion  in a new threading.
> >>
> >> BTW, I find that you have had leave the very useful comments in the
> google
> >> doc:
> >>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit#
> >>
> >> Thanks again for both your mail feedback and doc comments !
> >>
> >> Best,
> >> Jincheng
> >>
> >>
> >>
> >> Fabian Hueske <fhue...@gmail.com> 于2018年11月6日周二 下午6:21写道：
> >>
> >>> Thanks for the replies Xiaowei and others!
> >>>
> >>> You are right, I did not consider the batch optimization that would be
> >>> missing if the DataSet API would be ported to extend the DataStream
> API.
> >>> By extending the scope of the Table API, we can gain a holistic
> logical &
> >>> physical optimization which would be great!
> >>> Is your plan to move all DataSet API functionality into the Table API?
> >>> If so, do you envision any batch-related API in DataStream at all or
> >> should
> >>> this be done by converting a batch table to DataStream? I'm asking
> >> because
> >>> if there would be batch features in DataStream, we would need some
> >>> optimization there as well.
> >>>
> >>> I think the proposed separation of Table API (stateless APIs) and
> >>> DataStream (APIs that expose state handling) is a good idea.
> >>> On a side note, the DataSet API discouraged state handing in user
> >> function,
> >>> so porting this Table API would be quite "natural".
> >>>
> >>> As I said before, I like that we can incrementally extend the Table
> API.
> >>> Map and FlatMap functions do not seem too difficult.
> >>> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
> >>> tricky, esp. if we want to support retractions.
> >>> Iterations should be a challenge. I assume that Calcite does not
> support
> >>> iterations, so we probably need to split query / program and optimize
> >> parts
> >>> separately (IIRC, this is also how Flink's own optimizer handles this).
> >>> To what extend are you planning to support explicit physical operations
> >>> like partitioning, sorting or optimizer hints?
> >>>
> >>> I haven't had a look in the design document that you shared. Probably,
> I
> >>> find answers to some of my questions there ;-)
> >>>
> >>> Regarding the question of SQL or Table API, I agree that extending the
> >>> scope of the Table API does not limit the scope for SQL.
> >>> By adding more operations to the Table API we can expand it to use case
> >>> that are not well-served by SQL.
> >>> As others have said, we'll of course continue to extend and improve
> >> Flink's
> >>> SQL support (within the bounds of the standard).
> >>>
> >>> Best, Fabian
> >>>
> >>> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
> >>> sunjincheng...@gmail.com>:
> >>>
> >>>> Hi Jark,
> >>>> Glad to see your feedback!
> >>>> That's Correct, The proposal is aiming to extend the functionality for
> >>>> Table API! I like add "drop" to fit the use case you mentioned. Not
> >> only
> >>>> that, if a 100-columns Table. and our UDF needs these 100 columns, we
> >>> don't
> >>>> want to define the eval as eval(column0...column99), we prefer to
> >> define
> >>>> eval as eval(Row)。Using it like this: table.select(udf (*)). All we
> >> also
> >>>> need to consider if we put the columns package as a row. In a scenario
> >>> like
> >>>> this, we have Classification it as cloumn operation, and  list the
> >>> changes
> >>>> to the column operation after the map/flatMap/agg/flatAgg phase is
> >>>> completed. And Currently,  Xiaowei has started a threading outlining
> >>> which
> >>>> talk about what we are proposing. Please see the detail in the mail
> >>> thread:
> >>>> Please see the detail in the mail thread:
> >>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>> <
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>
> >>>> .
> >>>>
> >>>> At this stage the Table API Enhancement Outline as follows:
> >>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >>>>
> >>>> Please let we know if you have further thoughts or feedback!
> >>>>
> >>>> Thanks，
> >>>> Jincheng
> >>>>
> >>>>
> >>>> Jark Wu <imj...@gmail.com> 于2018年11月6日周二 下午3:35写道：
> >>>>
> >>>>> Hi jingcheng,
> >>>>>
> >>>>> Thanks for your proposal. I think it is a helpful enhancement for
> >>>> TableAPI
> >>>>> which is a solid step forward for TableAPI.
> >>>>> It doesn't weaken SQL or DataStream, because the conversion between
> >>>>> DataStream and Table still works.
> >>>>> People with advanced cases (e.g. complex and fine-grained state
> >>> control)
> >>>>> can go with DataStream,
> >>>>> but most general cases can stay in TableAPI. This works is aiming to
> >>>> extend
> >>>>> the functionality for TableAPI,
> >>>>> to extend the usage scenario, to help TableAPI becomes a more widely
> >>> used
> >>>>> API.
> >>>>>
> >>>>> For example, someone want to drop one column from a 100-columns
> >> Table.
> >>>>> Currently, we have to convert
> >>>>> Table to DataStream and use MapFunction to do that, or select the
> >>>> remaining
> >>>>> 99 columns using Table.select API.
> >>>>> But if we support Table.drop() method for TableAPI, it will be a very
> >>>>> convenient method and let users stay in Table.
> >>>>>
> >>>>> Looking forward to the more detailed design and further discussion.
> >>>>>
> >>>>> Regards,
> >>>>> Jark
> >>>>>
> >>>>> jincheng sun <sunjincheng...@gmail.com> 于2018年11月6日周二 下午1:05写道：
> >>>>>
> >>>>>> Hi Rong Rong,
> >>>>>>
> >>>>>> Sorry for the late reply, And thanks for your feedback!  We will
> >>>> continue
> >>>>>> to add more convenience features to the TableAPI, such as map,
> >>> flatmap,
> >>>>>> agg, flatagg, iteration etc. And I am very happy that you are
> >>>> interested
> >>>>> on
> >>>>>> this proposal. Due to this is a long-term continuous work, we will
> >>> push
> >>>>> it
> >>>>>> in stages.  Currently  Xiaowei has started a threading outlining
> >>> which
> >>>>> talk
> >>>>>> about what we are proposing. Please see the detail in the mail
> >>> thread:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>> <
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>>>
> >>>>>> .
> >>>>>>
> >>>>>> The Table API Enhancement Outline as follows:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >>>>>>
> >>>>>> Please let we know if you have further thoughts or feedback!
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jincheng
> >>>>>>
> >>>>>> Fabian Hueske <fhue...@gmail.com> 于2018年11月5日周一 下午7:03写道：
> >>>>>>
> >>>>>>> Hi Jincheng,
> >>>>>>>
> >>>>>>> Thanks for this interesting proposal.
> >>>>>>> I like that we can push this effort forward in a very
> >> fine-grained
> >>>>>> manner,
> >>>>>>> i.e., incrementally adding more APIs to the Table API.
> >>>>>>>
> >>>>>>> However, I also have a few questions / concerns.
> >>>>>>> Today, the Table API is tightly integrated with the DataSet and
> >>>>>> DataStream
> >>>>>>> APIs. It is very easy to convert a Table into a DataSet or
> >>> DataStream
> >>>>> and
> >>>>>>> vice versa. This mean it is already easy to combine custom logic
> >> an
> >>>>>>> relational operations. What I like is that several aspects are
> >>>> clearly
> >>>>>>> separated like retraction and timestamp handling (see below) +
> >> all
> >>>>>>> libraries on DataStream/DataSet can be easily combined with
> >>>> relational
> >>>>>>> operations.
> >>>>>>> I can see that adding more functionality to the Table API would
> >>>> remove
> >>>>>> the
> >>>>>>> distinction between DataSet and DataStream. However, wouldn't we
> >>> get
> >>>> a
> >>>>>>> similar benefit by extending the DataStream API for proper
> >> support
> >>>> for
> >>>>>>> bounded streams (as is the long-term goal of Flink)?
> >>>>>>> I'm also a bit skeptical about the optimization opportunities we
> >>>> would
> >>>>>>> gain. Map/FlatMap UDFs are black boxes that cannot be easily
> >>> removed
> >>>>>>> without additional information (I did some research on this a few
> >>>> years
> >>>>>> ago
> >>>>>>> [1]).
> >>>>>>>
> >>>>>>> Moreover, I think there are a few tricky details that need to be
> >>>>> resolved
> >>>>>>> to enable a good integration.
> >>>>>>>
> >>>>>>> 1) How to deal with retraction messages? The DataStream API does
> >>> not
> >>>>>> have a
> >>>>>>> notion of retractions. How would a MapFunction or FlatMapFunction
> >>>>> handle
> >>>>>>> retraction? Do they need to be aware of the change flag? Custom
> >>>>> windowing
> >>>>>>> and aggregation logic would certainly need to have that
> >>> information.
> >>>>>>> 2) How to deal with timestamps? The DataStream API does not give
> >>>> access
> >>>>>> to
> >>>>>>> timestamps. In the Table API / SQL these are exposed as regular
> >>>>>> attributes.
> >>>>>>> How can we ensure that timestamp attributes remain valid (i.e.
> >>>> aligned
> >>>>>> with
> >>>>>>> watermarks) if the output is produced by arbitrary code?
> >>>>>>> There might be more issues of this kind.
> >>>>>>>
> >>>>>>> My main question would be how much would we gain with this
> >> proposal
> >>>>> over
> >>>>>> a
> >>>>>>> tight integration of Table API and DataStream API, assuming that
> >>>> batch
> >>>>>>> functionality is moved to DataStream?
> >>>>>>>
> >>>>>>> Best, Fabian
> >>>>>>>
> >>>>>>> [1]
> >>> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> >>>>>>>
> >>>>>>>
> >>>>>>> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> >>>>> walter...@gmail.com
> >>>>>>> :
> >>>>>>>
> >>>>>>>> Hi Jincheng,
> >>>>>>>>
> >>>>>>>> Thank you for the proposal! I think being able to define a
> >>> process
> >>>> /
> >>>>>>>> co-process function in table API definitely opens up a whole
> >> new
> >>>>> level
> >>>>>> of
> >>>>>>>> applications using a unified API.
> >>>>>>>>
> >>>>>>>> In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> >>>>>>>> optimization layer of Table API will already bring in
> >> additional
> >>>>>> benefit
> >>>>>>>> over directly programming on top of DataStream/DataSet API. I
> >> am
> >>>> very
> >>>>>>>> interested an looking forward to seeing the support for more
> >>>> complex
> >>>>>> use
> >>>>>>>> cases, especially iterations. It will enable table API to
> >> define
> >>>> much
> >>>>>>>> broader, event-driven use cases such as real-time ML
> >>>>>> prediction/training.
> >>>>>>>>
> >>>>>>>> As Timo mentioned, This will make Table API diverge from the
> >> SQL
> >>>> API.
> >>>>>> But
> >>>>>>>> as from my experience Table API was always giving me the
> >>> impression
> >>>>> to
> >>>>>>> be a
> >>>>>>>> more sophisticated, syntactic-aware way to express relational
> >>>>>> operations.
> >>>>>>>> Looking forward to further discussion and collaborations on the
> >>>> FLIP
> >>>>>> doc.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Rong
> >>>>>>>>
> >>>>>>>> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> >>>>> sunjincheng...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi tison,
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for your feedback!
> >>>>>>>>> I am very happy to see that community contributors agree to
> >>>>> enhanced
> >>>>>>> the
> >>>>>>>>> TableAPI. This work is a long-term continuous work, we will
> >>> push
> >>>> it
> >>>>>> in
> >>>>>>>>> stages, we will soon complete  the enhanced list of the first
> >>>>> phase，
> >>>>>> we
> >>>>>>>> can
> >>>>>>>>> go deep discussion  in google doc. thanks again for joining
> >> on
> >>>> the
> >>>>>> very
> >>>>>>>>> important discussion of the Flink Table API.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Jincheng
> >>>>>>>>>
> >>>>>>>>> Tzu-Li Chen <wander4...@gmail.com> 于2018年11月2日周五 下午1:49写道：
> >>>>>>>>>
> >>>>>>>>>> Hi jingchengm
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for your proposal! I find it is a good start
> >>> point
> >>>>> for
> >>>>>>>>>> internal optimization works and help Flink to be more
> >>>>>>>>>> user-friendly.
> >>>>>>>>>>
> >>>>>>>>>> AFAIK, DataStream is the most popular API currently that
> >>> Flink
> >>>>>>>>>> users should describe their logic with detailed logic.
> >>>>>>>>>> From a more internal view the conversion from DataStream to
> >>>>>>>>>> JobGraph is quite mechanically and hard to be optimized. So
> >>>> when
> >>>>>>>>>> users program with DataStream, they have to learn more
> >>>> internals
> >>>>>>>>>> and spend a lot of time to tune for performance.
> >>>>>>>>>> With your proposal, we provide enhanced functionality of
> >>> Table
> >>>>> API,
> >>>>>>>>>> so that users can describe their job easily on Table
> >> aspect.
> >>>> This
> >>>>>>> gives
> >>>>>>>>>> an opportunity to Flink developers to introduce an optimize
> >>>> phase
> >>>>>>>>>> while transforming user program(described by Table API) to
> >>>>> internal
> >>>>>>>>>> representation.
> >>>>>>>>>>
> >>>>>>>>>> Given a user who want to start using Flink with simple ETL,
> >>>>>>> pipelining
> >>>>>>>>>> or analytics, he would find it is most naturally described
> >> by
> >>>>>>> SQL/Table
> >>>>>>>>>> API. Further, as mentioned by @hequn,
> >>>>>>>>>>
> >>>>>>>>>> SQL is a widely used language. It follows standards, is a
> >>>>>>>>>>> descriptive language, and is easy to use
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> thus we could expect with the enhancement of SQL/Table API,
> >>>> Flink
> >>>>>>>>>> becomes more friendly to users.
> >>>>>>>>>>
> >>>>>>>>>> Looking forward to the design doc/FLIP!
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> tison.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> jincheng sun <sunjincheng...@gmail.com> 于2018年11月2日周五
> >>>> 上午11:46写道：
> >>>>>>>>>>
> >>>>>>>>>>> Hi Hequn,
> >>>>>>>>>>> Thanks for your feedback! And also thanks for our offline
> >>>>>>> discussion!
> >>>>>>>>>>> You are right, unification of batch and streaming is very
> >>>>>> important
> >>>>>>>> for
> >>>>>>>>>>> flink API.
> >>>>>>>>>>> We will provide more detailed design later, Please let me
> >>>> know
> >>>>> if
> >>>>>>> you
> >>>>>>>>>> have
> >>>>>>>>>>> further thoughts or feedback.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Jincheng
> >>>>>>>>>>>
> >>>>>>>>>>> Hequn Cheng <chenghe...@gmail.com> 于2018年11月2日周五
> >>> 上午10:02写道：
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jincheng,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for your proposal. It is very encouraging!
> >>>>>>>>>>>>
> >>>>>>>>>>>> As we all know, SQL is a widely used language. It
> >> follows
> >>>>>>>> standards,
> >>>>>>>>>> is a
> >>>>>>>>>>>> descriptive language, and is easy to use. A powerful
> >>>> feature
> >>>>> of
> >>>>>>> SQL
> >>>>>>>>> is
> >>>>>>>>>>> that
> >>>>>>>>>>>> it supports optimization. Users only need to care about
> >>> the
> >>>>>> logic
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>> program. The underlying optimizer will help users
> >>> optimize
> >>>>> the
> >>>>>>>>>>> performance
> >>>>>>>>>>>> of the program. However, in terms of functionality and
> >>> ease
> >>>>> of
> >>>>>>> use,
> >>>>>>>>> in
> >>>>>>>>>>> some
> >>>>>>>>>>>> scenarios sql will be limited, as described in
> >> Jincheng's
> >>>>>>> proposal.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Correspondingly, the DataStream/DataSet api can provide
> >>>>>> powerful
> >>>>>>>>>>>> functionalities. Users can write
> >>>>>>> ProcessFunction/CoProcessFunction
> >>>>>>>>> and
> >>>>>>>>>>> get
> >>>>>>>>>>>> the timer. Compared with SQL, it provides more
> >>>>> functionalities
> >>>>>>> and
> >>>>>>>>>>>> flexibilities. However, it does not support
> >> optimization
> >>>> like
> >>>>>>> SQL.
> >>>>>>>>>>>> Meanwhile, DataStream/DataSet api has not been unified
> >>>> which
> >>>>>>> means,
> >>>>>>>>> for
> >>>>>>>>>>> the
> >>>>>>>>>>>> same logic, users need to write a job for each stream
> >> and
> >>>>>> batch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> With TableApi, I think we can combine the advantages of
> >>>> both.
> >>>>>>> Users
> >>>>>>>>> can
> >>>>>>>>>>>> easily write relational operations and enjoy
> >>> optimization.
> >>>> At
> >>>>>> the
> >>>>>>>>> same
> >>>>>>>>>>>> time, it supports more functionality and ease of use.
> >>>> Looking
> >>>>>>>> forward
> >>>>>>>>>> to
> >>>>>>>>>>>> the detailed design/FLIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Hequn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> >>>>>>> wshaox...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Aljoscha，
> >>>>>>>>>>>>> Glad that you like the proposal. We have completed
> >> the
> >>>>>>> prototype
> >>>>>>>> of
> >>>>>>>>>>> most
> >>>>>>>>>>>>> new proposed functionalities. Once collect the
> >> feedback
> >>>>> from
> >>>>>>>>>> community,
> >>>>>>>>>>>> we
> >>>>>>>>>>>>> will come up with a concrete FLIP/design doc.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Shaoxuan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> >>>>>>>>> aljos...@apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Jincheng,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> these points sound very good! Are there any
> >> concrete
> >>>>>>> proposals
> >>>>>>>>> for
> >>>>>>>>>>>>>> changes? For example a FLIP/design document?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> See here for FLIPs:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 1. Nov 2018, at 12:51, jincheng sun <
> >>>>>>>>> sunjincheng...@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *--------I am sorry for the formatting of the
> >> email
> >>>>>>> content.
> >>>>>>>> I
> >>>>>>>>>>>> reformat
> >>>>>>>>>>>>>>> the **content** as follows-----------*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Hi ALL,*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With the continuous efforts from the community,
> >> the
> >>>>> Flink
> >>>>>>>>> system
> >>>>>>>>>>> has
> >>>>>>>>>>>>> been
> >>>>>>>>>>>>>>> continuously improved, which has attracted more
> >> and
> >>>>> more
> >>>>>>>> users.
> >>>>>>>>>>> Flink
> >>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>> is a canonical, widely used relational query
> >>>> language.
> >>>>>>>> However,
> >>>>>>>>>>> there
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
> >> meet
> >>>>> user
> >>>>>>>> needs
> >>>>>>>>> in
> >>>>>>>>>>>> terms
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> functionality and ease of use, such as:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *1. In terms of functionality*
> >>>>>>>>>>>>>>>   Iteration, user-defined window, user-defined
> >>> join,
> >>>>>>>>>> user-defined
> >>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
> >>> SQL;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *2. In terms of ease of use*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>  - Map - e.g. “dataStream.map(mapFun)”. Although
> >>>>>>>>>>>> “table.select(udf1(),
> >>>>>>>>>>>>>>>  udf2(), udf3()....)” can be used to accomplish
> >>> the
> >>>>> same
> >>>>>>>>>>> function.,
> >>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>  map() function returning 100 columns, one has
> >> to
> >>>>> define
> >>>>>>> or
> >>>>>>>>> call
> >>>>>>>>>>> 100
> >>>>>>>>>>>>>> UDFs
> >>>>>>>>>>>>>>>  when using SQL, which is quite involved.
> >>>>>>>>>>>>>>>  - FlatMap -  e.g.
> >>> “dataStrem.flatmap(flatMapFun)”.
> >>>>>>>> Similarly,
> >>>>>>>>>> it
> >>>>>>>>>>>> can
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>  implemented with “table.join(udtf).select()”.
> >>>>> However,
> >>>>>> it
> >>>>>>>> is
> >>>>>>>>>>>> obvious
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>  dataStream is easier to use than SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Due to the above two reasons, some users have to
> >>> use
> >>>>> the
> >>>>>>>>>> DataStream
> >>>>>>>>>>>> API
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>> the DataSet API. But when they do that, they lose
> >>> the
> >>>>>>>>> unification
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>> and streaming. They will also lose the
> >>> sophisticated
> >>>>>>>>>> optimizations
> >>>>>>>>>>>> such
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>> codegen, aggregate join transpose and multi-stage
> >>> agg
> >>>>>> from
> >>>>>>>>> Flink
> >>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We believe that enhancing the functionality and
> >>>>>>> productivity
> >>>>>>>> is
> >>>>>>>>>>> vital
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> the successful adoption of Table API. To this
> >> end,
> >>>>> Table
> >>>>>>> API
> >>>>>>>>>> still
> >>>>>>>>>>>>>>> requires more efforts from every contributor in
> >> the
> >>>>>>>> community.
> >>>>>>>>> We
> >>>>>>>>>>> see
> >>>>>>>>>>>>>> great
> >>>>>>>>>>>>>>> opportunity in improving our user’s experience
> >> from
> >>>>> this
> >>>>>>>> work.
> >>>>>>>>>> Any
> >>>>>>>>>>>>>> feedback
> >>>>>>>>>>>>>>> is welcome.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> jincheng sun <sunjincheng...@gmail.com>
> >>>> 于2018年11月1日周四
> >>>>>>>>> 下午5:07写道：
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> With the continuous efforts from the community,
> >>> the
> >>>>>> Flink
> >>>>>>>>> system
> >>>>>>>>>>> has
> >>>>>>>>>>>>>> been
> >>>>>>>>>>>>>>>> continuously improved, which has attracted more
> >>> and
> >>>>> more
> >>>>>>>>> users.
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>> is a canonical, widely used relational query
> >>>> language.
> >>>>>>>>> However,
> >>>>>>>>>>>> there
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
> >>> meet
> >>>>> user
> >>>>>>>> needs
> >>>>>>>>>> in
> >>>>>>>>>>>>> terms
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> functionality and ease of use, such as:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  In terms of functionality
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Iteration, user-defined window, user-defined
> >> join,
> >>>>>>>>> user-defined
> >>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
> >>>> SQL;
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  In terms of ease of use
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>     Map - e.g. “dataStream.map(mapFun)”.
> >> Although
> >>>>>>>>>>>>> “table.select(udf1(),
> >>>>>>>>>>>>>>>>     udf2(), udf3()....)” can be used to
> >>> accomplish
> >>>>> the
> >>>>>>> same
> >>>>>>>>>>>>> function.,
> >>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>>     map() function returning 100 columns, one
> >> has
> >>>> to
> >>>>>>> define
> >>>>>>>>> or
> >>>>>>>>>>> call
> >>>>>>>>>>>>>> 100 UDFs
> >>>>>>>>>>>>>>>>     when using SQL, which is quite involved.
> >>>>>>>>>>>>>>>>     -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>     FlatMap -  e.g.
> >>>> “dataStrem.flatmap(flatMapFun)”.
> >>>>>>>>> Similarly,
> >>>>>>>>>>> it
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>     be implemented with
> >>>> “table.join(udtf).select()”.
> >>>>>>>> However,
> >>>>>>>>>> it
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> obvious
> >>>>>>>>>>>>>>>>     that datastream is easier to use than SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Due to the above two reasons, some users have to
> >>> use
> >>>>> the
> >>>>>>>>>>> DataStream
> >>>>>>>>>>>>> API
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> the DataSet API. But when they do that, they
> >> lose
> >>>> the
> >>>>>>>>>> unification
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>>> and streaming. They will also lose the
> >>> sophisticated
> >>>>>>>>>> optimizations
> >>>>>>>>>>>>> such
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>> codegen, aggregate join transpose  and
> >> multi-stage
> >>>> agg
> >>>>>>> from
> >>>>>>>>>> Flink
> >>>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> We believe that enhancing the functionality and
> >>>>>>> productivity
> >>>>>>>>> is
> >>>>>>>>>>>> vital
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> the successful adoption of Table API. To this
> >> end,
> >>>>>> Table
> >>>>>>>> API
> >>>>>>>>>>> still
> >>>>>>>>>>>>>>>> requires more efforts from every contributor in
> >>> the
> >>>>>>>> community.
> >>>>>>>>>> We
> >>>>>>>>>>>> see
> >>>>>>>>>>>>>> great
> >>>>>>>>>>>>>>>> opportunity in improving our user’s experience
> >>> from
> >>>>> this
> >>>>>>>> work.
> >>>>>>>>>> Any
> >>>>>>>>>>>>>> feedback
> >>>>>>>>>>>>>>>> is welcome.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Reply via email to