Re: [DISCUSS] Table API / SQL features for Flink 1.4.0
Hi everybody, Shaoxuan, Timo, and I compiled a list of features from the replies to this thread, features that didn't make it into 1.3, and some additional ones. We also graded them by importance, tried to assess the effort, and added links to JIRAs (some existed already others were created) and existing PRs. Please have a look at the list and give feedback if you want to add a feature to the list or do not agree with the importance or effort assessment either by replying to this thread or commenting on the document. -> https://docs.google.com/document/d/1I7YuF6lfxyuyPIve_ VtLDNMfSKcepZJbnZfVueqlQN4 Thanks, Fabian 2017-06-21 23:55 GMT+02:00 Fabian Hueske : > Hi Haohui, > > thanks for your input! > > Can you describe the semantics of the join you'd like to see in Flink 1.4? > I can think of three types of joins that match your description: > 1) `table` is an external table stored in an external database (redis, > cassandra, MySQL, etc) and we join with the current value that is in that > table. This could be implemented with an async TableFunction (based on > Flink's AsyncFunction). > 2) `table` is static: In this case we need support for side-inputs to read > the whole table before starting to process the other (streaming) side. > There is a FLIP [1] for side inputs. I don't know what's the status of this > feature though. > 3) `table` changing and each record of the stream should be joined with > the most recent update (but no future updates). In this case, the query is > more complex to express and requires some time-bound logic which is quite > cumbersome to express in SQL. I think this is a very important type of > join, but IMO it is more challenging to implement than the other joins. We > had also a discussion about this type of join on the dev ML a few months > back [2]. > > Which type of join are you looking for (external table, static table, > dynamic table)? > > Regarding the bottleneck of committers, the situation should become a bit > better in the near future as we have two committer more working on the > relational APIs (Jark is spending more time here and Shaoxuan recently > became a committer). However, we will of course continue to encourage and > help contributors to earn the merits to become committers and grow the > number of committers. > > Thank you very much, > Fabian > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP- > 17+Side+Inputs+for+DataStream+API > [2] http://apache-flink-mailing-list-archive.1008284.n3. > nabble.com/STREAM-SQL-inner-queries-tp15585.html >
Re: [DISCUSS] Table API / SQL features for Flink 1.4.0
Hi Haohui, thanks for your input! Can you describe the semantics of the join you'd like to see in Flink 1.4? I can think of three types of joins that match your description: 1) `table` is an external table stored in an external database (redis, cassandra, MySQL, etc) and we join with the current value that is in that table. This could be implemented with an async TableFunction (based on Flink's AsyncFunction). 2) `table` is static: In this case we need support for side-inputs to read the whole table before starting to process the other (streaming) side. There is a FLIP [1] for side inputs. I don't know what's the status of this feature though. 3) `table` changing and each record of the stream should be joined with the most recent update (but no future updates). In this case, the query is more complex to express and requires some time-bound logic which is quite cumbersome to express in SQL. I think this is a very important type of join, but IMO it is more challenging to implement than the other joins. We had also a discussion about this type of join on the dev ML a few months back [2]. Which type of join are you looking for (external table, static table, dynamic table)? Regarding the bottleneck of committers, the situation should become a bit better in the near future as we have two committer more working on the relational APIs (Jark is spending more time here and Shaoxuan recently became a committer). However, we will of course continue to encourage and help contributors to earn the merits to become committers and grow the number of committers. Thank you very much, Fabian [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/STREAM-SQL-inner-queries-tp15585.html
Re: [DISCUSS] Table API / SQL features for Flink 1.4.0
Hi, We are interested in building the simplest case of stream-table joins -- essentially calling stream.map(x => (x, table.get(x)). It solves the use cases of augmenting the streams with the information of the database. The operation itself can be batched for better performance. We are happy to contribute to the the scalar functions as well as we internally also share similar requirements. Fabian mentioned that the development of Table / SQL API was bottlenecked by committers, which shows that there are thriving developments happening in the space. I think it is a good problem to have. :-) I wonder, is it a good time to nominate new batches of committers and to keep the momentum of developments? Regards, Haohui On Fri, Jun 16, 2017 at 7:28 AM jincheng sun wrote: > Hi Fabian, > Thanks for bring up this discuss. > In order to enrich Flink's built-in scalar function, friendly user > experience, I recommend adding as much scalar functions as possible in > version 1.4 release. I have filed the JIRAs( > https://issues.apache.org/jira/browse/FLINK-6810), and try my best to work > on them. > > Of course, welcome anybody to add sub-tasks or take the JIRAs. > > Cheers, > SunJincheng > > 2017-06-16 16:07 GMT+08:00 Fabian Hueske : > > > Thanks for your response Shaoxuan, > > > > My "Table-table join with retraction" is probably the same as your > > "unbounded stream-stream join with retraction". > > Basically, a join between two dynamic tables with unique keys (either > > because of an upsert stream->table conversion or an unbounded > aggregation). > > > > Best, Fabian > > > > 2017-06-16 0:56 GMT+02:00 Shaoxuan Wang : > > > > > Nice timing, Fabian! > > > > > > Your checklist aligns our plans very well. Here are the things we are > > > working on & planning to contribute to release 1.4: > > > 1. DDL (with property waterMark config for source-table, and emit > config > > on > > > result-table) > > > 2. unbounded stream-stream joins (with retraction supported) > > > 3. backend state user interface for UDAGG > > > 4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to > > > table)/UDAGG(table to scalar), this allows user to define a table to > > table > > > conversion business logic) > > > > > > Some of them already have PR/jira, while some are not. We will send out > > the > > > design doc for the missing ones very soon. Looking forward to the 1.4 > > > release. > > > > > > Btw, what is "Table-Table (with retraction)" you have mentioned in your > > > plan? > > > > > > Regards, > > > Shaoxuan > > > > > > > > > > > > On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske > > wrote: > > > > > > > Hi everybody, > > > > > > > > I would like to start a discussion about the targeted feature set of > > the > > > > Table API / SQL for Flink 1.4.0. > > > > Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months > (~11 > > > > weeks, until begin of September) left until the feature freeze for > > Flink > > > > 1.4.0. > > > > > > > > I think it makes sense to start with a collection of desired > features. > > > Once > > > > we have a list of requested features, we might want to prioritize and > > > maybe > > > > also assign responsibilities. > > > > > > > > When we prioritize, we should keep in mind that: > > > > - we want to have a consistent API. Larger features should be > developed > > > in > > > > a feature branch first. > > > > - the next months are typical time for vacations > > > > - we have been bottlenecked by committer resources in the last > release. > > > > > > > > I think the following features would be a nice addition to the > current > > > > state: > > > > > > > > - Conversion of a stream into an upsert table (with retraction, > > updating > > > to > > > > the last row per key) > > > > - Joins for streaming tables > > > > - Stream-Stream (time-range predicate) there is already a PR for > > > > processing time joins > > > > - Table-Table (with retraction) > > > > - Support for late arriving records in group window aggregations > > > > - Exposing a keyed result table as queryable state > > > > > > > > Which features are others looking for? > > > > > > > > Cheers, > > > > Fabian > > > > > > > > > >
Re: [DISCUSS] Table API / SQL features for Flink 1.4.0
Hi Fabian, Thanks for bring up this discuss. In order to enrich Flink's built-in scalar function, friendly user experience, I recommend adding as much scalar functions as possible in version 1.4 release. I have filed the JIRAs( https://issues.apache.org/jira/browse/FLINK-6810), and try my best to work on them. Of course, welcome anybody to add sub-tasks or take the JIRAs. Cheers, SunJincheng 2017-06-16 16:07 GMT+08:00 Fabian Hueske : > Thanks for your response Shaoxuan, > > My "Table-table join with retraction" is probably the same as your > "unbounded stream-stream join with retraction". > Basically, a join between two dynamic tables with unique keys (either > because of an upsert stream->table conversion or an unbounded aggregation). > > Best, Fabian > > 2017-06-16 0:56 GMT+02:00 Shaoxuan Wang : > > > Nice timing, Fabian! > > > > Your checklist aligns our plans very well. Here are the things we are > > working on & planning to contribute to release 1.4: > > 1. DDL (with property waterMark config for source-table, and emit config > on > > result-table) > > 2. unbounded stream-stream joins (with retraction supported) > > 3. backend state user interface for UDAGG > > 4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to > > table)/UDAGG(table to scalar), this allows user to define a table to > table > > conversion business logic) > > > > Some of them already have PR/jira, while some are not. We will send out > the > > design doc for the missing ones very soon. Looking forward to the 1.4 > > release. > > > > Btw, what is "Table-Table (with retraction)" you have mentioned in your > > plan? > > > > Regards, > > Shaoxuan > > > > > > > > On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske > wrote: > > > > > Hi everybody, > > > > > > I would like to start a discussion about the targeted feature set of > the > > > Table API / SQL for Flink 1.4.0. > > > Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months (~11 > > > weeks, until begin of September) left until the feature freeze for > Flink > > > 1.4.0. > > > > > > I think it makes sense to start with a collection of desired features. > > Once > > > we have a list of requested features, we might want to prioritize and > > maybe > > > also assign responsibilities. > > > > > > When we prioritize, we should keep in mind that: > > > - we want to have a consistent API. Larger features should be developed > > in > > > a feature branch first. > > > - the next months are typical time for vacations > > > - we have been bottlenecked by committer resources in the last release. > > > > > > I think the following features would be a nice addition to the current > > > state: > > > > > > - Conversion of a stream into an upsert table (with retraction, > updating > > to > > > the last row per key) > > > - Joins for streaming tables > > > - Stream-Stream (time-range predicate) there is already a PR for > > > processing time joins > > > - Table-Table (with retraction) > > > - Support for late arriving records in group window aggregations > > > - Exposing a keyed result table as queryable state > > > > > > Which features are others looking for? > > > > > > Cheers, > > > Fabian > > > > > >
Re: [DISCUSS] Table API / SQL features for Flink 1.4.0
Thanks for your response Shaoxuan, My "Table-table join with retraction" is probably the same as your "unbounded stream-stream join with retraction". Basically, a join between two dynamic tables with unique keys (either because of an upsert stream->table conversion or an unbounded aggregation). Best, Fabian 2017-06-16 0:56 GMT+02:00 Shaoxuan Wang : > Nice timing, Fabian! > > Your checklist aligns our plans very well. Here are the things we are > working on & planning to contribute to release 1.4: > 1. DDL (with property waterMark config for source-table, and emit config on > result-table) > 2. unbounded stream-stream joins (with retraction supported) > 3. backend state user interface for UDAGG > 4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to > table)/UDAGG(table to scalar), this allows user to define a table to table > conversion business logic) > > Some of them already have PR/jira, while some are not. We will send out the > design doc for the missing ones very soon. Looking forward to the 1.4 > release. > > Btw, what is "Table-Table (with retraction)" you have mentioned in your > plan? > > Regards, > Shaoxuan > > > > On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske wrote: > > > Hi everybody, > > > > I would like to start a discussion about the targeted feature set of the > > Table API / SQL for Flink 1.4.0. > > Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months (~11 > > weeks, until begin of September) left until the feature freeze for Flink > > 1.4.0. > > > > I think it makes sense to start with a collection of desired features. > Once > > we have a list of requested features, we might want to prioritize and > maybe > > also assign responsibilities. > > > > When we prioritize, we should keep in mind that: > > - we want to have a consistent API. Larger features should be developed > in > > a feature branch first. > > - the next months are typical time for vacations > > - we have been bottlenecked by committer resources in the last release. > > > > I think the following features would be a nice addition to the current > > state: > > > > - Conversion of a stream into an upsert table (with retraction, updating > to > > the last row per key) > > - Joins for streaming tables > > - Stream-Stream (time-range predicate) there is already a PR for > > processing time joins > > - Table-Table (with retraction) > > - Support for late arriving records in group window aggregations > > - Exposing a keyed result table as queryable state > > > > Which features are others looking for? > > > > Cheers, > > Fabian > > >
Re: [DISCUSS] Table API / SQL features for Flink 1.4.0
Nice timing, Fabian! Your checklist aligns our plans very well. Here are the things we are working on & planning to contribute to release 1.4: 1. DDL (with property waterMark config for source-table, and emit config on result-table) 2. unbounded stream-stream joins (with retraction supported) 3. backend state user interface for UDAGG 4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to table)/UDAGG(table to scalar), this allows user to define a table to table conversion business logic) Some of them already have PR/jira, while some are not. We will send out the design doc for the missing ones very soon. Looking forward to the 1.4 release. Btw, what is "Table-Table (with retraction)" you have mentioned in your plan? Regards, Shaoxuan On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske wrote: > Hi everybody, > > I would like to start a discussion about the targeted feature set of the > Table API / SQL for Flink 1.4.0. > Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months (~11 > weeks, until begin of September) left until the feature freeze for Flink > 1.4.0. > > I think it makes sense to start with a collection of desired features. Once > we have a list of requested features, we might want to prioritize and maybe > also assign responsibilities. > > When we prioritize, we should keep in mind that: > - we want to have a consistent API. Larger features should be developed in > a feature branch first. > - the next months are typical time for vacations > - we have been bottlenecked by committer resources in the last release. > > I think the following features would be a nice addition to the current > state: > > - Conversion of a stream into an upsert table (with retraction, updating to > the last row per key) > - Joins for streaming tables > - Stream-Stream (time-range predicate) there is already a PR for > processing time joins > - Table-Table (with retraction) > - Support for late arriving records in group window aggregations > - Exposing a keyed result table as queryable state > > Which features are others looking for? > > Cheers, > Fabian >