Re: [DISCUSS] Incrementally deprecating the DataSet API
Happy to help clarify, and certainly, I don't think anyone would reject the contribution if you wanted to add that functionality. The important point we want to communicate is that the table API has matured to a level where we do not see it as supplementary to DataStream but as an equal first-class interface. Both optimize towards different making specific use cases easier - declarative definition of pipelines vs fine-grained control - and neither needs to support every operation natively because it should be seamless to interoperate them. Seth On Fri, Jul 9, 2021 at 10:35 AM Etienne Chauchot wrote: > Hi Seth, > > Thanks for your comment. > > That is perfect then ! > > If the community agrees to give precedence to the table API connectors > over the DataStream connectors as discussed here, all the features are > already available then. > > I will not need to port my latest additions to table connectors then. > > Best > > Etienne > > On 08/07/2021 17:28, Seth Wiesman wrote: > > Hi Etienne, > > > > The `toDataStream` method supports converting to concrete Java types, not > > just Row, which can include your Avro specific-records. See example 2: > > > > > https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream > > > > On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot > > wrote: > > > >> Hi Timo, > >> > >> Thanks for your answers, no problem with the delay, I was in vacation > >> too last week :) > >> > >> My comments are inline > >> > >> Best, > >> > >> Etienne > >> > >> On 07/07/2021 16:48, Timo Walther wrote: > >>> Hi Etienne, > >>> > >>> sorry for the late reply due to my vacation last week. > >>> > >>> Regarding: "support of aggregations in batch mode for DataStream API > >>> [...] is there a plan to solve it before the actual drop of DataSet > API" > >>> > >>> Just to clarify it again: we will not drop the DataSet API any time > >>> soon. So users will have enough time to update their pipelines. There > >>> are a couple of features missing to fully switch to DataStream API in > >>> batch mode. Thanks for opening an issue, this is helpful for us to > >>> gradually remove those barriers. They don't need to have a "Blocker" > >>> priority in JIRA for now. > >> > >> Ok I thought the drop was sooner, no problem then. > >> > >> > >>> But aggregations is a good example where we should discuss if it would > >>> be easier to simply switch to Table API for that. Table API has a lot > >>> of aggregation optimizations and can work on binary data. Also joins > >>> should be easier in Table API. DataStream API can be a very low-level > >>> API in the near future and most use cases (esp. the batch ones) should > >>> be possible in Table API. > >> > >> Yes sure. As a matter of fact, my point was to use low level DataStream > >> API in a benchmark to compare with Table API but I guess it is not a > >> common user behavior. > >> > >> > >>> Regarding: "Is it needed to port these Avro enhancements to new > >>> DataStream connectors (add a new equivalent of > >>> ParquetColumnarRowInputFormat but for Avro)" > >>> > >>> We should definitely not loose functionality. The same functionality > >>> should be present in the new connectors. The questions is rather > >>> whether we need to offer a DataStream API connector or if a Table API > >>> connector would be nicer to use (also nicely integrated with catalogs). > >>> > >>> So a user can use a simple CREATE TABLE statement to configure the > >>> connector; an easier abstraction is almost not possible. With > >>> `tableEnv.toDataStream(table)` you can then continue in DataStream API > >>> if there is still a need for it. > >> > >> Yes I agree, there is no easier connector setup than CREATE TABLE, and > >> with tableEnv.toDataStream(table) if one would want to stay with > >> DataStream (in my bench for example) it is still possible. And by the > >> way the doc for parquet (1) for example only mentions using Table API > >> connector. > >> > >> So I guess the table connector would be nicer for the user indeed. But > >> if we use tableEnv.toDataStream(table), we would be able to produce only > >> usual types like Row, Tuples or Pojos, we still need to add Avro support > >> right ? > >> > >> [1] > >> > >> > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/ > >> > >> > >>> Regarding: "there are parquet bugs still open on deprecated parquet > >>> connector" > >>> > >>> Yes, bugs should still be fixed in 1.13. > >> > >> OK > >> > >> > >>> Regarading: "I've been doing TPCDS benchmarks with Flink lately" > >>> > >>> Great to hear that :-) > >> > >> And congrats again on blink performances ! > >> > >> > >>> Did you also see the recent discussion? A TPC-DS benchmark can further > >>> be improved by providing statistics. Maybe this is helpful to you: > >>> > >>> > >> >
Re: [DISCUSS] Incrementally deprecating the DataSet API
Hi Seth, Thanks for your comment. That is perfect then ! If the community agrees to give precedence to the table API connectors over the DataStream connectors as discussed here, all the features are already available then. I will not need to port my latest additions to table connectors then. Best Etienne On 08/07/2021 17:28, Seth Wiesman wrote: Hi Etienne, The `toDataStream` method supports converting to concrete Java types, not just Row, which can include your Avro specific-records. See example 2: https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot wrote: Hi Timo, Thanks for your answers, no problem with the delay, I was in vacation too last week :) My comments are inline Best, Etienne On 07/07/2021 16:48, Timo Walther wrote: Hi Etienne, sorry for the late reply due to my vacation last week. Regarding: "support of aggregations in batch mode for DataStream API [...] is there a plan to solve it before the actual drop of DataSet API" Just to clarify it again: we will not drop the DataSet API any time soon. So users will have enough time to update their pipelines. There are a couple of features missing to fully switch to DataStream API in batch mode. Thanks for opening an issue, this is helpful for us to gradually remove those barriers. They don't need to have a "Blocker" priority in JIRA for now. Ok I thought the drop was sooner, no problem then. But aggregations is a good example where we should discuss if it would be easier to simply switch to Table API for that. Table API has a lot of aggregation optimizations and can work on binary data. Also joins should be easier in Table API. DataStream API can be a very low-level API in the near future and most use cases (esp. the batch ones) should be possible in Table API. Yes sure. As a matter of fact, my point was to use low level DataStream API in a benchmark to compare with Table API but I guess it is not a common user behavior. Regarding: "Is it needed to port these Avro enhancements to new DataStream connectors (add a new equivalent of ParquetColumnarRowInputFormat but for Avro)" We should definitely not loose functionality. The same functionality should be present in the new connectors. The questions is rather whether we need to offer a DataStream API connector or if a Table API connector would be nicer to use (also nicely integrated with catalogs). So a user can use a simple CREATE TABLE statement to configure the connector; an easier abstraction is almost not possible. With `tableEnv.toDataStream(table)` you can then continue in DataStream API if there is still a need for it. Yes I agree, there is no easier connector setup than CREATE TABLE, and with tableEnv.toDataStream(table) if one would want to stay with DataStream (in my bench for example) it is still possible. And by the way the doc for parquet (1) for example only mentions using Table API connector. So I guess the table connector would be nicer for the user indeed. But if we use tableEnv.toDataStream(table), we would be able to produce only usual types like Row, Tuples or Pojos, we still need to add Avro support right ? [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/ Regarding: "there are parquet bugs still open on deprecated parquet connector" Yes, bugs should still be fixed in 1.13. OK Regarading: "I've been doing TPCDS benchmarks with Flink lately" Great to hear that :-) And congrats again on blink performances ! Did you also see the recent discussion? A TPC-DS benchmark can further be improved by providing statistics. Maybe this is helpful to you: https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E No, I missed that thread, thanks for the pointer, I'll read it and comment if I have something to add. I will prepare a blog post shortly. Good to hear :) Regards, Timo On 06.07.21 15:05, Etienne Chauchot wrote: Hi all, Any comments ? cheers, Etienne On 25/06/2021 15:09, Etienne Chauchot wrote: Hi everyone, @Timo, my comments are inline for steps 2, 4 and 5, please tell me what you think. Best Etienne On 23/06/2021 15:27, Chesnay Schepler wrote: If we want to publicize this plan more shouldn't we have a rough timeline for when 2.0 is on the table? On 6/23/2021 2:44 PM, Stephan Ewen wrote: Thanks for writing this up, this also reflects my understanding. I think a blog post would be nice, ideally with an explicit call for feedback so we learn about user concerns. A blog post has a lot more reach than an ML thread. Best, Stephan On Wed, Jun 23, 2021 at 12:23 PM Timo Walther wrote: Hi everyone, I'm sending this email to make sure everyone is on the same page about slowly deprecating the DataSet API. There have been a few thoughts mentioned in presentations, offline
Re: [DISCUSS] Incrementally deprecating the DataSet API
Hi Etienne, The `toDataStream` method supports converting to concrete Java types, not just Row, which can include your Avro specific-records. See example 2: https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot wrote: > Hi Timo, > > Thanks for your answers, no problem with the delay, I was in vacation > too last week :) > > My comments are inline > > Best, > > Etienne > > On 07/07/2021 16:48, Timo Walther wrote: > > Hi Etienne, > > > > sorry for the late reply due to my vacation last week. > > > > Regarding: "support of aggregations in batch mode for DataStream API > > [...] is there a plan to solve it before the actual drop of DataSet API" > > > > Just to clarify it again: we will not drop the DataSet API any time > > soon. So users will have enough time to update their pipelines. There > > are a couple of features missing to fully switch to DataStream API in > > batch mode. Thanks for opening an issue, this is helpful for us to > > gradually remove those barriers. They don't need to have a "Blocker" > > priority in JIRA for now. > > > Ok I thought the drop was sooner, no problem then. > > > > > > But aggregations is a good example where we should discuss if it would > > be easier to simply switch to Table API for that. Table API has a lot > > of aggregation optimizations and can work on binary data. Also joins > > should be easier in Table API. DataStream API can be a very low-level > > API in the near future and most use cases (esp. the batch ones) should > > be possible in Table API. > > > Yes sure. As a matter of fact, my point was to use low level DataStream > API in a benchmark to compare with Table API but I guess it is not a > common user behavior. > > > > > > Regarding: "Is it needed to port these Avro enhancements to new > > DataStream connectors (add a new equivalent of > > ParquetColumnarRowInputFormat but for Avro)" > > > > We should definitely not loose functionality. The same functionality > > should be present in the new connectors. The questions is rather > > whether we need to offer a DataStream API connector or if a Table API > > connector would be nicer to use (also nicely integrated with catalogs). > > > > So a user can use a simple CREATE TABLE statement to configure the > > connector; an easier abstraction is almost not possible. With > > `tableEnv.toDataStream(table)` you can then continue in DataStream API > > if there is still a need for it. > > > Yes I agree, there is no easier connector setup than CREATE TABLE, and > with tableEnv.toDataStream(table) if one would want to stay with > DataStream (in my bench for example) it is still possible. And by the > way the doc for parquet (1) for example only mentions using Table API > connector. > > So I guess the table connector would be nicer for the user indeed. But > if we use tableEnv.toDataStream(table), we would be able to produce only > usual types like Row, Tuples or Pojos, we still need to add Avro support > right ? > > [1] > > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/ > > > > > > Regarding: "there are parquet bugs still open on deprecated parquet > > connector" > > > > Yes, bugs should still be fixed in 1.13. > > > OK > > > > > > Regarading: "I've been doing TPCDS benchmarks with Flink lately" > > > > Great to hear that :-) > > > And congrats again on blink performances ! > > > > > > Did you also see the recent discussion? A TPC-DS benchmark can further > > be improved by providing statistics. Maybe this is helpful to you: > > > > > https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E > > > > > No, I missed that thread, thanks for the pointer, I'll read it and > comment if I have something to add. > > > > > > I will prepare a blog post shortly. > > > Good to hear :) > > > > > > Regards, > > Timo > > > > > > > > On 06.07.21 15:05, Etienne Chauchot wrote: > >> Hi all, > >> > >> Any comments ? > >> > >> cheers, > >> > >> Etienne > >> > >> On 25/06/2021 15:09, Etienne Chauchot wrote: > >>> Hi everyone, > >>> > >>> @Timo, my comments are inline for steps 2, 4 and 5, please tell me > >>> what you think. > >>> > >>> Best > >>> > >>> Etienne > >>> > >>> > >>> On 23/06/2021 15:27, Chesnay Schepler wrote: > If we want to publicize this plan more shouldn't we have a rough > timeline for when 2.0 is on the table? > > On 6/23/2021 2:44 PM, Stephan Ewen wrote: > > Thanks for writing this up, this also reflects my understanding. > > > > I think a blog post would be nice, ideally with an explicit call for > > feedback so we learn about user concerns. > > A blog post has a lot more reach than an ML thread. > > > > Best, > > Stephan > > > > > > On Wed, Jun 23, 2021 at 12:23 PM Timo Walther > > wrote: > > > >> Hi everyone, > >>
Re: [DISCUSS] Incrementally deprecating the DataSet API
Hi Timo, Thanks for your answers, no problem with the delay, I was in vacation too last week :) My comments are inline Best, Etienne On 07/07/2021 16:48, Timo Walther wrote: Hi Etienne, sorry for the late reply due to my vacation last week. Regarding: "support of aggregations in batch mode for DataStream API [...] is there a plan to solve it before the actual drop of DataSet API" Just to clarify it again: we will not drop the DataSet API any time soon. So users will have enough time to update their pipelines. There are a couple of features missing to fully switch to DataStream API in batch mode. Thanks for opening an issue, this is helpful for us to gradually remove those barriers. They don't need to have a "Blocker" priority in JIRA for now. Ok I thought the drop was sooner, no problem then. But aggregations is a good example where we should discuss if it would be easier to simply switch to Table API for that. Table API has a lot of aggregation optimizations and can work on binary data. Also joins should be easier in Table API. DataStream API can be a very low-level API in the near future and most use cases (esp. the batch ones) should be possible in Table API. Yes sure. As a matter of fact, my point was to use low level DataStream API in a benchmark to compare with Table API but I guess it is not a common user behavior. Regarding: "Is it needed to port these Avro enhancements to new DataStream connectors (add a new equivalent of ParquetColumnarRowInputFormat but for Avro)" We should definitely not loose functionality. The same functionality should be present in the new connectors. The questions is rather whether we need to offer a DataStream API connector or if a Table API connector would be nicer to use (also nicely integrated with catalogs). So a user can use a simple CREATE TABLE statement to configure the connector; an easier abstraction is almost not possible. With `tableEnv.toDataStream(table)` you can then continue in DataStream API if there is still a need for it. Yes I agree, there is no easier connector setup than CREATE TABLE, and with tableEnv.toDataStream(table) if one would want to stay with DataStream (in my bench for example) it is still possible. And by the way the doc for parquet (1) for example only mentions using Table API connector. So I guess the table connector would be nicer for the user indeed. But if we use tableEnv.toDataStream(table), we would be able to produce only usual types like Row, Tuples or Pojos, we still need to add Avro support right ? [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/ Regarding: "there are parquet bugs still open on deprecated parquet connector" Yes, bugs should still be fixed in 1.13. OK Regarading: "I've been doing TPCDS benchmarks with Flink lately" Great to hear that :-) And congrats again on blink performances ! Did you also see the recent discussion? A TPC-DS benchmark can further be improved by providing statistics. Maybe this is helpful to you: https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E No, I missed that thread, thanks for the pointer, I'll read it and comment if I have something to add. I will prepare a blog post shortly. Good to hear :) Regards, Timo On 06.07.21 15:05, Etienne Chauchot wrote: Hi all, Any comments ? cheers, Etienne On 25/06/2021 15:09, Etienne Chauchot wrote: Hi everyone, @Timo, my comments are inline for steps 2, 4 and 5, please tell me what you think. Best Etienne On 23/06/2021 15:27, Chesnay Schepler wrote: If we want to publicize this plan more shouldn't we have a rough timeline for when 2.0 is on the table? On 6/23/2021 2:44 PM, Stephan Ewen wrote: Thanks for writing this up, this also reflects my understanding. I think a blog post would be nice, ideally with an explicit call for feedback so we learn about user concerns. A blog post has a lot more reach than an ML thread. Best, Stephan On Wed, Jun 23, 2021 at 12:23 PM Timo Walther wrote: Hi everyone, I'm sending this email to make sure everyone is on the same page about slowly deprecating the DataSet API. There have been a few thoughts mentioned in presentations, offline discussions, and JIRA issues. However, I have observed that there are still some concerns or different opinions on what steps are necessary to implement this change. Let me summarize some of the steps and assumpations and let's have a discussion about it: Step 1: Introduce a batch mode for Table API (FLIP-32) [DONE in 1.9] Step 2: Introduce a batch mode for DataStream API (FLIP-134) [DONE in 1.12] I've been using DataSet API and I tested migrating to DataStream + batch mode. I opened this (1) ticket regarding the support of aggregations in batch mode for DataStream API. It seems that join operation (at least)
Re: [DISCUSS] Incrementally deprecating the DataSet API
Hi Etienne, sorry for the late reply due to my vacation last week. Regarding: "support of aggregations in batch mode for DataStream API [...] is there a plan to solve it before the actual drop of DataSet API" Just to clarify it again: we will not drop the DataSet API any time soon. So users will have enough time to update their pipelines. There are a couple of features missing to fully switch to DataStream API in batch mode. Thanks for opening an issue, this is helpful for us to gradually remove those barriers. They don't need to have a "Blocker" priority in JIRA for now. But aggregations is a good example where we should discuss if it would be easier to simply switch to Table API for that. Table API has a lot of aggregation optimizations and can work on binary data. Also joins should be easier in Table API. DataStream API can be a very low-level API in the near future and most use cases (esp. the batch ones) should be possible in Table API. Regarding: "Is it needed to port these Avro enhancements to new DataStream connectors (add a new equivalent of ParquetColumnarRowInputFormat but for Avro)" We should definitely not loose functionality. The same functionality should be present in the new connectors. The questions is rather whether we need to offer a DataStream API connector or if a Table API connector would be nicer to use (also nicely integrated with catalogs). So a user can use a simple CREATE TABLE statement to configure the connector; an easier abstraction is almost not possible. With `tableEnv.toDataStream(table)` you can then continue in DataStream API if there is still a need for it. Regarding: "there are parquet bugs still open on deprecated parquet connector" Yes, bugs should still be fixed in 1.13. Regarading: "I've been doing TPCDS benchmarks with Flink lately" Great to hear that :-) Did you also see the recent discussion? A TPC-DS benchmark can further be improved by providing statistics. Maybe this is helpful to you: https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E I will prepare a blog post shortly. Regards, Timo On 06.07.21 15:05, Etienne Chauchot wrote: Hi all, Any comments ? cheers, Etienne On 25/06/2021 15:09, Etienne Chauchot wrote: Hi everyone, @Timo, my comments are inline for steps 2, 4 and 5, please tell me what you think. Best Etienne On 23/06/2021 15:27, Chesnay Schepler wrote: If we want to publicize this plan more shouldn't we have a rough timeline for when 2.0 is on the table? On 6/23/2021 2:44 PM, Stephan Ewen wrote: Thanks for writing this up, this also reflects my understanding. I think a blog post would be nice, ideally with an explicit call for feedback so we learn about user concerns. A blog post has a lot more reach than an ML thread. Best, Stephan On Wed, Jun 23, 2021 at 12:23 PM Timo Walther wrote: Hi everyone, I'm sending this email to make sure everyone is on the same page about slowly deprecating the DataSet API. There have been a few thoughts mentioned in presentations, offline discussions, and JIRA issues. However, I have observed that there are still some concerns or different opinions on what steps are necessary to implement this change. Let me summarize some of the steps and assumpations and let's have a discussion about it: Step 1: Introduce a batch mode for Table API (FLIP-32) [DONE in 1.9] Step 2: Introduce a batch mode for DataStream API (FLIP-134) [DONE in 1.12] I've been using DataSet API and I tested migrating to DataStream + batch mode. I opened this (1) ticket regarding the support of aggregations in batch mode for DataStream API. It seems that join operation (at least) does not work in batch mode even though I managed to implement a join using low level KeyedCoProcessFunction (thanks Seth, for the pointer !). => Should it be considered a blocker ? Is there a plan to solve it before the actual drop of DataSet API ? Maybe in step 6 ? [1] https://issues.apache.org/jira/browse/FLINK-22587 Step 3: Soft deprecate DataSet API (FLIP-131) [DONE in 1.12] We updated the documentation recently to make this deprecation even more visible. There is a dedicated `(Legacy)` label right next to the menu item now. We won't deprecate concrete classes of the API with a @Deprecated annotation to avoid extensive warnings in logs until then. Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) [DONE in 1.14] We dropped code for ORC, Parque, and HBase formats that were only used by DataSet API users. The removed classes had no documentation and were not annotated with one of our API stability annotations. The old functionality should be available through the new sources and sinks for Table API and DataStream API. If not, we should bring them into a shape that they can be a full replacement. DataSet users are encouraged to either upgrade the API or use Flink 1.13. Users can either
Re: [DISCUSS] Incrementally deprecating the DataSet API
Hi all, Any comments ? cheers, Etienne On 25/06/2021 15:09, Etienne Chauchot wrote: Hi everyone, @Timo, my comments are inline for steps 2, 4 and 5, please tell me what you think. Best Etienne On 23/06/2021 15:27, Chesnay Schepler wrote: If we want to publicize this plan more shouldn't we have a rough timeline for when 2.0 is on the table? On 6/23/2021 2:44 PM, Stephan Ewen wrote: Thanks for writing this up, this also reflects my understanding. I think a blog post would be nice, ideally with an explicit call for feedback so we learn about user concerns. A blog post has a lot more reach than an ML thread. Best, Stephan On Wed, Jun 23, 2021 at 12:23 PM Timo Walther wrote: Hi everyone, I'm sending this email to make sure everyone is on the same page about slowly deprecating the DataSet API. There have been a few thoughts mentioned in presentations, offline discussions, and JIRA issues. However, I have observed that there are still some concerns or different opinions on what steps are necessary to implement this change. Let me summarize some of the steps and assumpations and let's have a discussion about it: Step 1: Introduce a batch mode for Table API (FLIP-32) [DONE in 1.9] Step 2: Introduce a batch mode for DataStream API (FLIP-134) [DONE in 1.12] I've been using DataSet API and I tested migrating to DataStream + batch mode. I opened this (1) ticket regarding the support of aggregations in batch mode for DataStream API. It seems that join operation (at least) does not work in batch mode even though I managed to implement a join using low level KeyedCoProcessFunction (thanks Seth, for the pointer !). => Should it be considered a blocker ? Is there a plan to solve it before the actual drop of DataSet API ? Maybe in step 6 ? [1] https://issues.apache.org/jira/browse/FLINK-22587 Step 3: Soft deprecate DataSet API (FLIP-131) [DONE in 1.12] We updated the documentation recently to make this deprecation even more visible. There is a dedicated `(Legacy)` label right next to the menu item now. We won't deprecate concrete classes of the API with a @Deprecated annotation to avoid extensive warnings in logs until then. Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) [DONE in 1.14] We dropped code for ORC, Parque, and HBase formats that were only used by DataSet API users. The removed classes had no documentation and were not annotated with one of our API stability annotations. The old functionality should be available through the new sources and sinks for Table API and DataStream API. If not, we should bring them into a shape that they can be a full replacement. DataSet users are encouraged to either upgrade the API or use Flink 1.13. Users can either just stay at Flink 1.13 or copy only the format's code to a newer Flink version. We aim to keep the core interfaces (i.e. InputFormat and OutputFormat) stable until the next major version. We will maintain/allow important contributions to dropped connectors in 1.13. So 1.13 could be considered as kind of a DataSet API LTS release. I added several bug fixes and enhancements (avro support, automatic schema etc...) to parquet DataSet connector. After discussing with Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to the fact that 1.13 is a LTS release receiving maintenance changes as you mentioned here. => Is it needed to port these Avro enhancements to new DataStream connectors (add a new equivalent of ParquetColumnarRowInputFormat but for Avro) ? IMHO opinion it is an important feature that the users will need. So, if I understand the plan correctly, we have until the release of 2.0 to implement it, right ? => Also there are parquet bugs still open on deprecated parquet connector: https://issues.apache.org/jira/browse/FLINK-21520, https://issues.apache.org/jira/browse/FLINK-21468, I think that the same applies, we should fix them on 1.13 right ? Step 5: Drop the legacy SQL planner (FLINK-14437) [DONE in 1.14] This included dropping support of DataSet API with SQL. That is a major point ! I've been doing TPCDS benchmarks with Flink lately by coding query3 with a DataSet pipeline, a DataStream pipeline and a SQL pipeline. What I can tell is that when I migrated from the legacy SQL planer to blink SQL planner, I got 2 major improvements: 1. around 25% gain in run times on 1TB input dataset (even if memory conf was slightly different between runs of the 2 planners) 2. global order support: with legacy planer based on DataSet, only local partition ordering was supported. As a consequence, a SQL query with an ORDER BY clause actually produced wrong results. With blink planner based on DataStream, global order is supported and now the query results are correct ! => congrats to everyone involved in these big SQL improvements ! Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897) [PLANNED in 1.14] Step 7: Reach
Re: [DISCUSS] Incrementally deprecating the DataSet API
Hi everyone, @Timo, my comments are inline for steps 2, 4 and 5, please tell me what you think. Best Etienne On 23/06/2021 15:27, Chesnay Schepler wrote: If we want to publicize this plan more shouldn't we have a rough timeline for when 2.0 is on the table? On 6/23/2021 2:44 PM, Stephan Ewen wrote: Thanks for writing this up, this also reflects my understanding. I think a blog post would be nice, ideally with an explicit call for feedback so we learn about user concerns. A blog post has a lot more reach than an ML thread. Best, Stephan On Wed, Jun 23, 2021 at 12:23 PM Timo Walther wrote: Hi everyone, I'm sending this email to make sure everyone is on the same page about slowly deprecating the DataSet API. There have been a few thoughts mentioned in presentations, offline discussions, and JIRA issues. However, I have observed that there are still some concerns or different opinions on what steps are necessary to implement this change. Let me summarize some of the steps and assumpations and let's have a discussion about it: Step 1: Introduce a batch mode for Table API (FLIP-32) [DONE in 1.9] Step 2: Introduce a batch mode for DataStream API (FLIP-134) [DONE in 1.12] I've been using DataSet API and I tested migrating to DataStream + batch mode. I opened this (1) ticket regarding the support of aggregations in batch mode for DataStream API. It seems that join operation (at least) does not work in batch mode even though I managed to implement a join using low level KeyedCoProcessFunction (thanks Seth, for the pointer !). => Should it be considered a blocker ? Is there a plan to solve it before the actual drop of DataSet API ? Maybe in step 6 ? [1] https://issues.apache.org/jira/browse/FLINK-22587 Step 3: Soft deprecate DataSet API (FLIP-131) [DONE in 1.12] We updated the documentation recently to make this deprecation even more visible. There is a dedicated `(Legacy)` label right next to the menu item now. We won't deprecate concrete classes of the API with a @Deprecated annotation to avoid extensive warnings in logs until then. Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) [DONE in 1.14] We dropped code for ORC, Parque, and HBase formats that were only used by DataSet API users. The removed classes had no documentation and were not annotated with one of our API stability annotations. The old functionality should be available through the new sources and sinks for Table API and DataStream API. If not, we should bring them into a shape that they can be a full replacement. DataSet users are encouraged to either upgrade the API or use Flink 1.13. Users can either just stay at Flink 1.13 or copy only the format's code to a newer Flink version. We aim to keep the core interfaces (i.e. InputFormat and OutputFormat) stable until the next major version. We will maintain/allow important contributions to dropped connectors in 1.13. So 1.13 could be considered as kind of a DataSet API LTS release. I added several bug fixes and enhancements (avro support, automatic schema etc...) to parquet DataSet connector. After discussing with Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to the fact that 1.13 is a LTS release receiving maintenance changes as you mentioned here. => Is it needed to port these Avro enhancements to new DataStream connectors (add a new equivalent of ParquetColumnarRowInputFormat but for Avro) ? IMHO opinion it is an important feature that the users will need. So, if I understand the plan correctly, we have until the release of 2.0 to implement it, right ? => Also there are parquet bugs still open on deprecated parquet connector: https://issues.apache.org/jira/browse/FLINK-21520, https://issues.apache.org/jira/browse/FLINK-21468, I think that the same applies, we should fix them on 1.13 right ? Step 5: Drop the legacy SQL planner (FLINK-14437) [DONE in 1.14] This included dropping support of DataSet API with SQL. That is a major point ! I've been doing TPCDS benchmarks with Flink lately by coding query3 with a DataSet pipeline, a DataStream pipeline and a SQL pipeline. What I can tell is that when I migrated from the legacy SQL planer to blink SQL planner, I got 2 major improvements: 1. around 25% gain in run times on 1TB input dataset (even if memory conf was slightly different between runs of the 2 planners) 2. global order support: with legacy planer based on DataSet, only local partition ordering was supported. As a consequence, a SQL query with an ORDER BY clause actually produced wrong results. With blink planner based on DataStream, global order is supported and now the query results are correct ! => congrats to everyone involved in these big SQL improvements ! Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897) [PLANNED in 1.14] Step 7: Reach feature parity of Table API/DataStream API with DataSet API [PLANNED for 1.14++] We need to
Re: [DISCUSS] Incrementally deprecating the DataSet API
If we want to publicize this plan more shouldn't we have a rough timeline for when 2.0 is on the table? On 6/23/2021 2:44 PM, Stephan Ewen wrote: Thanks for writing this up, this also reflects my understanding. I think a blog post would be nice, ideally with an explicit call for feedback so we learn about user concerns. A blog post has a lot more reach than an ML thread. Best, Stephan On Wed, Jun 23, 2021 at 12:23 PM Timo Walther wrote: Hi everyone, I'm sending this email to make sure everyone is on the same page about slowly deprecating the DataSet API. There have been a few thoughts mentioned in presentations, offline discussions, and JIRA issues. However, I have observed that there are still some concerns or different opinions on what steps are necessary to implement this change. Let me summarize some of the steps and assumpations and let's have a discussion about it: Step 1: Introduce a batch mode for Table API (FLIP-32) [DONE in 1.9] Step 2: Introduce a batch mode for DataStream API (FLIP-134) [DONE in 1.12] Step 3: Soft deprecate DataSet API (FLIP-131) [DONE in 1.12] We updated the documentation recently to make this deprecation even more visible. There is a dedicated `(Legacy)` label right next to the menu item now. We won't deprecate concrete classes of the API with a @Deprecated annotation to avoid extensive warnings in logs until then. Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) [DONE in 1.14] We dropped code for ORC, Parque, and HBase formats that were only used by DataSet API users. The removed classes had no documentation and were not annotated with one of our API stability annotations. The old functionality should be available through the new sources and sinks for Table API and DataStream API. If not, we should bring them into a shape that they can be a full replacement. DataSet users are encouraged to either upgrade the API or use Flink 1.13. Users can either just stay at Flink 1.13 or copy only the format's code to a newer Flink version. We aim to keep the core interfaces (i.e. InputFormat and OutputFormat) stable until the next major version. We will maintain/allow important contributions to dropped connectors in 1.13. So 1.13 could be considered as kind of a DataSet API LTS release. Step 5: Drop the legacy SQL planner (FLINK-14437) [DONE in 1.14] This included dropping support of DataSet API with SQL. Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897) [PLANNED in 1.14] Step 7: Reach feature parity of Table API/DataStream API with DataSet API [PLANNED for 1.14++] We need to identify blockers when migrating from DataSet API to Table API/DataStream API. Here we need to estabilish a good feedback pipeline to include DataSet users in the roadmap planning. Step 7: Drop the Gelly library No concrete plan yet. Latest would be the next major Flink version aka Flink 2.0. Step 8: Drop DataSet API Planned for the next major Flink version aka Flink 2.0. Please let me know if this matches your thoughts. We can also convert this into a blog post or mention it in the next release notes. Regards, Timo
Re: [DISCUSS] Incrementally deprecating the DataSet API
Thanks for writing this up, this also reflects my understanding. I think a blog post would be nice, ideally with an explicit call for feedback so we learn about user concerns. A blog post has a lot more reach than an ML thread. Best, Stephan On Wed, Jun 23, 2021 at 12:23 PM Timo Walther wrote: > Hi everyone, > > I'm sending this email to make sure everyone is on the same page about > slowly deprecating the DataSet API. > > There have been a few thoughts mentioned in presentations, offline > discussions, and JIRA issues. However, I have observed that there are > still some concerns or different opinions on what steps are necessary to > implement this change. > > Let me summarize some of the steps and assumpations and let's have a > discussion about it: > > Step 1: Introduce a batch mode for Table API (FLIP-32) > [DONE in 1.9] > > Step 2: Introduce a batch mode for DataStream API (FLIP-134) > [DONE in 1.12] > > Step 3: Soft deprecate DataSet API (FLIP-131) > [DONE in 1.12] > > We updated the documentation recently to make this deprecation even more > visible. There is a dedicated `(Legacy)` label right next to the menu > item now. > > We won't deprecate concrete classes of the API with a @Deprecated > annotation to avoid extensive warnings in logs until then. > > Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) > [DONE in 1.14] > > We dropped code for ORC, Parque, and HBase formats that were only used > by DataSet API users. The removed classes had no documentation and were > not annotated with one of our API stability annotations. > > The old functionality should be available through the new sources and > sinks for Table API and DataStream API. If not, we should bring them > into a shape that they can be a full replacement. > > DataSet users are encouraged to either upgrade the API or use Flink > 1.13. Users can either just stay at Flink 1.13 or copy only the format's > code to a newer Flink version. We aim to keep the core interfaces (i.e. > InputFormat and OutputFormat) stable until the next major version. > > We will maintain/allow important contributions to dropped connectors in > 1.13. So 1.13 could be considered as kind of a DataSet API LTS release. > > Step 5: Drop the legacy SQL planner (FLINK-14437) > [DONE in 1.14] > > This included dropping support of DataSet API with SQL. > > Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897) > [PLANNED in 1.14] > > Step 7: Reach feature parity of Table API/DataStream API with DataSet API > [PLANNED for 1.14++] > > We need to identify blockers when migrating from DataSet API to Table > API/DataStream API. Here we need to estabilish a good feedback pipeline > to include DataSet users in the roadmap planning. > > Step 7: Drop the Gelly library > > No concrete plan yet. Latest would be the next major Flink version aka > Flink 2.0. > > Step 8: Drop DataSet API > > Planned for the next major Flink version aka Flink 2.0. > > > Please let me know if this matches your thoughts. We can also convert > this into a blog post or mention it in the next release notes. > > Regards, > Timo > >
[DISCUSS] Incrementally deprecating the DataSet API
Hi everyone, I'm sending this email to make sure everyone is on the same page about slowly deprecating the DataSet API. There have been a few thoughts mentioned in presentations, offline discussions, and JIRA issues. However, I have observed that there are still some concerns or different opinions on what steps are necessary to implement this change. Let me summarize some of the steps and assumpations and let's have a discussion about it: Step 1: Introduce a batch mode for Table API (FLIP-32) [DONE in 1.9] Step 2: Introduce a batch mode for DataStream API (FLIP-134) [DONE in 1.12] Step 3: Soft deprecate DataSet API (FLIP-131) [DONE in 1.12] We updated the documentation recently to make this deprecation even more visible. There is a dedicated `(Legacy)` label right next to the menu item now. We won't deprecate concrete classes of the API with a @Deprecated annotation to avoid extensive warnings in logs until then. Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) [DONE in 1.14] We dropped code for ORC, Parque, and HBase formats that were only used by DataSet API users. The removed classes had no documentation and were not annotated with one of our API stability annotations. The old functionality should be available through the new sources and sinks for Table API and DataStream API. If not, we should bring them into a shape that they can be a full replacement. DataSet users are encouraged to either upgrade the API or use Flink 1.13. Users can either just stay at Flink 1.13 or copy only the format's code to a newer Flink version. We aim to keep the core interfaces (i.e. InputFormat and OutputFormat) stable until the next major version. We will maintain/allow important contributions to dropped connectors in 1.13. So 1.13 could be considered as kind of a DataSet API LTS release. Step 5: Drop the legacy SQL planner (FLINK-14437) [DONE in 1.14] This included dropping support of DataSet API with SQL. Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897) [PLANNED in 1.14] Step 7: Reach feature parity of Table API/DataStream API with DataSet API [PLANNED for 1.14++] We need to identify blockers when migrating from DataSet API to Table API/DataStream API. Here we need to estabilish a good feedback pipeline to include DataSet users in the roadmap planning. Step 7: Drop the Gelly library No concrete plan yet. Latest would be the next major Flink version aka Flink 2.0. Step 8: Drop DataSet API Planned for the next major Flink version aka Flink 2.0. Please let me know if this matches your thoughts. We can also convert this into a blog post or mention it in the next release notes. Regards, Timo