Re: [DISCUSS] Incrementally deprecating the DataSet API

Etienne Chauchot Tue, 06 Jul 2021 06:05:13 -0700

Hi all,

Any comments ?


cheers,

Etienne

On 25/06/2021 15:09, Etienne Chauchot wrote:

Hi everyone,
@Timo, my comments are inline for steps 2, 4 and 5, please tell mewhat you think.
Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:
If we want to publicize this plan more shouldn't we have a roughtimeline for when 2.0 is on the table?
On 6/23/2021 2:44 PM, Stephan Ewen wrote:
Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan
On Wed, Jun 23, 2021 at 12:23 PM Timo Walther <twal...@apache.org>wrote:
Hi everyone,

I'm sending this email to make sure everyone is on the same page about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there are
still some concerns or different opinions on what steps arenecessary to
implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]
I've been using DataSet API and I tested migrating to DataStream +batch mode.
I opened this (1) ticket regarding the support of aggregations inbatch mode for DataStream API. It seems that join operation (at least)does not work in batch mode even though I managed to implement a joinusing low level KeyedCoProcessFunction (thanks Seth, for the pointer !).
=> Should it be considered a blocker ? Is there a plan to solve itbefore the actual drop of DataSet API ? Maybe in step 6 ?
[1] https://issues.apache.org/jira/browse/FLINK-22587
Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]
We updated the documentation recently to make this deprecation evenmore
visible. There is a dedicated `(Legacy)` label right next to the menu
item now.

We won't deprecate concrete classes of the API with a @Deprecated
annotation to avoid extensive warnings in logs until then.

Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only used
by DataSet API users. The removed classes had no documentation andwere
not annotated with one of our API stability annotations.

The old functionality should be available through the new sources and
sinks for Table API and DataStream API. If not, we should bring them
into a shape that they can be a full replacement.

DataSet users are encouraged to either upgrade the API or use Flink
1.13. Users can either just stay at Flink 1.13 or copy only theformat'scode to a newer Flink version. We aim to keep the core interfaces(i.e.
InputFormat and OutputFormat) stable until the next major version.
We will maintain/allow important contributions to droppedconnectors in1.13. So 1.13 could be considered as kind of a DataSet API LTSrelease.
I added several bug fixes and enhancements (avro support, automaticschema etc...) to parquet DataSet connector. After discussing withJingsong and Arvid, we agreed to merge them to 1.13 in accordance tothe fact that 1.13 is a LTS release receiving maintenance changes asyou mentioned here.
=> Is it needed to port these Avro enhancements to new DataStreamconnectors (add a new equivalent of ParquetColumnarRowInputFormat butfor Avro) ? IMHO opinion it is an important feature that the userswill need. So, if I understand the plan correctly, we have until therelease of 2.0 to implement it, right ?
=> Also there are parquet bugs still open on deprecated parquetconnector: https://issues.apache.org/jira/browse/FLINK-21520,https://issues.apache.org/jira/browse/FLINK-21468, I think that thesame applies, we should fix them on 1.13 right ?
Step 5: Drop the legacy SQL planner (FLINK-14437)
[DONE in 1.14]

This included dropping support of DataSet API with SQL.
That is a major point ! I've been doing TPCDS benchmarks with Flinklately by coding query3 with a DataSet pipeline, a DataStream pipelineand a SQL pipeline. What I can tell is that when I migrated from thelegacy SQL planer to blink SQL planner, I got 2 major improvements:
1. around 25% gain in run times on 1TB input dataset (even if memoryconf was slightly different between runs of the 2 planners)
2. global order support: with legacy planer based on DataSet, onlylocal partition ordering was supported. As a consequence, a SQL querywith an ORDER BY clause actually produced wrong results. With blinkplanner based on DataStream, global order is supported and now thequery results are correct !
=> congrats to everyone involved in these big SQL improvements !
Step 6: Connect both Table and DataStream API in batch mode(FLINK-20897)
[PLANNED in 1.14]
Step 7: Reach feature parity of Table API/DataStream API withDataSet API
[PLANNED for 1.14++]

We need to identify blockers when migrating from DataSet API to Table
API/DataStream API. Here we need to estabilish a good feedbackpipeline
to include DataSet users in the roadmap planning.

Step 7: Drop the Gelly library

No concrete plan yet. Latest would be the next major Flink version aka
Flink 2.0.

Step 8: Drop DataSet API

Planned for the next major Flink version aka Flink 2.0.


Please let me know if this matches your thoughts. We can also convert
this into a blog post or mention it in the next release notes.

Regards,
Timo

Re: [DISCUSS] Incrementally deprecating the DataSet API

Reply via email to