Hi all,

Any comments ?

cheers,

Etienne

On 25/06/2021 15:09, Etienne Chauchot wrote:
Hi everyone,

@Timo, my comments are inline for steps 2, 4 and 5, please tell me what you think.

Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:
If we want to publicize this plan more shouldn't we have a rough timeline for when 2.0 is on the table?

On 6/23/2021 2:44 PM, Stephan Ewen wrote:
Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther <twal...@apache.org> wrote:

Hi everyone,

I'm sending this email to make sure everyone is on the same page about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there are
still some concerns or different opinions on what steps are necessary to
implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]


I've been using DataSet API and I tested migrating to DataStream + batch mode.

I opened this (1) ticket regarding the support of aggregations in batch mode for DataStream API. It seems that join operation (at least) does not work in batch mode even though I managed to implement a join using low level KeyedCoProcessFunction (thanks Seth, for the pointer !).

=> Should it be considered a blocker ? Is there a plan to solve it before the actual drop of DataSet API ? Maybe in step 6 ?

[1] https://issues.apache.org/jira/browse/FLINK-22587



Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]

We updated the documentation recently to make this deprecation even more
visible. There is a dedicated `(Legacy)` label right next to the menu
item now.

We won't deprecate concrete classes of the API with a @Deprecated
annotation to avoid extensive warnings in logs until then.

Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only used
by DataSet API users. The removed classes had no documentation and were
not annotated with one of our API stability annotations.

The old functionality should be available through the new sources and
sinks for Table API and DataStream API. If not, we should bring them
into a shape that they can be a full replacement.

DataSet users are encouraged to either upgrade the API or use Flink
1.13. Users can either just stay at Flink 1.13 or copy only the format's code to a newer Flink version. We aim to keep the core interfaces (i.e.
InputFormat and OutputFormat) stable until the next major version.

We will maintain/allow important contributions to dropped connectors in 1.13. So 1.13 could be considered as kind of a DataSet API LTS release.


I added several bug fixes and enhancements (avro support, automatic schema etc...) to parquet DataSet connector. After discussing with Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to the fact that 1.13 is a LTS release receiving maintenance changes as you mentioned here.

=> Is it needed to port these Avro enhancements to new DataStream connectors (add a new equivalent of ParquetColumnarRowInputFormat but for Avro) ? IMHO opinion it is an important feature that the users will need. So, if I understand the plan correctly, we have until the release of 2.0 to implement it, right ?

=> Also there are parquet bugs still open on deprecated parquet connector: https://issues.apache.org/jira/browse/FLINK-21520, https://issues.apache.org/jira/browse/FLINK-21468, I think that the same applies, we should fix them on 1.13 right ?


Step 5: Drop the legacy SQL planner (FLINK-14437)
[DONE in 1.14]

This included dropping support of DataSet API with SQL.


That is a major point ! I've been doing TPCDS benchmarks with Flink lately by coding query3 with a DataSet pipeline, a DataStream pipeline and a SQL pipeline. What I can tell is that when I migrated from the legacy SQL planer to blink SQL planner, I got 2 major improvements:

1. around 25% gain in run times on 1TB input dataset (even if memory conf was slightly different between runs of the 2 planners)

2. global order support: with legacy planer based on DataSet, only local partition ordering was supported. As a consequence, a SQL query with an ORDER BY clause actually produced wrong results. With blink planner based on DataStream, global order is supported and now the query results are correct !

=> congrats to everyone involved in these big SQL improvements !


Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897)
[PLANNED in 1.14]

Step 7: Reach feature parity of Table API/DataStream API with DataSet API
[PLANNED for 1.14++]

We need to identify blockers when migrating from DataSet API to Table
API/DataStream API. Here we need to estabilish a good feedback pipeline
to include DataSet users in the roadmap planning.

Step 7: Drop the Gelly library

No concrete plan yet. Latest would be the next major Flink version aka
Flink 2.0.

Step 8: Drop DataSet API

Planned for the next major Flink version aka Flink 2.0.


Please let me know if this matches your thoughts. We can also convert
this into a blog post or mention it in the next release notes.

Regards,
Timo



Reply via email to