Re: [DISCUSS] Incrementally deprecating the DataSet API

Etienne Chauchot Fri, 09 Jul 2021 08:35:29 -0700

Hi Seth,

Thanks for your comment.


That is perfect then !

If the community agrees to give precedence to the table API connectorsover the DataStream connectors as discussed here, all the features arealready available then.


I will not need to port my latest additions to table connectors then.

Best

Etienne

On 08/07/2021 17:28, Seth Wiesman wrote:

Hi Etienne,

The `toDataStream` method supports converting to concrete Java types, not
just Row, which can include your Avro specific-records. See example 2:

https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream

On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot <[email protected]>
wrote:

Hi Timo,

Thanks for your answers, no problem with the delay, I was in vacation
too last week :)

My comments are inline

Best,

Etienne

On 07/07/2021 16:48, Timo Walther wrote:

Hi Etienne,

sorry for the late reply due to my vacation last week.

Regarding: "support of aggregations in batch mode for DataStream API
[...] is there a plan to solve it before the actual drop of DataSet API"

Just to clarify it again: we will not drop the DataSet API any time
soon. So users will have enough time to update their pipelines. There
are a couple of features missing to fully switch to DataStream API in
batch mode. Thanks for opening an issue, this is helpful for us to
gradually remove those barriers. They don't need to have a "Blocker"
priority in JIRA for now.


Ok I thought the drop was sooner, no problem then.

But aggregations is a good example where we should discuss if it would
be easier to simply switch to Table API for that. Table API has a lot
of aggregation optimizations and can work on binary data. Also joins
should be easier in Table API. DataStream API can be a very low-level
API in the near future and most use cases (esp. the batch ones) should
be possible in Table API.


Yes sure. As a matter of fact, my point was to use low level DataStream
API in a benchmark to compare with Table API but I guess it is not a
common user behavior.

Regarding: "Is it needed to port these Avro enhancements to new
DataStream connectors (add a new equivalent of
ParquetColumnarRowInputFormat but for Avro)"

We should definitely not loose functionality. The same functionality
should be present in the new connectors. The questions is rather
whether we need to offer a DataStream API connector or if a Table API
connector would be nicer to use (also nicely integrated with catalogs).

So a user can use a simple CREATE TABLE statement to configure the
connector; an easier abstraction is almost not possible. With
`tableEnv.toDataStream(table)` you can then continue in DataStream API
if there is still a need for it.


Yes I agree, there is no easier connector setup than CREATE TABLE, and
with tableEnv.toDataStream(table) if one would want to stay with
DataStream (in my bench for example) it is still possible. And by the
way the doc for parquet (1) for example only mentions using Table API
connector.

So I guess the table connector would be nicer for the user indeed. But
if we use tableEnv.toDataStream(table), we would be able to produce only
usual types like Row, Tuples or Pojos, we still need to add Avro support
right ?

[1]

https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/

Regarding: "there are parquet bugs still open on deprecated parquet
connector"

Yes, bugs should still be fixed in 1.13.

OK

Regarading: "I've been doing TPCDS benchmarks with Flink lately"

Great to hear that :-)


And congrats again on blink performances !

Did you also see the recent discussion? A TPC-DS benchmark can further
be improved by providing statistics. Maybe this is helpful to you:

https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E

No, I missed that thread, thanks for the pointer, I'll read it and
comment if I have something to add.

I will prepare a blog post shortly.


Good to hear :)

Regards,
Timo



On 06.07.21 15:05, Etienne Chauchot wrote:

Hi all,

Any comments ?

cheers,

Etienne

On 25/06/2021 15:09, Etienne Chauchot wrote:

Hi everyone,

@Timo, my comments are inline for steps 2, 4 and 5, please tell me
what you think.

Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:

If we want to publicize this plan more shouldn't we have a rough
timeline for when 2.0 is on the table?

On 6/23/2021 2:44 PM, Stephan Ewen wrote:

Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther <[email protected]>
wrote:

Hi everyone,

I'm sending this email to make sure everyone is on the same page
about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there
are
still some concerns or different opinions on what steps are
necessary to
implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]


I've been using DataSet API and I tested migrating to DataStream +
batch mode.

I opened this (1) ticket regarding the support of aggregations in
batch mode for DataStream API. It seems that join operation (at
least) does not work in batch mode even though I managed to
implement a join using low level KeyedCoProcessFunction (thanks
Seth, for the pointer !).

=> Should it be considered a blocker ? Is there a plan to solve it
before the actual drop of DataSet API ? Maybe in step 6 ?

[1] https://issues.apache.org/jira/browse/FLINK-22587

Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]

We updated the documentation recently to make this deprecation
even more
visible. There is a dedicated `(Legacy)` label right next to the
menu
item now.

We won't deprecate concrete classes of the API with a @Deprecated
annotation to avoid extensive warnings in logs until then.

Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only
used
by DataSet API users. The removed classes had no documentation
and were
not annotated with one of our API stability annotations.

The old functionality should be available through the new sources
and
sinks for Table API and DataStream API. If not, we should bring them
into a shape that they can be a full replacement.

DataSet users are encouraged to either upgrade the API or use Flink
1.13. Users can either just stay at Flink 1.13 or copy only the
format's
code to a newer Flink version. We aim to keep the core interfaces
(i.e.
InputFormat and OutputFormat) stable until the next major version.

We will maintain/allow important contributions to dropped
connectors in
1.13. So 1.13 could be considered as kind of a DataSet API LTS
release.


I added several bug fixes and enhancements (avro support, automatic
schema etc...) to parquet DataSet connector. After discussing with
Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to
the fact that 1.13 is a LTS release receiving maintenance changes as
you mentioned here.

=> Is it needed to port these Avro enhancements to new DataStream
connectors (add a new equivalent of ParquetColumnarRowInputFormat
but for Avro) ? IMHO opinion it is an important feature that the
users will need. So, if I understand the plan correctly, we have
until the release of 2.0 to implement it, right ?

=> Also there are parquet bugs still open on deprecated parquet
connector: https://issues.apache.org/jira/browse/FLINK-21520,
https://issues.apache.org/jira/browse/FLINK-21468, I think that the
same applies, we should fix them on 1.13 right ?

Step 5: Drop the legacy SQL planner (FLINK-14437)
[DONE in 1.14]

This included dropping support of DataSet API with SQL.


That is a major point ! I've been doing TPCDS benchmarks with Flink
lately by coding query3 with a DataSet pipeline, a DataStream
pipeline and a SQL pipeline. What I can tell is that when I migrated
from the legacy SQL planer to blink SQL planner, I got 2 major
improvements:

1. around 25% gain in run times on 1TB input dataset (even if memory
conf was slightly different between runs of the 2 planners)

2. global order support: with legacy planer based on DataSet, only
local partition ordering was supported. As a consequence, a SQL
query with an ORDER BY clause actually produced wrong results. With
blink planner based on DataStream, global order is supported and now
the query results are correct !

=> congrats to everyone involved in these big SQL improvements !

Step 6: Connect both Table and DataStream API in batch mode
(FLINK-20897)
[PLANNED in 1.14]

Step 7: Reach feature parity of Table API/DataStream API with
DataSet API
[PLANNED for 1.14++]

We need to identify blockers when migrating from DataSet API to
Table
API/DataStream API. Here we need to estabilish a good feedback
pipeline
to include DataSet users in the roadmap planning.

Step 7: Drop the Gelly library

No concrete plan yet. Latest would be the next major Flink
version aka
Flink 2.0.

Step 8: Drop DataSet API

Planned for the next major Flink version aka Flink 2.0.


Please let me know if this matches your thoughts. We can also
convert
this into a blog post or mention it in the next release notes.

Regards,
Timo

Re: [DISCUSS] Incrementally deprecating the DataSet API

Reply via email to