Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-07-09 Thread Seth Wiesman
Happy to help clarify, and certainly, I don't think anyone would reject the
contribution if you wanted to add that functionality. The important point
we want to communicate is that the table API has matured to a level where
we do not see it as supplementary to DataStream but as an equal first-class
interface. Both optimize towards different making specific use cases easier
- declarative definition of pipelines vs fine-grained control - and neither
needs to support every operation natively because it should be seamless to
interoperate them.

Seth

On Fri, Jul 9, 2021 at 10:35 AM Etienne Chauchot 
wrote:

> Hi Seth,
>
> Thanks for your comment.
>
> That is perfect then !
>
> If the community agrees to give precedence to the table API connectors
> over the DataStream connectors as discussed here, all the features are
> already available then.
>
> I will not need to port my latest additions to table connectors then.
>
> Best
>
> Etienne
>
> On 08/07/2021 17:28, Seth Wiesman wrote:
> > Hi Etienne,
> >
> > The `toDataStream` method supports converting to concrete Java types, not
> > just Row, which can include your Avro specific-records. See example 2:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream
> >
> > On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot 
> > wrote:
> >
> >> Hi Timo,
> >>
> >> Thanks for your answers, no problem with the delay, I was in vacation
> >> too last week :)
> >>
> >> My comments are inline
> >>
> >> Best,
> >>
> >> Etienne
> >>
> >> On 07/07/2021 16:48, Timo Walther wrote:
> >>> Hi Etienne,
> >>>
> >>> sorry for the late reply due to my vacation last week.
> >>>
> >>> Regarding: "support of aggregations in batch mode for DataStream API
> >>> [...] is there a plan to solve it before the actual drop of DataSet
> API"
> >>>
> >>> Just to clarify it again: we will not drop the DataSet API any time
> >>> soon. So users will have enough time to update their pipelines. There
> >>> are a couple of features missing to fully switch to DataStream API in
> >>> batch mode. Thanks for opening an issue, this is helpful for us to
> >>> gradually remove those barriers. They don't need to have a "Blocker"
> >>> priority in JIRA for now.
> >>
> >> Ok I thought the drop was sooner, no problem then.
> >>
> >>
> >>> But aggregations is a good example where we should discuss if it would
> >>> be easier to simply switch to Table API for that. Table API has a lot
> >>> of aggregation optimizations and can work on binary data. Also joins
> >>> should be easier in Table API. DataStream API can be a very low-level
> >>> API in the near future and most use cases (esp. the batch ones) should
> >>> be possible in Table API.
> >>
> >> Yes sure. As a matter of fact, my point was to use low level DataStream
> >> API in a benchmark to compare with Table API but I guess it is not a
> >> common user behavior.
> >>
> >>
> >>> Regarding: "Is it needed to port these Avro enhancements to new
> >>> DataStream connectors (add a new equivalent of
> >>> ParquetColumnarRowInputFormat but for Avro)"
> >>>
> >>> We should definitely not loose functionality. The same functionality
> >>> should be present in the new connectors. The questions is rather
> >>> whether we need to offer a DataStream API connector or if a Table API
> >>> connector would be nicer to use (also nicely integrated with catalogs).
> >>>
> >>> So a user can use a simple CREATE TABLE statement to configure the
> >>> connector; an easier abstraction is almost not possible. With
> >>> `tableEnv.toDataStream(table)` you can then continue in DataStream API
> >>> if there is still a need for it.
> >>
> >> Yes I agree, there is no easier connector setup than CREATE TABLE, and
> >> with tableEnv.toDataStream(table) if one would want to stay with
> >> DataStream (in my bench for example) it is still possible. And by the
> >> way the doc for parquet (1) for example only mentions using Table API
> >> connector.
> >>
> >> So I guess the table connector would be nicer for the user indeed. But
> >> if we use tableEnv.toDataStream(table), we would be able to produce only
> >> usual types like Row, Tuples or Pojos, we still need to add Avro support
> >> right ?
> >>
> >> [1]
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/
> >>
> >>
> >>> Regarding: "there are parquet bugs still open on deprecated parquet
> >>> connector"
> >>>
> >>> Yes, bugs should still be fixed in 1.13.
> >>
> >> OK
> >>
> >>
> >>> Regarading: "I've been doing TPCDS benchmarks with Flink lately"
> >>>
> >>> Great to hear that :-)
> >>
> >> And congrats again on blink performances !
> >>
> >>
> >>> Did you also see the recent discussion? A TPC-DS benchmark can further
> >>> be improved by providing statistics. Maybe this is helpful to you:
> >>>
> >>>
> >>
> 

Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-07-09 Thread Etienne Chauchot

Hi Seth,

Thanks for your comment.

That is perfect then !

If the community agrees to give precedence to the table API connectors 
over the DataStream connectors as discussed here, all the features are 
already available then.


I will not need to port my latest additions to table connectors then.

Best

Etienne

On 08/07/2021 17:28, Seth Wiesman wrote:

Hi Etienne,

The `toDataStream` method supports converting to concrete Java types, not
just Row, which can include your Avro specific-records. See example 2:

https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream

On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot 
wrote:


Hi Timo,

Thanks for your answers, no problem with the delay, I was in vacation
too last week :)

My comments are inline

Best,

Etienne

On 07/07/2021 16:48, Timo Walther wrote:

Hi Etienne,

sorry for the late reply due to my vacation last week.

Regarding: "support of aggregations in batch mode for DataStream API
[...] is there a plan to solve it before the actual drop of DataSet API"

Just to clarify it again: we will not drop the DataSet API any time
soon. So users will have enough time to update their pipelines. There
are a couple of features missing to fully switch to DataStream API in
batch mode. Thanks for opening an issue, this is helpful for us to
gradually remove those barriers. They don't need to have a "Blocker"
priority in JIRA for now.


Ok I thought the drop was sooner, no problem then.



But aggregations is a good example where we should discuss if it would
be easier to simply switch to Table API for that. Table API has a lot
of aggregation optimizations and can work on binary data. Also joins
should be easier in Table API. DataStream API can be a very low-level
API in the near future and most use cases (esp. the batch ones) should
be possible in Table API.


Yes sure. As a matter of fact, my point was to use low level DataStream
API in a benchmark to compare with Table API but I guess it is not a
common user behavior.



Regarding: "Is it needed to port these Avro enhancements to new
DataStream connectors (add a new equivalent of
ParquetColumnarRowInputFormat but for Avro)"

We should definitely not loose functionality. The same functionality
should be present in the new connectors. The questions is rather
whether we need to offer a DataStream API connector or if a Table API
connector would be nicer to use (also nicely integrated with catalogs).

So a user can use a simple CREATE TABLE statement to configure the
connector; an easier abstraction is almost not possible. With
`tableEnv.toDataStream(table)` you can then continue in DataStream API
if there is still a need for it.


Yes I agree, there is no easier connector setup than CREATE TABLE, and
with tableEnv.toDataStream(table) if one would want to stay with
DataStream (in my bench for example) it is still possible. And by the
way the doc for parquet (1) for example only mentions using Table API
connector.

So I guess the table connector would be nicer for the user indeed. But
if we use tableEnv.toDataStream(table), we would be able to produce only
usual types like Row, Tuples or Pojos, we still need to add Avro support
right ?

[1]

https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/



Regarding: "there are parquet bugs still open on deprecated parquet
connector"

Yes, bugs should still be fixed in 1.13.


OK



Regarading: "I've been doing TPCDS benchmarks with Flink lately"

Great to hear that :-)


And congrats again on blink performances !



Did you also see the recent discussion? A TPC-DS benchmark can further
be improved by providing statistics. Maybe this is helpful to you:



https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E

No, I missed that thread, thanks for the pointer, I'll read it and
comment if I have something to add.



I will prepare a blog post shortly.


Good to hear :)



Regards,
Timo



On 06.07.21 15:05, Etienne Chauchot wrote:

Hi all,

Any comments ?

cheers,

Etienne

On 25/06/2021 15:09, Etienne Chauchot wrote:

Hi everyone,

@Timo, my comments are inline for steps 2, 4 and 5, please tell me
what you think.

Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:

If we want to publicize this plan more shouldn't we have a rough
timeline for when 2.0 is on the table?

On 6/23/2021 2:44 PM, Stephan Ewen wrote:

Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther 
wrote:


Hi everyone,

I'm sending this email to make sure everyone is on the same page
about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline

Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-07-08 Thread Seth Wiesman
Hi Etienne,

The `toDataStream` method supports converting to concrete Java types, not
just Row, which can include your Avro specific-records. See example 2:

https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream

On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot 
wrote:

> Hi Timo,
>
> Thanks for your answers, no problem with the delay, I was in vacation
> too last week :)
>
> My comments are inline
>
> Best,
>
> Etienne
>
> On 07/07/2021 16:48, Timo Walther wrote:
> > Hi Etienne,
> >
> > sorry for the late reply due to my vacation last week.
> >
> > Regarding: "support of aggregations in batch mode for DataStream API
> > [...] is there a plan to solve it before the actual drop of DataSet API"
> >
> > Just to clarify it again: we will not drop the DataSet API any time
> > soon. So users will have enough time to update their pipelines. There
> > are a couple of features missing to fully switch to DataStream API in
> > batch mode. Thanks for opening an issue, this is helpful for us to
> > gradually remove those barriers. They don't need to have a "Blocker"
> > priority in JIRA for now.
>
>
> Ok I thought the drop was sooner, no problem then.
>
>
> >
> > But aggregations is a good example where we should discuss if it would
> > be easier to simply switch to Table API for that. Table API has a lot
> > of aggregation optimizations and can work on binary data. Also joins
> > should be easier in Table API. DataStream API can be a very low-level
> > API in the near future and most use cases (esp. the batch ones) should
> > be possible in Table API.
>
>
> Yes sure. As a matter of fact, my point was to use low level DataStream
> API in a benchmark to compare with Table API but I guess it is not a
> common user behavior.
>
>
> >
> > Regarding: "Is it needed to port these Avro enhancements to new
> > DataStream connectors (add a new equivalent of
> > ParquetColumnarRowInputFormat but for Avro)"
> >
> > We should definitely not loose functionality. The same functionality
> > should be present in the new connectors. The questions is rather
> > whether we need to offer a DataStream API connector or if a Table API
> > connector would be nicer to use (also nicely integrated with catalogs).
> >
> > So a user can use a simple CREATE TABLE statement to configure the
> > connector; an easier abstraction is almost not possible. With
> > `tableEnv.toDataStream(table)` you can then continue in DataStream API
> > if there is still a need for it.
>
>
> Yes I agree, there is no easier connector setup than CREATE TABLE, and
> with tableEnv.toDataStream(table) if one would want to stay with
> DataStream (in my bench for example) it is still possible. And by the
> way the doc for parquet (1) for example only mentions using Table API
> connector.
>
> So I guess the table connector would be nicer for the user indeed. But
> if we use tableEnv.toDataStream(table), we would be able to produce only
> usual types like Row, Tuples or Pojos, we still need to add Avro support
> right ?
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/
>
>
> >
> > Regarding: "there are parquet bugs still open on deprecated parquet
> > connector"
> >
> > Yes, bugs should still be fixed in 1.13.
>
>
> OK
>
>
> >
> > Regarading: "I've been doing TPCDS benchmarks with Flink lately"
> >
> > Great to hear that :-)
>
>
> And congrats again on blink performances !
>
>
> >
> > Did you also see the recent discussion? A TPC-DS benchmark can further
> > be improved by providing statistics. Maybe this is helpful to you:
> >
> >
> https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E
> >
>
>
> No, I missed that thread, thanks for the pointer, I'll read it and
> comment if I have something to add.
>
>
> >
> > I will prepare a blog post shortly.
>
>
> Good to hear :)
>
>
> >
> > Regards,
> > Timo
> >
> >
> >
> > On 06.07.21 15:05, Etienne Chauchot wrote:
> >> Hi all,
> >>
> >> Any comments ?
> >>
> >> cheers,
> >>
> >> Etienne
> >>
> >> On 25/06/2021 15:09, Etienne Chauchot wrote:
> >>> Hi everyone,
> >>>
> >>> @Timo, my comments are inline for steps 2, 4 and 5, please tell me
> >>> what you think.
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>>
> >>> On 23/06/2021 15:27, Chesnay Schepler wrote:
>  If we want to publicize this plan more shouldn't we have a rough
>  timeline for when 2.0 is on the table?
> 
>  On 6/23/2021 2:44 PM, Stephan Ewen wrote:
> > Thanks for writing this up, this also reflects my understanding.
> >
> > I think a blog post would be nice, ideally with an explicit call for
> > feedback so we learn about user concerns.
> > A blog post has a lot more reach than an ML thread.
> >
> > Best,
> > Stephan
> >
> >
> > On Wed, Jun 23, 2021 at 12:23 PM Timo Walther 
> > wrote:
> >
> >> Hi everyone,
> >>

Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-07-08 Thread Etienne Chauchot

Hi Timo,

Thanks for your answers, no problem with the delay, I was in vacation 
too last week :)


My comments are inline

Best,

Etienne

On 07/07/2021 16:48, Timo Walther wrote:

Hi Etienne,

sorry for the late reply due to my vacation last week.

Regarding: "support of aggregations in batch mode for DataStream API 
[...] is there a plan to solve it before the actual drop of DataSet API"


Just to clarify it again: we will not drop the DataSet API any time 
soon. So users will have enough time to update their pipelines. There 
are a couple of features missing to fully switch to DataStream API in 
batch mode. Thanks for opening an issue, this is helpful for us to 
gradually remove those barriers. They don't need to have a "Blocker" 
priority in JIRA for now.



Ok I thought the drop was sooner, no problem then.




But aggregations is a good example where we should discuss if it would 
be easier to simply switch to Table API for that. Table API has a lot 
of aggregation optimizations and can work on binary data. Also joins 
should be easier in Table API. DataStream API can be a very low-level 
API in the near future and most use cases (esp. the batch ones) should 
be possible in Table API.



Yes sure. As a matter of fact, my point was to use low level DataStream 
API in a benchmark to compare with Table API but I guess it is not a 
common user behavior.





Regarding: "Is it needed to port these Avro enhancements to new 
DataStream connectors (add a new equivalent of 
ParquetColumnarRowInputFormat but for Avro)"


We should definitely not loose functionality. The same functionality 
should be present in the new connectors. The questions is rather 
whether we need to offer a DataStream API connector or if a Table API 
connector would be nicer to use (also nicely integrated with catalogs).


So a user can use a simple CREATE TABLE statement to configure the 
connector; an easier abstraction is almost not possible. With 
`tableEnv.toDataStream(table)` you can then continue in DataStream API 
if there is still a need for it.



Yes I agree, there is no easier connector setup than CREATE TABLE, and 
with tableEnv.toDataStream(table) if one would want to stay with 
DataStream (in my bench for example) it is still possible. And by the 
way the doc for parquet (1) for example only mentions using Table API 
connector.


So I guess the table connector would be nicer for the user indeed. But 
if we use tableEnv.toDataStream(table), we would be able to produce only 
usual types like Row, Tuples or Pojos, we still need to add Avro support 
right ?


[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/





Regarding: "there are parquet bugs still open on deprecated parquet 
connector"


Yes, bugs should still be fixed in 1.13.



OK




Regarading: "I've been doing TPCDS benchmarks with Flink lately"

Great to hear that :-)



And congrats again on blink performances !




Did you also see the recent discussion? A TPC-DS benchmark can further 
be improved by providing statistics. Maybe this is helpful to you:


https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E 




No, I missed that thread, thanks for the pointer, I'll read it and 
comment if I have something to add.





I will prepare a blog post shortly.



Good to hear :)




Regards,
Timo



On 06.07.21 15:05, Etienne Chauchot wrote:

Hi all,

Any comments ?

cheers,

Etienne

On 25/06/2021 15:09, Etienne Chauchot wrote:

Hi everyone,

@Timo, my comments are inline for steps 2, 4 and 5, please tell me 
what you think.


Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:
If we want to publicize this plan more shouldn't we have a rough 
timeline for when 2.0 is on the table?


On 6/23/2021 2:44 PM, Stephan Ewen wrote:

Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther  
wrote:



Hi everyone,

I'm sending this email to make sure everyone is on the same page 
about

slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there 
are
still some concerns or different opinions on what steps are 
necessary to

implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]



I've been using DataSet API and I tested migrating to DataStream + 
batch mode.


I opened this (1) ticket regarding the support of aggregations in 
batch mode for DataStream API. It seems that join operation (at 
least) 

Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-07-07 Thread Timo Walther

Hi Etienne,

sorry for the late reply due to my vacation last week.

Regarding: "support of aggregations in batch mode for DataStream API 
[...] is there a plan to solve it before the actual drop of DataSet API"


Just to clarify it again: we will not drop the DataSet API any time 
soon. So users will have enough time to update their pipelines. There 
are a couple of features missing to fully switch to DataStream API in 
batch mode. Thanks for opening an issue, this is helpful for us to 
gradually remove those barriers. They don't need to have a "Blocker" 
priority in JIRA for now.


But aggregations is a good example where we should discuss if it would 
be easier to simply switch to Table API for that. Table API has a lot of 
aggregation optimizations and can work on binary data. Also joins should 
be easier in Table API. DataStream API can be a very low-level API in 
the near future and most use cases (esp. the batch ones) should be 
possible in Table API.


Regarding: "Is it needed to port these Avro enhancements to new 
DataStream connectors (add a new equivalent of 
ParquetColumnarRowInputFormat but for Avro)"


We should definitely not loose functionality. The same functionality 
should be present in the new connectors. The questions is rather whether 
we need to offer a DataStream API connector or if a Table API connector 
would be nicer to use (also nicely integrated with catalogs).


So a user can use a simple CREATE TABLE statement to configure the 
connector; an easier abstraction is almost not possible. With 
`tableEnv.toDataStream(table)` you can then continue in DataStream API 
if there is still a need for it.


Regarding: "there are parquet bugs still open on deprecated parquet 
connector"


Yes, bugs should still be fixed in 1.13.

Regarading: "I've been doing TPCDS benchmarks with Flink lately"

Great to hear that :-)

Did you also see the recent discussion? A TPC-DS benchmark can further 
be improved by providing statistics. Maybe this is helpful to you:


https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E

I will prepare a blog post shortly.

Regards,
Timo



On 06.07.21 15:05, Etienne Chauchot wrote:

Hi all,

Any comments ?

cheers,

Etienne

On 25/06/2021 15:09, Etienne Chauchot wrote:

Hi everyone,

@Timo, my comments are inline for steps 2, 4 and 5, please tell me 
what you think.


Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:
If we want to publicize this plan more shouldn't we have a rough 
timeline for when 2.0 is on the table?


On 6/23/2021 2:44 PM, Stephan Ewen wrote:

Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther  
wrote:



Hi everyone,

I'm sending this email to make sure everyone is on the same page about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there are
still some concerns or different opinions on what steps are 
necessary to

implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]



I've been using DataSet API and I tested migrating to DataStream + 
batch mode.


I opened this (1) ticket regarding the support of aggregations in 
batch mode for DataStream API. It seems that join operation (at least) 
does not work in batch mode even though I managed to implement a join 
using low level KeyedCoProcessFunction (thanks Seth, for the pointer !).


=> Should it be considered a blocker ? Is there a plan to solve it 
before the actual drop of DataSet API ? Maybe in step 6 ?


[1] https://issues.apache.org/jira/browse/FLINK-22587




Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]

We updated the documentation recently to make this deprecation even 
more

visible. There is a dedicated `(Legacy)` label right next to the menu
item now.

We won't deprecate concrete classes of the API with a @Deprecated
annotation to avoid extensive warnings in logs until then.

Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only used
by DataSet API users. The removed classes had no documentation and 
were

not annotated with one of our API stability annotations.

The old functionality should be available through the new sources and
sinks for Table API and DataStream API. If not, we should bring them
into a shape that they can be a full replacement.

DataSet users are encouraged to either upgrade the API or use Flink
1.13. Users can either 

Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-07-06 Thread Etienne Chauchot

Hi all,

Any comments ?

cheers,

Etienne

On 25/06/2021 15:09, Etienne Chauchot wrote:

Hi everyone,

@Timo, my comments are inline for steps 2, 4 and 5, please tell me 
what you think.


Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:
If we want to publicize this plan more shouldn't we have a rough 
timeline for when 2.0 is on the table?


On 6/23/2021 2:44 PM, Stephan Ewen wrote:

Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther  
wrote:



Hi everyone,

I'm sending this email to make sure everyone is on the same page about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there are
still some concerns or different opinions on what steps are 
necessary to

implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]



I've been using DataSet API and I tested migrating to DataStream + 
batch mode.


I opened this (1) ticket regarding the support of aggregations in 
batch mode for DataStream API. It seems that join operation (at least) 
does not work in batch mode even though I managed to implement a join 
using low level KeyedCoProcessFunction (thanks Seth, for the pointer !).


=> Should it be considered a blocker ? Is there a plan to solve it 
before the actual drop of DataSet API ? Maybe in step 6 ?


[1] https://issues.apache.org/jira/browse/FLINK-22587




Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]

We updated the documentation recently to make this deprecation even 
more

visible. There is a dedicated `(Legacy)` label right next to the menu
item now.

We won't deprecate concrete classes of the API with a @Deprecated
annotation to avoid extensive warnings in logs until then.

Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only used
by DataSet API users. The removed classes had no documentation and 
were

not annotated with one of our API stability annotations.

The old functionality should be available through the new sources and
sinks for Table API and DataStream API. If not, we should bring them
into a shape that they can be a full replacement.

DataSet users are encouraged to either upgrade the API or use Flink
1.13. Users can either just stay at Flink 1.13 or copy only the 
format's
code to a newer Flink version. We aim to keep the core interfaces 
(i.e.

InputFormat and OutputFormat) stable until the next major version.

We will maintain/allow important contributions to dropped 
connectors in
1.13. So 1.13 could be considered as kind of a DataSet API LTS 
release.



I added several bug fixes and enhancements (avro support, automatic 
schema etc...) to parquet DataSet connector. After discussing with 
Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to 
the fact that 1.13 is a LTS release receiving maintenance changes as 
you mentioned here.


=> Is it needed to port these Avro enhancements to new DataStream 
connectors (add a new equivalent of ParquetColumnarRowInputFormat but 
for Avro) ? IMHO opinion it is an important feature that the users 
will need. So, if I understand the plan correctly, we have until the 
release of 2.0 to implement it, right ?


=> Also there are parquet bugs still open on deprecated parquet 
connector: https://issues.apache.org/jira/browse/FLINK-21520, 
https://issues.apache.org/jira/browse/FLINK-21468, I think that the 
same applies, we should fix them on 1.13 right ?




Step 5: Drop the legacy SQL planner (FLINK-14437)
[DONE in 1.14]

This included dropping support of DataSet API with SQL.



That is a major point ! I've been doing TPCDS benchmarks with Flink 
lately by coding query3 with a DataSet pipeline, a DataStream pipeline 
and a SQL pipeline. What I can tell is that when I migrated from the 
legacy SQL planer to blink SQL planner, I got 2 major improvements:


1. around 25% gain in run times on 1TB input dataset (even if memory 
conf was slightly different between runs of the 2 planners)


2. global order support: with legacy planer based on DataSet, only 
local partition ordering was supported. As a consequence, a SQL query 
with an ORDER BY clause actually produced wrong results. With blink 
planner based on DataStream, global order is supported and now the 
query results are correct !


=> congrats to everyone involved in these big SQL improvements !



Step 6: Connect both Table and DataStream API in batch mode 
(FLINK-20897)

[PLANNED in 1.14]

Step 7: Reach 

Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-06-25 Thread Etienne Chauchot

Hi everyone,

@Timo, my comments are inline for steps 2, 4 and 5, please tell me what 
you think.


Best

Etienne


On 23/06/2021 15:27, Chesnay Schepler wrote:
If we want to publicize this plan more shouldn't we have a rough 
timeline for when 2.0 is on the table?


On 6/23/2021 2:44 PM, Stephan Ewen wrote:

Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther  
wrote:



Hi everyone,

I'm sending this email to make sure everyone is on the same page about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there are
still some concerns or different opinions on what steps are 
necessary to

implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]



I've been using DataSet API and I tested migrating to DataStream + batch 
mode.


I opened this (1) ticket regarding the support of aggregations in batch 
mode for DataStream API. It seems that join operation (at least) does 
not work in batch mode even though I managed to implement a join using 
low level KeyedCoProcessFunction (thanks Seth, for the pointer !).


=> Should it be considered a blocker ? Is there a plan to solve it 
before the actual drop of DataSet API ? Maybe in step 6 ?


[1] https://issues.apache.org/jira/browse/FLINK-22587




Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]

We updated the documentation recently to make this deprecation even 
more

visible. There is a dedicated `(Legacy)` label right next to the menu
item now.

We won't deprecate concrete classes of the API with a @Deprecated
annotation to avoid extensive warnings in logs until then.

Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only used
by DataSet API users. The removed classes had no documentation and were
not annotated with one of our API stability annotations.

The old functionality should be available through the new sources and
sinks for Table API and DataStream API. If not, we should bring them
into a shape that they can be a full replacement.

DataSet users are encouraged to either upgrade the API or use Flink
1.13. Users can either just stay at Flink 1.13 or copy only the 
format's

code to a newer Flink version. We aim to keep the core interfaces (i.e.
InputFormat and OutputFormat) stable until the next major version.

We will maintain/allow important contributions to dropped connectors in
1.13. So 1.13 could be considered as kind of a DataSet API LTS release.



I added several bug fixes and enhancements (avro support, automatic 
schema etc...) to parquet DataSet connector. After discussing with 
Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to the 
fact that 1.13 is a LTS release receiving maintenance changes as you 
mentioned here.


=> Is it needed to port these Avro enhancements to new DataStream 
connectors (add a new equivalent of ParquetColumnarRowInputFormat but 
for Avro) ? IMHO opinion it is an important feature that the users will 
need. So, if I understand the plan correctly, we have until the release 
of 2.0 to implement it, right ?


=> Also there are parquet bugs still open on deprecated parquet 
connector: https://issues.apache.org/jira/browse/FLINK-21520, 
https://issues.apache.org/jira/browse/FLINK-21468, I think that the same 
applies, we should fix them on 1.13 right ?




Step 5: Drop the legacy SQL planner (FLINK-14437)
[DONE in 1.14]

This included dropping support of DataSet API with SQL.



That is a major point ! I've been doing TPCDS benchmarks with Flink 
lately by coding query3 with a DataSet pipeline, a DataStream pipeline 
and a SQL pipeline. What I can tell is that when I migrated from the 
legacy SQL planer to blink SQL planner, I got 2 major improvements:


1. around 25% gain in run times on 1TB input dataset (even if memory 
conf was slightly different between runs of the 2 planners)


2. global order support: with legacy planer based on DataSet, only local 
partition ordering was supported. As a consequence, a SQL query with an 
ORDER BY clause actually produced wrong results. With blink planner 
based on DataStream, global order is supported and now the query results 
are correct !


=> congrats to everyone involved in these big SQL improvements !



Step 6: Connect both Table and DataStream API in batch mode 
(FLINK-20897)

[PLANNED in 1.14]

Step 7: Reach feature parity of Table API/DataStream API with 
DataSet API

[PLANNED for 1.14++]

We need to 

Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-06-23 Thread Chesnay Schepler
If we want to publicize this plan more shouldn't we have a rough 
timeline for when 2.0 is on the table?


On 6/23/2021 2:44 PM, Stephan Ewen wrote:

Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther  wrote:


Hi everyone,

I'm sending this email to make sure everyone is on the same page about
slowly deprecating the DataSet API.

There have been a few thoughts mentioned in presentations, offline
discussions, and JIRA issues. However, I have observed that there are
still some concerns or different opinions on what steps are necessary to
implement this change.

Let me summarize some of the steps and assumpations and let's have a
discussion about it:

Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]

Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]

We updated the documentation recently to make this deprecation even more
visible. There is a dedicated `(Legacy)` label right next to the menu
item now.

We won't deprecate concrete classes of the API with a @Deprecated
annotation to avoid extensive warnings in logs until then.

Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only used
by DataSet API users. The removed classes had no documentation and were
not annotated with one of our API stability annotations.

The old functionality should be available through the new sources and
sinks for Table API and DataStream API. If not, we should bring them
into a shape that they can be a full replacement.

DataSet users are encouraged to either upgrade the API or use Flink
1.13. Users can either just stay at Flink 1.13 or copy only the format's
code to a newer Flink version. We aim to keep the core interfaces (i.e.
InputFormat and OutputFormat) stable until the next major version.

We will maintain/allow important contributions to dropped connectors in
1.13. So 1.13 could be considered as kind of a DataSet API LTS release.

Step 5: Drop the legacy SQL planner (FLINK-14437)
[DONE in 1.14]

This included dropping support of DataSet API with SQL.

Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897)
[PLANNED in 1.14]

Step 7: Reach feature parity of Table API/DataStream API with DataSet API
[PLANNED for 1.14++]

We need to identify blockers when migrating from DataSet API to Table
API/DataStream API. Here we need to estabilish a good feedback pipeline
to include DataSet users in the roadmap planning.

Step 7: Drop the Gelly library

No concrete plan yet. Latest would be the next major Flink version aka
Flink 2.0.

Step 8: Drop DataSet API

Planned for the next major Flink version aka Flink 2.0.


Please let me know if this matches your thoughts. We can also convert
this into a blog post or mention it in the next release notes.

Regards,
Timo






Re: [DISCUSS] Incrementally deprecating the DataSet API

2021-06-23 Thread Stephan Ewen
Thanks for writing this up, this also reflects my understanding.

I think a blog post would be nice, ideally with an explicit call for
feedback so we learn about user concerns.
A blog post has a lot more reach than an ML thread.

Best,
Stephan


On Wed, Jun 23, 2021 at 12:23 PM Timo Walther  wrote:

> Hi everyone,
>
> I'm sending this email to make sure everyone is on the same page about
> slowly deprecating the DataSet API.
>
> There have been a few thoughts mentioned in presentations, offline
> discussions, and JIRA issues. However, I have observed that there are
> still some concerns or different opinions on what steps are necessary to
> implement this change.
>
> Let me summarize some of the steps and assumpations and let's have a
> discussion about it:
>
> Step 1: Introduce a batch mode for Table API (FLIP-32)
> [DONE in 1.9]
>
> Step 2: Introduce a batch mode for DataStream API (FLIP-134)
> [DONE in 1.12]
>
> Step 3: Soft deprecate DataSet API (FLIP-131)
> [DONE in 1.12]
>
> We updated the documentation recently to make this deprecation even more
> visible. There is a dedicated `(Legacy)` label right next to the menu
> item now.
>
> We won't deprecate concrete classes of the API with a @Deprecated
> annotation to avoid extensive warnings in logs until then.
>
> Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
> [DONE in 1.14]
>
> We dropped code for ORC, Parque, and HBase formats that were only used
> by DataSet API users. The removed classes had no documentation and were
> not annotated with one of our API stability annotations.
>
> The old functionality should be available through the new sources and
> sinks for Table API and DataStream API. If not, we should bring them
> into a shape that they can be a full replacement.
>
> DataSet users are encouraged to either upgrade the API or use Flink
> 1.13. Users can either just stay at Flink 1.13 or copy only the format's
> code to a newer Flink version. We aim to keep the core interfaces (i.e.
> InputFormat and OutputFormat) stable until the next major version.
>
> We will maintain/allow important contributions to dropped connectors in
> 1.13. So 1.13 could be considered as kind of a DataSet API LTS release.
>
> Step 5: Drop the legacy SQL planner (FLINK-14437)
> [DONE in 1.14]
>
> This included dropping support of DataSet API with SQL.
>
> Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897)
> [PLANNED in 1.14]
>
> Step 7: Reach feature parity of Table API/DataStream API with DataSet API
> [PLANNED for 1.14++]
>
> We need to identify blockers when migrating from DataSet API to Table
> API/DataStream API. Here we need to estabilish a good feedback pipeline
> to include DataSet users in the roadmap planning.
>
> Step 7: Drop the Gelly library
>
> No concrete plan yet. Latest would be the next major Flink version aka
> Flink 2.0.
>
> Step 8: Drop DataSet API
>
> Planned for the next major Flink version aka Flink 2.0.
>
>
> Please let me know if this matches your thoughts. We can also convert
> this into a blog post or mention it in the next release notes.
>
> Regards,
> Timo
>
>


[DISCUSS] Incrementally deprecating the DataSet API

2021-06-23 Thread Timo Walther

Hi everyone,

I'm sending this email to make sure everyone is on the same page about 
slowly deprecating the DataSet API.


There have been a few thoughts mentioned in presentations, offline 
discussions, and JIRA issues. However, I have observed that there are 
still some concerns or different opinions on what steps are necessary to 
implement this change.


Let me summarize some of the steps and assumpations and let's have a 
discussion about it:


Step 1: Introduce a batch mode for Table API (FLIP-32)
[DONE in 1.9]

Step 2: Introduce a batch mode for DataStream API (FLIP-134)
[DONE in 1.12]

Step 3: Soft deprecate DataSet API (FLIP-131)
[DONE in 1.12]

We updated the documentation recently to make this deprecation even more 
visible. There is a dedicated `(Legacy)` label right next to the menu 
item now.


We won't deprecate concrete classes of the API with a @Deprecated 
annotation to avoid extensive warnings in logs until then.


Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
[DONE in 1.14]

We dropped code for ORC, Parque, and HBase formats that were only used 
by DataSet API users. The removed classes had no documentation and were 
not annotated with one of our API stability annotations.


The old functionality should be available through the new sources and 
sinks for Table API and DataStream API. If not, we should bring them 
into a shape that they can be a full replacement.


DataSet users are encouraged to either upgrade the API or use Flink 
1.13. Users can either just stay at Flink 1.13 or copy only the format's 
code to a newer Flink version. We aim to keep the core interfaces (i.e. 
InputFormat and OutputFormat) stable until the next major version.


We will maintain/allow important contributions to dropped connectors in 
1.13. So 1.13 could be considered as kind of a DataSet API LTS release.


Step 5: Drop the legacy SQL planner (FLINK-14437)
[DONE in 1.14]

This included dropping support of DataSet API with SQL.

Step 6: Connect both Table and DataStream API in batch mode (FLINK-20897)
[PLANNED in 1.14]

Step 7: Reach feature parity of Table API/DataStream API with DataSet API
[PLANNED for 1.14++]

We need to identify blockers when migrating from DataSet API to Table 
API/DataStream API. Here we need to estabilish a good feedback pipeline 
to include DataSet users in the roadmap planning.


Step 7: Drop the Gelly library

No concrete plan yet. Latest would be the next major Flink version aka 
Flink 2.0.


Step 8: Drop DataSet API

Planned for the next major Flink version aka Flink 2.0.


Please let me know if this matches your thoughts. We can also convert 
this into a blog post or mention it in the next release notes.


Regards,
Timo