This sounds reasonable to me. One question I have is why a user would
prefer to stick with the DataSet API if the DataStream API is
available. Would there be any user-visible difference?

On Wed, Feb 1, 2023 at 1:11 AM Becket Qin <becket....@gmail.com> wrote:
>
> Hi Beam devs,
>
> I'd like to start a discussion about migrating the Flink runner to execute 
> the batch jobs in DataStream API instead of DataSet API.
>
> Today Flink runner executes batch jobs with DataSet API which is 
> semi-deprecated and will be removed sometime in future Flink releases. Flink 
> DataStream API has been extended to replace DataSet API for batch job 
> execution. So here we propose to migrate the Flink Beam runner from DataSet 
> to DataStream for batch job execution.
>
> I have compiled this one pager[1] to explain the motivation, interface 
> change, migration plan and proposed changes. We also have a PoC 
> implementation of this migration[2] which has passed the existing unit tests 
> and runner validation tests.
>
> Would love to get your thoughts on this.
>
> BTW, I am starting this discussion thread as I am not sure whether this 
> change is considered as a large change[3] or not. If there is no concern for 
> the change, I'll just create the GitHub issues and start to work on it.
>
> Also, I have worked with Xinyu Liu on the PoC implementation, and Xinyu has 
> agreed to help review the patches (thank you Xinyu). It would be great if 
> someone who has worked on Flink runner before can also help with the PR 
> reviews.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> [1] 
> https://docs.google.com/document/d/1cjUJHOS1eEkH76hMNeBuc-kPhbIIc9w2gvjm8miIFS8/edit?usp=sharing
> [2] https://github.com/becketqin/beam/tree/flink-batch-runner-migration
> [3] 
> https://github.com/apache/beam/blob/14e8de6e99a031ba7376bdb6837d471648878932/CONTRIBUTING.md

Reply via email to