Thanks Robert. I am going to create the GitHub issues and PRs. If there are further questions/concerns raised later, we can address them here.
Cheers, Jiangjie (Becket) Qin On Thu, Feb 2, 2023 at 8:39 AM Robert Bradshaw <rober...@google.com> wrote: > Thanks. In that case keeping both in parallel, and tying the switch in > the default to a (possibly overridable) choice of Flink version, makes > a lot of sense. > > On Wed, Feb 1, 2023 at 3:33 PM Becket Qin <becket....@gmail.com> wrote: > > > > Hi Robert, > > > > Thanks for the feedback. This change will be transparent to the user > applications in most cases. However, there are still a few differences > visible to the users. > > > > 1. Configurations. DataStream and DataSet take different configurations. > > 2. Metrics. DataStream operators and DataSet operators may emit > different metrics. > > 3. Some other potential behavior change. The DataSet API currently goes > through a simple optimizer, while the DataStream API does not. And the > underlying operator implementations are also different. So users may find > their job execution topology changes after switching to DataStream. > > 4. Resource consumption. Because the underlying operator implementations > are different, the resource consumption may be different. > > > > So, in general I feel it is probably safer to keep the DataSet execution > path for some time before we remove it completely. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > > > > > On Thu, Feb 2, 2023 at 1:23 AM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> > >> This sounds reasonable to me. One question I have is why a user would > >> prefer to stick with the DataSet API if the DataStream API is > >> available. Would there be any user-visible difference? > >> > >> On Wed, Feb 1, 2023 at 1:11 AM Becket Qin <becket....@gmail.com> wrote: > >> > > >> > Hi Beam devs, > >> > > >> > I'd like to start a discussion about migrating the Flink runner to > execute the batch jobs in DataStream API instead of DataSet API. > >> > > >> > Today Flink runner executes batch jobs with DataSet API which is > semi-deprecated and will be removed sometime in future Flink releases. > Flink DataStream API has been extended to replace DataSet API for batch job > execution. So here we propose to migrate the Flink Beam runner from DataSet > to DataStream for batch job execution. > >> > > >> > I have compiled this one pager[1] to explain the motivation, > interface change, migration plan and proposed changes. We also have a PoC > implementation of this migration[2] which has passed the existing unit > tests and runner validation tests. > >> > > >> > Would love to get your thoughts on this. > >> > > >> > BTW, I am starting this discussion thread as I am not sure whether > this change is considered as a large change[3] or not. If there is no > concern for the change, I'll just create the GitHub issues and start to > work on it. > >> > > >> > Also, I have worked with Xinyu Liu on the PoC implementation, and > Xinyu has agreed to help review the patches (thank you Xinyu). It would be > great if someone who has worked on Flink runner before can also help with > the PR reviews. > >> > > >> > Thanks, > >> > > >> > Jiangjie (Becket) Qin > >> > > >> > [1] > https://docs.google.com/document/d/1cjUJHOS1eEkH76hMNeBuc-kPhbIIc9w2gvjm8miIFS8/edit?usp=sharing > >> > [2] > https://github.com/becketqin/beam/tree/flink-batch-runner-migration > >> > [3] > https://github.com/apache/beam/blob/14e8de6e99a031ba7376bdb6837d471648878932/CONTRIBUTING.md >