Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

Becket Qin Wed, 01 Feb 2023 17:41:58 -0800

Thanks Robert.

I am going to create the GitHub issues and PRs. If there are further
questions/concerns raised later, we can address them here.


Cheers,

Jiangjie (Becket) Qin

On Thu, Feb 2, 2023 at 8:39 AM Robert Bradshaw <[email protected]> wrote:

> Thanks. In that case keeping both in parallel, and tying the switch in
> the default to a (possibly overridable) choice of Flink version, makes
> a lot of sense.
>
> On Wed, Feb 1, 2023 at 3:33 PM Becket Qin <[email protected]> wrote:
> >
> > Hi Robert,
> >
> > Thanks for the feedback. This change will be transparent to the user
> applications in most cases. However, there are still a few differences
> visible to the users.
> >
> > 1. Configurations. DataStream and DataSet take different configurations.
> > 2. Metrics. DataStream operators and DataSet operators may emit
> different metrics.
> > 3. Some other potential behavior change. The DataSet API currently goes
> through a simple optimizer, while the DataStream API does not. And the
> underlying operator implementations are also different. So users may find
> their job execution topology changes after switching to DataStream.
> > 4. Resource consumption. Because the underlying operator implementations
> are different, the resource consumption may be different.
> >
> > So, in general I feel it is probably safer to keep the DataSet execution
> path for some time before we remove it completely.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Thu, Feb 2, 2023 at 1:23 AM Robert Bradshaw via dev <
> [email protected]> wrote:
> >>
> >> This sounds reasonable to me. One question I have is why a user would
> >> prefer to stick with the DataSet API if the DataStream API is
> >> available. Would there be any user-visible difference?
> >>
> >> On Wed, Feb 1, 2023 at 1:11 AM Becket Qin <[email protected]> wrote:
> >> >
> >> > Hi Beam devs,
> >> >
> >> > I'd like to start a discussion about migrating the Flink runner to
> execute the batch jobs in DataStream API instead of DataSet API.
> >> >
> >> > Today Flink runner executes batch jobs with DataSet API which is
> semi-deprecated and will be removed sometime in future Flink releases.
> Flink DataStream API has been extended to replace DataSet API for batch job
> execution. So here we propose to migrate the Flink Beam runner from DataSet
> to DataStream for batch job execution.
> >> >
> >> > I have compiled this one pager[1] to explain the motivation,
> interface change, migration plan and proposed changes. We also have a PoC
> implementation of this migration[2] which has passed the existing unit
> tests and runner validation tests.
> >> >
> >> > Would love to get your thoughts on this.
> >> >
> >> > BTW, I am starting this discussion thread as I am not sure whether
> this change is considered as a large change[3] or not. If there is no
> concern for the change, I'll just create the GitHub issues and start to
> work on it.
> >> >
> >> > Also, I have worked with Xinyu Liu on the PoC implementation, and
> Xinyu has agreed to help review the patches (thank you Xinyu). It would be
> great if someone who has worked on Flink runner before can also help with
> the PR reviews.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> > [1]
> https://docs.google.com/document/d/1cjUJHOS1eEkH76hMNeBuc-kPhbIIc9w2gvjm8miIFS8/edit?usp=sharing
> >> > [2]
> https://github.com/becketqin/beam/tree/flink-batch-runner-migration
> >> > [3]
> https://github.com/apache/beam/blob/14e8de6e99a031ba7376bdb6837d471648878932/CONTRIBUTING.md
>

Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

Reply via email to