Hey everyone;

I'm starting to work on BEAM-38 (
https://issues.apache.org/jira/browse/BEAM-38), which enables an
optimization for runners with many small bundles. BEAM-38 allows runners to
reuse DoFn instances so long as that DoFn has not terminated abnormally.
This replaces the previous requirement that a DoFn be used for only a
single bundle if either of startBundle or finishBundle have been
overwritten.

DoFn deserialization-per-bundle can be a significance performance
bottleneck when there are many small bundles, as is common in streaming
executions. It has also surfaced as the cause of much of the current
slowness in the new InProcessRunner.

Existing Runners do not require any changes; they may choose to take
advantage of of the new optimization opportunity. However, user DoFns may
need to be revised to properly set up and tear down state in startBundle
and finishBundle, respectively, if the depended on only being used for a
single bundle.

The first two updates are already in pull requests:

PR #419 (https://github.com/apache/incubator-beam/pull/419) updates the
Javadoc to the new spec
PR #418 (https://github.com/apache/incubator-beam/pull/418) updates the
DirectRunner to reuse DoFns according to the new policy.

Yours,

Thomas

Reply via email to