Amazing improvement, Charles. Thanks for the effort!
On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <kirpic...@google.com> wrote: > Sounds awesome, congratulations and thanks for making this happen! > > On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com> wrote: > >> This is terrific news! Thanks Charles. >> >> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote: >> >>> Local execution of Beam pipelines on the Python DirectRunner currently >>> suffers from performance issues, which makes it hard for pipeline authors >>> to iterate, especially on medium to large size datasets. We would like to >>> optimize and make this a better experience for Beam users. >>> >>> The FnApiRunner was written as a way of leveraging the portability >>> framework execution code path for local portability development. We've >>> found it also provides great speedups in batch execution with no user >>> changes required, so we propose to switch to use this runner by default in >>> batch pipelines. For example, WordCount on the Shakespeare dataset with a >>> single CPU core now takes 50 seconds to run, compared to 12 minutes before; >>> this is a 15x performance improvement that users can get for free, with >>> no user pipeline changes. >>> >>> The JIRA for this change is here ( >>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch >>> is available here (https://github.com/apache/beam/pull/4634). I have >>> been working over the last month on making this an automatic drop-in >>> replacement for the current DirectRunner when applicable. Before it >>> becomes the default, you can try this runner now by manually specifying >>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the >>> runner. >>> >>> Even with this change, local Python pipeline execution can only >>> effectively use one core because of the Python GIL. A natural next step to >>> further improve performance will be to refactor the FnApiRunner to allow >>> for multi-process execution. This is being tracked here ( >>> https://issues.apache.org/jira/browse/BEAM-3645). >>> >>> Best, >>> >>> Charles >>> >> -- Impact is the effect that wouldn’t have happened if you hadn’t done what you did.