I hope those interested have had time to test this out. I have sent out https://github.com/apache/beam/pull/4696 to switch to using this fast runner as the default DirectRunner for local execution. Let me know if there are any concerns.
On Tue, Feb 13, 2018 at 12:17 PM Charles Chen <[email protected]> wrote: > This is now checked into master. You can use it by setting > --runner=SwitchingDirectRunner. Please let us know if you run into any > issues. > > > On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau <[email protected]> > wrote: > >> Very interesting! Sounds like a sane way for beam future and I'm very >> happy it is consistent with the current Java experience: no need to >> interlace runners at the end, it makes design, code and user experience way >> better than trying to put everything in the direct runner :). >> >> Le 8 févr. 2018 19:20, "María García Herrero" <[email protected]> a >> écrit : >> >>> Amazing improvement, Charles. >>> Thanks for the effort! >>> >>> >>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <[email protected]> >>> wrote: >>> >>>> Sounds awesome, congratulations and thanks for making this happen! >>>> >>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <[email protected]> >>>> wrote: >>>> >>>>> This is terrific news! Thanks Charles. >>>>> >>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <[email protected]> wrote: >>>>> >>>>>> Local execution of Beam pipelines on the Python DirectRunner >>>>>> currently suffers from performance issues, which makes it hard for >>>>>> pipeline >>>>>> authors to iterate, especially on medium to large size datasets. We >>>>>> would >>>>>> like to optimize and make this a better experience for Beam users. >>>>>> >>>>>> The FnApiRunner was written as a way of leveraging the portability >>>>>> framework execution code path for local portability development. We've >>>>>> found it also provides great speedups in batch execution with no user >>>>>> changes required, so we propose to switch to use this runner by default >>>>>> in >>>>>> batch pipelines. For example, WordCount on the Shakespeare dataset with >>>>>> a >>>>>> single CPU core now takes 50 seconds to run, compared to 12 minutes >>>>>> before; >>>>>> this is a 15x performance improvement that users can get for free, >>>>>> with no user pipeline changes. >>>>>> >>>>>> The JIRA for this change is here ( >>>>>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate >>>>>> patch is available here (https://github.com/apache/beam/pull/4634). >>>>>> I have been working over the last month on making this an automatic >>>>>> drop-in >>>>>> replacement for the current DirectRunner when applicable. Before it >>>>>> becomes the default, you can try this runner now by manually specifying >>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the >>>>>> runner. >>>>>> >>>>>> Even with this change, local Python pipeline execution can only >>>>>> effectively use one core because of the Python GIL. A natural next step >>>>>> to >>>>>> further improve performance will be to refactor the FnApiRunner to allow >>>>>> for multi-process execution. This is being tracked here ( >>>>>> https://issues.apache.org/jira/browse/BEAM-3645). >>>>>> >>>>>> Best, >>>>>> >>>>>> Charles >>>>>> >>>>> >>> >>> -- >>> >>> Impact is the effect that wouldn’t have happened if you hadn’t done what you >>> did. >>> >>>
