If there are no concerns, I say let's merge this.
On Fri, Feb 16, 2018 at 9:39 AM, Charles Chen <c...@google.com> wrote: > I hope those interested have had time to test this out. I have sent out > https://github.com/apache/beam/pull/4696 to switch to using this fast runner > as the default DirectRunner for local execution. Let me know if there are > any concerns. > > On Tue, Feb 13, 2018 at 12:17 PM Charles Chen <c...@google.com> wrote: >> >> This is now checked into master. You can use it by setting >> --runner=SwitchingDirectRunner. Please let us know if you run into any >> issues. >> >> >> On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau <rmannibu...@gmail.com> >> wrote: >>> >>> Very interesting! Sounds like a sane way for beam future and I'm very >>> happy it is consistent with the current Java experience: no need to >>> interlace runners at the end, it makes design, code and user experience way >>> better than trying to put everything in the direct runner :). >>> >>> Le 8 févr. 2018 19:20, "María García Herrero" <mari...@google.com> a >>> écrit : >>>> >>>> Amazing improvement, Charles. >>>> Thanks for the effort! >>>> >>>> >>>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <kirpic...@google.com> >>>> wrote: >>>>> >>>>> Sounds awesome, congratulations and thanks for making this happen! >>>>> >>>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com> >>>>> wrote: >>>>>> >>>>>> This is terrific news! Thanks Charles. >>>>>> >>>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote: >>>>>>> >>>>>>> Local execution of Beam pipelines on the Python DirectRunner >>>>>>> currently suffers from performance issues, which makes it hard for >>>>>>> pipeline >>>>>>> authors to iterate, especially on medium to large size datasets. We >>>>>>> would >>>>>>> like to optimize and make this a better experience for Beam users. >>>>>>> >>>>>>> >>>>>>> The FnApiRunner was written as a way of leveraging the portability >>>>>>> framework execution code path for local portability development. We've >>>>>>> found >>>>>>> it also provides great speedups in batch execution with no user changes >>>>>>> required, so we propose to switch to use this runner by default in batch >>>>>>> pipelines. For example, WordCount on the Shakespeare dataset with a >>>>>>> single >>>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes before; >>>>>>> this is >>>>>>> a 15x performance improvement that users can get for free, with no user >>>>>>> pipeline changes. >>>>>>> >>>>>>> >>>>>>> The JIRA for this change is here >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate >>>>>>> patch is >>>>>>> available here (https://github.com/apache/beam/pull/4634). I have been >>>>>>> working over the last month on making this an automatic drop-in >>>>>>> replacement >>>>>>> for the current DirectRunner when applicable. Before it becomes the >>>>>>> default, you can try this runner now by manually specifying >>>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the runner. >>>>>>> >>>>>>> >>>>>>> Even with this change, local Python pipeline execution can only >>>>>>> effectively use one core because of the Python GIL. A natural next >>>>>>> step to >>>>>>> further improve performance will be to refactor the FnApiRunner to >>>>>>> allow for >>>>>>> multi-process execution. This is being tracked here >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3645). >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Charles >>>> >>>> >>>> >>>> -- >>>> >>>> Impact is the effect that wouldn’t have happened if you hadn’t done what >>>> you did. >>>> >>>> >