Does the same runner work for Java pipelines? (I assume so, given that it uses portability framework.) If so, does it provide similar speedup?
On Fri, Feb 16, 2018 at 7:37 PM Robert Bradshaw <rober...@google.com> wrote: > If there are no concerns, I say let's merge this. > > On Fri, Feb 16, 2018 at 9:39 AM, Charles Chen <c...@google.com> wrote: > > I hope those interested have had time to test this out. I have sent out > > https://github.com/apache/beam/pull/4696 to switch to using this fast > runner > > as the default DirectRunner for local execution. Let me know if there > are > > any concerns. > > > > On Tue, Feb 13, 2018 at 12:17 PM Charles Chen <c...@google.com> wrote: > >> > >> This is now checked into master. You can use it by setting > >> --runner=SwitchingDirectRunner. Please let us know if you run into any > >> issues. > >> > >> > >> On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau < > rmannibu...@gmail.com> > >> wrote: > >>> > >>> Very interesting! Sounds like a sane way for beam future and I'm very > >>> happy it is consistent with the current Java experience: no need to > >>> interlace runners at the end, it makes design, code and user > experience way > >>> better than trying to put everything in the direct runner :). > >>> > >>> Le 8 févr. 2018 19:20, "María García Herrero" <mari...@google.com> a > >>> écrit : > >>>> > >>>> Amazing improvement, Charles. > >>>> Thanks for the effort! > >>>> > >>>> > >>>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov < > kirpic...@google.com> > >>>> wrote: > >>>>> > >>>>> Sounds awesome, congratulations and thanks for making this happen! > >>>>> > >>>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com> > >>>>> wrote: > >>>>>> > >>>>>> This is terrific news! Thanks Charles. > >>>>>> > >>>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> > wrote: > >>>>>>> > >>>>>>> Local execution of Beam pipelines on the Python DirectRunner > >>>>>>> currently suffers from performance issues, which makes it hard for > pipeline > >>>>>>> authors to iterate, especially on medium to large size datasets. > We would > >>>>>>> like to optimize and make this a better experience for Beam users. > >>>>>>> > >>>>>>> > >>>>>>> The FnApiRunner was written as a way of leveraging the portability > >>>>>>> framework execution code path for local portability development. > We've found > >>>>>>> it also provides great speedups in batch execution with no user > changes > >>>>>>> required, so we propose to switch to use this runner by default in > batch > >>>>>>> pipelines. For example, WordCount on the Shakespeare dataset with > a single > >>>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes > before; this is > >>>>>>> a 15x performance improvement that users can get for free, with no > user > >>>>>>> pipeline changes. > >>>>>>> > >>>>>>> > >>>>>>> The JIRA for this change is here > >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a > candidate patch is > >>>>>>> available here (https://github.com/apache/beam/pull/4634). I have > been > >>>>>>> working over the last month on making this an automatic drop-in > replacement > >>>>>>> for the current DirectRunner when applicable. Before it becomes > the > >>>>>>> default, you can try this runner now by manually specifying > >>>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the > runner. > >>>>>>> > >>>>>>> > >>>>>>> Even with this change, local Python pipeline execution can only > >>>>>>> effectively use one core because of the Python GIL. A natural > next step to > >>>>>>> further improve performance will be to refactor the FnApiRunner to > allow for > >>>>>>> multi-process execution. This is being tracked here > >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3645). > >>>>>>> > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> Charles > >>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> Impact is the effect that wouldn’t have happened if you hadn’t done > what > >>>> you did. > >>>> > >>>> > > >