Very interesting! Sounds like a sane way for beam future and I'm very happy it is consistent with the current Java experience: no need to interlace runners at the end, it makes design, code and user experience way better than trying to put everything in the direct runner :).
Le 8 févr. 2018 19:20, "María García Herrero" <[email protected]> a écrit : > Amazing improvement, Charles. > Thanks for the effort! > > > On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <[email protected]> > wrote: > >> Sounds awesome, congratulations and thanks for making this happen! >> >> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <[email protected]> wrote: >> >>> This is terrific news! Thanks Charles. >>> >>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <[email protected]> wrote: >>> >>>> Local execution of Beam pipelines on the Python DirectRunner currently >>>> suffers from performance issues, which makes it hard for pipeline authors >>>> to iterate, especially on medium to large size datasets. We would like to >>>> optimize and make this a better experience for Beam users. >>>> >>>> The FnApiRunner was written as a way of leveraging the portability >>>> framework execution code path for local portability development. We've >>>> found it also provides great speedups in batch execution with no user >>>> changes required, so we propose to switch to use this runner by default in >>>> batch pipelines. For example, WordCount on the Shakespeare dataset with a >>>> single CPU core now takes 50 seconds to run, compared to 12 minutes before; >>>> this is a 15x performance improvement that users can get for free, >>>> with no user pipeline changes. >>>> >>>> The JIRA for this change is here (https://issues.apache.org/ >>>> jira/browse/BEAM-3644), and a candidate patch is available here ( >>>> https://github.com/apache/beam/pull/4634). I have been working over >>>> the last month on making this an automatic drop-in replacement for the >>>> current DirectRunner when applicable. Before it becomes the default, you >>>> can try this runner now by manually specifying apache_beam.runners. >>>> portability.fn_api_runner.FnApiRunner as the runner. >>>> >>>> Even with this change, local Python pipeline execution can only >>>> effectively use one core because of the Python GIL. A natural next step to >>>> further improve performance will be to refactor the FnApiRunner to allow >>>> for multi-process execution. This is being tracked here ( >>>> https://issues.apache.org/jira/browse/BEAM-3645). >>>> >>>> Best, >>>> >>>> Charles >>>> >>> > > -- > > Impact is the effect that wouldn’t have happened if you hadn’t done what you > did. > >
