Re: A 15x speed-up in local Python DirectRunner execution

Charles Chen Fri, 16 Feb 2018 09:39:40 -0800

I hope those interested have had time to test this out.  I have sent out
https://github.com/apache/beam/pull/4696 to switch to using this fast
runner as the default DirectRunner for local execution.  Let me know if
there are any concerns.


On Tue, Feb 13, 2018 at 12:17 PM Charles Chen <[email protected]> wrote:

> This is now checked into master.  You can use it by setting
> --runner=SwitchingDirectRunner.  Please let us know if you run into any
> issues.
>
>
> On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau <[email protected]>
> wrote:
>
>> Very interesting! Sounds like a sane way for beam future and I'm very
>> happy it is consistent with the current Java experience: no need to
>> interlace runners at the end, it makes design, code and user experience way
>> better than trying to put everything in the direct runner :).
>>
>> Le 8 févr. 2018 19:20, "María García Herrero" <[email protected]> a
>> écrit :
>>
>>> Amazing improvement, Charles.
>>> Thanks for the effort!
>>>
>>>
>>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <[email protected]>
>>> wrote:
>>>
>>>> Sounds awesome, congratulations and thanks for making this happen!
>>>>
>>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <[email protected]>
>>>> wrote:
>>>>
>>>>> This is terrific news! Thanks Charles.
>>>>>
>>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <[email protected]> wrote:
>>>>>
>>>>>> Local execution of Beam pipelines on the Python DirectRunner
>>>>>> currently suffers from performance issues, which makes it hard for 
>>>>>> pipeline
>>>>>> authors to iterate, especially on medium to large size datasets.  We 
>>>>>> would
>>>>>> like to optimize and make this a better experience for Beam users.
>>>>>>
>>>>>> The FnApiRunner was written as a way of leveraging the portability
>>>>>> framework execution code path for local portability development. We've
>>>>>> found it also provides great speedups in batch execution with no user
>>>>>> changes required, so we propose to switch to use this runner by default 
>>>>>> in
>>>>>> batch pipelines.  For example, WordCount on the Shakespeare dataset with 
>>>>>> a
>>>>>> single CPU core now takes 50 seconds to run, compared to 12 minutes 
>>>>>> before;
>>>>>> this is a 15x performance improvement that users can get for free,
>>>>>> with no user pipeline changes.
>>>>>>
>>>>>> The JIRA for this change is here (
>>>>>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate
>>>>>> patch is available here (https://github.com/apache/beam/pull/4634).
>>>>>> I have been working over the last month on making this an automatic 
>>>>>> drop-in
>>>>>> replacement for the current DirectRunner when applicable.  Before it
>>>>>> becomes the default, you can try this runner now by manually specifying
>>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
>>>>>> runner.
>>>>>>
>>>>>> Even with this change, local Python pipeline execution can only
>>>>>> effectively use one core because of the Python GIL.  A natural next step 
>>>>>> to
>>>>>> further improve performance will be to refactor the FnApiRunner to allow
>>>>>> for multi-process execution.  This is being tracked here (
>>>>>> https://issues.apache.org/jira/browse/BEAM-3645).
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Charles
>>>>>>
>>>>>
>>>
>>> --
>>>
>>> Impact is the effect that wouldn’t have happened if you hadn’t done what you
>>> did.
>>>
>>>

Re: A 15x speed-up in local Python DirectRunner execution

Reply via email to