Amazing improvement, Charles.
Thanks for the effort!

On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov <kirpic...@google.com>
wrote:

> Sounds awesome, congratulations and thanks for making this happen!
>
> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com> wrote:
>
>> This is terrific news! Thanks Charles.
>>
>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> wrote:
>>
>>> Local execution of Beam pipelines on the Python DirectRunner currently
>>> suffers from performance issues, which makes it hard for pipeline authors
>>> to iterate, especially on medium to large size datasets.  We would like to
>>> optimize and make this a better experience for Beam users.
>>>
>>> The FnApiRunner was written as a way of leveraging the portability
>>> framework execution code path for local portability development. We've
>>> found it also provides great speedups in batch execution with no user
>>> changes required, so we propose to switch to use this runner by default in
>>> batch pipelines.  For example, WordCount on the Shakespeare dataset with a
>>> single CPU core now takes 50 seconds to run, compared to 12 minutes before;
>>> this is a 15x performance improvement that users can get for free, with
>>> no user pipeline changes.
>>>
>>> The JIRA for this change is here (
>>> https://issues.apache.org/jira/browse/BEAM-3644), and a candidate patch
>>> is available here (https://github.com/apache/beam/pull/4634). I have
>>> been working over the last month on making this an automatic drop-in
>>> replacement for the current DirectRunner when applicable.  Before it
>>> becomes the default, you can try this runner now by manually specifying
>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
>>> runner.
>>>
>>> Even with this change, local Python pipeline execution can only
>>> effectively use one core because of the Python GIL.  A natural next step to
>>> further improve performance will be to refactor the FnApiRunner to allow
>>> for multi-process execution.  This is being tracked here (
>>> https://issues.apache.org/jira/browse/BEAM-3645).
>>>
>>> Best,
>>>
>>> Charles
>>>
>>

-- 

Impact is the effect that wouldn’t have happened if you hadn’t done what you
did.

Reply via email to