Re: Python profiling

Ahmet Altay Fri, 16 Nov 2018 10:15:32 -0800

On Fri, Nov 16, 2018 at 10:12 AM, Thomas Weise <[email protected]> wrote:


> Since it is for users, it should eventually go to the web site.
>
> How about a new section under: https://beam.apache.
> org/documentation/sdks/python/
>
> "Troubleshooting and Tuning" ?
>

That is a good idea.


>
>
> On Fri, Nov 16, 2018 at 10:08 AM Ahmet Altay <[email protected]> wrote:
>
>>
>>
>> On Fri, Nov 16, 2018 at 2:12 AM, Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> One needs to ensure that gprof2dot is importable (i.e. installed via pip
>>> into your Python environment).
>>>
>>> As for specifying the FnApiRunner via the runner argument, --runner can
>>> take fully qualified names (if it's not in the short list of known
>>> runners). However, the FnApiRunner is the DirectRunner for non-streaming
>>> mode, so there's no need to specify it explicitly.
>>>
>>> Good point about adding this to the documentation. It's unclear where
>>> best to put it...
>>>
>>
>> How about in wiki under python tips? (https://cwiki.apache.org/
>> confluence/display/BEAM/Python+Tips) From there it can be later
>> converted to full user docs later.
>>
>>
>>>
>>> On Thu, Nov 15, 2018 at 5:28 PM Thomas Weise <[email protected]> wrote:
>>>
>>>> Hi Robert,
>>>>
>>>> This is great. It should be added to our Python documentation because
>>>> users will like need this!
>>>>
>>>> After I installed gprof2dot I'm still prompted to install:
>>>>
>>>> "Please install gprof2dot and dot for profile renderings."
>>>>
>>>> Also is there a way to run a pipeline unmodified with fn_api_runner?
>>>> (For those interested in profiling the SDK worker.)
>>>>
>>>> It works with direct runner, but "FnApiRunner" isn't currently
>>>> supported as --runner argument:
>>>>
>>>> python -m apache_beam.examples.wordcount \
>>>>   --input=/etc/profile \
>>>>   --output=/tmp/py-wordcount-direct \
>>>>   *--runner=FnApiRunner* \
>>>>   --streaming \
>>>>   --profile_cpu --profile_location=./build/pyprofile
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Mon, Nov 5, 2018 at 7:15 PM Ankur Goenka <[email protected]> wrote:
>>>>
>>>>> All containers are destroyed by default on termination so to analyze
>>>>> profiling data for portable runners, either disable container cleanup
>>>>> (using --retainDockerContainers=true) or use remote distributed file
>>>>> system path.
>>>>>
>>>>> On Mon, Nov 5, 2018 at 1:05 AM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Any portable runner should pick it up automatically.
>>>>>> On Tue, Oct 30, 2018 at 3:32 AM Manu Zhang <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > Cool ! Can we document it somewhere such that other Runners could
>>>>>> pick it up later ?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Manu Zhang
>>>>>> > On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels <[email protected]>,
>>>>>> wrote:
>>>>>> >
>>>>>> > This looks very helpful for debugging performance of portable
>>>>>> pipelines.
>>>>>> > Great work!
>>>>>> >
>>>>>> > Enabling local directories for Flink or other portable Runners
>>>>>> would be
>>>>>> > useful for debugging, e.g. per
>>>>>> > https://issues.apache.org/jira/browse/BEAM-5440
>>>>>> >
>>>>>> > On 26.10.18 18:08, Robert Bradshaw wrote:
>>>>>> >
>>>>>> > Now that we've (mostly) moved from features to performance for
>>>>>> > BeamPython-on-Flink, I've been doing some profiling of Python code,
>>>>>> > and thought it may be useful for others as well (both those working
>>>>>> on
>>>>>> > the SDK, and users who want to understand their own code), so I've
>>>>>> > tried to wrap this up into something useful.
>>>>>> >
>>>>>> > Python already had some existing profile options that we used with
>>>>>> > Dataflow, specifically --profile_cpu and --profile_location. I've
>>>>>> > hooked these up to both the DirectRunner and the SDK Harness Worker.
>>>>>> > One can now run commands like
>>>>>> >
>>>>>> > python -m apache_beam.examples.wordcount
>>>>>> > --output=counts.txt--profile_cpu --profile_location=path/to/
>>>>>> directory
>>>>>> >
>>>>>> > and get nice graphs like the one attached. (Here the bulk of the
>>>>>> time
>>>>>> > is spent reading from the default input in gcs. Another hint for
>>>>>> > reading the graph is that due to fusion the call graph is cyclic,
>>>>>> > passing through operations:86:receive for every output.)
>>>>>> >
>>>>>> > The raw python profile stats [1] are produced in that directory,
>>>>>> along
>>>>>> > with a dot graph and an svg if both dot and gprof2dot are installed.
>>>>>> > There is also an important option --direct_runner_bundle_repeat
>>>>>> which
>>>>>> > can be set to gain more accurate profiles on smaller data sets by
>>>>>> > re-playing the bundle without the (non-trivial) one-time setup
>>>>>> costs.
>>>>>> >
>>>>>> > These flags also work on portability runners such as Flink, where
>>>>>> the
>>>>>> > directory must be set to a distributed filesystem. Each bundle
>>>>>> > produces its own profile in that directory, and they can be
>>>>>> > concatenated and manually fed into tools like below. In that case
>>>>>> > there is a --profile_sample_rate which can be set to avoid profiling
>>>>>> > every single bundle (e.g. for a production job).
>>>>>> >
>>>>>> > The PR is up at https://github.com/apache/beam/pull/6847 Hope it's
>>>>>> useful.
>>>>>> >
>>>>>> > - Robert
>>>>>> >
>>>>>> >
>>>>>> > [1] https://docs.python.org/2/library/profile.html
>>>>>> > [2] https://github.com/jrfonseca/gprof2dot
>>>>>> >
>>>>>>
>>>>>
>>

Re: Python profiling

Reply via email to