Re: Python profiling

Ankur Goenka Mon, 05 Nov 2018 19:16:42 -0800

All containers are destroyed by default on termination so to analyze
profiling data for portable runners, either disable container cleanup
(using --retainDockerContainers=true) or use remote distributed file system
path.


On Mon, Nov 5, 2018 at 1:05 AM Robert Bradshaw <[email protected]> wrote:

> Any portable runner should pick it up automatically.
> On Tue, Oct 30, 2018 at 3:32 AM Manu Zhang <[email protected]>
> wrote:
> >
> > Cool ! Can we document it somewhere such that other Runners could pick
> it up later ?
> >
> > Thanks,
> > Manu Zhang
> > On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels <[email protected]>,
> wrote:
> >
> > This looks very helpful for debugging performance of portable pipelines.
> > Great work!
> >
> > Enabling local directories for Flink or other portable Runners would be
> > useful for debugging, e.g. per
> > https://issues.apache.org/jira/browse/BEAM-5440
> >
> > On 26.10.18 18:08, Robert Bradshaw wrote:
> >
> > Now that we've (mostly) moved from features to performance for
> > BeamPython-on-Flink, I've been doing some profiling of Python code,
> > and thought it may be useful for others as well (both those working on
> > the SDK, and users who want to understand their own code), so I've
> > tried to wrap this up into something useful.
> >
> > Python already had some existing profile options that we used with
> > Dataflow, specifically --profile_cpu and --profile_location. I've
> > hooked these up to both the DirectRunner and the SDK Harness Worker.
> > One can now run commands like
> >
> > python -m apache_beam.examples.wordcount
> > --output=counts.txt--profile_cpu --profile_location=path/to/directory
> >
> > and get nice graphs like the one attached. (Here the bulk of the time
> > is spent reading from the default input in gcs. Another hint for
> > reading the graph is that due to fusion the call graph is cyclic,
> > passing through operations:86:receive for every output.)
> >
> > The raw python profile stats [1] are produced in that directory, along
> > with a dot graph and an svg if both dot and gprof2dot are installed.
> > There is also an important option --direct_runner_bundle_repeat which
> > can be set to gain more accurate profiles on smaller data sets by
> > re-playing the bundle without the (non-trivial) one-time setup costs.
> >
> > These flags also work on portability runners such as Flink, where the
> > directory must be set to a distributed filesystem. Each bundle
> > produces its own profile in that directory, and they can be
> > concatenated and manually fed into tools like below. In that case
> > there is a --profile_sample_rate which can be set to avoid profiling
> > every single bundle (e.g. for a production job).
> >
> > The PR is up at https://github.com/apache/beam/pull/6847 Hope it's
> useful.
> >
> > - Robert
> >
> >
> > [1] https://docs.python.org/2/library/profile.html
> > [2] https://github.com/jrfonseca/gprof2dot
> >
>

Re: Python profiling

Reply via email to