If you're working with Dataflow, it supports this flag:
https://github.com/apache/beam/blob/75e9f645c7bec940b87b93f416823b020e4c5f69/sdks/python/apache_beam/options/pipeline_options.py#L602
which uses guppy for heap profiling.

On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <ruo...@google.com> wrote:

> Even tough the algorithm works on your batch system, did you verify
> anything that can rule out the possibility where it is the underlying ML
> package causing the memory leak?
>
> If not, maybe replace your prediction with a dummy function which does not
> load any model at all, and always just give the same prediction. Then do
> the same plotting, let us see what it looks like. And a plus with version
> two: still a dummy prediction, but with model loaded.    Given we don't
> have much clue at this stage, at least this probably can give us more
> confidence in whether it is the underlying ML package causing the issue, or
> from beam sdk. just my 2 cents.
>
>
> On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <rakeshku...@lyft.com> wrote:
>
>> Thanks for responding Ruoyun,
>>
>> We are not sure yet who is causing the leak, but once we run out of the
>> memory then sdk worker crashes and pipeline is forced to restart. Check the
>> memory usage patterns in the attached image. Each line in that graph is
>> representing one task manager host.
>>  You are right we are running the models for predictions.
>>
>> Here are few observations:
>>
>> 1. All the tasks manager memory usage climb over time but some of the
>> task managers' memory climb really fast because they are running the ML
>> models. These models are definitely using memory intensive data structure
>> (pandas data frame etc) hence their memory usage climb really fast.
>> 2. We had almost the same code running in different infrastructure
>> (non-streaming) that doesn't cause any memory issue.
>> 3. Even when the pipeline has restarted, the memory is not released. It
>> is still hogged by something. You can notice in the attached image that
>> pipeline restarted around 13:30. At that time it is definitely released
>> some portion of the memory but didn't completely released all memory.
>> Notice that, when the pipeline was originally started, it started with 30%
>> of the memory but when got restarted by the job manager it started with 60%
>> of the memory.
>>
>>
>>
>> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <ruo...@google.com> wrote:
>>
>>> trying to understand the situation you are having.
>>>
>>> By saying 'kills the appllication', is that a leak in the application
>>> itself, or the workers being the root cause?  Also are you running ML
>>> models inside Python SDK DoFn's?  Then I suppose it is running some
>>> predictions rather than model training?
>>>
>>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <rakeshku...@lyft.com>
>>> wrote:
>>>
>>>> I am using *Beam Python SDK *to run my app in production. The app is
>>>> running machine learning models. I am noticing some memory leak which
>>>> eventually kills the application. I am not sure the source of memory leak.
>>>> Currently, I am using object graph
>>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
>>>> stats. I hope I will get some useful information out of this. I have also
>>>> looked into Guppy library <https://pypi.org/project/guppy/> and they
>>>> are almost the same.
>>>>
>>>> Do you guys have any recommendation for debugging this issue? Do we
>>>> have any tooling in the SDK that can help to debug it?
>>>> Please feel free to share your experience if you have debugged similar
>>>> issues in past.
>>>>
>>>> Thank you,
>>>> Rakesh
>>>>
>>>
>>>
>>> --
>>> ================
>>> Ruoyun  Huang
>>>
>>>
>
> --
> ================
> Ruoyun  Huang
>
>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to