On Fri, Nov 16, 2018 at 3:36 PM Udi Meiri <[email protected]> wrote: > If you're working with Dataflow, it supports this flag: > https://github.com/apache/beam/blob/75e9f645c7bec940b87b93f416823b020e4c5f69/sdks/python/apache_beam/options/pipeline_options.py#L602 > which uses guppy for heap profiling. >
This is really useful flag. Unfortunetly, we are using Beam + Flink. It would be really useful to have similar flag for other Streaming engines. > On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <[email protected]> wrote: > >> Even tough the algorithm works on your batch system, did you verify >> anything that can rule out the possibility where it is the underlying ML >> package causing the memory leak? >> >> If not, maybe replace your prediction with a dummy function which does >> not load any model at all, and always just give the same prediction. Then >> do the same plotting, let us see what it looks like. And a plus with >> version two: still a dummy prediction, but with model loaded. Given we >> don't have much clue at this stage, at least this probably can give us more >> confidence in whether it is the underlying ML package causing the issue, or >> from beam sdk. just my 2 cents. >> >> >> On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <[email protected]> >> wrote: >> >>> Thanks for responding Ruoyun, >>> >>> We are not sure yet who is causing the leak, but once we run out of the >>> memory then sdk worker crashes and pipeline is forced to restart. Check the >>> memory usage patterns in the attached image. Each line in that graph is >>> representing one task manager host. >>> You are right we are running the models for predictions. >>> >>> Here are few observations: >>> >>> 1. All the tasks manager memory usage climb over time but some of the >>> task managers' memory climb really fast because they are running the ML >>> models. These models are definitely using memory intensive data structure >>> (pandas data frame etc) hence their memory usage climb really fast. >>> 2. We had almost the same code running in different infrastructure >>> (non-streaming) that doesn't cause any memory issue. >>> 3. Even when the pipeline has restarted, the memory is not released. It >>> is still hogged by something. You can notice in the attached image that >>> pipeline restarted around 13:30. At that time it is definitely released >>> some portion of the memory but didn't completely released all memory. >>> Notice that, when the pipeline was originally started, it started with 30% >>> of the memory but when got restarted by the job manager it started with 60% >>> of the memory. >>> >>> >>> >>> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <[email protected]> wrote: >>> >>>> trying to understand the situation you are having. >>>> >>>> By saying 'kills the appllication', is that a leak in the application >>>> itself, or the workers being the root cause? Also are you running ML >>>> models inside Python SDK DoFn's? Then I suppose it is running some >>>> predictions rather than model training? >>>> >>>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <[email protected]> >>>> wrote: >>>> >>>>> I am using *Beam Python SDK *to run my app in production. The app is >>>>> running machine learning models. I am noticing some memory leak which >>>>> eventually kills the application. I am not sure the source of memory leak. >>>>> Currently, I am using object graph >>>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory >>>>> stats. I hope I will get some useful information out of this. I have also >>>>> looked into Guppy library <https://pypi.org/project/guppy/> and they >>>>> are almost the same. >>>>> >>>>> Do you guys have any recommendation for debugging this issue? Do we >>>>> have any tooling in the SDK that can help to debug it? >>>>> Please feel free to share your experience if you have debugged similar >>>>> issues in past. >>>>> >>>>> Thank you, >>>>> Rakesh >>>>> >>>> >>>> >>>> -- >>>> ================ >>>> Ruoyun Huang >>>> >>>> >> >> -- >> ================ >> Ruoyun Huang >> >>
