Even tough the algorithm works on your batch system, did you verify anything that can rule out the possibility where it is the underlying ML package causing the memory leak?
If not, maybe replace your prediction with a dummy function which does not load any model at all, and always just give the same prediction. Then do the same plotting, let us see what it looks like. And a plus with version two: still a dummy prediction, but with model loaded. Given we don't have much clue at this stage, at least this probably can give us more confidence in whether it is the underlying ML package causing the issue, or from beam sdk. just my 2 cents. On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <[email protected]> wrote: > Thanks for responding Ruoyun, > > We are not sure yet who is causing the leak, but once we run out of the > memory then sdk worker crashes and pipeline is forced to restart. Check the > memory usage patterns in the attached image. Each line in that graph is > representing one task manager host. > You are right we are running the models for predictions. > > Here are few observations: > > 1. All the tasks manager memory usage climb over time but some of the task > managers' memory climb really fast because they are running the ML models. > These models are definitely using memory intensive data structure (pandas > data frame etc) hence their memory usage climb really fast. > 2. We had almost the same code running in different infrastructure > (non-streaming) that doesn't cause any memory issue. > 3. Even when the pipeline has restarted, the memory is not released. It is > still hogged by something. You can notice in the attached image that > pipeline restarted around 13:30. At that time it is definitely released > some portion of the memory but didn't completely released all memory. > Notice that, when the pipeline was originally started, it started with 30% > of the memory but when got restarted by the job manager it started with 60% > of the memory. > > > > On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <[email protected]> wrote: > >> trying to understand the situation you are having. >> >> By saying 'kills the appllication', is that a leak in the application >> itself, or the workers being the root cause? Also are you running ML >> models inside Python SDK DoFn's? Then I suppose it is running some >> predictions rather than model training? >> >> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <[email protected]> >> wrote: >> >>> I am using *Beam Python SDK *to run my app in production. The app is >>> running machine learning models. I am noticing some memory leak which >>> eventually kills the application. I am not sure the source of memory leak. >>> Currently, I am using object graph >>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory >>> stats. I hope I will get some useful information out of this. I have also >>> looked into Guppy library <https://pypi.org/project/guppy/> and they >>> are almost the same. >>> >>> Do you guys have any recommendation for debugging this issue? Do we have >>> any tooling in the SDK that can help to debug it? >>> Please feel free to share your experience if you have debugged similar >>> issues in past. >>> >>> Thank you, >>> Rakesh >>> >> >> >> -- >> ================ >> Ruoyun Huang >> >> -- ================ Ruoyun Huang
