Hi Yury. 1. I am using Storm 0.9.5 2. It is a BaseRichSpout. Yes, it has acking enabled and I ack each tuple at the end of the "execute" method of the bolt. I see tuples being acked in Storm UI. 3. Yes I observe memory usage increasing (which eventually leads to the topology hanging) even in my dummy setup which is not saving anything in memory, it merely reproduces the message-passing of my algorithm. I do not get OOM errors when I execute the topology on the cluster, but I get the most common exception in Storm :* java.lang.RuntimeException: java.lang.NullPointerException at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)* and some tasks die and the Storm UI statistics get lost/re-started. I have never profiled a topology that is being executed on the cluster so I am not very certain if this is what you mean. If I understand correctly, what you are suggesting is to take a heap dump using visualVM at some node while the topology is running and analyze this heap dump. 4. I haven't seen any GC logs (not sure how to collect GC logs from the cluster).
Thanks again for your help. On Fri, Jan 15, 2016 at 9:57 PM, Yury Ruchin <yuri.ruc...@gmail.com> wrote: > Hi Nick, > > Some questions: > > 1. Well, what version of Storm are you using? :) > > 2. What is the spout you are using? Is this spout reliable, i. e. does it > use message ids to have messages acked/failed by downstream bolts? Do you > have acker enabled for your topology? If it is unreliable or does not have > acker, then topology.max.spout.pending has no effect and if your bolts > don't keep up with your spout, you will likely end up with overflow buffer > growing larger and larger. > > 3. Not sure if I get it right: after you stopped saving anything in memory > - do you still experience memory usage increasing? Have you observed > OutOfMemoryErrors? If yes, you might want to launch your worker processes > with -XX:+HeapDumpOnOutOfMemoryError. If no, you can take on-demand heap > dump using e. g. VisualVM and feed it to a memory analyzer, such as MAT, > then take a look what eats up the heap. > > 4. What do you think it's a memory issue? Have you looked at GC graphs > shown by e. g. VisualVM? Did you collect any GC logs to see how long it > took? > > Regards > Yury > > 2016-01-15 20:15 GMT+03:00 Nikolaos Pavlakis <nikolaspavla...@gmail.com>: > >> Thanks for all the replies so far. I am profiling the topology in local >> mode with VisualVm and I do not see this problem. I am still running to >> this problem when the topology is deployed on the cluster, even with >> max.spout.pending = 1. >> >> On Wed, Jan 13, 2016 at 10:38 PM, John Yost <hokiege...@gmail.com> wrote: >> >>> +1 for Andrew, definitely agree profiling with jvisualvm or whatever is >>> definitely something to do if you have not done already >>> >>> On Wed, Jan 13, 2016 at 3:30 PM, Andrew Xor <andreas.gramme...@gmail.com >>> > wrote: >>> >>>> Hey, >>>> >>>> Care to give version of storm/jvm? Does this happen on cluster >>>> execution only or when also running the topology in local mode? >>>> Unfortunately, probably the best way to find what's really going on is to >>>> profile your topology... if you can run the topology locally this will make >>>> things quite a bit easier as profiling storm topologies on a live cluster >>>> can be quite time consuming. >>>> >>>> Regards. >>>> >>>> On Wed, Jan 13, 2016 at 10:06 PM, Nikolaos Pavlakis < >>>> nikolaspavla...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> I am implementing a distributed algorithm for pagerank estimation >>>>> using Storm. I have been having memory problems, so I decided to create a >>>>> dummy implementation that does not explicitly save anything in memory, to >>>>> determine whether the problem lies in my algorithm or my Storm structure. >>>>> >>>>> Indeed, while the only thing the dummy implementation does is >>>>> message-passing (a lot of it), the memory of each worker process keeps >>>>> rising until the pipeline is clogged. I do not understand why this might >>>>> be >>>>> happening. >>>>> >>>>> My cluster has 18 machines (some with 8g, some 16g and some 32g of >>>>> memory). I have set the worker heap size to 6g (-Xmx6g). >>>>> >>>>> My topology is very very simple: >>>>> One spout >>>>> One bolt (with parallelism). >>>>> >>>>> The bolt receives data from the spout (fieldsGrouping) and also from >>>>> other tasks of itself. >>>>> >>>>> My message-passing pattern is based on random walks with a certain >>>>> stopping probability. More specifically: >>>>> The spout generates a tuple. >>>>> One specific task from the bolt receives this tuple. >>>>> Based on a certain probability, this task generates another tuple and >>>>> emits it again to another task of the same bolt. >>>>> >>>>> >>>>> I am stuck at this problem for quite a while, so it would be very >>>>> helpful if someone could help. >>>>> >>>>> Best Regards, >>>>> Nick >>>>> >>>> >>>> >>> >> >