Hi again Yury. Thanks for the help and the references. I will have a look.
My spout is quite simple. Here is the code : public void nextTuple() { nodeIds[0] = random.nextInt(urlsNum); nodeIds[1] = random.nextInt(urlsNum); sent++; collector.emit("FirstNodeStream", new Values(nodeIds[0], nodeIds[1]), String.valueOf(sent)); // Third argument of emit is a messageId for the tuple. Utils.sleep(1); } *nodeIds* is an int[2], *urlsNum* is just an int and denotes the maximum possible id, *sent* is an int counter. On Mon, Jan 18, 2016 at 11:54 PM, Yury Ruchin <yuri.ruc...@gmail.com> wrote: > Yes, I suggest you to try and spot the problem by looking at the dump of a > workers that throws the exception. That way you could at least be certain > about what consumes workers memory. > > Oracle HotSpot has a number of options controlling GC logging, setting > them for worker JVMs may help in troubleshooting. Plumbr's Handbook seems > to be a decent reading on that matter: https://plumbr.eu/handbook. > > Since you are using custom spout, could you provide its code, at least the > part that emits tuples? > > 2016-01-15 23:57 GMT+03:00 Nikolaos Pavlakis <nikolaspavla...@gmail.com>: > >> Hi Yury. >> >> 1. I am using Storm 0.9.5 >> 2. It is a BaseRichSpout. Yes, it has acking enabled and I ack each tuple >> at the end of the "execute" method of the bolt. I see tuples being acked in >> Storm UI. >> 3. Yes I observe memory usage increasing (which eventually leads to the >> topology hanging) even in my dummy setup which is not saving anything in >> memory, it merely reproduces the message-passing of my algorithm. I do not >> get OOM errors when I execute the topology on the cluster, but I get the >> most common exception in Storm :* java.lang.RuntimeException: >> java.lang.NullPointerException at >> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)* >> and some tasks die and the Storm UI statistics get lost/re-started. I have >> never profiled a topology that is being executed on the cluster so I am not >> very certain if this is what you mean. If I understand correctly, what you >> are suggesting is to take a heap dump using visualVM at some node while the >> topology is running and analyze this heap dump. >> 4. I haven't seen any GC logs (not sure how to collect GC logs from the >> cluster). >> >> Thanks again for your help. >> >> On Fri, Jan 15, 2016 at 9:57 PM, Yury Ruchin <yuri.ruc...@gmail.com> >> wrote: >> >>> Hi Nick, >>> >>> Some questions: >>> >>> 1. Well, what version of Storm are you using? :) >>> >>> 2. What is the spout you are using? Is this spout reliable, i. e. does >>> it use message ids to have messages acked/failed by downstream bolts? Do >>> you have acker enabled for your topology? If it is unreliable or does not >>> have acker, then topology.max.spout.pending has no effect and if your bolts >>> don't keep up with your spout, you will likely end up with overflow buffer >>> growing larger and larger. >>> >>> 3. Not sure if I get it right: after you stopped saving anything in >>> memory - do you still experience memory usage increasing? Have you observed >>> OutOfMemoryErrors? If yes, you might want to launch your worker processes >>> with -XX:+HeapDumpOnOutOfMemoryError. If no, you can take on-demand heap >>> dump using e. g. VisualVM and feed it to a memory analyzer, such as MAT, >>> then take a look what eats up the heap. >>> >>> 4. What do you think it's a memory issue? Have you looked at GC graphs >>> shown by e. g. VisualVM? Did you collect any GC logs to see how long it >>> took? >>> >>> Regards >>> Yury >>> >>> 2016-01-15 20:15 GMT+03:00 Nikolaos Pavlakis <nikolaspavla...@gmail.com> >>> : >>> >>>> Thanks for all the replies so far. I am profiling the topology in local >>>> mode with VisualVm and I do not see this problem. I am still running to >>>> this problem when the topology is deployed on the cluster, even with >>>> max.spout.pending = 1. >>>> >>>> On Wed, Jan 13, 2016 at 10:38 PM, John Yost <hokiege...@gmail.com> >>>> wrote: >>>> >>>>> +1 for Andrew, definitely agree profiling with jvisualvm or whatever >>>>> is definitely something to do if you have not done already >>>>> >>>>> On Wed, Jan 13, 2016 at 3:30 PM, Andrew Xor < >>>>> andreas.gramme...@gmail.com> wrote: >>>>> >>>>>> Hey, >>>>>> >>>>>> Care to give version of storm/jvm? Does this happen on cluster >>>>>> execution only or when also running the topology in local mode? >>>>>> Unfortunately, probably the best way to find what's really going on is to >>>>>> profile your topology... if you can run the topology locally this will >>>>>> make >>>>>> things quite a bit easier as profiling storm topologies on a live cluster >>>>>> can be quite time consuming. >>>>>> >>>>>> Regards. >>>>>> >>>>>> On Wed, Jan 13, 2016 at 10:06 PM, Nikolaos Pavlakis < >>>>>> nikolaspavla...@gmail.com> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I am implementing a distributed algorithm for pagerank estimation >>>>>>> using Storm. I have been having memory problems, so I decided to create >>>>>>> a >>>>>>> dummy implementation that does not explicitly save anything in memory, >>>>>>> to >>>>>>> determine whether the problem lies in my algorithm or my Storm >>>>>>> structure. >>>>>>> >>>>>>> Indeed, while the only thing the dummy implementation does is >>>>>>> message-passing (a lot of it), the memory of each worker process keeps >>>>>>> rising until the pipeline is clogged. I do not understand why this >>>>>>> might be >>>>>>> happening. >>>>>>> >>>>>>> My cluster has 18 machines (some with 8g, some 16g and some 32g of >>>>>>> memory). I have set the worker heap size to 6g (-Xmx6g). >>>>>>> >>>>>>> My topology is very very simple: >>>>>>> One spout >>>>>>> One bolt (with parallelism). >>>>>>> >>>>>>> The bolt receives data from the spout (fieldsGrouping) and also from >>>>>>> other tasks of itself. >>>>>>> >>>>>>> My message-passing pattern is based on random walks with a certain >>>>>>> stopping probability. More specifically: >>>>>>> The spout generates a tuple. >>>>>>> One specific task from the bolt receives this tuple. >>>>>>> Based on a certain probability, this task generates another tuple >>>>>>> and emits it again to another task of the same bolt. >>>>>>> >>>>>>> >>>>>>> I am stuck at this problem for quite a while, so it would be very >>>>>>> helpful if someone could help. >>>>>>> >>>>>>> Best Regards, >>>>>>> Nick >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >