Thanks everyone for the help. I found the issue. Turns out it was something
completely unrelated. Someone had set up a JVM option to run a monitoring
application of their own on top of Storm and every time anyone was
submitting a topology, this application was also executed on the whole
cluster, leading to OOM. Remote-profiling helped me detect the problem and
disable this.

On Wed, Jan 20, 2016 at 6:57 PM, Andrew Xor <andreas.gramme...@gmail.com>
wrote:

> Hello again,
>
>  I'd like to chip in @Nikolas but what you will have to do... you will
> probably not like... I really think this is not storm's fault... that would
> be really weird. Additionally the jvm that you use for local execution are
> the same as the ones in your cluster? Now... the best way to know that is
> wrong would be to get profiled data from the workers that die on the
> cluster... to do that you will need to debug/profile the topology on the
> cluster -- it's not a trivial task and can be time consuming. Now I'll tell
> you a kind of "simple" way of getting the data... which is what I did when
> I needed to debug a cluster topology.
>
> Now, since you don't know which of the nodes of the cluster will be
> assigned to your topology you'd have to hook-up with all of them... but
> thankfully you will receive response only from the ones that will actually
> do the work; this is because the jvm on listening on your specified port
> will be spawned but the nodes executing your topology. You will have to use
> childopts argument to pass the port that the debug jvm will be hooked
> (details on how to hook jvm's remotely are here
> <http://docs.oracle.com/javase/8/docs/technotes/guides/jpda/conninv.html>
> and for attaching YK here
> <https://www.yourkit.com/docs/java/help/attach_agent.jsp>), so make sure
> it's one that's not already used -- you can use nmap to scan the nodes for
> available ports. Next, depending on your version of java you should invoke
> the jvm and attach it to your profiler server -- preferably, use YourKit
> for that...even in trial mode as it makes the process quite easier than
> other debuggers. Then review the dump/profiled data to find your issue...
> if you find that there is an issue with Storm please do include as much
> details as you can! That will make it easier to find and patch the issue,
> if any.
>
> Let us know if this helped or you require anything else.
>
>
> On Mon, Jan 18, 2016 at 11:54 PM, Yury Ruchin <yuri.ruc...@gmail.com>
> wrote:
>
>> Yes, I suggest you to try and spot the problem by looking at the dump of
>> a workers that throws the exception. That way you could at least be certain
>> about what consumes workers memory.
>>
>> Oracle HotSpot has a number of options controlling GC logging, setting
>> them for worker JVMs may help in troubleshooting. Plumbr's Handbook seems
>> to be a decent reading on that matter: https://plumbr.eu/handbook.
>>
>> Since you are using custom spout, could you provide its code, at least
>> the part that emits tuples?
>>
>> 2016-01-15 23:57 GMT+03:00 Nikolaos Pavlakis <nikolaspavla...@gmail.com>:
>>
>>> Hi Yury.
>>>
>>> 1. I am using Storm 0.9.5
>>> 2. It is a BaseRichSpout. Yes, it has acking enabled and I ack each
>>> tuple at the end of the "execute" method of the bolt. I see tuples being
>>> acked in Storm UI.
>>> 3. Yes I observe memory usage increasing (which eventually leads to the
>>> topology hanging) even in my dummy setup which is not saving anything in
>>> memory, it merely reproduces the message-passing of my algorithm. I do not
>>> get OOM errors when I execute the topology on the cluster, but I get the
>>> most common exception in Storm :* java.lang.RuntimeException:
>>> java.lang.NullPointerException at
>>> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)*
>>> and some tasks die and the Storm UI statistics get lost/re-started. I have
>>> never profiled a topology that is being executed on the cluster so I am not
>>> very certain if this is what you mean. If I understand correctly, what you
>>> are suggesting is to take a heap dump using visualVM at some node while the
>>> topology is running and analyze this heap dump.
>>> 4. I haven't seen any GC logs (not sure how to collect GC logs from the
>>> cluster).
>>>
>>> Thanks again for your help.
>>>
>>> On Fri, Jan 15, 2016 at 9:57 PM, Yury Ruchin <yuri.ruc...@gmail.com>
>>> wrote:
>>>
>>>> Hi Nick,
>>>>
>>>> Some questions:
>>>>
>>>> 1. Well, what version of Storm are you using? :)
>>>>
>>>> 2. What is the spout you are using? Is this spout reliable, i. e. does
>>>> it use message ids to have messages acked/failed by downstream bolts? Do
>>>> you have acker enabled for your topology? If it is unreliable or does not
>>>> have acker, then topology.max.spout.pending has no effect and if your bolts
>>>> don't keep up with your spout, you will likely end up with overflow buffer
>>>> growing larger and larger.
>>>>
>>>> 3. Not sure if I get it right: after you stopped saving anything in
>>>> memory - do you still experience memory usage increasing? Have you observed
>>>> OutOfMemoryErrors? If yes, you might want to launch your worker processes
>>>> with -XX:+HeapDumpOnOutOfMemoryError. If no, you can take on-demand heap
>>>> dump using e. g. VisualVM and feed it to a memory analyzer, such as MAT,
>>>> then take a look what eats up the heap.
>>>>
>>>> 4. What do you think it's a memory issue? Have you looked at GC graphs
>>>> shown by e. g. VisualVM? Did you collect any GC logs to see how long it
>>>> took?
>>>>
>>>> Regards
>>>> Yury
>>>>
>>>> 2016-01-15 20:15 GMT+03:00 Nikolaos Pavlakis <nikolaspavla...@gmail.com
>>>> >:
>>>>
>>>>> Thanks for all the replies so far. I am profiling the topology in
>>>>> local mode with VisualVm and I do not see this problem. I am still running
>>>>> to this problem when the topology is deployed on the cluster, even with
>>>>> max.spout.pending = 1.
>>>>>
>>>>> On Wed, Jan 13, 2016 at 10:38 PM, John Yost <hokiege...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 for Andrew, definitely agree profiling with jvisualvm or whatever
>>>>>> is definitely something to do if you have not done already
>>>>>>
>>>>>> On Wed, Jan 13, 2016 at 3:30 PM, Andrew Xor <
>>>>>> andreas.gramme...@gmail.com> wrote:
>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>>  Care to give version of storm/jvm? Does this happen on cluster
>>>>>>> execution only or when also running the topology in local mode?
>>>>>>> Unfortunately, probably the best way to find what's really going on is 
>>>>>>> to
>>>>>>> profile your topology... if you can run the topology locally this will 
>>>>>>> make
>>>>>>> things quite a bit easier as profiling storm topologies on a live 
>>>>>>> cluster
>>>>>>> can be quite time consuming.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>> On Wed, Jan 13, 2016 at 10:06 PM, Nikolaos Pavlakis <
>>>>>>> nikolaspavla...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am implementing a distributed algorithm for pagerank estimation
>>>>>>>> using Storm. I have been having memory problems, so I decided to 
>>>>>>>> create a
>>>>>>>> dummy implementation that does not explicitly save anything in memory, 
>>>>>>>> to
>>>>>>>> determine whether the problem lies in my algorithm or my Storm 
>>>>>>>> structure.
>>>>>>>>
>>>>>>>> Indeed, while the only thing the dummy implementation does is
>>>>>>>> message-passing (a lot of it), the memory of each worker process keeps
>>>>>>>> rising until the pipeline is clogged. I do not understand why this 
>>>>>>>> might be
>>>>>>>> happening.
>>>>>>>>
>>>>>>>> My cluster has 18 machines (some with 8g, some 16g and some 32g of
>>>>>>>> memory). I have set the worker heap size to 6g (-Xmx6g).
>>>>>>>>
>>>>>>>> My topology is very very simple:
>>>>>>>> One spout
>>>>>>>> One bolt (with parallelism).
>>>>>>>>
>>>>>>>> The bolt receives data from the spout (fieldsGrouping) and also
>>>>>>>> from other tasks of itself.
>>>>>>>>
>>>>>>>> My message-passing pattern is based on random walks with a certain
>>>>>>>> stopping probability. More specifically:
>>>>>>>> The spout generates a tuple.
>>>>>>>> One specific task from the bolt receives this tuple.
>>>>>>>> Based on a certain probability, this task generates another tuple
>>>>>>>> and emits it again to another task of the same bolt.
>>>>>>>>
>>>>>>>>
>>>>>>>> I am stuck at this problem for quite a while, so it would be very
>>>>>>>> helpful if someone could help.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to