Re: Re:Flooded topology after a full GC

Ramin Farajollah (BLOOMBERG/ 731 LEX) Fri, 20 Dec 2019 13:56:18 -0800

Rohsan,

Thanks for helpful comments. A few follow up questions:

1) enabling ACKing: 
What does this specifically entail? Anchor the tuple (collector.emit(tuple, 
...)) and ack (collector.ack(tuple)) at each hop?
Without doing both, ACKing is not enabled and topology.max.spout.pending would 
be ineffective. Therefore, on timeout, the messages are not resent. Correct?
We do NOT anchor the tuples but do ack/fail every one of them. Any 
harm/advantage doing that?

2) Excessive traffic
What happens to the tuples when they are produced at the faster rate that can 
be consumed?
Are they silently discarded? (when a) tuples are acked b) back pressure is 
enabled, c) nothing is enabled)

From: user@storm.apache.org At: 12/19/19 18:53:30To:  user@storm.apache.org
Subject: Re: Re:Flooded topology after a full GC

Some thoughts:
If you have ACKing enabled... you can control the number of inflight msgs 
using topology.max.spout.pending. It will constrain the spouts from producing 
more msgs. 
Long GC could potentially cause a time out thereby requiring the spout to re 
produce the msgs. However, the timed-out msgs that are already in flight will 
continue to drain out (assuming they were not lost due to worker crash)... so 
there will be duplicate delivery. Individual msgs in the "tuple tree" are not 
tracked to see if there is a timeout at every hop (i.e bolt).

The sudden burst of msgs  after a STW GC might be due to timeout causing spouts 
to re-emit. Check GC logs to see how long the STW cycle takes. Increasing msg 
timeout accordingly and also reducing inflight msgs could help this situation. 
Keep in mind that each worker will have its own STW  GC cycle and that means 
there is a possibility that a single tuple tree can hit multiple STW cycles... 
based on how many hops are involved.

-roshan

On Thursday, December 19, 2019, 06:59:55 AM PST, Ramin Farajollah (BLOOMBERG/ 
731 LEX) <rfarajol...@bloomberg.net> wrote: 

Correction: HotSpot 8 (not OpenJDK 8)

From: user@storm.apache.org At: 12/19/19 09:56:34
To:  user@storm.apache.org
Subject: Flooded topology after a full GC

> Hi,
> 
> We use an object pool for messages in tuples. It has been effective to reduce 
GCs in creating the heavy objects.
> 
> After a full GC (~30sec), the Zookeeper connection is suspended and is 
restored by Curator. This is followed by a huge rise in the number of the 
objects (presumably in flight). This leads into more frequent full GCs and the 
eventual crash of the topology.
> 
> I'm trying to understand what triggers the huge rise immediately after STW of 
full GC/Curator reconnect. My guess is that all tuples had failed due to zk 
timeout and were resent. In addition, there may be acks/fails signals 
exasperating the situation.
> 
> My questions are:
> 1) How to determine if tuples are resent?
> 2) How to determine if acks/fails contribute to the traffic?
> 3) Without back pressure, are excessive tuples are silently discarded from 
the outbound or the inbound queues?
> 4) What happens to the failed tuples? (I need a hook to release the objects).
> 
> Details:
> - OpenJDK 8
> - Storm 1.2.3
> - Curator 2.12.0
> - zk session timeout 40000 ms, connection timeout 1500 ms
> - Initially the cache is adequate (8gb)

Re: Re:Flooded topology after a full GC

Reply via email to