Roshan Naik created STORM-1949:
----------------------------------
Summary: Storm backpressure can cause spout to stop emitting and
stall topology
Key: STORM-1949
URL: https://issues.apache.org/jira/browse/STORM-1949
Project: Apache Storm
Issue Type: Bug
Reporter: Roshan Naik
Problem can be reproduced by this [Word count
topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java]
within a IDE.
I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt
instances.
The problem is more easily reproduced with WC topology as it causes an
explosion of tuples due to splitting a sentence tuple into word tuples. As the
bolts have to process more tuples than the spout is producing, spout needs to
operate slower.
The amount of time it takes for the topology to stall can vary.. but typically
under 10 mins.
*My theory:* I suspect there is a race condition in the way ZK is being
utilized to enable/disable back pressure. When congested (i.e pressure exceeds
high water mark), the bolt's worker records this congested situation in ZK by
creating a node. Once the congestion is reduced below the low water mark, it
deletes this node.
The spout's worker has setup a watch on the parent node, expecting a callback
whenever there is change in the child nodes. On receiving the callback the
spout's worker lists the parent node to check if there are 0 or more child
nodes.... it is essentially trying to figure out the nature of state change in
ZK to determine whether to throttle or not. Subsequently it setsup another
watch in ZK to keep an eye on future changes.
When there are multiple bolts, there can be rapid creation/deletion of these ZK
nodes. Between the time the worker receives a callback and sets up the next
watch.. many changes may have undergone in ZK which will go unnoticed by the
spout.
The condition that the bolts are no longer congested may not get noticed as a
result of this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)