[
https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368473#comment-15368473
]
Sriharsha Chintalapani commented on STORM-1949:
-----------------------------------------------
+1. Given that there are few issues we should turn it off in defaults.yaml.
My understanding of backpressure by talking to/debugging with [~roshan_naik] is
as soon as we hit the higher or lower ceiling we take an action.
I would like to see if we can add a time duration to understand that if the
bolt is actually hit the ceiling continuously i.e we are not constantly sending
signal to spout turn it on or off. Lets say as user if I set this time duration
to 1min we can check the higher watermark being hit consistently for 1min than
send a signal to turn on back pressure and similarly on lower watermark part.
Otherwise we can end up generating false positives on turning on back pressure.
Also agree with [~roshan_naik] about increased load on zookeeper. If possible
explore the possibility of it by a message transfer back to spout or via acker.
I am not clear if this is possible need to look into it.
> Backpressure can cause spout to stop emitting and stall topology
> ----------------------------------------------------------------
>
> Key: STORM-1949
> URL: https://issues.apache.org/jira/browse/STORM-1949
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Roshan Naik
>
> Problem can be reproduced by this [Word count
> topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java]
> within a IDE.
> I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt
> instances.
> The problem is more easily reproduced with WC topology as it causes an
> explosion of tuples due to splitting a sentence tuple into word tuples. As
> the bolts have to process more tuples than the spout is producing, spout
> needs to operate slower.
> The amount of time it takes for the topology to stall can vary.. but
> typically under 10 mins.
> *My theory:* I suspect there is a race condition in the way ZK is being
> utilized to enable/disable back pressure. When congested (i.e pressure
> exceeds high water mark), the bolt's worker records this congested situation
> in ZK by creating a node. Once the congestion is reduced below the low water
> mark, it deletes this node.
> The spout's worker has setup a watch on the parent node, expecting a callback
> whenever there is change in the child nodes. On receiving the callback the
> spout's worker lists the parent node to check if there are 0 or more child
> nodes.... it is essentially trying to figure out the nature of state change
> in ZK to determine whether to throttle or not. Subsequently it setsup
> another watch in ZK to keep an eye on future changes.
> When there are multiple bolts, there can be rapid creation/deletion of these
> ZK nodes. Between the time the worker receives a callback and sets up the
> next watch.. many changes may have undergone in ZK which will go unnoticed by
> the spout.
> The condition that the bolts are no longer congested may not get noticed as a
> result of this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)