[
https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423899#comment-15423899
]
Zhuo Liu commented on STORM-1949:
---------------------------------
Hi [~roshan_naik] [~revans2] [~knusbaum] [~kabhwan],
I tested with the word count example sent by Roshan, the results is:
Backpressure works fine for 1.x-branch while in master (2.x), we have a
NullPointerExecution in SpoutExecutor.
(1) in 1.x-branch, back pressure works well (see the attached screenshot). The
back pressure flag gets set and unset intermittently depends on the load in
disruptor queues. [~roshan_naik], I think you guess of reason (racing
condition) is incorrect since we have cyclic check of flags every 100 ms, as
mentioned by Bobby.
(2) in test for master, I see the topology gets stalled quickly, less than a
minute. But the reason is not because of the bug of back pressure, it is due to
the NullPointerException in Executor.
https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java#L205
[~Cody] Could you help with this NPE since you worked on Storm-1277?
Thanks a lot!
80 2016-08-16 21:19:15.102 o.a.s.m.EventLoggerBolt [INFO] EventLoggerBolt
prepare called
81 2016-08-16 21:19:15.107 o.a.s.m.FileBasedEventLogger [INFO] logFilePath
/Users/zliu6/git/joe/storm/storm-dist/binary/target/storm/logs/workers-artifacts/FileReadWordCountTopo-1-1471407540/6700
/events.log
82 2016-08-16 21:19:15.110 o.a.s.e.b.BoltExecutor [INFO] Prepared bolt
__eventlogger:[2]
83 2016-08-16 21:20:17.242 o.a.s.u.Utils [ERROR] Async loop died!
84 java.lang.RuntimeException: java.lang.NullPointerException
85 at
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:468)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
86 at
org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
87 at
org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
88 at
org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
89 at org.apache.storm.utils.Utils$7.run(Utils.java:2227)
[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
90 at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
91 Caused by: java.lang.NullPointerException
92 at
org.apache.storm.executor.spout.SpoutExecutor.tupleActionFn(SpoutExecutor.java:205)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
93 at org.apache.storm.executor.Executor.onEvent(Executor.java:257)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
94 at
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:455)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
95 ... 5 more
96 2016-08-16 21:20:17.860 o.a.s.e.e.ReportError [ERROR] Error
97 java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.NullPointerException
98 at org.apache.storm.utils.Utils$7.run(Utils.java:2237)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
99 at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
100 Caused by: java.lang.RuntimeException: java.lang.NullPointerException
101 at
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:468)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
102 at
org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
103 at
org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
104 at
org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
105 at org.apache.storm.utils.Utils$7.run(Utils.java:2227)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
106 ... 1 more
107 Caused by: java.lang.NullPointerException
108 at
org.apache.storm.executor.spout.SpoutExecutor.tupleActionFn(SpoutExecutor.java:205)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
109 at org.apache.storm.executor.Executor.onEvent(Executor.java:257)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
110 at
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:455)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
111 at
org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
112 at
org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
113 at
org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
114 at org.apache.storm.utils.Utils$7.run(Utils.java:2227)
~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
115 ... 1 more
> Backpressure can cause spout to stop emitting and stall topology
> ----------------------------------------------------------------
>
> Key: STORM-1949
> URL: https://issues.apache.org/jira/browse/STORM-1949
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Roshan Naik
> Attachments: 1.x-branch-works-perfect.png
>
>
> Problem can be reproduced by this [Word count
> topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java]
> within a IDE.
> I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt
> instances.
> The problem is more easily reproduced with WC topology as it causes an
> explosion of tuples due to splitting a sentence tuple into word tuples. As
> the bolts have to process more tuples than the spout is producing, spout
> needs to operate slower.
> The amount of time it takes for the topology to stall can vary.. but
> typically under 10 mins.
> *My theory:* I suspect there is a race condition in the way ZK is being
> utilized to enable/disable back pressure. When congested (i.e pressure
> exceeds high water mark), the bolt's worker records this congested situation
> in ZK by creating a node. Once the congestion is reduced below the low water
> mark, it deletes this node.
> The spout's worker has setup a watch on the parent node, expecting a callback
> whenever there is change in the child nodes. On receiving the callback the
> spout's worker lists the parent node to check if there are 0 or more child
> nodes.... it is essentially trying to figure out the nature of state change
> in ZK to determine whether to throttle or not. Subsequently it setsup
> another watch in ZK to keep an eye on future changes.
> When there are multiple bolts, there can be rapid creation/deletion of these
> ZK nodes. Between the time the worker receives a callback and sets up the
> next watch.. many changes may have undergone in ZK which will go unnoticed by
> the spout.
> The condition that the bolts are no longer congested may not get noticed as a
> result of this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)