Ma Zhechao created STORM-2150:
---------------------------------
Summary: ShellBolt raise subprocess heartbeat timeout Exception
Key: STORM-2150
URL: https://issues.apache.org/jira/browse/STORM-2150
Project: Apache Storm
Issue Type: Bug
Components: storm-core, storm-multilang
Affects Versions: 1.0.1, 1.0.2
Reporter: Ma Zhechao
Priority: Critical
I've got a simple topology running with Storm 1.0.1. The topology consists of a
KafkaSpout and several python multilang ShellBolt. I frequently got the
following exceptions.
{code}
java.lang.RuntimeException: subprocess heartbeat timeout at
org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}
More information here:
1. Topology run with ACK mode.
2. Topology had 40 workers.
3. Topology emitted about 10 milliom tuples every 10 minutes.
Every time subprocess heartbeat timeout, workers would restart and python
processes exited with exitCode:-1, which affected processing capacity and
stability of the topology.
I've checked some related issues from Storm Jira. I first found STORM-1946
reported a bug related to this problem and said bug had been fixed in Storm
1.0.2. However I got the same exception even after I upgraded Storm to 1.0.2.
I checked other related issues. Let's look at history of this problem.
DashengJu first reported this problem with Non-ACK mode in STORM-738. STORM-742
discussed the approach of this problem with ACK mode, and it seemed that bug
had been fixed in 0.10.0. I don't know whether this patch is included in
storm-1.x branch. In a word, this problem still exists in the latest stable
version.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)