Could you modify your log level to DEBUG and see worker's log? If you use Storm 1.x you can modify log level from UI on the fly. ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG since it could produce lots of logs.
Two lines: - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {} - BOLT - sending heartbeat request to subprocess Two lines will be logged to each 1 second. Please check logs are existing, and 'last heartbeat' is updated properly, and also worker timeout is set properly. 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <mazhechaomaill...@gmail.com>님이 작성: > I do not set "topology.subprocess.timeout.secs", so " > supervisor.worker.timeout.secs" will be used according to STORM-1314, > which is set 30 for my cluster. > 30 seconds is a very very very big value, it will never take more than 30 > seconds processing my tuple. > I think there must be problem somewhere else. > > 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <kabh...@gmail.com>: > > There're many situations for ShellBolt to trigger heartbeat issue, and at > least STORM-1946 is not the case. > > How long does your tuple take to be processed? You need to set subprocess > timeout seconds ("topology.subprocess.timeout.secs") to higher than max > time to process. You can even set it fairly big value so that subprocess > heartbeat issue will not happen. > > > ShellBolt requires that each tuple is handled and acked within heartbeat > timeout. I struggled to change this behavior for subprocess to periodically > sends heartbeat, but no luck because of GIL - global interpreter lock (same > for Ruby). We need to choose one: stick this restriction, or disable > subprocess heartbeat. > > I hope that we can resolve this issue clearly, but I guess multi-thread > approach doesn't work on Python, Ruby, and any language which uses GIL, and > I have no idea on alternatives > > - Jungtaek Lim (HeartSaVioR). > > 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <mazhechaomaill...@gmail.com>님이 작성: > > I made an issue (STORM-2150 > <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can > help? > > I've got a simple topology running with Storm 1.0.1. The topology consists > of a KafkaSpout and several python multilang ShellBolt. I frequently got > the following exceptions. > > java.lang.RuntimeException: subprocess heartbeat timeout at > > org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > More information here: > 1. Topology run with ACK mode. > 2. Topology had 40 workers. > 3. Topology emitted about 10 milliom tuples every 10 minutes. > > Every time subprocess heartbeat timeout, workers would restart and python > processes exited with exitCode:-1, which affected processing capacity and > stability of the topology. > > I've checked some related issues from Storm Jira. I first found STORM-1946 > <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related > to this problem and said bug had been fixed in Storm 1.0.2. However I got > the same exception even after I upgraded Storm to 1.0.2. > > I checked other related issues. Let's look at history of this problem. > DashengJu first reported this problem with Non-ACK mode in STORM-738 > <https://issues.apache.org/jira/browse/STORM-738>. STORM-742 > <https://issues.apache.org/jira/browse/STORM-742> discussed the approach > of > this problem with ACK mode, and it seemed that bug had been fixed in > 0.10.0. I don't know whether this patch is included in storm-1.x branch. In > a word, this problem still exists in the latest stable version. > > >