Please check if you have orphan workers. Orphan workers happen when a topology is redeployed in a short period of time and the old workers haven't yet been cleaned up.
Check this running ps aux|grep java or specific jar keyword if you have one. On Jul 14, 2017 11:17 PM, "Sreeram" <[email protected]> wrote: > Thank you Taylor for replying. > > I checked the worker logs and there were no messages that get printed > once the worker goes into freeze( I had the worker logs set at ERROR > level ). > > Regarding the components, I have kafka spout in the topology and bolts > that write to HBase. > > I had been rebalancing the topology with wait time of 30seconds. Is it > recommended to match it with topology.message.timeout.secs ? > > Please let me know if you need any specific info. > > Thanks, > Sreeram > > On Sat, Jul 15, 2017 at 12:39 AM, P. Taylor Goetz <[email protected]> > wrote: > >> Supervisor log at the time of freeze looks like below > >> > >> 2017-07-12 14:38:46.712 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > > > > > > There are two situations where you would see those messages: When a > topology is first deployed, and when a worker has died and is being > restarted. > > > > I suspect the latter. Have you looked at the worker logs for any > indication that the workers might be crashing and what might be causing it? > > > > What components are involved in you’re topology? > > > > -Taylor > > > > > >> On Jul 12, 2017, at 5:26 AM, Sreeram <[email protected]> wrote: > >> > >> Hi, > >> > >> I am observing that my storm topology intermediately freezes and does > >> not continue to process tuples from Kafka. This happens frequently and > >> when it happens this freeze lasts for 5 to 15 minutes. No content is > >> written to any of the worker log files during this time. > >> > >> The version of storm I use is 1.0.2 and Kafka version is 0.9.0. > >> > >> Any suggestions to solve the issue ? > >> > >> Thanks, > >> Sreeram > >> > >> Supervisor log at the time of freeze looks like below > >> > >> 2017-07-12 14:38:46.712 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:47.212 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:47.712 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:48.213 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:48.713 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:49.213 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:49.713 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:50.214 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> 2017-07-12 14:38:50.714 o.a.s.d.supervisor [INFO] > >> d8958816-5bc8-449e-94e3-87ddbb2c3d02 still hasn't started > >> > >> > >> Thread stacks (sample) > >> Most of worker threads during this freeze period look like one of the > >> below two stack traces. > >> > >> Thread 104773: (state = BLOCKED) > >> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; > >> information may be imprecise) > >> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, > >> long) @bci=20, line=215 (Compiled frame) > >> - java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill( > java.util.concurrent.SynchronousQueue$TransferStack$SNode, > >> boolean, long) @bci=160, line=460 (Compil > >> ed frame) > >> - java.util.concurrent.SynchronousQueue$TransferStack.transfer(java. > lang.Object, > >> boolean, long) @bci=102, line=362 (Compiled frame) > >> - java.util.concurrent.SynchronousQueue.poll(long, > >> java.util.concurrent.TimeUnit) @bci=11, line=941 (Compiled frame) > >> - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=134, > >> line=1066 (Compiled frame) > >> - java.util.concurrent.ThreadPoolExecutor.runWorker( > java.util.concurrent.ThreadPoolExecutor$Worker) > >> @bci=26, line=1127 (Compiled frame) > >> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, > >> line=617 (Compiled frame) > >> - java.lang.Thread.run() @bci=11, line=745 (Compiled frame) > >> > >> Thread 147495: (state = IN_NATIVE) > >> - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 > >> (Compiled frame; information may be imprecise) > >> - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled > frame) > >> - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=93 > (Compiled frame) > >> - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=86 > >> (Compiled frame) > >> - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=97 (Compiled > frame) > >> - org.apache.kafka.common.network.Selector.select(long) @bci=35, > >> line=425 (Compiled frame) > >> - org.apache.kafka.common.network.Selector.poll(long) @bci=81, > >> line=254 (Compiled frame) > >> - org.apache.kafka.clients.NetworkClient.poll(long, long) @bci=84, > >> line=270 (Compiled frame) > >> - org.apache.kafka.clients.producer.internals.Sender.run(long) > >> @bci=343, line=216 (Compiled frame) > >> - org.apache.kafka.clients.producer.internals.Sender.run() @bci=27, > >> line=128 (Interpreted frame) > >> - java.lang.Thread.run() @bci=11, line=745 (Compiled frame) > > >
