[ 
https://issues.apache.org/jira/browse/STORM-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525451#comment-16525451
 ] 

Aaron Gresch commented on STORM-3124:
-------------------------------------

I found by overriding lots of methods that after KerberosSaslClientHandler was 
registered for one of the pacemaker servers, it was immediately deregistered:
{code:java}
at 
org.apache.storm.messaging.netty.KerberosSaslClientHandler.channelUnregistered(KerberosSaslClientHandler.java:170)
 [storm-client-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:181)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:167)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:160)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:181)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:167)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:160)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1412)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:181)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:167)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:865)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:844)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:465)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]
{code}
Adding some logging in Netty, it was a connection refused:

 
{code:java}
java.net.ConnectException: Connection refused

        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
~[?:1.8.0_131]

        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
~[?:1.8.0_131]

        at 
org.apache.storm.shade.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:325)
 ~[shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:341)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:635)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:582)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:461)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at 
org.apache.storm.shade.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
 [shaded-deps-2.0.0.y.jar:2.0.0.y]

        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
{code}
I could not find any built in logging or easily understandable way to determine 
this is happening.  It looks like we have code to reconnect to the pacemaker 
server when catching an exception, but none is being passed back to us visibly 
in this case.  Adding a reconnect call after giving up on the number of 
attempts appears to fix the issue.  Will do further testing.

> Failures talking to Pacemaker
> -----------------------------
>
>                 Key: STORM-3124
>                 URL: https://issues.apache.org/jira/browse/STORM-3124
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Aaron Gresch
>            Assignee: Aaron Gresch
>            Priority: Critical
>
> {code:java}
> 2018-06-25 20:21:05.220 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 7 more attempts.
> 2018-06-25 20:21:06.220 o.a.s.p.PacemakerClient timer [ERROR] error 
> attempting to write to a channel Timed out waiting for channel ready..
> 2018-06-25 20:21:06.220 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 6 more attempts.
> 2018-06-25 20:21:07.220 o.a.s.p.PacemakerClient timer [ERROR] error 
> attempting to write to a channel Timed out waiting for channel ready..
> 2018-06-25 20:21:07.221 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 5 more attempts.
> 2018-06-25 20:21:08.221 o.a.s.p.PacemakerClient timer [ERROR] error 
> attempting to write to a channel Timed out waiting for channel ready..
> 2018-06-25 20:21:08.221 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 4 more attempts.
> 2018-06-25 20:21:09.222 o.a.s.p.PacemakerClient timer [ERROR] error 
> attempting to write to a channel Timed out waiting for channel ready..
> 2018-06-25 20:21:09.222 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 3 more attempts.
> 2018-06-25 20:21:10.222 o.a.s.p.PacemakerClient timer [ERROR] error 
> attempting to write to a channel Timed out waiting for channel ready..
> 2018-06-25 20:21:10.222 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 2 more attempts.
> 2018-06-25 20:21:11.223 o.a.s.p.PacemakerClient timer [ERROR] error 
> attempting to write to a channel Timed out waiting for channel ready..
> 2018-06-25 20:21:11.223 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 1 more attempts.
> 2018-06-25 20:21:12.223 o.a.s.p.PacemakerClient timer [ERROR] error 
> attempting to write to a channel Timed out waiting for channel ready..
> 2018-06-25 20:21:12.223 o.a.s.p.PacemakerClient timer [ERROR] Not getting 
> response or getting null response. Making 0 more attempts.
> 2018-06-25 20:21:13.224 o.a.s.p.PacemakerClientPool timer [WARN] Failed to 
> connect to the pacemaker server openqe74blue-n2.blue.ygrid.yahoo.com
> 2018-06-25 20:21:13.229 o.a.s.d.n.Nimbus pool-37-thread-481 [INFO] 
> uploadedJar 
> /home/y/var/storm/nimbus/inbox/stormjar-c5893ba3-21c6-4397-84e2-54aab8e091a9.jar
> 2018-06-25 20:21:13.225 o.a.s.d.n.Nimbus timer [ERROR] Error while processing 
> event
> java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.storm.pacemaker.PacemakerConnectionException: Failed to connect to 
> any Pacemaker.
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2773)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         at org.apache.storm.StormTimer$1.run(StormTimer.java:110) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226) 
> [storm-client-2.0.0.y.jar:2.0.0.y]
> Caused by: java.lang.RuntimeException: 
> org.apache.storm.pacemaker.PacemakerConnectionException: Failed to connect to 
> any Pacemaker.
>         at 
> org.apache.storm.cluster.PaceMakerStateStorage.get_worker_hb_children(PaceMakerStateStorage.java:214)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.StormClusterStateImpl.heartbeatStorms(StormClusterStateImpl.java:482)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.topoIdsToClean(Nimbus.java:897) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         at org.apache.storm.daemon.nimbus.Nimbus.doCleanup(Nimbus.java:2469) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2771)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         ... 2 more
> Caused by: org.apache.storm.pacemaker.PacemakerConnectionException: Failed to 
> connect to any Pacemaker.
>         at 
> org.apache.storm.pacemaker.PacemakerClientPool.sendAll(PacemakerClientPool.java:71)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.PaceMakerStateStorage.get_worker_hb_children(PaceMakerStateStorage.java:199)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.StormClusterStateImpl.heartbeatStorms(StormClusterStateImpl.java:482)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.topoIdsToClean(Nimbus.java:897) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         at org.apache.storm.daemon.nimbus.Nimbus.doCleanup(Nimbus.java:2469) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$37(Nimbus.java:2771)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         ... 2 more
> 2018-06-25 20:21:13.231 o.a.s.u.Utils timer [ERROR] Halting process: Error 
> while processing event
> java.lang.RuntimeException: Halting process: Error while processing event
>         at org.apache.storm.utils.Utils.exitProcess(Utils.java:470) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.lambda$new$17(Nimbus.java:490) 
> ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:253) 
> [storm-client-2.0.0.y.jar:2.0.0.y]
> 2018-06-25 20:21:13.232 o.a.s.d.n.Nimbus Thread-12 [INFO] Shutting down master
> 2018-06-25 20:21:13.232 o.a.s.u.Utils Thread-13 [INFO] Halting after 10 
> seconds
> 2018-06-25 20:21:13.677 o.a.s.d.n.Nimbus pool-37-thread-481 [INFO] desired 
> replication count 1 achieved, current-replication-count for conf key = 1, 
> current-replication-count for code key = 1, current-replication-count for jar 
> key = 1
> 2018-06-25 20:21:13.678 o.a.s.d.n.Nimbus pool-37-thread-481 [WARN] Topology 
> submission exception. (topology name='run')
> java.lang.IllegalStateException: instance must be started before calling this 
> method
>         at 
> org.apache.storm.shade.org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
>  ~[shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkExists(CuratorFrameworkImpl.java:432)
>  ~[shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:144)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:288)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.zookeeper.ClientZookeeper.mkdirs(ClientZookeeper.java:70) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.ZKStateStorage.mkdirs(ZKStateStorage.java:114) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.PaceMakerStateStorage.mkdirs(PaceMakerStateStorage.java:69)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.StormClusterStateImpl.setupHeatbeats(StormClusterStateImpl.java:435)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:3024)
>  [storm-server-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3511)
>  [storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3490)
>  [storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) 
> [shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147)
>  [storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:291)
>  [shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_131]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_131]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
> 2018-06-25 20:21:13.680 o.a.s.t.ProcessFunction pool-37-thread-481 [ERROR] 
> Internal error processing submitTopologyWithOpts
> java.lang.RuntimeException: java.lang.IllegalStateException: instance must be 
> started before calling this method
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:3049)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3511)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.generated.Nimbus$Processor$submitTopologyWithOpts.getResult(Nimbus.java:3490)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) 
> [shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.security.auth.sasl.SaslTransportPlugin$TUGIWrapProcessor.process(SaslTransportPlugin.java:147)
>  [storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:291)
>  [shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_131]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_131]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
> Caused by: java.lang.IllegalStateException: instance must be started before 
> calling this method
>         at 
> org.apache.storm.shade.org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
>  ~[shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkExists(CuratorFrameworkImpl.java:432)
>  ~[shaded-deps-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.zookeeper.ClientZookeeper.existsNode(ClientZookeeper.java:144)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.zookeeper.ClientZookeeper.mkdirsImpl(ClientZookeeper.java:288)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.zookeeper.ClientZookeeper.mkdirs(ClientZookeeper.java:70) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.ZKStateStorage.mkdirs(ZKStateStorage.java:114) 
> ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.PaceMakerStateStorage.mkdirs(PaceMakerStateStorage.java:69)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.cluster.StormClusterStateImpl.setupHeatbeats(StormClusterStateImpl.java:435)
>  ~[storm-client-2.0.0.y.jar:2.0.0.y]
>         at 
> org.apache.storm.daemon.nimbus.Nimbus.submitTopologyWithOpts(Nimbus.java:3024)
>  ~[storm-server-2.0.0.y.jar:2.0.0.y]
>         ... 9 more
> {code}
> We're having sporadic failures talking to Pacemaker.  This callstack shows us 
> unable to launch topologies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to