[ https://issues.apache.org/jira/browse/DRILL-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Hou resolved DRILL-5840. ------------------------------- Resolution: Not A Problem > A query that includes sort completes, and then loses Drill connection. Drill > becomes unresponsive, and cannot restart because it cannot communicate with > Zookeeper > ------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: DRILL-5840 > URL: https://issues.apache.org/jira/browse/DRILL-5840 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators > Affects Versions: 1.11.0 > Reporter: Robert Hou > Assignee: Paul Rogers > Fix For: 1.12.0 > > > Query is: > {noformat} > ALTER SESSION SET `exec.sort.disable_managed` = false; > select count(*) from (select * from > dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d > where d.columns[0] = 'ljdfhwuehnoiueyf'; > {noformat} > Query tries to complete, but cannot. It takes 20 hours from the time the > query tries to complete, to the time Drill finally loses its connection. > From the drillbit.log: > {noformat} > 2017-10-03 16:28:14,892 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG > o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: > State change requested RUNNING --> COMPLETED > 2017-10-04 01:47:27,698 [UserServer-1] DEBUG > o.a.d.e.r.u.UserServerRequestHandler - Received query to run. Returning > query handle. > 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN > o.a.d.exec.work.foreman.QueryManager - Failure while trying to delete the > estore profile for this query. > org.apache.drill.common.exceptions.DrillRuntimeException: unable to delete > node at /running/262bec7f-3539-0dd7-6fea-f2959f9df3b6 > at > org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:343) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.coord.zk.ZkEphemeralStore.remove(ZkEphemeralStore.java:108) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:293) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:964) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman.access$2600(Foreman.java:113) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1025) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1018) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.common.EventProcessor.processEvents(EventProcessor.java:107) > [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.common.EventProcessor.sendEvent(EventProcessor.java:65) > [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman$StateSwitch.addEvent(Foreman.java:1020) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman.addToEventQueue(Foreman.java:1038) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager.nodeComplete(QueryManager.java:498) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager.access$100(QueryManager.java:66) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager$NodeTracker.fragmentComplete(QueryManager.java:462) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager.fragmentDone(QueryManager.java:147) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager.access$400(QueryManager.java:66) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:525) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.control.WorkEventBus.statusUpdate(WorkEventBus.java:71) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentStatusReporter.sendStatus(FragmentStatusReporter.java:124) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentStatusReporter.stateChanged(FragmentStatusReporter.java:94) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:304) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:267) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) > [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_111] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_111] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111] > Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:238) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:233) > ~[curator-framework-2.7.1.jar:na] > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:214) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:41) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:340) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > ... 27 common frames omitted > 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG > o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: > cleaning up. > 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG > o.a.d.exec.rpc.control.WorkEventBus - Removing fragment status listener for > queryId 262bec7f-3539-0dd7-6fea-f2959f9df3b6. > 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG > o.a.drill.exec.work.foreman.Foreman - 262bec7f-3539-0dd7-6fea-f2959f9df3b6: > State change requested COMPLETED --> CANCELLATION_REQUESTED > 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN > o.a.drill.exec.work.foreman.Foreman - Dropping request to move to > CANCELLATION_REQUESTED state as query is already at COMPLETED state (which is > terminal). > 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG > o.a.d.e.w.f.FragmentStatusReporter - Closing > org.apache.drill.exec.work.fragment.FragmentStatusReporter@75e9a886 > 2017-10-04 03:30:02,970 [drill-executor-42] DEBUG > o.a.d.exec.rpc.control.WorkEventBus - Removing fragment manager: > 262bec7f-3539-0dd7-6fea-f2959f9df3b6:0:0 > 2017-10-04 12:49:47,660 [UserServer-1] WARN > o.a.drill.exec.rpc.user.UserServer - Message of mode REQUEST of rpc type 3 > took longer than 500ms. Actual duration was 39739962ms. > 2017-10-04 12:49:47,662 [UserServer-1] ERROR > o.a.d.exec.rpc.RpcExceptionHandler - Exception in RPC communication. > Connection: /10.10.100.190:31010 <--> /10.10.100.190:37568 (user server). > Closing connection. > io.netty.handler.codec.DecoderException: > org.apache.drill.common.exceptions.DrillRuntimeException: unable to put > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99) > [netty-codec-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:150) > [netty-handler-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > [netty-codec-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242) > [netty-codec-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111] > Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: unable > to put > at > org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:300) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:246) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.coord.zk.ZkEphemeralStore.put(ZkEphemeralStore.java:79) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:285) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.work.foreman.Foreman.<init>(Foreman.java:180) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.work.user.UserWorker.submitWork(UserWorker.java:78) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:66) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:47) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:157) > ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:53) > ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274) > ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244) > ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) > [netty-codec-4.0.27.Final.jar:4.0.27.Final] > ... 22 common frames omitted > Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: error > while checking path on zookeeper > at > org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:139) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:107) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:273) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > ... 35 common frames omitted > Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode > = ConnectionLoss > at > org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161) > ~[curator-framework-2.7.1.jar:na] > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) > ~[curator-client-2.7.1.jar:na] > at > org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:158) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36) > ~[curator-framework-2.7.1.jar:na] > at > org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:130) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > ... 37 common frames omitted > 2017-10-04 12:49:47,667 [UserServer-1] INFO > o.a.drill.exec.rpc.user.UserServer - RPC connection /10.10.100.190:31010 <--> > /10.10.100.190:37579 (user server) timed out. Timeout was set to 30 seconds. > Closing connection. > 2017-10-04 12:49:47,668 [UserServer-1] INFO > o.a.drill.exec.work.foreman.Foreman - Failure while trying communicate query > result to initiating client. This would happen if a client is disconnected > before response notice can be sent. > org.apache.drill.exec.rpc.ChannelClosedException: Channel closed > /10.10.100.190:31010 <--> /10.10.100.190:37579. > at > org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:164) > [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:144) > [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:406) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:943) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:592) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:584) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1099) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:71) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:466) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:187) > [netty-handler-4.0.27.Final.jar:4.0.27.Final] > at > org.apache.drill.exec.rpc.BasicServer$LoggingReadTimeoutHandler.readTimedOut(BasicServer.java:122) > [drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > io.netty.handler.timeout.ReadTimeoutHandler$ReadTimeoutTask.run(ReadTimeoutHandler.java:212) > [netty-handler-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > [netty-transport-4.0.27.Final.jar:4.0.27.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > [netty-common-4.0.27.Final.jar:4.0.27.Final] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111] > {noformat} > When I re-connect to Drill and try to show schemas, Drill hangs, and I see > this in drillbit.log > {noformat} > 2017-10-04 13:01:01,414 [UserServer-1] DEBUG > o.a.d.e.r.u.UserServerRequestHandler - Received query to run. Returning > query handle. > {noformat} > I try to stop Drill, but it doesn't stop, so it gets killed. > {noformat} > [root@qa-node190 logs.67]# /opt/drill/bin/drillbit.sh stop > Stopping drillbit > ......................................................................................................................... > drillbit did not complete after 120 seconds, killing with kill -9 14550 > {noformat} > I try to start Drill > {noformat} > /opt/drill/bin/drillbit.sh start > {noformat} > But it fails because it cannot communicate with Zookeeper: > {noformat} > 2017-10-04 14:40:32,812 [main] DEBUG o.a.d.e.c.zk.ZKClusterCoordinator - > Starting ZKClusterCoordination. > 2017-10-04 14:40:42,820 [main] ERROR o.apache.drill.exec.server.Drillbit - > Failure during initial startup of Drillbit. > java.io.IOException: Failure to connect to the zookeeper cluster service > within the allotted time of 10000 milliseconds. > at > org.apache.drill.exec.coord.zk.ZKClusterCoordinator.start(ZKClusterCoordinator.java:123) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:153) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:343) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > 2017-10-04 14:40:42,820 [main] DEBUG o.apache.drill.exec.server.Drillbit - > Shutdown begun. > 2017-10-04 14:40:44,714 [main] WARN o.apache.drill.exec.server.Drillbit - > Failure on close() > java.lang.NullPointerException: null > at org.apache.drill.exec.work.WorkManager.close(WorkManager.java:152) > ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) > ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at > org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) > ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:199) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:346) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) > [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT] > 2017-10-04 14:40:44,715 [main] INFO o.apache.drill.exec.server.Drillbit - > Shutdown completed (1894 ms). > {noformat} > I normally restart warden and zookeeper to resolve this. When I stopped > warden, it took 20 minutes (normally it should take a few minutes or less): > {noformat} > [root@qa-node190 logs.67]# time service mapr-warden stop > stopping WARDEN > looking to stop mapr-core processes not started by warden > real 20m46.637s > user 0m1.622s > sys 0m2.432s > {noformat} > Stopping zookeeper was quick, and restarting all processes including Drill > completed as expected. -- This message was sent by Atlassian JIRA (v6.4.14#64029)