will need more logs to debug this. you can grep only for helix/zk related
logs.
This should not happen in general. Couple of scenarios when this can happen
- The Participant was already in the middle of a transition when you
shut it down.
- Participant GC'ed has soon as it started and lost connection to ZK.
Was this a clean shutdown? Did you wait long enough the for liveinstance to
disappear?
what do you have in this line
at com.hcd.hcdadmin.CustomMessageHandlerFactory$CustomMessageHandler.
handleMessage(CustomMessageHandlerFactory.java:149
This does not appear to be executing state transition, if so who is sending
these custom messages to the participant.
On Wed, Apr 5, 2017 at 7:34 PM, Neutron Sharc <[email protected]>
wrote:
> We are using helix-0.7.1
>
> All these logs are from participants, not from controller.
>
> -Shawn
>
>
> On Wed, Apr 5, 2017 at 7:09 PM, kishore g <[email protected]> wrote:
> > Hi Shawn,
> >
> > Are the logs on the participant or controller? what is the helix version?
> >
> >
> >
> > On Wed, Apr 5, 2017 at 6:36 PM, Neutron Sharc <[email protected]>
> > wrote:
> >
> >> Hi all,
> >>
> >> We are testing a failure recovery scenario where I have many resources
> >> spanning many participants. I shutdown all participants and helix
> >> admins, wait a while, then add each participant back into cluster.
> >> (zookeeper is on a separate cluster, not affected by shtudown) During
> >> the recovery, it seems controller generates too many messages, and
> >> there so many exceptions. Below are some examples.
> >>
> >> Are these exceptions expected? Any comments are highly appreciated.
> >> Thanks.
> >>
> >>
> >> [ERROR 2017-04-05 14:26:17,734
> >> org.apache.helix.manager.zk.ZkBaseDataAccessor:303] Exception while
> >> updating path: /yy_cluster_name/INSTANCES/P60505029461/ERRORS/
> >> 1002d87a25a0589/USER_DEFINE_MSG/15ae0bd8-10
> >> 1f-4af3-acc3-36a486af4f4c
> >> org.I0Itec.zkclient.exception.ZkInterruptedException:
> >> java.lang.InterruptedException
> >> at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> >> java:687)
> >> at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.
> java:240)
> >> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> >> at org.apache.helix.manager.zk.ZkBaseDataAccessor.doUpdate(
> >> ZkBaseDataAccessor.java:273)
> >> at org.apache.helix.manager.zk.ZkBaseDataAccessor.update(
> >> ZkBaseDataAccessor.java:245)
> >> at org.apache.helix.manager.zk.ZKHelixDataAccessor.
> updateProperty(
> >> ZKHelixDataAccessor.java:150)
> >> at org.apache.helix.util.StatusUpdateUtil.publishErrorRecord(
> >> StatusUpdateUtil.java:501)
> >> at org.apache.helix.util.StatusUpdateUtil.
> >> publishStatusUpdateRecord(StatusUpdateUtil.java:435)
> >> at org.apache.helix.util.StatusUpdateUtil.
> >> logMessageStatusUpdateRecord(StatusUpdateUtil.java:334)
> >> at org.apache.helix.util.StatusUpdateUtil.logError(
> >> StatusUpdateUtil.java:342)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:163)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:42)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:745)
> >> Caused by: java.lang.InterruptedException
> >> at java.lang.Object.wait(Native Method)
> >> at java.lang.Object.wait(Object.java:502)
> >> at org.apache.zookeeper.ClientCnxn.submitRequest(
> >> ClientCnxn.java:1344)
> >> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:925)
> >> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:956)
> >> at org.I0Itec.zkclient.ZkConnection.readData(
> ZkConnection.java:103)
> >> at org.apache.helix.manager.zk.ZkClient$4.call(ZkClient.java:
> 244)
> >> at org.apache.helix.manager.zk.ZkClient$4.call(ZkClient.java:
> 240)
> >> at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> >> java:675)
> >> ... 15 more
> >>
> >>
> >> [ERROR 2017-04-05 14:26:17,676
> >> org.apache.helix.messaging.handling.HelixTask:162] Exception after
> >> executing a message, msgId:
> >> 35e73c64-8fd3-4fb8-b0b8-419eacfa91a0org.I0Itec.zkclient.exception.
> >> ZkInterrupte
> >> dException: java.lang.InterruptedException
> >> org.I0Itec.zkclient.exception.ZkInterruptedException:
> >> java.lang.InterruptedException
> >> at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> >> java:687)
> >> at org.apache.helix.manager.zk.ZkClient.getChildren(ZkClient.
> >> java:212)
> >> at org.I0Itec.zkclient.ZkClient.deleteRecursive(ZkClient.java:
> 505)
> >> at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(
> >> ZkBaseDataAccessor.java:537)
> >> at org.apache.helix.manager.zk.ZKHelixDataAccessor.
> removeProperty(
> >> ZKHelixDataAccessor.java:271)
> >> at org.apache.helix.messaging.handling.HelixTask.
> >> removeMessageFromZk(HelixTask.java:187)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:150)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:42)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:745)
> >> Caused by: java.lang.InterruptedException
> >> at java.lang.Object.wait(Native Method)
> >> at java.lang.Object.wait(Object.java:502)
> >> at org.apache.zookeeper.ClientCnxn.submitRequest(
> >> ClientCnxn.java:1344)
> >> at org.apache.zookeeper.ZooKeeper.getChildren(
> ZooKeeper.java:1247)
> >> at org.apache.zookeeper.ZooKeeper.getChildren(
> ZooKeeper.java:1277)
> >> at org.I0Itec.zkclient.ZkConnection.getChildren(
> >> ZkConnection.java:99)
> >> at org.apache.helix.manager.zk.ZkClient$3.call(ZkClient.java:
> 215)
> >> at org.apache.helix.manager.zk.ZkClient$3.call(ZkClient.java:
> 212)
> >> at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> >> java:675)
> >> ... 11 more
> >>
> >> [ERROR 2017-04-05 14:26:18,321
> >> org.apache.helix.messaging.handling.HelixTask:102] Exception while
> >> executing a message. java.lang.NullPointerException msgId:
> >> eb79d5b0-4210-479e-b010-076764199059 type: USER
> >> _DEFINE_MSG
> >> java.lang.NullPointerException
> >> at com.hcd.hcdadmin.CustomMessageHandlerFactory$
> >> CustomMessageHandler.handleMessage(CustomMessageHandlerFactory.
> java:149)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:85)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:42)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:745)
> >>
> >> [3:06]
> >> [ERROR 2017-04-05 14:26:15,958
> >> org.apache.helix.messaging.handling.HelixTask:102] Exception while
> >> executing a message. java.lang.NullPointerException msgId:
> >> 5b05cd82-c596-4329-8484-28ac3ef40e80 type: USER
> >> _DEFINE_MSG
> >> java.lang.NullPointerException
> >> at com.hcd.hcdadmin.CustomMessageHandlerFactory$
> >> CustomMessageHandler.handleMessage(CustomMessageHandlerFactory.
> java:149)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:85)
> >> at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:42)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:745)
> >>
>