Re: Zookeeper connection errors in Helix Controller

DImuthu Upeksha Fri, 31 May 2019 22:58:14 -0700

Hi Lee,

Understood and thanks for the heads up. We are currently in middle of
production deployment with 0.8.2 and most of the users are already notified
with the schedule.  Basically we are a happy with the stability and
functional correctness of 0.8.2 except for above mentioned case where we
pushed the cluster above its limits in stress testing. So we will go with
this version for this deployment and once you have released the new
version, we will perform the functional tests and stress tests on it within
our staging environment and if it looks good, we will patch it to the
production environment.


Thanks
Dimuthu

On Fri, May 31, 2019 at 5:07 PM Hunter Lee <[email protected]> wrote:

> Hey Dimuthu -
>
> We are actually in the process of preparing a new release, and this will
> come with the previously mentioned bug fixes in Task Framework. It also
> contains various ZK-related fixes - I don't know what your deployment
> schedule is but it might be worth the wait of another week or so.
>
> Hunter
>
> On Fri, May 31, 2019 at 10:27 AM DImuthu Upeksha <
> [email protected]>
> wrote:
>
> > Now I'm seeing following error in controller log. Restarting the
> controller
> > fixed the issue. We are time to time seeing this in controller with zk
> > connection issues. Is this also something to do with zk client version?
> >
> > 2019-05-31 13:21:46,669 [Thread-0-SendThread(localhost:2181)] WARN
> >  o.apache.zookeeper.ClientCnxn  - Session 0x16b0ebbee1d000e for server
> > localhost/127.0.0.1:2181, unexpected error, closing socket connection
> and
> > attempting reconnect
> > java.io.IOException: Broken pipe
> > at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> > at
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:102)
> > at
> >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:291)
> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1041)
> >
> > Thanks
> > Dimuthu
> >
> > On Fri, May 31, 2019 at 1:14 PM DImuthu Upeksha <
> > [email protected]>
> > wrote:
> >
> > > Hi Lei,
> > >
> > > We use 0.8.2. We initially had 0.8.4 but it contains an issue with task
> > > retry logic so we downgraded to 0.8.2. We are planning to go into
> > > production with 0.8.2 by next week so can you please advice a better
> way
> > to
> > > solve this without upgrading to 0.8.4.
> > >
> > > Thanks
> > > Dimuthu
> > >
> > > On Fri, May 31, 2019 at 1:04 PM Lei Xia <[email protected]> wrote:
> > >
> > >> Which Helix version do you use?  This may caused by this Zookeeper
> bug (
> > >> https://issues.apache.org/jira/browse/ZOOKEEPER-706).  We have
> upgraded
> > >> ZkClient in later Helix versions.
> > >>
> > >>
> > >> Lei
> > >>
> > >> On Fri, May 31, 2019 at 7:52 AM DImuthu Upeksha <
> > >> [email protected]> wrote:
> > >>
> > >>> Hi Folks,
> > >>>
> > >>> I'm getting following error in controller log and seems like
> controller
> > >>> is
> > >>> not moving froward after that point
> > >>>
> > >>> 2019-05-31 10:47:37,084 [main] INFO  o.a.a.h.i.c.HelixController  -
> > >>> Starting helix controller
> > >>> 2019-05-31 10:47:37,089 [main] INFO  o.a.a.c.u.ApplicationSettings  -
> > >>> Settings loaded from
> > >>>
> > >>>
> >
> file:/home/airavata/staging-deployment/airavata-helix/apache-airavata-controller-0.18-SNAPSHOT/conf/airavata-server.properties
> > >>> 2019-05-31 10:47:37,091 [Thread-0] INFO
> o.a.a.h.i.c.HelixController  -
> > >>> Connection to helix cluster : AiravataDemoCluster with name :
> > >>> helixcontroller2
> > >>> 2019-05-31 10:47:37,092 [Thread-0] INFO
> o.a.a.h.i.c.HelixController  -
> > >>> Zookeeper connection string localhost:2181
> > >>> 2019-05-31 10:47:42,907 [GenericHelixController-event_process] ERROR
> > >>> o.a.h.c.GenericHelixController  - Exception while executing
> > >>> DEFAULTpipeline:
> > >>> org.apache.helix.controller.pipeline.Pipeline@408d6d26for
> > >>> cluster .AiravataDemoCluster. Will not continue to next pipeline
> > >>> org.apache.helix.api.exceptions.HelixMetaDataAccessException: Failed
> to
> > >>> get
> > >>> full list of /AiravataDemoCluster/CONFIGS/PARTICIPANT
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:446)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:406)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:467)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.stages.ClusterDataCache.refresh(ClusterDataCache.java:176)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.stages.ReadClusterDataStage.process(ReadClusterDataStage.java:62)
> > >>> at
> > org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:63)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:432)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:928)
> > >>> Caused by:
> > org.apache.helix.api.exceptions.HelixMetaDataAccessException:
> > >>> Fail to read nodes for
> > >>> [/AiravataDemoCluster/CONFIGS/PARTICIPANT/helixparticipant]
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:414)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:479)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:442)
> > >>> ... 7 common frames omitted
> > >>>
> > >>> In the zookeeper log I can see following warning getting printed
> > >>> continuously. What could be the reason for that? I'm using helix
> 0.8.2
> > >>> and
> > >>> zookeeper 3.4.8
> > >>>
> > >>> 2019-05-31 10:49:37,621 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection
> > for
> > >>> client /0:0:0:0:0:0:0:1:59056 which had sessionid 0x16b0e59877f0000
> > >>> 2019-05-31 10:49:37,773 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket
> > >>> connection
> > >>> from /127.0.0.1:57984
> > >>> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to
> renew
> > >>> session 0x16b0e59877f0000 at /127.0.0.1:57984
> > >>> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@645] - Established session
> > >>> 0x16b0e59877f0000 with negotiated timeout 30000 for client /
> > >>> 127.0.0.1:57984
> > >>> 2019-05-31 10:49:37,790 [myid:] - WARN  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
> > exception
> > >>> EndOfStreamException: Unable to read additional data from client
> > >>> sessionid
> > >>> 0x16b0e59877f0000, likely client has closed socket
> > >>> at
> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> > >>> at java.lang.Thread.run(Thread.java:748)
> > >>>
> > >>> Thanks
> > >>> Dimuthu
> > >>>
> > >>
> > >>
> > >> --
> > >> Lei Xia
> > >>
> > >
> >
>

Re: Zookeeper connection errors in Helix Controller

Reply via email to