Thanks Gour ! Any idea when 1.0.0 will be available ?
On Fri, Sep 30, 2016 at 7:23 AM, Gour Saha <gs...@hortonworks.com> wrote: > I think you are hitting this - > https://issues.apache.org/jira/browse/SLIDER-1169 > > > On 9/29/16, 10:21 PM, "Manoj Samel" <manojsamelt...@gmail.com> wrote: > > >Hi > > > >Slider version .80 on secure cluster. > > > >In my xxx-site.xml files, the > > <property> > > <name>hadoop.registry.zk.quorum</name> > > <value>zk1_host:2181,zk2_host:2181,zk3_host:2181</value> > > </property> > > > >However, it appears slider AM uses only the first ZK to connect for > >registry - and fails when the first ZK happens to be down. > > > >In the slider AM log > > > >2016-09-30 02:27:27,279 [main] INFO appmaster.SliderAppMaster - Loading > >slider-server.xml at > >file:/foo/yarn/local/usercache/xx/appcache/application_1474675565244_ > 3660/ > >container_e80_1474675565244_3660_01_000001/confdir/slider-server.xml > >2016-09-30 02:27:27,285 [main] INFO appmaster.SliderAppMaster - AM > >configuration: > >dfs.namenode.kerberos.principal=hdfs/_HOST@ABC > >hadoop.registry.zk.quorum=zk1_host:2181 > >hadoop.registry.zk.root=/registry > > > >Note -- the log shows only the first host, not the quorum string of 3 > >host:ports > > > >later in log, it tries to connect to ZK1 but since ZK1 is down, the > >connection fails. The AM fails start any components as a result. > > > > > >2016-09-29 23:32:49,768 [main] INFO appmaster.SliderAppMaster - Service > >YarnRegistry in state YarnRegistry: STARTED Connection="fixed ZK quorum > >"zk1_host:2181" " root="/registry" security disabled > >2016-09-29 23:32:49,774 [main-SendThread(bds0211.svc.eng.pdx.wd:2181)] > >WARN > > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, > >closing socket connection and attempting reconnect > >java.net.ConnectException: Connection refused > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at > >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > > at > >org.apache.zookeeper.ClientCnxnSocketNIO.doTransport( > ClientCnxnSocketNIO.j > >ava:361) > > at > >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) > > > >I would expect that if connection to ZK1 failed, then ZK2, 3 ... etc would > >be tried .. thats what the ZK quorum is for. > > > >Looking into the code, I see this last "Connection" string is coming > >from org.apache.hadoop.registry.client.impl.zk.CuratorService.java > > > >In it, supplyBindingInformation() gets and prints the string in log > >message. > > > >public BindingInformation supplyBindingInformation() { > > BindingInformation binding = new BindingInformation(); > > String connectString = buildConnectionString(); > > binding.ensembleProvider = new FixedEnsembleProvider(connectString); > > binding.description = > > "fixed ZK quorum \"" + connectString + "\""; > > return binding; > > } > > > >protected String buildConnectionString() { > > return getConfig().getTrimmed(KEY_REGISTRY_ZK_QUORUM, > > DEFAULT_REGISTRY_ZK_QUORUM); > > } > > > >the getConfig() is from org.apache.hadoop.conf.Configuration.java > > > >Its not clear why the value of hadoop.registry.zk.quorum supplied in > >config > >gets trimmed to first host only. Is this the expected behavior ? Or Bug? > > > >It can't be possible to guarantee that the first zookeeper in quorum will > >always be reachable .. I would expect multiple nodes in quorum to be tried > >for connection > > > > > >Any thoughts would be appreciated ... > >