Hi Gour, Can you please reach me using your own email-id? I will then send logs to you, along with my analysis - I don't want to send logs on public list
Thanks, On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com> wrote: > Ok, so this node is not a gateway. It is part of the cluster, which means > you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR > pointing to /etc/hadoop/conf in slider-env.sh and that should be it. > > So the above simplifies your config setup. It will not solve either of the > 2 problems you are facing. > > Now coming to the 2 issues you are facing, you have to provide additional > logs for us to understand better. Let¹s start with - > 1. RM logs (specifically between the time when rm1->rm2 failover is > simulated) > 2. Slider App logs > > -Gour > > On 7/25/16, 5:16 PM, "Manoj Samel" <manojsamelt...@gmail.com> wrote: > > > 1. Not clear about your question on "gateway" node. The node running > > slider is part of the hadoop cluster and there are other services like > > Oozie that run on this node that utilizes hdfs and yarn. So if your > > question is whether the node is otherwise working for HDFS and Yarn > > configuration, it is working > > 2. I copied all files from HADOOP_CONF_DIR (say /etc/hadoop/conf) to > >the > > directory containing slider-client.xml (say /data/latest/conf) > > 3. In earlier email, I had done a mistake where slider-env.sh file > >HADOOP_CONF_DIR > > was pointing to original directory /etc/hadoop/conf. I edited it to > > point to same directory containing slider-client.xml & slider-env.sh > >i.e. > > /data/latest/conf > > 4. I emptied slider-client.xml. It just had the > ><configuration></configuration>. > > The creation of spas worked but the Slider AM still shows the same > >issue. > > i.e. when RM1 goes from active to standby, slider AM goes from RUNNING > >to > > ACCPTED state with same error about TOKEN. Also NOTE that when > > slider-client.xml is empty, the "slider destroy xxx" command still > >fails > > with Zookeeper connection errors. > > 5. I then added same parameters (as my last email - except > > HADOOP_CONF_DIR) to slider-client.xml and ran. This time slider-env.sh > > has HADOOP_CONF_DIR pointing to /data/latest/conf and slider-client.xml > > does not have HADOOP_CONF_DIR. The same issue exists (but "slider > > destroy" does not fails) > > 6. Could you explain what do you expect to pick up from Hadoop > > configurations that will help you in RM Token ? If slider has token > >from > > RM1, and it switches to RM2, not clear what slider does to get > >delegation > > token for RM2 communication ? > > 7. It is worth repeating again that issue happens only when RM1 was > > active when slider app was created and then RM1 becomes standby. If > >RM2 was > > active when slider app was created, then slider AM keeps running for > >any > > number of switches between RM2 and RM1 back and forth ... > > > > > >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com> wrote: > > > >> The node you are running slider from, is that a gateway node? Sorry for > >> not being explicit. I meant copy everything under /etc/hadoop/conf from > >> your cluster into some temp directory (say /tmp/hadoop_conf) in your > >> gateway node or local or whichever node you are running slider from. > >>Then > >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out from > >> slider-client.xml. > >> > >> On 7/25/16, 4:12 PM, "Manoj Samel" <manojsamelt...@gmail.com> wrote: > >> > >> >Hi Gour, > >> > > >> >Thanks for your prompt reply. > >> > > >> >FYI, issue happens when I create slider app when rm1 is active and when > >> >rm1 > >> >fails over to rm2. As soon as rm2 becomes active; the slider AM goes > >>from > >> >RUNNING to ACCEPTED state with above error. > >> > > >> >For your suggestion, I did following > >> > > >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from > >> >HADOOP_CONF_DIR > >> >to slider conf directory. > >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set > >> >3) I removed all properties from slider-client.xml EXCEPT following > >> > > >> > - HADOOP_CONF_DIR > >> > - slider.yarn.queue > >> > - slider.zookeeper.quorum > >> > - hadoop.registry.zk.quorum > >> > - hadoop.registry.zk.root > >> > - hadoop.security.authorization > >> > - hadoop.security.authentication > >> > > >> >Then I made rm1 active, installed and created slider app and restarted > >>rm1 > >> >(to make rm2) active. The slider-am again went from RUNNING to ACCEPTED > >> >state. > >> > > >> >Let me know if you want me to try further changes. > >> > > >> >If I make the slider-client.xml completely empty per your suggestion, > >>only > >> >slider AM comes up but it > >> >fails to start components. The AM log shows errors trying to connect to > >> >zookeeper like below. > >> >2016-07-25 23:07:41,532 > >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN > >> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, > >> >closing socket connection and attempting reconnect > >> >java.net.ConnectException: Connection refused > >> > > >> >Hence I kept minimal info in slider-client.xml > >> > > >> >FYI This is slider version 0.80 > >> > > >> >Thanks, > >> > > >> >Manoj > >> > > >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com> > >>wrote: > >> > > >> >> If possible, can you copy the entire content of the directory > >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to it. > >> >>Keep > >> >> slider-client.xml empty. > >> >> > >> >> Now when you do the same rm1->rm2 and then the reverse failovers, do > >>you > >> >> see the same behaviors? > >> >> > >> >> -Gour > >> >> > >> >> On 7/25/16, 2:28 PM, "Manoj Samel" <manojsamelt...@gmail.com> wrote: > >> >> > >> >> >Another observation (whatever it is worth) > >> >> > > >> >> >If slider app is created and started when rm2 was active, then it > >> >>seems to > >> >> >survive switches between rm2 and rm1 (and back). I.e > >> >> > > >> >> >* rm2 is active > >> >> >* create and start slider application > >> >> >* fail over to rm1. Now the Slider AM keeps running > >> >> >* fail over to rm2 again. Slider AM still keeps running > >> >> > > >> >> >So, it seems if it starts with rm1 active, then the AM goes to > >> >>"ACCEPTED" > >> >> >state when RM fails to rm2. If it starts with rm2 active, then it > >>runs > >> >> >fine > >> >> >with any switches between rm1 and rm2. > >> >> > > >> >> >Any feedback ? > >> >> > > >> >> >Thanks, > >> >> > > >> >> >Manoj > >> >> > > >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel > >> >><manojsamelt...@gmail.com> > >> >> >wrote: > >> >> > > >> >> >> Setup > >> >> >> > >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled > >> >> >> - Slider 0.80 > >> >> >> - In my slider-client.xml, I have added all RM HA properties, > >> >>including > >> >> >> the ones mentioned in > >>http://markmail.org/message/wnhpp2zn6ixo65e3. > >> >> >> > >> >> >> Following is the issue > >> >> >> > >> >> >> * rm1 is active, rm2 is standby > >> >> >> * deploy and start slider application, it runs fine > >> >> >> * restart rm1, rm2 is now active. > >> >> >> * The slider-am now goes from running into "ACCEPTED" mode. It > >>stays > >> >> >>there > >> >> >> till rm1 is made active again. > >> >> >> > >> >> >> In the slider-am log, it tries to connect to RM2 and connection > >>fails > >> >> >>due > >> >> >> to org.apache.hadoop.security.AccessControlException: Client > >>cannot > >> >> >> authenticate via:[TOKEN]. See detailed log below > >> >> >> > >> >> >> It seems it has some token (delegation token?) for RM1 but tries > >>to > >> >>use > >> >> >> same(?) for RM2 and fails. Am I missing some configuration ??? > >> >> >> > >> >> >> Thanks, > >> >> >> > >> >> >> > >> >> >> > >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO > >> >> >> client.ConfiguredRMFailoverProxyProvider - Failing over to rm2 > >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN > >> >> >> security.UserGroupInformation - PriviledgedActionException > >> >>as:abc@XYZ > >> >> >> (auth:KERBEROS) > >> >>cause:org.apache.hadoop.security.AccessControlException: > >> >> >> Client cannot authenticate via:[TOKEN] > >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN > >>ipc.Client - > >> >> >> Exception encountered while connecting to the server : > >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot > >> >> >> authenticate via:[TOKEN] > >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN > >> >> >> security.UserGroupInformation - PriviledgedActionException > >> >>as:abc@XYZ > >> >> >> (auth:KERBEROS) cause:java.io.IOException: > >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot > >> >> >> authenticate via:[TOKEN] > >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO > >> >> >> retry.RetryInvocationHandler - Exception while invoking allocate > >>of > >> >> >>class > >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over > >> >> >> attempts. Trying to fail over immediately. > >> >> >> java.io.IOException: Failed on local exception: > >>java.io.IOException: > >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot > >> >> >> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM > >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2 > >>HOST>":23130; > >> >> >> at > >> >> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > >> >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1476) > >> >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1403) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng > >>>>>>in > >> >>>>e. > >> >> >>java:230) > >> >> >> at com.sun.proxy.$Proxy23.allocate(Unknown Source) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPB > >>>>>>Cl > >> >>>>ie > >> >> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > >> >> >> at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown > >> >>Source) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces > >>>>>>so > >> >>>>rI > >> >> >>mpl.java:43) > >> >> >> at java.lang.reflect.Method.invoke(Method.java:497) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI > >>>>>>nv > >> >>>>oc > >> >> >>ationHandler.java:252) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat > >>>>>>io > >> >>>>nH > >> >> >>andler.java:104) > >> >> >> at com.sun.proxy.$Proxy24.allocate(Unknown Source) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl > >>>>>>ie > >> >>>>nt > >> >> >>Impl.java:278) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear > >>>>>>tb > >> >>>>ea > >> >> >>tThread.run(AMRMClientAsyncImpl.java:224) > >> >> >> Caused by: java.io.IOException: > >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot > >> >> >> authenticate via:[TOKEN] > >> >> >> at > >> >> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682) > >> >> >> at java.security.AccessController.doPrivileged(Native > >>Method) > >> >> >> at javax.security.auth.Subject.doAs(Subject.java:422) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma > >>>>>>ti > >> >>>>on > >> >> >>.java:1671) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(C > >>>>>>li > >> >>>>en > >> >> >>t.java:645) > >> >> >> at > >> >> >> > >> > >>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733) > >> >> >> at > >> >> >> > >>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) > >> >> >> at > >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525) > >> >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1442) > >> >> >> ... 12 more > >> >> >> Caused by: org.apache.hadoop.security.AccessControlException: > >>Client > >> >> >> cannot authenticate via:[TOKEN] > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClie > >>>>>>nt > >> >>>>.j > >> >> >>ava:172) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.ja > >>>>>>va > >> >>>>:3 > >> >> >>96) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.ja > >>>>>>va > >> >>>>:5 > >> >> >>55) > >> >> >> at > >> >> >> > >>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370) > >> >> >> at > >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725) > >> >> >> at > >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721) > >> >> >> at java.security.AccessController.doPrivileged(Native > >>Method) > >> >> >> at javax.security.auth.Subject.doAs(Subject.java:422) > >> >> >> at > >> >> >> > >> >> > >> > >>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma > >>>>>>ti > >> >>>>on > >> >> >>.java:1671) > >> >> >> at > >> >> >> > >> > >>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720) > >> >> >> ... 15 more > >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO > >> >> >> client.ConfiguredRMFailoverProxyProvider - Failing over to rm1 > >> >> >> > >> >> > >> >> > >> > >> > >