Hi Shekar,

I do not have much experience in setting up the HA. So if I were you, I may
check 1) when you take the RM down, does the backup RM runs successfully?
2) if the backup RM runs successfully, can you see the Samza Application
run in the Yarn UI (such as, localhost:8088?) 3) if can not see it, what
does Samza's log say?

Thanks,

Fang, Yan
yanfang...@gmail.com

On Thu, May 14, 2015 at 3:31 PM, Shekar Tippur <ctip...@gmail.com> wrote:

> Yan,
> I have followed the doc. Here is what was done ...
> 1. Setup the yarn-site.xml
>
> <configuration>
>
>  <property>
>
>   <name>yarn.resourcemanager.ha.enabled</name>
>
>   <value>true</value>
>
>  </property>
>
>  <property>
>
>   <name>yarn.resourcemanager.cluster-id</name>
>
>   <value>cluster1</value>
>
>  </property>
>
>  <property>
>
>   <name>yarn.resourcemanager.ha.rm-ids</name>
>
>   <value>rm1,rm2</value>
>
>  </property>
>
>  <property>
>
>   <name>yarn.resourcemanager.hostname.rm1</name>
>
>   <value>sprdargas402.</value>
>
>  </property>
>
>  <property>
>
>    <name>yarn.resourcemanager.hostname.rm2</name>
>
>    <value>sprdargas403.</value>
>
>  </property>
>
>  <property>
>
>   <description>Enable RM to recover state after starting. If true, then
> yarn.resourcemanager.store.class must be specified</description>
>
>   <name>yarn.resourcemanager.recovery.enabled</name>
>
>   <value>true</value>
>
>  </property>
>
>  <property>
>
>   <description>The class to use as the persistent store.</description>
>
>   <name>yarn.resourcemanager.store.class</name>
>
>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>
>  </property>
>
>  <property>
>
>   <name>yarn.resourcemanager.zk-state-store.address</name>
>
>   <value>sprdargas402.:2181</value>
>
>  </property>
>
>  <property>
>
>   <name>yarn.resourcemanager.zk-address</name>
>
>   <value>sprdargas402.:2181,sprdargas403.:2181,sprdargas404.:2181</value>
>
>  </property>
>
>  <property>
>
>   <description>CLASSPATH for YARN applications. A comma-separated list of
> CLASSPATH entries</description>
>
>   <name>yarn.application.classpath</name>
>
>
>
>
> <value>/app/hadoop/hadoop-2.5.2/conf,/app/hadoop/hadoop-2.5.2/share/hadoop/common/*,/app/hadoop/hadoop-2.5.2/share/hadoop/common/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/hdfs/*,/app/hadoop/hadoop-2.5.2/share/hadoop/hdfs/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/mapreduce/*,/app/hadoop/hadoop-2.5.2/share/hadoop/mapreduce/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/yarn/*,/app/hadoop/hadoop-2.5.2/share/hadoop/yarn/lib/*
>
>   </value>
>
>  </property>
>
> </configuration>
>
>
> 2. scp'd the config to the slave resource manager node
>
> 3. restart yarn on node 1.
>
> I am not sure if I missed anything.
>
> - Shekar
>
> On Thu, May 14, 2015 at 3:06 PM, Yan Fang <yanfang...@gmail.com> wrote:
>
> > Is the HA set correctly? The log looks like it's in the YARN setting
> side.
> >
> > Fang, Yan
> > yanfang...@gmail.com
> >
> > On Thu, May 14, 2015 at 12:29 PM, Shekar Tippur <ctip...@gmail.com>
> wrote:
> >
> > > Other observation I forgot to mention is that if I kill the rm and nm
> > > process, samza job seem to run properly. Only when 01 server is
> > rebooted, I
> > > seem to encounter this error and as a result, no jobs get processed.
> > >
> > > - Shekar
> > >
> > > On Thu, May 14, 2015 at 12:14 PM, Shekar Tippur <ctip...@gmail.com>
> > wrote:
> > >
> > > > Hello,
> > > >
> > > > I have setup redundancy on resource manager based on this doc
> > > >
> > >
> >
> https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html
> > > > I then shut down server 1 and was expecting that 02 server would take
> > > over.
> > > >
> > > > Instead I see this error. I am not sure if I am missing something.
> > > >
> > > > 2015-05-14 11:55:01,820 INFO  [Node Status Updater]
> > > > retry.RetryInvocationHandler
> (RetryInvocationHandler.java:invoke(140))
> > -
> > > > Exception while invoking nodeHeartbeat of class
> > > ResourceTrackerPBClientImpl
> > > > over rm2 after 19 fail over attempts. Trying to fail over after
> > sleeping
> > > > for 24180ms.
> > > >
> > > > java.net.ConnectException: Call From sprdargas403t/10.180.195.33 to
> > > > sprdargas403:8031 failed on connection exception:
> > > > java.net.ConnectException: Connection refused; For more details see:
> > > > http://wiki.apache.org/hadoop/ConnectionRefused
> > > >
> > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> > > >
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > >
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > >
> > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > >
> > > > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> > > >
> > > > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> > > >
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> > > >
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> > > >
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> > > >
> > > > at com.sun.proxy.$Proxy27.nodeHeartbeat(Unknown Source)
> > > >
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> > > >
> > > > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> > > >
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >
> > > > at java.lang.reflect.Method.invoke(Method.java:606)
> > > >
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > >
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > >
> > > > at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
> > > >
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:512)
> > > >
> > > > at java.lang.Thread.run(Thread.java:745)
> > > >
> > > > Caused by: java.net.ConnectException: Connection refused
> > > >
> > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> > > >
> > > > at
> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> > > >
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> > > >
> > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> > > >
> > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> > > >
> > > > at
> > >
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> > > >
> > > > at
> > > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> > > >
> > > > at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> > > >
> > > > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> > > >
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> > > >
> > > > ... 12 more
> > > >
> > > > 2015-05-14 11:55:01,965 INFO  [Container Monitor]
> > > > monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) -
> > > > Memory usage of ProcessTree 21428 for container-id
> > > > container_1431628855028_0001_01_000001: 369.7 MB of 1 GB physical
> > memory
> > > > used; 1.4 GB of 2.1 GB virtual memory used
> > > >
> > >
> >
>

Reply via email to