Re: ConnectionException in container, happens only sometimes

2013-07-11 Thread Andrei
Here are logs of RM and 2 NMs:

RM (master-host): http://pastebin.com/q4qJP8Ld
NM where AM ran (slave-1-host): http://pastebin.com/vSsz7mjG
NM where slave container ran (slave-2-host): http://pastebin.com/NMFi6gRp

The only related error I've found in them is the following (from RM logs):

...
2013-07-11 07:46:06,225 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1373465780870_0005_01
2013-07-11 07:46:06,227 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call org.apache.hadoop.yarn.api.AMRMProtocolPB.allocate from
10.128.40.184:47101: output error
2013-07-11 07:46:06,228 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8030 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:456)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2140)
at org.apache.hadoop.ipc.Server.access$2000(Server.java:108)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:939)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1747)
2013-07-11 07:46:11,238 INFO org.apache.hadoop.yarn.util.RackResolver:
Resolved my_user to /default-rack
2013-07-11 07:46:11,283 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node my_user(cmPort: 59267 httpPort: 8042) registered with
capability: 8192, assigned nodeId my_user:59267
...

Though from stack trace it's hard to tell where this error came from.

Let me know if you need any more information.










On Thu, Jul 11, 2013 at 1:00 AM, Andrei faithlessfri...@gmail.com wrote:

 Hi Omkar,

 I'm out of office now, so I'll post it as fast as get back there.

 Thanks


 On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi ojo...@hortonworks.comwrote:

 can you post RM/NM logs too.?

 Thanks,
 Omkar Joshi
 *Hortonworks Inc.* http://www.hortonworks.com




RE: ConnectionException in container, happens only sometimes

2013-07-10 Thread Devaraj k
1. I assume this is the task (container) that tries to establish connection, 
but what it wants to connect to?
It is trying to connect to MRAppMaster for executing the actual task.

1. I assume this is the task (container) that tries to establish connection, 
but what it wants to connect to?
It seems Container is not getting the correct MRAppMaster address due to some 
reason or AM is crashing before giving the task to Container. Probably it is 
coming due to invalid host mapping.  Can you check the host mapping is proper 
in both the machines and also check the AM log that time for any clue.

Thanks
Devaraj k

From: Andrei [mailto:faithlessfri...@gmail.com]
Sent: 10 July 2013 17:32
To: user@hadoop.apache.org
Subject: ConnectionException in container, happens only sometimes

Hi,

I'm running CDH4.3 installation of Hadoop with the following simple setup:

master-host: runs NameNode, ResourceManager and JobHistoryServer
slave-1-host and slave-2-hosts: DataNodes and NodeManagers.

When I run simple MapReduce job (both - using streaming API or Pi example from 
distribution) on client I see that some tasks fail:

13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : 
attempt_1373454026937_0005_m_03_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : 
attempt_1373454026937_0005_m_05_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%
...

Every time different set of tasks/attempts fails. In some cases number of 
failed attempts becomes critical, and the whole job fails, in other cases job 
is finished successfully. I can't see any dependency, but I noticed the 
following.

Let's say, ApplicationMaster runs on _slave-1-host_. In this case on 
_slave-2-host_ there will be corresponding syslog with the following contents:

...
2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying 
connect to server: slave-2-host/127.0.0.1:11812http://127.0.0.1:11812. 
Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying 
connect to server: slave-2-host/127.0.0.1:11812http://127.0.0.1:11812. 
Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying 
connect to server: slave-2-host/127.0.0.1:11812http://127.0.0.1:11812. 
Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.net.ConnectException: Call From 
slave-2-host/127.0.0.1http://127.0.0.1 to slave-2-host:11812 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
at org.apache.hadoop.ipc.Client.call(Client.java:1229)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
at com.sun.proxy.$Proxy6.getTask(Unknown Source)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)
at org.apache.hadoop.ipc.Client.call(Client.java:1196)
... 3 more


Notice several things:

1. This exception always happens on the different host than ApplicationMaster 
runs on.
2. It always tries to connect to localhost, not other host in cluster.
3. Port number (11812 in this case) is always different.

My questions are:

1. I assume this is the task (container) that tries to establish connection, 
but what it 

Re: ConnectionException in container, happens only sometimes

2013-07-10 Thread Andrei
Hi Devaraj,

thanks for your answer. Yes, I suspected it could be because of host
mapping, so I have already checked (and have just re-checked) settings in
/etc/hosts of each machine, and they all are ok. I use both fully-qualified
names (e.g. `master-host.company.com`) and their shortcuts (e.g.
`master-host`), so it shouldn't depend on notation too.

I have also checked AM syslog. There's nothing about network, but there are
several messages like the following:

ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container
complete event for unknown container id
container_1373460572360_0001_01_88


I understand container just doesn't get registered in AM (probably because
of the same issue), is it correct? So I wonder who sends container
complete event to ApplicationMaster?





On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k devara...@huawei.com wrote:

  1. I assume this is the task (container) that tries to establish
 connection, but what it wants to connect to? 

 It is trying to connect to MRAppMaster for executing the actual task.

 ** **

 1. I assume this is the task (container) that tries to establish
 connection, but what it wants to connect to? 

 It seems Container is not getting the correct MRAppMaster address due to
 some reason or AM is crashing before giving the task to Container. Probably
 it is coming due to invalid host mapping.  Can you check the host mapping
 is proper in both the machines and also check the AM log that time for any
 clue. 

 ** **

 Thanks

 Devaraj k

 ** **

 *From:* Andrei [mailto:faithlessfri...@gmail.com]
 *Sent:* 10 July 2013 17:32
 *To:* user@hadoop.apache.org
 *Subject:* ConnectionException in container, happens only sometimes

 ** **

 Hi, 

 ** **

 I'm running CDH4.3 installation of Hadoop with the following simple setup:
 

 ** **

 master-host: runs NameNode, ResourceManager and JobHistoryServer

 slave-1-host and slave-2-hosts: DataNodes and NodeManagers. 

 ** **

 When I run simple MapReduce job (both - using streaming API or Pi example
 from distribution) on client I see that some tasks fail: 

 ** **

 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%

 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
 attempt_1373454026937_0005_m_03_0, Status : FAILED

 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
 attempt_1373454026937_0005_m_05_0, Status : FAILED

 ...

 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%

 ...

 ** **

 Every time different set of tasks/attempts fails. In some cases number of
 failed attempts becomes critical, and the whole job fails, in other cases
 job is finished successfully. I can't see any dependency, but I noticed the
 following. 

 ** **

 Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
 _slave-2-host_ there will be corresponding syslog with the following
 contents: 

 ** **

 ... 

 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying
 connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s);
 retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
 sleepTime=1 SECONDS)

 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying
 connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s);
 retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
 sleepTime=1 SECONDS)

 ...

 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying
 connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s);
 retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
 sleepTime=1 SECONDS)

 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
 Exception running child : java.net.ConnectException: Call From slave-2-host/
 127.0.0.1 to slave-2-host:11812 failed on connection exception:
 java.net.ConnectException: Connection refused; For more details see:
 http://wiki.apache.org/hadoop/ConnectionRefused

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 

 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 

 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 

 at
 org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)

 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
 

 at org.apache.hadoop.ipc.Client.call(Client.java:1229)

 at
 org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
 

 at com.sun.proxy.$Proxy6.getTask(Unknown Source)

 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)

 Caused by: java.net.ConnectException: 

Re: ConnectionException in container, happens only sometimes

2013-07-10 Thread Andrei
If it helps, full log of AM can be found here http://pastebin.com/zXTabyvv
.


On Wed, Jul 10, 2013 at 4:21 PM, Andrei faithlessfri...@gmail.com wrote:

 Hi Devaraj,

 thanks for your answer. Yes, I suspected it could be because of host
 mapping, so I have already checked (and have just re-checked) settings in
 /etc/hosts of each machine, and they all are ok. I use both fully-qualified
 names (e.g. `master-host.company.com`) and their shortcuts (e.g.
 `master-host`), so it shouldn't depend on notation too.

 I have also checked AM syslog. There's nothing about network, but there
 are several messages like the following:

 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container 
 complete event for unknown container id container_1373460572360_0001_01_88


 I understand container just doesn't get registered in AM (probably because
 of the same issue), is it correct? So I wonder who sends container
 complete event to ApplicationMaster?





 On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k devara...@huawei.com wrote:

  1. I assume this is the task (container) that tries to establish
 connection, but what it wants to connect to? 

 It is trying to connect to MRAppMaster for executing the actual task.

 ** **

 1. I assume this is the task (container) that tries to establish
 connection, but what it wants to connect to? 

 It seems Container is not getting the correct MRAppMaster address due to
 some reason or AM is crashing before giving the task to Container. Probably
 it is coming due to invalid host mapping.  Can you check the host mapping
 is proper in both the machines and also check the AM log that time for any
 clue. 

 ** **

 Thanks

 Devaraj k

 ** **

 *From:* Andrei [mailto:faithlessfri...@gmail.com]
 *Sent:* 10 July 2013 17:32
 *To:* user@hadoop.apache.org
 *Subject:* ConnectionException in container, happens only sometimes

 ** **

 Hi, 

 ** **

 I'm running CDH4.3 installation of Hadoop with the following simple
 setup: 

 ** **

 master-host: runs NameNode, ResourceManager and JobHistoryServer

 slave-1-host and slave-2-hosts: DataNodes and NodeManagers. 

 ** **

 When I run simple MapReduce job (both - using streaming API or Pi example
 from distribution) on client I see that some tasks fail: 

 ** **

 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%

 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
 attempt_1373454026937_0005_m_03_0, Status : FAILED

 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
 attempt_1373454026937_0005_m_05_0, Status : FAILED

 ...

 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%

 ...

 ** **

 Every time different set of tasks/attempts fails. In some cases number of
 failed attempts becomes critical, and the whole job fails, in other cases
 job is finished successfully. I can't see any dependency, but I noticed the
 following. 

 ** **

 Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
 _slave-2-host_ there will be corresponding syslog with the following
 contents: 

 ** **

 ... 

 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
 Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
 0 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
 **

 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
 Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
 1 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
 **

 ...

 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
 Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
 9 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)**
 **

 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
 Exception running child : java.net.ConnectException: Call From slave-2-host/
 127.0.0.1 to slave-2-host:11812 failed on connection exception:
 java.net.ConnectException: Connection refused; For more details see:
 http://wiki.apache.org/hadoop/ConnectionRefused

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 

 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 

 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 

 at
 org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)

 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
 

 at org.apache.hadoop.ipc.Client.call(Client.java:1229)

 at
 org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
 


Re: ConnectionException in container, happens only sometimes

2013-07-10 Thread Omkar Joshi
can you post RM/NM logs too.?

Thanks,
Omkar Joshi
*Hortonworks Inc.* http://www.hortonworks.com


On Wed, Jul 10, 2013 at 6:42 AM, Andrei faithlessfri...@gmail.com wrote:

 If it helps, full log of AM can be found herehttp://pastebin.com/zXTabyvv
 .


 On Wed, Jul 10, 2013 at 4:21 PM, Andrei faithlessfri...@gmail.com wrote:

 Hi Devaraj,

 thanks for your answer. Yes, I suspected it could be because of host
 mapping, so I have already checked (and have just re-checked) settings in
 /etc/hosts of each machine, and they all are ok. I use both fully-qualified
 names (e.g. `master-host.company.com`) and their shortcuts (e.g.
 `master-host`), so it shouldn't depend on notation too.

 I have also checked AM syslog. There's nothing about network, but there
 are several messages like the following:

 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container 
 complete event for unknown container id 
 container_1373460572360_0001_01_88


 I understand container just doesn't get registered in AM (probably
 because of the same issue), is it correct? So I wonder who sends container
 complete event to ApplicationMaster?





 On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k devara...@huawei.com wrote:

  1. I assume this is the task (container) that tries to establish
 connection, but what it wants to connect to? 

 It is trying to connect to MRAppMaster for executing the actual task.***
 *

 ** **

 1. I assume this is the task (container) that tries to establish
 connection, but what it wants to connect to? 

 It seems Container is not getting the correct MRAppMaster address due to
 some reason or AM is crashing before giving the task to Container. Probably
 it is coming due to invalid host mapping.  Can you check the host mapping
 is proper in both the machines and also check the AM log that time for any
 clue. 

 ** **

 Thanks

 Devaraj k

 ** **

 *From:* Andrei [mailto:faithlessfri...@gmail.com]
 *Sent:* 10 July 2013 17:32
 *To:* user@hadoop.apache.org
 *Subject:* ConnectionException in container, happens only sometimes

 ** **

 Hi, 

 ** **

 I'm running CDH4.3 installation of Hadoop with the following simple
 setup: 

 ** **

 master-host: runs NameNode, ResourceManager and JobHistoryServer

 slave-1-host and slave-2-hosts: DataNodes and NodeManagers. 

 ** **

 When I run simple MapReduce job (both - using streaming API or Pi
 example from distribution) on client I see that some tasks fail: 

 ** **

 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%

 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
 attempt_1373454026937_0005_m_03_0, Status : FAILED

 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
 attempt_1373454026937_0005_m_05_0, Status : FAILED

 ...

 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%

 ...

 ** **

 Every time different set of tasks/attempts fails. In some cases number
 of failed attempts becomes critical, and the whole job fails, in other
 cases job is finished successfully. I can't see any dependency, but I
 noticed the following. 

 ** **

 Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
 _slave-2-host_ there will be corresponding syslog with the following
 contents: 

 ** **

 ... 

 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
 Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
 0 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
 ***

 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
 Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
 1 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
 ***

 ...

 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
 Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
 9 time(s); retry policy is
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
 ***

 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
 Exception running child : java.net.ConnectException: Call From slave-2-host/
 127.0.0.1 to slave-2-host:11812 failed on connection exception:
 java.net.ConnectException: Connection refused; For more details see:
 http://wiki.apache.org/hadoop/ConnectionRefused

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 

 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 

 at
 java.lang.reflect.Constructor.newInstance(Constructor.java:526)

 at
 org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)

 at
 org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)