can you post RM/NM logs too.? Thanks, Omkar Joshi *Hortonworks Inc.* <http://www.hortonworks.com>
On Wed, Jul 10, 2013 at 6:42 AM, Andrei <faithlessfri...@gmail.com> wrote: > If it helps, full log of AM can be found here<http://pastebin.com/zXTabyvv> > . > > > On Wed, Jul 10, 2013 at 4:21 PM, Andrei <faithlessfri...@gmail.com> wrote: > >> Hi Devaraj, >> >> thanks for your answer. Yes, I suspected it could be because of host >> mapping, so I have already checked (and have just re-checked) settings in >> /etc/hosts of each machine, and they all are ok. I use both fully-qualified >> names (e.g. `master-host.company.com`) and their shortcuts (e.g. >> `master-host`), so it shouldn't depend on notation too. >> >> I have also checked AM syslog. There's nothing about network, but there >> are several messages like the following: >> >> ERROR [RMCommunicator Allocator] >> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container >> complete event for unknown container id >> container_1373460572360_0001_01_000088 >> >> >> I understand container just doesn't get registered in AM (probably >> because of the same issue), is it correct? So I wonder who sends "container >> complete event" to ApplicationMaster? >> >> >> >> >> >> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <devara...@huawei.com> wrote: >> >>> >1. I assume this is the task (container) that tries to establish >>> connection, but what it wants to connect to? **** >>> >>> It is trying to connect to MRAppMaster for executing the actual task.*** >>> * >>> >>> ** ** >>> >>> >1. I assume this is the task (container) that tries to establish >>> connection, but what it wants to connect to? **** >>> >>> It seems Container is not getting the correct MRAppMaster address due to >>> some reason or AM is crashing before giving the task to Container. Probably >>> it is coming due to invalid host mapping. Can you check the host mapping >>> is proper in both the machines and also check the AM log that time for any >>> clue. **** >>> >>> ** ** >>> >>> Thanks**** >>> >>> Devaraj k**** >>> >>> ** ** >>> >>> *From:* Andrei [mailto:faithlessfri...@gmail.com] >>> *Sent:* 10 July 2013 17:32 >>> *To:* user@hadoop.apache.org >>> *Subject:* ConnectionException in container, happens only sometimes**** >>> >>> ** ** >>> >>> Hi, **** >>> >>> ** ** >>> >>> I'm running CDH4.3 installation of Hadoop with the following simple >>> setup: **** >>> >>> ** ** >>> >>> master-host: runs NameNode, ResourceManager and JobHistoryServer**** >>> >>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. **** >>> >>> ** ** >>> >>> When I run simple MapReduce job (both - using streaming API or Pi >>> example from distribution) on client I see that some tasks fail: **** >>> >>> ** ** >>> >>> 13/07/10 14:40:10 INFO mapreduce.Job: map 60% reduce 0%**** >>> >>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : >>> attempt_1373454026937_0005_m_000003_0, Status : FAILED**** >>> >>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : >>> attempt_1373454026937_0005_m_000005_0, Status : FAILED**** >>> >>> ...**** >>> >>> 13/07/10 14:40:23 INFO mapreduce.Job: map 60% reduce 20%**** >>> >>> ...**** >>> >>> ** ** >>> >>> Every time different set of tasks/attempts fails. In some cases number >>> of failed attempts becomes critical, and the whole job fails, in other >>> cases job is finished successfully. I can't see any dependency, but I >>> noticed the following. **** >>> >>> ** ** >>> >>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on >>> _slave-2-host_ there will be corresponding syslog with the following >>> contents: **** >>> >>> ** ** >>> >>> ... **** >>> >>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: >>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried >>> 0 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)* >>> *** >>> >>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: >>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried >>> 1 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)* >>> *** >>> >>> ...**** >>> >>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: >>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried >>> 9 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)* >>> *** >>> >>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: >>> Exception running child : java.net.ConnectException: Call From slave-2-host/ >>> 127.0.0.1 to slave-2-host:11812 failed on connection exception: >>> java.net.ConnectException: Connection refused; For more details see: >>> http://wiki.apache.org/hadoop/ConnectionRefused**** >>> >>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>> Method)**** >>> >>> at >>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>> **** >>> >>> at >>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>> **** >>> >>> at >>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)**** >>> >>> at >>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)**** >>> >>> at >>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)**** >>> >>> at org.apache.hadoop.ipc.Client.call(Client.java:1229)**** >>> >>> at >>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225) >>> **** >>> >>> at com.sun.proxy.$Proxy6.getTask(Unknown Source)**** >>> >>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)** >>> ** >>> >>> Caused by: java.net.ConnectException: Connection refused**** >>> >>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)**** >>> >>> at >>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)** >>> ** >>> >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207) >>> **** >>> >>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)**** >>> >>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)**** >>> >>> at >>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499) >>> **** >>> >>> at >>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)* >>> *** >>> >>> at >>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)**** >>> >>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)* >>> *** >>> >>> at org.apache.hadoop.ipc.Client.call(Client.java:1196)**** >>> >>> ... 3 more**** >>> >>> ** ** >>> >>> ** ** >>> >>> Notice several things: **** >>> >>> ** ** >>> >>> 1. This exception always happens on the different host than >>> ApplicationMaster runs on. **** >>> >>> 2. It always tries to connect to localhost, not other host in cluster. * >>> *** >>> >>> 3. Port number (11812 in this case) is always different. **** >>> >>> ** ** >>> >>> My questions are: **** >>> >>> ** ** >>> >>> 1. I assume this is the task (container) that tries to establish >>> connection, but what it wants to connect to? **** >>> >>> 2. Why this error happens and how can I fix it? **** >>> >>> ** ** >>> >>> Any suggestions are welcome.**** >>> >>> ** ** >>> >>> Thanks, **** >>> >>> Andrei**** >>> >> >> >