can you post RM/NM logs too.?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Wed, Jul 10, 2013 at 6:42 AM, Andrei <faithlessfri...@gmail.com> wrote:

> If it helps, full log of AM can be found here<http://pastebin.com/zXTabyvv>
> .
>
>
> On Wed, Jul 10, 2013 at 4:21 PM, Andrei <faithlessfri...@gmail.com> wrote:
>
>> Hi Devaraj,
>>
>> thanks for your answer. Yes, I suspected it could be because of host
>> mapping, so I have already checked (and have just re-checked) settings in
>> /etc/hosts of each machine, and they all are ok. I use both fully-qualified
>> names (e.g. `master-host.company.com`) and their shortcuts (e.g.
>> `master-host`), so it shouldn't depend on notation too.
>>
>> I have also checked AM syslog. There's nothing about network, but there
>> are several messages like the following:
>>
>> ERROR [RMCommunicator Allocator] 
>> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container 
>> complete event for unknown container id 
>> container_1373460572360_0001_01_000088
>>
>>
>> I understand container just doesn't get registered in AM (probably
>> because of the same issue), is it correct? So I wonder who sends "container
>> complete event" to ApplicationMaster?
>>
>>
>>
>>
>>
>> On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <devara...@huawei.com> wrote:
>>
>>>  >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It is trying to connect to MRAppMaster for executing the actual task.***
>>> *
>>>
>>> ** **
>>>
>>> >1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> It seems Container is not getting the correct MRAppMaster address due to
>>> some reason or AM is crashing before giving the task to Container. Probably
>>> it is coming due to invalid host mapping.  Can you check the host mapping
>>> is proper in both the machines and also check the AM log that time for any
>>> clue. ****
>>>
>>> ** **
>>>
>>> Thanks****
>>>
>>> Devaraj k****
>>>
>>> ** **
>>>
>>> *From:* Andrei [mailto:faithlessfri...@gmail.com]
>>> *Sent:* 10 July 2013 17:32
>>> *To:* user@hadoop.apache.org
>>> *Subject:* ConnectionException in container, happens only sometimes****
>>>
>>> ** **
>>>
>>> Hi, ****
>>>
>>> ** **
>>>
>>> I'm running CDH4.3 installation of Hadoop with the following simple
>>> setup: ****
>>>
>>> ** **
>>>
>>> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>>>
>>> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>>>
>>> ** **
>>>
>>> When I run simple MapReduce job (both - using streaming API or Pi
>>> example from distribution) on client I see that some tasks fail: ****
>>>
>>> ** **
>>>
>>> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>>>
>>> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
>>> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>>>
>>> ...****
>>>
>>> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>>>
>>> ...****
>>>
>>> ** **
>>>
>>> Every time different set of tasks/attempts fails. In some cases number
>>> of failed attempts becomes critical, and the whole job fails, in other
>>> cases job is finished successfully. I can't see any dependency, but I
>>> noticed the following. ****
>>>
>>> ** **
>>>
>>> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
>>> _slave-2-host_ there will be corresponding syslog with the following
>>> contents: ****
>>>
>>> ** **
>>>
>>> ... ****
>>>
>>> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 1 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> ...****
>>>
>>> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client:
>>> Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried
>>> 9 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)*
>>> ***
>>>
>>> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
>>> Exception running child : java.net.ConnectException: Call From slave-2-host/
>>> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused****
>>>
>>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)****
>>>
>>>         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> ****
>>>
>>>         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> ****
>>>
>>>         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>>>
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)****
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
>>> ****
>>>
>>>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>>>
>>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)**
>>> **
>>>
>>> Caused by: java.net.ConnectException: Connection refused****
>>>
>>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>>>
>>>         at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)**
>>> **
>>>
>>>         at
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
>>> ****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>>>
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)
>>> ****
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)*
>>> ***
>>>
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>>>
>>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)*
>>> ***
>>>
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>>>
>>>         ... 3 more****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> Notice several things: ****
>>>
>>> ** **
>>>
>>> 1. This exception always happens on the different host than
>>> ApplicationMaster runs on. ****
>>>
>>> 2. It always tries to connect to localhost, not other host in cluster. *
>>> ***
>>>
>>> 3. Port number (11812 in this case) is always different. ****
>>>
>>> ** **
>>>
>>> My questions are: ****
>>>
>>> ** **
>>>
>>> 1. I assume this is the task (container) that tries to establish
>>> connection, but what it wants to connect to? ****
>>>
>>> 2. Why this error happens and how can I fix it? ****
>>>
>>> ** **
>>>
>>> Any suggestions are welcome.****
>>>
>>> ** **
>>>
>>> Thanks, ****
>>>
>>> Andrei****
>>>
>>
>>
>

Reply via email to