Appmaster failed to launch container in alternate nodemanager after it connection timeout in one NM.

2018-02-11 Thread Khireswar Kalita
Dear friends,

Need some help to know root cause of this issue.

In a sqoop job failures, it hs been noticed that the app master wasn't able
to connect to a NM due to connection time out issues and it kept on
retrying the connection for close to 2 hrs, until killed manually.
The timeout was due a temporary network issue.

Here is overview of  what happend:

RM <-> NM01(hdpn01)  Network ok
RM <-> NM08(hdpn08)  Network ok
NM01 <---X---> NM08 Network failed

AppMaster container launched at NM01 node.

Here is brief log:

INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0
AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0
HostLocal:0 RackLocal:0
2018-02-03 21:12:51,734 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources()
for application_1517675224254_1052: ask=1 release= 0 newContainers=0
finishedContainers=0 resourcelimit=<memory:2776576, vCores:1> knownNMs=24
2018-02-03 21:12:52,751 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
containers 1
2018-02-03 21:12:52,793 INFO [RMCommunicator Allocator]
org.apache.hadoop.yarn.util.RackResolver: Resolved hdpn08.ztpl.net to
/default-rack
2018-02-03 21:12:52,797 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
container container_1517675224254_1052_02_02 to
attempt_1517675224254_1052_m_00_1000
2018-02-03 21:12:52,799 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 Sc

..

2018-02-03 22:43:58,911 WARN [ContainerLauncher #0]
org.apache.hadoop.ipc.Client: Failed to connect to server:
hdpn08.ztpl.net/172.20.1.108:45454: retries get failed due to exceeded
maximum allowed retries number: 0
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)


Why did AM keep retrying the connection to the NM on hdpn08 for 2 hrs. till
the time it was manually killed? If it wasn’t killed it would have
continued for much longer.

Why did AM not stop trying after x number of tries? Is there any max
attempt properties for application master?

Why did AM not spin out another map task to compensate for this problematic
task?



Thanks
Khireswar Kalita


Hive query taking long time in reduce phase

2017-07-19 Thread Khireswar Kalita
I am getting an issue on a production cluster where Mapreduce Job taking
long time in reduce phase while running a Hive query, approximately 2 hrs
for one reduce task.

Hardware specification of the cluster :
Number of nodes:22
Number of Master nodes4
Number of data nodes:18
Core per nodes:24
Memory per nodes:512 GB
Minimum yarn container size is 2048MB.

This issue does not occur all the time. So I am not sure what might be the
root cause.

Please help me with your suggestions if any one experiencing similar issue

Application Log:
8:02,616 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager:
memoryLimit=6370046464, maxSingleShuffleLimit=1592511616,
mergeThreshold=4204230912, ioSortFactor=100,
memToMemMergeOutputsThreshold=100
2017-06-04 18:58:02,619 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher:
attempt_1496358601894_1431_r_00_0 Thread started: EventFetcher for
fetching Map Completion Events
2017-06-04 18:58:02,628 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning
am1dlccmrhdd08.r1-core.r1.aig.net:13562 with 1 to fetcher#5
2017-06-04 18:58:02,628 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of
1 to am1dlccmrhdd08.r1-core.r1.aig.net:13562 to fetcher#5
2017-06-04 18:58:02,628 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
Ignoring obsolete output of OBSOLETE map-task:
'attempt_1496358601894_1431_m_03_0'
2017-06-04 18:58:02,629 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
Ignoring obsolete output of FAILED map-task:
'attempt_1496358601894_1431_m_03_0'
2017-06-04 18:58:02,629 INFO [fetcher#8]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning
am1dlccmrhdd01.r1-core.r1.aig.net:13562 with 1 to fetcher#8
2017-06-04 18:58:02,631 INFO [fetcher#8]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of
1 to am1dlccmrhdd01.r1-core.r1.aig.net:13562 to fetcher#8
2017-06-04 18:58:02,631 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher:
attempt_1496358601894_1431_r_00_0: Got 3 new map-outputs
2017-06-04 18:58:02,747 INFO [fetcher#8]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: for
url=13562/mapOutput?job=job_1496358601894_1431=0=attempt_1496358601894_1431_m_06_0
sent hash and received reply
2017-06-04 18:58:02,747 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: for
url=13562/mapOutput?job=job_1496358601894_1431=0=attempt_1496358601894_1431_m_09_0
sent hash and received reply
2017-06-04 18:58:02,752 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 about to shuffle
output of map attempt_1496358601894_1431_m_09_0 decomp: 40205 len:
40209 to MEMORY
2017-06-04 18:58:02,752 INFO [fetcher#8]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#8 about to shuffle
output of map attempt_1496358601894_1431_m_06_0 decomp: 50946 len:
50950 to MEMORY
2017-06-04 18:58:02,755 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 40205 bytes
from map-output for attempt_1496358601894_1431_m_09_0
2017-06-04 18:58:02,755 INFO [fetcher#8]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 50946 bytes
from map-output for attempt_1496358601894_1431_m_06_0
2017-06-04 18:58:02,758 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile
-> map-output of size: 40205, inMemoryMapOutputs.size() -> 1, commitMemory
-> 0, usedMemory ->91151
2017-06-04 18:58:02,759 INFO [fetcher#8]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile
-> map-output of size: 50946, inMemoryMapOutputs.size() -> 2, commitMemory
-> 40205, usedMemory ->91151
2017-06-04 18:58:02,760 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
am1dlccmrhdd08.r1-core.r1.aig.net:13562 freed by fetcher#5 in 131ms
2017-06-04 18:58:02,760 INFO [fetcher#8]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
am1dlccmrhdd01.r1-core.r1.aig.net:13562 freed by fetcher#8 in 131ms
2017-06-04 18:58:02,760 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning
am1dlccmrhdd01.r1-core.r1.aig.net:13562 with 1 to fetcher#5
2017-06-04 18:58:02,760 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 1 of
1 to am1dlccmrhdd01.r1-core.r1.aig.net:13562 to fetcher#5
2017-06-04 18:58:02,762 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: for
url=13562/mapOutput?job=job_1496358601894_1431=0=attempt_1496358601894_1431_m_08_0
sent hash and received reply
2017-06-04 18:58:02,762 INFO [fetcher#5]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#5 about to shuffle
output of map