anikad ayman created HADOOP-14858:
-------------------------------------
Summary: Why Yarn crashes ?
Key: HADOOP-14858
URL: https://issues.apache.org/jira/browse/HADOOP-14858
Project: Hadoop Common
Issue Type: Bug
Environment: Production
Reporter: anikad ayman
Fix For: 2.7.0
During MapReduce processing, Yarn did crash and the processing of jobs had
stopped. I successed to back the processing after killing the first job which
was running, but after some minutes, another crach thatI solved by killing the
second job wich was running.
We are looking for reasons of this crach that we had several times before
(between one to two times in a month)
In ressource manager logs , I find this messages repeated from the beggining of
the crach until the killing of the jobs:
2017-08-25 03:51:58,815 WARN org.apache.hadoop.ipc.Server: Large response size
4739374 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:38352 Call#33361 Retry#0
2017-08-25 03:53:39,255 WARN org.apache.hadoop.ipc.Server: Large response size
4739374 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:38456 Call#33364 Retry#0
2017-08-25 03:55:19,700 WARN org.apache.hadoop.ipc.Server: Large response size
4739374 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:38556 Call#33367 Retry#0
2017-08-25 03:57:00,262 WARN org.apache.hadoop.ipc.Server: Large response size
4739374 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:38674 Call#33370 Retry#0
2017-08-25 03:58:40,687 WARN org.apache.hadoop.ipc.Server: Large response size
4739374 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:38804 Call#33373 Retry#0
.
.
.
2017-08-25 11:02:44,086 WARN org.apache.hadoop.ipc.Server: Large response size
4751251 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:39778 Call#34159 Retry#0
2017-08-25 11:02:47,933 WARN org.apache.hadoop.ipc.Server: Large response size
4751251 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:39778 Call#34162 Retry#0
2017-08-25 11:03:06,800 WARN org.apache.hadoop.ipc.Server: Large response size
4751251 for call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from
10.135.8.101:39814 Call#34165 Retry#0
NB: We still get this warning from time to another, we still wondring if it
concerns a connexion between the node manager (10.135.8.101) and the ressource
manager, or something else ?
For the node manager logs, I find theses messages :
2017-08-25 03:51:54,396 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 98201 for container-id
container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
2017-08-25 03:51:54,791 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 112912 for container-id
container_e41_1500982512144_36679_01_000387: 2.3 GB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
2017-08-25 03:51:55,177 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 105848 for container-id
container_e41_1500982512144_36627_01_001644: 619.4 MB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
2017-08-25 03:51:58,938 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 98201 for container-id
container_e41_1500982512144_36679_01_000382: 1.4 GB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
.
.
.
2017-08-25 11:05:40,104 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 112912 for container-id
container_e41_1500982512144_36679_01_000387: 1.1 GB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:40,493 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 105848 for container-id
container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:43,867 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 98201 for container-id
container_e41_1500982512144_36679_01_000382: 1.1 GB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:45,040 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 105848 for container-id
container_e41_1500982512144_36627_01_001644: 648.4 MB of 10 GB physical memory
used; 10.1 GB of 21 GB virtual memory used
2017-08-25 11:05:48,397 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_e41_1500982512144_36627_01_001644 transitioned from
RUNNING to KILLING
2017-08-25 11:05:48,397 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
Application application_1500982512144_36627 transitioned from RUNNING to
FINISHING_CONTAINERS_WAIT
2017-08-25 11:05:48,397 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Cleaning up container container_e41_1500982512144_36627_01_001644
and also for the job history :
2017-08-25 03:53:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
2017-08-25 03:56:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
2017-08-25 03:59:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
2017-08-25 04:02:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
2017-08-25 04:05:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
2017-08-25 04:08:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
2017-08-25 04:11:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
. . .
2017-08-25 11:05:36,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
History Cleaner started
2017-08-25 11:05:41,271 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
History Cleaner complete
2017-08-25 11:06:04,214 INFO
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
Updating the current master key for generating delegation tokens
2017-08-25 11:08:06,504 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files
2017-08-25 11:08:06,518 INFO org.apache.hadoop.mapreduce.jobhistory.JobSummary:
jobId=job_1500982512144_36793,submitTime=1503647426340,launchTime=1503651960434,firstMapTaskLaunchTime=1503651982671,firstReduceTaskLaunchTime=0,finishTime=1503651985794,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=9,reduceSlotSeconds=0,jobName=SELECT
`C_7361705f62736973`.`buk...20170825)(Stage-1)
2017-08-25 11:08:06,518 INFO
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Deleting JobSummary file:
[maprfs:/var/mapr/cluster/yarn/rm/staging/history/done_intermediate/mapr/job_1500982512144_36793.summary]
2017-08-25 11:08:06,518 INFO org.apache.hadoop.mapreduce.jobhistory.JobSummary:
jobId=job_1500982512144_36778,submitTime=1503642110785,launchTime=1503651960266,firstMapTaskLaunchTime=1503651969483,firstReduceTaskLaunchTime=0,finishTime=1503651976016,resourcesPerMap=5120,resourcesPerReduce=0,numMaps=1,numReduces=0,user=mapr,queue=default,status=SUCCEEDED,mapSlotSeconds=19,reduceSlotSeconds=0,jobName=SELECT
`C_7361705f7662726b`.`vbe...20170825)(Stage-1)
Please, have you any explication or solution of this issue ?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]