Re: [jira] [Commented] (MAPREDUCE-3065) ApplicationMaster killed by NodeManager due to excessive virtual memory consumption

Vinod Kumar Vavilapalli Wed, 21 Sep 2011 23:02:20 -0700

Wow, this is new to me.

Chris, I just created
MAPREDUCE-3068<https://issues.apache.org/jira/browse/MAPREDUCE-3068>and
assigned it to you. Please reassign if you don't have cycles.


Thanks,
+Vinod


On Thu, Sep 22, 2011 at 5:35 AM, Chris Riccomini (JIRA) <j...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAPREDUCE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112247#comment-13112247]
>
> Chris Riccomini commented on MAPREDUCE-3065:
> --------------------------------------------
>
> Sure thing. Will get on it tomorrow AM. Headed out the door at the moment.
>
> > ApplicationMaster killed by NodeManager due to excessive virtual memory
> consumption
> >
> -----------------------------------------------------------------------------------
> >
> >                 Key: MAPREDUCE-3065
> >                 URL:
> https://issues.apache.org/jira/browse/MAPREDUCE-3065
> >             Project: Hadoop Map/Reduce
> >          Issue Type: Bug
> >          Components: nodemanager
> >    Affects Versions: 0.24.0
> >            Reporter: Chris Riccomini
> >
> > > Hey Vinod,
> > >
> > > OK, so I have a little more clarity into this.
> > >
> > > When I bump my resource request for my AM to 4096, it runs. The
> important line in the NM logs is:
> > >
> > > 2011-09-21 13:43:44,366 INFO  monitor.ContainersMonitorImpl
> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 25656
> for container-id container_1316637655278_0001_01_000001 : Virtual 2260938752
> bytes, limit : 4294967296 bytes; Physical 120860672 bytes, limit -1 bytes
> > >
> > > The thing to note is the virtual memory, which is off the charts, even
> though my physical memory is almost nothing (12 megs). I'm still poking
> around the code, but I am noticing that there are two checks in the NM, one
> for virtual mem, and one for physical mem. The virtual memory check appears
> to be toggle-able, but is presumably defaulted to on.
> > >
> > > At this point I'm trying to figure out exactly what the VMEM check is
> for, why YARN thinks my app is taking 2 gigs, and how to fix this.
> > >
> > > Cheers,
> > > Chris
> > > ________________________________________
> > > From: Chris Riccomini [criccom...@linkedin.com]
> > > Sent: Wednesday, September 21, 2011 1:42 PM
> > > To: mapreduce-dev@hadoop.apache.org
> > > Subject: Re: ApplicationMaster Memory Usage
> > >
> > > For the record, I bumped to 4096 for memory resource request, and it
> works.
> > > :(
> > >
> > >
> > > On 9/21/11 1:32 PM, "Chris Riccomini" <criccom...@linkedin.com> wrote:
> > >
> > >> Hey Vinod,
> > >>
> > >> So, I ran my application master directly from the CLI. I commented out
> the
> > >> YARN-specific code. It runs fine without leaking memory.
> > >>
> > >> I then ran it from YARN, with all YARN-specific code commented it. It
> again
> > >> ran fine.
> > >>
> > >> I then uncommented JUST my registerWithResourceManager call. It then
> fails
> > >> with OOM after a few seconds. I call registerWithResourceManager, and
> then go
> > >> into a while(true) { println("yeh") sleep(1000) }. Doing this prints:
> > >>
> > >> yeh
> > >> yeh
> > >> yeh
> > >> yeh
> > >> yeh
> > >>
> > >> At which point, it dies, and, in the NodeManager,I see:
> > >>
> > >> 2011-09-21 13:24:51,036 WARN  monitor.ContainersMonitorImpl
> > >> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process
> tree for
> > >> container: container_1316626117280_0005_01_000001 has processes older
> than 1
> > >> iteration running over the configured limit. Limit=2147483648, current
> usage =
> > >> 2192773120
> > >> 2011-09-21 13:24:51,037 WARN  monitor.ContainersMonitorImpl
> > >> (ContainersMonitorImpl.java:run(453)) - Container
> > >> [pid=23852,containerID=container_1316626117280_0005_01_000001] is
> running
> > >> beyond memory-limits. Current usage : 2192773120bytes. Limit :
> > >> 2147483648bytes. Killing container.
> > >> Dump of the process-tree for container_1316626117280_0005_01_000001 :
> > >> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> SYSTEM_TIME(MILLIS)
> > >> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> > >> |- 23852 20570 23852 23852 (bash) 0 0 108638208 303 /bin/bash -c java
> -Xmx512M
> > >> -cp './package/*' kafka.yarn.ApplicationMaster
> > >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1
> 1316626117280
> > >> com.linkedin.TODO 1
> > >>
> 1>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000
> > >> 001/stdout
> > >>
> 2>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000
> > >> 001/stderr
> > >> |- 23855 23852 23852 23852 (java) 81 4 2084134912 14772 java -Xmx512M
> -cp
> > >> ./package/* kafka.yarn.ApplicationMaster
> > >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1
> 1316626117280
> > >> com.linkedin.TODO 1
> > >> 2011-09-21 13:24:51,037 INFO  monitor.ContainersMonitorImpl
> > >> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root
> 23852
> > >>
> > >> Either something is leaking in YARN, or my registerWithResourceManager
> code
> > >> (see below) is doing something funky.
> > >>
> > >> I'm trying to avoid going through all the pain of attaching a remote
> debugger.
> > >> Presumably things aren't leaking in YARN, which means it's likely that
> I'm
> > >> doing something wrong in my registration code.
> > >>
> > >> Incidentally, my NodeManager is running with 1000 megs. My application
> master
> > >> memory is set to 2048, and my -Xmx setting is 512M
> > >>
> > >> Cheers,
> > >> Chris
> > >> ________________________________________
> > >> From: Vinod Kumar Vavilapalli [vino...@hortonworks.com]
> > >> Sent: Wednesday, September 21, 2011 11:52 AM
> > >> To: mapreduce-dev@hadoop.apache.org
> > >> Subject: Re: ApplicationMaster Memory Usage
> > >>
> > >> Actually MAPREDUCE-2998 is only related to MRV2, so that isn't
> related.
> > >>
> > >> Somehow, your JVM itself is taking so much of virtual memory. Are you
> > >> loading some native libs?
> > >>
> > >> And how many containers have already been allocated by the time the AM
> > >> crashes. May be you are accumulating some per-container data. You can
> try
> > >> dumping heap vai hprof.
> > >>
> > >> +Vinod
> > >>
> > >>
> > >> On Wed, Sep 21, 2011 at 11:21 PM, Chris Riccomini
> > >> <criccom...@linkedin.com>wrote:
> > >>
> > >>> Hey Vinod,
> > >>>
> > >>> I svn up'd, and rebuilt. My application's task (container) now runs!
> > >>>
> > >>> Unfortunately, my application master eventually gets killed by the
> > >>> NodeManager anyway, and I'm still not clear as to why. The AM is just
> > >>> running a loop, asking for a container, and executing a command in
> the
> > >>> container. It keeps doing this over and over again. After a few
> iterations,
> > >>> it gets killed with something like:
> > >>>
> > >>> 2011-09-21 10:42:40,869 INFO  monitor.ContainersMonitorImpl
> > >>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree
> 21666
> > >>> for container-id container_1316626117280_0002_01_000001 : Virtual
> 2260938752
> > >>> bytes, limit : 2147483648 bytes; Physical 77398016 bytes, limit -1
> bytes
> > >>> 2011-09-21 10:42:40,869 WARN  monitor.ContainersMonitorImpl
> > >>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process
> tree for
> > >>> container: container_1316626117280_0002_01_000001 has processes older
> than 1
> > >>> iteration running over the configured limit. Limit=2147483648,
> current usage
> > >>> = 2260938752
> > >>> 2011-09-21 10:42:40,870 WARN  monitor.ContainersMonitorImpl
> > >>> (ContainersMonitorImpl.java:run(453)) - Container
> > >>> [pid=21666,containerID=container_1316626117280_0002_01_000001] is
> running
> > >>> beyond memory-limits. Current usage : 2260938752bytes. Limit :
> > >>> 2147483648bytes. Killing container.
> > >>> Dump of the process-tree for container_1316626117280_0002_01_000001 :
> > >>>        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> > >>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
> FULL_CMD_LINE
> > >>>        |- 21669 21666 21666 21666 (java) 105 4 2152300544 18593 java
> > >>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster
> > >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1
> 1316626117280
> > >>> com.linkedin.TODO 1
> > >>>       |- 21666 20570 21666 21666 (bash) 0 0 108638208 303 /bin/bash
> -c
> > >>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster
> > >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1
> 1316626117280
> > >>> com.linkedin.TODO 1
> > >>>
> 1>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00
> > >>> 0001/stdout
> > >>>
> 2>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00
> > >>> 0001/stderr
> > >>>
> > >>> 2011-09-21 10:42:40,870 INFO  monitor.ContainersMonitorImpl
> > >>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root
> 21666
> > >>>
> > >>> I don't think that my AM is leaking memory. Full code paste after the
> break
> > >>>
> > >>> 1. Do I need to release a container in my AM even if the AM receives
> it as
> > >>> a finished container in the resource request response?
> > >>> 2. Do I need to free any other resources after a resource request
> (e.g.
> > >>> ResourceRequest, AllocateRequest, etc)?
> > >>>
> > >>> Cheers,
> > >>> Chris
> > >>>
> > >>>
> > >>> def main(args: Array[String]) {
> > >>>   // YARN will always give our ApplicationMaster
> > >>>   // the first four parameters as: <package> <app id> <attempt id>
> > >>> <timestamp>
> > >>>   val packagePath = args(0)
> > >>>   val appId = args(1).toInt
> > >>>   val attemptId = args(2).toInt
> > >>>   val timestamp = args(3).toLong
> > >>>
> > >>>   // these are our application master's parameters
> > >>>   val streamerClass = args(4)
> > >>>   val tasks = args(5).toInt
> > >>>
> > >>>   // TODO log params here
> > >>>
> > >>>   // start the application master helper
> > >>>   val conf = new Configuration
> > >>>   val applicationMasterHelper = new ApplicationMasterHelper(appId,
> > >>> attemptId, timestamp, conf)
> > >>>     .registerWithResourceManager
> > >>>
> > >>>   // start and manage the slaves
> > >>>   val noReleases = List[ContainerId]()
> > >>>   var runningContainers = 0
> > >>>
> > >>>   // keep going forever
> > >>>   while (true) {
> > >>>     val nonRunningTasks = tasks - runningContainers
> > >>>     val response =
> > >>> applicationMasterHelper.sendResourceRequest(nonRunningTasks,
> noReleases)
> > >>>
> > >>>     response.getAllocatedContainers.foreach(container => {
> > >>>       new ContainerExecutor(packagePath, container)
> > >>>         .addCommand("java -Xmx256M -cp './package/*'
> > >>> kafka.yarn.StreamingTask " + streamerClass + " "
> > >>>           + "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR +
> "/stdout "
> > >>>           + "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR +
> > >>> "/stderr").execute(conf)
> > >>>     })
> > >>>
> > >>>     runningContainers += response.getAllocatedContainers.length
> > >>>     runningContainers -=
> response.getCompletedContainersStatuses.length
> > >>>
> > >>>     Thread.sleep(1000)
> > >>>   }
> > >>>
> > >>>   applicationMasterHelper.unregisterWithResourceManager("SUCCESS")
> > >>> }
> > >>>
> > >>>
> > >>> class ApplicationMasterHelper(iAppId: Int, iAppAttemptId: Int,
> lTimestamp:
> > >>> Long, conf: Configuration) {
> > >>> val rpc = YarnRPC.create(conf)
> > >>> val appId = Records.newRecord(classOf[ApplicationId])
> > >>> val appAttemptId = Records.newRecord(classOf[ApplicationAttemptId])
> > >>> val rmAddress =
> > >>>
> NetUtils.createSocketAddr(conf.get(YarnConfiguration.RM_SCHEDULER_ADDRESS,
> > >>> YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS))
> > >>> val resourceManager = rpc.getProxy(classOf[AMRMProtocol], rmAddress,
> > >>> conf).asInstanceOf[AMRMProtocol]
> > >>> var requestId = 0
> > >>>
> > >>> appId.setClusterTimestamp(lTimestamp)
> > >>> appId.setId(iAppId)
> > >>> appAttemptId.setApplicationId(appId)
> > >>> appAttemptId.setAttemptId(iAppAttemptId)
> > >>>
> > >>> def registerWithResourceManager(): ApplicationMasterHelper = {
> > >>>   val req =
> Records.newRecord(classOf[RegisterApplicationMasterRequest])
> > >>>   req.setApplicationAttemptId(appAttemptId)
> > >>>   // TODO not sure why these are blank- This is how spark does it
> > >>>   req.setHost("")
> > >>>   req.setRpcPort(1)
> > >>>   req.setTrackingUrl("")
> > >>>   resourceManager.registerApplicationMaster(req)
> > >>>   this
> > >>> }
> > >>>
> > >>> def unregisterWithResourceManager(state: String):
> ApplicationMasterHelper
> > >>> = {
> > >>>   val finReq =
> Records.newRecord(classOf[FinishApplicationMasterRequest])
> > >>>   finReq.setAppAttemptId(appAttemptId)
> > >>>   finReq.setFinalState(state)
> > >>>   resourceManager.finishApplicationMaster(finReq)
> > >>>   this
> > >>> }
> > >>>
> > >>> def sendResourceRequest(containers: Int, release: List[ContainerId]):
> > >>> AMResponse = {
> > >>>   // TODO will need to make this more flexible for hostname requests,
> etc
> > >>>   val request = Records.newRecord(classOf[ResourceRequest])
> > >>>   val pri = Records.newRecord(classOf[Priority])
> > >>>   val capability = Records.newRecord(classOf[Resource])
> > >>>   val req = Records.newRecord(classOf[AllocateRequest])
> > >>>   request.setHostName("*")
> > >>>   request.setNumContainers(containers)
> > >>>   pri.setPriority(1)
> > >>>   request.setPriority(pri)
> > >>>   capability.setMemory(128)
> > >>>   request.setCapability(capability)
> > >>>   req.setResponseId(requestId)
> > >>>   req.setApplicationAttemptId(appAttemptId)
> > >>>   req.addAllAsks(Lists.newArrayList(request))
> > >>>   req.addAllReleases(release)
> > >>>   requestId += 1
> > >>>   // TODO we might want to return a list of container executors here
> > >>> instead of AMResponses
> > >>>   resourceManager.allocate(req).getAMResponse
> > >>> }
> > >>> }
> > >>>
> > >>>
> > >>> ________________________________________
> > >>> From: Vinod Kumar Vavilapalli [vino...@hortonworks.com]
> > >>> Sent: Wednesday, September 21, 2011 10:08 AM
> > >>> To: mapreduce-dev@hadoop.apache.org
> > >>> Subject: Re: ApplicationMaster Memory Usage
> > >>>
> > >>> Yes, the process-dump clearly tells that this is MAPREDUCE-2998.
> > >>>
> > >>> +Vinod
> > >>> (With a smirk to see his container-memory-monitoring code in action)
> > >>>
> > >>>
> > >>> On Wed, Sep 21, 2011 at 10:26 PM, Arun C Murthy <a...@hortonworks.com
> >
> > >>> wrote:
> > >>>
> > >>>> I'll bet you are hitting MR-2998.
> > >>>>
> > >>>> From the changelog:
> > >>>>
> > >>>>   MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to
> fork
> > >>>> bin/mapred too many times. Contributed by Vinod K V.
> > >>>>
> > >>>> Arun
> > >>>>
> > >>>> On Sep 21, 2011, at 9:52 AM, Chris Riccomini wrote:
> > >>>>
> > >>>>> Hey Guys,
> > >>>>>
> > >>>>> My ApplicationMaster is being killed by the NodeManager because of
> > >>> memory
> > >>>> consumption, and I don't understand why. I'm using -Xmx512M, and
> setting
> > >>> my
> > >>>> resource request to 2048.
> > >>>>>
> > >>>>>
> > >>>>>   .addCommand("java -Xmx512M -cp './package/*'
> > >>>> kafka.yarn.ApplicationMaster " ...
> > >>>>>
> > >>>>>   ...
> > >>>>>
> > >>>>>   private var memory = 2048
> > >>>>>
> > >>>>>   resource.setMemory(memory)
> > >>>>>   containerCtx.setResource(resource)
> > >>>>>   containerCtx.setCommands(cmds.toList)
> > >>>>>
> containerCtx.setLocalResources(Collections.singletonMap("package",
> > >>>> packageResource))
> > >>>>>   appCtx.setApplicationId(appId)
> > >>>>>   appCtx.setUser(user.getShortUserName)
> > >>>>>   appCtx.setAMContainerSpec(containerCtx)
> > >>>>>   request.setApplicationSubmissionContext(appCtx)
> > >>>>>   applicationsManager.submitApplication(request)
> > >>>>>
> > >>>>> When this runs, I see (in my NodeManager's logs):
> > >>>>>
> > >>>>>
> > >>>>> 2011-09-21 09:35:19,112 INFO  monitor.ContainersMonitorImpl
> > >>>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree
> 28134
> > >>>> for container-id container_1316559026783_0003_01_000001 : Virtual
> > >>> 2260938752
> > >>>> bytes, limit : 2147483648 bytes; Physical 71540736 bytes, limit -1
> bytes
> > >>>>> 2011-09-21 09:35:19,112 WARN  monitor.ContainersMonitorImpl
> > >>>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process
> tree
> > >>> for
> > >>>> container: container_1316559026783_0003_01_000001 has processes
> older
> > >>> than 1
> > >>>> iteration running over the configured limit. Limit=2147483648,
> current
> > >>> usage
> > >>>> = 2260938752
> > >>>>> 2011-09-21 09:35:19,113 WARN  monitor.ContainersMonitorImpl
> > >>>> (ContainersMonitorImpl.java:run(453)) - Container
> > >>>> [pid=28134,containerID=container_1316559026783_0003_01_000001] is
> running
> > >>>> beyond memory-limits. Current usage : 2260938752bytes. Limit :
> > >>>> 2147483648bytes. Killing container.
> > >>>>> Dump of the process-tree for container_1316559026783_0003_01_000001
> :
> > >>>>>      |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> > >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
> FULL_CMD_LINE
> > >>>>>      |- 28134 25886 28134 28134 (bash) 0 0 108638208 303 /bin/bash
> -c
> > >>>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster 3 1
> > >>>> 1316559026783 com.linkedin.TODO 1
> > >>>>
> > >>>
> 1>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00
> > >>> 0001/stdout
> > >>>>
> > >>>
> 2>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00
> > >>> 0001/stderr
> > >>>>>      |- 28137 28134 28134 28134 (java) 92 3 2152300544 17163 java
> > >>>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster 3 1
> 1316559026783
> > >>>> com.linkedin.TODO 1
> > >>>>>
> > >>>>> 2011-09-21 09:35:19,113 INFO  monitor.ContainersMonitorImpl
> > >>>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with
> root
> > >>> 28134
> > >>>>>
> > >>>>> It appears that YARN is honoring my 2048 command, yet my process is
> > >>>> somehow taking 2260938752 bytes. I don't think that I'm using nearly
> that
> > >>>> much in permgen, and my heap is limited to 512. I don't have any JNI
> > >>> stuff
> > >>>> running (that I know of), so it's unclear to me what's going on
> here. The
> > >>>> only thing that I can think of is that Java's Runtime exec is
> forking,
> > >>> and
> > >>>> copying its entire JVM memory footprint for the fork.
> > >>>>>
> > >>>>> Has anyone seen this? Am I doing something dumb?
> > >>>>>
> > >>>>> Thanks!
> > >>>>> Chris
> > >>>>
> > >>>>
> > >>>
> > >
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] [Commented] (MAPREDUCE-3065) ApplicationMaster killed by NodeManager due to excessive virtual memory consumption

Reply via email to