Wow, this is new to me. Chris, I just created MAPREDUCE-3068<https://issues.apache.org/jira/browse/MAPREDUCE-3068>and assigned it to you. Please reassign if you don't have cycles.
Thanks, +Vinod On Thu, Sep 22, 2011 at 5:35 AM, Chris Riccomini (JIRA) <j...@apache.org>wrote: > > [ > https://issues.apache.org/jira/browse/MAPREDUCE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112247#comment-13112247] > > Chris Riccomini commented on MAPREDUCE-3065: > -------------------------------------------- > > Sure thing. Will get on it tomorrow AM. Headed out the door at the moment. > > > ApplicationMaster killed by NodeManager due to excessive virtual memory > consumption > > > ----------------------------------------------------------------------------------- > > > > Key: MAPREDUCE-3065 > > URL: > https://issues.apache.org/jira/browse/MAPREDUCE-3065 > > Project: Hadoop Map/Reduce > > Issue Type: Bug > > Components: nodemanager > > Affects Versions: 0.24.0 > > Reporter: Chris Riccomini > > > > > Hey Vinod, > > > > > > OK, so I have a little more clarity into this. > > > > > > When I bump my resource request for my AM to 4096, it runs. The > important line in the NM logs is: > > > > > > 2011-09-21 13:43:44,366 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 25656 > for container-id container_1316637655278_0001_01_000001 : Virtual 2260938752 > bytes, limit : 4294967296 bytes; Physical 120860672 bytes, limit -1 bytes > > > > > > The thing to note is the virtual memory, which is off the charts, even > though my physical memory is almost nothing (12 megs). I'm still poking > around the code, but I am noticing that there are two checks in the NM, one > for virtual mem, and one for physical mem. The virtual memory check appears > to be toggle-able, but is presumably defaulted to on. > > > > > > At this point I'm trying to figure out exactly what the VMEM check is > for, why YARN thinks my app is taking 2 gigs, and how to fix this. > > > > > > Cheers, > > > Chris > > > ________________________________________ > > > From: Chris Riccomini [criccom...@linkedin.com] > > > Sent: Wednesday, September 21, 2011 1:42 PM > > > To: mapreduce-dev@hadoop.apache.org > > > Subject: Re: ApplicationMaster Memory Usage > > > > > > For the record, I bumped to 4096 for memory resource request, and it > works. > > > :( > > > > > > > > > On 9/21/11 1:32 PM, "Chris Riccomini" <criccom...@linkedin.com> wrote: > > > > > >> Hey Vinod, > > >> > > >> So, I ran my application master directly from the CLI. I commented out > the > > >> YARN-specific code. It runs fine without leaking memory. > > >> > > >> I then ran it from YARN, with all YARN-specific code commented it. It > again > > >> ran fine. > > >> > > >> I then uncommented JUST my registerWithResourceManager call. It then > fails > > >> with OOM after a few seconds. I call registerWithResourceManager, and > then go > > >> into a while(true) { println("yeh") sleep(1000) }. Doing this prints: > > >> > > >> yeh > > >> yeh > > >> yeh > > >> yeh > > >> yeh > > >> > > >> At which point, it dies, and, in the NodeManager,I see: > > >> > > >> 2011-09-21 13:24:51,036 WARN monitor.ContainersMonitorImpl > > >> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process > tree for > > >> container: container_1316626117280_0005_01_000001 has processes older > than 1 > > >> iteration running over the configured limit. Limit=2147483648, current > usage = > > >> 2192773120 > > >> 2011-09-21 13:24:51,037 WARN monitor.ContainersMonitorImpl > > >> (ContainersMonitorImpl.java:run(453)) - Container > > >> [pid=23852,containerID=container_1316626117280_0005_01_000001] is > running > > >> beyond memory-limits. Current usage : 2192773120bytes. Limit : > > >> 2147483648bytes. Killing container. > > >> Dump of the process-tree for container_1316626117280_0005_01_000001 : > > >> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) > > >> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > > >> |- 23852 20570 23852 23852 (bash) 0 0 108638208 303 /bin/bash -c java > -Xmx512M > > >> -cp './package/*' kafka.yarn.ApplicationMaster > > >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1 > 1316626117280 > > >> com.linkedin.TODO 1 > > >> > 1>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000 > > >> 001/stdout > > >> > 2>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000 > > >> 001/stderr > > >> |- 23855 23852 23852 23852 (java) 81 4 2084134912 14772 java -Xmx512M > -cp > > >> ./package/* kafka.yarn.ApplicationMaster > > >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1 > 1316626117280 > > >> com.linkedin.TODO 1 > > >> 2011-09-21 13:24:51,037 INFO monitor.ContainersMonitorImpl > > >> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root > 23852 > > >> > > >> Either something is leaking in YARN, or my registerWithResourceManager > code > > >> (see below) is doing something funky. > > >> > > >> I'm trying to avoid going through all the pain of attaching a remote > debugger. > > >> Presumably things aren't leaking in YARN, which means it's likely that > I'm > > >> doing something wrong in my registration code. > > >> > > >> Incidentally, my NodeManager is running with 1000 megs. My application > master > > >> memory is set to 2048, and my -Xmx setting is 512M > > >> > > >> Cheers, > > >> Chris > > >> ________________________________________ > > >> From: Vinod Kumar Vavilapalli [vino...@hortonworks.com] > > >> Sent: Wednesday, September 21, 2011 11:52 AM > > >> To: mapreduce-dev@hadoop.apache.org > > >> Subject: Re: ApplicationMaster Memory Usage > > >> > > >> Actually MAPREDUCE-2998 is only related to MRV2, so that isn't > related. > > >> > > >> Somehow, your JVM itself is taking so much of virtual memory. Are you > > >> loading some native libs? > > >> > > >> And how many containers have already been allocated by the time the AM > > >> crashes. May be you are accumulating some per-container data. You can > try > > >> dumping heap vai hprof. > > >> > > >> +Vinod > > >> > > >> > > >> On Wed, Sep 21, 2011 at 11:21 PM, Chris Riccomini > > >> <criccom...@linkedin.com>wrote: > > >> > > >>> Hey Vinod, > > >>> > > >>> I svn up'd, and rebuilt. My application's task (container) now runs! > > >>> > > >>> Unfortunately, my application master eventually gets killed by the > > >>> NodeManager anyway, and I'm still not clear as to why. The AM is just > > >>> running a loop, asking for a container, and executing a command in > the > > >>> container. It keeps doing this over and over again. After a few > iterations, > > >>> it gets killed with something like: > > >>> > > >>> 2011-09-21 10:42:40,869 INFO monitor.ContainersMonitorImpl > > >>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree > 21666 > > >>> for container-id container_1316626117280_0002_01_000001 : Virtual > 2260938752 > > >>> bytes, limit : 2147483648 bytes; Physical 77398016 bytes, limit -1 > bytes > > >>> 2011-09-21 10:42:40,869 WARN monitor.ContainersMonitorImpl > > >>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process > tree for > > >>> container: container_1316626117280_0002_01_000001 has processes older > than 1 > > >>> iteration running over the configured limit. Limit=2147483648, > current usage > > >>> = 2260938752 > > >>> 2011-09-21 10:42:40,870 WARN monitor.ContainersMonitorImpl > > >>> (ContainersMonitorImpl.java:run(453)) - Container > > >>> [pid=21666,containerID=container_1316626117280_0002_01_000001] is > running > > >>> beyond memory-limits. Current usage : 2260938752bytes. Limit : > > >>> 2147483648bytes. Killing container. > > >>> Dump of the process-tree for container_1316626117280_0002_01_000001 : > > >>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > > >>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) > FULL_CMD_LINE > > >>> |- 21669 21666 21666 21666 (java) 105 4 2152300544 18593 java > > >>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster > > >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1 > 1316626117280 > > >>> com.linkedin.TODO 1 > > >>> |- 21666 20570 21666 21666 (bash) 0 0 108638208 303 /bin/bash > -c > > >>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster > > >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1 > 1316626117280 > > >>> com.linkedin.TODO 1 > > >>> > 1>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00 > > >>> 0001/stdout > > >>> > 2>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00 > > >>> 0001/stderr > > >>> > > >>> 2011-09-21 10:42:40,870 INFO monitor.ContainersMonitorImpl > > >>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root > 21666 > > >>> > > >>> I don't think that my AM is leaking memory. Full code paste after the > break > > >>> > > >>> 1. Do I need to release a container in my AM even if the AM receives > it as > > >>> a finished container in the resource request response? > > >>> 2. Do I need to free any other resources after a resource request > (e.g. > > >>> ResourceRequest, AllocateRequest, etc)? > > >>> > > >>> Cheers, > > >>> Chris > > >>> > > >>> > > >>> def main(args: Array[String]) { > > >>> // YARN will always give our ApplicationMaster > > >>> // the first four parameters as: <package> <app id> <attempt id> > > >>> <timestamp> > > >>> val packagePath = args(0) > > >>> val appId = args(1).toInt > > >>> val attemptId = args(2).toInt > > >>> val timestamp = args(3).toLong > > >>> > > >>> // these are our application master's parameters > > >>> val streamerClass = args(4) > > >>> val tasks = args(5).toInt > > >>> > > >>> // TODO log params here > > >>> > > >>> // start the application master helper > > >>> val conf = new Configuration > > >>> val applicationMasterHelper = new ApplicationMasterHelper(appId, > > >>> attemptId, timestamp, conf) > > >>> .registerWithResourceManager > > >>> > > >>> // start and manage the slaves > > >>> val noReleases = List[ContainerId]() > > >>> var runningContainers = 0 > > >>> > > >>> // keep going forever > > >>> while (true) { > > >>> val nonRunningTasks = tasks - runningContainers > > >>> val response = > > >>> applicationMasterHelper.sendResourceRequest(nonRunningTasks, > noReleases) > > >>> > > >>> response.getAllocatedContainers.foreach(container => { > > >>> new ContainerExecutor(packagePath, container) > > >>> .addCommand("java -Xmx256M -cp './package/*' > > >>> kafka.yarn.StreamingTask " + streamerClass + " " > > >>> + "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + > "/stdout " > > >>> + "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + > > >>> "/stderr").execute(conf) > > >>> }) > > >>> > > >>> runningContainers += response.getAllocatedContainers.length > > >>> runningContainers -= > response.getCompletedContainersStatuses.length > > >>> > > >>> Thread.sleep(1000) > > >>> } > > >>> > > >>> applicationMasterHelper.unregisterWithResourceManager("SUCCESS") > > >>> } > > >>> > > >>> > > >>> class ApplicationMasterHelper(iAppId: Int, iAppAttemptId: Int, > lTimestamp: > > >>> Long, conf: Configuration) { > > >>> val rpc = YarnRPC.create(conf) > > >>> val appId = Records.newRecord(classOf[ApplicationId]) > > >>> val appAttemptId = Records.newRecord(classOf[ApplicationAttemptId]) > > >>> val rmAddress = > > >>> > NetUtils.createSocketAddr(conf.get(YarnConfiguration.RM_SCHEDULER_ADDRESS, > > >>> YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS)) > > >>> val resourceManager = rpc.getProxy(classOf[AMRMProtocol], rmAddress, > > >>> conf).asInstanceOf[AMRMProtocol] > > >>> var requestId = 0 > > >>> > > >>> appId.setClusterTimestamp(lTimestamp) > > >>> appId.setId(iAppId) > > >>> appAttemptId.setApplicationId(appId) > > >>> appAttemptId.setAttemptId(iAppAttemptId) > > >>> > > >>> def registerWithResourceManager(): ApplicationMasterHelper = { > > >>> val req = > Records.newRecord(classOf[RegisterApplicationMasterRequest]) > > >>> req.setApplicationAttemptId(appAttemptId) > > >>> // TODO not sure why these are blank- This is how spark does it > > >>> req.setHost("") > > >>> req.setRpcPort(1) > > >>> req.setTrackingUrl("") > > >>> resourceManager.registerApplicationMaster(req) > > >>> this > > >>> } > > >>> > > >>> def unregisterWithResourceManager(state: String): > ApplicationMasterHelper > > >>> = { > > >>> val finReq = > Records.newRecord(classOf[FinishApplicationMasterRequest]) > > >>> finReq.setAppAttemptId(appAttemptId) > > >>> finReq.setFinalState(state) > > >>> resourceManager.finishApplicationMaster(finReq) > > >>> this > > >>> } > > >>> > > >>> def sendResourceRequest(containers: Int, release: List[ContainerId]): > > >>> AMResponse = { > > >>> // TODO will need to make this more flexible for hostname requests, > etc > > >>> val request = Records.newRecord(classOf[ResourceRequest]) > > >>> val pri = Records.newRecord(classOf[Priority]) > > >>> val capability = Records.newRecord(classOf[Resource]) > > >>> val req = Records.newRecord(classOf[AllocateRequest]) > > >>> request.setHostName("*") > > >>> request.setNumContainers(containers) > > >>> pri.setPriority(1) > > >>> request.setPriority(pri) > > >>> capability.setMemory(128) > > >>> request.setCapability(capability) > > >>> req.setResponseId(requestId) > > >>> req.setApplicationAttemptId(appAttemptId) > > >>> req.addAllAsks(Lists.newArrayList(request)) > > >>> req.addAllReleases(release) > > >>> requestId += 1 > > >>> // TODO we might want to return a list of container executors here > > >>> instead of AMResponses > > >>> resourceManager.allocate(req).getAMResponse > > >>> } > > >>> } > > >>> > > >>> > > >>> ________________________________________ > > >>> From: Vinod Kumar Vavilapalli [vino...@hortonworks.com] > > >>> Sent: Wednesday, September 21, 2011 10:08 AM > > >>> To: mapreduce-dev@hadoop.apache.org > > >>> Subject: Re: ApplicationMaster Memory Usage > > >>> > > >>> Yes, the process-dump clearly tells that this is MAPREDUCE-2998. > > >>> > > >>> +Vinod > > >>> (With a smirk to see his container-memory-monitoring code in action) > > >>> > > >>> > > >>> On Wed, Sep 21, 2011 at 10:26 PM, Arun C Murthy <a...@hortonworks.com > > > > >>> wrote: > > >>> > > >>>> I'll bet you are hitting MR-2998. > > >>>> > > >>>> From the changelog: > > >>>> > > >>>> MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to > fork > > >>>> bin/mapred too many times. Contributed by Vinod K V. > > >>>> > > >>>> Arun > > >>>> > > >>>> On Sep 21, 2011, at 9:52 AM, Chris Riccomini wrote: > > >>>> > > >>>>> Hey Guys, > > >>>>> > > >>>>> My ApplicationMaster is being killed by the NodeManager because of > > >>> memory > > >>>> consumption, and I don't understand why. I'm using -Xmx512M, and > setting > > >>> my > > >>>> resource request to 2048. > > >>>>> > > >>>>> > > >>>>> .addCommand("java -Xmx512M -cp './package/*' > > >>>> kafka.yarn.ApplicationMaster " ... > > >>>>> > > >>>>> ... > > >>>>> > > >>>>> private var memory = 2048 > > >>>>> > > >>>>> resource.setMemory(memory) > > >>>>> containerCtx.setResource(resource) > > >>>>> containerCtx.setCommands(cmds.toList) > > >>>>> > containerCtx.setLocalResources(Collections.singletonMap("package", > > >>>> packageResource)) > > >>>>> appCtx.setApplicationId(appId) > > >>>>> appCtx.setUser(user.getShortUserName) > > >>>>> appCtx.setAMContainerSpec(containerCtx) > > >>>>> request.setApplicationSubmissionContext(appCtx) > > >>>>> applicationsManager.submitApplication(request) > > >>>>> > > >>>>> When this runs, I see (in my NodeManager's logs): > > >>>>> > > >>>>> > > >>>>> 2011-09-21 09:35:19,112 INFO monitor.ContainersMonitorImpl > > >>>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree > 28134 > > >>>> for container-id container_1316559026783_0003_01_000001 : Virtual > > >>> 2260938752 > > >>>> bytes, limit : 2147483648 bytes; Physical 71540736 bytes, limit -1 > bytes > > >>>>> 2011-09-21 09:35:19,112 WARN monitor.ContainersMonitorImpl > > >>>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process > tree > > >>> for > > >>>> container: container_1316559026783_0003_01_000001 has processes > older > > >>> than 1 > > >>>> iteration running over the configured limit. Limit=2147483648, > current > > >>> usage > > >>>> = 2260938752 > > >>>>> 2011-09-21 09:35:19,113 WARN monitor.ContainersMonitorImpl > > >>>> (ContainersMonitorImpl.java:run(453)) - Container > > >>>> [pid=28134,containerID=container_1316559026783_0003_01_000001] is > running > > >>>> beyond memory-limits. Current usage : 2260938752bytes. Limit : > > >>>> 2147483648bytes. Killing container. > > >>>>> Dump of the process-tree for container_1316559026783_0003_01_000001 > : > > >>>>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > > >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) > FULL_CMD_LINE > > >>>>> |- 28134 25886 28134 28134 (bash) 0 0 108638208 303 /bin/bash > -c > > >>>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster 3 1 > > >>>> 1316559026783 com.linkedin.TODO 1 > > >>>> > > >>> > 1>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00 > > >>> 0001/stdout > > >>>> > > >>> > 2>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00 > > >>> 0001/stderr > > >>>>> |- 28137 28134 28134 28134 (java) 92 3 2152300544 17163 java > > >>>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster 3 1 > 1316559026783 > > >>>> com.linkedin.TODO 1 > > >>>>> > > >>>>> 2011-09-21 09:35:19,113 INFO monitor.ContainersMonitorImpl > > >>>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with > root > > >>> 28134 > > >>>>> > > >>>>> It appears that YARN is honoring my 2048 command, yet my process is > > >>>> somehow taking 2260938752 bytes. I don't think that I'm using nearly > that > > >>>> much in permgen, and my heap is limited to 512. I don't have any JNI > > >>> stuff > > >>>> running (that I know of), so it's unclear to me what's going on > here. The > > >>>> only thing that I can think of is that Java's Runtime exec is > forking, > > >>> and > > >>>> copying its entire JVM memory footprint for the fork. > > >>>>> > > >>>>> Has anyone seen this? Am I doing something dumb? > > >>>>> > > >>>>> Thanks! > > >>>>> Chris > > >>>> > > >>>> > > >>> > > > > > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >