[ https://issues.apache.org/jira/browse/MAPREDUCE-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247740#comment-16247740 ]
Peter Bacsko commented on MAPREDUCE-5124: ----------------------------------------- Just a quick update on the GC usage improvement. I know the POC is not the final version, but I still, I decided to check how much it improves. I added a 2 second sleep to {{StatusUpdater.transition()}} to cause event backlog and used a mapper code which constantly called {{reporter.progress()}} in a loop. I also decreased update interval to 100 ms. GC events in the with the old code: {noformat} [GC (Allocation Failure) 52224K->8221K(200192K), 0.0130368 secs] [GC (Allocation Failure) 60445K->10200K(252416K), 0.0119459 secs] [GC (Metadata GC Threshold) 59477K->10902K(252416K), 0.0151800 secs] [Full GC (Metadata GC Threshold) 10902K->9053K(201216K), 0.0446707 secs] [GC (Allocation Failure) 113501K->19028K(251904K), 0.0136092 secs] [GC (Metadata GC Threshold) 78026K->17595K(305664K), 0.0226579 secs] [Full GC (Metadata GC Threshold) 17595K->12774K(347648K), 0.0501647 secs] [GC (Allocation Failure) 221670K->24081K(377344K), 0.0199000 secs] [GC (Allocation Failure) 260113K->29187K(378368K), 0.0277259 secs] [GC (Allocation Failure) 265219K->39660K(373248K), 0.0384575 secs] [GC (Allocation Failure) 267500K->48473K(378368K), 0.0370554 secs] [GC (Allocation Failure) 276313K->55049K(371200K), 0.0417077 secs] [GC (Allocation Failure) 275721K->61521K(365568K), 0.0270593 secs] [GC (Allocation Failure) 275025K->67873K(359936K), 0.0417392 secs] [GC (Allocation Failure) 274721K->74129K(345088K), 0.0531881 secs] [GC (Allocation Failure) 274833K->80089K(347648K), 0.0270885 secs] [GC (Allocation Failure) 274649K->85921K(345088K), 0.0313155 secs] <-- I killed the job at this point {noformat} With the POC: {noformat} [GC (Allocation Failure) 52224K->8183K(200192K), 0.0228069 secs] [GC (Allocation Failure) 60407K->10370K(252416K), 0.0135163 secs] [GC (Metadata GC Threshold) 60383K->10958K(252416K), 0.0174618 secs] [Full GC (Metadata GC Threshold) 10958K->8924K(198144K), 0.0452158 secs] [GC (Allocation Failure) 113372K->18810K(254976K), 0.0132976 secs] [GC (Metadata GC Threshold) 80801K->17577K(302592K), 0.0137089 secs] [Full GC (Metadata GC Threshold) 17577K->12903K(345088K), 0.0579774 secs] [GC (Allocation Failure) 221799K->24221K(382976K), 0.0188251 secs] [GC (Allocation Failure) 268445K->24870K(384000K), 0.0164503 secs] [GC (Allocation Failure) 269094K->19999K(381952K), 0.0155673 secs] <-- final event {noformat} I think the difference speaks for itself. > AM lacks flow control for task events > ------------------------------------- > > Key: MAPREDUCE-5124 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5124 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 2.0.3-alpha, 0.23.5 > Reporter: Jason Lowe > Assignee: Peter Bacsko > Attachments: MAPREDUCE-5124-CoalescingPOC-1.patch, > MAPREDUCE-5124-CoalescingPOC2.patch, MAPREDUCE-5124-proto.2.txt, > MAPREDUCE-5124-prototype.txt > > > The AM does not have any flow control to limit the incoming rate of events > from tasks. If the AM is unable to keep pace with the rate of incoming > events for a sufficient period of time then it will eventually exhaust the > heap and crash. MAPREDUCE-5043 addressed a major bottleneck for event > processing, but the AM could still get behind if it's starved for CPU and/or > handling a very large job with tens of thousands of active tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org