[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612150#comment-16612150 ]
Hadoop QA commented on YARN-8234: --------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 9m 54s{color} | {color:red} Docker failed to build yetus/hadoop:c2d96dd. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-8234 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12939409/YARN-8234-branch-2.8.3.004.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21823/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > ----------------------------------------------------------------------------------------------- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver > Affects Versions: 2.8.3 > Reporter: Hu Ziqian > Assignee: Hu Ziqian > Priority: Critical > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, > YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, > YARN-8234.003.patch, YARN-8234.004.patch > > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM stop-world and > cause a timeout from rm to zookeeper) or even get an OOM. > The main problem here is that timeline can't receive timeline server's event > as fast as it generated. Now, RM system metrics publisher put only one event > in a request, and most time costs on handling http header or some thing about > the net connection on timeline side. Only few time is spent on dealing with > the timeline event which is truly valuable. > In this issue, we add a buffer in system metrics publisher and let publisher > send events to timeline server in batch via one request. When sets the batch > size to 1000, in out experiment the speed of the timeline server receives > events has 100x improvement. We have implement this function int our product > environment which accepts 20000 app's in one hour and it works fine. > We add following configuration: > * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of > system metrics publisher sending events in one request. Default value is 1000 > * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the > event buffer in system metrics publisher. > * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When > enable batch publishing, we must avoid that the publisher waits for a batch > to be filled up and hold events in buffer for long time. So we add another > thread which send event's in the buffer periodically. This config sets the > interval of the cyclical sending thread. The default value is 60s. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org