[ https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402597#comment-15402597 ]
Wangda Tan commented on YARN-4091: ---------------------------------- [~eepayne], Thanks you so much for reviewing this JIRA. bq. I would be interested to know how you gathered this information We're using SLS simulate a 2k nodes cluster, and added few debug logging to print time (in nano seconds) costed by each scheduler allocation. bq. Also, how are you limiting the number of nodes whose state is being logged? We deliberately designed REST API for this JIRA to limit number of nodes being recorded concurrently. The goal of this JIRA is to return human-readable result and avoid noticeable slowdown to scheduler. So at each time, user can request recording only one node heartbeat. Too much data (like 2000 node heartbeat per sec) returned by scheduler will be definitely not readable by users. With this, we can only send limited number of request per sec to limit #recorded node allocation per sec. So from my perspective, it may not be a valid use case that someone need to record 2,000 nodes altogether. Is it make sense to you? And if you're concerned about this API is abused by users, we can add ACLs or traffic control on the client (web UI) side or server side. bq. I am concerned about the performance load this feature will add to the resource manager. I have analyzed the code and experimented with the feature on a 3-node cluster. It appears that the state is being recorded for every node on every heartbeat... If you could take a look at the implementation, startNodeUpdateRecording/finishNodeUpdateRecording only check if a key exists in a ConcurrentHashMap when node recording is not enabled, from our performance test, we didn't see it added any overhead comparing to original scheduler code without applying the patch. Also, I just wrote a quick test: {code} ConcurrentHashMap<String, String> map = new ConcurrentHashMap(); java.util.Random random = new java.util.Random(); List<String> arr = new ArrayList(); for (int i = 0; i < 100000; i++) { String s = String.valueOf(random.nextFloat()); map.put(s, s); arr.add(s); } long time = System.currentTimeMillis(); for (String s : arr) { map.get(s); } System.out.println(System.currentTimeMillis() - time); {code} Total time spent by 100k get operation is around 16 ms on my laptop. So each get takes only 160 ns. > Add REST API to retrieve scheduler activity > ------------------------------------------- > > Key: YARN-4091 > URL: https://issues.apache.org/jira/browse/YARN-4091 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, resourcemanager > Affects Versions: 2.7.0 > Reporter: Sunil G > Assignee: Chen Ge > Attachments: Improvement on debugdiagnostic information - YARN.pdf, > SchedulerActivityManager-TestReport v2.pdf, > SchedulerActivityManager-TestReport.pdf, YARN-4091-design-doc-v1.pdf, > YARN-4091.1.patch, YARN-4091.2.patch, YARN-4091.3.patch, YARN-4091.4.patch, > YARN-4091.5.patch, YARN-4091.5.patch, YARN-4091.preliminary.1.patch, > app_activities.json, node_activities.json > > > As schedulers are improved with various new capabilities, more configurations > which tunes the schedulers starts to take actions such as limit assigning > containers to an application, or introduce delay to allocate container etc. > There are no clear information passed down from scheduler to outerworld under > these various scenarios. This makes debugging very tougher. > This ticket is an effort to introduce more defined states on various parts in > scheduler where it skips/rejects container assignment, activate application > etc. Such information will help user to know whats happening in scheduler. > Attaching a short proposal for initial discussion. We would like to improve > on this as we discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org