[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402597#comment-15402597
 ] 

Wangda Tan commented on YARN-4091:
----------------------------------

[~eepayne],

Thanks you so much for reviewing this JIRA.

bq. I would be interested to know how you gathered this information
We're using SLS simulate a 2k nodes cluster, and added few debug logging to 
print time (in nano seconds) costed by each scheduler allocation.

bq. Also, how are you limiting the number of nodes whose state is being logged?
We deliberately designed REST API for this JIRA to limit number of nodes being 
recorded concurrently. The goal of this JIRA is to return human-readable result 
and avoid noticeable slowdown to scheduler. So at each time, user can request 
recording only one node heartbeat. Too much data (like 2000 node heartbeat per 
sec) returned by scheduler will be definitely not readable by users.
With this, we can only send limited number of request per sec to limit 
#recorded node allocation per sec.
So from my perspective, it may not be a valid use case that someone need to 
record 2,000 nodes altogether. Is it make sense to you? And if you're concerned 
about this API is abused by users, we can add ACLs or traffic control on the 
client (web UI) side or server side.

bq. I am concerned about the performance load this feature will add to the 
resource manager. I have analyzed the code and experimented with the feature on 
a 3-node cluster. It appears that the state is being recorded for every node on 
every heartbeat...
If you could take a look at the implementation, 
startNodeUpdateRecording/finishNodeUpdateRecording only check if a key exists 
in a ConcurrentHashMap when node recording is not enabled, from our performance 
test, we didn't see it added any overhead comparing to original scheduler code 
without applying the patch. Also, I just wrote a quick test:
{code}
    ConcurrentHashMap<String, String> map = new ConcurrentHashMap();

    java.util.Random random = new java.util.Random();
    List<String> arr = new ArrayList();

    for (int i = 0; i < 100000; i++) {
      String s = String.valueOf(random.nextFloat());
      map.put(s, s);
      arr.add(s);
    }

    long time = System.currentTimeMillis();
    for (String s : arr) {
      map.get(s);
    }
    System.out.println(System.currentTimeMillis() - time);
{code}

Total time spent by 100k get operation is around 16 ms on my laptop. So each 
get takes only 160 ns.


> Add REST API to retrieve scheduler activity
> -------------------------------------------
>
>                 Key: YARN-4091
>                 URL: https://issues.apache.org/jira/browse/YARN-4091
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacity scheduler, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Sunil G
>            Assignee: Chen Ge
>         Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> SchedulerActivityManager-TestReport v2.pdf, 
> SchedulerActivityManager-TestReport.pdf, YARN-4091-design-doc-v1.pdf, 
> YARN-4091.1.patch, YARN-4091.2.patch, YARN-4091.3.patch, YARN-4091.4.patch, 
> YARN-4091.5.patch, YARN-4091.5.patch, YARN-4091.preliminary.1.patch, 
> app_activities.json, node_activities.json
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to