[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169779#comment-14169779 ]
Maysam Yabandeh commented on HDFS-6982: --------------------------------------- Thanks [~andrew.wang] for the well-detailed review. I will submit a new patch soon. In the meanwhile, let me double check a couple of points with you. bq. Since I don't see any modifications to any existing files, I'm also wondering how this is exposed to JMX or on the webUI. You are right. I was not sure where is the best place to integrate nntop with nn. I will pick a place and we can update it later. bq. There's only a {{getDefaultRollingWindow}} class, no other ways of constructing a RollingWindow. The design doc envisions two interfaces to access the top users. One is jmx that requires rolling window over only one reporting period, say 1 minute. Jmx data however are most useful when they are integrated with an external graphing tool. To also allow users with small clusters to benefit from the data computed by nntop, we also provide an html interface, which has no graphing capability. This basic interface unfortunately does not give a sense of *trend* to the viewer. To compensate for that, the html page will show the top users over multiple time periods, say 1, 5, 25 minutes; ergo why we have multiple rolling window periods in nntop. One of them however is used for jmx interface, which is specific by {{getDefaultRollingWindow}}. About the html interface, I excluded it from this patch for two reasons. First, i figured it is better to keep this patch as small as possible and work on the html interface patch on a separate jira. Second reason was that previously I had used yarn html utils and I am gonna have to rewrite that part using html utils which are standard to the hdfs project. bq. How do we configure multiple reporting periods? via some conf params. I will make sure that the docs reflect that properly. bq. WEB_PORT and DEFAULT_WEB_PORT seem to be unused you right. they are supposed to be used by the html interface. but I should remove them from this patch. bq. getCmdTotal and getTopMetricsRecordPrefix static getters are only used in TopMetrics, that might be a better home. they will later be used by the html interface as well. the html interface will show the total operations on top and then details of each command afterwards. bq. Rather than MIN_2_MS, could we have a long array with the default periods, i.e. DEFAULT_REPORTING_PERIODS? In addition to the previous explanation about multiple reporting periods for the html view, I should add the them reporting periods are expected to be specified in the conf file. I dropped the method that reads them from the conf file from the patch since it was invoked only via the html interface. But I guess I should put it back to avoid confusion. bq. report, we construct the permStr, but don't actually use it. you are right. I actually can drop src, dst, and also status. At the beginning the vision for nntop was to also report hot directories, etc. and that is why we kept the full details in the report method. but i guess we can always put such details back if at some point those visions were to pursued. bq. report, I don't think we need the catch for Throwable t, no checked exceptions are being thrown? the idea was that any unexpected problem from a programming bug in nntop should not crash the name node. bq. TopUtil: This stuff isn't shared much, seems like we could just move things to where they're used TopUtil was much fatter when it also included html view util functions. Also html view will also be a user of TopUtil. bq. TopMetricsCollector: Is this used? yeah, by the html view. I should drop it from this patch. > nntop: top-like tool for name node users > ----------------------------------------- > > Key: HDFS-6982 > URL: https://issues.apache.org/jira/browse/HDFS-6982 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Maysam Yabandeh > Assignee: Maysam Yabandeh > Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, nntop-design-v1.pdf > > > In this jira we motivate the need for nntop, a tool that, similarly to what > top does in Linux, gives the list of top users of the HDFS name node and > gives insight about which users are sending majority of each traffic type to > the name node. This information turns out to be the most critical when the > name node is under pressure and the HDFS admin needs to know which user is > hammering the name node and with what kind of requests. Here we present the > design of nntop which has been in production at Twitter in the past 10 > months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K > nodes), low memory footprint (less than a few MB), and quite efficient for > the write path (only two hash lookup for updating a metric). -- This message was sent by Atlassian JIRA (v6.3.4#6332)