[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207528#comment-14207528
 ] 

Andrew Wang commented on HDFS-6982:
-----------------------------------

Hi Maysam, I did another round of review. Mostly usability stuff and small 
cleanups, core functionality as before still looks great. As always, thanks for 
your patience.

TopConf / DFSConfigKeys / hdfs-default.xml
* Can use Configuration.getInts rather than parsing yourself, also would be 
good to do basic validation that the intervals are >0 mins via a Precondition 
check.
* Do you mind writing out the key strings in DFSConfigKeys? It's how we do it 
in the rest of the file, and more readable.
* Consider renaming "windowmanager.topsize" to "num.users"
* Consider renaming "window.buckets" to "window.num.buckets" and "periods.min" 
to "windows.minutes"
* We could set the default value to {{int[] {1,5,25}}} to make the logic a 
little easier. I noticed that the default in hdfs-default and DFSConfigKeys is 
different too.
* Seems like 30min would be a more human-friendly number than 25min also
* hdfs-default.xml typo: "seperated" -> "separated"
* I noticed that NameNodeMetrics fills in sessionId and processId, any reason 
we can't do the same?
* Would be cool to INFO log the configuration settings at startup, this can be 
pretty handy
* It also seems like an unnecessary step to also have to specify the 
TopAuditLogger in the conf, if a user already specified 
{{dfs.namenode.top.periods.min}}. If there are periods set, let's just also 
create the TopAuditLogger.

TopMetrics:
* Should fill out the DummyDescription to not be dummy :)
* Some unused methods: initSingleton(Configuration, String, String), 
getReportingPeriods, isTopMetricRecord, extractPeriodFormMetricRecordName. 
Guessing these are used in the HTML view, could we remove these for now?
* In getMetrics, would prefer the info log to be debug, and the error to be 
warn. Error is normally for something nigh fatal, warn is something we can ride 
out.
* The username string parsing is a little brittle, would be good to put a 
little static helper method in UserGroupInformation next to toString in case 
someone changes the format.

RollingWindowManager:
* We can make the asserts in the constructor into Precondition checks, we want 
these even in production

RollingWindow:
* Is it possible to use monotonic time, i.e. Time.monotonicNow, rather than 
system time? We've been bitten by clocks moving around before, and I think 
relative time should work here. I see we make it into a Date in 
RollingWindow#getSum, but just for debugging.
* Unused import

Experiences running it:
* The jmx page right now only exposes the shortest window, I see the 
{{smallestWindow}} hardcode. Since the goal is to use jmx to populate the HTML 
view, we need to somehow expose all of the configured windows.
* I tried doing some operations, then waited for the window to pass. The counts 
never returned to zero. I think this is because in RWManager#snapshot, we have 
a continue if {{n == 0}}, so never push the 0 count.
* It's also unfortunate that there's no way to remove a metric once we add it, 
so over time we'll end up with a lot of garbage {{user.operation}} strings that 
don't apply to the current window. Thoughts welcome here.

> nntop: top­-like tool for name node users
> -----------------------------------------
>
>                 Key: HDFS-6982
>                 URL: https://issues.apache.org/jira/browse/HDFS-6982
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Maysam Yabandeh
>            Assignee: Maysam Yabandeh
>         Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
> HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
> nntop-design-v1.pdf
>
>
> In this jira we motivate the need for nntop, a tool that, similarly to what 
> top does in Linux, gives the list of top users of the HDFS name node and 
> gives insight about which users are sending majority of each traffic type to 
> the name node. This information turns out to be the most critical when the 
> name node is under pressure and the HDFS admin needs to know which user is 
> hammering the name node and with what kind of requests. Here we present the 
> design of nntop which has been in production at Twitter in the past 10 
> months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K 
> nodes), low memory footprint (less than a few MB), and quite efficient for 
> the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to