[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126018#comment-14126018 ]
Maysam Yabandeh commented on HDFS-6982: --------------------------------------- bq. I anticipate that one use case is to push the metrics ganglia / nagios directly. It should not require any aggregation. Am I correct? I see another proposal here different from what the design doc suggests and different from the one that proved to work in practice. I guess we need to first analyze the pros and cons of this alternative and reach a full-fledged design before we plan according to it. So, let me try to understand your vision better by asking some questions about its details: Lets say that you have a cluster with 1m op/min. If aggregator is not part of nn, how such a large volume of events are transferred to the aggregator? Do you envision that jmx creates a metric per command run on the name node and gangila reads from it over the network and aggregate it? Do you have some numbers of the volume of the data that needs to be transferred to gangila in real time by this approach, and some numbers of how much traffic gangila can handle with reasonable overhead? In the second architecture in the design doc where the aggregator was placed in a separate process, it benefited from the existing local log files and parse them off the memory before the file system pushes it to the disk, thus allowing a very efficient processing of such large volume of data, The aggregator then generates a very small set of top users which is efficient to be transferred to the monitoring tool over the network. > nntop: top-like tool for name node users > ----------------------------------------- > > Key: HDFS-6982 > URL: https://issues.apache.org/jira/browse/HDFS-6982 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Maysam Yabandeh > Assignee: Maysam Yabandeh > Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, nntop-design-v1.pdf > > > In this jira we motivate the need for nntop, a tool that, similarly to what > top does in Linux, gives the list of top users of the HDFS name node and > gives insight about which users are sending majority of each traffic type to > the name node. This information turns out to be the most critical when the > name node is under pressure and the HDFS admin needs to know which user is > hammering the name node and with what kind of requests. Here we present the > design of nntop which has been in production at Twitter in the past 10 > months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K > nodes), low memory footprint (less than a few MB), and quite efficient for > the write path (only two hash lookup for updating a metric). -- This message was sent by Atlassian JIRA (v6.3.4#6332)