[ 
https://issues.apache.org/jira/browse/HDFS-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204308#comment-13204308
 ] 

Todd Lipcon commented on HDFS-2510:
-----------------------------------

Sorry, missed the comment above:
{quote}
Similarly, I couldn't think of anything useful an operator could get from this. 
It also doesn't help the situation that currently all DN metrics are 
per-DN-daemon, not per BP offer service. Thus, it's not obvious how to get 
meaningful DN-side metrics for just a single namespace.
{quote}

I think a useful metric which could be exposed is {{max(time since last 
successful communication)}}. This would help diagnose if one of the racks gets 
partitioned off from one of the NNs, for example -- all of the DNs in that rack 
would start to rise in this metric.

That said, the ones you've implemented here are fine and the most crucial, so 
+1 to the current patch and we can discuss adding some more DN-side metrics 
separately.
                
> Add HA-related metrics
> ----------------------
>
>                 Key: HDFS-2510
>                 URL: https://issues.apache.org/jira/browse/HDFS-2510
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node, ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>         Attachments: HDFS-2510-HDFS-1623.patch, HDFS-2510.HDFS-1623.patch
>
>
> Off the top of my head, I can think of:
> NN metrics:
> * A binary metric for active or standby
> * The size of the pending DN message queues
> * A timestamp for when the standby NN last read from shared edit log
> * The difference between highest generation stamp seen from the shared edit 
> log and the highest generation stamp seen from any DN
> It would probably also be useful to have a DN metric which somehow describes 
> which active/standby NNs its talking to, e.g. "times since last communicated 
> with standby/active NNs."
> I'm sure there are others as well. Comments strongly encouraged.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to