[
https://issues.apache.org/jira/browse/HDFS-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405170#comment-13405170
]
Todd Lipcon commented on HDFS-3590:
-----------------------------------
Nice idea. Per-directory metrics would also be a nice improvement, but the log
makes good sense. I'd warn at a much earlier threshold, like 5 seconds.
> Print a WARN if the edit log sync period takes more than X time units
> ---------------------------------------------------------------------
>
> Key: HDFS-3590
> URL: https://issues.apache.org/jira/browse/HDFS-3590
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: name-node
> Reporter: Harsh J
> Priority: Minor
>
> If an logSync operation, which happens for calls such as FS#create() after
> the edit has been made at the NN metadata, takes longer than X seconds (I'd
> say if it took more than a minute, there's something really wrong with the
> volume it probably got stuck on), we should log a WARN with the volume that
> may have particularly caused it. This helps track down, if an NN runs with
> multiple NFS volumes, which particular volume may have caused it, as there's
> no per-NN-dir metrics of any kind.
> I ran into a situation today where a hard-mounted NFS point hung for over X
> minutes but there was no indication in NN's logs after it recovered
> (recovering so late caused its own slew of issues for which I'll file other
> improvement JIRAs) that such an event happened, aside of the Sync (Journal
> Sync) metric spiking with the elapsed sync time value rising up. A log would
> have helped save time investigating this, and possibly would have also
> pin-pointed the bad location more accurately.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira