[
https://issues.apache.org/jira/browse/HDFS-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405333#comment-13405333
]
Andy Isaacson commented on HDFS-3590:
-------------------------------------
I'm +1 on the concept of logging a message when IO is slow; I've used such log
messages successfully in the past to diagnose system problems.
At 5 seconds we'll see lots of log messages from systems with just generally
slow IO systems. It only takes 500 requests queued in front of you to delay
you for 5 seconds (or just one media error with firmware retry). This is fine
as a log message (it helps diagnose slowness) but a 5 second delay does not
justify a warning or error.
At 60 seconds we would probably not see any false positives and a warning or
error would be reasonable.
The message should be rate-limited (you don't want your log messages to
generate additional IO load causing the problem to get worse) and should
include the actual elapsed time to 1ms accuracy if possible.
> Print a WARN if the edit log sync period takes more than X time units
> ---------------------------------------------------------------------
>
> Key: HDFS-3590
> URL: https://issues.apache.org/jira/browse/HDFS-3590
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: name-node
> Reporter: Harsh J
> Priority: Minor
>
> If an logSync operation, which happens for calls such as FS#create() after
> the edit has been made at the NN metadata, takes longer than X seconds (I'd
> say if it took more than a minute, there's something really wrong with the
> volume it probably got stuck on), we should log a WARN with the volume that
> may have particularly caused it. This helps track down, if an NN runs with
> multiple NFS volumes, which particular volume may have caused it, as there's
> no per-NN-dir metrics of any kind.
> I ran into a situation today where a hard-mounted NFS point hung for over X
> minutes but there was no indication in NN's logs after it recovered
> (recovering so late caused its own slew of issues for which I'll file other
> improvement JIRAs) that such an event happened, aside of the Sync (Journal
> Sync) metric spiking with the elapsed sync time value rising up. A log would
> have helped save time investigating this, and possibly would have also
> pin-pointed the bad location more accurately.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira