[ 
https://issues.apache.org/jira/browse/HDFS-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405333#comment-13405333
 ] 

Andy Isaacson commented on HDFS-3590:
-------------------------------------

I'm +1 on the concept of logging a message when IO is slow; I've used such log 
messages successfully in the past to diagnose system problems.

At 5 seconds we'll see lots of log messages from systems with just generally 
slow IO systems.  It only takes 500 requests queued in front of you to delay 
you for 5 seconds (or just one media error with firmware retry).  This is fine 
as a log message (it helps diagnose slowness) but a 5 second delay does not 
justify a warning or error.

At 60 seconds we would probably not see any false positives and a warning or 
error would be reasonable.

The message should be rate-limited (you don't want your log messages to 
generate additional IO load causing the problem to get worse) and should 
include the actual elapsed time to 1ms accuracy if possible.
                
> Print a WARN if the edit log sync period takes more than X time units
> ---------------------------------------------------------------------
>
>                 Key: HDFS-3590
>                 URL: https://issues.apache.org/jira/browse/HDFS-3590
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: Harsh J
>            Priority: Minor
>
> If an logSync operation, which happens for calls such as FS#create() after 
> the edit has been made at the NN metadata, takes longer than X seconds (I'd 
> say if it took more than a minute, there's something really wrong with the 
> volume it probably got stuck on), we should log a WARN with the volume that 
> may have particularly caused it. This helps track down, if an NN runs with 
> multiple NFS volumes, which particular volume may have caused it, as there's 
> no per-NN-dir metrics of any kind.
> I ran into a situation today where a hard-mounted NFS point hung for over X 
> minutes but there was no indication in NN's logs after it recovered 
> (recovering so late caused its own slew of issues for which I'll file other 
> improvement JIRAs) that such an event happened, aside of the Sync (Journal 
> Sync) metric spiking with the elapsed sync time value rising up. A log would 
> have helped save time investigating this, and possibly would have also 
> pin-pointed the bad location more accurately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to