[ 
https://issues.apache.org/jira/browse/HDFS-8617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596369#comment-14596369
 ] 

Colin Patrick McCabe commented on HDFS-8617:
--------------------------------------------

Andrew and I actually benchmarked setting {{ioprio}} in order to implement 
quality of service on the DataNode.  It didn't have very much effect.

In general, more and more I/O scheduling is moving out of the operating system 
and into the storage device.  Back in the old days, operating systems would 
feed requests to disks one at a time.  Disks took a long time to process 
requests in those days so it was easy for the CPU to stay well ahead of the 
disk and basically lead it around by the nose.  Nowadays, hard disks have huge 
on-disk write buffers (several megabytes in size) and internal software that 
handles draining them.  The hard drive doesn't necessarily process requests in 
the order it gets them.  The situation with SSDs is even worse... SSDs have a 
huge internal layer of firmware that handles servicing any request.  In general 
with SSDs the role of the OS is just to forward requests as quickly as possible 
to try to keep up with the very fast speed of the SSD.  This is why Linux 
tuning guides tell you to turn your I/O schedule to either {{noop}} or 
{{deadline}} for best performance on SSDs.

Of course, when disks fail, they usually don't fail all at once.  Instead, more 
and more operations start to time out and produce I/O errors.  This is 
problematic for systems like HBase which strive for low latency.  That's why we 
developed workarounds like hedged reads.  However, HDFS's checkDirs behavior 
here is making the situation much worse.  For a disk that returns I/O errors 
every so often, each error may trigger a new full scan of every block file on 
the datanode.  While it's true that these scans just look at the metadata, not 
the data, they still can put a heavy load on the system.

It's pointless to keep rescanning the filesystem continuously when a disk 
starts returning errors.  At the very most, we should rescan only the drive 
that's failing.  And we should not do it continuously, but maybe once every 
hour or half hour.  An HBase sysadmin asked me how to configure this behavior 
and I had to tell him that we have absolutely no way to do it.

bq. I'm unsure whether \[andrew's IOPs calculation\] is the right math. I just 
checked the code. It looks like checkDir() mostly performs read-only operations 
on the metadata of the underlying filesystem. The metadata can be fully cached 
thus the parameter can be way off (and for SSD the parameter needs to be 
recalculated). That comes back to the point that it is difficult to determine 
the right parameter for various configuration. The difficulties of finding the 
parameter leads me to believe that using throttling here is flawed.

When your application is latency-sensitive (such as HBase), it makes sense to 
do a worst-case calculation of how many IOPS per second the workload may 
generate.  While it's true that sometime this may be overly pessimistic if 
things are cached in memory, it is the right math to do when latency is 
critical.

> Throttle DiskChecker#checkDirs() speed.
> ---------------------------------------
>
>                 Key: HDFS-8617
>                 URL: https://issues.apache.org/jira/browse/HDFS-8617
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: HDFS
>    Affects Versions: 2.7.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-8617.000.patch
>
>
> As described in HDFS-8564,  {{DiskChecker.checkDirs(finalizedDir)}} is 
> causing excessive I/Os because {{finalizedDirs}} might have up to 64K 
> sub-directories (HDFS-6482).
> This patch proposes to limit the rate of IO operations in 
> {{DiskChecker.checkDirs()}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to