[ https://issues.apache.org/jira/browse/HDFS-8617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596369#comment-14596369 ]
Colin Patrick McCabe commented on HDFS-8617: -------------------------------------------- Andrew and I actually benchmarked setting {{ioprio}} in order to implement quality of service on the DataNode. It didn't have very much effect. In general, more and more I/O scheduling is moving out of the operating system and into the storage device. Back in the old days, operating systems would feed requests to disks one at a time. Disks took a long time to process requests in those days so it was easy for the CPU to stay well ahead of the disk and basically lead it around by the nose. Nowadays, hard disks have huge on-disk write buffers (several megabytes in size) and internal software that handles draining them. The hard drive doesn't necessarily process requests in the order it gets them. The situation with SSDs is even worse... SSDs have a huge internal layer of firmware that handles servicing any request. In general with SSDs the role of the OS is just to forward requests as quickly as possible to try to keep up with the very fast speed of the SSD. This is why Linux tuning guides tell you to turn your I/O schedule to either {{noop}} or {{deadline}} for best performance on SSDs. Of course, when disks fail, they usually don't fail all at once. Instead, more and more operations start to time out and produce I/O errors. This is problematic for systems like HBase which strive for low latency. That's why we developed workarounds like hedged reads. However, HDFS's checkDirs behavior here is making the situation much worse. For a disk that returns I/O errors every so often, each error may trigger a new full scan of every block file on the datanode. While it's true that these scans just look at the metadata, not the data, they still can put a heavy load on the system. It's pointless to keep rescanning the filesystem continuously when a disk starts returning errors. At the very most, we should rescan only the drive that's failing. And we should not do it continuously, but maybe once every hour or half hour. An HBase sysadmin asked me how to configure this behavior and I had to tell him that we have absolutely no way to do it. bq. I'm unsure whether \[andrew's IOPs calculation\] is the right math. I just checked the code. It looks like checkDir() mostly performs read-only operations on the metadata of the underlying filesystem. The metadata can be fully cached thus the parameter can be way off (and for SSD the parameter needs to be recalculated). That comes back to the point that it is difficult to determine the right parameter for various configuration. The difficulties of finding the parameter leads me to believe that using throttling here is flawed. When your application is latency-sensitive (such as HBase), it makes sense to do a worst-case calculation of how many IOPS per second the workload may generate. While it's true that sometime this may be overly pessimistic if things are cached in memory, it is the right math to do when latency is critical. > Throttle DiskChecker#checkDirs() speed. > --------------------------------------- > > Key: HDFS-8617 > URL: https://issues.apache.org/jira/browse/HDFS-8617 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS > Affects Versions: 2.7.0 > Reporter: Lei (Eddy) Xu > Assignee: Lei (Eddy) Xu > Attachments: HDFS-8617.000.patch > > > As described in HDFS-8564, {{DiskChecker.checkDirs(finalizedDir)}} is > causing excessive I/Os because {{finalizedDirs}} might have up to 64K > sub-directories (HDFS-6482). > This patch proposes to limit the rate of IO operations in > {{DiskChecker.checkDirs()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)