[ https://issues.apache.org/jira/browse/HBASE-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271890#comment-13271890 ]
Todd Lipcon commented on HBASE-5979: ------------------------------------ My thinking is that the solution is something like this: When any scanner starts, it begins by using the "pread" API for the first N hfile blocks it reads. This allows short scans, which can often fall entirely within one or two HFile blocks, to avoid the read amplification of doing a DFSInputStream seek. After a scanner has read several blocks from an HFile, it switches over to the seek+read mode. However, it does this with its *own* input stream. This way, all of the pre-buffering that happens through the HDFS layer will benefit it, and it doesn't have to contend with other scans. This should improve performance of long scans in the presence of contention (eg scans + compactions or multiple longer scans within the same region). The actual input streams would thus become owned by the individual HFileScanners. Not sure if I'll have time to prototype a patch for this any time soon, but happy to help review ideas. > Non-pread DFSInputStreams should be associated with scanners, not > HFile.Readers > ------------------------------------------------------------------------------- > > Key: HBASE-5979 > URL: https://issues.apache.org/jira/browse/HBASE-5979 > Project: HBase > Issue Type: Improvement > Components: performance, regionserver > Reporter: Todd Lipcon > > Currently, every HFile.Reader has a single DFSInputStream, which it uses to > service all gets and scans. For gets, we use the positional read API (aka > "pread") and for scans we use a synchronized block to seek, then read. The > advantage of pread is that it doesn't hold any locks, so multiple gets can > proceed at the same time. The advantage of seek+read for scans is that the > datanode starts to send the entire rest of the HDFS block, rather than just > the single hfile block necessary. So, in a single thread, pread is faster for > gets, and seek+read is faster for scans since you get a strong pipelining > effect. > However, in a multi-threaded case where there are multiple scans (including > scans which are actually part of compactions), the seek+read strategy falls > apart, since only one scanner may be reading at a time. Additionally, a large > amount of wasted IO is generated on the datanode side, and we get none of the > earlier-mentioned advantages. > In one test, I switched scans to always use pread, and saw a 5x improvement > in throughput of the YCSB scan-only workload, since it previously was > completely blocked by contention on the DFSIS lock. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira