[ https://issues.apache.org/jira/browse/HBASE-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413293#comment-17413293 ]
Josh Elser commented on HBASE-26273: ------------------------------------ {quote}Can you explain why there are more HDFS connection with PREAD than STREAM? Thanks. {quote} Probably not a good way to phrase it on my part :). What I meant to point out is that, if you're doing mapreduce over Snapshots, you most likely are reading most/all of the HFile. The seek+read we do for every pread seems excessive to me (where we can instead just keep reading forward like normal). This is also related to the other issue Stephen filed: HBASE-26274 (where we _did_ make a lot more connections to HDFS because we kept having to go back and re-read the index blocks) > TableSnapshotInputFormat/TableSnapshotInputFormatImpl should use > ReadType.STREAM for scanning HFiles > ----------------------------------------------------------------------------------------------------- > > Key: HBASE-26273 > URL: https://issues.apache.org/jira/browse/HBASE-26273 > Project: HBase > Issue Type: Improvement > Components: mapreduce > Affects Versions: 3.0.0-alpha-1, 2.4.6 > Reporter: Tak-Lon (Stephen) Wu > Assignee: Josh Elser > Priority: Major > > After the change in HBASE-17917 that use PREAD ({{ReadType.DEFAULT}}) for all > user scan, the behavior of TableSnapshotInputFormat changed from STREAM to > PREAD. > TableSnapshotInputFormat is supposed to be use with a YARN/MR or other batch > engine that should read the entire HFile in the container/executor, with > default always to PREAD, the number of connection to HDFS surges and has an > side-effect on the overall performance. > The goal of this change is to make any downstream using > TableSnapshotInputFormat with STREAM scan. -- This message was sent by Atlassian Jira (v8.3.4#803005)