[ 
https://issues.apache.org/jira/browse/HBASE-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413293#comment-17413293
 ] 

Josh Elser commented on HBASE-26273:
------------------------------------

{quote}Can you explain why there are more HDFS connection with PREAD than 
STREAM? Thanks.
{quote}
Probably not a good way to phrase it on my part :). What I meant to point out 
is that, if you're doing mapreduce over Snapshots, you most likely are reading 
most/all of the HFile. The seek+read we do for every pread seems excessive to 
me (where we can instead just keep reading forward like normal).
 
This is also related to the other issue Stephen filed: HBASE-26274 (where we 
_did_ make a lot more connections to HDFS because we kept having to go back and 
re-read the index blocks)

> TableSnapshotInputFormat/TableSnapshotInputFormatImpl should use 
> ReadType.STREAM for scanning HFiles 
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-26273
>                 URL: https://issues.apache.org/jira/browse/HBASE-26273
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 3.0.0-alpha-1, 2.4.6
>            Reporter: Tak-Lon (Stephen) Wu
>            Assignee: Josh Elser
>            Priority: Major
>
> After the change in HBASE-17917 that use PREAD ({{ReadType.DEFAULT}}) for all 
> user scan, the behavior of TableSnapshotInputFormat changed from STREAM to 
> PREAD. 
> TableSnapshotInputFormat is supposed to be use with a YARN/MR or other batch 
> engine that should read the entire HFile in the container/executor, with 
> default always to PREAD, the number of connection to HDFS surges and has an 
> side-effect on the overall performance. 
> The goal of this change is to make any downstream using 
> TableSnapshotInputFormat with STREAM scan. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to