[ 
https://issues.apache.org/jira/browse/HBASE-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152778#comment-17152778
 ] 

Constantin-Catalin Luca commented on HBASE-24541:
-------------------------------------------------

Ok, I will submit a PR for master as well.

> Add support to run LoadIncrementalHFiles in a distributed manner
> ----------------------------------------------------------------
>
>                 Key: HBASE-24541
>                 URL: https://issues.apache.org/jira/browse/HBASE-24541
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce, Performance
>    Affects Versions: 1.4.0
>            Reporter: Constantin-Catalin Luca
>            Assignee: Constantin-Catalin Luca
>            Priority: Minor
>         Attachments: HBASE_24541-1.4.0.patch
>
>
> LoadIncrementalHFiles takes a very long time to complete when running HBase 
> on top of S3 and attempting to bulkload 500K-700K files.
> The root cause of this is a combination of the higher latency of S3 (as 
> compared to HDFS) as well as the calls made by LoadIncrementalHFiles to the 
> underlying filesystem(each file is opened, seeked to the trailer offset at 
> the end, and then the trailer is read).
> Increasing the parallelism does not yield any significant improvement. This 
> seems to stem from the fact that once the trailer is read the stream is not 
> consumed to the end. This causes the underlying HTTP connection to be aborted 
> and it cannot be re-used.
>  
> The proposed solution would be to also add support to run 
> LoadIncrementalHFiles on multiple machines as a map reduce job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to