[ https://issues.apache.org/jira/browse/HBASE-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577232#comment-17577232 ]
Duo Zhang commented on HBASE-24541: ----------------------------------- The PR against branch-1.4 has been closed as we have already EOL all the 1.x release lines. Feel free to open a new PR against master. > Add support to run LoadIncrementalHFiles in a distributed manner > ---------------------------------------------------------------- > > Key: HBASE-24541 > URL: https://issues.apache.org/jira/browse/HBASE-24541 > Project: HBase > Issue Type: Improvement > Components: mapreduce, Performance > Affects Versions: 1.4.0 > Reporter: Constantin-Catalin Luca > Assignee: Constantin-Catalin Luca > Priority: Minor > Attachments: HBASE_24541-1.4.0.patch > > > LoadIncrementalHFiles takes a very long time to complete when running HBase > on top of S3 and attempting to bulkload 500K-700K files. > The root cause of this is a combination of the higher latency of S3 (as > compared to HDFS) as well as the calls made by LoadIncrementalHFiles to the > underlying filesystem(each file is opened, seeked to the trailer offset at > the end, and then the trailer is read). > Increasing the parallelism does not yield any significant improvement. This > seems to stem from the fact that once the trailer is read the stream is not > consumed to the end. This causes the underlying HTTP connection to be aborted > and it cannot be re-used. > > The proposed solution would be to also add support to run > LoadIncrementalHFiles on multiple machines as a map reduce job. -- This message was sent by Atlassian Jira (v8.20.10#820010)