catalin-luca commented on pull request #2002:
URL: https://github.com/apache/hbase/pull/2002#issuecomment-655025128


   > Can you point out where you see this in LoadIncrementalHFiles and how this 
current proposal avoid that, since it basically call 
LoadIncrementalHFiles.doBulkloadFromQueue at the end of each map task?
   
   This is the part that opens each HFile to obtain the start end key :
   ```
   HFile.Reader hfr = HFile.createReader(fs, hfilePath,
           new CacheConfig(getConf()), getConf());
       final byte[] first, last;
       try {
         hfr.loadFileInfo();
         first = hfr.getFirstRowKey();
         last = hfr.getLastRowKey();
       }  finally {
         hfr.close();
       }
   ```
   
   Particularly, `HFile.createReader` opens the HFile and loads the file 
trailer.
   When running HBase on top of S3 these calls are an order of magnitude larger 
in latency.
   
   The calls themselves are not the problem (as they are needed to determine 
the region server that will receive the file).
   The problem is the large latency that causes the overall bulkload process to 
take very long.
   
   My first instinct was to hide this latency by increasing the parallelism of 
LoadIncrementalHFiles. However, going beyond ~500-600 threads did not yield any 
improvement. After inspecting thread dumps, I saw lots of time spent in 
re-creating HTTP connections. It seemed that the connections were not being 
re-used because the `HFile.Reader` was not reading all the bytes after seeking 
to read the trailer. In turn this causes the connection to get aborted and it 
can't be pooled.
   
   After running the LoadIncrementalHFiles code in multiple processes using 
map/reduce I was to able to achieve an overall larger parallelism larger than 
the 500-600 mentioned above. The connections are still getting aborted, but the 
overall process can be scaled horizontally better. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to