[ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533995#comment-17533995
 ] 

Josh Elser commented on HBASE-27013:
------------------------------------

{quote}So the problem here is, the implementation of S3A is not HDFS, we can 
not reuse the stream to send multiple pread requests with random offset. Seems 
not like a good enough pread implementation...
{quote}
Yeah, s3a != hdfs is definitely a major pain point. IIUC, HBase nor HDFS are 
doing anything wrong, per se. HDFS just happens to handle this super fast and 
s3a... doesn't.
{quote}In general, in pread mode, a FSDataInputStream may be used by different 
read requests so even if you fixed this problem, it could still introduce a lot 
of aborts as different read request may read from different offsets...
{quote}
Right again – focus being put on prefetching as we know that once hfiles are 
cached, things are super fast. Thus, this is the first problem to chase. 
However, any operations over a table which isn't fully cache would end up 
over-reading from s3. I had thought about whether we just write a custom Reader 
for the prefetch case, but then we wouldn't address the rest of the access 
paths (e.g. scans).

Stephen's worst case numbers are still ~130MB/s to pull down HFiles from S3 to 
cache which is good on the surface, but not so good when you compare to the 
closer to 1GB/s that you can get through awscli (and whatever their 
parallelized downloader was called). One optimization at a time :)

> Introduce read all bytes when using pread for prefetch
> ------------------------------------------------------
>
>                 Key: HBASE-27013
>                 URL: https://issues.apache.org/jira/browse/HBASE-27013
>             Project: HBase
>          Issue Type: Improvement
>          Components: HFile, Performance
>    Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.13
>            Reporter: Tak-Lon (Stephen) Wu
>            Assignee: Tak-Lon (Stephen) Wu
>            Priority: Major
>
> h2. Problem statement
> When prefetching HFiles from blob storage like S3 and use it with the storage 
> implementation like S3A, we found there is a logical issue in HBase pread 
> that causes the reading of the remote HFile aborts the input stream multiple 
> times. This aborted stream and reopen slow down the reads and trigger many 
> aborted bytes and waste time in recreating the connection especially when SSL 
> is enabled.
> h2. ROOT CAUSE
> The root cause of above issue was due to 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
>  is reading an input stream that does not guarrentee to return the data block 
> and the next block header as an option data to be cached.
> In the case of the input stream read short and when the input stream read 
> passed the length of the necessary data block with few more bytes within the 
> size of next block header, the 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
>  returns to the caller without a cached the next block header. As a result, 
> before HBase tries to read the next block, 
> [HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
>  in hbase tries to re-read the next block header from the input stream. Here, 
> the reusable input stream has move the current position pointer ahead from 
> the offset of the last read data block, when using with the [S3A 
> implementation|https://github.com/apache/hadoop/blob/29401c820377d02a992eecde51083cf87f8e57af/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L339-L361],
>  the input stream is then closed, aborted all the remaining bytes and reopen 
> a new input stream at the offset of the last read data block .
> h2. How do we fix it?
> S3A is doing the right job that HBase is telling to move the offset from 
> position A back to A - N, so there is not much thing we can do on how S3A 
> handle the inputstream. meanwhile in the case of HDFS, this operation is fast.
> Such that, we should fix in HBase level, and try always to read datablock + 
> next block header when we're using blob storage to avoid expensive draining 
> the bytes in a stream and reopen the socket with the remote storage.
> h2. Draw back and discussion
>  * A known drawback is, when we're at the last block, we will read extra 
> length that should not be a header, and we still read that into the byte 
> buffer array. the size should be always 33 bytes, and it should not a big 
> issue in data correctness because the trailer will tell when the last 
> datablock should end. And we just waste a 33 byte read and that data is not 
> being used.
>  * I don't know if we can use HFileStreamReader but that will change the 
> Prefetch logic a lot, such that this minimum change should be the best.
> h2. initial result
> We use YCSB 1 billion records data, and we enable prefetch for the userable. 
> the collected the S3A metrics of {{stream_read_bytes_discarded_in_abort}} to 
> compare the solution, each region server have abort ~290 GB data to be 
> prefetch to bucketcache.
> * before the change, we have a total of 4235973338472 bytes (~4235GB) has 
> been aborted on a sample region server for about 290GB data.
> ** the overall time was about 45 ~ 60 mins
>  
> {code}
> % grep "stream_read_bytes_discarded_in_abort" 
> ~/prefetch-result/prefetch-s3a-jmx-metrics.json | grep -wv 
> "stream_read_bytes_discarded_in_abort\":0,"
>          "stream_read_bytes_discarded_in_abort":3136854553,
>          "stream_read_bytes_discarded_in_abort":19119241,
>          "stream_read_bytes_discarded_in_abort":2131591701471,
>          "stream_read_bytes_discarded_in_abort":150484654298,
>          "stream_read_bytes_discarded_in_abort":106536641550,
>          "stream_read_bytes_discarded_in_abort":1785264521717,
>          "stream_read_bytes_discarded_in_abort":58939845642,
> {code}
> * After the change, we only have 87100225454 bytes (~87GB) data to be aborted.
> ** the reason is about the position is way behind the asked target position, 
> then S3A reopen the stream and move the position to the current offset. This 
> is a different problem we will need to look into later. 
> ** the overall time is then cut to 30~38 mins, about 30% faster.
> {code}
> % grep "stream_read_bytes_discarded_in_abort" ~/fixed-formatted-jmx2.json
>       "stream_read_bytes_discarded_in_abort": 0,
>       "stream_read_bytes_discarded_in_abort": 87100225454,
>       "stream_read_bytes_discarded_in_abort": 67043088,
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to