[jira] [Comment Edited] (HBASE-27013) Introduce read all bytes when using pread for prefetch

2022-05-10 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534638#comment-17534638
 ] 

Josh Elser edited comment on HBASE-27013 at 5/11/22 1:46 AM:
-

{quote}In the case of the input stream read short and when the input stream 
read passed the length of the necessary data block with few more bytes within 
the size of next block header, the 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 returns to the caller without a cached the next block header. As a result, 
before HBase tries to read the next block, 
[HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
 in hbase tries to re-read the next block header from the input stream.
{quote}
If we read the comment on the code that Stephen called out in 
readBlockDataInternal, you can find:
{code:java}
If header was not cached (see getCachedHeader above), need to seek to pull it 
in. This is costly and should happen very rarely {code}
And then you had also said:
{quote}The root cause of above issue was due to 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 is reading an input stream that does not guarrentee to return the data block 
and the next block header as an option data to be cached.
{quote}
I think what you're saying is the following.
 # Read header for block1
 # Read block1 and try to read block2's header
 # Read block2 and try to read block3's header
 # Repeat

This would align with the comment, too. Where, the last time we read, we tried 
to get the header cached, such that the _next_ time we come back to read, we 
have that header already cached and can avoid another {{seek()}} (through the 
pread).

The very high-level reading of the HBase code would indicate to me that we 
_expect_ to read the n+1th block header when reading the nth block. I would 
assume that we also want this for HDFS base cluster, but HDFS just does a good 
enough job that we haven't noticed this being a problem (short-circuit reads 
making our live happy?).

I think attempting to read off the end of a file is not a big concern since 
we're just pulling those extra bytes off in the current read. I am thinking 
about a different drawback where, if the InputStream isn't giving us the bytes 
we asked for back, why was that? Did it take over some threshold of time? If we 
go back and ask HDFS (or S3) again "give me those extra bytes", would we 
increase the overall latency? Genuinely not sure.

I think, long-term, it makes sense for this configuration to be on by default, 
but I am motivated by the expose this configuration property for additional 
testing on HDFS while committing this change to try to help the S3-based 
prefetching workload. I'm leaning towards putting this in since the risk is low 
(given my understanding).

WDYT, Duo? Stephen, did I get this all correct? (please correct me if I'm wrong)


was (Author: elserj):
{quote}In the case of the input stream read short and when the input stream 
read passed the length of the necessary data block with few more bytes within 
the size of next block header, the 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 returns to the caller without a cached the next block header. As a result, 
before HBase tries to read the next block, 
[HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
 in hbase tries to re-read the next block header from the input stream.
{quote}
If we read the comment on the code that Stephen called out in 
readBlockDataInternal, you can find:
{code:java}
If header was not cached (see getCachedHeader above), need to seek to pull it 
in. This is costly and should happen very rarely {code}
And then you had also said:
{quote}The root cause of above issue was due to 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 is reading an input stream that does not guarrentee to return the data block 
and the next block header as an option data to be cached.
{quote}
I think what you're saying is the following.
 # Read header for block1
 # Read block1 and try to read block2's header
 # Read block2 and try to read block3's header
 

[jira] [Comment Edited] (HBASE-27013) Introduce read all bytes when using pread for prefetch

2022-05-09 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534042#comment-17534042
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-27013 at 5/10/22 3:40 AM:
---

bq. we can not reuse the stream to send multiple pread requests with random 
offset

the concept of reuse the same stream is about how much it read a head 
(readahead range) from a single/current HTTP call to the object store, e.g. S3. 
If seek/pread ask the the range that has been already read ahead from the HTTP 
response, we don't need to reopen a new HTTP to maintain the streaming data. in 
other words, it's a different type of streaming implementation that based on a 
HTTP connection to the blob storage. the problem of this prefetch is that, if 
we're using {{fs.s3a.experimental.input.fadvise=sequential}} as that we have 
read a lot of data from the remote data into a local buffer, we don't want to 
completely drain and abort the connection. (meanwhile, we knew that 
{{fs.s3a.experimental.input.fadvise=random}}) can read small data into buffer 
one at a time but it's slower a lot)

bq. Seems not like a good enough pread implementation

I would say we're using the HDFS semantic with blob storage like S3A, such that 
we're doing interesting thing for any supported blob storage. HDFS, as Josh 
also pointed out, is just faster a lot than any file system implementation 
written for blob storage 

bq.  FSDataInputStream may be used by different read requests so even if you 
fixed this problem, it could still introduce a lot of aborts as different read 
request may read from different offsets...

So, it won't introduce other aborts when reading other offsets because the 
problem we're facing in this JIRA only for Prefetch, and I should have proved 
that in my prototype. To view it in the technical way, it's the other way 
around that, we're closing and aborting as of today without my change. 

Where the improvement of this JIRA is only about a Store open or prefetch when 
Store Open, the actual usage is to customize the prefetch (via Store File 
Manager) with the proposed configuration ({{hfile.pread.all.bytes.enabled}}) 
during the store is opening and use this optional read all bytes feature. (but 
don't provide this store file manager because this option is disabled by 
default)

To sum, if we're introducing a lot of aborts, then I think our implementation 
isn't right but I still don't find a case that we have introduced new aborts if 
we're reading the extra header that is part of the data block of the HFile. 


was (Author: taklwu):
bq. we can not reuse the stream to send multiple pread requests with random 
offset

the concept of reuse the same stream is about how much it read a head 
(readahead range) from a single/current HTTP call to the object store, e.g. S3. 
If seek/pread ask the the range that has been already read ahead from the HTTP 
response, we don't need to reopen a new HTTP to maintain the streaming data. in 
other words, it's a different type of streaming implementation that based on a 
HTTP connection to the blob storage. the problem of this prefetch is that, if 
we're using {{fs.s3a.experimental.input.fadvise=sequential}} as that we have 
read a lot of data from the remote data into a local buffer, we don't want to 
completely drain and abort the connection. (meanwhile, we knew that 
{{fs.s3a.experimental.input.fadvise=random}}) can read small data into buffer 
one at a time but it's slower a lot)

bq. Seems not like a good enough pread implementation

I would say we're using the HDFS semantic with blob storage like S3A, such that 
we're doing interesting thing for any supported blob storage. HDFS, as Josh 
also pointed out, is just faster a lot than any file system implementation 
written for blob storage 

bq.  FSDataInputStream may be used by different read requests so even if you 
fixed this problem, it could still introduce a lot of aborts as different read 
request may read from different offsets...

So, it won't introduce other aborts when reading other offsets because the 
problem we're facing in this JIRA only for Prefetch, and I should have proved 
that in my prototype. To view it in the technical way, it's the other way 
around that, we're closing and aborting as of today without my change. 

Where the improvement of this JIRA is only about a Store open or prefetch when 
Store Open, the actual usage is to customize the prefetch (via Store File 
Manager) with the proposed configuration ({{hfile.pread.all.bytes.enabled}}) 
during the store is opening and use this optional read all bytes feature. (but 
don't provide this store file manager because this option is disabled by 
default)

To sum, if we're introducing a lot of aborts, then I think our implementation 
isn't right but I still don't find a case that we can introduce abort if we're 
reading the extra header that is part 

[jira] [Comment Edited] (HBASE-27013) Introduce read all bytes when using pread for prefetch

2022-05-09 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534042#comment-17534042
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-27013 at 5/9/22 10:24 PM:
---

bq. we can not reuse the stream to send multiple pread requests with random 
offset

the concept of reuse the same stream is about how much it read a head 
(readahead range) from a single/current HTTP call to the object store, e.g. S3. 
If seek/pread ask the the range that has been already read ahead from the HTTP 
response, we don't need to reopen a new HTTP to maintain the streaming data. in 
other words, it's a different type of streaming implementation that based on a 
HTTP connection to the blob storage. the problem of this prefetch is that, if 
we're using {{fs.s3a.experimental.input.fadvise=sequential}} as that we have 
read a lot of data from the remote data into a local buffer, we don't want to 
completely drain and abort the connection. (meanwhile, we knew that 
{{fs.s3a.experimental.input.fadvise=random}}) can read small data into buffer 
one at a time but it's slower a lot)

bq. Seems not like a good enough pread implementation

I would say we're using the HDFS semantic with blob storage like S3A, such that 
we're doing interesting thing for any supported blob storage. HDFS, as Josh 
also pointed out, is just faster a lot than any file system implementation 
written for blob storage 

bq.  FSDataInputStream may be used by different read requests so even if you 
fixed this problem, it could still introduce a lot of aborts as different read 
request may read from different offsets...

So, it won't introduce other aborts when reading other offsets because the 
problem we're facing in this JIRA only for Prefetch, and I should have proved 
that in my prototype. To view it in the technical way, it's the other way 
around that, we're closing and aborting as of today without my change. 

Where the improvement of this JIRA is only about a Store open or prefetch when 
Store Open, the actual usage is to customize the prefetch (via Store File 
Manager) with the proposed configuration ({{hfile.pread.all.bytes.enabled}}) 
during the store is opening and use this optional read all bytes feature. (but 
don't provide this store file manager because this option is disabled by 
default)

To sum, if we're introducing a lot of aborts, then I think our implementation 
isn't right but I still don't find a case that we can introduce abort if we're 
reading the extra header that is part of the data block of the HFile 


was (Author: taklwu):
bq. we can not reuse the stream to send multiple pread requests with random 
offset

the concept of reuse the same stream is about how much it read a head 
(readahead range) from a single/current HTTP call to the object store, e.g. S3. 
If seek/pread ask the the range that has been already read ahead from the HTTP 
response, we don't need to reopen a new HTTP to maintain the streaming data. in 
other words, it's a different type of streaming implementation that based on a 
HTTP connection to the blob storage. the problem of this prefetch is that, if 
we're using {{fs.s3a.experimental.input.fadvise=sequential}} as that we have 
read a lot of data from the remote data into a local buffer, we don't want to 
completely drain and abort the connection. (meanwhile, we knew that 
{{fs.s3a.experimental.input.fadvise=random}}) can read small data into buffer 
one at a time but it's slower a lot)

bq. Seems not like a good enough pread implementation

I would say we're using the HDFS semantic with blob storage like S3A, such that 
we're doing interesting thing for any supported blob storage. HDFS, as Josh 
also pointed out, is just faster a lot than any file system implementation 
written for blob storage 

> FSDataInputStream may be used by different read requests so even if you fixed 
> this problem, it could still introduce a lot of aborts as different read 
> request may read from different offsets...

So, it won't introduce other aborts when reading other offsets because the 
problem we're facing in this JIRA only for Prefetch, and I should have proved 
that in my prototype. To view it in the technical way, it's the other way 
around that, we're closing and aborting as of today without my change. 

Where the improvement of this JIRA is only about a Store open or prefetch when 
Store Open, the actual usage is to customize the prefetch (via Store File 
Manager) with the proposed configuration ({{hfile.pread.all.bytes.enabled}}) 
during the store is opening and use this optional read all bytes feature. (but 
don't provide this store file manager because this option is disabled by 
default)

To sum, if we're introducing a lot of aborts, then I think our implementation 
isn't right but I still don't find a case that we can introduce abort if we're 
reading the extra header that is part of the