[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-11-04 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226329#comment-17226329
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 11/4/20, 6:23 PM:


thanks Nick for the information! will do a feature branch and come back to 
discuss when we should merge.


was (Author: taklwu):
thanks Nick !

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes senses and if we miss 
> anything should be included.
> Improvement Highlights
>  1. Lower write latency, especially the p99+
>  2. Higher write throughput on flush and compaction 
>  3. Lower MTTR on region (re)open or assignment 
>  4. Remove consistent check dependencies (e.g. DynamoDB) supported by file 
> system implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-08-24 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183739#comment-17183739
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 8/25/20, 5:18 AM:


I updated a [design 
doc|https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing]
 (google doc version) with development plan milestones. please feel free to 
drop comments on there.

In addition, I'm wondered if we can simplify the circuit for tracking the 
HFiles of the ROOT region to directly rely on the file storage (assuming the 
WAL works fine + HFiles are always immutable) without adding a tracking layer 
as well as directly writes HFiles to the data directory. e.g. the dependencies 
flow should be
 # When (re)open, ROOT region only cares HFiles in the data directory of ROOT 
Region (relies on the MVCC protection of what files should be included). 
 # HFile Tracking of hbase:meta are written to ROOT region (similar to how the 
meta location is being handled), and this tracking metadata is being protected 
by the WAL of ROOT Region and HFiles in the data directory of ROOT Region.
 # HFile Tracking of any other tables are being updated to a column family 
cf:storefile in hbase:meta. The only read extensively period is during the 
region open and region assignment.

We have provided an [investigation (Appendix#1) within the new design doc 
|https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing]
 that MVCC (Max Seq#) in the Store is the guard to reload cells from HFiles in 
the data directory without the tracking metadata and .tmp directory. But we 
should only use it for ROOT region because the amount of HFiles in ROOT 
directory is limited and normally won't change frequently

 


was (Author: taklwu):
I updated a [design 
doc|https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing]
 (google doc version), then we leave any design related comments on there 
directly to avoid a long page of comments in this JIRA.

In addition, I'm wondered if we can simplify the circuit for tracking the 
HFiles of the ROOT region to directly rely on the file storage (assuming the 
WAL works fine + HFiles are always immutable) without adding a tracking layer 
as well as directly writes HFiles to the data directory. e.g. the dependencies 
flow should be
 # When (re)open, ROOT region only cares HFiles in the data directory of ROOT 
Region (relies on the MVCC protection of what files should be included). 
 # HFile Tracking of hbase:meta are written to ROOT region (similar to how the 
meta location is being handled), and this tracking metadata is being protected 
by the WAL of ROOT Region and HFiles in the data directory of ROOT Region.
 # HFile Tracking of any other tables are being updated to a column family 
cf:storefile in hbase:meta. The only read extensively period is during the 
region open and region assignment.

We have provided an [investigation (Appendix#1) within the new design doc 
|https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing]
 that MVCC (Max Seq#) in the Store is the guard to reload cells from HFiles in 
the data directory without the tracking metadata and .tmp directory. But we 
should only use it for ROOT region because the amount of HFiles in ROOT 
directory is limited and normally won't change frequently

 

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes senses an

[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-08-04 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171116#comment-17171116
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 8/4/20, 8:59 PM:
---

sorry for the delay, I was out few days last week.
{quote}every flush and compaction will result in an update inline w/ the 
flush/compaction completion – if it fails, the flush/compaction fail?
{quote}
if updating hfile set in {{hbase:meta}} fails, it should be considered as 
failure if this feature is enabled. do you have concern on blocking the actual 
flush to be completed ? (it should be similar to other feature like 
{{hbase:quota}})
{quote}Master would update, or RS writes meta, a violation of a simplification 
we made trying to ensure one-writer
{quote}
for ensuring one-writer to {{hbase:meta}}, this is a good note and we haven't 
considered one writer scenario yet. I'm not sure the right way but for flush 
should be happened on the RS side, then either the RS create directly 
connection to {{hbase:meta}} with a limit on writing to only this column family 
outside of the master (suggested by [~zyork], pending investigation) or as you 
suggested that we package the hfile set information with a RPC call to Master 
and Master updates the hfile set. the amount of traffic (direct table 
connection or RPC call) should be the same, I still need to compare if the 
overhead (throughput) have any difference.

In addition, I will try to come up a set of sub-tasks and update the proposal 
doc the coming week. please bear with me, the plan may have some transition 
tasks (the goal is to have delivery with stages), e.g. 1. having the separate 
system table first, then have followup tasks to 2). compare the migration into 
the {{hbase:meta}} and actually 3). merge into {{hbase:meta}} (as a throughput 
sanity check)


was (Author: taklwu):
sorry for the delay, I was out few days last week.
{quote}every flush and compaction will result in an update inline w/ the 
flush/compaction completion – if it fails, the flush/compaction fail?
{quote}
if updating hfile set in {{hbase:meta}} fails, it should be considered as 
failure if this feature is enabled. do you have concern on blocking the actual 
flush to be completed ? (it should be similar to other feature like 
{{hbase:quota}})
{quote}Master would update, or RS writes meta, a violation of a simplification 
we made trying to ensure one-writer
{quote}
for ensuring one-writer to {{hbase:meta}}, this is a good note and we haven't 
considered one writer scenario yet. I'm not sure the right way but for flush 
should be happened on the RS side, then either the RS create directly 
connection to {{hbase:meta}} with a limit on writing to only this column family 
outside of the master (suggested by [~zyork], pending investigation) or as you 
suggested that we package the hfile set information with a RPC call to Master 
and Master updates the hfile set. the amount of traffic (direct table 
connection or RPC call) should be the same, I still need to compare if the 
overhead (throughput) have any difference.

In addition, I will try to come up a set of sub-tasks and update the proposal 
doc the coming week. please bear with me, the plan may have some transition 
tasks, e.g. 1. having the separate system table first, then have followup tasks 
to 2). compare the migration into the {{hbase:meta}} and actually 3). merge 
into {{hbase:meta}} (as a throughput sanity check)

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes

[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-07-23 Thread Zach York (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164094#comment-17164094
 ] 

Zach York edited comment on HBASE-24749 at 7/24/20, 3:37 AM:
-

Yes, I think that is potentially an alternative implementation that could work. 
One downside I could see is you would still want to be able to handle bulk 
loading/other procedures. If all updates to the state are controlled by the RS, 
this approach would work. I wonder what the perf difference might be... since 
in this case you would have to replay edits always.

Edit: After thinking it through a bit, the WAL approach has one problem in our 
environment (where we expect the HDFS WALs will not be migrated to a new 
cluster). Storing the data in a table is more durable for our use case, but the 
WAL implementation could be suitable for the ROOT table where it matters less 
if the file list needs to fall back to FS listing/validation. 


was (Author: zyork):
Yes, I think that is potentially an alternative implementation that could work. 
One downside I could see is you would still want to be able to handle bulk 
loading/other procedures. If all updates to the state are controlled by the RS, 
this approach would work. I wonder what the perf difference might be... since 
in this case you would have to replay edits always.

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes senses and if we miss 
> anything should be included.
> Improvement Highlights
>  1. Lower write latency, especially the p99+
>  2. Higher write throughput on flush and compaction 
>  3. Lower MTTR on region (re)open or assignment 
>  4. Remove consistent check dependencies (e.g. DynamoDB) supported by file 
> system implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-07-23 Thread Anoop Sam John (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164090#comment-17164090
 ] 

Anoop Sam John edited comment on HBASE-24749 at 7/24/20, 2:33 AM:
--

bq. Can you expand on how we can get in a situation where a partial file is 
written? I'm trying to see if there are any failure modes we haven't though of. 
If the case is a complete file written to the data directory, is there harm in 
picking up the new file (even if it hasn't successfully committed to the SFM)?
That point was based on another direction what Stack was saying.  Am not sure 
whether Stack suggested that for META table alone or for all.  ie use the WAL 
event markers to know whether a HFile is committed or not. If we see, during 
WAL replay that there is a flush begin marker and later flush complete marker, 
this means it is  a committed file.  If no markers at all for a file, this is 
an old existing file. If only begin but no end means this is not a committed 
file and so while region reopen, we can ignore this.  Same with compaction 
also.   There one issue was what if the WAL file which is having the begin 
marker got rolled and deleted.  We lost the track.  But if that can be 
controlled, this is also a direction no? (Dedicated wal for these event 
markers)  We can avoid the need to store all the files list into META and avoid 
the Q of how to handled the META's file list. Storing in zk is not a direction.


was (Author: anoop.hbase):
bq. Can you expand on how we can get in a situation where a partial file is 
written? I'm trying to see if there are any failure modes we haven't though of. 
If the case is a complete file written to the data directory, is there harm in 
picking up the new file (even if it hasn't successfully committed to the SFM)?
That point was based on another direction what Stack was saying.  Am not sure 
whether Stack suggested that for META table alone or for all.  ie use the WAL 
event markers to know whether a HFile is committer or not. If we see, during 
WAL replay that there is a flush begin marker and later flush complete marker, 
this means it is  a committed file.  If no markers at all for a file, this is 
an old existing file. If only begin but no end means this is not a committed 
file and so while region reopen, we can ignore this.  Same with compaction 
also.   There one issue was what if the WAL file which is having the begin 
marker got rolled and deleted.  We lost the track.  But if that can be 
controlled, this is also a direction no?  We can avoid the need to store all 
the files list into META and avoid the Q of how to handled the META's file 
list. Storing in zk is not a direction.

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes senses and if we miss 
> anything should be included.
> Improvement Highlights
>  1. Lower write latency, especially the p99+
>  2. Higher write throughput on flush and compaction 
>  3. Lower MTTR on region (re)open or assignment 
>  4. Remove consistent check dependencies (e.g. DynamoDB) supported by file 
> system implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-07-23 Thread Guanghao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164048#comment-17164048
 ] 

Guanghao Zhang edited comment on HBASE-24749 at 7/24/20, 12:19 AM:
---

bq. it should be HBASE-20724, so for compaction we can reused that to confirm 
if the flushed StoreFile were from a compaction.

 

Yes. The compaction event marker in WAL is not used anymore.


was (Author: zghaobac):
{quote}{quote} it should be HBASE-20724, so for compaction we can reused that 
to confirm if the flushed StoreFile were from a compaction.
{quote}{quote}
Yes. The compaction event marker in WAL is not used anymore.

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes senses and if we miss 
> anything should be included.
> Improvement Highlights
>  1. Lower write latency, especially the p99+
>  2. Higher write throughput on flush and compaction 
>  3. Lower MTTR on region (re)open or assignment 
>  4. Remove consistent check dependencies (e.g. DynamoDB) supported by file 
> system implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-07-23 Thread Guanghao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164048#comment-17164048
 ] 

Guanghao Zhang edited comment on HBASE-24749 at 7/24/20, 12:18 AM:
---

{quote}{quote} it should be HBASE-20724, so for compaction we can reused that 
to confirm if the flushed StoreFile were from a compaction.
{quote}{quote}
Yes. The compaction event marker in WAL is not used anymore.


was (Author: zghaobac):
{quote}bq. it should be HBASE-20724, so for compaction we can reused that to 
confirm if the flushed StoreFile were from a compaction.
{quote}
Yes. The compaction event marker is not used anymore.

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes senses and if we miss 
> anything should be included.
> Improvement Highlights
>  1. Lower write latency, especially the p99+
>  2. Higher write throughput on flush and compaction 
>  3. Lower MTTR on region (re)open or assignment 
>  4. Remove consistent check dependencies (e.g. DynamoDB) supported by file 
> system implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-07-23 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164037#comment-17164037
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 7/24/20, 12:03 AM:
-

bq. If an HFile is written successfully but no marker in the WAL, then it 
doesn't exist, right? As part of the WAL replay you will reconstitute it from 
edits in the WAL?
You're right if that happens, any uncommitted HFile should not be taken and 
replay from the WALs. (assuming that replay may generate the same content but a 
different HFile, I was thinking an store open optimization that but that would 
be too far from now.)

bq. you have surveyed the calls to the NN made by HBase on a regular basis?
we don't have how many rename calls captured to NN or even to object stores, 
and we will add a measurement survey task on the related milestone. But at one 
point we captured while running compaction with rename, the overall wall clock 
time only for the rename part were dominating ~60% of overall compaction time 
on object stores.

bq. IIRC there is a issue for storing the compacted files in HFile's metadata, 
to solve the problem that the wal file contains the compaction marker may be 
deleted before wal splitting.
it should be HBASE-20724, so for compaction we can reused that to confirm if 
the flushed StoreFile were from a compaction. 

bq. Now while replay of wal, we dont have start compaction marker for this wal 
file. So we think this is an old valid file but that is wrong. This is a 
partial file. This is possible.
don't we only have the `end` compaction event marker only when compacted 
HFile(s) has been moved to cf directory right before updating the store file 
manager? but yeah if the WAL is rolled, then we lost this even marker. 

but this is a good discussion, I will put down the above discussion as 
consideration to related milestone.



was (Author: taklwu):
bq. If an HFile is written successfully but no marker in the WAL, then it 
doesn't exist, right? As part of the WAL replay you will reconstitute it from 
edits in the WAL?
You're right if that happens, any uncommitted HFile should not be taken and 
replay from the WALs. (assuming that replay may generate the same content but a 
different HFile, I was thinking an store open optimization that but that would 
be too far from now.)

bq. you have surveyed the calls to the NN made by HBase on a regular basis?
we don't have how many rename calls captured to NN or even to object stores, 
and we will add a measurement survey task on the related milestone. But at one 
point we captured while running compaction with rename, the overall wall clock 
time only for the rename part were dominating ~60% of overall compaction time 
on object stores.

bq. IIRC there is a issue for storing the compacted files in HFile's metadata, 
to solve the problem that the wal file contains the compaction marker may be 
deleted before wal splitting.
it should be HBASE-20724, so for compaction we can reused that to confirm if 
the flushed StoreFile were from a compaction. 

bq. Now while replay of wal, we dont have start compaction marker for this wal 
file. So we think this is an old valid file but that is wrong. This is a 
partial file. This is possible.
don't we only have the `end` compaction event marker only when compacted 
HFile(s) has been moved to cf directory right before updating the store file 
manager? but yeah if the WAL is rolled, then we lost this even marker. 


> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the obj

[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-07-23 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164037#comment-17164037
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 7/24/20, 12:03 AM:
-

bq. If an HFile is written successfully but no marker in the WAL, then it 
doesn't exist, right? As part of the WAL replay you will reconstitute it from 
edits in the WAL?
You're right if that happens, any uncommitted HFile should not be taken and 
replay from the WALs. (assuming that replay may generate the same content but a 
different HFile, I was thinking an store open optimization that but that would 
be too far from now.)

bq. you have surveyed the calls to the NN made by HBase on a regular basis?
we don't have how many rename calls captured to NN or even to object stores, 
and we will add a measurement survey task on the related milestone. But at one 
point we captured while running compaction with rename, the overall wall clock 
time only for the rename part were dominating ~60% of overall compaction time 
on object stores.

bq. IIRC there is a issue for storing the compacted files in HFile's metadata, 
to solve the problem that the wal file contains the compaction marker may be 
deleted before wal splitting.
it should be HBASE-20724, so for compaction we can reused that to confirm if 
the flushed StoreFile were from a compaction. 

bq. Now while replay of wal, we dont have start compaction marker for this wal 
file. So we think this is an old valid file but that is wrong. This is a 
partial file. This is possible.
don't we only have the `end` compaction event marker only when compacted 
HFile(s) has been moved to cf directory right before updating the store file 
manager? but yeah if the WAL is rolled, then we lost this even marker. 

this is a good discussion, I will put down the above discussion as 
consideration to related milestone.



was (Author: taklwu):
bq. If an HFile is written successfully but no marker in the WAL, then it 
doesn't exist, right? As part of the WAL replay you will reconstitute it from 
edits in the WAL?
You're right if that happens, any uncommitted HFile should not be taken and 
replay from the WALs. (assuming that replay may generate the same content but a 
different HFile, I was thinking an store open optimization that but that would 
be too far from now.)

bq. you have surveyed the calls to the NN made by HBase on a regular basis?
we don't have how many rename calls captured to NN or even to object stores, 
and we will add a measurement survey task on the related milestone. But at one 
point we captured while running compaction with rename, the overall wall clock 
time only for the rename part were dominating ~60% of overall compaction time 
on object stores.

bq. IIRC there is a issue for storing the compacted files in HFile's metadata, 
to solve the problem that the wal file contains the compaction marker may be 
deleted before wal splitting.
it should be HBASE-20724, so for compaction we can reused that to confirm if 
the flushed StoreFile were from a compaction. 

bq. Now while replay of wal, we dont have start compaction marker for this wal 
file. So we think this is an old valid file but that is wrong. This is a 
partial file. This is possible.
don't we only have the `end` compaction event marker only when compacted 
HFile(s) has been moved to cf directory right before updating the store file 
manager? but yeah if the WAL is rolled, then we lost this even marker. 

but this is a good discussion, I will put down the above discussion as 
consideration to related milestone.


> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN r

[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking

2020-07-21 Thread Tak-Lon (Stephen) Wu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162196#comment-17162196
 ] 

Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 7/21/20, 5:26 PM:


Thanks [~busbey] and [~zhangduo] , I will send a email to dev@ list.
{quote}Could this be a part of meta instead?
{quote}
Definitively we can more them into meta table, and it shouldn't put too much 
load and data (about 400MB for 10k region). but [~zhangduo]  is right that, 
meta table itself still need to have something to store the tracking for the 
meta region(s). but other storing storefile tracking of metatable region in 
zookeeper, do you guys have other thoughts?


one additional thoughts on the meta table itself, do we need meta table to be 
splittable mentioned in HBASE-23055 ? basically we were thinking how much load 
meta table could handle, and we have yet gathered that information. 

 
{quote}But it is still a pain that we could see a HFile on the filesystem but 
we do not know whether it is valid... Check for trailer?
{quote}
For repairing and recovery, currently we're testing and reusing refresh hfiles 
coprocessor ({{RefreshHFilesClient}}) that relies on {{HStore#openStoreFiles}} 
to valid if the HFiles should be included (if the HFiles can be opened and/or 
it's a result from compaction). seem like that {{HStore#openStoreFiles}}  
({{initReader}}) already did that for checking the metadata ?
{quote}And we should provide new HBCK tools to sync the file system with the 
storefiles family, or rebuild the storefiles family from the filesystem?
{quote}
We can change that to be part of repairing meta table in HBCK such user do not 
need to run a separate command.


was (Author: taklwu):
Thanks [~busbey] and [~zhangduo] , I will send a email to dev@ list.

bq. Could this be a part of meta instead? 

Definitively we can more them into meta table, and it shouldn't put too much 
load and data (about 400MB for 10k region). but [~zhangduo]  is right that, 
meta table itself still need to have something to store the tracking for the 
meta region(s). but other storing storefile tracking of metatable region in 
zookeeper, do you guys have other thoughts? 

{quote}But it is still a pain that we could see a HFile on the filesystem but 
we do not know whether it is valid... Check for trailer?
{quote}
For repairing and recovery, currently we're testing and reusing refresh hfiles 
coprocessor ({{RefreshHFilesClient}}) that relies on {{HStore#openStoreFiles}} 
to valid if the HFiles should be included (if the HFiles can be opened and/or 
it's a result from compaction). seem like that {{HStore#openStoreFiles}}  
({{initReader}}) already did that for checking the metadata ? 

bq. And we should provide new HBCK tools to sync the file system with the 
storefiles family, or rebuild the storefiles family from the filesystem?

We can change that to be part of repairing meta table in HBCK such user do not 
need to run a separate command.

> Direct insert HFiles and Persist in-memory HFile tracking
> -
>
> Key: HBASE-24749
> URL: https://issues.apache.org/jira/browse/HBASE-24749
> Project: HBase
>  Issue Type: Umbrella
>  Components: Compaction, HFile
>Affects Versions: 3.0.0-alpha-1
>Reporter: Tak-Lon (Stephen) Wu
>Priority: Major
>  Labels: design, discussion, objectstore, storeFile, storeengine
> Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct 
> insert HFiles and Persist in-memory HFile tracking.pdf
>
>
> We propose a new feature (a new store engine) to remove the {{.tmp}} 
> directory used in the commit stage for common HFile operations such as flush 
> and compaction to improve the write throughput and latency on object stores. 
> Specifically for S3 filesystems, this will also mitigate read-after-write 
> inconsistencies caused by immediate HFiles validation after moving the 
> HFile(s) to data directory.
> Please see attached for this proposal and the initial result captured with 
> 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, 
> and workload C RUN result.
> The goal of this JIRA is to discuss with the community if the proposed 
> improvement on the object stores use case makes senses and if we miss 
> anything should be included.
> Improvement Highlights
>  1. Lower write latency, especially the p99+
>  2. Higher write throughput on flush and compaction 
>  3. Lower MTTR on region (re)open or assignment 
>  4. Remove consistent check dependencies (e.g. DynamoDB) supported by file 
> system imple



--
This message was sent by Atlassian Jira
(v8.3.4#803005)