[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226329#comment-17226329 ] Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 11/4/20, 6:23 PM: thanks Nick for the information! will do a feature branch and come back to discuss when we should merge. was (Author: taklwu): thanks Nick ! > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes senses and if we miss > anything should be included. > Improvement Highlights > 1. Lower write latency, especially the p99+ > 2. Higher write throughput on flush and compaction > 3. Lower MTTR on region (re)open or assignment > 4. Remove consistent check dependencies (e.g. DynamoDB) supported by file > system implementation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183739#comment-17183739 ] Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 8/25/20, 5:18 AM: I updated a [design doc|https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing] (google doc version) with development plan milestones. please feel free to drop comments on there. In addition, I'm wondered if we can simplify the circuit for tracking the HFiles of the ROOT region to directly rely on the file storage (assuming the WAL works fine + HFiles are always immutable) without adding a tracking layer as well as directly writes HFiles to the data directory. e.g. the dependencies flow should be # When (re)open, ROOT region only cares HFiles in the data directory of ROOT Region (relies on the MVCC protection of what files should be included). # HFile Tracking of hbase:meta are written to ROOT region (similar to how the meta location is being handled), and this tracking metadata is being protected by the WAL of ROOT Region and HFiles in the data directory of ROOT Region. # HFile Tracking of any other tables are being updated to a column family cf:storefile in hbase:meta. The only read extensively period is during the region open and region assignment. We have provided an [investigation (Appendix#1) within the new design doc |https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing] that MVCC (Max Seq#) in the Store is the guard to reload cells from HFiles in the data directory without the tracking metadata and .tmp directory. But we should only use it for ROOT region because the amount of HFiles in ROOT directory is limited and normally won't change frequently was (Author: taklwu): I updated a [design doc|https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing] (google doc version), then we leave any design related comments on there directly to avoid a long page of comments in this JIRA. In addition, I'm wondered if we can simplify the circuit for tracking the HFiles of the ROOT region to directly rely on the file storage (assuming the WAL works fine + HFiles are always immutable) without adding a tracking layer as well as directly writes HFiles to the data directory. e.g. the dependencies flow should be # When (re)open, ROOT region only cares HFiles in the data directory of ROOT Region (relies on the MVCC protection of what files should be included). # HFile Tracking of hbase:meta are written to ROOT region (similar to how the meta location is being handled), and this tracking metadata is being protected by the WAL of ROOT Region and HFiles in the data directory of ROOT Region. # HFile Tracking of any other tables are being updated to a column family cf:storefile in hbase:meta. The only read extensively period is during the region open and region assignment. We have provided an [investigation (Appendix#1) within the new design doc |https://docs.google.com/document/d/15Nx-xZ7FoPoud9vqkmIwphkNwBv0mdKMkvU7Ley5i4A/edit?usp=sharing] that MVCC (Max Seq#) in the Store is the guard to reload cells from HFiles in the data directory without the tracking metadata and .tmp directory. But we should only use it for ROOT region because the amount of HFiles in ROOT directory is limited and normally won't change frequently > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes senses an
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171116#comment-17171116 ] Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 8/4/20, 8:59 PM: --- sorry for the delay, I was out few days last week. {quote}every flush and compaction will result in an update inline w/ the flush/compaction completion – if it fails, the flush/compaction fail? {quote} if updating hfile set in {{hbase:meta}} fails, it should be considered as failure if this feature is enabled. do you have concern on blocking the actual flush to be completed ? (it should be similar to other feature like {{hbase:quota}}) {quote}Master would update, or RS writes meta, a violation of a simplification we made trying to ensure one-writer {quote} for ensuring one-writer to {{hbase:meta}}, this is a good note and we haven't considered one writer scenario yet. I'm not sure the right way but for flush should be happened on the RS side, then either the RS create directly connection to {{hbase:meta}} with a limit on writing to only this column family outside of the master (suggested by [~zyork], pending investigation) or as you suggested that we package the hfile set information with a RPC call to Master and Master updates the hfile set. the amount of traffic (direct table connection or RPC call) should be the same, I still need to compare if the overhead (throughput) have any difference. In addition, I will try to come up a set of sub-tasks and update the proposal doc the coming week. please bear with me, the plan may have some transition tasks (the goal is to have delivery with stages), e.g. 1. having the separate system table first, then have followup tasks to 2). compare the migration into the {{hbase:meta}} and actually 3). merge into {{hbase:meta}} (as a throughput sanity check) was (Author: taklwu): sorry for the delay, I was out few days last week. {quote}every flush and compaction will result in an update inline w/ the flush/compaction completion – if it fails, the flush/compaction fail? {quote} if updating hfile set in {{hbase:meta}} fails, it should be considered as failure if this feature is enabled. do you have concern on blocking the actual flush to be completed ? (it should be similar to other feature like {{hbase:quota}}) {quote}Master would update, or RS writes meta, a violation of a simplification we made trying to ensure one-writer {quote} for ensuring one-writer to {{hbase:meta}}, this is a good note and we haven't considered one writer scenario yet. I'm not sure the right way but for flush should be happened on the RS side, then either the RS create directly connection to {{hbase:meta}} with a limit on writing to only this column family outside of the master (suggested by [~zyork], pending investigation) or as you suggested that we package the hfile set information with a RPC call to Master and Master updates the hfile set. the amount of traffic (direct table connection or RPC call) should be the same, I still need to compare if the overhead (throughput) have any difference. In addition, I will try to come up a set of sub-tasks and update the proposal doc the coming week. please bear with me, the plan may have some transition tasks, e.g. 1. having the separate system table first, then have followup tasks to 2). compare the migration into the {{hbase:meta}} and actually 3). merge into {{hbase:meta}} (as a throughput sanity check) > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164094#comment-17164094 ] Zach York edited comment on HBASE-24749 at 7/24/20, 3:37 AM: - Yes, I think that is potentially an alternative implementation that could work. One downside I could see is you would still want to be able to handle bulk loading/other procedures. If all updates to the state are controlled by the RS, this approach would work. I wonder what the perf difference might be... since in this case you would have to replay edits always. Edit: After thinking it through a bit, the WAL approach has one problem in our environment (where we expect the HDFS WALs will not be migrated to a new cluster). Storing the data in a table is more durable for our use case, but the WAL implementation could be suitable for the ROOT table where it matters less if the file list needs to fall back to FS listing/validation. was (Author: zyork): Yes, I think that is potentially an alternative implementation that could work. One downside I could see is you would still want to be able to handle bulk loading/other procedures. If all updates to the state are controlled by the RS, this approach would work. I wonder what the perf difference might be... since in this case you would have to replay edits always. > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes senses and if we miss > anything should be included. > Improvement Highlights > 1. Lower write latency, especially the p99+ > 2. Higher write throughput on flush and compaction > 3. Lower MTTR on region (re)open or assignment > 4. Remove consistent check dependencies (e.g. DynamoDB) supported by file > system implementation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164090#comment-17164090 ] Anoop Sam John edited comment on HBASE-24749 at 7/24/20, 2:33 AM: -- bq. Can you expand on how we can get in a situation where a partial file is written? I'm trying to see if there are any failure modes we haven't though of. If the case is a complete file written to the data directory, is there harm in picking up the new file (even if it hasn't successfully committed to the SFM)? That point was based on another direction what Stack was saying. Am not sure whether Stack suggested that for META table alone or for all. ie use the WAL event markers to know whether a HFile is committed or not. If we see, during WAL replay that there is a flush begin marker and later flush complete marker, this means it is a committed file. If no markers at all for a file, this is an old existing file. If only begin but no end means this is not a committed file and so while region reopen, we can ignore this. Same with compaction also. There one issue was what if the WAL file which is having the begin marker got rolled and deleted. We lost the track. But if that can be controlled, this is also a direction no? (Dedicated wal for these event markers) We can avoid the need to store all the files list into META and avoid the Q of how to handled the META's file list. Storing in zk is not a direction. was (Author: anoop.hbase): bq. Can you expand on how we can get in a situation where a partial file is written? I'm trying to see if there are any failure modes we haven't though of. If the case is a complete file written to the data directory, is there harm in picking up the new file (even if it hasn't successfully committed to the SFM)? That point was based on another direction what Stack was saying. Am not sure whether Stack suggested that for META table alone or for all. ie use the WAL event markers to know whether a HFile is committer or not. If we see, during WAL replay that there is a flush begin marker and later flush complete marker, this means it is a committed file. If no markers at all for a file, this is an old existing file. If only begin but no end means this is not a committed file and so while region reopen, we can ignore this. Same with compaction also. There one issue was what if the WAL file which is having the begin marker got rolled and deleted. We lost the track. But if that can be controlled, this is also a direction no? We can avoid the need to store all the files list into META and avoid the Q of how to handled the META's file list. Storing in zk is not a direction. > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes senses and if we miss > anything should be included. > Improvement Highlights > 1. Lower write latency, especially the p99+ > 2. Higher write throughput on flush and compaction > 3. Lower MTTR on region (re)open or assignment > 4. Remove consistent check dependencies (e.g. DynamoDB) supported by file > system implementation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164048#comment-17164048 ] Guanghao Zhang edited comment on HBASE-24749 at 7/24/20, 12:19 AM: --- bq. it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. Yes. The compaction event marker in WAL is not used anymore. was (Author: zghaobac): {quote}{quote} it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. {quote}{quote} Yes. The compaction event marker in WAL is not used anymore. > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes senses and if we miss > anything should be included. > Improvement Highlights > 1. Lower write latency, especially the p99+ > 2. Higher write throughput on flush and compaction > 3. Lower MTTR on region (re)open or assignment > 4. Remove consistent check dependencies (e.g. DynamoDB) supported by file > system implementation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164048#comment-17164048 ] Guanghao Zhang edited comment on HBASE-24749 at 7/24/20, 12:18 AM: --- {quote}{quote} it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. {quote}{quote} Yes. The compaction event marker in WAL is not used anymore. was (Author: zghaobac): {quote}bq. it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. {quote} Yes. The compaction event marker is not used anymore. > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes senses and if we miss > anything should be included. > Improvement Highlights > 1. Lower write latency, especially the p99+ > 2. Higher write throughput on flush and compaction > 3. Lower MTTR on region (re)open or assignment > 4. Remove consistent check dependencies (e.g. DynamoDB) supported by file > system implementation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164037#comment-17164037 ] Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 7/24/20, 12:03 AM: - bq. If an HFile is written successfully but no marker in the WAL, then it doesn't exist, right? As part of the WAL replay you will reconstitute it from edits in the WAL? You're right if that happens, any uncommitted HFile should not be taken and replay from the WALs. (assuming that replay may generate the same content but a different HFile, I was thinking an store open optimization that but that would be too far from now.) bq. you have surveyed the calls to the NN made by HBase on a regular basis? we don't have how many rename calls captured to NN or even to object stores, and we will add a measurement survey task on the related milestone. But at one point we captured while running compaction with rename, the overall wall clock time only for the rename part were dominating ~60% of overall compaction time on object stores. bq. IIRC there is a issue for storing the compacted files in HFile's metadata, to solve the problem that the wal file contains the compaction marker may be deleted before wal splitting. it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. bq. Now while replay of wal, we dont have start compaction marker for this wal file. So we think this is an old valid file but that is wrong. This is a partial file. This is possible. don't we only have the `end` compaction event marker only when compacted HFile(s) has been moved to cf directory right before updating the store file manager? but yeah if the WAL is rolled, then we lost this even marker. but this is a good discussion, I will put down the above discussion as consideration to related milestone. was (Author: taklwu): bq. If an HFile is written successfully but no marker in the WAL, then it doesn't exist, right? As part of the WAL replay you will reconstitute it from edits in the WAL? You're right if that happens, any uncommitted HFile should not be taken and replay from the WALs. (assuming that replay may generate the same content but a different HFile, I was thinking an store open optimization that but that would be too far from now.) bq. you have surveyed the calls to the NN made by HBase on a regular basis? we don't have how many rename calls captured to NN or even to object stores, and we will add a measurement survey task on the related milestone. But at one point we captured while running compaction with rename, the overall wall clock time only for the rename part were dominating ~60% of overall compaction time on object stores. bq. IIRC there is a issue for storing the compacted files in HFile's metadata, to solve the problem that the wal file contains the compaction marker may be deleted before wal splitting. it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. bq. Now while replay of wal, we dont have start compaction marker for this wal file. So we think this is an old valid file but that is wrong. This is a partial file. This is possible. don't we only have the `end` compaction event marker only when compacted HFile(s) has been moved to cf directory right before updating the store file manager? but yeah if the WAL is rolled, then we lost this even marker. > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the obj
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164037#comment-17164037 ] Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 7/24/20, 12:03 AM: - bq. If an HFile is written successfully but no marker in the WAL, then it doesn't exist, right? As part of the WAL replay you will reconstitute it from edits in the WAL? You're right if that happens, any uncommitted HFile should not be taken and replay from the WALs. (assuming that replay may generate the same content but a different HFile, I was thinking an store open optimization that but that would be too far from now.) bq. you have surveyed the calls to the NN made by HBase on a regular basis? we don't have how many rename calls captured to NN or even to object stores, and we will add a measurement survey task on the related milestone. But at one point we captured while running compaction with rename, the overall wall clock time only for the rename part were dominating ~60% of overall compaction time on object stores. bq. IIRC there is a issue for storing the compacted files in HFile's metadata, to solve the problem that the wal file contains the compaction marker may be deleted before wal splitting. it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. bq. Now while replay of wal, we dont have start compaction marker for this wal file. So we think this is an old valid file but that is wrong. This is a partial file. This is possible. don't we only have the `end` compaction event marker only when compacted HFile(s) has been moved to cf directory right before updating the store file manager? but yeah if the WAL is rolled, then we lost this even marker. this is a good discussion, I will put down the above discussion as consideration to related milestone. was (Author: taklwu): bq. If an HFile is written successfully but no marker in the WAL, then it doesn't exist, right? As part of the WAL replay you will reconstitute it from edits in the WAL? You're right if that happens, any uncommitted HFile should not be taken and replay from the WALs. (assuming that replay may generate the same content but a different HFile, I was thinking an store open optimization that but that would be too far from now.) bq. you have surveyed the calls to the NN made by HBase on a regular basis? we don't have how many rename calls captured to NN or even to object stores, and we will add a measurement survey task on the related milestone. But at one point we captured while running compaction with rename, the overall wall clock time only for the rename part were dominating ~60% of overall compaction time on object stores. bq. IIRC there is a issue for storing the compacted files in HFile's metadata, to solve the problem that the wal file contains the compaction marker may be deleted before wal splitting. it should be HBASE-20724, so for compaction we can reused that to confirm if the flushed StoreFile were from a compaction. bq. Now while replay of wal, we dont have start compaction marker for this wal file. So we think this is an old valid file but that is wrong. This is a partial file. This is possible. don't we only have the `end` compaction event marker only when compacted HFile(s) has been moved to cf directory right before updating the store file manager? but yeah if the WAL is rolled, then we lost this even marker. but this is a good discussion, I will put down the above discussion as consideration to related milestone. > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN r
[jira] [Comment Edited] (HBASE-24749) Direct insert HFiles and Persist in-memory HFile tracking
[ https://issues.apache.org/jira/browse/HBASE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162196#comment-17162196 ] Tak-Lon (Stephen) Wu edited comment on HBASE-24749 at 7/21/20, 5:26 PM: Thanks [~busbey] and [~zhangduo] , I will send a email to dev@ list. {quote}Could this be a part of meta instead? {quote} Definitively we can more them into meta table, and it shouldn't put too much load and data (about 400MB for 10k region). but [~zhangduo] is right that, meta table itself still need to have something to store the tracking for the meta region(s). but other storing storefile tracking of metatable region in zookeeper, do you guys have other thoughts? one additional thoughts on the meta table itself, do we need meta table to be splittable mentioned in HBASE-23055 ? basically we were thinking how much load meta table could handle, and we have yet gathered that information. {quote}But it is still a pain that we could see a HFile on the filesystem but we do not know whether it is valid... Check for trailer? {quote} For repairing and recovery, currently we're testing and reusing refresh hfiles coprocessor ({{RefreshHFilesClient}}) that relies on {{HStore#openStoreFiles}} to valid if the HFiles should be included (if the HFiles can be opened and/or it's a result from compaction). seem like that {{HStore#openStoreFiles}} ({{initReader}}) already did that for checking the metadata ? {quote}And we should provide new HBCK tools to sync the file system with the storefiles family, or rebuild the storefiles family from the filesystem? {quote} We can change that to be part of repairing meta table in HBCK such user do not need to run a separate command. was (Author: taklwu): Thanks [~busbey] and [~zhangduo] , I will send a email to dev@ list. bq. Could this be a part of meta instead? Definitively we can more them into meta table, and it shouldn't put too much load and data (about 400MB for 10k region). but [~zhangduo] is right that, meta table itself still need to have something to store the tracking for the meta region(s). but other storing storefile tracking of metatable region in zookeeper, do you guys have other thoughts? {quote}But it is still a pain that we could see a HFile on the filesystem but we do not know whether it is valid... Check for trailer? {quote} For repairing and recovery, currently we're testing and reusing refresh hfiles coprocessor ({{RefreshHFilesClient}}) that relies on {{HStore#openStoreFiles}} to valid if the HFiles should be included (if the HFiles can be opened and/or it's a result from compaction). seem like that {{HStore#openStoreFiles}} ({{initReader}}) already did that for checking the metadata ? bq. And we should provide new HBCK tools to sync the file system with the storefiles family, or rebuild the storefiles family from the filesystem? We can change that to be part of repairing meta table in HBCK such user do not need to run a separate command. > Direct insert HFiles and Persist in-memory HFile tracking > - > > Key: HBASE-24749 > URL: https://issues.apache.org/jira/browse/HBASE-24749 > Project: HBase > Issue Type: Umbrella > Components: Compaction, HFile >Affects Versions: 3.0.0-alpha-1 >Reporter: Tak-Lon (Stephen) Wu >Priority: Major > Labels: design, discussion, objectstore, storeFile, storeengine > Attachments: 1B100m-25m25m-performance.pdf, Apache HBase - Direct > insert HFiles and Persist in-memory HFile tracking.pdf > > > We propose a new feature (a new store engine) to remove the {{.tmp}} > directory used in the commit stage for common HFile operations such as flush > and compaction to improve the write throughput and latency on object stores. > Specifically for S3 filesystems, this will also mitigate read-after-write > inconsistencies caused by immediate HFiles validation after moving the > HFile(s) to data directory. > Please see attached for this proposal and the initial result captured with > 25m (25m operations) and 1B (100m operations) YCSB workload A LOAD and RUN, > and workload C RUN result. > The goal of this JIRA is to discuss with the community if the proposed > improvement on the object stores use case makes senses and if we miss > anything should be included. > Improvement Highlights > 1. Lower write latency, especially the p99+ > 2. Higher write throughput on flush and compaction > 3. Lower MTTR on region (re)open or assignment > 4. Remove consistent check dependencies (e.g. DynamoDB) supported by file > system imple -- This message was sent by Atlassian Jira (v8.3.4#803005)