[jira] [Comment Edited] (HBASE-27826) Region split and merge time while offline is O(n) with respect to number of store files

Andrew Kyle Purtell (Jira) Wed, 13 Mar 2024 09:15:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826771#comment-17826771
 ]


Andrew Kyle Purtell edited comment on HBASE-27826 at 3/13/24 4:14 PM:
----------------------------------------------------------------------

{quote}We will define a splitFiles method in StoreFileTracker interface
{quote}
No. Split logic should remain in SplitTransaction. Breaking this encapsulation 
and diluting the split implementation does not seem like a good idea to me, but 
we could discuss it, if someone actually wants this.

StoreFileTracker is a directory of store files. This concept is neatly extended 
to include management of reference and link files. References and links are 
aspects of maintaining a directory of store file contents. SFT is the 
appropriate place to make design changes (in my opinion). And once SFT is 
managing references and links, they do not need to be real files, they can be 
virtual concepts maintained in the manifest. So SFT gets new additional methods 
for adding and removing references and links. Like createLink(), deleteLink(), 
createReference(), deleteReference(), and so on.

Once references and links are concepts managed by SFT, we can have the 
different SFT implementations optimize for their design cases. When using the 
FileBasedStoreFileTracker we would not wait for up to a second or two when 
creating each link or reference in the S3 bucket, causing long offline times 
during splits proportional to the number of store files in the region. Instead 
imagine links and references are entries in the manifest, not real files. We 
don't take the cost of creating files in the S3 bucket, we only update the 
manifest, and that can be optimized further. We can gather all of the links and 
references we want to create into a list, and we submit them to SFT all at 
once, using an interface method that accepts an array or list of SFT mutations 
to perform in batch, so there is only one manifest update required, and then 
this aspect of splitting becomes O(1) in time.

Regarding the DefaultStoreFileTracker, it maintains existing functionality. 
DefaultStoreFileTracker needs new methods for creating and managing links too, 
but they will be real link and reference files, they will maintain their 
current naming and structure, this will be fully compatible with existing 
stores. This amounts to refactoring some of the code in HFileLink and 
ReferenceFile into DefaultStoreFileTracker. This is our current thinking.

A design doc will help clarify the proposals and discussion.


was (Author: apurtell):
{quote}We will define a splitFiles method in StoreFileTracker interface
{quote}
No. Split logic should remain in SplitTransaction. Breaking this encapsulation 
and diluting the split implementation does not seem like a good idea to me, but 
we could discuss it, if someone actually wants this.

StoreFileTracker is a directory of store files. This concept is neatly extended 
to include management of reference and link files. References and links are 
aspects of maintaining a directory of store file contents. SFT is the 
appropriate place to make design changes (in my opinion). And once SFT is 
managing references and links, they do not need to be real files, they can be 
virtual concepts maintained in the manifest. So SFT gets new additional methods 
for adding and removing references and links. Like createLink(), deleteLink(), 
createReference(), deleteReference(), and so on.

Once references and links are concepts managed by SFT, we can have the 
different SFT implementations optimize for their design cases. When using the 
FileBasedStoreFileTracker we would not wait for up to a second or two when 
creating each link or reference in the S3 bucket, causing long offline times 
during splits proportional to the number of store files in the region. Instead 
imagine we gather all of the links and references we want to create into a 
list, and we submit them to SFT all at once, using an interface method that 
accepts an array or list of SFT mutations to perform in batch, so there is only 
one manifest update required, and then this aspect of splitting becomes O(1) in 
time.

Regarding the DefaultStoreFileTracker, it maintains existing functionality. 
DefaultStoreFileTracker needs new methods for creating and managing links too, 
but they will be real link and reference files, they will maintain their 
current naming and structure, this will be fully compatible with existing 
stores. This amounts to refactoring some of the code in HFileLink and 
ReferenceFile into DefaultStoreFileTracker. This is our current thinking.

A design doc will help clarify the proposals and discussion.

> Region split and merge time while offline is O(n) with respect to number of 
> store files
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-27826
>                 URL: https://issues.apache.org/jira/browse/HBASE-27826
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.4
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> This is a significant availability issue when HFiles are on S3. =
> HBASE-26079 ({_}Use StoreFileTracker when splitting and merging{_}) changed 
> the split and merge table procedure implementations to indirect through the 
> StoreFileTracker implementation when selecting HFiles to be merged or split, 
> rather than directly listing those using file system APIs. It also changed 
> the commit logic in HRegionFileSystem to add the link/ref files on resulting 
> split or merged regions to the StoreFileTracker. However, the creation of a 
> link file is still a filesystem operation and creating a “file” on S3 can 
> take well over a second. If, for example there are 20 store files in a 
> region, which is not uncommon, after the region is taken offline for a split 
> (or merge) it may require more than 20 seconds to create the link files 
> before the results can be brought back online, creating a severe availability 
> problem. Splits and merges are supposed to be fast, completing in less than a 
> second, certainly less than a few seconds. This has been true when HFiles are 
> stored on HDFS only because file creation operations there are nearly 
> instantaneous. 
> There are two issues but both can be handled with modifications to the store 
> file tracker interface and the file based store file tracker implementation. 
> When the file based store file file tracker is enabled the HFile links should 
> be virtual entities that only exist in the file manifest. We do not require 
> physical files in the filesystem to serve as links now. That is the magic of 
> the this file tracker, the manifest file replaces requirements to list the 
> filesystem.
> Then, when splitting or merging, the HFile links should be collected into a 
> list and committed in one batch using a new FILE file tracker interface, 
> requiring only one update of the manifest file in S3, bringing the time 
> requirement for this operation to O(1) down from O[n].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-27826) Region split and merge time while offline is O(n) with respect to number of store files

Reply via email to