sharmaar12 commented on PR #7149:
URL: https://github.com/apache/hbase/pull/7149#issuecomment-3270669250
> Let's try to investigate the cases with different store file tracker
implementations.
>
> **1. DefaultStoreFileTracker**
>
> ```java
> /**
> * The default implementation for store file tracker, where we do not
persist the store file list,
> * and use listing when loading store files.
> */
> @InterfaceAudience.Private
> class DefaultStoreFileTracker extends StoreFileTrackerBase {
> ```
>
> So, in this case the refresh command should always get a list of all
HFiles in the CF directory and should be able to detect new HFiles
automatically. Is the correct?
Yes. In this case we will be able to detect and load the newly added files.
>
> **2. File based tracker**
>
> ```java
> /**
> * A file based store file tracker.
> * <p/>
> * For this tracking way, the store file list will be persistent into a
file, so we can write the
> * new store files directly to the final data directory, as we will not
load the broken files. This
> * will greatly reduce the time for flush and compaction on some object
storages as a rename is
> * actual a copy on them. And it also avoid listing when loading store
file list, which could also
> * speed up the loading of store files as listing is also not a fast
operation on most object
> * storages.
> */
> @InterfaceAudience.Private
> class FileBasedStoreFileTracker extends StoreFileTrackerBase {
> ```
>
> I think this is the case that you're talking about. In this case SFT might
be able or might not be able to detect new HFiles depending on whether the SFT
file has been updated or not.
In this case, the copy has to be done via HBase only so the tracking file
(IIRC `.filelist` is the file we used for tracking) gets updated properly.
> So, basically if I just copy a new file to the CF directory, it won't be
detected, because of the reasons you mentioned.
In reality, with FileBasedTracker we will not be able to detect this because
simply copying (say in S3) will not update our tracking file (`.filelist`). The
testing issue which we faces is related to opening file for read and not
detecting/loading the file.
(https://github.com/apache/hbase/pull/7149#issuecomment-3269427414)
> But if the new HFile was properly added by another cluster which is using
the same SFT implementation, the file must have been updated properly, so our
cluster will pick it up.
Yes its correct. In case of, DefaultStoreFileTracker (in case of HDFS), we
rely on listing from directory directly so no issue if manual copy happens or
not.
In case of, FileBasedTracker (such as S3), we will be able to detect/load
the newly added HFiles if its added by Active cluster as that will update
`.filelist`, also as its shared between read-replica and active cluster. Read
replica will be able to load it. The caveat is if someone manually copy to S3
without active cluster aware of it then neither Active or readonly be able to
load it.
> If all the above are true, do we need to add any additional logic to the
command?
I don't think we need to add additional logic as if Active cluster
create/modify Hfiles then it will get updated in both active as well as read
replica cluster.
Only the case where user deliberately changes internal structure (copying
files directly to S3 without using bulkload) then its expected to have
inconsistent behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]