[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815461#comment-16815461
 ] 

Sean Mackrory commented on HBASE-22149:
---------------------------------------

{quote}fs.qualify(path) {quote}

Yeah I probably do need that, although it hasn't come up yet in tests.

{quote}What about multi-bucket support?{quote}

Added that yesterday, actually - in my next patch the ZK client will now be 
'jailed' inside a z-node named after the hostname in the URI. That's not quite 
right for WASB and ABFS but they don't need this anyway. It's right for S3 and 
GCS. Others may come along that require it to be rethought but that's good 
enough for now and I'd like to avoid putting any FS-specific logic inside this 
as long as I can.

{quote}S3Mock sounds interesting{quote}

Yes, I wondered if it was a faithful enough recreation for the full battery of 
s3a tests. One side note: even though I got S3Mock working, I did have to rely 
on APIs designated as Private (specifically the S3ClientFactory stuff). So we 
need to have a discussion about if we think those APIs might be stable enough 
to promote to LimitedPrivate({"HBase"}), or perhaps another API wherein I 
simply hand the FS a ready-to-go S3 client, instead of pointing it at a 
"Factory" class that will return the client I already made (which is what I 
have to do now).

{quote}For lockListing(), why is a shared lock on the path being listed not 
sufficient?{quote}

Because you want it to have exclusive access to all the children (and in some 
cases all children recursively) when there may be renames going on inside that 
path. Other than this particular case, write locks don't have to block when 
there are read locks above them in the path. For a non-recursive listing, a 
read-lock on all children of the path you're referencing would be sufficient, 
but how do you correctly enumerate the children without first having the lock? 
You end back where you started. Exclusive lock on the parent for listing is a 
little more aggressive than needed, but it's simple and safe. I've tried to err 
on that side of things since we can't seem to enumerate all the FS assumptions 
of HBase. If integration / performance testing finds that there is a particular 
point of contention, that's a targeted area we can investigate to determine is 
relaxing the constraints is safe.

{quote}Deadlock detection and debuggability{quote}
{quote}I don't have much experience with Curator but have heard of it.{quote}

Yeah this will definitely warrant some work on that, and operational concerns 
when problems arise. I've been finding Curator is not as fool-proof as I had 
hoped, and my next patch actually eliminates the use of curator-framework (in 
favor of the lower level curator-client) for everything but the actual locking 
/ unlock. The APIs for creating and deleting znodes have actually been very 
hard to debug.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> ------------------------------------------------------------------------
>
>                 Key: HBASE-22149
>                 URL: https://issues.apache.org/jira/browse/HBASE-22149
>             Project: HBase
>          Issue Type: New Feature
>          Components: Filesystem Integration
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>            Priority: Critical
>         Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to