[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823273#comment-16823273
 ] 

Andrew Purtell commented on HBASE-22149:
----------------------------------------

{quote}This will harm r/w latency distribution tail for sure. Copying (moving) 
GBs of data in S3 with 3x replication - that is 10s of seconds.
{quote}
This file system will not be triple replicating data like HDFS would, right? 
Whatever the S3 service might do internally to provide durability and 
availability will be a black box to us, but we won't make the problem worse by 
replicating data somewhere we control things, right?

The basic point is sound.

Atomic rename is expected to be fast. We are renaming under lock in the flush, 
compaction, split, or merge code paths. I assume it, haven't looked at the 
code. With HBOSS those locks are going to be held for a time proportional to 
the amount of data being copied behind the scenes instead of O(1) latency of 
one HDFS namenode RPC.

This is why I proposed a new 'transaction' framework at the HBase level instead 
of a lower level solution like HBOSS, so we could engineer for this and provide 
for substrate specific strategies, like taking advantage of atomic rename if it 
exists, or using PUT-COPY if we're on S3, or... but we are not considering that 
type of solution here. 

As Sean points out the normal read and write paths are not normally on the 
critical path. However if flushes are taking a very long time to complete the 
memstore's buffering capacity will be exceeded and writes will stall. Flush 
upon closing is going to be a danger zone. It may take a very long time for a 
region to close and this will impact both reads and writes for the duration, so 
you probably want to rethink all region management concerns like balancer 
policy, split policy, etc. I think the HBOSS design rationale says these things 
are the acceptable outcome of the tradeoff being made. Another approach could 
solve it without these drawbacks.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> ------------------------------------------------------------------------
>
>                 Key: HBASE-22149
>                 URL: https://issues.apache.org/jira/browse/HBASE-22149
>             Project: HBase
>          Issue Type: New Feature
>          Components: Filesystem Integration
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>            Priority: Critical
>         Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to