[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813439#comment-16813439 ]
Sean Mackrory commented on HBASE-22149: --------------------------------------- Posting another patch in which I've added FS contract tests to get much greater test coverage, and fixed a bunch of issues. I've also added TODO's in most of the places where I know a tweak is still required. I'm sure this is not the conventional patch name format for HBase, but that's because I'm not actually proposing this for inclusion just yet - very much still a work-in-progress. Current test failures: {code} ZK (RootContract test times out): [ERROR] Failures: [ERROR] TestHBOSSContractGetFileStatus>AbstractContractGetFileStatusTest.testListLocatedStatusEmptyDirectory:129->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88 listLocatedStatus(test dir): file count in 1 directory and 1 file expected:<0> but was:<1> [ERROR] TestHBOSSContractGetFileStatus>AbstractContractGetFileStatusTest.testListStatusFiltering:463->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88 length of listStatus(s3a://mackrory/user/sean/hboss-junit-test/contract-tests, org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$AllPathsFilter@452f7a60 ) expected:<2> but was:<3> [ERROR] Errors: [ERROR] TestHBOSSContractMkdir>AbstractContractMkdirTest.testMkdirOverParentFile:108->AbstractFSContractTestBase.assertDeleted:349 » IO [ERROR] TestHBOSSContractRename>AbstractContractRenameTest.testRenameDirIntoExistingDir:155->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound [INFO] [ERROR] Tests run: 91, Failures: 2, Errors: 2, Skipped: 13 Local: [ERROR] Failures: [ERROR] TestAtomicRename.testAtomicRename:77 Rename source is still visible after rename finished or target showed up. [ERROR] Errors: [ERROR] TestHBOSSContractRename>AbstractContractRenameTest.testRenameDirIntoExistingDir:155->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound [INFO] [ERROR] Tests run: 91, Failures: 1, Errors: 2, Skipped: 13 {code} {quote}Seems TreeLockManager.lockListings(Path[] paths) can get deadlocked when passing list of hierarchical paths{quote} Good catch, and in fact at least one of the new contract tests fails for that reason. One solution is to reconcile the list before locking to eliminate any path for which an ancestor is included elsewhere in the list. Another solution is to make the tree as a whole reentrant instead of just individual locks: if writeLockAbove() finds a lock already held by the current thread, we shouldn't block. The local contains the latter solution right now. Need to supplement the ZK implementation or filter the list of locked directories. {quote}would it be enough to have mock classes using local FS emulating S3 behaviour here{quote} Testing against local FS is a better thing to fall back on if no S3 credentials are configured than simply skipping the tests entirely. It would test that we don't dead-lock and that there are no *major* functional issues. But it wouldn't reproduce any of the problems we're trying to fix. Using local FS to emulate S3 would be it's own undertaking, and honestly if none of the S3 mocks currently out there are faithful enough, I don't even want to try doing it myself. I did have a breakthrough getting Adobe's S3Mock to work earlier this week. I put it on hold when I saw the atomic rename test was failing, but something's up with it even against Amazon S3, so I need to dig it back out (and of course get to the bottom of why that test started failing when I moved to the HBase code base). Also note that the embedded ZooKeeper seems to be having problems with HBase's newer version of Curator, so for now I'm configuring my tests to point to a local ZooKeeper instance in auth-keys.xml. > HBOSS: A FileSystem implementation to provide HBase's required semantics > ------------------------------------------------------------------------ > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration > Reporter: Sean Mackrory > Assignee: Sean Mackrory > Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)