[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830316#comment-16830316 ] Sean Busbey commented on HBASE-22149: - also given the size of the patch, please make a PR. :) > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, > HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830314#comment-16830314 ] Sean Busbey commented on HBASE-22149: - FYI, I pushed an empty commit to the {{hbase-filesystem}} repo so you should be able to make a PR off of that. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, > HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830313#comment-16830313 ] Sean Busbey commented on HBASE-22149: - bq. Not sure how we want to version this. It's 1.0.0-SNAPSHOT right now, but could arguably be lower, or synced up with the HBase version it's built against. Any thoughts? 1.0.0-alpha1-SNAPSHOT please. this repo is probably going to be in alpha status for some time, and I don't want to make assumptions about what version(s) of hbase it can be used with until we have to. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, > HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830291#comment-16830291 ] Sean Mackrory commented on HBASE-22149: --- (Overwrote the last attachment - cleaned up some local files that shouldn't have been included) > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, > HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830241#comment-16830241 ] Sean Busbey commented on HBASE-22149: - bq. I'm only extending that as a trivial way to add more coverage, so we could just skip it when a "hadoop2" profile is activated or something. If I could just figure out how you do that we can take care of this in a follow on ticket. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, > HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829760#comment-16829760 ] Sean Mackrory commented on HBASE-22149: --- {quote}I'll look into what's required to get this compiling on newer 2.x.{quote} So it actually compiles fine against 2.9.2, and the only issues I see are in tests themselves. I had to tweak EmbeddedS3 a little bit as 2.9.2 uses an older listing API. The remaining issue is that AbstractContractDistCpTest in 2.9.2 appears to assume it's running on HDFS. I'm only extending that as a trivial way to add more coverage, so we could just skip it when a "hadoop2" profile is activated or something. If I could just figure out how you do that :) Attached a patch from a separate repository with all the tests passing and all of the above fixes. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, > HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829659#comment-16829659 ] Andrew Purtell commented on HBASE-22149: bq. I'll look into what's required to get this compiling on newer 2.x. Yes, please, and thank you. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829438#comment-16829438 ] Josh Elser commented on HBASE-22149: We now have a new repository: hbase-filesystem.git – [https://github.com/apache/hbase-filesystem] Precommit is not going to be hooked up yet, but I think I can copy what Duo and team recently did for another repository. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829322#comment-16829322 ] Sean Mackrory commented on HBASE-22149: --- {quote}Hadoop advertises S3Guard in 2.9.2. Is this not production capable there?{quote} I had forgotten that, actually. I'll look into what's required to get this compiling on newer 2.x. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829318#comment-16829318 ] Sean Mackrory commented on HBASE-22149: --- Integration testing of my latest patch yielded an issue where the wrapped s3a instance was getting returned by the FS cache. I switched to using a new instance for the config changes in FS.initialize(), but it still happened in some cases (after FS.get was used like 6-7 times, which is surprising since the cache instance should be part of the FS cache key). Really the internal s3a shouldn't be used externally at all, so I just disabled FS caching in that Configuration instance as well, which seems to have fixed the issue entirely. The new function is below. I'll incorporate it in my next patch which will probably include all the changes necessary to be part of a separate repo... Re-mapping fs.s3a.impl to HBOSS and setting fs.hboss.fs.s3a.impl is how I've been testing so far because it requires no changes to HBase at all - if we want to disallow just calling 'fs = new HBaseObjectStoreSemantics(); fs.initialize("s3a://...", conf);", which we probably do if this will remain as a distinct repository, then I can simplify this function a bit further. {code} public void initialize(URI name, Configuration conf) throws IOException { setConf(conf); String scheme = name.getScheme(); String schemeImpl = "fs." + scheme + ".impl"; String hbossSchemeImpl = "fs.hboss." + schemeImpl; String wrappedImpl = conf.get(hbossSchemeImpl); Configuration internalConf = new Configuration(conf); if (wrappedImpl != null) { LOG.info("HBOSS wrapping file-system {} using implementation {}", name, wrappedImpl); String disableCache = "fs." + scheme + ".impl.disable.cache"; internalConf.set(schemeImpl, wrappedImpl); internalConf.set(disableCache, "true"); } fs = FileSystem.get(name, internalConf); sync = TreeLockManager.get(fs); } {code} > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827334#comment-16827334 ] Andrew Purtell commented on HBASE-22149: I don't object to a separate repo, of course > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827333#comment-16827333 ] Andrew Purtell commented on HBASE-22149: {quote}The whole thing really depends on Hadoop 3+ (in production, S3Guard is required and isn't in the Hadoop 2 releases, {quote} Hadoop advertises S3Guard in 2.9.2. Is this not production capable there? Otherwise, I would really prefer testing this on 2.9.2 than any 3.x. That major upgrade is off-putting (never mind that Hadoop's motto in previous years should have been "every minor is a major" (smile)) > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827008#comment-16827008 ] Sean Busbey commented on HBASE-22149: - sent a heads up to dev@hbase w/subject "[DISCUSS] lazy consensus on "hbase-filesystem" repository" [https://s.apache.org/bNWq] > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827003#comment-16827003 ] Sean Busbey commented on HBASE-22149: - Given the combination of this a) needing hadoop 3 only and b) being an experimental approach that we're not sure on sustainability in production I'd much prefer a different repository. Is anyone opposed to landing this in a new repository, i.e. `hbase-filesystem`? Provided it includes instructions for installation / set up we wouldn't even need to add the artifacts from that repository as a dependency for the main repo's binary artifacts. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827000#comment-16827000 ] Josh Elser commented on HBASE-22149: {quote}I've had the idea of a separate code repository suggested to me by a couple of people, too. That solves this problem, too. {quote} This seems like the cleanest approach to me! > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826998#comment-16826998 ] Sean Mackrory commented on HBASE-22149: --- So for the Hadoop 3 issue, we can have this module only built when the hadoop-3.0 profile is activated, but that requires the list of modules be duplicated, and then the profile-specific copy of that list adds this. Not entirely clean but the alternatives are worse. I've had the idea of a separate code repository suggested to me by a couple of people, too. That solves this problem, too. Any thoughts? Is there any precedent for a Hadoop-3-only module that has any better solution? > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826523#comment-16826523 ] Sean Mackrory commented on HBASE-22149: --- (swapped out patch #5 - there was a variable name change that I hadn't done everywhere, and didn't notice it until I did a clean build). > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826393#comment-16826393 ] Sean Mackrory commented on HBASE-22149: --- Added some contract tests I had missed that get a lot more coverage, fixed all the issues I was having in the tests (it was just tests stepping on each other's paths because they weren't all in separate directories and are supposed to be), am now normalizing all paths and sorting arrays in one central place. When I normalize paths for locking, I'm doing /scheme/hostname/path to ensure using this for multiple filesystems is safe. Some minor to do's left, but they definitely don't impact my test cases or the HBase workloads that have run on this so far: - mkdirs has implications for any parent directories that don't exist yet, although it will only lock the path. I can't think of a scenario where this would cause a problem, though. - The local lock implementation isn't re-entrant if you read-lock a path and then try to read-lock a parent in the same thread. I don't think anyone would use it in production, and the ZK implementation is the default even for the unit tests. This implementation is really only still there in case it helps with debugging other logic. - The whole thing really depends on Hadoop 3+ (in production, S3Guard is required and isn't in the Hadoop 2 releases, and even just for testing there's a lot of changes required to get it to compile). I'm wondering if there's an easy way to only include this module only with the Hadoop 3 profile. I haven't seen one, so... hints welcome :) Other than that: what else would the community like to see before this was committed (albeit perhaps with a big "experimental" label until this has gone through more scale and integration testing). > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, > HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. >
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823277#comment-16823277 ] Andrew Purtell commented on HBASE-22149: {quote}there were a few problems with large numbers of connections to S3, etc. {quote} I filed a Jira for this problem, unless I misunderstand what you are seeing. (The suggestion to have more smaller regions is confusing.) The essential idea is instead of keeping a file handle open for every file - which has sublinear resource usage on HDFS but a linear cost (connections, file handles) with S3A - instead we maintain an open file cache with LRU eviction policy, and some way to exclude bulk scans from caching similar to what we do in the blockcache today. As you suggest an operator could mitigate this by keeping the total number of store files as low as possible, via tunables like split policy and file size thresholds. I think that also means keeping the number of regions as small as possible too, since the goal is minimizing total store file number. Here's where we may not be on the same page. A S3 backed filesystem is going to behave very differently from HDFS and a small number of regions with really huge store files might be handled just fine. Then I suspect we would see stress on hfile index and blockcache efficiencies and have to explore more tunables there, or maybe small code changes. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823273#comment-16823273 ] Andrew Purtell commented on HBASE-22149: {quote}This will harm r/w latency distribution tail for sure. Copying (moving) GBs of data in S3 with 3x replication - that is 10s of seconds. {quote} This file system will not be triple replicating data like HDFS would, right? Whatever the S3 service might do internally to provide durability and availability will be a black box to us, but we won't make the problem worse by replicating data somewhere we control things, right? The basic point is sound. Atomic rename is expected to be fast. We are renaming under lock in the flush, compaction, split, or merge code paths. I assume it, haven't looked at the code. With HBOSS those locks are going to be held for a time proportional to the amount of data being copied behind the scenes instead of O(1) latency of one HDFS namenode RPC. This is why I proposed a new 'transaction' framework at the HBase level instead of a lower level solution like HBOSS, so we could engineer for this and provide for substrate specific strategies, like taking advantage of atomic rename if it exists, or using PUT-COPY if we're on S3, or... but we are not considering that type of solution here. As Sean points out the normal read and write paths are not normally on the critical path. However if flushes are taking a very long time to complete the memstore's buffering capacity will be exceeded and writes will stall. Flush upon closing is going to be a danger zone. It may take a very long time for a region to close and this will impact both reads and writes for the duration, so you probably want to rethink all region management concerns like balancer policy, split policy, etc. I think the HBOSS design rationale says these things are the acceptable outcome of the tradeoff being made. Another approach could solve it without these drawbacks. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable,
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823264#comment-16823264 ] Sean Mackrory commented on HBASE-22149: --- {quote}I am concerned about the write scalability of the ZooKeeper service when servicing a cluster with a very large number of regions{quote} There's a balance we need to keep in mind there - Wellington's been doing some integration testing with me, and there were a few problems with large numbers of connections to S3, etc. that could be avoided by having a larger number of smaller regions. Probably best to explore other solutions to those problems, then. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823251#comment-16823251 ] Sean Mackrory commented on HBASE-22149: --- Just attached another patch. The really big thing in this patch is that I've added my own implementation of the AmazonS3 interface for testing. It turns out it's actually fairly straightforward to get a local HashMap of Strings to support all of the FS contract tests. The only remotely tricky part is the substring logic to make all the backwards seeks, etc. work. Major upside, here: * All of the contract tests that will work on S3 will work against my mock. I'm not having to skip a bunch in the default case anymore. * The 2 main tests I originally added to reproduce the semantic problems also fail when you pass -Pnull (i.e. to disable all locking). * The S3 mocks implicitly pulled in a bunch of J2EE stuff and conflicting versions of the AWS SDK which caused a number of issues. * Much more debuggable - no more going over the network - it's all on a small, in-memory HashMap. I still see a couple of contract test failures when running against the real S3 service that I need to get to the bottom of (although my gut feeling right now is that it's all issues with my test code & set up and not the core locking logic). Beyond that and Fabbri's suggestion to qualify all paths, I'm starting to get to the bottom of my to-do list. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823236#comment-16823236 ] Sean Mackrory commented on HBASE-22149: --- {quote}This will harm r/w latency distribution tail for sure. Copying (moving) GBs of data in S3 with 3x replication - that is 10s of seconds. {quote} Will it, actually? My understanding was reads and write were served in-memory, with writes being persisted on the WAL (and to be clear, if I didn't say this already, this FS isn't intended to support the WAL yet). Is a compaction on the critical path for reads and writes? > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817543#comment-16817543 ] Steve Loughran commented on HBASE-22149: bq. LimitedPrivate({"HBase"}), Prefer @VisibleForTesting. I really hate the limited private stuff > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815804#comment-16815804 ] Vladimir Rodionov commented on HBASE-22149: --- {quote} I'd love to explore it but I think it's worth seeing if the ~minute renames are really a problem first. {quote} This will harm r/w latency distribution tail for sure. Copying (moving) GBs of data in S3 with 3x replication - that is 10s of seconds. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815643#comment-16815643 ] Sean Mackrory commented on HBASE-22149: --- {quote}I think that the patch should support meta-data-only rename operation. {quote} I mention an alternate design above in which all metadata is stored in a separate repository and maps to GUID-named files on S3. It could give atomic / metadata-only renames on S3, but it then makes the underlying storage incompatible and makes the repository a critical point of failure. Lose your ZK ensemble, your HFiles are now effectively unreadable until someone can go through and piece them into tables manually. It has a lot of benefits, but it's a much more severe approach. I'd love to explore it but I think it's worth seeing if the ~minute renames are really a problem first. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815624#comment-16815624 ] Andrew Purtell commented on HBASE-22149: {quote}Locking destination directory from read/write operations for duration of object store (S3) rename operation does not look like a good idea. S3 physically moves data during this operation and it can takes time (sometimes minutes, though very rarely)ink that the patch should support meta-data-only rename operation. {quote} Sure there may be other alternatives but the motivation of HBOSS seems pretty clear and that is to make HBase work on S3 even though there are no atomic metadata-only renames available there, via locking. On the grounds that HBOSS is an attempt to make this work with this approach (locking in Hadoop FS layer) it seems perfectly reasonable to wait as long as necessary. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815616#comment-16815616 ] Vladimir Rodionov commented on HBASE-22149: --- {code} public boolean rename(Path src, Path dst) throws IOException { try (AutoLock l = sync.lockRename(src, dst)) { return fs.rename(src, dst); } } {code} Locking destination directory from read/write operations for duration of object store (S3) rename operation does not look like a good idea. S3 physically moves data during this operation and it can takes time (sometimes minutes, though very rarely) I think that the patch should support meta-data-only rename operation. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815461#comment-16815461 ] Sean Mackrory commented on HBASE-22149: --- {quote}fs.qualify(path) {quote} Yeah I probably do need that, although it hasn't come up yet in tests. {quote}What about multi-bucket support?{quote} Added that yesterday, actually - in my next patch the ZK client will now be 'jailed' inside a z-node named after the hostname in the URI. That's not quite right for WASB and ABFS but they don't need this anyway. It's right for S3 and GCS. Others may come along that require it to be rethought but that's good enough for now and I'd like to avoid putting any FS-specific logic inside this as long as I can. {quote}S3Mock sounds interesting{quote} Yes, I wondered if it was a faithful enough recreation for the full battery of s3a tests. One side note: even though I got S3Mock working, I did have to rely on APIs designated as Private (specifically the S3ClientFactory stuff). So we need to have a discussion about if we think those APIs might be stable enough to promote to LimitedPrivate({"HBase"}), or perhaps another API wherein I simply hand the FS a ready-to-go S3 client, instead of pointing it at a "Factory" class that will return the client I already made (which is what I have to do now). {quote}For lockListing(), why is a shared lock on the path being listed not sufficient?{quote} Because you want it to have exclusive access to all the children (and in some cases all children recursively) when there may be renames going on inside that path. Other than this particular case, write locks don't have to block when there are read locks above them in the path. For a non-recursive listing, a read-lock on all children of the path you're referencing would be sufficient, but how do you correctly enumerate the children without first having the lock? You end back where you started. Exclusive lock on the parent for listing is a little more aggressive than needed, but it's simple and safe. I've tried to err on that side of things since we can't seem to enumerate all the FS assumptions of HBase. If integration / performance testing finds that there is a particular point of contention, that's a targeted area we can investigate to determine is relaxing the constraints is safe. {quote}Deadlock detection and debuggability{quote} {quote}I don't have much experience with Curator but have heard of it.{quote} Yeah this will definitely warrant some work on that, and operational concerns when problems arise. I've been finding Curator is not as fool-proof as I had hoped, and my next patch actually eliminates the use of curator-framework (in favor of the lower level curator-client) for everything but the actual locking / unlock. The APIs for creating and deleting znodes have actually been very hard to debug. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815031#comment-16815031 ] Aaron Fabbri commented on HBASE-22149: -- Thanks for your work on this interesting patch [~mackrorysd]. In the process of reading it. A couple of initial thoughts–not really looking at overall correctness yet. * On path normalization. Do you need a fs.qualify(path) before calling into your lock manager to ensure you are always looking at an absolute path? What about multi-bucket support? You may need to preserve the authority/host once you look at supporting that. S3guard had these issues as it also used path as a lookup key for stuff. * S3Mock sounds interesting. Would be nice to be able to work on S3A some without paying for AWS usage (cost has been limiting my involvement). * For lockListing(), why is a shared lock on the path being listed not sufficient? * Deadlock detection and debuggability. You might want the concepts of waiters / owners and wait-for graphs at some point to be able to avoid deadlock. Assuming you keep going down this route, and we cannot convince our selves that applications (HBASE) will not hold and wait. Probably a bit early to go this deep though. I don't have much experience with Curator but have heard of it. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813967#comment-16813967 ] Sean Mackrory commented on HBASE-22149: --- HADOOP-22149-hbase-3.patch has big improvements for tests: * Now runs against S3Mock by default. 1 new test failures and 2 new test errors when running in local mode. * Now runs against HBase's MiniZKCluster by default instead of Curator's TestServer. Couldn't figure out why it wasn't working in the newer versions of Curator. * No resets the fs..impl back to it's former value, as suggested by [~wchevreuil]. I still have my work cut out for me to get all of the tests pass and a few remaining TODO's in the code, but now I can do a full test run in a couple of minutes. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813439#comment-16813439 ] Sean Mackrory commented on HBASE-22149: --- Posting another patch in which I've added FS contract tests to get much greater test coverage, and fixed a bunch of issues. I've also added TODO's in most of the places where I know a tweak is still required. I'm sure this is not the conventional patch name format for HBase, but that's because I'm not actually proposing this for inclusion just yet - very much still a work-in-progress. Current test failures: {code} ZK (RootContract test times out): [ERROR] Failures: [ERROR] TestHBOSSContractGetFileStatus>AbstractContractGetFileStatusTest.testListLocatedStatusEmptyDirectory:129->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88 listLocatedStatus(test dir): file count in 1 directory and 1 file expected:<0> but was:<1> [ERROR] TestHBOSSContractGetFileStatus>AbstractContractGetFileStatusTest.testListStatusFiltering:463->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88 length of listStatus(s3a://mackrory/user/sean/hboss-junit-test/contract-tests, org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$AllPathsFilter@452f7a60 ) expected:<2> but was:<3> [ERROR] Errors: [ERROR] TestHBOSSContractMkdir>AbstractContractMkdirTest.testMkdirOverParentFile:108->AbstractFSContractTestBase.assertDeleted:349 » IO [ERROR] TestHBOSSContractRename>AbstractContractRenameTest.testRenameDirIntoExistingDir:155->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound [INFO] [ERROR] Tests run: 91, Failures: 2, Errors: 2, Skipped: 13 Local: [ERROR] Failures: [ERROR] TestAtomicRename.testAtomicRename:77 Rename source is still visible after rename finished or target showed up. [ERROR] Errors: [ERROR] TestHBOSSContractRename>AbstractContractRenameTest.testRenameDirIntoExistingDir:155->AbstractFSContractTestBase.assertIsDirectory:327 » FileNotFound [INFO] [ERROR] Tests run: 91, Failures: 1, Errors: 2, Skipped: 13 {code} {quote}Seems TreeLockManager.lockListings(Path[] paths) can get deadlocked when passing list of hierarchical paths{quote} Good catch, and in fact at least one of the new contract tests fails for that reason. One solution is to reconcile the list before locking to eliminate any path for which an ancestor is included elsewhere in the list. Another solution is to make the tree as a whole reentrant instead of just individual locks: if writeLockAbove() finds a lock already held by the current thread, we shouldn't block. The local contains the latter solution right now. Need to supplement the ZK implementation or filter the list of locked directories. {quote}would it be enough to have mock classes using local FS emulating S3 behaviour here{quote} Testing against local FS is a better thing to fall back on if no S3 credentials are configured than simply skipping the tests entirely. It would test that we don't dead-lock and that there are no *major* functional issues. But it wouldn't reproduce any of the problems we're trying to fix. Using local FS to emulate S3 would be it's own undertaking, and honestly if none of the S3 mocks currently out there are faithful enough, I don't even want to try doing it myself. I did have a breakthrough getting Adobe's S3Mock to work earlier this week. I put it on hold when I saw the atomic rename test was failing, but something's up with it even against Amazon S3, so I need to dig it back out (and of course get to the bottom of why that test started failing when I moved to the HBase code base). Also note that the embedded ZooKeeper seems to be having problems with HBase's newer version of Curator, so for now I'm configuring my tests to point to a local ZooKeeper instance in auth-keys.xml. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, > HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813240#comment-16813240 ] Wellington Chevreuil commented on HBASE-22149: -- Had done some tests on this, could add it as a module of hbase main project, built and ran unit tests using local FS. Some adjustments were required on the UTs, changed it to define dirs as part of the test working directory, as referring root "/" with local FS would give permissions errors. I understand, however, that this should not be shipped as part of hbase main project. Should it have a separate project/repo for this, such as hbase-operator-tools (Of course, if the proposal is to be accepted)? Any thoughts [~apurtell] [~vrodionov] [~zyork]? [~mackrorysd] For tests, would it be enough to have mock classes using local FS emulating S3 behaviour here, so that we don't require any dependency on S3 itself in hbase project? > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811301#comment-16811301 ] Zach York commented on HBASE-22149: --- [~apurtell] Makes sense, I agree on the ideal long term solutions. I'll review this with those caveats in mind then. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811291#comment-16811291 ] Andrew Purtell commented on HBASE-22149: [~zyork] My philosophy of changes like this, and I think applicable here, is this type of solution aims to not make code changes to HBase. Not intended to be the ideal long term solution. The ideal long term solution includes managing metadata natively within HBase somehow, sufficient to provide atomic semantics regardless of substrate. Past proposals include the "filesystem v2" work. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811283#comment-16811283 ] Zach York commented on HBASE-22149: --- [~mackrorysd] Cool ideas. I'll try to take a look soon! bq. Indeed, S3Guard is required as well. S3Guard is entirely pluggable (like this, we have Null and Local implementations in addition to the Dynamo one), so a ZooKeeper implementation is quite feasible as well. I actually suggested as much in the early days of S3Guard since a ZooKeeper ensemble is a de-facto requirement for Hadoop already, but nothing happened for performance reasons. If you think the slower metadata lookups wouldn't be a problem for HBase, that's worth looking into. Why not a system table in HBase itself? If we're trying to not depend too much on external solutions, that might be best. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810403#comment-16810403 ] Vladimir Rodionov commented on HBASE-22149: --- [~mackrorysd], you can take a look at [www.min.io|http://minio] for S3 local testing. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809366#comment-16809366 ] Sean Mackrory commented on HBASE-22149: --- HBASE-22149-hbase.patch is my proof-of-concept ported into the HBase code base. I've address all of Vladimir's feedback so far, but not Wellington's. I did get the tests running, although they still require you to add an S3 URI and S3 credentials to src/test/resources/auth-keys.xml. I tried several candidates for mocking S3 today. adobe/S3Mock requires overriding the actual S3 client used by s3a, which is not a publicly exposed interface right now. It could be exposed. I also hit what appears to be some conflicting HTTP library versions. findify/s3mock requires a Scala dependency (which is banned - not sure if we can work around that since it's only required in the test scope), but more seriously it doesn't support FS-style S3 keys. It documents that it won't work with the local filesystem backend, but I had problems with the in-memory backend as well any time directories were involved. S3Proxy is my current favorite and I'm going to work on it some more tomorrow. It fails when you use headers it doesn't support, but I want to see if we can work around that by disabling unnecessary features in S3A or by modifying S3Proxy to proceed in the presence of unknown headers and just ignore them. - Need to work around this more? > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809133#comment-16809133 ] Sean Mackrory commented on HBASE-22149: --- {quote}Also shouldn't be the case for hbase, given every path under the rename would already be owned by hbase user.{quote} Yeah - I think for v1 at least we can just document a need to have consistent access permissions. Of course since the ACLs are external it's a little less trivial for us to validate and fix, so I'll keep it on my wish-list to have more robust renames... {quote}maybe just mock S3 FS implementation{quote} Yeah I'm actually playing around with switching the tests over to that. When porting this module into the HBase codebase I had some other issues with tests. I'll post another patch with all the feedback so far when I've gotten further with the. Your other 2 points are good ones and I'll incorporate them. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809131#comment-16809131 ] Sean Mackrory commented on HBASE-22149: --- {quote}S3Guard must still be enabled{quote} {quote}Why not mirror instead to a hierarchy of znodes?{quote} Indeed, S3Guard is required as well. S3Guard is entirely pluggable (like this, we have Null and Local implementations in addition to the Dynamo one), so a ZooKeeper implementation is quite feasible as well. I actually suggested as much in the early days of S3Guard since a ZooKeeper ensemble is a de-facto requirement for Hadoop already, but nothing happened for performance reasons. If you think the slower metadata lookups wouldn't be a problem for HBase, that's worth looking into. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809016#comment-16809016 ] Wellington Chevreuil commented on HBASE-22149: -- Some thoughts: 1) {quote}but there's also a contrived case I'm aware of where you can set up ACLs such that the permissions are restricted on a subset of the rename {quote} Also shouldn't be the case for hbase, given every path under the rename would already be owned by hbase user. 2) Seems *TreeLockManager.lockListings(Path[] paths)* can get deadlocked when passing list of hierarchical paths, trying to *treeWriteLock* child node of previous node in the paths array. Doesn't seem to a call hbase would do anyway, but maybe only need to lock parent paths when listing. 3) Noticed there's no actual dependency on s3/s3guard actually, maybe just mock S3 FS implementation behaviour should suffice. 4) On HBaseObjectStoreSemantics.initialize(), maybe worth reset "fs.SCHEMA.impl" back to HBaseObjectStoreSemantics before returning, or subsequent clients trying to acquire FS implementation with same Configuration instance may not be able to get HBaseObjectStoreSemantics. {noformat} public void initialize(URI name, Configuration conf) throws IOException { String wrappedImpl = conf.get("fs.hboss.fs." + name.getScheme() + ".impl"); if (wrappedImpl != null) { conf.set("fs." + name.getScheme() + ".impl", wrappedImpl); } fs = FileSystem.get(name, conf); sync = TreeLockManager.get(name, conf); } {noformat} > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808974#comment-16808974 ] Andrew Purtell commented on HBASE-22149: Some thoughts. S3Guard must still be enabled. While namespace consistency is orthogonal to locking, ideally we can also eliminate the dependency on S3Guard, because this impacts the costs-to-serve of the AWS hosted HBase service. Maybe this is something that could be addressed here too. S3Guard mirrors the S3 namespace in a Dynamo table. Why not mirror instead to a hierarchy of znodes? The scalability of the path-based locking approach given HBase's access patterns is probably ok. We write relatively rarely, and when we do the writes for each region go into their own separate directories. We can expect locks on the "directory path" for each region directory. Locks for writes in one region are independent of all other locks for all other regions. However, resources required for locks and activity related to locking will grow linearly with respect to the number of regions. I am concerned about the write scalability of the ZooKeeper service when servicing a cluster with a very large number of regions. Would be curious to see the results of an experiment to assess this. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808928#comment-16808928 ] Andrew Purtell commented on HBASE-22149: {quote}So the rename will not be atomic in the sense that it's an O(1) metadata-only operation, but it will be atomic in the sense that it will have either happened or not happened, and no other calls to the FileSystem will see partial results because of the locking. {quote} Makes sense. I think this could be an excellent short-to-medium term option. As you mention in the top post, other proposals, like HBASE-20431 , could also optimize for data movement so renames become O(1) in time cost instead of O(N) where N is the number of bytes to be moved, but that also would be a significantly more complex undertaking requiring commitment of significant development resources... which is why it has not yet been attempted. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808753#comment-16808753 ] Sean Mackrory commented on HBASE-22149: --- So the rename will not be atomic in the sense that it's an O(1) metadata-only operation, but it will be atomic in the sense that it will have either happened or not happened, and no other calls to the FileSystem will see partial results because of the locking. This guarantee only holds true if everything accessing the storage is using this FileSystem implementation, but as far as I know that shouldn't be an issue for HBase use cases. {quote}Lock release must be bullet proof{quote} Good catch, thanks - that is indeed the point of the AutoLock class in the first place :) I've also moved the lock releases in RemoteIterator.hasNext() under finally. A couple of other things on my immediate to-do list: I'd like to add FS contract tests to get better coverage of the rest of the FS API in addition to the 2 main use cases I'm testing. Certainly more tests needed in general. I'd also like to add a way to mock S3 so you don't need AWS credentials to run the tests. Shouldn't be too tricky - the problems my tests currently reproduce depend on client behavior, and not subtle server-side implementation details. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808315#comment-16808315 ] Allan Yang commented on HBASE-22149: Sorry, I still don't get the point of how to handle the atomic rename. Since the FileSystem provided by hadoop-aws (based on s3) does not provide atomic rename API. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808252#comment-16808252 ] Vladimir Rodionov commented on HBASE-22149: --- {quote} One area in which I would particularly appreciate review for safety is in TreeLockManager.treeWriteLock and treeReadLock. {quote} Sure, this is the most interesting part :) > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808251#comment-16808251 ] Vladimir Rodionov commented on HBASE-22149: --- {code} public class LockedFSDataOutputStream extends FSDataOutputStream { + +public LockedFSDataOutputStream(FSDataOutputStream stream, AutoLock lock) { + super(stream, null); + this.stream = stream; + this.lock = lock; +} + +private final FSDataOutputStream stream; +private AutoLock lock; + +@Override +public long getPos() { + return stream.getPos(); +} + +@Override +public void close() throws IOException { + stream.close(); + lock.close(); +} + {code} Lock release must be bullet proof. If stream.close() throws IOException, lock will never be released. Please check all your code and make sure put lock.close in finally clause. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808247#comment-16808247 ] Vladimir Rodionov commented on HBASE-22149: --- {code} + public static class LockedRemoteIterator implements RemoteIterator { +public LockedRemoteIterator(RemoteIterator iterator, AutoLock lock) + { + this.iterator = iterator; + this.lock = lock; +} + +private RemoteIterator iterator; +private AutoLock lock; + +public boolean hasNext() throws IOException { + if (iterator.hasNext()) { +return true; + } + lock.close(); + return false; +} + +/** + * Delegates to the wrapped iterator, but will close the lock in the event + * of a NoSuchElementException. Some applications do not call hasNext() and + * simply depend on the NoSuchElementException. + */ +public E next() throws IOException { + try { +return iterator.next(); + } catch (NoSuchElementException e) { +lock.close(); +throw e; + } +} + } {code} I would add explicit close() method to this Iterator, since it holds lock, which is closed only when iterator reaches end. There is no explicit way to release lock here. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance >
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808248#comment-16808248 ] Sean Mackrory commented on HBASE-22149: --- Thanks [~vrodionov]. You're quite right - there should be a copy of all the paths put into a single array and then sorted. I thought I had done that, but perhaps I'm thinking of another method or I lost it while cleaning up the patch. One area in which I would particularly appreciate review for safety is in TreeLockManager.treeWriteLock and treeReadLock. As I said I'll be replacing the while(true)s and Thread.sleeps with proper retry logic that can eventually timeout, but I don't think that will change the overall structure of those procedures. The idea is to safely obtain locks on the specific nodes in question, but then also ensure they're not INSIDE another lock that someone else already obtained, etc. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808228#comment-16808228 ] Vladimir Rodionov commented on HBASE-22149: --- Nice, thanks [~mackrorysd]. I will go through your patch later on today/tomorrow, for now one q: {code} + public void concat(final Path trg, final Path[] psrcs) throws IOException { +AutoLock[] locks = new AutoLock[psrcs.length + 1]; +try { + for (int i = 0; i < psrcs.length; i++) { +locks[i] = sync.lock(psrcs[i]); + } + locks[psrcs.length] = sync.lock(trg); + fs.concat(trg, psrcs); +} finally { + for (int i = 0; i < locks.length; i++) { +if (locks[i] != null) { + locks[i].close(); +} + } +} + } {code} This code does not seem deadlock-safe? > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: New Feature > Components: Filesystem Integration >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Critical > Attachments: HBASE-22149-hadoop.patch > > > (Have been using the name HBOSS for HBase / Object Store Semantics) > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right solution unless it proves unable to provide the required performance > characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics
[ https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808127#comment-16808127 ] Sean Mackrory commented on HBASE-22149: --- I've attached HBASE-22149-hadoop.patch, which is my proof-of-concept built on the Hadoop source tree. The while(true)s and Thread.sleeps in TreeLockManager must be replaced by more robust retry logic. And I need to add support for globStatus() operations and deleteOnExit(), as HBase does appear to use them and they aren't quite as trivial as everything else. Symlinks and PathHandle are also unsupported, but they are unused and unsupported in s3a, so I have no plans to address that. A write-lock is acquired on a path when OutputStreams are created, and they are not released until the OutputStream is closed (whereas most operations are locking & unlocking in the same method call). I haven't done this with InputStreams and I'm not sure if that's required. OutputStreams require it to ensure create() is atomic, as s3a won't actually create a file on the underlying S3 bucket until later. With the exception of InputStreams, I've generally erred on the side of locking everything in the hope of starting out with correctness. As I do performance testing and identify any bottlenecks, it may be desirable to carefully consider if some locking should be removed where HBase's usage makes it safe to do so if it would streamline a particular bottleneck. > HBOSS: A FileSystem implementation to provide HBase's required semantics > > > Key: HBASE-22149 > URL: https://issues.apache.org/jira/browse/HBASE-22149 > Project: HBase > Issue Type: Bug >Reporter: Sean Mackrory >Assignee: Sean Mackrory >Priority: Major > Attachments: HBASE-22149-hadoop.patch > > > I've had some thoughts about how to solve the problem of running HBase on > object stores. There has been some thought in the past about adding the > required semantics to S3Guard, but I have some concerns about that. First, > it's mixing complicated solutions to different problems (bridging the gap > between a flat namespace and a hierarchical namespace vs. solving > inconsistency). Second, it's S3-specific, whereas other objects stores could > use virtually identical solutions. And third, we can't do things like atomic > renames in a true sense. There would have to be some trade-offs specific to > HBase's needs and it's better if we can solve that in an HBase-specific > module without mixing all that logic in with the rest of S3A. > Ideas to solve this above the FileSystem layer have been proposed and > considered (HBASE-20431, for one), and maybe that's the right way forward > long-term, but it certainly seems to be a hard problem and hasn't been done > yet. But I don't know enough of all the internal considerations to make much > of a judgment on that myself. > I propose a FileSystem implementation that wraps another FileSystem instance > and provides locking of FileSystem operations to ensure correct semantics. > Locking could quite possibly be done on the same ZooKeeper ensemble as an > HBase cluster already uses (I'm sure there are some performance > considerations here that deserve more attention). I've put together a > proof-of-concept on which I've tested some aspects of atomic renames and > atomic file creates. Both of these tests fail reliably on a naked s3a > instance. I've also done a small YCSB run against a small cluster to sanity > check other functionality and was successful. I will post the patch, and my > laundry list of things that still need work. The WAL is still placed on HDFS, > but the HBase root directory is otherwise on S3. > Note that my prototype is built on Hadoop's source tree right now. That's > purely for my convenience in putting it together quickly, as that's where I > mostly work. I actually think long-term, if this is accepted as a good > solution, it makes sense to live in HBase (or it's own repository). It only > depends on stable, public APIs in Hadoop and is targeted entirely at HBase's > needs, so it should be able to iterate on the HBase community's terms alone. > Another idea [~ste...@apache.org] proposed to me is that of an inode-based > FileSystem that keeps hierarchical metadata in a more appropriate store that > would allow the required transactions (maybe a special table in HBase could > provide that store itself for other tables), and stores the underlying files > with unique identifiers on S3. This allows renames to actually become fast > instead of just large atomic operations. It does however place a strong > dependency on the metadata store. I have not explored this idea much. My > current proof-of-concept has been pleasantly simple, so I think it's the > right