[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-30 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830316#comment-16830316
 ] 

Sean Busbey commented on HBASE-22149:
-

also given the size of the patch, please make a PR. :)

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, 
> HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-30 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830314#comment-16830314
 ] 

Sean Busbey commented on HBASE-22149:
-

FYI, I pushed an empty commit to the {{hbase-filesystem}} repo so you should be 
able to make a PR off of that.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, 
> HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-30 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830313#comment-16830313
 ] 

Sean Busbey commented on HBASE-22149:
-

bq. Not sure how we want to version this. It's 1.0.0-SNAPSHOT right now, but 
could arguably be lower, or synced up with the HBase version it's built 
against. Any thoughts?

1.0.0-alpha1-SNAPSHOT please. this repo is probably going to be in alpha status 
for some time, and I don't want to make assumptions about what version(s) of 
hbase it can be used with until we have to.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, 
> HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-30 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830291#comment-16830291
 ] 

Sean Mackrory commented on HBASE-22149:
---

(Overwrote the last attachment - cleaned up some local files that shouldn't 
have been included)

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, 
> HBASE-22149-hbase-filesystem-1.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-30 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830241#comment-16830241
 ] 

Sean Busbey commented on HBASE-22149:
-

bq. I'm only extending that as a trivial way to add more coverage, so we could 
just skip it when a "hadoop2" profile is activated or something. If I could 
just figure out how you do that 

we can take care of this in a follow on ticket.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, 
> HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-29 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829760#comment-16829760
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}I'll look into what's required to get this compiling on newer 2.x.{quote}

So it actually compiles fine against 2.9.2, and the only issues I see are in 
tests themselves. I had to tweak EmbeddedS3 a little bit as 2.9.2 uses an older 
listing API. The remaining issue is that AbstractContractDistCpTest in 2.9.2 
appears to assume it's running on HDFS. I'm only extending that as a trivial 
way to add more coverage, so we could just skip it when a "hadoop2" profile is 
activated or something. If I could just figure out how you do that :)

Attached a patch from a separate repository with all the tests passing and all 
of the above fixes.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase-filesystem-1.patch, 
> HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-29 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829659#comment-16829659
 ] 

Andrew Purtell commented on HBASE-22149:


bq. I'll look into what's required to get this compiling on newer 2.x.

Yes, please, and thank you.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-29 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829438#comment-16829438
 ] 

Josh Elser commented on HBASE-22149:


We now have a new repository: hbase-filesystem.git – 
[https://github.com/apache/hbase-filesystem]

Precommit is not going to be hooked up yet, but I think I can copy what Duo and 
team recently did for another repository.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-29 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829322#comment-16829322
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}Hadoop advertises S3Guard in 2.9.2. Is this not production capable 
there?{quote}

I had forgotten that, actually. I'll look into what's required to get this 
compiling on newer 2.x.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-29 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829318#comment-16829318
 ] 

Sean Mackrory commented on HBASE-22149:
---

Integration testing of my latest patch yielded an issue where the wrapped s3a 
instance was getting returned by the FS cache. I switched to using a new 
instance for the config changes in FS.initialize(), but it still happened in 
some cases (after FS.get was used like 6-7 times, which is surprising since the 
cache instance should be part of the FS cache key). Really the internal s3a 
shouldn't be used externally at all, so I just disabled FS caching in that 
Configuration instance as well, which seems to have fixed the issue entirely. 
The new function is below. I'll incorporate it in my next patch which will 
probably include all the changes necessary to be part of a separate repo... 
Re-mapping fs.s3a.impl to HBOSS and setting fs.hboss.fs.s3a.impl is how I've 
been testing so far because it requires no changes to HBase at all - if we want 
to disallow just calling 'fs = new HBaseObjectStoreSemantics(); 
fs.initialize("s3a://...", conf);", which we probably do if this will remain as 
a distinct repository, then I can simplify this function a bit further.

{code}
  public void initialize(URI name, Configuration conf) throws IOException {
setConf(conf);

String scheme = name.getScheme();
String schemeImpl = "fs." + scheme + ".impl";
String hbossSchemeImpl = "fs.hboss." + schemeImpl;
String wrappedImpl = conf.get(hbossSchemeImpl);
Configuration internalConf = new Configuration(conf);

if (wrappedImpl != null) {
  LOG.info("HBOSS wrapping file-system {} using implementation {}", name,
wrappedImpl);
  String disableCache = "fs." + scheme + ".impl.disable.cache";
  internalConf.set(schemeImpl, wrappedImpl);
  internalConf.set(disableCache, "true");
}

fs = FileSystem.get(name, internalConf);
sync = TreeLockManager.get(fs);
  }
{code}

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-26 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827334#comment-16827334
 ] 

Andrew Purtell commented on HBASE-22149:


I don't object to a separate repo, of course

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-26 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827333#comment-16827333
 ] 

Andrew Purtell commented on HBASE-22149:


{quote}The whole thing really depends on Hadoop 3+ (in production, S3Guard is 
required and isn't in the Hadoop 2 releases, 
{quote}
Hadoop advertises S3Guard in 2.9.2. Is this not production capable there?

Otherwise, I would really prefer testing this on 2.9.2 than any 3.x. That major 
upgrade is off-putting (never mind that Hadoop's motto in previous years should 
have been "every minor is a major" (smile))

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-26 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827008#comment-16827008
 ] 

Sean Busbey commented on HBASE-22149:
-

sent a heads up to dev@hbase w/subject "[DISCUSS] lazy consensus on 
"hbase-filesystem" repository"

 

[https://s.apache.org/bNWq]

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-26 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827003#comment-16827003
 ] 

Sean Busbey commented on HBASE-22149:
-

Given the combination of this a) needing hadoop 3 only and b) being an 
experimental approach that we're not sure on sustainability in production I'd 
much prefer a different repository.

 

Is anyone opposed to landing this in a new repository, i.e. `hbase-filesystem`? 
Provided it includes instructions for installation / set up we wouldn't even 
need to add the artifacts from that repository as a dependency for the main 
repo's binary artifacts.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-26 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827000#comment-16827000
 ] 

Josh Elser commented on HBASE-22149:


{quote}I've had the idea of a separate code repository suggested to me by a 
couple of people, too. That solves this problem, too.
{quote}
This seems like the cleanest approach to me!

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-26 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826998#comment-16826998
 ] 

Sean Mackrory commented on HBASE-22149:
---

So for the Hadoop 3 issue, we can have this module only built when the 
hadoop-3.0 profile is activated, but that requires the list of modules be 
duplicated, and then the profile-specific copy of that list adds this. Not 
entirely clean but the alternatives are worse. I've had the idea of a separate 
code repository suggested to me by a couple of people, too. That solves this 
problem, too. Any thoughts? Is there any precedent for a Hadoop-3-only module 
that has any better solution?

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-25 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826523#comment-16826523
 ] 

Sean Mackrory commented on HBASE-22149:
---

(swapped out patch #5 - there was a variable name change that I hadn't done 
everywhere, and didn't notice it until I did a clean build).

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-25 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826393#comment-16826393
 ] 

Sean Mackrory commented on HBASE-22149:
---

Added some contract tests I had missed that get a lot more coverage, fixed all 
the issues I was having in the tests (it was just tests stepping on each 
other's paths because they weren't all in separate directories and are supposed 
to be), am now normalizing all paths and sorting arrays in one central place. 
When I normalize paths for locking, I'm doing /scheme/hostname/path to ensure 
using this for multiple filesystems is safe.

Some minor to do's left, but they definitely don't impact my test cases or the 
HBase workloads that have run on this so far:

- mkdirs has implications for any parent directories that don't exist yet, 
although it will only lock the path. I can't think of a scenario where this 
would cause a problem, though.
- The local lock implementation isn't re-entrant if you read-lock a path and 
then try to read-lock a parent in the same thread. I don't think anyone would 
use it in production, and the ZK implementation is the default even for the 
unit tests. This implementation is really only still there in case it helps 
with debugging other logic.
- The whole thing really depends on Hadoop 3+ (in production, S3Guard is 
required and isn't in the Hadoop 2 releases, and even just for testing there's 
a lot of changes required to get it to compile). I'm wondering if there's an 
easy way to only include this module only with the Hadoop 3 profile. I haven't 
seen one, so... hints welcome :)

Other than that: what else would the community like to see before this was 
committed (albeit perhaps with a big "experimental" label until this has gone 
through more scale and integration testing).

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, 
> HBASE-22149-hbase-5.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823277#comment-16823277
 ] 

Andrew Purtell commented on HBASE-22149:


{quote}there were a few problems with large numbers of connections to S3, etc.
{quote}
I filed a Jira for this problem, unless I misunderstand what you are seeing. 
(The suggestion to have more smaller regions is confusing.) The essential idea 
is instead of keeping a file handle open for every file - which has sublinear 
resource usage on HDFS but a linear cost (connections, file handles) with S3A - 
instead we maintain an open file cache with LRU eviction policy, and some way 
to exclude bulk scans from caching similar to what we do in the blockcache 
today.

As you suggest an operator could mitigate this by keeping the total number of 
store files as low as possible, via tunables like split policy and file size 
thresholds. I think that also means keeping the number of regions as small as 
possible too, since the goal is minimizing total store file number. Here's 
where we may not be on the same page.

A S3 backed filesystem is going to behave very differently from HDFS and a 
small number of regions with really huge store files might be handled just 
fine. Then I suspect we would see stress on hfile index and blockcache 
efficiencies and have to explore more tunables there, or maybe small code 
changes.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-22 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823273#comment-16823273
 ] 

Andrew Purtell commented on HBASE-22149:


{quote}This will harm r/w latency distribution tail for sure. Copying (moving) 
GBs of data in S3 with 3x replication - that is 10s of seconds.
{quote}
This file system will not be triple replicating data like HDFS would, right? 
Whatever the S3 service might do internally to provide durability and 
availability will be a black box to us, but we won't make the problem worse by 
replicating data somewhere we control things, right?

The basic point is sound.

Atomic rename is expected to be fast. We are renaming under lock in the flush, 
compaction, split, or merge code paths. I assume it, haven't looked at the 
code. With HBOSS those locks are going to be held for a time proportional to 
the amount of data being copied behind the scenes instead of O(1) latency of 
one HDFS namenode RPC.

This is why I proposed a new 'transaction' framework at the HBase level instead 
of a lower level solution like HBOSS, so we could engineer for this and provide 
for substrate specific strategies, like taking advantage of atomic rename if it 
exists, or using PUT-COPY if we're on S3, or... but we are not considering that 
type of solution here. 

As Sean points out the normal read and write paths are not normally on the 
critical path. However if flushes are taking a very long time to complete the 
memstore's buffering capacity will be exceeded and writes will stall. Flush 
upon closing is going to be a danger zone. It may take a very long time for a 
region to close and this will impact both reads and writes for the duration, so 
you probably want to rethink all region management concerns like balancer 
policy, split policy, etc. I think the HBOSS design rationale says these things 
are the acceptable outcome of the tradeoff being made. Another approach could 
solve it without these drawbacks.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-22 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823264#comment-16823264
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}I am concerned about the write scalability of the ZooKeeper service when 
servicing a cluster with a very large number of regions{quote}

There's a balance we need to keep in mind there - Wellington's been doing some 
integration testing with me, and there were a few problems with large numbers 
of connections to S3, etc. that could be avoided by having a larger number of 
smaller regions. Probably best to explore other solutions to those problems, 
then.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-22 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823251#comment-16823251
 ] 

Sean Mackrory commented on HBASE-22149:
---

Just attached another patch. The really big thing in this patch is that I've 
added my own implementation of the AmazonS3 interface for testing. It turns out 
it's actually fairly straightforward to get a local HashMap of Strings to 
support all of the FS contract tests. The only remotely tricky part is the 
substring logic to make all the backwards seeks, etc. work. Major upside, here:
* All of the contract tests that will work on S3 will work against my mock. I'm 
not having to skip a bunch in the default case anymore.
* The 2 main tests I originally added to reproduce the semantic problems also 
fail when you pass -Pnull (i.e. to disable all locking).
* The S3 mocks implicitly pulled in a bunch of J2EE stuff and conflicting 
versions of the AWS SDK which caused a number of issues.
* Much more debuggable - no more going over the network - it's all on a small, 
in-memory HashMap.
I still see a couple of contract test failures when running against the real S3 
service that I need to get to the bottom of (although my gut feeling right now 
is that it's all issues with my test code & set up and not the core locking 
logic). Beyond that and Fabbri's suggestion to qualify all paths, I'm starting 
to get to the bottom of my to-do list.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-22 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823236#comment-16823236
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}This will harm r/w latency distribution tail for sure. Copying (moving) 
GBs of data in S3 with 3x replication - that is 10s of seconds. {quote}

Will it, actually? My understanding was reads and write were served in-memory, 
with writes being persisted on the WAL (and to be clear, if I didn't say this 
already, this FS isn't intended to support the WAL yet). Is a compaction on the 
critical path for reads and writes?

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase-4.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-14 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817543#comment-16817543
 ] 

Steve Loughran commented on HBASE-22149:


bq.  LimitedPrivate({"HBase"}),

Prefer @VisibleForTesting. I really hate the limited private stuff

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-11 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815804#comment-16815804
 ] 

Vladimir Rodionov commented on HBASE-22149:
---

{quote}
I'd love to explore it but I think it's worth seeing if the ~minute renames are 
really a problem first.
{quote}

This will harm r/w latency distribution tail for sure. Copying (moving) GBs of 
data in S3 with 3x replication - that is 10s of seconds. 


> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-11 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815643#comment-16815643
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}I think that the patch should support meta-data-only rename operation. 
{quote}

I mention an alternate design above in which all metadata is stored in a 
separate repository and maps to GUID-named files on S3. It could give atomic / 
metadata-only renames on S3, but it then makes the underlying storage 
incompatible and makes the repository a critical point of failure. Lose your ZK 
ensemble, your HFiles are now effectively unreadable until someone can go 
through and piece them into tables manually. It has a lot of benefits, but it's 
a much more severe approach. I'd love to explore it but I think it's worth 
seeing if the ~minute renames are really a problem first.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-11 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815624#comment-16815624
 ] 

Andrew Purtell commented on HBASE-22149:


{quote}Locking destination directory from read/write operations for duration of 
object store (S3) rename operation does not look like a good idea. S3 
physically moves data during this operation and it can takes time (sometimes 
minutes, though very rarely)ink that the patch should support meta-data-only 
rename operation.
{quote}
Sure there may be other alternatives but the motivation of HBOSS seems pretty 
clear and that is to make HBase work on S3 even though there are no atomic 
metadata-only renames available there, via locking. On the grounds that HBOSS 
is an attempt to make this work with this approach (locking in Hadoop FS layer) 
it seems perfectly reasonable to wait as long as necessary.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-11 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815616#comment-16815616
 ] 

Vladimir Rodionov commented on HBASE-22149:
---

{code}
  public boolean rename(Path src, Path dst) throws IOException {
try (AutoLock l = sync.lockRename(src, dst)) {
  return fs.rename(src, dst);
}
  }
{code}

Locking destination directory from read/write operations for duration of object 
store (S3) rename operation does not look like a good idea. S3 physically moves 
data during this operation and it can takes time (sometimes minutes, though 
very rarely)

I think that the patch should support meta-data-only rename operation. 

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-11 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815461#comment-16815461
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}fs.qualify(path) {quote}

Yeah I probably do need that, although it hasn't come up yet in tests.

{quote}What about multi-bucket support?{quote}

Added that yesterday, actually - in my next patch the ZK client will now be 
'jailed' inside a z-node named after the hostname in the URI. That's not quite 
right for WASB and ABFS but they don't need this anyway. It's right for S3 and 
GCS. Others may come along that require it to be rethought but that's good 
enough for now and I'd like to avoid putting any FS-specific logic inside this 
as long as I can.

{quote}S3Mock sounds interesting{quote}

Yes, I wondered if it was a faithful enough recreation for the full battery of 
s3a tests. One side note: even though I got S3Mock working, I did have to rely 
on APIs designated as Private (specifically the S3ClientFactory stuff). So we 
need to have a discussion about if we think those APIs might be stable enough 
to promote to LimitedPrivate({"HBase"}), or perhaps another API wherein I 
simply hand the FS a ready-to-go S3 client, instead of pointing it at a 
"Factory" class that will return the client I already made (which is what I 
have to do now).

{quote}For lockListing(), why is a shared lock on the path being listed not 
sufficient?{quote}

Because you want it to have exclusive access to all the children (and in some 
cases all children recursively) when there may be renames going on inside that 
path. Other than this particular case, write locks don't have to block when 
there are read locks above them in the path. For a non-recursive listing, a 
read-lock on all children of the path you're referencing would be sufficient, 
but how do you correctly enumerate the children without first having the lock? 
You end back where you started. Exclusive lock on the parent for listing is a 
little more aggressive than needed, but it's simple and safe. I've tried to err 
on that side of things since we can't seem to enumerate all the FS assumptions 
of HBase. If integration / performance testing finds that there is a particular 
point of contention, that's a targeted area we can investigate to determine is 
relaxing the constraints is safe.

{quote}Deadlock detection and debuggability{quote}
{quote}I don't have much experience with Curator but have heard of it.{quote}

Yeah this will definitely warrant some work on that, and operational concerns 
when problems arise. I've been finding Curator is not as fool-proof as I had 
hoped, and my next patch actually eliminates the use of curator-framework (in 
favor of the lower level curator-client) for everything but the actual locking 
/ unlock. The APIs for creating and deleting znodes have actually been very 
hard to debug.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-10 Thread Aaron Fabbri (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815031#comment-16815031
 ] 

Aaron Fabbri commented on HBASE-22149:
--

Thanks for your work on this interesting patch [~mackrorysd].  In the process 
of reading it.  A couple of initial thoughts–not really looking at overall 
correctness yet.
 * On path normalization. Do you need a fs.qualify(path) before calling into 
your lock manager to ensure you are always looking at an absolute path? What 
about multi-bucket support? You may need to preserve the authority/host once 
you look at supporting that. S3guard had these issues as it also used path as a 
lookup key for stuff.
 * S3Mock sounds interesting. Would be nice to be able to work on S3A some 
without paying for AWS usage (cost has been limiting my involvement). 
 * For lockListing(), why is a shared lock on the path being listed not 
sufficient?
 * Deadlock detection and debuggability. You might want the concepts of waiters 
/ owners and wait-for graphs at some point to be able to avoid deadlock. 
Assuming you keep going down this route, and we cannot convince our selves that 
applications (HBASE) will not hold and wait. Probably a bit early to go this 
deep though.

I don't have much experience with Curator but have heard of it.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-09 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813967#comment-16813967
 ] 

Sean Mackrory commented on HBASE-22149:
---

HADOOP-22149-hbase-3.patch has big improvements for tests:
* Now runs against S3Mock by default. 1 new test failures and 2 new test errors 
when running in local mode.
* Now runs against HBase's MiniZKCluster by default instead of Curator's 
TestServer. Couldn't figure out why it wasn't working in the newer versions of 
Curator.
* No resets the fs..impl back to it's former value, as suggested by 
[~wchevreuil].
I still have my work cut out for me to get all of the tests pass and a few 
remaining TODO's in the code, but now I can do a full test run in a couple of 
minutes.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase-3.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-09 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813439#comment-16813439
 ] 

Sean Mackrory commented on HBASE-22149:
---

Posting another patch in which I've added FS contract tests to get much greater 
test coverage, and fixed a bunch of issues. I've also added TODO's in most of 
the places where I know a tweak is still required. I'm sure this is not the 
conventional patch name format for HBase, but that's because I'm not actually 
proposing this for inclusion just yet - very much still a work-in-progress.

Current test failures:
{code}
ZK (RootContract test times out):

  [ERROR] Failures: 
  [ERROR]   
TestHBOSSContractGetFileStatus>AbstractContractGetFileStatusTest.testListLocatedStatusEmptyDirectory:129->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88
 listLocatedStatus(test dir): file count in 1 directory and 1 file expected:<0> 
but was:<1>
  [ERROR]   
TestHBOSSContractGetFileStatus>AbstractContractGetFileStatusTest.testListStatusFiltering:463->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:645->Assert.failNotEquals:834->Assert.fail:88
 length of listStatus(s3a://mackrory/user/sean/hboss-junit-test/contract-tests, 
org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$AllPathsFilter@452f7a60
 ) expected:<2> but was:<3>
  [ERROR] Errors: 
  [ERROR]   
TestHBOSSContractMkdir>AbstractContractMkdirTest.testMkdirOverParentFile:108->AbstractFSContractTestBase.assertDeleted:349
 » IO
  [ERROR]   
TestHBOSSContractRename>AbstractContractRenameTest.testRenameDirIntoExistingDir:155->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
  [INFO] 
  [ERROR] Tests run: 91, Failures: 2, Errors: 2, Skipped: 13

Local:

  [ERROR] Failures: 
  [ERROR]   TestAtomicRename.testAtomicRename:77 Rename source is still visible 
after rename finished or target showed up.
  [ERROR] Errors: 
  [ERROR]   
TestHBOSSContractRename>AbstractContractRenameTest.testRenameDirIntoExistingDir:155->AbstractFSContractTestBase.assertIsDirectory:327
 » FileNotFound
  [INFO] 
  [ERROR] Tests run: 91, Failures: 1, Errors: 2, Skipped: 13
{code}

{quote}Seems TreeLockManager.lockListings(Path[] paths) can get deadlocked when 
passing list of hierarchical paths{quote}

Good catch, and in fact at least one of the new contract tests fails for that 
reason. One solution is to reconcile the list before locking to eliminate any 
path for which an ancestor is included elsewhere in the list. Another solution 
is to make the tree as a whole reentrant instead of just individual locks: if 
writeLockAbove() finds a lock already held by the current thread, we shouldn't 
block. The local contains the latter solution right now. Need to supplement the 
ZK implementation or filter the list of locked directories.

{quote}would it be enough to have mock classes using local FS emulating S3 
behaviour here{quote}

Testing against local FS is a better thing to fall back on if no S3 credentials 
are configured than simply skipping the tests entirely. It would test that we 
don't dead-lock and that there are no *major* functional issues. But it 
wouldn't reproduce any of the problems we're trying to fix. Using local FS to 
emulate S3 would be it's own undertaking, and honestly if none of the S3 mocks 
currently out there are faithful enough, I don't even want to try doing it 
myself. I did have a breakthrough getting Adobe's S3Mock to work earlier this 
week. I put it on hold when I saw the atomic rename test was failing, but 
something's up with it even against Amazon S3, so I need to dig it back out 
(and of course get to the bottom of why that test started failing when I moved 
to the HBase code base). Also note that the embedded ZooKeeper seems to be 
having problems with HBase's newer version of Curator, so for now I'm 
configuring my tests to point to a local ZooKeeper instance in auth-keys.xml.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase-2.patch, 
> HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813240#comment-16813240
 ] 

Wellington Chevreuil commented on HBASE-22149:
--

Had done some tests on this, could add it as a module of hbase main project, 
built and ran unit tests using local FS. Some adjustments were required on the 
UTs, changed it to define dirs as part of the test working directory, as 
referring root "/" with local FS would give permissions errors. I understand, 
however, that this should not be shipped as part of hbase main project. Should 
it have a separate project/repo for this, such as hbase-operator-tools (Of 
course, if the proposal is to be accepted)? Any thoughts [~apurtell] 
[~vrodionov] [~zyork]?

[~mackrorysd] For tests, would it be enough to have mock classes using local FS 
emulating S3 behaviour here, so that we don't require any dependency on S3 
itself in hbase project? 

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-05 Thread Zach York (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811301#comment-16811301
 ] 

Zach York commented on HBASE-22149:
---

[~apurtell] Makes sense, I agree on the ideal long term solutions. I'll review 
this with those caveats in mind then.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-05 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811291#comment-16811291
 ] 

Andrew Purtell commented on HBASE-22149:


[~zyork] My philosophy of changes like this, and I think applicable here, is 
this type of solution aims to not make code changes to HBase. Not intended to 
be the ideal long term solution. The ideal long term solution includes managing 
metadata natively within HBase somehow, sufficient to provide atomic semantics 
regardless of substrate. Past proposals include the "filesystem v2" work.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-05 Thread Zach York (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811283#comment-16811283
 ] 

Zach York commented on HBASE-22149:
---

[~mackrorysd] Cool ideas. I'll try to take a look soon!

bq. Indeed, S3Guard is required as well. S3Guard is entirely pluggable (like 
this, we have Null and Local implementations in addition to the Dynamo one), so 
a ZooKeeper implementation is quite feasible as well. I actually suggested as 
much in the early days of S3Guard since a ZooKeeper ensemble is a de-facto 
requirement for Hadoop already, but nothing happened for performance reasons. 
If you think the slower metadata lookups wouldn't be a problem for HBase, 
that's worth looking into.

Why not a system table in HBase itself? If we're trying to not depend too much 
on external solutions, that might be best.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-04 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810403#comment-16810403
 ] 

Vladimir Rodionov commented on HBASE-22149:
---

[~mackrorysd], you can take a look at [www.min.io|http://minio] for S3 local 
testing.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-03 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809366#comment-16809366
 ] 

Sean Mackrory commented on HBASE-22149:
---

HBASE-22149-hbase.patch is my proof-of-concept ported into the HBase code base. 
I've address all of Vladimir's feedback so far, but not Wellington's. I did get 
the tests running, although they still require you to add an S3 URI and S3 
credentials to src/test/resources/auth-keys.xml.

I tried several candidates for mocking S3 today. adobe/S3Mock requires 
overriding the actual S3 client used by s3a, which is not a publicly exposed 
interface right now. It could be exposed. I also hit what appears to be some 
conflicting HTTP library versions. findify/s3mock requires a Scala dependency 
(which is banned - not sure if we can work around that since it's only required 
in the test scope), but more seriously it doesn't support FS-style S3 keys. It 
documents that it won't work with the local filesystem backend, but I had 
problems with the in-memory backend as well any time directories were involved. 
S3Proxy is my current favorite and I'm going to work on it some more tomorrow. 
It fails when you use headers it doesn't support, but I want to see if we can 
work around that by disabling unnecessary features in S3A or by modifying 
S3Proxy to proceed in the presence of unknown headers and just ignore them.
  - Need to work around this more?

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch, HBASE-22149-hbase.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-03 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809133#comment-16809133
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}Also shouldn't be the case for hbase, given every path under the rename 
would already be owned by hbase user.{quote}
Yeah - I think for v1 at least we can just document a need to have consistent 
access permissions. Of course since the ACLs are external it's a little less 
trivial for us to validate and fix, so I'll keep it on my wish-list to have 
more robust renames...

{quote}maybe just mock S3 FS implementation{quote}
Yeah I'm actually playing around with switching the tests over to that. When 
porting this module into the HBase codebase I had some other issues with tests. 
I'll post another patch with all the feedback so far when I've gotten further 
with the.

Your other 2 points are good ones and I'll incorporate them.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-03 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809131#comment-16809131
 ] 

Sean Mackrory commented on HBASE-22149:
---

{quote}S3Guard must still be enabled{quote}
{quote}Why not mirror instead to a hierarchy of znodes?{quote}

Indeed, S3Guard is required as well. S3Guard is entirely pluggable (like this, 
we have Null and Local implementations in addition to the Dynamo one), so a 
ZooKeeper implementation is quite feasible as well. I actually suggested as 
much in the early days of S3Guard since a ZooKeeper ensemble is a de-facto 
requirement for Hadoop already, but nothing happened for performance reasons. 
If you think the slower metadata lookups wouldn't be a problem for HBase, 
that's worth looking into.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-03 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809016#comment-16809016
 ] 

Wellington Chevreuil commented on HBASE-22149:
--

Some thoughts:
 1)
{quote}but there's also a contrived case I'm aware of where you can set up ACLs 
such that the permissions are restricted on a subset of the rename
{quote}
Also shouldn't be the case for hbase, given every path under the rename would 
already be owned by hbase user.

2)
 Seems *TreeLockManager.lockListings(Path[] paths)* can get deadlocked when 
passing list of hierarchical paths, trying to *treeWriteLock* child node of 
previous node in the paths array. Doesn't seem to a call hbase would do anyway, 
but maybe only need to lock parent paths when listing.

3)
 Noticed there's no actual dependency on s3/s3guard actually, maybe just mock 
S3 FS implementation behaviour should suffice.

4) On HBaseObjectStoreSemantics.initialize(), maybe worth reset 
"fs.SCHEMA.impl" back to HBaseObjectStoreSemantics before returning, or 
subsequent clients trying to acquire FS implementation with same Configuration 
instance may not be able to get HBaseObjectStoreSemantics.
{noformat}
  public void initialize(URI name, Configuration conf) throws IOException {
String wrappedImpl = conf.get("fs.hboss.fs." + name.getScheme() + ".impl");
if (wrappedImpl != null) {
  conf.set("fs." + name.getScheme() + ".impl", wrappedImpl);
}
fs = FileSystem.get(name, conf);
sync = TreeLockManager.get(name, conf);
  }
{noformat}

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-03 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808974#comment-16808974
 ] 

Andrew Purtell commented on HBASE-22149:


Some thoughts.

S3Guard must still be enabled. While namespace consistency is orthogonal to 
locking, ideally we can also eliminate the dependency on S3Guard, because this 
impacts the costs-to-serve of the AWS hosted HBase service. Maybe this is 
something that could be addressed here too. S3Guard mirrors the S3 namespace in 
a Dynamo table. Why not mirror instead to a hierarchy of znodes? 

The scalability of the path-based locking approach given HBase's access 
patterns is probably ok. We write relatively rarely, and when we do the writes 
for each region go into their own separate directories. We can expect locks on 
the "directory path" for each region directory. Locks for writes in one region 
are independent of all other locks for all other regions. However, resources 
required for locks and activity related to locking will grow linearly with 
respect to the number of regions. I am concerned about the write scalability of 
the ZooKeeper service when servicing a cluster with a very large number of 
regions. Would be curious to see the results of an experiment to assess this.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-03 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808928#comment-16808928
 ] 

Andrew Purtell commented on HBASE-22149:


{quote}So the rename will not be atomic in the sense that it's an O(1) 
metadata-only operation, but it will be atomic in the sense that it will have 
either happened or not happened, and no other calls to the FileSystem will see 
partial results because of the locking. 
{quote}
Makes sense.

I think this could be an excellent short-to-medium term option.

As you mention in the top post, other proposals, like HBASE-20431 , could also 
optimize for data movement so renames become O(1) in time cost instead of O(N) 
where N is the number of bytes to be moved, but that also would be a 
significantly more complex undertaking requiring commitment of significant 
development resources... which is why it has not yet been attempted. 

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-03 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808753#comment-16808753
 ] 

Sean Mackrory commented on HBASE-22149:
---

So the rename will not be atomic in the sense that it's an O(1) metadata-only 
operation, but it will be atomic in the sense that it will have either happened 
or not happened, and no other calls to the FileSystem will see partial results 
because of the locking. This guarantee only holds true if everything accessing 
the storage is using this FileSystem implementation, but as far as I know that 
shouldn't be an issue for HBase use cases.

{quote}Lock release must be bullet proof{quote}

Good catch, thanks - that is indeed the point of the AutoLock class in the 
first place :) I've also moved the lock releases in RemoteIterator.hasNext() 
under finally.

A couple of other things on my immediate to-do list: I'd like to add FS 
contract tests to get better coverage of the rest of the FS API in addition to 
the 2 main use cases I'm testing. Certainly more tests needed in general. I'd 
also like to add a way to mock S3 so you don't need AWS credentials to run the 
tests. Shouldn't be too tricky - the problems my tests currently reproduce 
depend on client behavior, and not subtle server-side implementation details.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-02 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808315#comment-16808315
 ] 

Allan Yang commented on HBASE-22149:


Sorry, I still don't get the point of how to handle the atomic rename. Since 
the FileSystem provided by hadoop-aws (based on s3) does not provide atomic  
rename API.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-02 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808252#comment-16808252
 ] 

Vladimir Rodionov commented on HBASE-22149:
---

{quote}
One area in which I would particularly appreciate review for safety is in 
TreeLockManager.treeWriteLock and treeReadLock.
{quote}

Sure, this is the most interesting part :)

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-02 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808251#comment-16808251
 ] 

Vladimir Rodionov commented on HBASE-22149:
---

{code}
  public class LockedFSDataOutputStream extends FSDataOutputStream {
+
+public LockedFSDataOutputStream(FSDataOutputStream stream, AutoLock lock) {
+  super(stream, null);
+  this.stream = stream;
+  this.lock = lock;
+}
+
+private final FSDataOutputStream stream;
+private AutoLock lock;
+
+@Override
+public long getPos() {
+  return stream.getPos();
+}
+
+@Override
+public void close() throws IOException {
+  stream.close();
+  lock.close();
+}
+
{code}

Lock release must be bullet proof. If stream.close() throws IOException, lock 
will never be released. Please check all your code and make sure put lock.close 
in finally clause.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-02 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808247#comment-16808247
 ] 

Vladimir Rodionov commented on HBASE-22149:
---

{code}
+  public static class LockedRemoteIterator implements RemoteIterator {
+public LockedRemoteIterator(RemoteIterator iterator, AutoLock lock)
+  {
+  this.iterator = iterator;
+  this.lock = lock;
+}
+
+private RemoteIterator iterator;
+private AutoLock lock;
+
+public boolean hasNext() throws IOException {
+  if (iterator.hasNext()) {
+return true;
+  }
+  lock.close();
+  return false;
+}
+
+/**
+ * Delegates to the wrapped iterator, but will close the lock in the event
+ * of a NoSuchElementException. Some applications do not call hasNext() and
+ * simply depend on the NoSuchElementException.
+ */
+public E next() throws IOException {
+  try {
+return iterator.next();
+  } catch (NoSuchElementException e) {
+lock.close();
+throw e;
+  }
+}
+  }
{code}

I would add explicit close() method to this Iterator, since it holds lock, 
which is closed only when iterator reaches end. There is no explicit way to 
release lock here.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> 

[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-02 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808248#comment-16808248
 ] 

Sean Mackrory commented on HBASE-22149:
---

Thanks [~vrodionov]. You're quite right - there should be a copy of all the 
paths put into a single array and then sorted. I thought I had done that, but 
perhaps I'm thinking of another method or I lost it while cleaning up the patch.

One area in which I would particularly appreciate review for safety is in 
TreeLockManager.treeWriteLock and treeReadLock. As I said I'll be replacing the 
while(true)s and Thread.sleeps with proper retry logic that can eventually 
timeout, but I don't think that will change the overall structure of those 
procedures. The idea is to safely obtain locks on the specific nodes in 
question, but then also ensure they're not INSIDE another lock that someone 
else already obtained, etc.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-02 Thread Vladimir Rodionov (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808228#comment-16808228
 ] 

Vladimir Rodionov commented on HBASE-22149:
---

Nice, thanks [~mackrorysd]. I will go through your patch later on 
today/tomorrow, for now one q:
{code}
+  public void concat(final Path trg, final Path[] psrcs) throws IOException {
+AutoLock[] locks = new AutoLock[psrcs.length + 1];
+try {
+  for (int i = 0; i < psrcs.length; i++) {
+locks[i] = sync.lock(psrcs[i]);
+  }
+  locks[psrcs.length] = sync.lock(trg);
+  fs.concat(trg, psrcs);
+} finally {
+  for (int i = 0; i < locks.length; i++) {
+if (locks[i] != null) {
+  locks[i].close();
+}
+  }
+}
+  }
{code}

This code does not seem deadlock-safe?

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: New Feature
>  Components: Filesystem Integration
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Critical
> Attachments: HBASE-22149-hadoop.patch
>
>
> (Have been using the name HBOSS for HBase / Object Store Semantics)
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right solution unless it proves unable to provide the required performance 
> characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22149) HBOSS: A FileSystem implementation to provide HBase's required semantics

2019-04-02 Thread Sean Mackrory (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808127#comment-16808127
 ] 

Sean Mackrory commented on HBASE-22149:
---

I've attached HBASE-22149-hadoop.patch, which is my proof-of-concept built on 
the Hadoop source tree. The while(true)s and Thread.sleeps in TreeLockManager 
must be replaced by more robust retry logic. And I need to add support for 
globStatus() operations and deleteOnExit(), as HBase does appear to use them 
and they aren't quite as trivial as everything else. Symlinks and PathHandle 
are also unsupported, but they are unused and unsupported in s3a, so I have no 
plans to address that.

A write-lock is acquired on a path when OutputStreams are created, and they are 
not released until the OutputStream is closed (whereas most operations are 
locking & unlocking in the same method call). I haven't done this with 
InputStreams and I'm not sure if that's required. OutputStreams require it to 
ensure create() is atomic, as s3a won't actually create a file on the 
underlying S3 bucket until later.

With the exception of InputStreams, I've generally erred on the side of locking 
everything in the hope of starting out with correctness. As I do performance 
testing and identify any bottlenecks, it may be desirable to carefully consider 
if some locking should be removed where HBase's usage makes it safe to do so if 
it would streamline a particular bottleneck.

> HBOSS: A FileSystem implementation to provide HBase's required semantics
> 
>
> Key: HBASE-22149
> URL: https://issues.apache.org/jira/browse/HBASE-22149
> Project: HBase
>  Issue Type: Bug
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
>Priority: Major
> Attachments: HBASE-22149-hadoop.patch
>
>
> I've had some thoughts about how to solve the problem of running HBase on 
> object stores. There has been some thought in the past about adding the 
> required semantics to S3Guard, but I have some concerns about that. First, 
> it's mixing complicated solutions to different problems (bridging the gap 
> between a flat namespace and a hierarchical namespace vs. solving 
> inconsistency). Second, it's S3-specific, whereas other objects stores could 
> use virtually identical solutions. And third, we can't do things like atomic 
> renames in a true sense. There would have to be some trade-offs specific to 
> HBase's needs and it's better if we can solve that in an HBase-specific 
> module without mixing all that logic in with the rest of S3A.
> Ideas to solve this above the FileSystem layer have been proposed and 
> considered (HBASE-20431, for one), and maybe that's the right way forward 
> long-term, but it certainly seems to be a hard problem and hasn't been done 
> yet. But I don't know enough of all the internal considerations to make much 
> of a judgment on that myself.
> I propose a FileSystem implementation that wraps another FileSystem instance 
> and provides locking of FileSystem operations to ensure correct semantics. 
> Locking could quite possibly be done on the same ZooKeeper ensemble as an 
> HBase cluster already uses (I'm sure there are some performance 
> considerations here that deserve more attention). I've put together a 
> proof-of-concept on which I've tested some aspects of atomic renames and 
> atomic file creates. Both of these tests fail reliably on a naked s3a 
> instance. I've also done a small YCSB run against a small cluster to sanity 
> check other functionality and was successful. I will post the patch, and my 
> laundry list of things that still need work. The WAL is still placed on HDFS, 
> but the HBase root directory is otherwise on S3.
> Note that my prototype is built on Hadoop's source tree right now. That's 
> purely for my convenience in putting it together quickly, as that's where I 
> mostly work. I actually think long-term, if this is accepted as a good 
> solution, it makes sense to live in HBase (or it's own repository). It only 
> depends on stable, public APIs in Hadoop and is targeted entirely at HBase's 
> needs, so it should be able to iterate on the HBase community's terms alone.
> Another idea [~ste...@apache.org] proposed to me is that of an inode-based 
> FileSystem that keeps hierarchical metadata in a more appropriate store that 
> would allow the required transactions (maybe a special table in HBase could 
> provide that store itself for other tables), and stores the underlying files 
> with unique identifiers on S3. This allows renames to actually become fast 
> instead of just large atomic operations. It does however place a strong 
> dependency on the metadata store. I have not explored this idea much. My 
> current proof-of-concept has been pleasantly simple, so I think it's the 
> right