Re: [EXTERNAL] Accumulo with Native S3 Support

Bill Slacum Tue, 03 Aug 2021 20:34:38 -0700

The short answers to these questions are:

1. Yes. Both of them are meant to be simpler implementations than S3A. For 
RFiles we only don’t need any complicated write semantics for it. For WAL 
files, it’s meant to be a bit higher throughput than S3A, and be a bit of an 
adapter for a single log split across multiple, smaller objects. They are 
pretty tightly scoped to their package, without much leakage into other aspects 
of the code.
2. Do we still have contrib? That may be the best place if we don’t them in 
mainline Accumulo. My personal opinion is the maintenance cost of them is low 
enough to include them in Accumulo, that way Accumulo is S3 ready out of the 
box. 
3. The main benefit of the ZooLease and its integration is that we wanted a way 
to have tighter timings in the write path to simulate something similar to the 
HDFS lease functionality the WAL write path has now. It’s mostly a way to avoid 
a rogue TServer continuing to accept writes for a tablet, when to the rest of 
the system its lock is gone.


On 2021/07/28 17:41:10, Christopher <[email protected]> wrote: 
> From what I saw from looking at the changes in Chris Milbert's fork,> 
> the fork contains a couple S3 implementations of Hadoop's FileSystem> 
> interface in a separate module (similar to s3a:// and abfss://> 
> implementations). It seems to add accS3mo:// and accS3nf://> 
> implementations, which, in spite of their names, do not appear to be> 
> Accumulo-specific (that's a good thing... as these could be reused by> 
> other projects as well!).> 
> 
> In addition, these FileSystem implementations seem to be accompanied> 
> by a few changes to Accumulo code itself, but I couldn't tell if these> 
> were necessary to improve compatibility with these new FileSystems or> 
> if they were unrelated additional enhancements to Accumulo. They also> 
> appeared to be based on an older 2.0 branch, rather than the latest> 
> 2.1 / main branch, and conflict with some of the changes in 2.1> 
> branch. So those changes will need to be rebased.> 
> 
> So, I suggest isolating the FileSystem implementations from the> 
> changes to Accumulo. The FileSystem implementations don't need to be> 
> merged into Accumulo's code base, or built as part of Accumulo at all.> 
> They are completely independent from Accumulo and can exist in their> 
> own repo, for use by any other user, just like s3a:// or abfss:// .> 
> The Accumulo PMC could decide to accept responsibility for these> 
> FileSystem implementations, but I don't think the Accumulo project at> 
> the ASF is the best home for them, as they are not Accumulo-specific.> 
> It might make more sense as a subproject of Hadoop instead of> 
> Accumulo, since they are Hadoop FileSystem implementations, or remain> 
> as a 3rd party repository on GitHub as part of the larger Hadoop> 
> ecosystem. Finding the best home for these may take some additional> 
> research on the part of its developers.> 
> 
> The changes to Accumulo itself, separate from the S3 FileSystem> 
> implementations, will be easiest to incorporate into the 2.1 / main> 
> branch if they are rebased first, and submitted from a fork on GitHub> 
> (Chris Milbert's repo does not appear to be a "fork", but a> 
> disconnected clone, so creating a PR using GitHub's UI won't be> 
> possible without first recreating the repo using the "fork" feature on> 
> GitHub). If there are multiple, discrete changes, serving independent> 
> purposes, the changes should be teased apart and submitted as separate> 
> PRs against the main branch, so they can be evaluated on their own> 
> merits through the code review process. It is hard to consider their> 
> merits without a pull request for those changes.> 
> 
> I think the discussion of abstracting the storage layer in Accumulo is> 
> a worthy one, but I think it can be set aside for now. Abstracting the> 
> storage layer from Hadoop would involve creating Accumulo-specific> 
> storage APIs, and corralling Hadoop FileSystem API calls behind an> 
> implementation of that Accumulo storage API. However, that's not> 
> necessary for this. We currently use Hadoop's FileSystem APIs> 
> throughout our own code, and Hadoop's FileSystem already provides> 
> sufficient abstraction for the purposes of adding S3 support to> 
> Accumulo, and that's what appears to have been done by Chris Milbert.> 
> So, there's no need to complicate the discussion with additional> 
> potential future work to further abstract Hadoop FileSystem API calls.> 
> That abstraction doesn't appear to be a necessary prerequisite to> 
> considering the work done by Chris in his repo.> 
> 
> To me, the main questions are:> 
> 
> 1. Can the new FileSystem implementations be used as easily as other> 
> drop-in implementations, like s3a:// and abfss:// ?> 
> 2. Where is the best home for these FileSystem implementations?> 
> 3. What benefits do the other changes to Accumulo serve, and can they> 
> be rebased and submitted as separate PRs against Accumulo's main> 
> branch?> 
> 
> 
> On Tue, Jul 27, 2021 at 2:00 PM Arvind Shyamsundar> 
> <[email protected]> wrote:> 
> >> 
> > Hi Jeff, what would be the difference between this path, and what can be 
> > accomplished by using a Hadoop FileSystem interface based connector to talk 
> > to S3? Is it because of the consistency limitations with s3a:// 
> > (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)?>
> >  
> >> 
> > As you probably know for Azure, we went with the abfss:// connector 
> > provided as part of hadoop-azure 
> > (https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) with 
> > minimal effort. Just wondering what the key difference here is for S3.> 
> >> 
> > Thanks!> 
> >> 
> > Arvind.> 
> >> 
> > -----Original Message-----> 
> > From: Jeff Kubina <[email protected]>> 
> > Sent: Tuesday, July 27, 2021 10:16 AM> 
> > To: [email protected]> 
> > Subject: [EXTERNAL] Accumulo with Native S3 Support> 
> >> 
> > All,> 
> >> 
> > Some of AWS's back end services use a version of Accumulo modified to use 
> > Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and 
> > merged that S3 support into it 
> > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcmilbert%2Faccumulo%2F&amp;data=04%7C01%7Carvindsh%40microsoft.com%7C9b8c533f2a85467b90c008d95122491f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637630030450339294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WvhjAgkOZMRVM%2B2KzXH8ZvDU2ZsFxaw%2BFUPtupsNNbs%3D&amp;reserved=0>.>
> >  
> > Chris Milbert is the lead Amazon engineer who did the integration. Chris 
> > and I would like to jump start the conversation about how best to initiate 
> > the pull request for these changes into Accumulo 2.1.> 
> >> 
> > Mike Wall suggested using this as an opportunity to abstract out the 
> > storage system of Accumulo and make it pluggable. He suggested the 
> > following broad steps:> 
> >> 
> >    1. Identify all the things HDFS provides such as read, write,> 
> >    replication and failover.> 
> >    2. Abstract out a file system interface with hooks for all those things> 
> >    (and does not require loading hadoop jars).> 
> >    3. Plugin HDFS as the default implementation of that interface, hiding> 
> >    all hadoop jars there.> 
> >    4. Make another implementation that plugins in S3 and make it 
> > optionally> 
> >    configured.> 
> >    5. Run tests to make sure we didn't break things with HDFS.> 
> >    6. Run tests to see if S3 meets all the requirements.> 
> >> 
> > Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3 
> > changes into it.> 
> >> 
> > Chris and I look forward to the discussion on how best to add S3 support to 
> > Accumulo.> 
> >> 
> > Thanks,> 
> > Jeff> 
> > --> 
> > Jeff Kubina> 
>

Re: [EXTERNAL] Accumulo with Native S3 Support

Reply via email to