Re: [EXTERNAL] Accumulo with Native S3 Support

Christopher Wed, 04 Aug 2021 13:18:50 -0700

Hi Bill!

> 2. Do we still have contrib? That may be the best place if we don’t them in 
> mainline Accumulo. My personal opinion is the maintenance cost of them is low 
> enough to include them in Accumulo, that way Accumulo is S3 ready out of the 
> box.

A contrib directory in the main source tarball seems very much like an
outdated concept to me, leftover from early open source days where
projects couldn't easily establish their own community presence to
distribute related code. I don't think that is the case today, in the
era of git. It is no longer necessary for related/supplemental
projects to be distributed with the main project in order for them to
be discoverable. Accumulo itself moved all of it's "contrib" projects
to separate git repos, and there are many more accumulo-related
projects that add features to Accumulo scattered across the internet
(Apache Fluo, pyaccumulo, Hive AccumuloIntegration, Geowave Accumulo
integration, and many others).

So, contrib directory doesn't necessarily make sense to me for these.
But contrib repos do make sense. The question then becomes, who "owns"
the repo. Does the Accumulo PMC want to take them over as a
"subproject", like "accumulo-maven-plugin" or "accumulo-proxy"? One
thing is clear to me, it isn't *necessary* for the Accumulo PMC to
take on a project for it to survive or to be useful to Accumulo users
(see Apache Fluo). We can support it by blogging about it, providing
examples, linking to it on our website with other integrations
available to users, etc. So, since it's not *necessary*, it comes down
to the pros and cons of the PMC choosing to take over
governance/maintenance of these FileSystem implementations.

For me, the S3 FileSystem implementations don't really make sense as
Accumulo PMC subprojects... they aren't "Accumulo" things. They are
"FileSystem" implementations, which is a Hadoop DFS API thing, which
the Accumulo developers have, on numerous occasions, discussed
distancing ourselves from by creating a storage abstraction layer to
abstract away the Hadoop dependencies. However, I'm willing to hear
arguments for why the PMC should be responsible for them, instead of
them existing alongside Accumulo under the governance of somebody
other than the Accumulo PMC.

As for minimal effort to maintain, the reality is, most Accumulo
"contribs" or "subprojects" have effectively died off because of lack
of maintainers keeping up with them (see accumulo-bsp, accumulo-pig,
accumulo-instamo-archetype, accumulo-wikisearch, accumulo-docker).
These S3 implementations seem just as, if not more, complex and
require more specialized expertise to maintain than those. Given that
these S3 implementations aren't coupled to Accumulo, as they are
Hadoop FileSystem implementations, I see no reason why Accumulo
developers could be expected to have the specialized knowledge to
maintain them, regardless of how little work it might be. And, I see
no reason they should be scoped to Accumulo for
governance/maintenance, when they aren't scoped to Accumulo by nature
of their code.

Regarding "Accumulo is S3 ready out of the box", I'd also be concerned
about getting into the politics around whether Accumulo should be make
to be "ready out of the box" for S3, Azure, some
SuperDuperGreatNextGenStorageProviderThatOnlyTwoPeopleUseDotCom, or
any other 3rd party integration. As long as there are sufficient
integration points in Accumulo, we don't really need to take ownership
over specific integration implementations. In Accumulo 2.0 we dropped
the HDP/CDH-specific configuration templates from our tarball, and
worked to make our scripts and example configs more independent.
Trying to make Accumulo "S3-ready" by adding S3-specific integrations,
or "Azure-ready" by adding Azure-specific integrations, or bundling
any other specialized integrations, seems like it is not only
unnecessary, but prone to putting us in the middle of commercial
politics when we should work to stay independent. We can avoid all of
this and still support any 3rd party integrator, without bundling
their specific implementations of our integration interfaces.

Given all this, I think it would be much more reasonable for these
FileSystem implementations to be hosted elsewhere and packaged as
drop-in class path extensions for users to use, if they choose to. For
the Accumulo PMC, I think it would be reasonable for us to support /
empower them by linking to them on our website, accepting a
contributed blog post and/or examples for using it, etc., to help
users discover them.

> 3. The main benefit of the ZooLease and its integration is that we wanted a 
> way to have tighter timings in the write path to simulate something similar 
> to the HDFS lease functionality the WAL write path has now. It’s mostly a way 
> to avoid a rogue TServer continuing to accept writes for a tablet, when to 
> the rest of the system its lock is gone.

I would be interested in seeing those changes rebased onto the current
main branch, and submitted as a separate PR to be considered on their
own, since they do modify existing Accumulo code. If we can
incorporate these changes so that they not only help support the S3
FileSystem implementations, but also enhance Accumulo more generally
without being tightly coupled to those implementations, I think that's
probably the best way forward for the ZooLease stuff.

>
> On 2021/07/28 17:41:10, Christopher <[email protected]> wrote:
> > From what I saw from looking at the changes in Chris Milbert's fork,>
> > the fork contains a couple S3 implementations of Hadoop's FileSystem>
> > interface in a separate module (similar to s3a:// and abfss://>
> > implementations). It seems to add accS3mo:// and accS3nf://>
> > implementations, which, in spite of their names, do not appear to be>
> > Accumulo-specific (that's a good thing... as these could be reused by>
> > other projects as well!).>
> >
> > In addition, these FileSystem implementations seem to be accompanied>
> > by a few changes to Accumulo code itself, but I couldn't tell if these>
> > were necessary to improve compatibility with these new FileSystems or>
> > if they were unrelated additional enhancements to Accumulo. They also>
> > appeared to be based on an older 2.0 branch, rather than the latest>
> > 2.1 / main branch, and conflict with some of the changes in 2.1>
> > branch. So those changes will need to be rebased.>
> >
> > So, I suggest isolating the FileSystem implementations from the>
> > changes to Accumulo. The FileSystem implementations don't need to be>
> > merged into Accumulo's code base, or built as part of Accumulo at all.>
> > They are completely independent from Accumulo and can exist in their>
> > own repo, for use by any other user, just like s3a:// or abfss:// .>
> > The Accumulo PMC could decide to accept responsibility for these>
> > FileSystem implementations, but I don't think the Accumulo project at>
> > the ASF is the best home for them, as they are not Accumulo-specific.>
> > It might make more sense as a subproject of Hadoop instead of>
> > Accumulo, since they are Hadoop FileSystem implementations, or remain>
> > as a 3rd party repository on GitHub as part of the larger Hadoop>
> > ecosystem. Finding the best home for these may take some additional>
> > research on the part of its developers.>
> >
> > The changes to Accumulo itself, separate from the S3 FileSystem>
> > implementations, will be easiest to incorporate into the 2.1 / main>
> > branch if they are rebased first, and submitted from a fork on GitHub>
> > (Chris Milbert's repo does not appear to be a "fork", but a>
> > disconnected clone, so creating a PR using GitHub's UI won't be>
> > possible without first recreating the repo using the "fork" feature on>
> > GitHub). If there are multiple, discrete changes, serving independent>
> > purposes, the changes should be teased apart and submitted as separate>
> > PRs against the main branch, so they can be evaluated on their own>
> > merits through the code review process. It is hard to consider their>
> > merits without a pull request for those changes.>
> >
> > I think the discussion of abstracting the storage layer in Accumulo is>
> > a worthy one, but I think it can be set aside for now. Abstracting the>
> > storage layer from Hadoop would involve creating Accumulo-specific>
> > storage APIs, and corralling Hadoop FileSystem API calls behind an>
> > implementation of that Accumulo storage API. However, that's not>
> > necessary for this. We currently use Hadoop's FileSystem APIs>
> > throughout our own code, and Hadoop's FileSystem already provides>
> > sufficient abstraction for the purposes of adding S3 support to>
> > Accumulo, and that's what appears to have been done by Chris Milbert.>
> > So, there's no need to complicate the discussion with additional>
> > potential future work to further abstract Hadoop FileSystem API calls.>
> > That abstraction doesn't appear to be a necessary prerequisite to>
> > considering the work done by Chris in his repo.>
> >
> > To me, the main questions are:>
> >
> > 1. Can the new FileSystem implementations be used as easily as other>
> > drop-in implementations, like s3a:// and abfss:// ?>
> > 2. Where is the best home for these FileSystem implementations?>
> > 3. What benefits do the other changes to Accumulo serve, and can they>
> > be rebased and submitted as separate PRs against Accumulo's main>
> > branch?>
> >
> >
> > On Tue, Jul 27, 2021 at 2:00 PM Arvind Shyamsundar>
> > <[email protected]> wrote:>
> > >>
> > > Hi Jeff, what would be the difference between this path, and what can be 
> > > accomplished by using a Hadoop FileSystem interface based connector to 
> > > talk to S3? Is it because of the consistency limitations with s3a:// 
> > > (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)?>
> > >>
> > > As you probably know for Azure, we went with the abfss:// connector 
> > > provided as part of hadoop-azure 
> > > (https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) with 
> > > minimal effort. Just wondering what the key difference here is for S3.>
> > >>
> > > Thanks!>
> > >>
> > > Arvind.>
> > >>
> > > -----Original Message----->
> > > From: Jeff Kubina <[email protected]>>
> > > Sent: Tuesday, July 27, 2021 10:16 AM>
> > > To: [email protected]>
> > > Subject: [EXTERNAL] Accumulo with Native S3 Support>
> > >>
> > > All,>
> > >>
> > > Some of AWS's back end services use a version of Accumulo modified to use 
> > > Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 
> > > and merged that S3 support into it 
> > > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcmilbert%2Faccumulo%2F&amp;data=04%7C01%7Carvindsh%40microsoft.com%7C9b8c533f2a85467b90c008d95122491f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637630030450339294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WvhjAgkOZMRVM%2B2KzXH8ZvDU2ZsFxaw%2BFUPtupsNNbs%3D&amp;reserved=0>.>
> > > Chris Milbert is the lead Amazon engineer who did the integration. Chris 
> > > and I would like to jump start the conversation about how best to 
> > > initiate the pull request for these changes into Accumulo 2.1.>
> > >>
> > > Mike Wall suggested using this as an opportunity to abstract out the 
> > > storage system of Accumulo and make it pluggable. He suggested the 
> > > following broad steps:>
> > >>
> > >    1. Identify all the things HDFS provides such as read, write,>
> > >    replication and failover.>
> > >    2. Abstract out a file system interface with hooks for all those 
> > > things>
> > >    (and does not require loading hadoop jars).>
> > >    3. Plugin HDFS as the default implementation of that interface, hiding>
> > >    all hadoop jars there.>
> > >    4. Make another implementation that plugins in S3 and make it 
> > > optionally>
> > >    configured.>
> > >    5. Run tests to make sure we didn't break things with HDFS.>
> > >    6. Run tests to see if S3 meets all the requirements.>
> > >>
> > > Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3 
> > > changes into it.>
> > >>
> > > Chris and I look forward to the discussion on how best to add S3 support 
> > > to Accumulo.>
> > >>
> > > Thanks,>
> > > Jeff>
> > > -->
> > > Jeff Kubina>
> >

Re: [EXTERNAL] Accumulo with Native S3 Support

Reply via email to