Re: FileSystem API (was: Slack call notes)

Michael Wall Tue, 28 Apr 2020 13:19:23 -0700

That HDFS gateway appears to be an S3 layer on top of HDFS, not and HDFS
layer on top of S3/Minio.  It allows you to write code to use Minio and
pull existing data from HDFS as you migrate it into Minio.  As far as I can
tell, it would not work without changes to Accumulo.


In the next week or so I'll look at actually putting interfaces around the
HDFS interactions for RFiles and WALs as a first step.  I will report back
with my findings and hopefully some code.

Thanks

Mike

On Fri, Apr 24, 2020 at 10:32 PM Christopher <[email protected]> wrote:

> I'm not familiar with it, but the website says it can replace HDFS.
> There appears to be an "HDFS Gateway"
> (https://github.com/minio/minio/blob/master/docs/gateway/hdfs.md) that
> might be useful. At a glance, it looks like no abstraction is needed
> in Accumulo code is needed for it... you just run the gateway and
> Accumulo believes it is using HDFS, but it is really using MinIO
> instead.
>
> There also might be a Hadoop FileSystem implementation for it to use
> it directly without a Gateway, but I didn't have any luck with a quick
> search for one.
>
> In either case, there shouldn't need to be any changes to Accumulo itself.
>
> If changes to Accumulo do become necessary (or desired), I'd be
> interested in collaborating on that part. If it's just a matter of
> trying it with the Gateway or existing Hadoop FileSystem
> implementation, I'd also be interested in testing any step-by-step
> HOWTO guides somebody might want to write as a blog post.
>
> On Fri, Apr 24, 2020 at 11:20 AM Mike Miller <[email protected]> wrote:
> >
> > I have no experience with MinIO but would be interested in learning more
> > and collaborating.
> >
> > On Fri, Apr 24, 2020 at 10:57 AM Michael Wall <[email protected]> wrote:
> >
> > > Resurrecting this thread on the File System API.  I have been thinking
> > > about giving Minio [1] a try for both WALs and RFiles.  Seems to me
> like
> > > step one is to abstract internal interfaces for both targeted against
> 2.1?
> > > Couple of questions
> > >
> > > 1 - Anyone have experience with minio?
> > > 2 - Anyone interested in collaborating?  Thinking anything from
> providing
> > > input to helping to test once we get a prototype to actually doing some
> > > development.
> > >
> > > Thanks, hope everyone is staying safe and healthy.
> > >
> > > [1] - https://min.io/
> > >
> > > On Wed, Mar 25, 2020 at 6:08 PM Christopher <[email protected]>
> wrote:
> > >
> > > > Only 705 across 280 files, if you exclude Text, though :)
> > > >
> > > > grep -rP 'org[.]apache[.]hadoop(?![.]io[.]Text)' --include='*.java' *
> > > > | grep -v test/ | wc -l
> > > >
> > > > On Wed, Mar 25, 2020 at 3:34 PM Mike Miller <[email protected]>
> wrote:
> > > > >
> > > > > I think we have come a long way removing any external types from
> the
> > > API,
> > > > > for reasons other than de-coupling from Hadoop.  While we don't
> have
> > > many
> > > > > dependencies on the other components of Hadoop, we are still very
> > > tightly
> > > > > coupled to HDFS.
> > > > > For example, some quick grep'ing of the code shows:
> > > > > "grep -r "import org.apache.hadoop" --include=*.java * | wc -l"
> > > > > 1734
> > > > > Without tests it is slightly more feasible...
> > > > > grep -r "import org.apache.hadoop" --include=*.java * | grep -v
> "test"
> > > |
> > > > wc
> > > > > -l
> > > > > 858
> > > > >
> > > > >
> > > > > On Wed, Mar 25, 2020 at 3:19 PM David Mollitor <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I too have been thinking about this for a pet project.  There is
> > > > already
> > > > > > Apache Commons VFS that, with some investment, could probably
> serve
> > > all
> > > > > > these requirements.
> > > > > >
> > > > > > On Wed, Mar 25, 2020, 3:16 PM Christopher <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > (Forking this thread, as it's a distinct topic)
> > > > > > >
> > > > > > > I've thought about it. The idea has driven me to try to reduce
> our
> > > > use
> > > > > > > of Hadoop-specific code, and to isolate Hadoop-specific stuff
> > > behind
> > > > > > > some abstraction, wherever possible. Though, I'll admit, we're
> > > > nowhere
> > > > > > > close to where we'd want to be to be fully decoupled from
> Hadoop.
> > > > > > >
> > > > > > > I've also been looking a lot at our VolumeManager code lately,
> to
> > > try
> > > > > > > to improve it a bit, and to create better abstractions for
> Volumes,
> > > > > > > that could aid future work in this area.
> > > > > > >
> > > > > > > But, I haven't directly been working on new FileSystem API
> > > > > > > abstraction... just trying to lay some groundwork for that
> > > > possibility
> > > > > > > in future.
> > > > > > >
> > > > > > > It'd be nice to get to a point where we have a Hadoop-specific
> > > > > > > implementation isolated to a jar that can be swapped out at
> runtime
> > > > > > > for other file system implementations, as needed. I see that
> as a
> > > > > > > somewhat long-way off.
> > > > > > >
> > > > > > > On Wed, Mar 25, 2020 at 2:08 PM <[email protected]> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >   I couldn't make the call today, but am curious if anyone
> has
> > > > > > > previously brought up creating a FileSystem API for Accumulo so
> > > that
> > > > we
> > > > > > > could use implementations other than Hadoop. I realize that
> Hadoop
> > > > > > provides
> > > > > > > implementations for things other than HDFS but that doesn't
> > > > necessarily
> > > > > > > mean that all filesystem implementations are covered.
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Christopher <[email protected]>
> > > > > > > > Sent: Wednesday, March 25, 2020 1:45 PM
> > > > > > > > To: accumulo-dev <[email protected]>
> > > > > > > > Subject: Slack call notes
> > > > > > > >
> > > > > > > > Several committers/contributors in the community joined a
> call in
> > > > Slack
> > > > > > > on Wednesday, at 1130-1230, New York (Eastern) time. Here are
> my
> > > > notes of
> > > > > > > the call. Please feel free to add to them.
> > > > > > > >
> > > > > > > > I shared the overall philosophy and backstory to some of the
> > > script
> > > > > > > improvements in 2.x to help guide current/future work on the
> > > scripts.
> > > > > > > >
> > > > > > > > * bin/accumulo is inspired by old jpackage.org standards
> which
> > > are
> > > > > > > still in use in RPM macros for Java packaging in
> Fedora/RHEL/etc.
> > > > The key
> > > > > > > idea is that scripts are simple... set up environment (class
> path,
> > > > etc.),
> > > > > > > locate java, and exec a single process with the provided args.
> > > > > > > > * bin/accumulo-service is inspired by old SysVInit scripts
> for
> > > > > > > start/stop/restart/status of a single service
> > > > > > > > * behavior of bin/accumulo and bin/accumulo-service can be
> > > > manipulated
> > > > > > > through launch environment
> > > > > > > > * bin/accumulo-cluster uses bin/accumulo-service, and is
> provided
> > > > as a
> > > > > > > simple, out-of-the-box cluster management tool
> > > > > > > > * bin/accumulo-cluster and bin/accumulo-service are
> replaceable;
> > > > they
> > > > > > > are useful for out-of-the-box, but one would expect them to be
> > > > > > unnecessary
> > > > > > > if using systemd, or a vendor-provided cluster management
> system
> > > > > > > > * we discussed possibly moving bin/accumulo-cluster and
> > > > > > > bin/accumulo-service to contrib/ in the tarball, or some
> subdir of
> > > > bin/,
> > > > > > > but it was suggested to not make too many disruptive changes
> there
> > > > > > > > * we discussed the possibility of adding a config file for
> > > > > > > bin/accumulo-cluster (also mentioned on
> > > > > > > > https://github.com/apache/accumulo/pull/1568)
> > > > > > > > * we discussed the need to document the intent/purpose/scope
> of
> > > the
> > > > > > > scripts in comments inside the scripts themselves
> > > > > > > > * Ed Coleman asked if it'd be good to document a systemd
> > > example; I
> > > > > > > suggested it might make for a good blog post (perhaps by the
> person
> > > > who
> > > > > > > wrote the systemd unit files for Fluo Muchos)
> > > > > > > >
> > > > > > > > Keith Turner discussed his development efforts with regard to
> > > > enabling
> > > > > > > more controls over compactions.
> > > > > > > >
> > > > > > > > * one main idea was to keep configuration/API for data
> separate
> > > > from
> > > > > > > that for execution
> > > > > > > > * data is concerns to application owners, whereas execution
> > > > involves
> > > > > > > system admins (resource contention, etc.)
> > > > > > > > * he will submit a PR for review when ready
> > > > > > > > * he also suggested another call to go over the PR
> > > > > > > >
> > > > > > > > Billie Rinaldi discussed better support for Azure Data Lake
> > > Storage
> > > > > > > > Gen2 (ADLSv2).
> > > > > > > >
> > > > > > > > * maintaining a fork for experimenting, and working on
> reliably
> > > > testing
> > > > > > > issues involving WALs
> > > > > > > > * did not recommend using ADLSv2 with WALs, but that we
> should
> > > > still
> > > > > > > support it
> > > > > > > > * might need to implement a custom log closer to better
> support
> > > it
> > > > > > > >
> > > > > > > > Mike Miller brought up the idea of eliminating more static
> > > internal
> > > > > > > state.
> > > > > > > >
> > > > > > > > * ServerConfigurationFactory might be improved in this
> regard,
> > > with
> > > > > > some
> > > > > > > additional ZK cleanup
> > > > > > > > * Other ZK cleanup might help elsewhere (such as ZooCache)
> > > > > > > > * I suggested tablet location cache might also benefit from
> being
> > > > bound
> > > > > > > to an AccumuloClient lifecycle (or a dedicated opaque object
> that
> > > > could
> > > > > > be
> > > > > > > shared across AccumuloClient instances with its own
> user-managed
> > > > > > lifecycle)
> > > > > > > >
> > > > > > > > Please add anything I might have missed (or got wrong) in
> > > response
> > > > to
> > > > > > > this post.
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
>

Re: FileSystem API (was: Slack call notes)

Reply via email to