WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Josh Elser Fri, 25 Oct 2019 14:30:38 -0700

Forking this off because I don't think it's related to Tushar's originalquestion.

HBase and Accumulo both implementation a WAL which can be said torelying on a distributed FileSystem which:


1. Is API compatible with HDFS
2. Guarantees that data written prior to an hflush/hsync() is durable

There are actually a few filesystems capable of this: HDFS (duh),Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store(ADLS), and Azure's Blob Filesystem (ABFS).

Azure has had a pretty long interaction with the upstream Hadoop project(and some ties in with the HBase project) to make sure that we know howto configure their Hadoop drivers that work with those Azure blob storesto make that durability guarantee.

That said, it's wrong to say that HBase/Accumulo in a cloud solutionrequire HDFS. It is accurate to say that S3 (via the S3A adapter) doesnot provide the durability guarantees that HBase/Accumulo need for WALs(but EMRFS does, from what I've heard through the grapevine, butrequires you to be using EMR)


On 10/25/19 1:49 PM, David Mollitor wrote:

Hello Team,

One short coming of Apache Accumulo and Apache HBase, as I understand it,
is that they both rely on the HDFS for replicated WAL management.
Therefore, HDFS is a requirement even if deploying to a cloud solution.  I
believe Google has developed a consensus enabled WAL management so that
three instances can be stood up without any external dependencies (other
than storage for the collection of rfile/hfile).

Be interested to hear your thoughts on this.

On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <[email protected]> wrote:

Hi Tushar,

The closest thing we have are the performance tests in accumulo-testing,
which is probably the best place.
https://github.com/apache/accumulo-testing#performance-test
The instructions for setting up the scripts are in the README.  There are
only a limited number of tests written though and they used to be
integration tests that were moved out of the main test package.

org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
org.apache.accumulo.testing.performance.tests.ScanExecutorPT
org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT

On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <[email protected]>
wrote:

Hello Everyone,


I am a Software Engineer at Microsoft and our team is currently working

on

making the deployment and operations of Accumulo on Azure as seamless as
possible. As part of this effort, we are attempting to observe / measure
some standard Accumulo operations (e.g. scan, canary queries, ingest,

etc.)

and how their performance varies over time on long standing Accumulo
clusters running in Azure. As part of this we’re looking to come up with

metric that we can use to evaluate how healthy / available an Accumulo
cluster is. Over time we intend to use this to understand how underlying
platform changes in Azure can affect overall health of Accumulo

workloads.




As a starting metric for example, we are thinking of continually doing
scans of random values across various tablet servers and capturing timing
information related to how long such scans take. I took a quick look at

the

accumulo-testing repo and didn’t find any tests or probes attempting to

do

something along these lines. Does something like this seem reasonable?

Has

anyone previously attempted something similar? Does accumulo-testing seem
like a reasonable place for code that attempts to do something like this?



Appreciate your thoughts and feedback.



Cheers,

Tushar Dhadiwal

WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Reply via email to