Forking this off because I don't think it's related to Tushar's original
question.
HBase and Accumulo both implementation a WAL which can be said to
relying on a distributed FileSystem which:
1. Is API compatible with HDFS
2. Guarantees that data written prior to an hflush/hsync() is durable
There are actually a few filesystems capable of this: HDFS (duh),
Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store
(ADLS), and Azure's Blob Filesystem (ABFS).
Azure has had a pretty long interaction with the upstream Hadoop project
(and some ties in with the HBase project) to make sure that we know how
to configure their Hadoop drivers that work with those Azure blob stores
to make that durability guarantee.
That said, it's wrong to say that HBase/Accumulo in a cloud solution
require HDFS. It is accurate to say that S3 (via the S3A adapter) does
not provide the durability guarantees that HBase/Accumulo need for WALs
(but EMRFS does, from what I've heard through the grapevine, but
requires you to be using EMR)
On 10/25/19 1:49 PM, David Mollitor wrote:
Hello Team,
One short coming of Apache Accumulo and Apache HBase, as I understand it,
is that they both rely on the HDFS for replicated WAL management.
Therefore, HDFS is a requirement even if deploying to a cloud solution. I
believe Google has developed a consensus enabled WAL management so that
three instances can be stood up without any external dependencies (other
than storage for the collection of rfile/hfile).
Be interested to hear your thoughts on this.
On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <[email protected]> wrote:
Hi Tushar,
The closest thing we have are the performance tests in accumulo-testing,
which is probably the best place.
https://github.com/apache/accumulo-testing#performance-test
The instructions for setting up the scripts are in the README. There are
only a limited number of tests written though and they used to be
integration tests that were moved out of the main test package.
org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
org.apache.accumulo.testing.performance.tests.ScanExecutorPT
org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <[email protected]>
wrote:
Hello Everyone,
I am a Software Engineer at Microsoft and our team is currently working
on
making the deployment and operations of Accumulo on Azure as seamless as
possible. As part of this effort, we are attempting to observe / measure
some standard Accumulo operations (e.g. scan, canary queries, ingest,
etc.)
and how their performance varies over time on long standing Accumulo
clusters running in Azure. As part of this we’re looking to come up with
a
metric that we can use to evaluate how healthy / available an Accumulo
cluster is. Over time we intend to use this to understand how underlying
platform changes in Azure can affect overall health of Accumulo
workloads.
As a starting metric for example, we are thinking of continually doing
scans of random values across various tablet servers and capturing timing
information related to how long such scans take. I took a quick look at
the
accumulo-testing repo and didn’t find any tests or probes attempting to
do
something along these lines. Does something like this seem reasonable?
Has
anyone previously attempted something similar? Does accumulo-testing seem
like a reasonable place for code that attempts to do something like this?
Appreciate your thoughts and feedback.
Cheers,
Tushar Dhadiwal