Re: [DISCUSS] New User Experience and Data Durability Guarantees on LocalFileSystem (HBASE-24086)

Nick Dimiduk Wed, 15 Apr 2020 11:02:09 -0700

On Wed, Apr 15, 2020 at 10:05 AM Sean Busbey <bus...@apache.org> wrote:


> I think the first assumption no longer holds. Especially with the move
> to flexible compute environments I regularly get asked by folks what
> the smallest HBase they can start with for production. I can keep
> saying 3/5/7 nodes or whatever but I guarantee there are folks who
> want to and will run HBase with a single node. Probably those
> deployments won't want to have the distributed flag set. None of them
> really have a good option for where the WALs go, and failing loud when
> they try to go to LocalFileSystem is the best option I've seen so far
> to make sure folks realize they are getting into muddy waters.
>

I think this is where we disagree. My answer to this same question is 12
node: 3 "coordinator" hosts for HA ZK, HDFS, and HBase master + 9 "worker"
hosts for replicated data serving and storage. Tweak the number of workers
and the replication factor if you like, but that's how you get a durable,
available deployment suitable for an online production solution. Anything
smaller than this and you're in the "muddy waters" of under-replicated
distributed system failure domains.

I agree with the second assumption. Our quickstart in general is too
> complicated. Maybe if we include big warnings in the guide itself, we
> could make a quickstart specific artifact to download that has the
> unsafe disabling config in place?
>

I'm not a fan of the dedicated artifact as a binary tarball. I think that
approach fractures the brand of our product and emphasizes the idea that
it's even more complicated. If we want a dedicated quick start experience,
I would advocate investing the resources into something more like a
learning laboratory that is accompanied with a runtime image in a VM or
container.

Last fall I toyed with the idea of adding an "hbase-local" module to
> the hbase-filesystem repo that could start us out with some
> optimizations for single node set ups. We could start with a fork of
> RawLocalFileSystem (which will call OutputStream flush operations in
> response to hflush/hsync) that properly advertises its
> StreamCapabilities to say that it supports the operations we need.
> Alternatively we could make our own implementation of FileSystem that
> uses NIO stuff. Either of these approaches would solve both problems.
>

I find this approach more palatable than a custom quick start binary
tarball.

On Wed, Apr 15, 2020 at 11:40 AM Nick Dimiduk <ndimi...@apache.org> wrote:
> >
> > Hi folks,
> >
> > I'd like to bring up the topic of the experience of new users as it
> > pertains to use of the `LocalFileSystem` and its associated (lack of)
> data
> > durability guarantees. By default, an unconfigured HBase runs with its
> root
> > directory on a `file:///` path. This patch is picked up as an instance of
> > `LocalFileSystem`. Hadoop has long offered this class, but it has never
> > supported `hsync` or `hflush` stream characteristics. Thus, when HBase
> runs
> > on this configuration, it is unable to ensure that WAL writes are
> durable,
> > and thus will ACK a write without this assurance. This is the case, even
> > when running in a fully durable WAL mode.
> >
> > This impacts a new user, someone kicking the tires on HBase following our
> > Getting Started docs. On Hadoop 2.8 and before, an unconfigured HBase
> will
> > WARN and cary on. Hadoop 2.10+, HBase will refuse to start. The book
> > describes a process of disabling enforcement of stream capability
> > enforcement as a first step. This is a mandatory configuration for
> running
> > HBase directly out of our binary distribution.
> >
> > HBASE-24086 restores the behavior on Hadoop 2.10+ to that of running on
> > 2.8: log a warning and cary on. The critique of this approach is that
> it's
> > far too subtle, too quiet for a system operating in a state known to not
> > provide data durability.
> >
> > I have two assumptions/concerns around the state of things, which
> prompted
> > my solution on HBASE-24086 and the associated doc update on HBASE-24106.
> >
> > 1. No one should be running a production system on `LocalFileSystem`.
> >
> > The initial implementation checked both for `LocalFileSystem` and
> > `hbase.cluster.distributed`. When running on the former and the latter is
> > false, we assume the user is running a non-production deployment and
> carry
> > on with the warning. When the latter is true, we assume the user
> intended a
> > production deployment and the process terminates due to stream capability
> > enforcement. Subsequent code review resulted in skipping the
> > `hbase.cluster.distributed` check and simply warning, as was done on 2.8
> > and earlier.
> >
> > (As I understand it, we've long used the `hbase.cluster.distributed`
> > configuration to decide if the user intends this runtime to be a
> production
> > deployment or not.)
> >
> > Is this a faulty assumption? Is there a use-case we support where we
> > condone running production deployment on the non-durable
> `LocalFileSystem`?
> >
> > 2. The Quick Start experience should require no configuration at all.
> >
> > Our stack is difficult enough to run in a fully durable production
> > environment. We should make it a priority to ensure it's as easy as
> > possible to try out HBase. Forcing a user to make decisions about data
> > durability before they even launch the web ui is a terrible experience,
> in
> > my opinion, and should be a non-starter for us as a project.
> >
> > (In my opinion, the need to configure either `hbase.rootdir` or
> > `hbase.tmp.dir` away from `/tmp` is equally bad for a Getting Started
> > experience. It is a second, more subtle question of data durability that
> we
> > should avoid out of the box. But I'm happy to leave that for another
> > thread.)
> >
> > Thank you for your time,
> > Nick
>

Re: [DISCUSS] New User Experience and Data Durability Guarantees on LocalFileSystem (HBASE-24086)

Reply via email to