On Wed, Apr 15, 2020 at 10:05 AM Sean Busbey <bus...@apache.org> wrote:
> I think the first assumption no longer holds. Especially with the move > to flexible compute environments I regularly get asked by folks what > the smallest HBase they can start with for production. I can keep > saying 3/5/7 nodes or whatever but I guarantee there are folks who > want to and will run HBase with a single node. Probably those > deployments won't want to have the distributed flag set. None of them > really have a good option for where the WALs go, and failing loud when > they try to go to LocalFileSystem is the best option I've seen so far > to make sure folks realize they are getting into muddy waters. > I think this is where we disagree. My answer to this same question is 12 node: 3 "coordinator" hosts for HA ZK, HDFS, and HBase master + 9 "worker" hosts for replicated data serving and storage. Tweak the number of workers and the replication factor if you like, but that's how you get a durable, available deployment suitable for an online production solution. Anything smaller than this and you're in the "muddy waters" of under-replicated distributed system failure domains. I agree with the second assumption. Our quickstart in general is too > complicated. Maybe if we include big warnings in the guide itself, we > could make a quickstart specific artifact to download that has the > unsafe disabling config in place? > I'm not a fan of the dedicated artifact as a binary tarball. I think that approach fractures the brand of our product and emphasizes the idea that it's even more complicated. If we want a dedicated quick start experience, I would advocate investing the resources into something more like a learning laboratory that is accompanied with a runtime image in a VM or container. Last fall I toyed with the idea of adding an "hbase-local" module to > the hbase-filesystem repo that could start us out with some > optimizations for single node set ups. We could start with a fork of > RawLocalFileSystem (which will call OutputStream flush operations in > response to hflush/hsync) that properly advertises its > StreamCapabilities to say that it supports the operations we need. > Alternatively we could make our own implementation of FileSystem that > uses NIO stuff. Either of these approaches would solve both problems. > I find this approach more palatable than a custom quick start binary tarball. On Wed, Apr 15, 2020 at 11:40 AM Nick Dimiduk <ndimi...@apache.org> wrote: > > > > Hi folks, > > > > I'd like to bring up the topic of the experience of new users as it > > pertains to use of the `LocalFileSystem` and its associated (lack of) > data > > durability guarantees. By default, an unconfigured HBase runs with its > root > > directory on a `file:///` path. This patch is picked up as an instance of > > `LocalFileSystem`. Hadoop has long offered this class, but it has never > > supported `hsync` or `hflush` stream characteristics. Thus, when HBase > runs > > on this configuration, it is unable to ensure that WAL writes are > durable, > > and thus will ACK a write without this assurance. This is the case, even > > when running in a fully durable WAL mode. > > > > This impacts a new user, someone kicking the tires on HBase following our > > Getting Started docs. On Hadoop 2.8 and before, an unconfigured HBase > will > > WARN and cary on. Hadoop 2.10+, HBase will refuse to start. The book > > describes a process of disabling enforcement of stream capability > > enforcement as a first step. This is a mandatory configuration for > running > > HBase directly out of our binary distribution. > > > > HBASE-24086 restores the behavior on Hadoop 2.10+ to that of running on > > 2.8: log a warning and cary on. The critique of this approach is that > it's > > far too subtle, too quiet for a system operating in a state known to not > > provide data durability. > > > > I have two assumptions/concerns around the state of things, which > prompted > > my solution on HBASE-24086 and the associated doc update on HBASE-24106. > > > > 1. No one should be running a production system on `LocalFileSystem`. > > > > The initial implementation checked both for `LocalFileSystem` and > > `hbase.cluster.distributed`. When running on the former and the latter is > > false, we assume the user is running a non-production deployment and > carry > > on with the warning. When the latter is true, we assume the user > intended a > > production deployment and the process terminates due to stream capability > > enforcement. Subsequent code review resulted in skipping the > > `hbase.cluster.distributed` check and simply warning, as was done on 2.8 > > and earlier. > > > > (As I understand it, we've long used the `hbase.cluster.distributed` > > configuration to decide if the user intends this runtime to be a > production > > deployment or not.) > > > > Is this a faulty assumption? Is there a use-case we support where we > > condone running production deployment on the non-durable > `LocalFileSystem`? > > > > 2. The Quick Start experience should require no configuration at all. > > > > Our stack is difficult enough to run in a fully durable production > > environment. We should make it a priority to ensure it's as easy as > > possible to try out HBase. Forcing a user to make decisions about data > > durability before they even launch the web ui is a terrible experience, > in > > my opinion, and should be a non-starter for us as a project. > > > > (In my opinion, the need to configure either `hbase.rootdir` or > > `hbase.tmp.dir` away from `/tmp` is equally bad for a Getting Started > > experience. It is a second, more subtle question of data durability that > we > > should avoid out of the box. But I'm happy to leave that for another > > thread.) > > > > Thank you for your time, > > Nick >