[ https://issues.apache.org/jira/browse/HBASE-21831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053823#comment-17053823 ]
Andrew Kyle Purtell commented on HBASE-21831: --------------------------------------------- [~sandeep.pal] are you working on this? > Optional store-and-forward of simple mutations for regions in transition > ------------------------------------------------------------------------ > > Key: HBASE-21831 > URL: https://issues.apache.org/jira/browse/HBASE-21831 > Project: HBase > Issue Type: New Feature > Components: regionserver, rpc > Reporter: Andrew Kyle Purtell > Assignee: Sandeep Pal > Priority: Major > Fix For: 3.0.0, 1.7.0, 2.4.0 > > > We have an internal service built on Redis that is considering writing > through to HBase directly for their persistence needs. Their current > experience with Redis is > * Average write latency is ~milliseconds > * p999 write latencies with are "a few seconds" > They want a similar experience when writing simple values directly to HBase. > Infrequent exceptions to this would be acceptable. > * Availability of 99.9% for writes > * Expect most writes to be serviced within a few milliseconds, e.g. few > millis at p95. Still evaluating what the requirement should be (~millis at > p90 vs p95 vs p99). > * Timeout of 2 seconds, should be rare > There is a fallback plan considered if HBase cannot respond within 2 seconds. > However this fallback cannot guarantee durability. Redis or the service's > daemons may go down. They want HBase to provide required durability. > Because this is a caching service, where all writes are expected to be served > again from cache, at least for a while, if HBase were to accept writes such > that they are not immediately visible, it could be fine that they are not > visible for 10-20 minutes in the worst case. This is relatively easy to > achieve as an engineering target should we consider offering a write option > that does not guarantee immediate visibility. (A proposal follows below.) We > are considering store-and-forward of simple mutations and perhaps also simple > deletes, although the latter is not a hard requirement. Out of order > processing of this subset of mutation requests is acceptable because their > data model ensures all values are immutable. Presumably on the HBase side the > timestamps of the requests would be set to the current server wall clock time > when received, so eventually when applied all are available with correct > temporal ordering (within the effective resolution of the server clocks). > Deletes which are not immediately applied (or failed) could cause application > level confusion, and although this would remain a concern for the general > case, for this specific use case, stale reads could be explained to and > tolerated by their users. > The BigTable architecture assigns at most one server to serve a region at a > time. Region Replicas are an enhancement to the base BigTable architecture we > made in HBase which stands up two more read-only replicas for a given region, > meaning a client attempting a read has the option to fail very quickly over > from the primary to a replica for a (potentially stale) read, or distribute > read load over all replicas, or employ a hedged reading strategy. Enabling > region replicas and timeline consistency can lower the availability gap for > reads in the high percentiles from ~minutes to ~milliseconds. However, this > option will not help for write use cases wanting roughly the same thing, > because there can be no fail-over for writes. Writes must still go to the > active primary. When that region is in transition, writes must be held on the > client until it is redeployed. Or, if region replicas are not enabled, when > the sole region is in transition, again, writes must be held on the client > until the region is available again. > Regions enter the in-transition state for two reasons: failures, and > housekeeping (splits and merges, or balancing). Time to region redeployment > after failures depends on a number of factors, like how long it took for us > to become aware of the failure, and how long it takes to split the > write-ahead log of the failed server and distribute the recovered edits to > the reopening region(s). We could in theory improve this behavior by being > more predictive about declaring failure, like employing a phi accrual failure > detector to signal to the master from clients that a regionserver is sick. > Other time-to-recovery issues and mitigations are discussed in a number of > JIRAs and blog posts and not discussed further here. Regarding housekeeping > activities, splits and merges typically complete in under a second. However, > split times up to ~30 seconds have been observed at my place of employ in > rare conditions. In the instances I have investigated the cause is I/O stalls > on the datanodes and metadata request stalls in the namenode, so not > unexpected outlier cases. Mitigating these risks involve looking at split and > policies. Split and merge policies are pluggable, and policy choices can be > applied per table. In extreme cases, auto-splitting (and auto-merging) can be > disabled on performance sensitive tables and accomplished through manual > means during scheduled maintenance windows. Regions may also be moved by the > Balancer to avoid unbalanced loading over the available cluster resources. > During balancing, one or more regions are closed on some servers, then opened > on others. While closing, a region must flush all of its memstores, yet will > not accept any new requests during flushing, because it is closing. This can > lead to short availability gaps. The Balancer's strategy can be tuned, or on > clusters where any disruption is undesirable, the balancer can be disabled, > and enabled/invoked manually only during scheduled maintenance either by > admin API or by plugging in a custom implementation that does nothing. _While > these options are available, they are needlessly complex to consider for use > cases that can be satisfied with simple dogged store-and-forward of mutations > accepted on behalf of an unavailable region_. _It would be far simpler from > the user perspective to offer a new flag for mutation requests._ It may also > not be tenable to apply the global configuration changes discussed above in > this paragraph to a multitenant cluster. > The requirement to always take writes even under partial failure conditions > is a prime motivator for the development of eventually consistent systems. > However while those systems can accept writes under a wider range of failure > conditions than others, like HBase, which strive for consistency, they cannot > guarantee those writes are immediately available for reads. Far from it. The > guarantees about data availability and freshness are reduced or eliminated in > eventually consistent designs. Consistent semantics remain highly desirable > even though we have to make availability tradeoffs. Eventually consistent > designs expose data inconsistency issues to their applications, and this is a > constant pain point for even the best developers. We want to retain HBase's > consistent semantics and operational model for the vast majority of use > cases. That said, we can look at some changes that improve the apparent > availability of an HBase cluster for a subset of simple mutation requests, > for use cases that want to relax some guarantees for writes in a similar > manner as we have done earlier for reads via the read replica feature. > If we accept the requirement to always accept writes, if any server is > available, and there is no need to make them immediately visible, we can > introduce a new write request attribute that says "it is fine to accept this > on behalf of the now or future region holder, in a store-and-forward manner", > for a subset of possible write operations: Append and Increment requires the > server to return the up-to-date result, so are not possible. CheckAndXXX > operations likewise must be executed at the primary. Deletes could be > dangerous to apply out of order and so should not be accepted as a rule. > Perhaps simple deletes could be supported, if an additional safety valve is > switched off in configuration, but not DeleteColumn or DeleteFamily. Simple > Puts and multi-puts (batch Put[] or RowMutations) can be safely serviced. > This can still satisfy requirements for dogged persistence of simple writes > and benefit a range of use cases. Should the primary region be unavailable > for accepting this subset of writes, the client would contact another > regionserver, any regionserver, with the new operation flag set, and that > regionserver would then accept the write on behalf of the future holder of > the in-transition region. (Technically, a client could set this flag at any > time for any reason.) Regionservers to which writes are handed off must then > efficiently and doggedly drain their store-and-forward queue. This queue must > be durable across process failures and restarts. We can use existing > regionserver WAL facilities to support this. This would be similar in some > ways to how cross cluster replication is implemented. Edits are persisted to > the WAL at the source, the WAL entries are later enumerated and applied at > the sink. The differences here are: > * Regionservers would accept edits for regions they are not currently > servicing. > * WAL replay must also handle these "martian" edits, adding them to the > store-and-forward queue of the regionserver recovering the WAL. > * Regionservers will queue such edits and apply them to the local cluster > instead of shipping them out; in other words, the local regionserver acts as > a replication sink, not a source. > There could be no guarantee on the eventual visibility of requests accepted > in this manner, and writes accepted into store-and-forward queues may be > applied out of order, although the timestamp component of HBase keys will > ensure correct temporal ordering (within the effective resolution of the > server clocks) after all are eventually applied. This is consistent with the > semantics one gets with eventually consistent systems. This would not be > default behavior, nor default semantics. HBase would continue to trade off > for consistency, unless this new feature/flag is enabled by an informed > party. This is consistent with the strategy we adopted for region replicas. > > Like with cross-cluster replication we would want to provide metrics on the > depth of the queues and maximum age of entries in these queues, so operators > can get a sense of how far behind they might be, to determine compliance with > application service level objectives. > Implementation of this feature should satisfy compatibility policy > constraints such that minor releases can accept it. At the very least we > would require it in a new branch-1 minor. This is a hard requirement. -- This message was sent by Atlassian Jira (v8.3.4#803005)