[ 
https://issues.apache.org/jira/browse/HBASE-21831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053823#comment-17053823
 ] 

Andrew Kyle Purtell commented on HBASE-21831:
---------------------------------------------

[~sandeep.pal] are you working on this?

> Optional store-and-forward of simple mutations for regions in transition
> ------------------------------------------------------------------------
>
>                 Key: HBASE-21831
>                 URL: https://issues.apache.org/jira/browse/HBASE-21831
>             Project: HBase
>          Issue Type: New Feature
>          Components: regionserver, rpc
>            Reporter: Andrew Kyle Purtell
>            Assignee: Sandeep Pal
>            Priority: Major
>             Fix For: 3.0.0, 1.7.0, 2.4.0
>
>
> We have an internal service built on Redis that is considering writing 
> through to HBase directly for their persistence needs. Their current 
> experience with Redis is
>  * Average write latency is ~milliseconds
>  * p999 write latencies with are "a few seconds"
> They want a similar experience when writing simple values directly to HBase. 
> Infrequent exceptions to this would be acceptable. 
>  * Availability of 99.9% for writes
>  * Expect most writes to be serviced within a few milliseconds, e.g. few 
> millis at p95. Still evaluating what the requirement should be (~millis at 
> p90 vs p95 vs p99).
>  * Timeout of 2 seconds, should be rare
> There is a fallback plan considered if HBase cannot respond within 2 seconds. 
> However this fallback cannot guarantee durability. Redis or the service's 
> daemons may go down. They want HBase to provide required durability.
> Because this is a caching service, where all writes are expected to be served 
> again from cache, at least for a while, if HBase were to accept writes such 
> that they are not immediately visible, it could be fine that they are not 
> visible for 10-20 minutes in the worst case. This is relatively easy to 
> achieve as an engineering target should we consider offering a write option 
> that does not guarantee immediate visibility. (A proposal follows below.) We 
> are considering store-and-forward of simple mutations and perhaps also simple 
> deletes, although the latter is not a hard requirement. Out of order 
> processing of this subset of mutation requests is acceptable because their 
> data model ensures all values are immutable. Presumably on the HBase side the 
> timestamps of the requests would be set to the current server wall clock time 
> when received, so eventually when applied all are available with correct 
> temporal ordering (within the effective resolution of the server clocks). 
> Deletes which are not immediately applied (or failed) could cause application 
> level confusion, and although this would remain a concern for the general 
> case, for this specific use case, stale reads could be explained to and 
> tolerated by their users.
> The BigTable architecture assigns at most one server to serve a region at a 
> time. Region Replicas are an enhancement to the base BigTable architecture we 
> made in HBase which stands up two more read-only replicas for a given region, 
> meaning a client attempting a read has the option to fail very quickly over 
> from the primary to a replica for a (potentially stale) read, or distribute 
> read load over all replicas, or employ a hedged reading strategy. Enabling 
> region replicas and timeline consistency can lower the availability gap for 
> reads in the high percentiles from ~minutes to ~milliseconds. However, this 
> option will not help for write use cases wanting roughly the same thing, 
> because there can be no fail-over for writes. Writes must still go to the 
> active primary. When that region is in transition, writes must be held on the 
> client until it is redeployed. Or, if region replicas are not enabled, when 
> the sole region is in transition, again, writes must be held on the client 
> until the region is available again.
> Regions enter the in-transition state for two reasons: failures, and 
> housekeeping (splits and merges, or balancing). Time to region redeployment 
> after failures depends on a number of factors, like how long it took for us 
> to become aware of the failure, and how long it takes to split the 
> write-ahead log of the failed server and distribute the recovered edits to 
> the reopening region(s). We could in theory improve this behavior by being 
> more predictive about declaring failure, like employing a phi accrual failure 
> detector to signal to the master from clients that a regionserver is sick. 
> Other time-to-recovery issues and mitigations are discussed in a number of 
> JIRAs and blog posts and not discussed further here. Regarding housekeeping 
> activities, splits and merges typically complete in under a second. However, 
> split times up to ~30 seconds have been observed at my place of employ in 
> rare conditions. In the instances I have investigated the cause is I/O stalls 
> on the datanodes and metadata request stalls in the namenode, so not 
> unexpected outlier cases. Mitigating these risks involve looking at split and 
> policies. Split and merge policies are pluggable, and policy choices can be 
> applied per table. In extreme cases, auto-splitting (and auto-merging) can be 
> disabled on performance sensitive tables and accomplished through manual 
> means during scheduled maintenance windows. Regions may also be moved by the 
> Balancer to avoid unbalanced loading over the available cluster resources. 
> During balancing, one or more regions are closed on some servers, then opened 
> on others. While closing, a region must flush all of its memstores, yet will 
> not accept any new requests during flushing, because it is closing. This can 
> lead to short availability gaps. The Balancer's strategy can be tuned, or on 
> clusters where any disruption is undesirable, the balancer can be disabled, 
> and enabled/invoked manually only during scheduled maintenance either by 
> admin API or by plugging in a custom implementation that does nothing. _While 
> these options are available, they are needlessly complex to consider for use 
> cases that can be satisfied with simple dogged store-and-forward of mutations 
> accepted on behalf of an unavailable region_. _It would be far simpler from 
> the user perspective to offer a new flag for mutation requests._ It may also 
> not be tenable to apply the global configuration changes discussed above in 
> this paragraph to a multitenant cluster. 
> The requirement to always take writes even under partial failure conditions 
> is a prime motivator for the development of eventually consistent systems. 
> However while those systems can accept writes under a wider range of failure 
> conditions than others, like HBase, which strive for consistency, they cannot 
> guarantee those writes are immediately available for reads. Far from it. The 
> guarantees about data availability and freshness are reduced or eliminated in 
> eventually consistent designs. Consistent semantics remain highly desirable 
> even though we have to make availability tradeoffs. Eventually consistent 
> designs expose data inconsistency issues to their applications, and this is a 
> constant pain point for even the best developers. We want to retain HBase's 
> consistent semantics and operational model for the vast majority of use 
> cases. That said, we can look at some changes that improve the apparent 
> availability of an HBase cluster for a subset of simple mutation requests, 
> for use cases that want to relax some guarantees for writes in a similar 
> manner as we have done earlier for reads via the read replica feature.
> If we accept the requirement to always accept writes, if any server is 
> available, and there is no need to make them immediately visible, we can 
> introduce a new write request attribute that says "it is fine to accept this 
> on behalf of the now or future region holder, in a store-and-forward manner", 
> for a subset of possible write operations: Append and Increment requires the 
> server to return the up-to-date result, so are not possible. CheckAndXXX 
> operations likewise must be executed at the primary. Deletes could be 
> dangerous to apply out of order and so should not be accepted as a rule. 
> Perhaps simple deletes could be supported, if an additional safety valve is 
> switched off in configuration, but not DeleteColumn or DeleteFamily. Simple 
> Puts and multi-puts (batch Put[] or RowMutations) can be safely serviced. 
> This can still satisfy requirements for dogged persistence of simple writes 
> and benefit a range of use cases. Should the primary region be unavailable 
> for accepting this subset of writes, the client would contact another 
> regionserver, any regionserver, with the new operation flag set, and that 
> regionserver would then accept the write on behalf of the future holder of 
> the in-transition region. (Technically, a client could set this flag at any 
> time for any reason.) Regionservers to which writes are handed off must then 
> efficiently and doggedly drain their store-and-forward queue. This queue must 
> be durable across process failures and restarts. We can use existing 
> regionserver WAL facilities to support this. This would be similar in some 
> ways to how cross cluster replication is implemented. Edits are persisted to 
> the WAL at the source, the WAL entries are later enumerated and applied at 
> the sink. The differences here are:
>  * Regionservers would accept edits for regions they are not currently 
> servicing.
>  * WAL replay must also handle these "martian" edits, adding them to the 
> store-and-forward queue of the regionserver recovering the WAL.
>  * Regionservers will queue such edits and apply them to the local cluster 
> instead of shipping them out; in other words, the local regionserver acts as 
> a replication sink, not a source.
> There could be no guarantee on the eventual visibility of requests accepted 
> in this manner, and writes accepted into store-and-forward queues may be 
> applied out of order, although the timestamp component of HBase keys will 
> ensure correct temporal ordering (within the effective resolution of the 
> server clocks) after all are eventually applied. This is consistent with the 
> semantics one gets with eventually consistent systems. This would not be 
> default behavior, nor default semantics. HBase would continue to trade off 
> for consistency, unless this new feature/flag is enabled by an informed 
> party. This is consistent with the strategy we adopted for region replicas.
>  
>  Like with cross-cluster replication we would want to provide metrics on the 
> depth of the queues and maximum age of entries in these queues, so operators 
> can get a sense of how far behind they might be, to determine compliance with 
> application service level objectives.
> Implementation of this feature should satisfy compatibility policy 
> constraints such that minor releases can accept it. At the very least we 
> would require it in a new branch-1 minor. This is a hard requirement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to