On Tue, Dec 3, 2013 at 11:37 AM, Enis Söztutar <[email protected]> wrote:
> Responses inlined. > > On Mon, Dec 2, 2013 at 10:00 PM, Jonathan Hsieh <[email protected]> wrote: > > > For the most efficient consistent read-recovery (shadow > regions/memstores), > > it would make sense to have them assigned to the rs's where the Hlogs are > > local. Thus this approach would want to assign shadow regions for regions > > X, Y, and Z on RS-L and RS-M. > > > > I don't this this is the case. Clarification: For achieving the goal of low latency for recovery using the fewest resources (efficiency for this goal) this is the design point that best achieves the goal. > Recovery is a multi step process, and > reading and > applying the log is only one step. Yes and the replay to recover read consistency seems to be one of the more expensive steps. > After the region is opened, you > definitely want > the data files to be local as much as possible. Ideal but not necessary for correctness or faster consistent read recovery. Today we don't have data files local as much as possible and normal hbase user's can't use the feature yet. > Considering the relative > sizes of > the files and the WALs, I think we will always want to use hdfs affinity > groups for > hfiles rather than hlogs to assign secondary replicas. This will help both > stale reads > and local reads in case of a promotion to primary. > > > This is why i'm suggesting separating the two functions (read replica and shadow memstore) to separate logical types with different placement options.. I agree with you on selecting hfile related region servers for read replicas, (which I hadn't fully considered when I was working on the shadow memstore writeup). However for minimizing recovery time replay is more costly (either the n^2 tailers). I don't think hlog tailers should be coupled to the hfile affinity groups for the same reasons as you -- n^2 tailers where n is # of regions. > > > A simple optimal solution for both read replicas and shadow regions would > > be to assign the regions and the HLog to the same set of machines so that > > the RS's for the logs and region x, y, and z hosted are on the same > > machines -- let's say RS-A, RS-H, and RS-I. This has some non-optimal > > balancing ramifications upon machine failure -- the work of RS-A would be > > split between to RS-H and RS-I. > > > > I don't think we want this. This implies that we are creating region > assignment groups ( group-based > assignment as described in the doc). The problem is that in case of a > crash, you cannot evenly > distribute out the regions from the primary otherwise you will still end up > tailing all the logs for > all the region servers. Plus if you want to load balance, it will be even > harder to satisfy the constraints while > keeping the balance. > > In your example, if you have replication=2 for example, we cannot simply > move all the primary regions > of RS-A to RS-H, which will then suddenly have twice the number of regions. > > I think a realistic use would be to set replication 3 since we have three replicas of the logs. Instead the client would just choose to hit the first two replicas (rep0=primary and rep1=secondary) to reduce the memory pressure on the 3rd node (rep=secondary, not read from). Here's an extension to this hybrid approach which potentially buys us both good recency and high availability (at the cost of poor balance if we enforce optimal locality). We essentially group a set of regions to the same three RS's for all regions, and hlogs in that group in a little cycle. Ex: Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2). Region Y on RS-A (rep=1), RS-B(rep=2), RS-C(rep=0). Region Z on RS-A (rep=2), RS-B(rep=0), RS-C(rep=1). RS-A's log on RS-A, RS-B, RS-C. RS-B's log on RS-B, RS-C, RS-A. RS-C's log on RS-C, RS-A, RS-B. If any RS's go down the load is spread between the other two in the group. > > > > A more complex solution for both would be to choose machines for the > > purpose they are best suited for. Read replicas are hosted on their > > respective machines, and shadow region memstores on the hlog's rs's. > > Promotion becomes a more complicated dance where upon RS-A's failure, we > > have the log tailing shadow region catchup and perform a flush of the > > affected memstores to the appropriate hfds affinity group/favored nodes. > > So the shadow memstore for region X would flush the hfile to A,B,C and > > region Y to A,D,E. Then the read replicas would be promoted (close > > secondary, open as primary) based on where the regions/hfiles's affinity > > group. This feels likes an optimization done one the 2rd or 3rd rev. > > > > I think we do not want to differentiate between RS's by splitting them > between primaries and shadows. > This will complicate provisioning, administration, monitoring and load > balancing a lot, and will not achieve > very cheap secondary region promotions (because you have to move the region > still as you described). > > I think there is a misunderstanding here -- clarifying. In this combined approach, we have a pool of RS's, each of which will can host a combination of primary regions, secondary read replica regions, and shadow memstore regions. If we don't have separate out the shadow memstore regions from the secondary read replicas, we end up with the inefficient design implied in the read replica write up -- where all region servers need to essentially read all hlogs from the other region servers causing n^2 tailers across the region instead of n or 2n tailers. We will want to have different metrics for monitoring and log for the read replicas and the shadow memstore. Ideally we'd know how far behind we are. In the combined approach, this isn't a move here -- the flush is completed by the shadow to the nodes that are assigned as the secondaries than we promote the secondary to primary by closing and then opening the region as primary. > Ex: Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2). Region Y on RS-A (rep=0), RS-D(rep=1), RS-E(rep=2). RS-A's log on RS-A, RS-F, RS-G. RS-F and RS-G shadow RS-A's Hlog. RS-A goes down. RS-F and RS-G catch up to the end of RS-A's HLog. RS-F flushes the region X shadow memstore to nodes RS-B, RS-C (and some other node). RS-G flushes the region Y shadow memstore to nodes RS-D, RS-E (and some other node). Master promotes region X on secondary RS-B (closeReplica, open X) with all its stores local to RS-B and RS-C. Master promotes region Y on secondary RS-D (closeReplica, open Y) with all its stores local to RS-D and RS-E. > > > > Jon > > > > -- > > // Jonathan Hsieh (shay) > > // Software Engineer, Cloudera > > // [email protected] > > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [email protected]
