On Tue, Dec 3, 2013 at 11:37 AM, Enis Söztutar <enis....@gmail.com> wrote:

> Responses inlined.
>
> On Mon, Dec 2, 2013 at 10:00 PM, Jonathan Hsieh <j...@cloudera.com> wrote:
>
> > For the most efficient consistent read-recovery (shadow
> regions/memstores),
> > it would make sense to have them assigned to the rs's where the Hlogs are
> > local. Thus this approach would want to assign shadow regions for regions
> > X, Y, and Z on RS-L and RS-M.
> >
>
> I don't this this is the case.


Clarification: For achieving the goal of low latency for recovery using the
fewest resources (efficiency for this goal) this is the design point that
best achieves the goal.


> Recovery is a multi step process, and
> reading and
> applying the log is only one step.


Yes and the replay to recover read consistency seems to be one of the more
expensive steps.


> After the region is opened, you
> definitely want
> the data files to be local as much as possible.


Ideal but not necessary for correctness or faster consistent read recovery.
 Today we don't have data files local as much as possible and normal hbase
user's can't use the feature yet.


> Considering the relative
> sizes of
> the files and the WALs, I think we will always want to use hdfs affinity
> groups for
> hfiles rather than hlogs to assign secondary replicas. This will help both
> stale reads
> and local reads in case of a promotion to primary.
>
>
>
This is why i'm suggesting separating the two functions (read replica and
shadow memstore) to separate logical types with different placement
options..

I agree with you on selecting hfile related region servers for read
replicas, (which I hadn't fully considered when I was working on the shadow
memstore writeup).  However for minimizing recovery time replay is more
costly (either the n^2 tailers).

I don't think hlog tailers should be coupled to the hfile affinity groups
for the same reasons as you -- n^2 tailers where n is # of regions.

 >
> > A simple optimal solution for both read replicas and shadow regions would
> > be to assign the regions and the HLog to the same set of machines so that
> > the RS's for the logs and region x, y, and z hosted are on the same
> > machines -- let's say RS-A, RS-H, and RS-I.  This has some non-optimal
> > balancing ramifications upon machine failure -- the work of RS-A would be
> > split between to RS-H and RS-I.
> >
>
> I don't think we want this. This implies that we are creating region
> assignment groups ( group-based
> assignment as described in the doc). The problem is that in case of a
> crash, you cannot evenly
> distribute out the regions from the primary otherwise you will still end up
> tailing all the logs for
> all the region servers. Plus if you want to load balance, it will be even
> harder to satisfy the constraints while
> keeping the balance.
>
> In your example, if you have replication=2 for example, we cannot simply
> move all the primary regions
> of RS-A to RS-H, which will then suddenly have twice the number of regions.
>
>
I think a realistic use would be to set replication 3 since we have three
replicas of the logs.  Instead the client would just choose to hit the
first two replicas (rep0=primary and rep1=secondary) to reduce the memory
pressure on the 3rd node (rep=secondary, not read from).

Here's an extension to this hybrid approach which potentially buys us both
good recency and high availability (at the cost of poor balance if we
enforce optimal locality).  We essentially group a set of regions to the
same three RS's for all regions, and hlogs in that group in a little cycle.

Ex:
Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2).
Region Y on RS-A (rep=1), RS-B(rep=2), RS-C(rep=0).
Region Z on RS-A (rep=2), RS-B(rep=0), RS-C(rep=1).
RS-A's log on RS-A, RS-B, RS-C.
RS-B's log on RS-B, RS-C, RS-A.
RS-C's log on RS-C, RS-A, RS-B.

If any RS's go down the load is spread between the other two in the group.


> >
> > A more complex solution for both would be to choose machines for the
> > purpose they are best suited for.  Read replicas are hosted on their
> > respective machines, and shadow region memstores on the hlog's rs's.
> >  Promotion becomes a more complicated dance where upon RS-A's failure, we
> > have the log tailing shadow region catchup and perform a flush of the
> > affected memstores to the appropriate hfds affinity group/favored nodes.
> >  So the shadow memstore for region X would flush the hfile to A,B,C and
> > region Y to A,D,E.  Then the read replicas would be promoted (close
> > secondary, open as primary) based on where the regions/hfiles's affinity
> > group.  This feels likes an optimization done one the 2rd or 3rd rev.
> >
>
> I think we do not want to differentiate between RS's by splitting them
> between primaries and shadows.
> This will complicate provisioning, administration, monitoring and load
> balancing a lot, and will not achieve
> very cheap secondary region promotions (because you have to move the region
> still as you described).
>
>
I think there is a misunderstanding here -- clarifying.  In this combined
approach, we have a pool of RS's, each of which will can host a combination
of primary regions, secondary read replica regions, and shadow memstore
regions.  If we don't have separate out the shadow memstore regions from
the secondary read replicas, we end up with the inefficient design implied
in the read replica write up -- where all region servers need to
essentially read all hlogs from the other region servers causing n^2
tailers across the region instead of n or 2n tailers.

We will want to have different metrics for monitoring and log for the read
replicas and the shadow memstore.  Ideally we'd know how far behind we are.


In the combined approach, this isn't a move here -- the flush is completed
by the shadow to the nodes that are assigned as the secondaries than we
promote the secondary to primary by closing and then opening the region as
primary.

>
Ex:
Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2).
Region Y on RS-A (rep=0), RS-D(rep=1), RS-E(rep=2).
RS-A's log on RS-A, RS-F, RS-G.
RS-F and RS-G shadow RS-A's Hlog.

RS-A goes down.
RS-F and RS-G catch up to the end of RS-A's HLog.
RS-F flushes the region X shadow memstore to nodes RS-B, RS-C (and some
other node).
RS-G flushes the region Y shadow memstore to nodes RS-D, RS-E (and some
other node).
Master promotes region X on secondary RS-B (closeReplica, open X) with all
its stores local to RS-B and RS-C.
Master promotes region Y on secondary RS-D (closeReplica, open Y) with all
its stores local to RS-D and RS-E.



> >
> > Jon
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // j...@cloudera.com
> >
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// j...@cloudera.com

Reply via email to