We are also proposing to implement HBASE-7509 as a part of this major undertaking. HBASE-7509 will help with HBase in general (even if you are not using HBASE-10070), and possibly some other hdfs clients. HBASE-10070 will give you similar benefits to HBASE-7509 if your use case needs that, but on the hbase layer which will sit on top of HBASE-7509.
Enis On Sat, Dec 7, 2013 at 5:39 AM, 谢良 <[email protected]> wrote: > For one advantage of this design(ability to do low latency reads with > <20ms 99.9% latencies for stale reads), to me, i more prefer to hbase-7509 > solution, Since if you want to ganrantee similar high performance read > ability in > shadow regions, then you must let the shadow rs warmup the related hot > blocks > into block cache.(In deed, i have a similar worry with Vladimir). > I tried to think how this design could beat hbase-7509 on cutting the > latency tail, > but no result still. > > Enis, could you share your thoughts on it? thanks > > Thanks, > > ________________________________________ > 发件人: Enis Söztutar [[email protected]] > 发送时间: 2013年12月4日 6:18 > 收件人: [email protected] > 主题: Re: [Shadow Regions / Read Replicas ] > > On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov > <[email protected]>wrote: > > > The downside: > > > > - Double/Triple memstore usage > > - Increased block cache usage (effectively, block cache will have 50% > > capacity may be less) > > > These are covered at the tradeoff section at the design doc. > > > > > > > These downsides are pretty serious ones. This will result: > > > > 1. in decreased overall performance due to decreased efficient block > cache > > size > > > > You can elect to not fill up the block cache for secondary reads. It will > be a configuration option, and a > tradeoff you may or may not want to pay. Details are in the doc. > > > > 2. In more frequent memstore flushes - this will affect compaction and > > write tput. > > > > More frequent flushes is not needed unless you are using region snapshots > approach, > and want to bound the lag better. It is a tradeoff between expected lag vs > more > write amplification. > > > > > > I do not believe that HBase 'large' MTTR does not allow to meet 99% SLA. > > of 10-20ms unless your RSs go down 2-3 times a day for several minutes > each > > time. You have to analyze first why are you having so frequent failures, > > than fix the root source of the problem. Its possible to reduce > 'detection' > > phase in MTTR process to couple seconds either by using external beacon > > process (as I suggested already) or by rewriting some code inside HBase > and > > NameNode to move all data out from Java heap to off-heap and reducing > > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. > The > > result: you will decrease MTTR by 50% at least w/o sacrificing the > overall > > cluster performance. > > > > I think, its RS and NN large heaps and frequent s-t-w GC activities > > prevents meeting strict SLA - not occasional server failures. > > > > MTTR and this work is ortagonal. In a distributed system, you cannot > differentiate between > a process not responding because it is down or it is busy or network is > down, or whatnot. Having > a couple of seconds detection time is unrealistic. You will end up in a > very unstable state where > you will be failing servers all over the place. An external beacon also > cannot differentiate between > the main process not responding because it is busy, or it is down. What > happens why there is a temporary > network partition. > > > > > > > > > > > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <[email protected]> > wrote: > > > > > To keep the discussion focused on the design goals, I'm going start > > > referring to enis and deveraj's eventually consistent read replicas as > > the > > > *read replica* design, and consistent fast read recovery mechanism > based > > on > > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*. > > Can > > > we agree on nomenclature? > > > > > > > > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <[email protected]> > wrote: > > > > > > > Thanks Jon for bringing this to dev@. > > > > > > > > > > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <[email protected]> > > > wrote: > > > > > > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" > instead > > of > > > > > tackling a feature that other systems architecturally can do better > > > > > (inconsistent reads). I consider consistent reads/writes being > one > > of > > > > > HBase's defining features. That said, I think read replicas makes > > sense > > > > and > > > > > is a nice feature to have. > > > > > > > > > > > > > Our design proposal has a specific use case goal, and hopefully we > can > > > > demonstrate the > > > > benefits of having this in HBase so that even more pieces can be > built > > on > > > > top of this. Plus I imagine this will > > > > be a widely used feature for read-only tables or bulk loaded tables. > We > > > are > > > > not > > > > proposing of reworking strong consistency semantics or major > > > architectural > > > > changes. I think by > > > > having the tables to be defined with replication count, and the > > proposed > > > > client API changes (Consistency definition) > > > > plugs well into the HBase model rather well. > > > > > > > > > > > I do agree think that without any recent updating mechanism, we are > > > limiting this usefulness of this feature to essentially *only* the > > > read-only or bulk load only tables. Recency if there were any > > > edits/updates would be severely lagging (by default potentially an > hour) > > > especially in cases where there are only a few edits to a primarily > bulk > > > loaded table. This limitation is not mentioned in the tradeoffs or > > > requirements (or a non-requirements section) definitely should be > listed > > > there. > > > > > > With the current design it might be best to have a flag on the table > > which > > > marks it read-only or bulk-load only so that it only gets used by users > > > when the table is in that mode? (and maybe an "escape hatch" for power > > > users). > > > > > > [snip] > > > > > > > > - I think the two goals are both worthy on their own each with their > > own > > > > > optimal points. We should in the design makes sure that we can > > support > > > > > both goals. > > > > > > > > > > > > > I think our proposal is consistent with your doc, and we have > > considered > > > > secondary region promotion > > > > in the future section. It would be good if you can review and comment > > on > > > > whether you see any points > > > > missing. > > > > > > > > > > > > I definitely will. At the moment, I think the hybrid for the > > wals/hlogs I > > > suggested in the other thread seems to be an optimal solution > considering > > > locality. Though feasible is obviously more complex than just one > > approach > > > alone. > > > > > > > > > > > - I want to making sure the proposed design have a path for optimal > > > > > fast-consistent read-recovery. > > > > > > > > > > > > > We think that it is, but it is a secondary goal for the initial > work. I > > > > don't see any reason why secondary > > > > promotion cannot be build on top of this, once the branch is in a > > better > > > > state. > > > > > > > > > > Based on the detail in the design doc and this statement it sounds like > > you > > > have a prototype branch already? Is this the case? > > > > > > -- > > > // Jonathan Hsieh (shay) > > > // Software Engineer, Cloudera > > > // [email protected] > > > > > >
