Re: Re: Re: Re: What way to improve MTTR other than DLR(distributed log replay)

Stack Thu, 27 Oct 2016 15:06:02 -0700

On Thu, Oct 27, 2016 at 3:01 PM, Stack <st...@duboce.net> wrote:

> On Fri, Oct 21, 2016 at 3:24 PM, Enis Söztutar <enis....@gmail.com> wrote:
>
>> A bit late, but let me give my perspective. This can also be moved to jira
>> or dev@ I think.
>>
>> DLR was a nice and had pretty good gains for MTTR. However, dealing with
>> the sequence ids, onlining regions etc and the replay paths proved to be
>> too difficult in practice. I think the way forward would be to not bring
>> DLR back, but actually fix long standing log split problems.
>>
>> The main gains in DLR is that we do not create lots and lots of tiny
>> files,
>> but instead rely on the regular region flushes, to flush bigger files.
>> This
>> also helps with handling requests coming from different log files etc. The
>> only gain that I can think of that you get with DLR, but not with log
>> split
>> is the online enabling of writes while the recovery is going on. However,
>> I
>> think it is not worth having DLR just for this feature.
>>
>>
> And not having to write intermediary files as you note at the start of
> your paragraph.
>
> I meant to say thanks for reviving this important topic.
St.Ack




>
>
>> Now, what are the problems with Log Split you ask. The problems are
>>   - we create a lot of tiny files
>>   - these tiny files are replayed sequentially when the region is assigned
>>   - The region has to replay and flush all data sequentially coming from
>> all these tiny files.
>>
>>
> Longest pole in MTTR used to be noticing the RS had gone away in the first
> place. Lets not forget to add this to our list.
>
>
>
>> In terms of IO, we pay the cost of reading original WAL files, and writing
>> this same amount of data into many small files where the NN overhead is
>> huge. Then for every region, we do serially sort the data by re-reading
>> the
>> tiny WAL files (recovered edits) and sorting them in memory and flushing
>> the data. Which means we do 2 times the reads and writes that we should do
>> otherwise.
>>
>> The way to solve our log split bottlenecks is re-reading the big table
>> paper and implement the WAL recovery as described there.
>>  - Implement an HFile format that can contain data from multiple regions.
>> Something like a concatinated HFile format where each region has its own
>> section, with its own sequence id, etc.
>
>  - Implement links to these files where a link can refer to this data. This
>> is very similar to our ReferenceFile concept.
>
>  - In each log splitter task, instead of generating tiny WAL files that are
>> recovered edits, we instead buffer up in memory, and do a sort (this is
>> the
>> same sort of inserting into the memstore) per region. A WAL is ~100 MB on
>> average, so should not be a big problem to buffer up this.
>
>
>
> Need to be able to spill. There will be anomalies.
>
>
>
>> At the end of
>> the WAL split task, write an hfile containing data from all the regions as
>> described above. Also do a multi NN request to create links in regions to
>> refer to these files (Not sure whether NN has a batch RPC call or not).
>>
>>
> It does not.
>
> So, doing an accounting, I see little difference from what we have now. In
> new scheme:
>
> + We read all WALs as before.
> + We write about the same (in current scheme, we'd aggregate
> across WAL so we didn't write a recovered edits file per WAL) though new
> scheme
> maybe less since we currently flush after replay of recovered edits so we
> nail an
> hfile into the file system that has the recovered edits (but in new
> scheme, we'll bring
> on a compaction because we have references which will cause a rewrite of
> the big hfile
> into a smaller one...).
> + Metadata ops are about the same (rather than lots of small recovered
> edits files instead
> we write lots of small reference files)
>
> ... only current scheme does distributed, paralellized sort and can spill
> if doesn't fit memory.
>
> Am I doing the math right here?
>
> Is there big improvement in MTTR? We are offline while we sort and write
> the big hfile and its
> references. We might save some because we just open the region after the
> above is done where
> now we have open and then replay recovered edits (though we could take
> writes in current
> scheme w/ a bit of work).
>
> Can we do better?
>
> St.Ack
>
>
>
>> The reason this will be on-par or better than DLR is that, we are only
>> doing 1 read and 1 write, and the sort is parallelized. The region opening
>> does not have to block on replaying anything or waiting for flush, because
>> the data is already sorted and in HFile format. These hfiles will be used
>> the normal way by adding them to the KVHeaps, etc. When compactions run,
>> we
>> will be removing the links to these files using the regular mechanisms.
>>
>> Enis
>>
>> On Tue, Oct 18, 2016 at 6:58 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> > Allan:
>> > One factor to consider is that the assignment manager in hbase 2.0
>> would be
>> > quite different from those in 0.98 and 1.x branches.
>> >
>> > Meaning, you may need to come up with two solutions for a single
>> problem.
>> >
>> > FYI
>> >
>> > On Tue, Oct 18, 2016 at 6:11 PM, Allan Yang <allan...@163.com> wrote:
>> >
>> > > Hi, Ted
>> > > These issues I mentioned above(HBASE-13567, HBASE-12743, HBASE-13535,
>> > > HBASE-14729) are ALL reproduced in our HBase1.x test environment.
>> Fixing
>> > > them is exactly what I'm going to do. I haven't found the root cause
>> yet,
>> > > but  I will update if I find solutions.
>> > >  what I afraid is that, there are other issues I don't know yet. So if
>> > you
>> > > or other guys know other issues related to DLR, please let me know
>> > >
>> > >
>> > > Regards
>> > > Allan Yang
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > At 2016-10-19 00:19:06, "Ted Yu" <yuzhih...@gmail.com> wrote:
>> > > >Allan:
>> > > >I wonder how you deal with open issues such as HBASE-13535.
>> > > >From your description, it seems your team fixed more DLR issues.
>> > > >
>> > > >Cheers
>> > > >
>> > > >On Mon, Oct 17, 2016 at 11:37 PM, allanwin <allan...@163.com> wrote:
>> > > >
>> > > >>
>> > > >>
>> > > >>
>> > > >> Here is the thing. We have backported DLR(HBASE-7006) to our 0.94
>> > > >> clusters  in production environment(of course a lot of bugs are
>> fixed
>> > > and
>> > > >> it is working well). It is was proven to be a huge gain. When a
>> large
>> > > >> cluster crash down, the MTTR improved from several hours to less
>> than
>> > a
>> > > >> hour. Now, we want to move on to HBase1.x, and still we want DLR.
>> This
>> > > >> time, we don't want to backport the 'backported' DLR to HBase1.x,
>> but
>> > it
>> > > >> seems like that the community have determined to remove DLR...
>> > > >>
>> > > >>
>> > > >> The DLR feature is proven useful in our production environment, so
>> I
>> > > think
>> > > >> I will try to fix its issues in branch-1.x
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >> At 2016-10-18 13:47:17, "Anoop John" <anoop.hb...@gmail.com>
>> wrote:
>> > > >> >Agree with ur observation.. But DLR feature we wanted to get
>> > removed..
>> > > >> >Because it is known to have issues..  Or else we need major work
>> to
>> > > >> >correct all these issues.
>> > > >> >
>> > > >> >-Anoop-
>> > > >> >
>> > > >> >On Tue, Oct 18, 2016 at 7:41 AM, Ted Yu <yuzhih...@gmail.com>
>> wrote:
>> > > >> >> If you have a cluster, I suggest you turn on DLR and observe the
>> > > effect
>> > > >> >> where fewer than half the region servers are up after the crash.
>> > > >> >> You would have first hand experience that way.
>> > > >> >>
>> > > >> >> On Mon, Oct 17, 2016 at 6:33 PM, allanwin <allan...@163.com>
>> > wrote:
>> > > >> >>
>> > > >> >>>
>> > > >> >>>
>> > > >> >>>
>> > > >> >>> Yes, region replica is a good way to improve MTTR. Specially if
>> > one
>> > > or
>> > > >> two
>> > > >> >>> servers are down, region replica can improve data availability.
>> > But
>> > > >> for big
>> > > >> >>> disaster like 1/3 or 1/2 region servers shutdown, I think DLR
>> > still
>> > > >> useful
>> > > >> >>> to bring regions online more quickly and with less IO usage.
>> > > >> >>>
>> > > >> >>>
>> > > >> >>> Regards
>> > > >> >>> Allan Yang
>> > > >> >>>
>> > > >> >>>
>> > > >> >>>
>> > > >> >>>
>> > > >> >>>
>> > > >> >>>
>> > > >> >>> At 2016-10-17 21:01:16, "Ted Yu" <yuzhih...@gmail.com> wrote:
>> > > >> >>> >Here was the thread discussing DLR:
>> > > >> >>> >
>> > > >> >>> >http://search-hadoop.com/m/YGbbOxBK2n4ES12&subj=Re+
>> > > >> >>> DISCUSS+retiring+current+DLR+code
>> > > >> >>> >
>> > > >> >>> >> On Oct 17, 2016, at 4:15 AM, allanwin <allan...@163.com>
>> > wrote:
>> > > >> >>> >>
>> > > >> >>> >> Hi, All
>> > > >> >>> >>  DLR can improve MTTR dramatically, but since it have many
>> bugs
>> > > like
>> > > >> >>> HBASE-13567, HBASE-12743, HBASE-13535, HBASE-14729(any more
>> > I'don't
>> > > >> know?),
>> > > >> >>> it was proved unreliable, and has been deprecated almost in all
>> > > >> branches
>> > > >> >>> now.
>> > > >> >>> >>
>> > > >> >>> >>
>> > > >> >>> >> My question is, is there any other way other than DLR to
>> > improve
>> > > >> MTTR?
>> > > >> >>> 'Cause If a big cluster crashes, It takes a long time to bring
>> > > regions
>> > > >> >>> online, not to mention it will create huge pressure on the IOs.
>> > > >> >>> >>
>> > > >> >>> >>
>> > > >> >>> >> To tell the truth, I still want DLR back, if the community
>> > don't
>> > > >> have
>> > > >> >>> any plan to bring back DLR, I may want to figure out the
>> problems
>> > in
>> > > >> DLR
>> > > >> >>> and make it working and reliable, Any suggests for that?
>> > > >> >>> >>
>> > > >> >>> >>
>> > > >> >>> >> sincerely
>> > > >> >>> >> Allan Yang
>> > > >> >>>
>> > > >>
>> > >
>> >
>>
>
>

Re: Re: Re: Re: What way to improve MTTR other than DLR(distributed log replay)

Reply via email to