On Thu, Oct 27, 2016 at 3:01 PM, Stack <st...@duboce.net> wrote: > On Fri, Oct 21, 2016 at 3:24 PM, Enis Söztutar <enis....@gmail.com> wrote: > >> A bit late, but let me give my perspective. This can also be moved to jira >> or dev@ I think. >> >> DLR was a nice and had pretty good gains for MTTR. However, dealing with >> the sequence ids, onlining regions etc and the replay paths proved to be >> too difficult in practice. I think the way forward would be to not bring >> DLR back, but actually fix long standing log split problems. >> >> The main gains in DLR is that we do not create lots and lots of tiny >> files, >> but instead rely on the regular region flushes, to flush bigger files. >> This >> also helps with handling requests coming from different log files etc. The >> only gain that I can think of that you get with DLR, but not with log >> split >> is the online enabling of writes while the recovery is going on. However, >> I >> think it is not worth having DLR just for this feature. >> >> > And not having to write intermediary files as you note at the start of > your paragraph. > > I meant to say thanks for reviving this important topic. St.Ack
> > >> Now, what are the problems with Log Split you ask. The problems are >> - we create a lot of tiny files >> - these tiny files are replayed sequentially when the region is assigned >> - The region has to replay and flush all data sequentially coming from >> all these tiny files. >> >> > Longest pole in MTTR used to be noticing the RS had gone away in the first > place. Lets not forget to add this to our list. > > > >> In terms of IO, we pay the cost of reading original WAL files, and writing >> this same amount of data into many small files where the NN overhead is >> huge. Then for every region, we do serially sort the data by re-reading >> the >> tiny WAL files (recovered edits) and sorting them in memory and flushing >> the data. Which means we do 2 times the reads and writes that we should do >> otherwise. >> >> The way to solve our log split bottlenecks is re-reading the big table >> paper and implement the WAL recovery as described there. >> - Implement an HFile format that can contain data from multiple regions. >> Something like a concatinated HFile format where each region has its own >> section, with its own sequence id, etc. > > - Implement links to these files where a link can refer to this data. This >> is very similar to our ReferenceFile concept. > > - In each log splitter task, instead of generating tiny WAL files that are >> recovered edits, we instead buffer up in memory, and do a sort (this is >> the >> same sort of inserting into the memstore) per region. A WAL is ~100 MB on >> average, so should not be a big problem to buffer up this. > > > > Need to be able to spill. There will be anomalies. > > > >> At the end of >> the WAL split task, write an hfile containing data from all the regions as >> described above. Also do a multi NN request to create links in regions to >> refer to these files (Not sure whether NN has a batch RPC call or not). >> >> > It does not. > > So, doing an accounting, I see little difference from what we have now. In > new scheme: > > + We read all WALs as before. > + We write about the same (in current scheme, we'd aggregate > across WAL so we didn't write a recovered edits file per WAL) though new > scheme > maybe less since we currently flush after replay of recovered edits so we > nail an > hfile into the file system that has the recovered edits (but in new > scheme, we'll bring > on a compaction because we have references which will cause a rewrite of > the big hfile > into a smaller one...). > + Metadata ops are about the same (rather than lots of small recovered > edits files instead > we write lots of small reference files) > > ... only current scheme does distributed, paralellized sort and can spill > if doesn't fit memory. > > Am I doing the math right here? > > Is there big improvement in MTTR? We are offline while we sort and write > the big hfile and its > references. We might save some because we just open the region after the > above is done where > now we have open and then replay recovered edits (though we could take > writes in current > scheme w/ a bit of work). > > Can we do better? > > St.Ack > > > >> The reason this will be on-par or better than DLR is that, we are only >> doing 1 read and 1 write, and the sort is parallelized. The region opening >> does not have to block on replaying anything or waiting for flush, because >> the data is already sorted and in HFile format. These hfiles will be used >> the normal way by adding them to the KVHeaps, etc. When compactions run, >> we >> will be removing the links to these files using the regular mechanisms. >> >> Enis >> >> On Tue, Oct 18, 2016 at 6:58 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> > Allan: >> > One factor to consider is that the assignment manager in hbase 2.0 >> would be >> > quite different from those in 0.98 and 1.x branches. >> > >> > Meaning, you may need to come up with two solutions for a single >> problem. >> > >> > FYI >> > >> > On Tue, Oct 18, 2016 at 6:11 PM, Allan Yang <allan...@163.com> wrote: >> > >> > > Hi, Ted >> > > These issues I mentioned above(HBASE-13567, HBASE-12743, HBASE-13535, >> > > HBASE-14729) are ALL reproduced in our HBase1.x test environment. >> Fixing >> > > them is exactly what I'm going to do. I haven't found the root cause >> yet, >> > > but I will update if I find solutions. >> > > what I afraid is that, there are other issues I don't know yet. So if >> > you >> > > or other guys know other issues related to DLR, please let me know >> > > >> > > >> > > Regards >> > > Allan Yang >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > At 2016-10-19 00:19:06, "Ted Yu" <yuzhih...@gmail.com> wrote: >> > > >Allan: >> > > >I wonder how you deal with open issues such as HBASE-13535. >> > > >From your description, it seems your team fixed more DLR issues. >> > > > >> > > >Cheers >> > > > >> > > >On Mon, Oct 17, 2016 at 11:37 PM, allanwin <allan...@163.com> wrote: >> > > > >> > > >> >> > > >> >> > > >> >> > > >> Here is the thing. We have backported DLR(HBASE-7006) to our 0.94 >> > > >> clusters in production environment(of course a lot of bugs are >> fixed >> > > and >> > > >> it is working well). It is was proven to be a huge gain. When a >> large >> > > >> cluster crash down, the MTTR improved from several hours to less >> than >> > a >> > > >> hour. Now, we want to move on to HBase1.x, and still we want DLR. >> This >> > > >> time, we don't want to backport the 'backported' DLR to HBase1.x, >> but >> > it >> > > >> seems like that the community have determined to remove DLR... >> > > >> >> > > >> >> > > >> The DLR feature is proven useful in our production environment, so >> I >> > > think >> > > >> I will try to fix its issues in branch-1.x >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> At 2016-10-18 13:47:17, "Anoop John" <anoop.hb...@gmail.com> >> wrote: >> > > >> >Agree with ur observation.. But DLR feature we wanted to get >> > removed.. >> > > >> >Because it is known to have issues.. Or else we need major work >> to >> > > >> >correct all these issues. >> > > >> > >> > > >> >-Anoop- >> > > >> > >> > > >> >On Tue, Oct 18, 2016 at 7:41 AM, Ted Yu <yuzhih...@gmail.com> >> wrote: >> > > >> >> If you have a cluster, I suggest you turn on DLR and observe the >> > > effect >> > > >> >> where fewer than half the region servers are up after the crash. >> > > >> >> You would have first hand experience that way. >> > > >> >> >> > > >> >> On Mon, Oct 17, 2016 at 6:33 PM, allanwin <allan...@163.com> >> > wrote: >> > > >> >> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> Yes, region replica is a good way to improve MTTR. Specially if >> > one >> > > or >> > > >> two >> > > >> >>> servers are down, region replica can improve data availability. >> > But >> > > >> for big >> > > >> >>> disaster like 1/3 or 1/2 region servers shutdown, I think DLR >> > still >> > > >> useful >> > > >> >>> to bring regions online more quickly and with less IO usage. >> > > >> >>> >> > > >> >>> >> > > >> >>> Regards >> > > >> >>> Allan Yang >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> At 2016-10-17 21:01:16, "Ted Yu" <yuzhih...@gmail.com> wrote: >> > > >> >>> >Here was the thread discussing DLR: >> > > >> >>> > >> > > >> >>> >http://search-hadoop.com/m/YGbbOxBK2n4ES12&subj=Re+ >> > > >> >>> DISCUSS+retiring+current+DLR+code >> > > >> >>> > >> > > >> >>> >> On Oct 17, 2016, at 4:15 AM, allanwin <allan...@163.com> >> > wrote: >> > > >> >>> >> >> > > >> >>> >> Hi, All >> > > >> >>> >> DLR can improve MTTR dramatically, but since it have many >> bugs >> > > like >> > > >> >>> HBASE-13567, HBASE-12743, HBASE-13535, HBASE-14729(any more >> > I'don't >> > > >> know?), >> > > >> >>> it was proved unreliable, and has been deprecated almost in all >> > > >> branches >> > > >> >>> now. >> > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> My question is, is there any other way other than DLR to >> > improve >> > > >> MTTR? >> > > >> >>> 'Cause If a big cluster crashes, It takes a long time to bring >> > > regions >> > > >> >>> online, not to mention it will create huge pressure on the IOs. >> > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> To tell the truth, I still want DLR back, if the community >> > don't >> > > >> have >> > > >> >>> any plan to bring back DLR, I may want to figure out the >> problems >> > in >> > > >> DLR >> > > >> >>> and make it working and reliable, Any suggests for that? >> > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> sincerely >> > > >> >>> >> Allan Yang >> > > >> >>> >> > > >> >> > > >> > >> > >