On review of the 'patch', do I just compare the branch to master or is there a megapatch posted somewhere (I think I saw one but it seemed stale and then I 'lost' the tab). Sorry for dumb question. St.Ack
On Mon, Sep 12, 2016 at 12:01 PM, Stack <st...@duboce.net> wrote: > Late to the game. A few comments after rereading this thread as a 'user'. > > + Before merge, a user-facing feature like this should work (If this is > "higher-bar > for new features", bring it on -- smile). > + As a user, I tried the branch with tools after reviewing the just-posted > doc. I had an 'interesting' experience (left comments up on issue). I think > the tooling/doc. important to get right. If it breaks easily or is > inconsistent (or lacks 'polish'), operators will judge the whole > backup/restore tooling chain as not trustworthy and abandon it. Lets not > have this happen to this feature. > + Matteo's suggestion (with a helpful starter list) that there needs to be > explicit qualification on what is actually being delivered -- including a > listing of limitations (some look serious such as data bleed from other > regions in WALs, but maybe I don't care for my use case...) -- needs to > accompany the merge. Lets fold them into the user doc. in the technical > overview area as suggested so user expectations are properly managed > (otherwise, they expect the world and will just give up when we fall > short). Vladimir did a list of what is in each of the phases above which > would serve as a good start. > + Is this feature 'experimental' (Matteo asks above). I'd prefer it is > not. If it is, it should be labelled all over that it is so. I see current > state called out as a '... technical preview feature'. Does this mean > not-for-users? > > St.Ack > > > > > > > > > > > > On Mon, Sep 12, 2016 at 8:03 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Sean: >> Do you have more comments ? >> >> Cheers >> >> On Fri, Sep 9, 2016 at 1:42 PM, Vladimir Rodionov <vladrodio...@gmail.com >> > >> wrote: >> >> > Sean, >> > >> > Backup/Restore can fail due to various reasons: network outage (cluster >> > wide), various time-outs in HBase and HDFS layer, M/R failure due to >> "HDFS >> > exceeded quota", user error (manual deletion of data) and so on so on. >> That >> > is impossible to enumerate all possible types of failures in a >> distributed >> > system - that is not our goal/task. >> > >> > We focus completely on backup system table consistency in a presence of >> any >> > type of failure. That is what I call "tolerance to failures". >> > >> > On a failure: >> > >> > BACKUP. All backup system information (prior to backup) will be restored >> > and all temporary data, related to a failed session, in HDFS will be >> > deleted >> > RESTORE. We do not care about system data, because restore does not >> change >> > it. Temporary data in HDFS will be cleaned up and table will be in a >> state >> > back to where it was before operation started. >> > >> > This is what user should expect in case of a failure. >> > >> > -Vlad >> > >> > >> > -Vlad >> > >> > On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <bus...@apache.org> wrote: >> > >> > > Failing in a consistent way, with docs that explain the various >> > > expected failures would be sufficient. >> > > >> > > On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov >> > > <vladrodio...@gmail.com> wrote: >> > > > Do not worry Sean, doc is coming today as a preview and our writer >> > Frank >> > > > will be working on a putting it into Apache repo. Timeline depends >> on >> > > > Franks schedule but I hope we will get it rather sooner than later. >> > > > >> > > > As for failure testing, we are focusing only on a consistent state >> of >> > > > backup system data in a presence of any type of failures, We are not >> > > going >> > > > to implement anything more "fancy", than that. We allow both: >> backup >> > and >> > > > restore to fail. What we do not allow is to have system data >> corrupted. >> > > > Will it suffice for you? Do you have any other concerns, you want >> us to >> > > > address? >> > > > >> > > > -Vlad >> > > > >> > > > >> > > > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <bus...@apache.org> >> > wrote: >> > > > >> > > >> "docs will come to Apache soon" does not address my concern around >> > docs >> > > at >> > > >> all, unless said docs have already made it into the project repo. I >> > > don't >> > > >> want third party resources for using a major and important feature >> of >> > > the >> > > >> project, I want us to provide end users with what they need to get >> the >> > > job >> > > >> done. >> > > >> >> > > >> I see some calls for patience on the failure testing, but the >> appeal >> > to >> > > us >> > > >> having done a bad job of requiring proper tests of previous >> features >> > > just >> > > >> makes me more concerned about not getting them here. I don't want >> to >> > set >> > > >> yet another bad example that will then be pointed to in the future. >> > > >> >> > > >> On Sep 8, 2016 10:50, "Ted Yu" <yuzhih...@gmail.com> wrote: >> > > >> >> > > >> > Is there any concern which is not addressed ? >> > > >> > >> > > >> > Do we need another Vote thread ? >> > > >> > >> > > >> > Thanks >> > > >> > >> > > >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell < >> apurt...@apache.org >> > > >> > > >> > wrote: >> > > >> > >> > > >> > > Vlad, >> > > >> > > >> > > >> > > I apologize for using the term 'half-baked' in a way that could >> > > seem a >> > > >> > > description of HBASE-7912. I meant that as a general >> hypothetical. >> > > >> > > >> > > >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov < >> > > >> > vladrodio...@gmail.com> >> > > >> > > wrote: >> > > >> > > >> > > >> > > > >> I'm not sure that "There is already lots of half-baked >> code >> > in >> > > the >> > > >> > > > branch, >> > > >> > > > so what's the harm in adding more?" >> > > >> > > > >> > > >> > > > I meant - not production - ready yet. This is 2.0 development >> > > branch >> > > >> > and, >> > > >> > > > hence many features are in works, >> > > >> > > > not being tested well etc. I do not consider backup as half >> > baked >> > > >> > > feature - >> > > >> > > > it has passed our internal QA and has very good doc, which we >> > will >> > > >> > > provide >> > > >> > > > to Apache shortly. >> > > >> > > > >> > > >> > > > -Vlad >> > > >> > > > >> > > >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell < >> > > apurt...@apache.org> >> > > >> > > > wrote: >> > > >> > > > >> > > >> > > > > We shouldn't admit half baked changes that won't be >> finished. >> > > >> However >> > > >> > > in >> > > >> > > > > this case the crew working on this feature are long timers >> and >> > > less >> > > >> > > > likely >> > > >> > > > > than just about anyone to leave something in a half baked >> > > state. Of >> > > >> > > > course >> > > >> > > > > there is no guarantee how anything will turn out, but I am >> > > willing >> > > >> to >> > > >> > > > take >> > > >> > > > > a little on faith if they feel their best path forward now >> is >> > to >> > > >> > merge >> > > >> > > to >> > > >> > > > > trunk. I only wish I had bandwidth to have done some real >> > > kicking >> > > >> of >> > > >> > > the >> > > >> > > > > tires by now. Maybe this week. >> > > >> > > > > >> > > >> > > > > (Yes, I'm using some of that time for this email :-) but I >> > type >> > > >> > fast.) >> > > >> > > > > >> > > >> > > > > That said, I would like to agitate for making 2.0 more real >> > and >> > > >> spend >> > > >> > > > some >> > > >> > > > > time on it now that I'm winding down with 0.98. I think >> that >> > > means >> > > >> > > > > branching for 2.0 real soon now and even evicting things >> from >> > > 2.0 >> > > >> > > branch >> > > >> > > > > that aren't finished or stable, leaving them only once >> again >> > in >> > > the >> > > >> > > > master >> > > >> > > > > branch. Or, maybe just evicting them. Let's take it case by >> > > case. >> > > >> > > > > >> > > >> > > > > I think this feature can come in relatively safely. As >> added >> > > >> > insurance, >> > > >> > > > > let's admit the possibility it could be reverted on the 2.0 >> > > branch >> > > >> if >> > > >> > > > folks >> > > >> > > > > working on stabilizing 2.0 decide to evict it because it is >> > > >> > unfinished >> > > >> > > or >> > > >> > > > > unstable, because that certainly can happen. I would >> expect if >> > > talk >> > > >> > > like >> > > >> > > > > that starts, we'd get help finishing or stabilizing what's >> > under >> > > >> > > > discussion >> > > >> > > > > for revert. Or, we'd have a revert. Either way the outcome >> is >> > > >> > > acceptable. >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima Spivak < >> > > dimaspi...@apache.org >> > > >> > >> > > >> > > > wrote: >> > > >> > > > > >> > > >> > > > > > I'm not sure that "There is already lots of half-baked >> code >> > in >> > > >> the >> > > >> > > > > branch, >> > > >> > > > > > so what's the harm in adding more?" is a good code commit >> > > >> > philosophy >> > > >> > > > for >> > > >> > > > > a >> > > >> > > > > > fault-tolerant distributed data store. ;) >> > > >> > > > > > >> > > >> > > > > > More seriously, a lack of test coverage for existing >> > features >> > > >> > > shouldn't >> > > >> > > > > be >> > > >> > > > > > used as justification for introducing new features with >> the >> > > same >> > > >> > > > > > shortcomings. Ultimately, it's the end user who will feel >> > the >> > > >> pain, >> > > >> > > so >> > > >> > > > > > shouldn't we do everything we can to mitigate that? >> > > >> > > > > > >> > > >> > > > > > -Dima >> > > >> > > > > > >> > > >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM, Vladimir Rodionov < >> > > >> > > > > vladrodio...@gmail.com> >> > > >> > > > > > wrote: >> > > >> > > > > > >> > > >> > > > > > > Sean, >> > > >> > > > > > > >> > > >> > > > > > > * have docs >> > > >> > > > > > > >> > > >> > > > > > > Agree. We have a doc and backup is the most documented >> > > feature >> > > >> > :), >> > > >> > > we >> > > >> > > > > > will >> > > >> > > > > > > release it shortly to Apache. >> > > >> > > > > > > >> > > >> > > > > > > * have sunny-day correctness tests >> > > >> > > > > > > >> > > >> > > > > > > Feature has close to 60 test cases, which run for >> approx >> > 30 >> > > >> min. >> > > >> > > We >> > > >> > > > > can >> > > >> > > > > > > add more, if community do not mind :) >> > > >> > > > > > > >> > > >> > > > > > > * have correctness-in-face-of-failure tests >> > > >> > > > > > > >> > > >> > > > > > > Any examples of these tests in existing features? In >> > works, >> > > we >> > > >> > > have a >> > > >> > > > > > clear >> > > >> > > > > > > understanding of what should be done by the time of 2.0 >> > > >> release. >> > > >> > > > > > > That is very close goal for us, to verify IT monkey for >> > > >> existing >> > > >> > > > code. >> > > >> > > > > > > >> > > >> > > > > > > * don't rely on things outside of HBase for normal >> > operation >> > > >> > (okay >> > > >> > > > for >> > > >> > > > > > > advanced operation) >> > > >> > > > > > > >> > > >> > > > > > > We do not. >> > > >> > > > > > > >> > > >> > > > > > > Enormous time has been spent already on the development >> > and >> > > >> > testing >> > > >> > > > the >> > > >> > > > > > > feature, it has passed our internal tests and many >> rounds >> > of >> > > >> code >> > > >> > > > > reviews >> > > >> > > > > > > by HBase committers. We do not mind if someone from >> HBase >> > > >> > community >> > > >> > > > > > > (outside of HW) will review the code, but it will >> probably >> > > >> takes >> > > >> > > > > forever >> > > >> > > > > > to >> > > >> > > > > > > wait for volunteer?, the feature is quite large (1MB+ >> > > >> cumulative >> > > >> > > > patch) >> > > >> > > > > > > >> > > >> > > > > > > 2.0 branch is full of half baked features, most of them >> > are >> > > in >> > > >> > > active >> > > >> > > > > > > development, therefore I am not following you here, >> Sean? >> > > Why >> > > >> > > > > HBASE-7912 >> > > >> > > > > > is >> > > >> > > > > > > not good enough yet to be integrated into 2.0 branch? >> > > >> > > > > > > >> > > >> > > > > > > -Vlad >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > On Wed, Sep 7, 2016 at 8:23 AM, Sean Busbey < >> > > bus...@apache.org >> > > >> > >> > > >> > > > wrote: >> > > >> > > > > > > >> > > >> > > > > > > > On Tue, Sep 6, 2016 at 10:36 PM, Josh Elser < >> > > >> > > josh.el...@gmail.com> >> > > >> > > > > > > wrote: >> > > >> > > > > > > > > So, the answer to Sean's original question is "as >> > > robust as >> > > >> > > > > snapshots >> > > >> > > > > > > > > presently are"? (independence of backup/restore >> > failure >> > > >> > > tolerance >> > > >> > > > > > from >> > > >> > > > > > > > > snapshot failure tolerance) >> > > >> > > > > > > > > >> > > >> > > > > > > > > Is this just a question WRT context of the change, >> or >> > > is it >> > > >> > > means >> > > >> > > > > > for a >> > > >> > > > > > > > veto >> > > >> > > > > > > > > from you, Sean? Just trying to make sure I'm >> following >> > > >> along >> > > >> > > > > > > adequately. >> > > >> > > > > > > > > >> > > >> > > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > I'd say ATM I'm -0, bordering on -1 but not for >> reasons >> > I >> > > can >> > > >> > > > > > articulate >> > > >> > > > > > > > well. >> > > >> > > > > > > > >> > > >> > > > > > > > Here's an attempt. >> > > >> > > > > > > > >> > > >> > > > > > > > We've been trying to move, as a community, towards >> > > minimizing >> > > >> > > risk >> > > >> > > > to >> > > >> > > > > > > > downstream folks by getting "complete enough for use" >> > > gates >> > > >> in >> > > >> > > > place >> > > >> > > > > > > > before we introduce new features. This was spurred >> by a >> > > some >> > > >> > > > features >> > > >> > > > > > > > getting in half-baked and never making it to "can >> really >> > > use" >> > > >> > > > status >> > > >> > > > > > > > (I'm thinking of distributed log replay and the >> zk-less >> > > >> > > assignment >> > > >> > > > > > > > stuff, I don't recall if there was more). >> > > >> > > > > > > > >> > > >> > > > > > > > The gates, generally, included things like: >> > > >> > > > > > > > >> > > >> > > > > > > > * have docs >> > > >> > > > > > > > * have sunny-day correctness tests >> > > >> > > > > > > > * have correctness-in-face-of-failure tests >> > > >> > > > > > > > * don't rely on things outside of HBase for normal >> > > operation >> > > >> > > (okay >> > > >> > > > > for >> > > >> > > > > > > > advanced operation) >> > > >> > > > > > > > >> > > >> > > > > > > > As an example, we kept the MOB work off in a branch >> and >> > > out >> > > >> of >> > > >> > > > master >> > > >> > > > > > > > until it could pass these criteria. The big exemption >> > > we've >> > > >> had >> > > >> > > to >> > > >> > > > > > > > this was the hbase-spark integration, where we all >> > agreed >> > > it >> > > >> > > could >> > > >> > > > > > > > land in master because it was very well isolated (the >> > > slide >> > > >> > away >> > > >> > > > from >> > > >> > > > > > > > including docs as a first-class part of building up >> that >> > > >> > > > integration >> > > >> > > > > > > > has led me to doubt the wisdom of this decision). >> > > >> > > > > > > > >> > > >> > > > > > > > We've also been treating inclusion in a "probably >> will >> > be >> > > >> > > released >> > > >> > > > to >> > > >> > > > > > > > downstream" branches as a higher bar, requiring >> > > >> > > > > > > > >> > > >> > > > > > > > * don't moderately impact performance when the >> feature >> > > isn't >> > > >> in >> > > >> > > use >> > > >> > > > > > > > * don't severely impact performance when the feature >> is >> > in >> > > >> use >> > > >> > > > > > > > * either default-to-on or show enough demand to >> believe >> > a >> > > >> > > > non-trivial >> > > >> > > > > > > > number of folks will turn the feature on >> > > >> > > > > > > > >> > > >> > > > > > > > The above has kept MOB and hbase-spark integration >> out >> > of >> > > >> > > branch-1, >> > > >> > > > > > > > presumably while they've "gotten more stable" in >> master >> > > from >> > > >> > the >> > > >> > > > odd >> > > >> > > > > > > > vendor inclusion. >> > > >> > > > > > > > >> > > >> > > > > > > > Are we going to have a 2.0 release before the end of >> the >> > > >> year? >> > > >> > > > We're >> > > >> > > > > > > > coming up on 1.5 years since the release of version >> 1.0; >> > > >> seems >> > > >> > > like >> > > >> > > > > > > > it's about time, though I haven't seen any concrete >> > plans >> > > >> this >> > > >> > > > year. >> > > >> > > > > > > > Presuming we are going to have one by the end of the >> > > year, it >> > > >> > > > seems a >> > > >> > > > > > > > bit close to still be adding in "features that need >> > > maturing" >> > > >> > on >> > > >> > > > the >> > > >> > > > > > > > branch. >> > > >> > > > > > > > >> > > >> > > > > > > > The lack of a concrete plan for 2.0 keeps me from >> > > considering >> > > >> > > these >> > > >> > > > > > > > things blocker at the moment. But I know first hand >> how >> > > much >> > > >> > > > trouble >> > > >> > > > > > > > folks have had with other features that have gone >> into >> > > >> > downstream >> > > >> > > > > > > > facing releases without robustness checks (i.e. >> > > replication), >> > > >> > and >> > > >> > > > I'm >> > > >> > > > > > > > concerned about what we're setting up if 2.0 goes out >> > with >> > > >> this >> > > >> > > > > > > > feature in its current state. >> > > >> > > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > -- >> > > >> > > > > Best regards, >> > > >> > > > > >> > > >> > > > > - Andy >> > > >> > > > > >> > > >> > > > > Problems worthy of attack prove their worth by hitting >> back. - >> > > Piet >> > > >> > > Hein >> > > >> > > > > (via Tom White) >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > -- >> > > >> > > Best regards, >> > > >> > > >> > > >> > > - Andy >> > > >> > > >> > > >> > > Problems worthy of attack prove their worth by hitting back. - >> > Piet >> > > >> Hein >> > > >> > > (via Tom White) >> > > >> > > >> > > >> > >> > > >> >> > > >> > >> > >