Re: [DISCUSS] Direction of HBCK2

Wellington Chevreuil Thu, 30 May 2019 02:01:14 -0700

>
> It seemed like the table data in HDFS was intact but they lost some meta
> data
> (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> data.
> In this case, we can still fix with some combinations of commands today? If
> so,
> I would appreciate it if you could suggest the steps to me
>


Yeah, there's no single command here, an alternative would be to combine
merges and bulkloading, for example, say you had regions A, B, C, now meta
has only A and C, with a hole where you should have B. How about merging A
and C, then bulkloading B files into the table? Sure, that's much more
laborious than the magic hbck1 fix, but it's my (same) understanding of the
hbck2 goals described by Josh earlier. I understand the concerns, and
Andrew's argument about time to recover operation is a solid one. Maybe
worth revisit and vote which hbck1 former options are seen as essential by
the majority? From this discussion so far, it seems the most missed are
fixMeta, fixHoles and fixOverlaps?


Em qua, 29 de mai de 2019 às 23:10, Stack <st...@duboce.net> escreveu:

> Would be good to do a bit of evangelizing that hbck2 is intentionally not
> meant to be like hbck1. hbck1 gave off the impression that it could fix
> "all" problems, rebuilding master functionality on the exterior in a
> contending script. Re-reading the hbck2 home page [1], hoping to find a
> quote to back Josh's perception, it is plain the text needs to state more
> forcefully the difference in philosophy.
>
> On missing hbck2 functionality, there is an outstanding task (HBASE-21745)
> sorting what is needed from hbck1 hangovers so the likes of our Andrew has
> confidence that should he hit an operational issue, he'll have tooling for
> repair. Let's be judicious in what we add to hbck2. We've left behind many
> of the problems hbck1 used 'fix'. A rebuild of meta should disaster hits
> makes sense (and is a long-time ask). Fixup for the mess JMS is able to
> make upgrading from hbase1 to hbase2 makes sense too since this is what our
> users will be doing (File JIRAs w/ detail on the mess JMS?). Andrew made a
> list a while back here that needs consideration (HBASE-21745).
>
> S
>
> 1. https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2
>
>
>
> On Wed, May 29, 2019 at 8:55 AM Andrew Purtell <apurt...@apache.org>
> wrote:
>
> > To me this is a succinct specification of minimum functionality for a
> > recovery tool: using on disk bits, rebuild meta table, with end result a
> > working cluster that did not miss any data during the reconstruction.
> >
> > Of course focusing on root causes of metadata mismanagement is
> appropriate
> > when investigating a specific incident, but this is orthogonal from the
> > question of whether or not recovery is possible after a bug corrupts
> > metadata. It is customary for filesystems and databases to ship with a
> tool
> > that attempts recovery after corruption, on the (correct, IMHO)
> assumption
> > that corruption is inevitable, either due to logic bug, hardware
> problems,
> > or operator error.
> >
> > The features of hbck in HBase 1 that have resolved availability problems
> > where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
> > In HBaseFsck.java in branch-2 these are all in the unsupported options
> set.
> > Because these are all lacking in HBase 2 I will not certify it ready for
> > production to my employer. If there is some other tool which offers these
> > recovery options I'm not aware of it nor documentation for it and would
> > appreciate a pointer if you have one.
> >
> >
> > On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki <brfrn...@apache.org>
> > wrote:
> >
> > > Thanks Wellington.
> > >
> > > > I guess those can still be fixed with some combinations of commands
> > > today,
> > > > such as merge/assign.
> > >
> > > Let me explain the situation I faced in the customer's cluster a little
> > bit
> > > more.
> > > It seemed like the table data in HDFS was intact but they lost some
> meta
> > > data
> > > (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> > > data.
> > > In this case, we can still fix with some combinations of commands
> today?
> > If
> > > so,
> > > I would appreciate it if you could suggest the steps to me.
> > >
> > > > And focus on fixing the main root cause of such problems, as a mean
> to
> > > > soften the need of use such commands.
> > >
> > > Yes, correct. Actually I usually do that. But I didn't do that in that
> > > case..
> > >
> > >
> > > On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
> > > wellington.chevre...@gmail.com> wrote:
> > >
> > > > Thanks Toshihiro! I guess those can still be fixed with some
> > combinations
> > > > of commands today, such as merge/assign. Of course, it requires some
> > > extra
> > > > scripting and log reading on cases where many regions are in an
> > > > inconsistent state, maybe we should work on provide a one liner
> command
> > > > that relies on the current existing ones. And focus on fixing the
> main
> > > root
> > > > cause of such problems, as a mean to soften the need of use such
> > > commands.
> > > >
> > > > I'm not really a fan of offlinemetarepair, nor hbck1 fix
> > holes/overlaps,
> > > > would rather not have those back. Sure those are easy and convenient
> to
> > > > trigger, but hbck1 reports are sometimes misleading (for instance, it
> > > > reports holes when region(s) on the chain is/are simply not online),
> > and
> > > > that, combined with availability of such heavy hammers had led
> > > > unexperienced operators to fall into running it and getting into a
> > worse
> > > > state.
> > > >
> > > > Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki <
> > > brfrn...@apache.org>
> > > > escreveu:
> > > >
> > > > > Hi Wellington,
> > > > >
> > > > > I saw table holes in a customer's cluster actually, and I just
> fixed
> > > the
> > > > > issues
> > > > > by the workaround I mentioned in HBASE-21665
> > > > > <https://issues.apache.org/jira/browse/HBASE-21665> and I didn't
> dig
> > > the
> > > > > reason
> > > > > why the table holes happened at that time because the customer
> didn't
> > > > want.
> > > > >
> > > > > However, IMO, whatever the reason I think we should have a direct
> way
> > > to
> > > > > fix
> > > > > holes and overlaps.
> > > > >
> > > > > On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
> > > > > wellington.chevre...@gmail.com> wrote:
> > > > >
> > > > > > So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x
> > > > consistently
> > > > > > triggers this problem? Do you guys know if there are any bug
> jiras
> > > open
> > > > > > that would cover these scenarios? If not, and if you guys have
> > enough
> > > > > > resources for investigating it, maybe worth open a specific jira?
> > > > > >
> > > > > > Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
> > > > > > jean-m...@spaggiari.org> escreveu:
> > > > > >
> > > > > > > Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end
> up
> > > in a
> > > > > > > situation where my meta was empty and had to get it repaired,
> but
> > > > > lacked
> > > > > > > OfflineMetaRepair for 2.2.x so I just had to delete all my
> > tables,
> > > > get
> > > > > a
> > > > > > > brand new installation, recreate the tables and bulkload back
> the
> > > > data
> > > > > > into
> > > > > > > them. Would have been happy to have a OfflineMetaRepair.
> > > > > > >
> > > > > > > But it's more like an experimental cluster than a production
> > one...
> > > > > > >
> > > > > > > JMS
> > > > > > >
> > > > > > > Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> > > > > > > wellington.chevre...@gmail.com> a écrit :
> > > > > > >
> > > > > > > > Interesting, I haven't seen any cases where OfflineMetaRepair
> > was
> > > > > > really
> > > > > > > > required, among our customer base (running
> cdh6.1.x/hbase2.1.1,
> > > > > > > > cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on
> > > hbase
> > > > > 2.x
> > > > > > > > were related to APs/SCPs failures, most of which could be
> > sorted
> > > > with
> > > > > > > hbck2
> > > > > > > > commands available by then (in some cases, required some CLI
> > > > > scripting
> > > > > > to
> > > > > > > > build up a "bulk" assign command).
> > > > > > > >
> > > > > > > > Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki <
> > > > > > > brfrn...@apache.org>
> > > > > > > > escreveu:
> > > > > > > >
> > > > > > > > > Hi Josh,
> > > > > > > > >
> > > > > > > > > Thank you for the explanation. I agree with the direction
> for
> > > > > HBCK2.
> > > > > > > > >
> > > > > > > > > The problem I wanted to tell you in the Jira is that until
> we
> > > > > > implement
> > > > > > > > the
> > > > > > > > > features
> > > > > > > > > you mentioned, we don't have any direct way how to fix
> holes
> > > and
> > > > > > > > overlaps.
> > > > > > > > > The holes and overlaps can be created by bugs or operation
> > > > errors,
> > > > > > so I
> > > > > > > > > think we
> > > > > > > > > should be able to fix these issues.
> > > > > > > > >
> > > > > > > > > I thought OfflineMetaRepair could be a workaround for the
> > > issues
> > > > > > until
> > > > > > > we
> > > > > > > > > implement
> > > > > > > > > the features of HBCK2.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Toshi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, May 28, 2019 at 9:12 AM Josh Elser <
> > els...@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Context:
> https://issues.apache.org/jira/browse/HBASE-21665
> > > > > > > > > >
> > > > > > > > > > I left a comment on the above issue about what I thought
> > good
> > > > > > things
> > > > > > > to
> > > > > > > > > > build into HBCK2 would be -- a focus on specific
> > "primitive"
> > > > > > > operations
> > > > > > > > > > that an admin/operator could use to help repair an
> > otherwise
> > > > > broken
> > > > > > > > > > HBase installation. Some examples I had in my head were:
> > > > > > > > > >
> > > > > > > > > > * Create an empty region (to plug a hole)
> > > > > > > > > > * Report holes in a region chain
> > > > > > > > > >
> > > > > > > > > > In my head, the difference for HBCK2 was that we want to
> > give
> > > > > folks
> > > > > > > the
> > > > > > > > > > tools to fix their cluster, but we did not want to own
> the
> > > > "just
> > > > > > fix
> > > > > > > > > > everything" kind of tool that HBCK1 had become. That
> > problem
> > > > with
> > > > > > > HBCK1
> > > > > > > > > > was that it was often difficult/problematic for us to
> know
> > > how
> > > > to
> > > > > > > > > > correctly fix a problem (the same problem could be
> > corrected
> > > in
> > > > > > > > > > different ways).
> > > > > > > > > >
> > > > > > > > > > Andrew had some confusion about this, so I'm not sure if
> > I'm
> > > > > > off-base
> > > > > > > > or
> > > > > > > > > > if we're all in agreement on direction and we just need
> to
> > > do a
> > > > > > > better
> > > > > > > > > > job documenting things. Thanks for keeping me honest
> either
> > > way
> > > > > :)
> > > > > > > > > >
> > > > > > > > > > And just in case it doesn't go without saying, HBCK2
> would
> > be
> > > > > > > something
> > > > > > > > > > that helps fix a system, while we want to always
> understand
> > > the
> > > > > > root
> > > > > > > > > > cause of how/why we got into a situation where we needed
> > > HBCK2
> > > > > and
> > > > > > > also
> > > > > > > > > > address that.
> > > > > > > > > >
> > > > > > > > > > - Josh
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> >    - A23, Crosstalk
> >
>

Re: [DISCUSS] Direction of HBCK2

Reply via email to