Re: HBASE WALs

Wellington Chevreuil Tue, 16 Mar 2021 08:50:58 -0700

>
> To be clear, if the other tables are stopped, I assume all pending and
> current operations will finish. How long will it take to write all data -
> if indeed the data does get permanently written - so that we can safely
> remove WALs?
>
If by "tables stopped" you mean your tables are disabled, then yeah, all
related data would already have been flushed into hfiles and wouldn't be on
your wals. But please be aware that what you really need here to get rid of
the rogue proc is to remove master proc wals, not normal wals.


Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins <[email protected]>
escreveu:

> Overall, I am mystified as to how this could happen.  If Hadoop has a
> replication factor (I believe we use the default) of 3 and we have two
> datacenters with masters and workers in both, how can a network outage
> affect Hadoop operation? Surely it should have used available resources to
> continue operations...or have I misinterpreted entirely?
>
> -----Original Message-----
> From: Stack <[email protected]>
> Sent: Tuesday, March 16, 2021 7:16 AM
> To: Hbase-User <[email protected]>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <[email protected]> wrote:
>
> > Hi, all,
> >
> > For our stuck region, this exists in meta.  Could we alter the state
> > to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> >
> > You could but IIRC, in that version of HBase, you may need to restart
> > the
> Master after the change (changing hbase:meta does not update the Master's
> in-memory state). On restart, Master will read hbase:meta to discover
> Region state.
>
> S
>
>
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:regioninfo, timestamp=1613580024017, value={ENCODED =>
> > f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
> > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.',
> > STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:seqnumDuringOpen, timestamp=1611787189839,
> > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:server, timestamp=1611787189839, value=
> > dr1-hbase18.jumbo.hq.eset.com:16020
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:serverstartcode, timestamp=1611787189839,
> > value=1611785264032
> hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:sn, timestamp=1613580024017, value=
> > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > column=info:state, timestamp=1613580024017, value=OPENING
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <[email protected]>
> > Sent: Wednesday, March 10, 2021 10:56 AM
> > To: Hbase-User <[email protected]>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Sorry if I seem stupid but this is still all new to me.
> > >
> > Forgot to mention, there's no stupid questions here. Don't be shy and
> > keep'em coming.
> >
> > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
> > [email protected]> escreveu:
> >
> > > However, how would that help anyway?  If we cannot fix this at this
> > > time
> > >> then any upgrade would have inconsistencies also, yes?
> > >>
> > > The upgrade on it's own wouldn't fix existing inconsistencies, but
> > > you would now have support for additional tooling
> > > (hbase-operators-tool) to help you with this.
> > >
> > > As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > > mean
> > >> that they were successfully and fully moved from hbase25 to each
> > >> server mentioned in that procedure?  Or does it just mean that the
> > >> region was successfully unassigned from hbase25 but the data still
> > >> resides on hbase25?  I see locality 0.
> > >>
> > > IIRC, those were all UnassignProcedures, so it means the
> > > unassignment of the related region has completed and the region for
> > > that particular procedure went offline.
> > >
> > > If we change the table state in meta to 'ENABLED', could this
> > > kickstart
> > >> all these things or will it just lead to further problems?
> > >
> > > Masters work with its own memory cache of meta, so manually updating
> > > it will just make masters cache inconsistent with meta. You would
> > > need to restart masters to get its cache reloaded from master. The
> > > main problem is that you still have the rogue procedures, which you
> > > can't get rid of without stopping the cluster. One alternative to a
> > > full cluster outage would be to identify all RSes running the rogue
> > > procs (you can find that from active master logs), then stop only
> > > those and master, clean masterprocwals, then start it again.
> > >
> > >
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > >> does it mean that the table is waiting to be disabled?  HBASE
> > >> master declares that table is NOT enabled.
> > >>
> > > The table state may have been already updated to disabled, most of
> > > its regions may already be offline, but the 73587
> > > DisableTableProcedure cannot be considered "done" until all its sub
> > > procedures are indeed
> > completed.
> > >
> > >
> > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
> > > <[email protected]>
> > > escreveu:
> > >
> > >> Thanks for that.
> > >>
> > >> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1
> > >> and do not have a viable business use to pay the extortionate
> > >> amount of money required to upgrade.  Which would give these
> > >> cluster access to newer versions.
> > >>
> > >> However, how would that help anyway?  If we cannot fix this at this
> > >> time then any upgrade would have inconsistencies also, yes?
> > >>
> > >> As all the 'SUCCESS' procedures have a parent ID 73587, does this
> > >> mean that they were successfully and fully moved from hbase25 to
> > >> each server mentioned in that procedure?  Or does it just mean that
> > >> the region was successfully unassigned from hbase25 but the data
> > >> still resides on hbase25?  I see locality 0.
> > >>
> > >> If we change the table state in meta to 'ENABLED', could this
> > >> kickstart all these things or will it just lead to further problems?
> > >> I suppose it means I am asking, the 73587 DisableTableProcedure,
> > >> does it mean that the table is waiting to be disabled?  HBASE
> > >> master declares that table is NOT enabled.
> > >>
> > >> Sorry if I seem stupid but this is still all new to me.
> > >>
> > >> I appreciate the help.
> > >>
> > >> -----Original Message-----
> > >> From: Wellington Chevreuil <[email protected]>
> > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > >> To: Hbase-User <[email protected]>
> > >> Subject: Re: HBASE WALs
> > >>
> > >> EXTERNAL
> > >>
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to
> > >> > be the problem.
> > >> >
> > >> Per your list procedures output attached, it seems the procs states
> > >> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with
> > >> PID 73827, which is the UnassignProcedure for this region. Problem
> > >> is that there are already 5 APs for the same region, which may be
> > >> causing some deadlocks. If this cluster was on a hbck2 supported
> > >> version, you could get rid of this state using bypass command on
> > >> all these proc ids, then manually get the table/regions states
> > >> consistent again using setRegionState/setTableState/assigns/unassigns
> methods.
> > >>
> > >> Without tooling, the only option I can think of is to stop cluster,
> > >> clean out masterprocwals, restart cluster, then use hbase shell to
> > >> enable/disable/assign regions. You may also need to manually update
> > >> table/region states in meta table. Of course, you can automate
> > >> these manual steps into your own tooling, but may be a better
> > >> strategy in the long term to upgrade to a more stable version that
> > >> also benefits from more tooling supported by the community.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
> > >> <[email protected]>
> > >> escreveu:
> > >>
> > >> > Hi, Wellington,
> > >> >
> > >> > I was on 'vacation' (no road trip or overseas anything) for a week.
> > >> >
> > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE
> > >> procedure.
> > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to
> > >> > be the problem.
> > >> >
> > >> > I am still mystified about the HBCK2-tools. I have attached a
> > >> > previous thread that you commented on at the time.
> > >> >
> > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it
> > >> > on Ubuntu
> > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on
> > >> > Ubuntu
> > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > >> > I used it to help fix a similar problem with an offline table and
> RITs.
> > >> > Both HBASE versions are the same.
> > >> >
> > >> > I attach a 'sheet' with the current procs/locks.
> > >> >
> > >> > -----Original Message-----
> > >> > From: Marc Hoppins <[email protected]>
> > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > >> > To: [email protected]
> > >> > Cc: Martin Oravec <[email protected]>
> > >> > Subject: RE: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Thanks, Wellington,
> > >> >
> > >> > I have already build a hbck1-tools for 2.1.0 using method
> > >> > described in other topics. All the HBASE and JDK here is the same
> > >> > version so if it worked fixing one cluster HBASE then it should
> > >> > work for other
> > installs.
> > >> >
> > >> > Fiddling with masterprocWALs will require complete shutdown of
> > >> > hbase operations to prevent incoming reds/writes on other tables
> > >> > and I am not sure how disruptive that will be other than
> > >> > "probably a
> > lot".
> > >> >
> > >> > -----Original Message-----
> > >> > From: Wellington Chevreuil <[email protected]>
> > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > >> > To: Hbase-User <[email protected]>
> > >> > Subject: Re: HBASE WALs
> > >> >
> > >> > EXTERNAL
> > >> >
> > >> > Sorry, missed your previous email. I was hoping you were not on a
> > >> > non-stable version, so that you would benefit from hbck2 tool
> support.
> > >> > Unfortunately, 2.1.0 is among the early releases that don't work
> > >> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > >> >
> > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system
> > >> > seems
> > >> > > mostly unhappy with one region in particular, and is reporting
> > >> > > on
> > >> that.
> > >> > >
> > >> > Are the other regions for the table properly closed, and this is
> > >> > the only one stuck? If you do a list_procedures, are you able to
> > >> > identify an 'unassign' procedure still running for this table? Or
> > >> > if you grep master logs for this region, do you see any messages
> > >> > suggesting there's still ongoing attempts to bring the region
> > >> > offline? If there's apparently no procedure/no ongoing attempts
> > >> > to offline the region, you might try to manually update its state
> > >> > in meta table, then flip masters (assuming you have master HA),
> > >> > so that the new active loads an up to date state from meta table.
> > >> >
> > >> > Otherwise, if there's still a rogue procedure trying to offline
> > >> > the region, unfortunately, due to the lack of hbck support, you
> > >> > would most likely need a more disruptive intervention similar to
> > >> > what you had described in your first email, but instead of normal
> > >> > wal folder, master proc wals is what you really would need to
> > >> > clean out here, as that is where procedures state is persisted,
> > >> > and you wouldn't want the rogue procedure to be resumed.
> > >> >
> > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
> > >> > <[email protected]>
> > >> > escreveu:
> > >> >
> > >> > > If you know of anything that will help I would appreciate it.
> > >> > >
> > >> > > If you need any log output let me know.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Wellington Chevreuil <[email protected]>
> > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > >> > > To: Hbase-User <[email protected]>
> > >> > > Subject: Re: HBASE WALs
> > >> > >
> > >> > > EXTERNAL
> > >> > >
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per WAL
> > >> > > > or is one WAL associated with one region?
> > >> > > >
> > >> > > Multiple regions edits would be present in a single wal file.
> > >> > > That's why upon a RS crash and wal processing, there's a wal
> > >> > > split
> > phase.
> > >> > >
> > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > A similar
> > >> > > > problem (but on a test cluster) involved me clearing znode
> > >> > > > info, deleting HDFS data for the table and deleting
> > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >> > > >
> > >> > > Which hbase version are you on?
> > >> > >
> > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
> > >> > > <[email protected]>
> > >> > > escreveu:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > Do WAL files contain information for multiple regions per WAL
> > >> > > > or is one WAL associated with one region?
> > >> > > >
> > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > >> > > > A similar problem (but on a test cluster) involved me
> > >> > > > clearing znode info, deleting HDFS data for the table and
> > >> > > > deleting WALs/MasterProcWAL files, finally restarting HBASE
> service.
> > >> > > >
> > >> > > > Table cannot be enabled.
> > >> > > >
> > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
> > >> > > > system seems mostly unhappy with one region in particular,
> > >> > > > and is reporting
> > >> > on that.
> > >> > > >
> > >> > > > There are many tables that are very active so I don't think
> > >> > > > it is possible to stop the entire service without a lot of
> > >> > > > forewarning to
> > >> > > users.
> > >> > > >
> > >> > > > Thanks in advance.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: HBASE WALs

Reply via email to