RE: HBASE WALs

Marc Hoppins Wed, 24 Mar 2021 02:01:33 -0700

😊 You are a helpful elf.  It takes a while for all  things to slow/stop when 
waiting to shut down hbase after read/write operations had been stopped. The DB 
folk were chomping at the bit to get started importing again.


So, to be clear (I must sound like an idiot)...

Disabling a table: does that perform any region operations before shutting down 
(compacting/merging) or do these get written to master WALs to be continued 
when the table is enabled?

If no operations are being carried out and all tables are disabled, all the 
remaining masterProcWALs will be for these procedures which lurk in the 
proc/lock list for this table (hds2_md5), including:

In the sheet I sent

ENABLE table
DISABLE table
RUNNABLE assigns
SUCCESS unassigns
WAITING_TIMEOUT (our stuck region)


Since restarting hbase (without actually clearing anything) there now exists (a 
long list with dates going back to Feb17 when the error first occurred).  I 
guess I am going to have to fix this once and for all before we end up with a 
system full of old WALs.

Procedure WAL state

LogID   Size    Timestamp       Path
11288   29.4 KB         Wed Mar 24 09:23:28 CET 2021    
hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011288.log
11287   166.2 KB        Wed Mar 24 08:23:28 CET 2021    
hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011287.log
11286   183.6 KB        Wed Mar 24 07:23:28 CET 2021    
hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011286.log
11285   101.9 KB        Wed Mar 24 06:23:28 CET 2021    
hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011285.log
11284   88.1 KB         Wed Mar 24 05:23:28 CET 2021    
hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011284.log
11283   101.7 KB        Wed Mar 24 04:23:28 CET 2021    
hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011283.log
11282   87.9 KB         Wed Mar 24 03:23:28 CET 2021    
hdfs://nameservice-hbase-jumbo/hbase/MasterProcWALs/pv2-00000000000000011282.log

It is a sticky situation that we are not in a position to upgrade Cloudera (and 
thus haddop services/software) to a newer version.

-----Original Message-----
From: Wellington Chevreuil <[email protected]> 
Sent: Tuesday, March 23, 2021 6:16 PM
To: Hbase-User <[email protected]>
Subject: Re: HBASE WALs

EXTERNAL

>
> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
masterProcWALs only contain info for running procedures, not user table data. 
User table data go on "normal" WALs, not "masterProcWALs".

 If all tables are disabled and I remove the master wals, how will that
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?

Table disabling involves the unassignment of all these tables regions. Each of 
these "unassign" operations comprise a set of sequential phases. These internal 
operations are called "procedures". Information about the progress of these 
operations as it progresses through its different phases are stored in these 
masterProcWALs files. That's why triggering the  "disable"
command will create some data under masterProcWALs. If all the disable commands 
finished successfully, and all your procedures are finished (apart from that 
rogue one existing for while already), you would be good to clean out 
masterProcWALs.

I did try to set the table state manually to see if the faulty table would
> fire up and I restarted hbase...state was the same a locked table 
> state due to pending disable and stuck region.
>
That's because of the rogue procedure. When you restarted master, it went 
through masterProcWals and resumed the rogue procedure from the unfinished 
state it was when you restarted hbase. If you had removed masterProcWALs prior 
to restart, the rogue procedure would now be gone.

We may have the go-ahead to remove this table - I assume we cannot clone it
> while it is in a state of (DISABLED) flux but, once again, messing 
> with master WALs has me on edge.

From what I understand, you already have the tables disabled, and no unfinished 
procs apart from the rogue one, so just clean out masterProcWALs and restart 
master.

Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins <[email protected]>
escreveu:

> I am still not certain what will happen.  masterProcWALs contain info 
> for all (running) tables, yes?
>
> If all tables are disabled and I remove the master wals, how will that 
> affect the other tables? When I disabled all tables, hundreds of 
> master WALs are now created. This means there is a bunch of pending 
> operations, yes?  Is it going to make some other things inconsistent?
>
> I did try to set the table state manually to see if the faulty table 
> would fire up and I restarted hbase...state was the same a locked 
> table state due to pending disable and stuck region.
>
> We may have the go-ahead to remove this table - I assume we cannot 
> clone it while it is in a state of (DISABLED) flux but, once again, 
> messing with master WALs has me on edge.
>
>
> -----Original Message-----
> From: Wellington Chevreuil <[email protected]>
> Sent: Tuesday, March 16, 2021 4:50 PM
> To: Hbase-User <[email protected]>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> >
> > To be clear, if the other tables are stopped, I assume all pending 
> > and current operations will finish. How long will it take to write 
> > all data - if indeed the data does get permanently written - so that 
> > we can safely remove WALs?
> >
> If by "tables stopped" you mean your tables are disabled, then yeah, 
> all related data would already have been flushed into hfiles and 
> wouldn't be on your wals. But please be aware that what you really 
> need here to get rid of the rogue proc is to remove master proc wals, not 
> normal wals.
>
> Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins 
> <[email protected]>
> escreveu:
>
> > Overall, I am mystified as to how this could happen.  If Hadoop has 
> > a replication factor (I believe we use the default) of 3 and we have 
> > two datacenters with masters and workers in both, how can a network 
> > outage affect Hadoop operation? Surely it should have used available 
> > resources to continue operations...or have I misinterpreted entirely?
> >
> > -----Original Message-----
> > From: Stack <[email protected]>
> > Sent: Tuesday, March 16, 2021 7:16 AM
> > To: Hbase-User <[email protected]>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <[email protected]>
> wrote:
> >
> > > Hi, all,
> > >
> > > For our stuck region, this exists in meta.  Could we alter the 
> > > state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?
> > >
> > > You could but IIRC, in that version of HBase, you may need to 
> > > restart the
> > Master after the change (changing hbase:meta does not update the 
> > Master's in-memory state). On restart, Master will read hbase:meta 
> > to discover Region state.
> >
> > S
> >
> >
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:regioninfo, timestamp=1613580024017, value={ENCODED => 
> > > f25fe93e24b34cb2f7fffddee1d89eec, NAME => 
> > > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
> > > , STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} 
> > > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:seqnumDuringOpen, timestamp=1611787189839, 
> > > value=\x00\x00\x00\x00\x00\x00\x04\x8F
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:server, timestamp=1611787189839, value=
> > > dr1-hbase18.jumbo.hq.eset.com:16020
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:serverstartcode, timestamp=1611787189839,
> > > value=1611785264032
> > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:sn, timestamp=1613580024017, value=
> > > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
> > >  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
> > > column=info:state, timestamp=1613580024017, value=OPENING
> > >
> > > -----Original Message-----
> > > From: Wellington Chevreuil <[email protected]>
> > > Sent: Wednesday, March 10, 2021 10:56 AM
> > > To: Hbase-User <[email protected]>
> > > Subject: Re: HBASE WALs
> > >
> > > EXTERNAL
> > >
> > > >
> > > > Sorry if I seem stupid but this is still all new to me.
> > > >
> > > Forgot to mention, there's no stupid questions here. Don't be shy 
> > > and keep'em coming.
> > >
> > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < 
> > > [email protected]> escreveu:
> > >
> > > > However, how would that help anyway?  If we cannot fix this at 
> > > > this time
> > > >> then any upgrade would have inconsistencies also, yes?
> > > >>
> > > > The upgrade on it's own wouldn't fix existing inconsistencies, 
> > > > but you would now have support for additional tooling
> > > > (hbase-operators-tool) to help you with this.
> > > >
> > > > As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > > this mean
> > > >> that they were successfully and fully moved from hbase25 to 
> > > >> each server mentioned in that procedure?  Or does it just mean 
> > > >> that the region was successfully unassigned from hbase25 but 
> > > >> the data still resides on hbase25?  I see locality 0.
> > > >>
> > > > IIRC, those were all UnassignProcedures, so it means the 
> > > > unassignment of the related region has completed and the region 
> > > > for that particular procedure went offline.
> > > >
> > > > If we change the table state in meta to 'ENABLED', could this 
> > > > kickstart
> > > >> all these things or will it just lead to further problems?
> > > >
> > > > Masters work with its own memory cache of meta, so manually 
> > > > updating it will just make masters cache inconsistent with meta.
> > > > You would need to restart masters to get its cache reloaded from 
> > > > master. The main problem is that you still have the rogue 
> > > > procedures, which you can't get rid of without stopping the 
> > > > cluster. One alternative to a full cluster outage would be to 
> > > > identify all RSes running the rogue procs (you can find that 
> > > > from active master logs), then stop only those and master, clean
> masterprocwals, then start it again.
> > > >
> > > >
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > > The table state may have been already updated to disabled, most 
> > > > of its regions may already be offline, but the 73587 
> > > > DisableTableProcedure cannot be considered "done" until all its 
> > > > sub procedures are indeed
> > > completed.
> > > >
> > > >
> > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins 
> > > > <[email protected]>
> > > > escreveu:
> > > >
> > > >> Thanks for that.
> > > >>
> > > >> Alas, we are (currently) constrained by using Cloudera (CDH)
> > > >> 6.3.1 and do not have a viable business use to pay the 
> > > >> extortionate amount of money required to upgrade.  Which would 
> > > >> give these cluster access to newer versions.
> > > >>
> > > >> However, how would that help anyway?  If we cannot fix this at 
> > > >> this time then any upgrade would have inconsistencies also, yes?
> > > >>
> > > >> As all the 'SUCCESS' procedures have a parent ID 73587, does 
> > > >> this mean that they were successfully and fully moved from 
> > > >> hbase25 to each server mentioned in that procedure?  Or does it 
> > > >> just mean that the region was successfully unassigned from 
> > > >> hbase25 but the data still resides on hbase25?  I see locality 0.
> > > >>
> > > >> If we change the table state in meta to 'ENABLED', could this 
> > > >> kickstart all these things or will it just lead to further problems?
> > > >> I suppose it means I am asking, the 73587 
> > > >> DisableTableProcedure, does it mean that the table is waiting 
> > > >> to be disabled?  HBASE master declares that table is NOT enabled.
> > > >>
> > > >> Sorry if I seem stupid but this is still all new to me.
> > > >>
> > > >> I appreciate the help.
> > > >>
> > > >> -----Original Message-----
> > > >> From: Wellington Chevreuil <[email protected]>
> > > >> Sent: Tuesday, March 9, 2021 1:20 PM
> > > >> To: Hbase-User <[email protected]>
> > > >> Subject: Re: HBASE WALs
> > > >>
> > > >> EXTERNAL
> > > >>
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> Per your list procedures output attached, it seems the procs 
> > > >> states are all inconsistent. There's a WAIT_TIMEOUT subproc of
> > > >> 73587 with PID 73827, which is the UnassignProcedure for this 
> > > >> region. Problem is that there are already 5 APs for the same 
> > > >> region, which may be causing some deadlocks. If this cluster 
> > > >> was on a hbck2 supported version, you could get rid of this 
> > > >> state using bypass command on all these proc ids, then manually 
> > > >> get the table/regions states consistent again using 
> > > >> setRegionState/setTableState/assigns/unassigns
> > methods.
> > > >>
> > > >> Without tooling, the only option I can think of is to stop 
> > > >> cluster, clean out masterprocwals, restart cluster, then use 
> > > >> hbase shell to enable/disable/assign regions. You may also need 
> > > >> to manually update table/region states in meta table. Of 
> > > >> course, you can automate these manual steps into your own 
> > > >> tooling, but may be a better strategy in the long term to 
> > > >> upgrade to a more stable version that also benefits from more 
> > > >> tooling supported by
> the community.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins 
> > > >> <[email protected]>
> > > >> escreveu:
> > > >>
> > > >> > Hi, Wellington,
> > > >> >
> > > >> > I was on 'vacation' (no road trip or overseas anything) for a
> week.
> > > >> >
> > > >> > All fails are waiting on the same PID (73587), a DISABLE 
> > > >> > TABLE
> > > >> procedure.
> > > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems 
> > > >> > to be the problem.
> > > >> >
> > > >> > I am still mystified about the HBCK2-tools. I have attached a 
> > > >> > previous thread that you commented on at the time.
> > > >> >
> > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built 
> > > >> > it on Ubuntu
> > > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on 
> > > >> > Ubuntu
> > > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
> > > >> > I used it to help fix a similar problem with an offline table 
> > > >> > and
> > RITs.
> > > >> > Both HBASE versions are the same.
> > > >> >
> > > >> > I attach a 'sheet' with the current procs/locks.
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Marc Hoppins <[email protected]>
> > > >> > Sent: Wednesday, March 3, 2021 9:51 AM
> > > >> > To: [email protected]
> > > >> > Cc: Martin Oravec <[email protected]>
> > > >> > Subject: RE: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Thanks, Wellington,
> > > >> >
> > > >> > I have already build a hbck1-tools for 2.1.0 using method 
> > > >> > described in other topics. All the HBASE and JDK here is the 
> > > >> > same version so if it worked fixing one cluster HBASE then it 
> > > >> > should work for other
> > > installs.
> > > >> >
> > > >> > Fiddling with masterprocWALs will require complete shutdown 
> > > >> > of hbase operations to prevent incoming reds/writes on other 
> > > >> > tables and I am not sure how disruptive that will be other 
> > > >> > than "probably a
> > > lot".
> > > >> >
> > > >> > -----Original Message-----
> > > >> > From: Wellington Chevreuil <[email protected]>
> > > >> > Sent: Tuesday, March 2, 2021 10:57 AM
> > > >> > To: Hbase-User <[email protected]>
> > > >> > Subject: Re: HBASE WALs
> > > >> >
> > > >> > EXTERNAL
> > > >> >
> > > >> > Sorry, missed your previous email. I was hoping you were not 
> > > >> > on a non-stable version, so that you would benefit from hbck2 
> > > >> > tool
> > support.
> > > >> > Unfortunately, 2.1.0 is among the early releases that don't 
> > > >> > work with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
> > > >> >
> > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > system seems
> > > >> > > mostly unhappy with one region in particular, and is 
> > > >> > > reporting on
> > > >> that.
> > > >> > >
> > > >> > Are the other regions for the table properly closed, and this 
> > > >> > is the only one stuck? If you do a list_procedures, are you 
> > > >> > able to identify an 'unassign' procedure still running for 
> > > >> > this table? Or if you grep master logs for this region, do 
> > > >> > you see any messages suggesting there's still ongoing 
> > > >> > attempts to bring the region offline? If there's apparently 
> > > >> > no procedure/no ongoing attempts to offline the region, you 
> > > >> > might try to manually update its state in meta table, then 
> > > >> > flip masters (assuming you have master HA), so that the new 
> > > >> > active loads an up
> to date state from meta table.
> > > >> >
> > > >> > Otherwise, if there's still a rogue procedure trying to 
> > > >> > offline the region, unfortunately, due to the lack of hbck 
> > > >> > support, you would most likely need a more disruptive 
> > > >> > intervention similar to what you had described in your first 
> > > >> > email, but instead of normal wal folder, master proc wals is 
> > > >> > what you really would need to clean out here, as that is 
> > > >> > where procedures state is persisted, and you wouldn't want 
> > > >> > the rogue procedure to be
> resumed.
> > > >> >
> > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> > > >> > <[email protected]>
> > > >> > escreveu:
> > > >> >
> > > >> > > If you know of anything that will help I would appreciate it.
> > > >> > >
> > > >> > > If you need any log output let me know.
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Wellington Chevreuil <[email protected]>
> > > >> > > Sent: Thursday, February 25, 2021 4:08 PM
> > > >> > > To: Hbase-User <[email protected]>
> > > >> > > Subject: Re: HBASE WALs
> > > >> > >
> > > >> > > EXTERNAL
> > > >> > >
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > Multiple regions edits would be present in a single wal file.
> > > >> > > That's why upon a RS crash and wal processing, there's a 
> > > >> > > wal split
> > > phase.
> > > >> > >
> > > >> > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > A similar
> > > >> > > > problem (but on a test cluster) involved me clearing 
> > > >> > > > znode info, deleting HDFS data for the table and deleting 
> > > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > > >> > > >
> > > >> > > Which hbase version are you on?
> > > >> > >
> > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > > >> > > <[email protected]>
> > > >> > > escreveu:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > Do WAL files contain information for multiple regions per 
> > > >> > > > WAL or is one WAL associated with one region?
> > > >> > > >
> > > >> > > > I am trying to find a way to clear a RIT for a disabled table.
> > > >> > > > A similar problem (but on a test cluster) involved me 
> > > >> > > > clearing znode info, deleting HDFS data for the table and 
> > > >> > > > deleting WALs/MasterProcWAL files, finally restarting 
> > > >> > > > HBASE
> > service.
> > > >> > > >
> > > >> > > > Table cannot be enabled.
> > > >> > > >
> > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the 
> > > >> > > > system seems mostly unhappy with one region in 
> > > >> > > > particular, and is reporting
> > > >> > on that.
> > > >> > > >
> > > >> > > > There are many tables that are very active so I don't 
> > > >> > > > think it is possible to stop the entire service without a 
> > > >> > > > lot of forewarning to
> > > >> > > users.
> > > >> > > >
> > > >> > > > Thanks in advance.
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

RE: HBASE WALs

Reply via email to