Would recommend you reach out to Cloudera Support if you're already using CDH. They will be able to help you a more hands-on with steps to find the busted procWAL(s) and recover.

On 4/7/21 2:11 AM, Marc Hoppins wrote:
Unfortunately, we are currently stuck using CDH 6.3.2 with Hbase 2.1.0.  The 
company cannot really justify the cost of upgrading this particular offering at 
the incredibly expensive price per node, as we do not have any money-making on 
the data being stored to justify such spending for the size of the cluster.

-----Original Message-----
From: Stack <st...@duboce.net>
Sent: Wednesday, April 7, 2021 12:55 AM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

On Tue, Mar 30, 2021 at 2:52 AM Marc Hoppins <marc.hopp...@eset.sk> wrote:

Dear HBASE gang,

...and, as I previously mentioned, we now have a grand bunch of OLD
WALs milling about.


WALs in the masterProcWALs dir?

MY thinking is that if nothing is going on with writing, then anything in
any masterProcWALs must be related to the bad table and we can just
wipe them and restart HBASE.

Questions I have:

Am I correct in my theory? (I am far from being a Java guy so am not
sure how to follow the process there)


If the old masterProcWALs are not clearing out, must be corruption in the older 
WALs that is preventing them 'completing' so they can be released (meantime new 
procs are added ahead of the old ones...so more WALs show up).


If another (quicker) choice was made and we stop DB operations,
disable all tables then delete masterProcWALs, WITHOUT waiting for
compactions to finish, would we have a real problem with where HBASE
thinks data is or where it should be going due to anything that was
pending in masterWALs for
(possibly) all tables?


Compactions are interruptible. Compactions have nothing to do w/ the 
masterProcStore (or with where data is located).



Is there any sane way to deal with the information in masterWALs?  Or
is that only a Java API thing?


Old WALs are corrupt. Could try and get hbase to quiescent state, stop it, and 
try removing an old WAL... restart, see if it all ok. Hard part is that 
procedures sometimes span WALs so removal may just move forward the corruption.

Upgrade is your best course.... to 2.3. The procedure store will be migrated. 
There'll likely be some mess to be cleaned up but at least there is tooling to 
do so in later hbases.

S



Thanks for all the help/info thus far.

-----Original Message-----
From: Marc Hoppins <marc.hopp...@eset.sk>
Sent: Friday, March 26, 2021 10:49 AM
To: user@hbase.apache.org
Subject: RE: HBASE WALs

EXTERNAL

I wonder if anyone can explain the following:

Before I tried my attempt to fix, HBASE master was retrying to deal
with that stuck region. The attempt counter was increasing - I think
at last count we were up to 3000 or something.  After my attempt, and
I restarted HBASE, it has not tried to fix the stuck region and
attempts are currently at zero.  All procs and locks still exist.

-----Original Message-----
From: Wellington Chevreuil <wellington.chevre...@gmail.com>
Sent: Tuesday, March 23, 2021 6:16 PM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL


I am still not certain what will happen.  masterProcWALs contain
info for all (running) tables, yes?

masterProcWALs only contain info for running procedures, not user
table data. User table data go on "normal" WALs, not "masterProcWALs".

  If all tables are disabled and I remove the master wals, how will
that
affect the other tables? When I disabled all tables, hundreds of
master WALs are now created. This means there is a bunch of pending
operations, yes?  Is it going to make some other things inconsistent?

Table disabling involves the unassignment of all these tables regions.
Each of these "unassign" operations comprise a set of sequential phases.
These internal operations are called "procedures". Information about
the progress of these operations as it progresses through its
different phases are stored in these masterProcWALs files. That's why
triggering the "disable"
command will create some data under masterProcWALs. If all the disable
commands finished successfully, and all your procedures are finished
(apart from that rogue one existing for while already), you would be
good to clean out masterProcWALs.

I did try to set the table state manually to see if the faulty table
would
fire up and I restarted hbase...state was the same a locked table
state due to pending disable and stuck region.

That's because of the rogue procedure. When you restarted master, it
went through masterProcWals and resumed the rogue procedure from the
unfinished state it was when you restarted hbase. If you had removed
masterProcWALs prior to restart, the rogue procedure would now be gone.

We may have the go-ahead to remove this table - I assume we cannot
clone it
while it is in a state of (DISABLED) flux but, once again, messing
with master WALs has me on edge.

 From what I understand, you already have the tables disabled, and no
unfinished procs apart from the rogue one, so just clean out
masterProcWALs and restart master.

Em ter., 23 de mar. de 2021 às 11:13, Marc Hoppins
<marc.hopp...@eset.sk>
escreveu:

I am still not certain what will happen.  masterProcWALs contain
info for all (running) tables, yes?

If all tables are disabled and I remove the master wals, how will
that affect the other tables? When I disabled all tables, hundreds
of master WALs are now created. This means there is a bunch of
pending operations, yes?  Is it going to make some other things inconsistent?

I did try to set the table state manually to see if the faulty table
would fire up and I restarted hbase...state was the same a locked
table state due to pending disable and stuck region.

We may have the go-ahead to remove this table - I assume we cannot
clone it while it is in a state of (DISABLED) flux but, once again,
messing with master WALs has me on edge.


-----Original Message-----
From: Wellington Chevreuil <wellington.chevre...@gmail.com>
Sent: Tuesday, March 16, 2021 4:50 PM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL


To be clear, if the other tables are stopped, I assume all pending
and current operations will finish. How long will it take to write
all data - if indeed the data does get permanently written - so
that we can safely remove WALs?

If by "tables stopped" you mean your tables are disabled, then yeah,
all related data would already have been flushed into hfiles and
wouldn't be on your wals. But please be aware that what you really
need here to get rid of the rogue proc is to remove master proc
wals,
not normal wals.

Em ter., 16 de mar. de 2021 às 07:12, Marc Hoppins
<marc.hopp...@eset.sk>
escreveu:

Overall, I am mystified as to how this could happen.  If Hadoop
has a replication factor (I believe we use the default) of 3 and
we have two datacenters with masters and workers in both, how can
a network outage affect Hadoop operation? Surely it should have
used available resources to continue operations...or have I misinterpreted 
entirely?

-----Original Message-----
From: Stack <st...@duboce.net>
Sent: Tuesday, March 16, 2021 7:16 AM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins
<marc.hopp...@eset.sk>
wrote:

Hi, all,

For our stuck region, this exists in meta.  Could we alter the
state to CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)?

You could but IIRC, in that version of HBase, you may need to
restart the
Master after the change (changing hbase:meta does not update the
Master's in-memory state). On restart, Master will read hbase:meta
to discover Region state.

S


hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
column=info:regioninfo, timestamp=1613580024017, value={ENCODED
=> f25fe93e24b34cb2f7fffddee1d89eec, NAME =>
'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.'
, STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'}
hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
column=info:seqnumDuringOpen, timestamp=1611787189839,
value=\x00\x00\x00\x00\x00\x00\x04\x8F
  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
column=info:server, timestamp=1611787189839, value=
dr1-hbase18.jumbo.hq.eset.com:16020
  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
column=info:serverstartcode, timestamp=1611787189839,
value=1611785264032
hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
column=info:sn, timestamp=1613580024017, value=
ba-hbase25.jumbo.hq.eset.com,16020,1604475904456
  hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.
column=info:state, timestamp=1613580024017, value=OPENING

-----Original Message-----
From: Wellington Chevreuil <wellington.chevre...@gmail.com>
Sent: Wednesday, March 10, 2021 10:56 AM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL


Sorry if I seem stupid but this is still all new to me.

Forgot to mention, there's no stupid questions here. Don't be
shy and keep'em coming.

Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil <
wellington.chevre...@gmail.com> escreveu:

However, how would that help anyway?  If we cannot fix this at
this time
then any upgrade would have inconsistencies also, yes?

The upgrade on it's own wouldn't fix existing inconsistencies,
but you would now have support for additional tooling
(hbase-operators-tool) to help you with this.

As all the 'SUCCESS' procedures have a parent ID 73587, does
this mean
that they were successfully and fully moved from hbase25 to
each server mentioned in that procedure?  Or does it just
mean that the region was successfully unassigned from hbase25
but the data still resides on hbase25?  I see locality 0.

IIRC, those were all UnassignProcedures, so it means the
unassignment of the related region has completed and the
region for that particular procedure went offline.

If we change the table state in meta to 'ENABLED', could this
kickstart
all these things or will it just lead to further problems?

Masters work with its own memory cache of meta, so manually
updating it will just make masters cache inconsistent with meta.
You would need to restart masters to get its cache reloaded
from master. The main problem is that you still have the rogue
procedures, which you can't get rid of without stopping the
cluster. One alternative to a full cluster outage would be to
identify all RSes running the rogue procs (you can find that
from active master logs), then stop only those and master,
clean
masterprocwals, then start it again.


I suppose it means I am asking, the 73587
DisableTableProcedure, does it mean that the table is waiting
to be disabled?  HBASE master declares that table is NOT enabled.

The table state may have been already updated to disabled,
most of its regions may already be offline, but the 73587
DisableTableProcedure cannot be considered "done" until all
its sub procedures are indeed
completed.


Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins
<marc.hopp...@eset.sk>
escreveu:

Thanks for that.

Alas, we are (currently) constrained by using Cloudera (CDH)
6.3.1 and do not have a viable business use to pay the
extortionate amount of money required to upgrade.  Which
would give these cluster access to newer versions.

However, how would that help anyway?  If we cannot fix this
at this time then any upgrade would have inconsistencies also, yes?

As all the 'SUCCESS' procedures have a parent ID 73587, does
this mean that they were successfully and fully moved from
hbase25 to each server mentioned in that procedure?  Or does
it just mean that the region was successfully unassigned from
hbase25 but the data still resides on hbase25?  I see locality 0.

If we change the table state in meta to 'ENABLED', could this
kickstart all these things or will it just lead to further
problems?
I suppose it means I am asking, the 73587
DisableTableProcedure, does it mean that the table is waiting
to be disabled?  HBASE master declares that table is NOT enabled.

Sorry if I seem stupid but this is still all new to me.

I appreciate the help.

-----Original Message-----
From: Wellington Chevreuil <wellington.chevre...@gmail.com>
Sent: Tuesday, March 9, 2021 1:20 PM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL


All fails are waiting on the same PID (73587), a DISABLE
TABLE
procedure.
The offending region (f25fe93e24b34cb2f7fffddee1d89eec)
seems to be the problem.

Per your list procedures output attached, it seems the procs
states are all inconsistent. There's a WAIT_TIMEOUT subproc
of
73587 with PID 73827, which is the UnassignProcedure for this
region. Problem is that there are already 5 APs for the same
region, which may be causing some deadlocks. If this cluster
was on a hbck2 supported version, you could get rid of this
state using bypass command on all these proc ids, then
manually get the table/regions states consistent again using
setRegionState/setTableState/assigns/unassigns
methods.

Without tooling, the only option I can think of is to stop
cluster, clean out masterprocwals, restart cluster, then use
hbase shell to enable/disable/assign regions. You may also
need to manually update table/region states in meta table. Of
course, you can automate these manual steps into your own
tooling, but may be a better strategy in the long term to
upgrade to a more stable version that also benefits from more
tooling supported by
the community.





Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins
<marc.hopp...@eset.sk>
escreveu:

Hi, Wellington,

I was on 'vacation' (no road trip or overseas anything) for
a
week.

All fails are waiting on the same PID (73587), a DISABLE
TABLE
procedure.
The offending region (f25fe93e24b34cb2f7fffddee1d89eec)
seems to be the problem.

I am still mystified about the HBCK2-tools. I have attached
a previous thread that you commented on at the time.

I did build a tools for our HBASE 2.1.0...or rather, I
built it on Ubuntu
20.04 with openJDK8 (1.8.0_212), then successfully ran it
on Ubuntu
16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).
I used it to help fix a similar problem with an offline
table and
RITs.
Both HBASE versions are the same.

I attach a 'sheet' with the current procs/locks.

-----Original Message-----
From: Marc Hoppins <marc.hopp...@eset.sk>
Sent: Wednesday, March 3, 2021 9:51 AM
To: user@hbase.apache.org
Cc: Martin Oravec <martin.ora...@eset.sk>
Subject: RE: HBASE WALs

EXTERNAL

Thanks, Wellington,

I have already build a hbck1-tools for 2.1.0 using method
described in other topics. All the HBASE and JDK here is
the same version so if it worked fixing one cluster HBASE
then it should work for other
installs.

Fiddling with masterprocWALs will require complete shutdown
of hbase operations to prevent incoming reds/writes on
other tables and I am not sure how disruptive that will be
other than "probably a
lot".

-----Original Message-----
From: Wellington Chevreuil <wellington.chevre...@gmail.com>
Sent: Tuesday, March 2, 2021 10:57 AM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

Sorry, missed your previous email. I was hoping you were
not on a non-stable version, so that you would benefit from
hbck2 tool
support.
Unfortunately, 2.1.0 is among the early releases that don't
work with this tool (it requires at least 2.0.3, 2.1.1 or
2.2.0).

Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the
system seems
mostly unhappy with one region in particular, and is
reporting on
that.

Are the other regions for the table properly closed, and
this is the only one stuck? If you do a list_procedures,
are you able to identify an 'unassign' procedure still
running for this table? Or if you grep master logs for this
region, do you see any messages suggesting there's still
ongoing attempts to bring the region offline? If there's
apparently no procedure/no ongoing attempts to offline the
region, you might try to manually update its state in meta
table, then flip masters (assuming you have master HA), so
that the new active loads an up
to date state from meta table.

Otherwise, if there's still a rogue procedure trying to
offline the region, unfortunately, due to the lack of hbck
support, you would most likely need a more disruptive
intervention similar to what you had described in your
first email, but instead of normal wal folder, master proc
wals is what you really would need to clean out here, as
that is where procedures state is persisted, and you
wouldn't want the rogue procedure to be
resumed.

Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins
<marc.hopp...@eset.sk>
escreveu:

If you know of anything that will help I would appreciate it.

If you need any log output let me know.

Thanks


-----Original Message-----
From: Wellington Chevreuil
<wellington.chevre...@gmail.com>
Sent: Thursday, February 25, 2021 4:08 PM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL


Do WAL files contain information for multiple regions
per WAL or is one WAL associated with one region?

Multiple regions edits would be present in a single wal file.
That's why upon a RS crash and wal processing, there's a
wal split
phase.

I am trying to find a way to clear a RIT for a disabled table.
A similar
problem (but on a test cluster) involved me clearing
znode info, deleting HDFS data for the table and
deleting WALs/MasterProcWAL files, finally restarting HBASE service.

Which hbase version are you on?

Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins
<marc.hopp...@eset.sk>
escreveu:

Hi all,

Do WAL files contain information for multiple regions
per WAL or is one WAL associated with one region?

I am trying to find a way to clear a RIT for a disabled
table.
A similar problem (but on a test cluster) involved me
clearing znode info, deleting HDFS data for the table
and deleting WALs/MasterProcWAL files, finally
restarting HBASE
service.

Table cannot be enabled.

Multiple locks exist for DISABLE/ENABLE/UNASSIGN but
the system seems mostly unhappy with one region in
particular, and is reporting
on that.

There are many tables that are very active so I don't
think it is possible to stop the entire service without
a lot of forewarning to
users.

Thanks in advance.









Reply via email to