Thanks for that.

Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 and do not 
have a viable business use to pay the extortionate amount of money required to 
upgrade.  Which would give these cluster access to newer versions.

However, how would that help anyway?  If we cannot fix this at this time then 
any upgrade would have inconsistencies also, yes?

As all the 'SUCCESS' procedures have a parent ID 73587, does this mean that 
they were successfully and fully moved from hbase25 to each server mentioned in 
that procedure?  Or does it just mean that the region was successfully 
unassigned from hbase25 but the data still resides on hbase25?  I see locality 
0.

If we change the table state in meta to 'ENABLED', could this kickstart all 
these things or will it just lead to further problems?  I suppose it means I am 
asking, the 73587 DisableTableProcedure, does it mean that the table is waiting 
to be disabled?  HBASE master declares that table is NOT enabled.

Sorry if I seem stupid but this is still all new to me.

I appreciate the help.

-----Original Message-----
From: Wellington Chevreuil <wellington.chevre...@gmail.com> 
Sent: Tuesday, March 9, 2021 1:20 PM
To: Hbase-User <user@hbase.apache.org>
Subject: Re: HBASE WALs

EXTERNAL

>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
> the problem.
>
Per your list procedures output attached, it seems the procs states are all 
inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with PID 73827, which is 
the UnassignProcedure for this region. Problem is that there are already 5 APs 
for the same region, which may be causing some deadlocks. If this cluster was 
on a hbck2 supported version, you could get rid of this state using bypass 
command on all these proc ids, then manually get the table/regions states 
consistent again using setRegionState/setTableState/assigns/unassigns methods.

Without tooling, the only option I can think of is to stop cluster, clean out 
masterprocwals, restart cluster, then use hbase shell to enable/disable/assign 
regions. You may also need to manually update table/region states in meta 
table. Of course, you can automate these manual steps into your own tooling, 
but may be a better strategy in the long term to upgrade to a more stable 
version that also benefits from more tooling supported by the community.





Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins <marc.hopp...@eset.sk>
escreveu:

> Hi, Wellington,
>
> I was on 'vacation' (no road trip or overseas anything) for a week.
>
> All fails are waiting on the same PID (73587), a DISABLE TABLE procedure.
> The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be 
> the problem.
>
> I am still mystified about the HBCK2-tools. I have attached a previous 
> thread that you commented on at the time.
>
> I did build a tools for our HBASE 2.1.0...or rather, I built it on 
> Ubuntu
> 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu 
> 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181).  I 
> used it to help fix a similar problem with an offline table and RITs.  
> Both HBASE versions are the same.
>
> I attach a 'sheet' with the current procs/locks.
>
> -----Original Message-----
> From: Marc Hoppins <marc.hopp...@eset.sk>
> Sent: Wednesday, March 3, 2021 9:51 AM
> To: user@hbase.apache.org
> Cc: Martin Oravec <martin.ora...@eset.sk>
> Subject: RE: HBASE WALs
>
> EXTERNAL
>
> Thanks, Wellington,
>
> I have already build a hbck1-tools for 2.1.0 using method described in 
> other topics. All the HBASE and JDK here is the same version so if it 
> worked fixing one cluster HBASE then it should work for other installs.
>
> Fiddling with masterprocWALs will require complete shutdown of hbase 
> operations to prevent incoming reds/writes on other tables and I am 
> not sure how disruptive that will be other than "probably a lot".
>
> -----Original Message-----
> From: Wellington Chevreuil <wellington.chevre...@gmail.com>
> Sent: Tuesday, March 2, 2021 10:57 AM
> To: Hbase-User <user@hbase.apache.org>
> Subject: Re: HBASE WALs
>
> EXTERNAL
>
> Sorry, missed your previous email. I was hoping you were not on a 
> non-stable version, so that you would benefit from hbck2 tool support.
> Unfortunately, 2.1.0 is among the early releases that don't work with 
> this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0).
>
> Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system seems
> > mostly unhappy with one region in particular, and is reporting on that.
> >
> Are the other regions for the table properly closed, and this is the 
> only one stuck? If you do a list_procedures, are you able to identify 
> an 'unassign' procedure still running for this table? Or if you grep 
> master logs for this region, do you see any messages suggesting 
> there's still ongoing attempts to bring the region offline? If there's 
> apparently no procedure/no ongoing attempts to offline the region, you 
> might try to manually update its state in meta table, then flip 
> masters (assuming you have master HA), so that the new active loads an 
> up to date state from meta table.
>
> Otherwise, if there's still a rogue procedure trying to offline the 
> region, unfortunately, due to the lack of hbck support, you would most 
> likely need a more disruptive intervention similar to what you had 
> described in your first email, but instead of normal wal folder, 
> master proc wals is what you really would need to clean out here, as 
> that is where procedures state is persisted, and you wouldn't want the 
> rogue procedure to be resumed.
>
> Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins 
> <marc.hopp...@eset.sk>
> escreveu:
>
> > If you know of anything that will help I would appreciate it.
> >
> > If you need any log output let me know.
> >
> > Thanks
> >
> >
> > -----Original Message-----
> > From: Wellington Chevreuil <wellington.chevre...@gmail.com>
> > Sent: Thursday, February 25, 2021 4:08 PM
> > To: Hbase-User <user@hbase.apache.org>
> > Subject: Re: HBASE WALs
> >
> > EXTERNAL
> >
> > >
> > > Do WAL files contain information for multiple regions per WAL or 
> > > is one WAL associated with one region?
> > >
> > Multiple regions edits would be present in a single wal file. That's 
> > why upon a RS crash and wal processing, there's a wal split phase.
> >
> > I am trying to find a way to clear a RIT for a disabled table. A 
> > similar
> > > problem (but on a test cluster) involved me clearing znode info, 
> > > deleting HDFS data for the table and deleting WALs/MasterProcWAL 
> > > files, finally restarting HBASE service.
> > >
> > Which hbase version are you on?
> >
> > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins 
> > <marc.hopp...@eset.sk>
> > escreveu:
> >
> > > Hi all,
> > >
> > > Do WAL files contain information for multiple regions per WAL or 
> > > is one WAL associated with one region?
> > >
> > > I am trying to find a way to clear a RIT for a disabled table. A 
> > > similar problem (but on a test cluster) involved me clearing znode 
> > > info, deleting HDFS data for the table and deleting 
> > > WALs/MasterProcWAL files, finally restarting HBASE service.
> > >
> > > Table cannot be enabled.
> > >
> > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system 
> > > seems mostly unhappy with one region in particular, and is 
> > > reporting
> on that.
> > >
> > > There are many tables that are very active so I don't think it is 
> > > possible to stop the entire service without a lot of forewarning 
> > > to
> > users.
> > >
> > > Thanks in advance.
> > >
> >
>

Reply via email to