On Fri, Mar 12, 2021 at 2:17 AM Marc Hoppins <marc.hopp...@eset.sk> wrote:
> Hi, all, > > For our stuck region, this exists in meta. Could we alter the state to > CLOSED (maybe via intermediate OPEN, CLOSING, CLOSED)? > > You could but IIRC, in that version of HBase, you may need to restart the Master after the change (changing hbase:meta does not update the Master's in-memory state). On restart, Master will read hbase:meta to discover Region state. S > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. > column=info:regioninfo, timestamp=1613580024017, value={ENCODED => > f25fe93e24b34cb2f7fffddee1d89eec, NAME => > 'hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec.', > STARTKEY => 'BDFFEEF', ENDKEY => 'BEAA821D2'} > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. > column=info:seqnumDuringOpen, timestamp=1611787189839, > value=\x00\x00\x00\x00\x00\x00\x04\x8F > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. > column=info:server, timestamp=1611787189839, value= > dr1-hbase18.jumbo.hq.eset.com:16020 > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. > column=info:serverstartcode, timestamp=1611787189839, value=1611785264032 > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. > column=info:sn, timestamp=1613580024017, value= > ba-hbase25.jumbo.hq.eset.com,16020,1604475904456 > hds2_md5,BDFFEEF,1535957697205.f25fe93e24b34cb2f7fffddee1d89eec. > column=info:state, timestamp=1613580024017, value=OPENING > > -----Original Message----- > From: Wellington Chevreuil <wellington.chevre...@gmail.com> > Sent: Wednesday, March 10, 2021 10:56 AM > To: Hbase-User <user@hbase.apache.org> > Subject: Re: HBASE WALs > > EXTERNAL > > > > > Sorry if I seem stupid but this is still all new to me. > > > Forgot to mention, there's no stupid questions here. Don't be shy and > keep'em coming. > > Em qua., 10 de mar. de 2021 às 09:48, Wellington Chevreuil < > wellington.chevre...@gmail.com> escreveu: > > > However, how would that help anyway? If we cannot fix this at this > > time > >> then any upgrade would have inconsistencies also, yes? > >> > > The upgrade on it's own wouldn't fix existing inconsistencies, but you > > would now have support for additional tooling (hbase-operators-tool) > > to help you with this. > > > > As all the 'SUCCESS' procedures have a parent ID 73587, does this mean > >> that they were successfully and fully moved from hbase25 to each > >> server mentioned in that procedure? Or does it just mean that the > >> region was successfully unassigned from hbase25 but the data still > >> resides on hbase25? I see locality 0. > >> > > IIRC, those were all UnassignProcedures, so it means the unassignment > > of the related region has completed and the region for that particular > > procedure went offline. > > > > If we change the table state in meta to 'ENABLED', could this > > kickstart > >> all these things or will it just lead to further problems? > > > > Masters work with its own memory cache of meta, so manually updating > > it will just make masters cache inconsistent with meta. You would need > > to restart masters to get its cache reloaded from master. The main > > problem is that you still have the rogue procedures, which you can't > > get rid of without stopping the cluster. One alternative to a full > > cluster outage would be to identify all RSes running the rogue procs > > (you can find that from active master logs), then stop only those and > > master, clean masterprocwals, then start it again. > > > > > >> I suppose it means I am asking, the 73587 DisableTableProcedure, does > >> it mean that the table is waiting to be disabled? HBASE master > >> declares that table is NOT enabled. > >> > > The table state may have been already updated to disabled, most of its > > regions may already be offline, but the 73587 DisableTableProcedure > > cannot be considered "done" until all its sub procedures are indeed > completed. > > > > > > Em ter., 9 de mar. de 2021 às 13:40, Marc Hoppins > > <marc.hopp...@eset.sk> > > escreveu: > > > >> Thanks for that. > >> > >> Alas, we are (currently) constrained by using Cloudera (CDH) 6.3.1 > >> and do not have a viable business use to pay the extortionate amount > >> of money required to upgrade. Which would give these cluster access > >> to newer versions. > >> > >> However, how would that help anyway? If we cannot fix this at this > >> time then any upgrade would have inconsistencies also, yes? > >> > >> As all the 'SUCCESS' procedures have a parent ID 73587, does this > >> mean that they were successfully and fully moved from hbase25 to each > >> server mentioned in that procedure? Or does it just mean that the > >> region was successfully unassigned from hbase25 but the data still > >> resides on hbase25? I see locality 0. > >> > >> If we change the table state in meta to 'ENABLED', could this > >> kickstart all these things or will it just lead to further problems? > >> I suppose it means I am asking, the 73587 DisableTableProcedure, does > >> it mean that the table is waiting to be disabled? HBASE master > >> declares that table is NOT enabled. > >> > >> Sorry if I seem stupid but this is still all new to me. > >> > >> I appreciate the help. > >> > >> -----Original Message----- > >> From: Wellington Chevreuil <wellington.chevre...@gmail.com> > >> Sent: Tuesday, March 9, 2021 1:20 PM > >> To: Hbase-User <user@hbase.apache.org> > >> Subject: Re: HBASE WALs > >> > >> EXTERNAL > >> > >> > > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE > >> procedure. > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be > >> > the problem. > >> > > >> Per your list procedures output attached, it seems the procs states > >> are all inconsistent. There's a WAIT_TIMEOUT subproc of 73587 with > >> PID 73827, which is the UnassignProcedure for this region. Problem is > >> that there are already 5 APs for the same region, which may be > >> causing some deadlocks. If this cluster was on a hbck2 supported > >> version, you could get rid of this state using bypass command on all > >> these proc ids, then manually get the table/regions states consistent > >> again using setRegionState/setTableState/assigns/unassigns methods. > >> > >> Without tooling, the only option I can think of is to stop cluster, > >> clean out masterprocwals, restart cluster, then use hbase shell to > >> enable/disable/assign regions. You may also need to manually update > >> table/region states in meta table. Of course, you can automate these > >> manual steps into your own tooling, but may be a better strategy in > >> the long term to upgrade to a more stable version that also benefits > >> from more tooling supported by the community. > >> > >> > >> > >> > >> > >> Em seg., 8 de mar. de 2021 às 07:50, Marc Hoppins > >> <marc.hopp...@eset.sk> > >> escreveu: > >> > >> > Hi, Wellington, > >> > > >> > I was on 'vacation' (no road trip or overseas anything) for a week. > >> > > >> > All fails are waiting on the same PID (73587), a DISABLE TABLE > >> procedure. > >> > The offending region (f25fe93e24b34cb2f7fffddee1d89eec) seems to be > >> > the problem. > >> > > >> > I am still mystified about the HBCK2-tools. I have attached a > >> > previous thread that you commented on at the time. > >> > > >> > I did build a tools for our HBASE 2.1.0...or rather, I built it on > >> > Ubuntu > >> > 20.04 with openJDK8 (1.8.0_212), then successfully ran it on Ubuntu > >> > 16.04 with a slightly different java (Oracle Java 8, 1.8.0_181). I > >> > used it to help fix a similar problem with an offline table and RITs. > >> > Both HBASE versions are the same. > >> > > >> > I attach a 'sheet' with the current procs/locks. > >> > > >> > -----Original Message----- > >> > From: Marc Hoppins <marc.hopp...@eset.sk> > >> > Sent: Wednesday, March 3, 2021 9:51 AM > >> > To: user@hbase.apache.org > >> > Cc: Martin Oravec <martin.ora...@eset.sk> > >> > Subject: RE: HBASE WALs > >> > > >> > EXTERNAL > >> > > >> > Thanks, Wellington, > >> > > >> > I have already build a hbck1-tools for 2.1.0 using method described > >> > in other topics. All the HBASE and JDK here is the same version so > >> > if it worked fixing one cluster HBASE then it should work for other > installs. > >> > > >> > Fiddling with masterprocWALs will require complete shutdown of > >> > hbase operations to prevent incoming reds/writes on other tables > >> > and I am not sure how disruptive that will be other than "probably a > lot". > >> > > >> > -----Original Message----- > >> > From: Wellington Chevreuil <wellington.chevre...@gmail.com> > >> > Sent: Tuesday, March 2, 2021 10:57 AM > >> > To: Hbase-User <user@hbase.apache.org> > >> > Subject: Re: HBASE WALs > >> > > >> > EXTERNAL > >> > > >> > Sorry, missed your previous email. I was hoping you were not on a > >> > non-stable version, so that you would benefit from hbck2 tool support. > >> > Unfortunately, 2.1.0 is among the early releases that don't work > >> > with this tool (it requires at least 2.0.3, 2.1.1 or 2.2.0). > >> > > >> > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system > >> > seems > >> > > mostly unhappy with one region in particular, and is reporting on > >> that. > >> > > > >> > Are the other regions for the table properly closed, and this is > >> > the only one stuck? If you do a list_procedures, are you able to > >> > identify an 'unassign' procedure still running for this table? Or > >> > if you grep master logs for this region, do you see any messages > >> > suggesting there's still ongoing attempts to bring the region > >> > offline? If there's apparently no procedure/no ongoing attempts to > >> > offline the region, you might try to manually update its state in > >> > meta table, then flip masters (assuming you have master HA), so > >> > that the new active loads an up to date state from meta table. > >> > > >> > Otherwise, if there's still a rogue procedure trying to offline the > >> > region, unfortunately, due to the lack of hbck support, you would > >> > most likely need a more disruptive intervention similar to what you > >> > had described in your first email, but instead of normal wal > >> > folder, master proc wals is what you really would need to clean out > >> > here, as that is where procedures state is persisted, and you > >> > wouldn't want the rogue procedure to be resumed. > >> > > >> > Em seg., 1 de mar. de 2021 às 10:22, Marc Hoppins > >> > <marc.hopp...@eset.sk> > >> > escreveu: > >> > > >> > > If you know of anything that will help I would appreciate it. > >> > > > >> > > If you need any log output let me know. > >> > > > >> > > Thanks > >> > > > >> > > > >> > > -----Original Message----- > >> > > From: Wellington Chevreuil <wellington.chevre...@gmail.com> > >> > > Sent: Thursday, February 25, 2021 4:08 PM > >> > > To: Hbase-User <user@hbase.apache.org> > >> > > Subject: Re: HBASE WALs > >> > > > >> > > EXTERNAL > >> > > > >> > > > > >> > > > Do WAL files contain information for multiple regions per WAL > >> > > > or is one WAL associated with one region? > >> > > > > >> > > Multiple regions edits would be present in a single wal file. > >> > > That's why upon a RS crash and wal processing, there's a wal split > phase. > >> > > > >> > > I am trying to find a way to clear a RIT for a disabled table. A > >> > > similar > >> > > > problem (but on a test cluster) involved me clearing znode > >> > > > info, deleting HDFS data for the table and deleting > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service. > >> > > > > >> > > Which hbase version are you on? > >> > > > >> > > Em qui., 25 de fev. de 2021 às 11:51, Marc Hoppins > >> > > <marc.hopp...@eset.sk> > >> > > escreveu: > >> > > > >> > > > Hi all, > >> > > > > >> > > > Do WAL files contain information for multiple regions per WAL > >> > > > or is one WAL associated with one region? > >> > > > > >> > > > I am trying to find a way to clear a RIT for a disabled table. > >> > > > A similar problem (but on a test cluster) involved me clearing > >> > > > znode info, deleting HDFS data for the table and deleting > >> > > > WALs/MasterProcWAL files, finally restarting HBASE service. > >> > > > > >> > > > Table cannot be enabled. > >> > > > > >> > > > Multiple locks exist for DISABLE/ENABLE/UNASSIGN but the system > >> > > > seems mostly unhappy with one region in particular, and is > >> > > > reporting > >> > on that. > >> > > > > >> > > > There are many tables that are very active so I don't think it > >> > > > is possible to stop the entire service without a lot of > >> > > > forewarning to > >> > > users. > >> > > > > >> > > > Thanks in advance. > >> > > > > >> > > > >> > > >> > > >