Ah, thanks Yulin Niu for the pointer. HBASE-26053 should be the problem.
Yulin Niu <[email protected]> 于2021年12月19日周日 10:41写道: > > https://issues.apache.org/jira/browse/HBASE-25053 > It seems the bug described in this issue, You can try cherry pick this > patch, Claude M > > Viraj Jasani <[email protected]> 于2021年12月19日周日 02:17写道: > > > > Your fix is a bit dangerous since you may lose some ongoing procedures, > > but > > > if you did not experience any inconsistency on your cluster, for example, > > > some regions are not online, then it is OK. > > > > Duo, out of curiosity, even if some regions are offline and/or some servers > > go offline, wouldn't master failover re-trigger SCPs and TRSPs to bring all > > regions ONLINE? > > I have played around with removal of MasterProcWAL on hbase1 only (WAL proc > > store) and have seen new SCPs getting triggered i.e. AM doesn bring all > > regions ONLINE eventually. > > > > > > On Thu, Dec 16, 2021 at 9:57 PM 张铎(Duo Zhang) <[email protected]> > > wrote: > > > > > I guess this should be a bug. For the master local region we do not > > handle > > > broken WAL files which do not even have a valid header. > > > > > > Will take a look at the code tomorrow to confirm whether this is the > > case. > > > > > > Your fix is a bit dangerous since you may lose some ongoing procedures, > > but > > > if you did not experience any inconsistency on your cluster, for example, > > > some regions are not online, then it is OK. > > > > > > Thanks for reporting. > > > > > > Claude M <[email protected]> 于2021年12月16日周四 03:37写道: > > > > > > > Hello, > > > > > > > > I have the following installed: > > > > > > > > - Hadoop 3.2.2 > > > > - HBase 2.3.5 > > > > > > > > > > > > When all the datanodes in Hadoop are stopped but the HBase cluster is > > > > still running, the HBase master crashes w/ the attached exception and > > is > > > > not recoverable. > > > > > > > > If I delete the contents under the following directories in hdfs, the > > > > master will then recover: > > > > > > > > - /hbase/MasterData/WALs/ > > > > - /hbase/MasterData/data/master/store/*/recovered.wals/ > > > > > > > > Is this an appropriate way to resolve the issue? If not, what should > > be > > > > done? > > > > > > > > > > > > Thanks > > > > > > > > >
