[ https://issues.apache.org/jira/browse/HBASE-27502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani resolved HBASE-27502. ---------------------------------- Assignee: Viraj Jasani Resolution: Workaround It turns out, this issue happened because snapshot was created by a MapReduce job that was still using hbase 1.x binaries in the classpath, and since we don't have HBASE-21977 backported to branch-1, we had this issue. HBASE-21977 prevents creating new seqid for region open/close that happens as part of creating snapshot scanners, but this fix is only available for hbase 2.x clusters. [~shahrs87] [~syuanjiang] > Regionservers aborted as mvcc read point is less than max seq id derived from > .seqid files > ------------------------------------------------------------------------------------------ > > Key: HBASE-27502 > URL: https://issues.apache.org/jira/browse/HBASE-27502 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 2.4.15 > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > > HBase cluster is upgraded from 1.6 to 2.4.14/15 version recently. The cluster > doesn't have much traffic. After 4-5 days of this upgrade, suddenly 144 out > of ~150 regionservers were aborted with {*}java.io.IOException: The new max > sequence id {} is less than the old max sequence id {}{*}. > After starting regionservers, things were normal. > > Sequence of events for the first regionserver that was aborted: > Table snapshot was created, hence region went through snapshot subprocedure. > {code:java} > 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] > snapshot.SnapshotManifest - Storing > 'TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.' > region-info for snapshot=SNAPSHOT_TABLE01_1668560404026_1668560447300_0 > 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] > snapshot.FlushSnapshotSubprocedure - Starting snapshot operation on > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b. > 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] > snapshot.SnapshotManifest - Adding snapshot references for > [hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a] > hfiles > 2022-11-16 01:03:11,504 DEBUG [532)-snapshot-pool-0] > snapshot.SnapshotManifest - Adding reference for file (1/1): > hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a > for snapshot=SNAPSHOT_TABLE01_1668560404026_1668560447300_0 > 2022-11-16 01:03:11,562 DEBUG [532)-snapshot-pool-0] > snapshot.FlushSnapshotSubprocedure - Closing snapshot operation on > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b. > 2022-11-16 01:03:11,562 DEBUG [532)-snapshot-pool-0] > snapshot.FlushSnapshotSubprocedure - ... SkipFlush Snapshotting region > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b. > completed. > {code} > > After 6+ hr, major compaction of the table was triggered. > Logs from RS c01-dabc11-12-xyz.abcxyz: > {code:java} > 2022-11-16 07:36:34,978 INFO [0-shortCompactions-0] regionserver.HStore - > Starting compaction of > [hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a] > into > tmpdir=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/.tmp, > totalSize=939.0 M > 2022-11-16 07:36:34,978 INFO [0-shortCompactions-0] regionserver.HRegion - > Starting compaction of d773829ad7e76202cccac6fbc314091b/0 in > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b. > {code} > > > Region split is triggered by CompactSplit. > Logs from RS c01-dabc11-12-xyz.abcxyz: > {code:java} > 2022-11-16 07:38:03,570 DEBUG [0-shortCompactions-0] > regionserver.CompactSplit - Splitting > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b., > compactionQueue=(longCompactions=0:shortCompactions=0), splitQueue=0 > 2022-11-16 07:38:03,848 INFO [abc11-12-xyz:61020-0] regionserver.HRegion - > Closing region > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b. > 2022-11-16 07:38:03,860 DEBUG [2cccac6fbc314091b.-1] backup.HFileArchiver - > Archived from FileableStoreFile, > hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a > to > hdfs://c01/hbase/archive/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a > 2022-11-16 07:38:03,881 DEBUG [abc11-12-xyz:61020-0] regionserver.HRegion - > Region close journal for d773829ad7e76202cccac6fbc314091b: > Waiting for close lock at 1668584283848Running coprocessor pre-close hooks at > 1668584283848Disabling compacts and flushes for region at > 1668584283848Disabling writes for close at 1668584283848Writing region close > event to WAL at 1668584283876 (+28 ms) > 2022-11-16 07:38:03,881 WARN [abc11-12-xyz:61020-0] > handler.UnassignRegionHandler - Fatal error occurred while closing region > d773829ad7e76202cccac6fbc314091b, aborting... > java.io.IOException: The new max sequence id 1963762 is less than the old max > sequence id 1963764 > at > org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:397) > at > org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1217) > at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1816) > at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552) > at > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:118) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} > Leading to RS abort: > {code:java} > 2022-11-16 07:38:03,889 ERROR [abc11-12-xyz:61020-0] > regionserver.HRegionServer - ***** ABORTING region server > c01-dabc11-12-xyz.abcxyz,61020,1668064189532: Failed to close region > d773829ad7e76202cccac6fbc314091b and can not recover ***** > java.io.IOException: The new max sequence id 1963762 is less than the old max > sequence id 1963764 > at > org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:397) > at > org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1217) > at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1816) > at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552) > at > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:118) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} > > > Master tries to open the region on other RS. > master logs: > {code:java} > 2022-11-16 07:38:08,840 INFO [PEWorker-60] assignment.RegionStateStore - > pid=58195 updating hbase:meta row=d773829ad7e76202cccac6fbc314091b, > regionState=ABNORMALLY_CLOSED > 2022-11-16 07:38:08,883 INFO [PEWorker-32] assignment.RegionStateStore - > pid=58195 updating hbase:meta row=d773829ad7e76202cccac6fbc314091b, > regionState=OPENING, > regionLocation=c01-dabc6-10-xyz.abcxyz,61020,1668058690832 > 2022-11-16 07:38:08,883 DEBUG [PEWorker-32] hbase.META - Put > {"totalColumns":3,"row":"TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\\x000000000000000008\\xFD\\xCC\\xDC\\x9C\\xB7\\xA4\\xFF\\xCF\\xBB\\xCA\\xB5\\xCF\\xCF\\xCF\\xCF\\xCF\\xCF\\x8A\\xAE\\xB3\\xA7\\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.","families":{"info":[{"qualifier":"regioninfo","vlen":245,"tag":[],"timestamp":"1668584288883"},{"qualifier":"sn","vlen":53,"tag":[],"timestamp":"1668584288883"},{"qualifier":"state","vlen":7,"tag":[],"timestamp":"1668584288883"}]},"ts":"1668584288883"} > 2022-11-16 07:38:08,888 INFO [PEWorker-32] procedure2.ProcedureExecutor - > Initialized subprocedures=[{pid=58794, ppid=58195, state=RUNNABLE; > OpenRegionProcedure d773829ad7e76202cccac6fbc314091b, > server=c01-dabc6-10-xyz.abcxyz,61020,1668058690832}] > 2022-11-16 07:38:09,398 INFO [PEWorker-55] assignment.RegionStateStore - > pid=58195 updating hbase:meta row=d773829ad7e76202cccac6fbc314091b, > regionState=OPEN, repBarrier=1963765, openSeqNum=1963765, > regionLocation=c01-dabc6-10-xyz.abcxyz,61020,1668058690832 > {code} > > > > On the new RS, region is opened, recovered, closed (as part of compact > split), compacted and successfully archived. > Logs from new RS c01-dabc6-10-xyz.abcxyz: > {code:java} > 2022-11-16 07:38:09,168 INFO [dabc6-10-xyz:61020-0] > handler.AssignRegionHandler - Open > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b. > 2022-11-16 07:38:09,235 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - > Instantiated > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.; > StoreHotnessProtector, parallelPutToStoreThreadLimit=10 ; minColumnNum=100 ; > preparePutThreadLimit=20 ; hotProtect now enable > 2022-11-16 07:38:09,318 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - > Found 0 recovered edits file(s) under > hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b > 2022-11-16 07:38:09,330 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - > Found 1 recovered edits file(s) under > hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b > 2022-11-16 07:38:09,334 INFO [dabc6-10-xyz:61020-0] regionserver.HRegion - > Replaying edits from > hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/0000000000001963761 > 2022-11-16 07:38:09,365 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - > Applied 0, skipped 1, firstSequenceIdInLog=1963761, > maxSequenceIdInLog=1963761, > path=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/0000000000001963761 > 2022-11-16 07:38:09,365 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - > d773829ad7e76202cccac6fbc314091b : Replaying compaction marker table_name: > "TABLE01" encoded_region_name: "d773829ad7e76202cccac6fbc314091b" > family_name: "0" compaction_input: "ea13e8a1f56843efb1243d5ba108e63a" > compaction_output: "e629d75027ed4da79fd962e645c7a3b4" store_home_dir: "0" > region_name: > "TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\0000000000000000008\375\314\334\234\267\244\377\317\273\312\265\317\317\317\317\317\317\212\256\263\247\235,1658067025381.d773829ad7e76202cccac6fbc314091b." > with seqId=9223372036854775807 and lastReplayedOpenRegionSeqId=-1 > 2022-11-16 07:38:09,367 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - > Deleted recovered.edits > file=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/0000000000001963761 > 2022-11-16 07:38:09,372 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - > writing seq id for d773829ad7e76202cccac6fbc314091b > 2022-11-16 07:38:09,373 INFO [dabc6-10-xyz:61020-0] regionserver.HRegion - > Opened d773829ad7e76202cccac6fbc314091b; next sequenceid=1963765; > SteppingSplitPolicysuper{IncreasingToUpperBoundRegionSplitPolicy{initialSize=536870912, > ConstantSizeRegionSplitPolicy{desiredMaxFileSize=9917293600, > jitterRate=-0.07638005912303925}}}, > FlushLargeStoresPolicy{flushSizeLowerBound=-1} > 2022-11-16 07:38:09,581 INFO [dabc6-10-xyz:61020-2] regionserver.HRegion - > Closing region > TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b. > 2022-11-16 07:38:09,597 DEBUG [dabc6-10-xyz:61020-2] wal.WALSplitUtil - Wrote > file=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/1963767.seqid, > newMaxSeqId=1963767, maxSeqId=1963764 > 2022-11-16 07:38:09,601 INFO [dabc6-10-xyz:61020-2] > handler.UnassignRegionHandler - Closed d773829ad7e76202cccac6fbc314091b > 2022-11-16 07:38:10,340 INFO [0-shortCompactions-0] regionserver.HStore - > Starting compaction of > [hdfs://c01/hbase/data/default/TABLE01/2eaab01e9bd001727475abf7a4f73053/0/e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b->hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/e629d75027ed4da79fd962e645c7a3b4-top] > into > tmpdir=hdfs://c01/hbase/data/default/TABLE01/2eaab01e9bd001727475abf7a4f73053/.tmp, > totalSize=939.0 M > 2022-11-16 07:38:10,340 DEBUG [0-shortCompactions-0] compactions.Compactor - > Compacting e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b, > keycount=18319191, bloomtype=NONE, size=939.0 M, encoding=FAST_DIFF, > compression=NONE, seqNum=1963665, earliestPutTs=1637575074953 > 2022-11-16 07:38:10,349 DEBUG [0-shortCompactions-1] compactions.Compactor - > Compacting e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b, > keycount=18319191, bloomtype=NONE, size=939.0 M, encoding=FAST_DIFF, > compression=NONE, seqNum=1963664, earliestPutTs=1637575074953 > 2022-11-16 07:40:15,455 DEBUG [dabc6-10-xyz:61020-8] backup.HFileArchiver - > Archived from FileableStoreFile, > hdfs://c01/hbase/data/default/TABLE01/d188c0c23b5826973511da243e8c892a/0/e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b > to > hdfs://c01/hbase/archive/data/default/TABLE01/d188c0c23b5826973511da243e8c892a/0/e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)