Viraj Jasani created HBASE-27502:
------------------------------------

             Summary: Regionservers aborted as new seq id is less than max seq 
id derived from .seqid files
                 Key: HBASE-27502
                 URL: https://issues.apache.org/jira/browse/HBASE-27502
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.4.15
            Reporter: Viraj Jasani


HBase cluster is upgraded from 1.6 to 2.4.14/15 version recently. The cluster 
doesn't have much traffic. After 4-5 days of this upgrade, suddenly 144 out of 
~150 regionservers were aborted with {*}java.io.IOException: The new max 
sequence id {} is less than the old max sequence id {}{*}.

After starting regionservers, things were normal.

 

Sequence of events for the first regionserver that was aborted (all servers 
aborted due to same reason):

Major compaction of one of the tables (TABLE01) was triggered.

Logs from RS c01-dabc11-12-xyz.abcxyz:

 
{code:java}
2022-11-16 07:36:34,978 INFO  [0-shortCompactions-0] regionserver.HStore - 
Starting compaction of 
[hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a]
 into 
tmpdir=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/.tmp,
 totalSize=939.0 M

2022-11-16 07:36:34,978 INFO  [0-shortCompactions-0] regionserver.HRegion - 
Starting compaction of d773829ad7e76202cccac6fbc314091b/0 in 
TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.
{code}
 

 

Region split is triggered by CompactSplit.

Logs from RS c01-dabc11-12-xyz.abcxyz:

 
{code:java}
2022-11-16 07:38:03,570 DEBUG [0-shortCompactions-0] regionserver.CompactSplit 
- Splitting 
TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.,
 compactionQueue=(longCompactions=0:shortCompactions=0), splitQueue=0

2022-11-16 07:38:03,848 INFO  [abc11-12-xyz:61020-0] regionserver.HRegion - 
Closing region 
TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.

2022-11-16 07:38:03,860 DEBUG [2cccac6fbc314091b.-1] backup.HFileArchiver - 
Archived from FileableStoreFile, 
hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a
 to 
hdfs://c01/hbase/archive/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/ea13e8a1f56843efb1243d5ba108e63a

2022-11-16 07:38:03,881 DEBUG [abc11-12-xyz:61020-0] regionserver.HRegion - 
Region close journal for d773829ad7e76202cccac6fbc314091b:
Waiting for close lock at 1668584283848Running coprocessor pre-close hooks at 
1668584283848Disabling compacts and flushes for region at 
1668584283848Disabling writes for close at 1668584283848Writing region close 
event to WAL at 1668584283876 (+28 ms)

2022-11-16 07:38:03,881 WARN  [abc11-12-xyz:61020-0] 
handler.UnassignRegionHandler - Fatal error occurred while closing region 
d773829ad7e76202cccac6fbc314091b, aborting...
java.io.IOException: The new max sequence id 1963762 is less than the old max 
sequence id 1963764
    at 
org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:397)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1217)
    at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1816)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
    at 
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:118)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
{code}
Leading to RS abort:

 
{code:java}
2022-11-16 07:38:03,889 ERROR [abc11-12-xyz:61020-0] regionserver.HRegionServer 
- ***** ABORTING region server c01-dabc11-12-xyz.abcxyz,61020,1668064189532: 
Failed to close region d773829ad7e76202cccac6fbc314091b and can not recover 
*****
java.io.IOException: The new max sequence id 1963762 is less than the old max 
sequence id 1963764
    at 
org.apache.hadoop.hbase.wal.WALSplitUtil.writeRegionSequenceIdFile(WALSplitUtil.java:397)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1217)
    at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1816)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
    at 
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:118)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
 {code}
 

 

Master tries to open the region on other RS.

master logs:

 
{code:java}
2022-11-16 07:38:08,840 INFO  [PEWorker-60] assignment.RegionStateStore - 
pid=58195 updating hbase:meta row=d773829ad7e76202cccac6fbc314091b, 
regionState=ABNORMALLY_CLOSED

2022-11-16 07:38:08,883 INFO  [PEWorker-32] assignment.RegionStateStore - 
pid=58195 updating hbase:meta row=d773829ad7e76202cccac6fbc314091b, 
regionState=OPENING, regionLocation=c01-dabc6-10-xyz.abcxyz,61020,1668058690832

2022-11-16 07:38:08,883 DEBUG [PEWorker-32] hbase.META - Put 
{"totalColumns":3,"row":"TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\\x000000000000000008\\xFD\\xCC\\xDC\\x9C\\xB7\\xA4\\xFF\\xCF\\xBB\\xCA\\xB5\\xCF\\xCF\\xCF\\xCF\\xCF\\xCF\\x8A\\xAE\\xB3\\xA7\\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.","families":{"info":[{"qualifier":"regioninfo","vlen":245,"tag":[],"timestamp":"1668584288883"},{"qualifier":"sn","vlen":53,"tag":[],"timestamp":"1668584288883"},{"qualifier":"state","vlen":7,"tag":[],"timestamp":"1668584288883"}]},"ts":"1668584288883"}

2022-11-16 07:38:08,888 INFO  [PEWorker-32] procedure2.ProcedureExecutor - 
Initialized subprocedures=[{pid=58794, ppid=58195, state=RUNNABLE; 
OpenRegionProcedure d773829ad7e76202cccac6fbc314091b, 
server=c01-dabc6-10-xyz.abcxyz,61020,1668058690832}]

2022-11-16 07:38:09,398 INFO  [PEWorker-55] assignment.RegionStateStore - 
pid=58195 updating hbase:meta row=d773829ad7e76202cccac6fbc314091b, 
regionState=OPEN, repBarrier=1963765, openSeqNum=1963765, 
regionLocation=c01-dabc6-10-xyz.abcxyz,61020,1668058690832
 {code}
 

 

 

On the new RS, region is opened, recovered, closed (as part of compact split), 
compacted and successfully archived.

Logs from new RS c01-dabc6-10-xyz.abcxyz:
{code:java}
2022-11-16 07:38:09,168 INFO  [dabc6-10-xyz:61020-0] 
handler.AssignRegionHandler - Open 
TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.

2022-11-16 07:38:09,235 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - 
Instantiated 
TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.;
 StoreHotnessProtector, parallelPutToStoreThreadLimit=10 ; minColumnNum=100 ; 
preparePutThreadLimit=20 ; hotProtect now enable

2022-11-16 07:38:09,318 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - 
Found 0 recovered edits file(s) under 
hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b

2022-11-16 07:38:09,330 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - 
Found 1 recovered edits file(s) under 
hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b

2022-11-16 07:38:09,334 INFO  [dabc6-10-xyz:61020-0] regionserver.HRegion - 
Replaying edits from 
hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/0000000000001963761

2022-11-16 07:38:09,365 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - 
Applied 0, skipped 1, firstSequenceIdInLog=1963761, maxSequenceIdInLog=1963761, 
path=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/0000000000001963761



2022-11-16 07:38:09,365 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - 
d773829ad7e76202cccac6fbc314091b : Replaying compaction marker table_name: 
"TABLE01" encoded_region_name: "d773829ad7e76202cccac6fbc314091b" family_name: 
"0" compaction_input: "ea13e8a1f56843efb1243d5ba108e63a" compaction_output: 
"e629d75027ed4da79fd962e645c7a3b4" store_home_dir: "0" region_name: 
"TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\0000000000000000008\375\314\334\234\267\244\377\317\273\312\265\317\317\317\317\317\317\212\256\263\247\235,1658067025381.d773829ad7e76202cccac6fbc314091b."
 with seqId=9223372036854775807 and lastReplayedOpenRegionSeqId=-1

2022-11-16 07:38:09,367 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - 
Deleted recovered.edits 
file=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/0000000000001963761



2022-11-16 07:38:09,372 DEBUG [dabc6-10-xyz:61020-0] regionserver.HRegion - 
writing seq id for d773829ad7e76202cccac6fbc314091b

2022-11-16 07:38:09,373 INFO  [dabc6-10-xyz:61020-0] regionserver.HRegion - 
Opened d773829ad7e76202cccac6fbc314091b; next sequenceid=1963765; 
SteppingSplitPolicysuper{IncreasingToUpperBoundRegionSplitPolicy{initialSize=536870912,
 ConstantSizeRegionSplitPolicy{desiredMaxFileSize=9917293600, 
jitterRate=-0.07638005912303925}}}, 
FlushLargeStoresPolicy{flushSizeLowerBound=-1}



2022-11-16 07:38:09,581 INFO  [dabc6-10-xyz:61020-2] regionserver.HRegion - 
Closing region 
TABLE01,00DJ0000003PwkV000000000000000005J000000681gsU\x000000000000000008\xFD\xCC\xDC\x9C\xB7\xA4\xFF\xCF\xBB\xCA\xB5\xCF\xCF\xCF\xCF\xCF\xCF\x8A\xAE\xB3\xA7\x9D,1658067025381.d773829ad7e76202cccac6fbc314091b.

2022-11-16 07:38:09,597 DEBUG [dabc6-10-xyz:61020-2] wal.WALSplitUtil - Wrote 
file=hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/recovered.edits/1963767.seqid,
 newMaxSeqId=1963767, maxSeqId=1963764



2022-11-16 07:38:09,601 INFO  [dabc6-10-xyz:61020-2] 
handler.UnassignRegionHandler - Closed d773829ad7e76202cccac6fbc314091b

2022-11-16 07:38:10,340 INFO  [0-shortCompactions-0] regionserver.HStore - 
Starting compaction of 
[hdfs://c01/hbase/data/default/TABLE01/2eaab01e9bd001727475abf7a4f73053/0/e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b->hdfs://c01/hbase/data/default/TABLE01/d773829ad7e76202cccac6fbc314091b/0/e629d75027ed4da79fd962e645c7a3b4-top]
 into 
tmpdir=hdfs://c01/hbase/data/default/TABLE01/2eaab01e9bd001727475abf7a4f73053/.tmp,
 totalSize=939.0 M



2022-11-16 07:38:10,340 DEBUG [0-shortCompactions-0] compactions.Compactor - 
Compacting e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b, 
keycount=18319191, bloomtype=NONE, size=939.0 M, encoding=FAST_DIFF, 
compression=NONE, seqNum=1963665, earliestPutTs=1637575074953

2022-11-16 07:38:10,349 DEBUG [0-shortCompactions-1] compactions.Compactor - 
Compacting e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b, 
keycount=18319191, bloomtype=NONE, size=939.0 M, encoding=FAST_DIFF, 
compression=NONE, seqNum=1963664, earliestPutTs=1637575074953



2022-11-16 07:40:15,455 DEBUG [dabc6-10-xyz:61020-8] backup.HFileArchiver - 
Archived from FileableStoreFile, 
hdfs://c01/hbase/data/default/TABLE01/d188c0c23b5826973511da243e8c892a/0/e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b
 to 
hdfs://c01/hbase/archive/data/default/TABLE01/d188c0c23b5826973511da243e8c892a/0/e629d75027ed4da79fd962e645c7a3b4.d773829ad7e76202cccac6fbc314091b
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to