[ https://issues.apache.org/jira/browse/HBASE-22862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909770#comment-16909770 ]
Alex Batyrshin edited comment on HBASE-22862 at 8/17/19 5:15 PM: ----------------------------------------------------------------- Error occurs on MemStore flush. Looks like replay from WAL has no error. Latest log dump with crash and WAL replays: {code} 2019-08-16 09:36:40,926 FATAL [MemStoreFlusher.1] regionserver.HRegionServer: ABORTING region server prod004,60020,1565928847135: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2675) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200) at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Added a key not lexically larger than previous. Current cell = \x1100000090013098!>ct94a/d:apd/1565936313172/DeleteColumn/vlen=0/seqid=38334251, lastCell = \x1100000090013098!>ct94a/d:apd/1565936313172/Put/vlen=12/seqid=38338239 at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75) at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003) at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622) ... 9 more 2019-08-16 09:36:41,516 INFO [RS_CLOSE_REGION-prod004:60020-1] regionserver.HRegion: Closed TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. <RESTART> 2019-08-16 09:36:54,944 INFO [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=60020] regionserver.RSRpcServices: Open TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. 2019-08-16 09:37:39,164 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Replaying edits from hdfs://motp02/hbase/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/recovered.edits/0000000000038324857 ... 2019-08-16 09:38:24,017 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Replaying edits from hdfs://motp02/hbase/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/recovered.edits/0000000000038340778 2019-08-16 09:38:27,185 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Flushing 1/1 column families, memstore=53.96 MB; WAL is null, using passed sequenceid=38340778 2019-08-16 09:38:29,203 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.DefaultStoreFlusher: Flushed, sequenceid=38340778, memsize=54.0 M, hasBloomFilter=false, into tmp file hdfs://motp02/hbase/data/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/.tmp/b8d1e9859cab41e9900b326ce5f65ab2 2019-08-16 09:38:29,258 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HStore: Added hdfs://motp02/hbase/data/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/d/b8d1e9859cab41e9900b326ce5f65ab2, entries=457848, sequenceid=38340778, filesize=6.6 M 2019-08-16 09:38:29,259 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Finished memstore flush of ~53.96 MB/56580048, currentsize=0 B/0 for region TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. in 2074ms, sequenceid=38340778, compaction requested=true; wal=null 2019-08-16 09:38:29,328 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Onlined ec69674133ba425f63cedfc686b70823; next sequenceid=38340779 2019-08-16 09:38:29,353 INFO [PostOpenDeployTasks:ec69674133ba425f63cedfc686b70823] regionserver.HRegionServer: Post open deploy tasks for TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. 2019-08-16 09:38:29,356 INFO [regionserver/prod004/10.73.67.84:60020-shortCompactions-1565937415376] regionserver.HRegion: Starting compaction on d in region TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. 2019-08-16 09:38:29,356 INFO [regionserver/prod004/10.73.67.84:60020-shortCompactions-1565937415376] regionserver.HStore: Starting compaction of 3 file(s) in d of TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. into tmpdir=hdfs://motp02/hbase/data/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/.tmp, totalSize=44.7 M 2019-08-16 09:38:29,381 INFO [PostOpenDeployTasks:ec69674133ba425f63cedfc686b70823] hbase.MetaTableAccessor: Updated row TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. with server=prod004,60020,1565937407427 2019-08-16 09:38:36,514 INFO [regionserver/prod004/10.73.67.84:60020-shortCompactions-1565937415376] regionserver.HStore: Completed compaction of 3 file(s) in d of TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. into 31c12a445ccf4e94b31acafabf28515d(size=44.1 M), total size for store is 2.3 G. This selection was in queue for 0sec, and took 7sec to execute. {code} was (Author: 0x62ash): Error occurs on MemStore flush. Looks like replay from WAL has no error. Latest log dump with crush and replays: {code} 2019-08-16 09:36:40,926 FATAL [MemStoreFlusher.1] regionserver.HRegionServer: ABORTING region server prod004,60020,1565928847135: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2675) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200) at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Added a key not lexically larger than previous. Current cell = \x1100000090013098!>ct94a/d:apd/1565936313172/DeleteColumn/vlen=0/seqid=38334251, lastCell = \x1100000090013098!>ct94a/d:apd/1565936313172/Put/vlen=12/seqid=38338239 at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75) at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003) at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622) ... 9 more 2019-08-16 09:36:41,516 INFO [RS_CLOSE_REGION-prod004:60020-1] regionserver.HRegion: Closed TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. <RESTART> 2019-08-16 09:36:54,944 INFO [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=60020] regionserver.RSRpcServices: Open TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. 2019-08-16 09:37:39,164 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Replaying edits from hdfs://motp02/hbase/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/recovered.edits/0000000000038324857 ... 2019-08-16 09:38:24,017 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Replaying edits from hdfs://motp02/hbase/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/recovered.edits/0000000000038340778 2019-08-16 09:38:27,185 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Flushing 1/1 column families, memstore=53.96 MB; WAL is null, using passed sequenceid=38340778 2019-08-16 09:38:29,203 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.DefaultStoreFlusher: Flushed, sequenceid=38340778, memsize=54.0 M, hasBloomFilter=false, into tmp file hdfs://motp02/hbase/data/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/.tmp/b8d1e9859cab41e9900b326ce5f65ab2 2019-08-16 09:38:29,258 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HStore: Added hdfs://motp02/hbase/data/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/d/b8d1e9859cab41e9900b326ce5f65ab2, entries=457848, sequenceid=38340778, filesize=6.6 M 2019-08-16 09:38:29,259 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Finished memstore flush of ~53.96 MB/56580048, currentsize=0 B/0 for region TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. in 2074ms, sequenceid=38340778, compaction requested=true; wal=null 2019-08-16 09:38:29,328 INFO [RS_OPEN_REGION-prod004:60020-2] regionserver.HRegion: Onlined ec69674133ba425f63cedfc686b70823; next sequenceid=38340779 2019-08-16 09:38:29,353 INFO [PostOpenDeployTasks:ec69674133ba425f63cedfc686b70823] regionserver.HRegionServer: Post open deploy tasks for TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. 2019-08-16 09:38:29,356 INFO [regionserver/prod004/10.73.67.84:60020-shortCompactions-1565937415376] regionserver.HRegion: Starting compaction on d in region TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. 2019-08-16 09:38:29,356 INFO [regionserver/prod004/10.73.67.84:60020-shortCompactions-1565937415376] regionserver.HStore: Starting compaction of 3 file(s) in d of TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. into tmpdir=hdfs://motp02/hbase/data/default/TBL_TABLE_CODE/ec69674133ba425f63cedfc686b70823/.tmp, totalSize=44.7 M 2019-08-16 09:38:29,381 INFO [PostOpenDeployTasks:ec69674133ba425f63cedfc686b70823] hbase.MetaTableAccessor: Updated row TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. with server=prod004,60020,1565937407427 2019-08-16 09:38:36,514 INFO [regionserver/prod004/10.73.67.84:60020-shortCompactions-1565937415376] regionserver.HStore: Completed compaction of 3 file(s) in d of TBL_TABLE_CODE,\x1100000090011599kySHLF;,1561791629538.ec69674133ba425f63cedfc686b70823. into 31c12a445ccf4e94b31acafabf28515d(size=44.1 M), total size for store is 2.3 G. This selection was in queue for 0sec, and took 7sec to execute. {code} > Region Server crash with: Added a key not lexically larger than previous > ------------------------------------------------------------------------ > > Key: HBASE-22862 > URL: https://issues.apache.org/jira/browse/HBASE-22862 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.4.10 > Environment: {code} > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (Zulu 8.31.0.1-linux64) (build 1.8.0_181-b02) > OpenJDK 64-Bit Server VM (Zulu 8.31.0.1-linux64) (build 25.181-b02, mixed > mode) > {code} > Reporter: Alex Batyrshin > Priority: Critical > > We observe error "Added a key not lexically larger than previous” that cause > most of our region-servers to crash in our cluster. > {code} > 2019-08-15 18:02:10,554 INFO [MemStoreFlusher.0] regionserver.HRegion: > Flushing 1/1 column families, memstore=56.08 MB > 2019-08-15 18:02:10,727 WARN [MemStoreFlusher.0] regionserver.HStore: Failed > flushing store file, retrying num=0 > java.io.IOException: Added a key not lexically larger than previous. Current > cell = > \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567, > lastCell = > \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770 > at > org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) > at > org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127) > at > org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139) > at > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75) > at > org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200) > at > org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264) > at java.lang.Thread.run(Thread.java:748) > 2019-08-15 18:02:21,776 WARN [MemStoreFlusher.0] regionserver.HStore: Failed > flushing store file, retrying num=9 > java.io.IOException: Added a key not lexically larger than previous. Current > cell = > \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567, > lastCell = > \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770 > at > org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) > at > org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127) > at > org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139) > at > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75) > at > org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200) > at > org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264) > at java.lang.Thread.run(Thread.java:748) > 2019-08-15 18:02:21,777 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: > ABORTING region server prod006,60020,1565873610692: Replay of WAL required. > Forcing server shutdown > org.apache.hadoop.hbase.DroppedSnapshotException: region: > TBL_TABLE_CODE,\x0904606203097821slG=sPD,1563070299676.5110b3395ca64a51cea99c6572a4c3d9. > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2675) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2352) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2314) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2200) > at > org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2125) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:512) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:482) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Added a key not lexically larger than > previous. Current cell = > \x0901820448218>wGavb'/d:elr/1565881054828/DeleteColumn/vlen=0/seqid=44456567, > lastCell = > \x0901820448218>wGavb'/d:elr/1565881054828/Put/vlen=1/seqid=44457770 > at > org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:279) > at > org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) > at > org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1127) > at > org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:139) > at > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75) > at > org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1003) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2622) > ... 9 more > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)