[ https://issues.apache.org/jira/browse/HBASE-23589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szabolcs Bukros updated HBASE-23589: ------------------------------------ Description: Flushing the active region creates the following files: {code:java} 2019-12-13 08:00:20,866 INFO org.apache.hadoop.hbase.regionserver.HStore: Added hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/dab4d1cc01e44773bad7bdb5d2e33b6c, entries=49128, sequenceid =70688, filesize=41.4 M 2019-12-13 08:00:20,897 INFO org.apache.hadoop.hbase.regionserver.HStore: Added hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f3/ecc50f33085042f7bd2397253b896a3a, entries=50000, sequenceid =70688, filesize=42.3 M {code} On the read replica region when we try to replay the flush we see the following: {code:java} 2019-12-13 08:00:21,279 WARN org.apache.hadoop.hbase.regionserver.HRegion: bfa9cdb0ab13d60b389df6621ab316d1 : At least one of the store files in flush: action: COMMIT_FLUSH table_name: "IntegrationTestRegionReplicaReplication" encoded_region_name: "20af2eb8929408f26d0b3b81e6b86d47" flush_sequence_number: 70688 store_flushes { family_name: "f2" store_home_dir: "f2" flush_output: "ecc50f33085042f7bd2397253b896a3a" } store_flushes { family_name: "f3" store_home_dir: "f3" flush_output: "dab4d1cc01e44773bad7bdb5d2e33b6c" } region_name: "IntegrationTestRegionReplicaReplication,cccccccc,1576252065847.20af2eb8929408f26d0b3b81e6b86d47." doesn't exist any more. Skip loading the file(s) java.io.FileNotFoundException: HFileLink locations=[hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, hdfs://replica-1:8020/hbase/.tmp/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, hdfs://replica-1:8020/hbase/mobdir/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, hdfs://replica-1:8020/hbase/archive/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a] at org.apache.hadoop.hbase.io.FileLink.getFileStatus(FileLink.java:415) at org.apache.hadoop.hbase.util.ServerRegionReplicaUtil.getStoreFileInfo(ServerRegionReplicaUtil.java:135) at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.getStoreFileInfo(HRegionFileSystem.java:311) at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2414) at org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:5310) at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:5184) at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:5018) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:1143) at org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:2229) at org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:29754) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) {code} As we can see the flush_outputs got mixed up. The issue is caused by HRegion.internalFlushCacheAndCommit. The code assumes "{color:#808080}stores.values() and storeFlushCtxs have same order{color}" which no longer seems to be true. was: Flushing the active region creates the following files: {code:java} 2019-12-13 08:00:20,866 INFO org.apache.hadoop.hbase.regionserver.HStore: Added hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/dab4d1cc01e44773bad7bdb5d2e33b6c, entries=49128, sequenceid =70688, filesize=41.4 M 2019-12-13 08:00:20,897 INFO org.apache.hadoop.hbase.regionserver.HStore: Added hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f3/ecc50f33085042f7bd2397253b896a3a, entries=50000, sequenceid =70688, filesize=42.3 M {code} On the read replica region when we try to replay the flush we see the following: {code:java} 2019-12-13 08:00:21,279 WARN org.apache.hadoop.hbase.regionserver.HRegion: bfa9cdb0ab13d60b389df6621ab316d1 : At least one of the store files in flush: action: COMMIT_FLUSH table_name: "IntegrationTestRegionReplicaReplication" encoded_region_name: "20af2eb8929408f26d0b3b81e6b86d47" flush_sequence_number: 70688 store_flushes { family_name: "f2" store_home_dir: "f2" flush_output: "ecc50f33085042f7bd2397253b896a3a" } store_flushes { family_name: "f3" store_home_dir: "f3" flush_output: "dab4d1cc01e44773bad7bdb5d2e33b6c" } region_name: "IntegrationTestRegionReplicaReplication,cccccccc,1576252065847.20af2eb8929408f26d0b3b81e6b86d47." doesn't exist any more. Skip loading the file(s) java.io.FileNotFoundException: HFileLink locations=[hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, hdfs://replica-1:8020/hbase/.tmp/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, hdfs://replica-1:8020/hbase/mobdir/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, hdfs://replica-1:8020/hbase/archive/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a] at org.apache.hadoop.hbase.io.FileLink.getFileStatus(FileLink.java:415) at org.apache.hadoop.hbase.util.ServerRegionReplicaUtil.getStoreFileInfo(ServerRegionReplicaUtil.java:135) at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.getStoreFileInfo(HRegionFileSystem.java:311) at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2414) at org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:5310) at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:5184) at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:5018) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:1143) at org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:2229) at org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:29754) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) {code} As you can see the flush_outputs are mixed up. The issue is caused by HRegion.internalFlushCacheAndCommit. The code assumes "{color:#808080}stores.values() and storeFlushCtxs have same order{color}" which no longer seems to be true. > FlushDescriptor contains non-matching family/output combinations > ---------------------------------------------------------------- > > Key: HBASE-23589 > URL: https://issues.apache.org/jira/browse/HBASE-23589 > Project: HBase > Issue Type: Bug > Components: read replicas > Affects Versions: 2.2.2 > Reporter: Szabolcs Bukros > Assignee: Szabolcs Bukros > Priority: Major > > Flushing the active region creates the following files: > {code:java} > 2019-12-13 08:00:20,866 INFO org.apache.hadoop.hbase.regionserver.HStore: > Added > hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/dab4d1cc01e44773bad7bdb5d2e33b6c, > entries=49128, sequenceid > =70688, filesize=41.4 M > 2019-12-13 08:00:20,897 INFO org.apache.hadoop.hbase.regionserver.HStore: > Added > hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f3/ecc50f33085042f7bd2397253b896a3a, > entries=50000, sequenceid > =70688, filesize=42.3 M > {code} > On the read replica region when we try to replay the flush we see the > following: > {code:java} > 2019-12-13 08:00:21,279 WARN org.apache.hadoop.hbase.regionserver.HRegion: > bfa9cdb0ab13d60b389df6621ab316d1 : At least one of the store files in flush: > action: COMMIT_FLUSH table_name: "IntegrationTestRegionReplicaReplication" > encoded_region_name: "20af2eb8929408f26d0b3b81e6b86d47" > flush_sequence_number: 70688 store_flushes { family_name: "f2" > store_home_dir: "f2" flush_output: "ecc50f33085042f7bd2397253b896a3a" } > store_flushes { family_name: "f3" store_home_dir: "f3" flush_output: > "dab4d1cc01e44773bad7bdb5d2e33b6c" } region_name: > "IntegrationTestRegionReplicaReplication,cccccccc,1576252065847.20af2eb8929408f26d0b3b81e6b86d47." > doesn't exist any more. Skip loading the file(s) > java.io.FileNotFoundException: HFileLink > locations=[hdfs://replica-1:8020/hbase/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, > > hdfs://replica-1:8020/hbase/.tmp/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, > > hdfs://replica-1:8020/hbase/mobdir/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a, > > hdfs://replica-1:8020/hbase/archive/data/default/IntegrationTestRegionReplicaReplication/20af2eb8929408f26d0b3b81e6b86d47/f2/ecc50f33085042f7bd2397253b896a3a] > at > org.apache.hadoop.hbase.io.FileLink.getFileStatus(FileLink.java:415) > at > org.apache.hadoop.hbase.util.ServerRegionReplicaUtil.getStoreFileInfo(ServerRegionReplicaUtil.java:135) > at > org.apache.hadoop.hbase.regionserver.HRegionFileSystem.getStoreFileInfo(HRegionFileSystem.java:311) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2414) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:5310) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:5184) > at > org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:5018) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:1143) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:2229) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:29754) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > {code} > As we can see the flush_outputs got mixed up. > > The issue is caused by HRegion.internalFlushCacheAndCommit. The code assumes > "{color:#808080}stores.values() and storeFlushCtxs have same order{color}" > which no longer seems to be true. -- This message was sent by Atlassian Jira (v8.3.4#803005)