[ https://issues.apache.org/jira/browse/HBASE-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Rodionov reassigned HBASE-12190: ----------------------------------------- Assignee: Vladimir Rodionov > Split thread pool larger than one results in fatal failure during region > splits > ------------------------------------------------------------------------------- > > Key: HBASE-12190 > URL: https://issues.apache.org/jira/browse/HBASE-12190 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 2.0.0 > Reporter: Vladimir Rodionov > Assignee: Vladimir Rodionov > > I have been playing with different sizes for long/short compaction and split > thread poll sizes. It is definitely clear, that any size of split thread pool > large than 1 (one) results in a fatal errors during region splits. > Here is the log file snippet of a local test case : > {code} > 2014-10-07 11:04:48,441 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionFileSystem|HRegionFileSystem] > The > hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/.splits > directory exists. Hence deleting it to recreate it > put: 230000 > 2014-10-07 11:04:48,697 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741862_1038{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]} > size 22106983 > 2014-10-07 11:04:49,100 > [INFO|org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher|DefaultStoreFlusher] > Flushed, sequenceid=495, memsize=63.2 M, hasBloomFilter=true, into tmp file > hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/.tmp/8d746dd585df4669a505fe2ba0ddafe1 > 2014-10-07 11:04:49,110 > [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Added > hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/fam_a/8d746dd585df4669a505fe2ba0ddafe1, > entries=360000, sequenceid=495, filesize=21.1 M > 2014-10-07 11:04:49,111 > [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Finished memstore > flush of ~63.17 MB/66240000, currentsize=42.11 MB/44160000 for region > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8. in > 674ms, sequenceid=495, compaction requested=true > 2014-10-07 11:04:49,111 > [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Started memstore > flush for > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8., > current region memstore size 42.11 MB > 2014-10-07 11:04:49,267 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741863_1039{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]} > size 14745108 > 2014-10-07 11:04:49,669 > [INFO|org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher|DefaultStoreFlusher] > Flushed, sequenceid=514, memsize=42.1 M, hasBloomFilter=true, into tmp file > hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/.tmp/785a7048b1b849509e3e622cdaceb033 > 2014-10-07 11:04:49,682 > [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Added > hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/fam_a/785a7048b1b849509e3e622cdaceb033, > entries=240000, sequenceid=514, filesize=14.1 M > 2014-10-07 11:04:49,683 > [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Finished memstore > flush of ~42.11 MB/44160000, currentsize=0 B/0 for region > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8. in > 572ms, sequenceid=514, compaction requested=true > 2014-10-07 11:04:49,686 > [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Closed fam_a > 2014-10-07 11:04:49,687 > [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Closed > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8. > 2014-10-07 11:04:49,687 > [WARN|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Region > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8. > already closed > 2014-10-07 11:04:49,688 > [INFO|org.apache.hadoop.hbase.regionserver.SplitRequest|SplitRequest] Running > rollback/cleanup of failed split of > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.; > Failed to close region: already closed by another thread > java.io.IOException: Failed to close region: already closed by another thread > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.<clinit>(SplitTransaction.java:190) > at > org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:65) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2014-10-07 11:04:49,690 > [ERROR|org.apache.hadoop.hbase.master.AssignmentManager|AssignmentManager] > Failed to transition region from {0ace9f564c05043a45cb44b2867432f8 > state=SPLITTING, ts=1412705087544, server=10.1.2.105,50226,1412705045018} on > SPLIT_REVERTED by 10.1.2.105,50226,1412705045018: Some daughter is not known > to be splitting on 10.1.2.105,50226,1412705045018, a=null, b=null > 2014-10-07 11:04:49,690 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > Failed to transition {ENCODED => 0ace9f564c05043a45cb44b2867432f8, NAME => > 'TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.', > STARTKEY => 'row000000037607', ENDKEY => ''} to SPLIT_REVERTED: Some daughter > is not known to be splitting on 10.1.2.105,50226,1412705045018, a=null, b=null > 2014-10-07 11:04:49,692 > [FATAL|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > ABORTING region server 10.1.2.105,50226,1412705045018: Abort; we got an error > after point-of-no-return > 2014-10-07 11:04:49,692 > [FATAL|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > RegionServer abort: loaded coprocessors are: > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > 2014-10-07 11:04:49,699 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741864_1040{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]} > size 0 > 2014-10-07 11:04:49,700 > [ERROR|org.apache.hadoop.hbase.master.MasterRpcServices|MasterRpcServices] > Region server 10.1.2.105,50226,1412705045018 reported a fatal error: > ABORTING region server 10.1.2.105,50226,1412705045018: Abort; we got an error > after point-of-no-return > 2014-10-07 11:04:49,702 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > STOPPED: Abort; we got an error after point-of-no-return > 2014-10-07 11:04:49,702 > [INFO|org.apache.hadoop.hbase.regionserver.SplitLogWorker|SplitLogWorker] > Sending interrupt to stop the worker thread > 2014-10-07 11:04:49,702 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > Stopping infoServer > 2014-10-07 11:04:49,702 > [INFO|org.apache.hadoop.hbase.regionserver.SplitLogWorker|SplitLogWorker] > SplitLogWorker interrupted. Exiting. > 2014-10-07 11:04:49,702 > [INFO|org.apache.hadoop.hbase.regionserver.SplitLogWorker|SplitLogWorker] > SplitLogWorker 10.1.2.105,50226,1412705045018 exiting > 2014-10-07 11:04:49,704 [INFO|org.mortbay.log|Slf4jLog] Stopped > SelectChannelConnector@0.0.0.0:0 > 2014-10-07 11:04:49,706 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741865_1041{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]} > size 35 > 2014-10-07 11:04:49,807 > [INFO|org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager|RegionServerSnapshotManager] > Stopping RegionServerSnapshotManager abruptly. > 2014-10-07 11:04:49,808 > [INFO|org.apache.hadoop.hbase.regionserver.ServerNonceManager$1|Chore] > regionserver//10.1.2.105:0.nonceCleaner exiting > 2014-10-07 11:04:49,808 > [INFO|org.apache.hadoop.hbase.procedure.flush.RegionServerFlushTableProcedureManager|RegionServerFlushTableProcedureManager] > Stopping region server flush procedure manager abruptly. > 2014-10-07 11:04:49,808 > [INFO|org.apache.hadoop.hbase.regionserver.MemStoreFlusher|MemStoreFlusher$FlushHandler] > MemStoreFlusher.1 exiting > 2014-10-07 11:04:49,808 > [INFO|org.apache.hadoop.hbase.regionserver.MemStoreFlusher|MemStoreFlusher$FlushHandler] > MemStoreFlusher.0 exiting > 2014-10-07 11:04:49,807 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker|Chore] > regionserver//10.1.2.105:0.compactionChecker exiting > 2014-10-07 11:04:49,809 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > aborting server 10.1.2.105,50226,1412705045018 > 2014-10-07 11:04:49,809 > [INFO|org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation|ConnectionManager$HConnectionImplementation] > Closing zookeeper sessionid=0x148ebc846910005 > 2014-10-07 11:04:49,812 > [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Closed fam_a > 2014-10-07 11:04:49,813 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > Waiting on 1 regions to close > 2014-10-07 11:04:49,813 > [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Closed > TABLE_A,,1412705075738.3b2edac70735ec721cd1ccba05211e33. > 2014-10-07 11:04:50,014 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer] > stopping server 10.1.2.105,50226,1412705045018; all regions closed. > 2014-10-07 11:04:50,019 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741860_1036{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]} > size 37803365 > 2014-10-07 11:04:50,118 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741866_1042{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]} > size 0 > 2014-10-07 11:04:50,130 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741867_1043{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]} > size 35 > 2014-10-07 11:04:50,522 > [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] > RS:0;10.1.2.105:50226 closing leases > 2014-10-07 11:04:50,522 > [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] > RS:0;10.1.2.105:50226 closed leases > 2014-10-07 11:04:50,546 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741868_1044{blockUCState=COMMITTED, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]} > size 35 > 2014-10-07 11:04:50,960 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741869_1045{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]} > size 0 > 2014-10-07 11:04:50,966 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741870_1046{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]} > size 0 > 2014-10-07 11:04:50,974 [INFO|BlockStateChange|BlockManager] BLOCK* > addStoredBlock: blockMap updated: 127.0.0.1:50216 is added to > blk_1073741871_1047{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]} > size 0 > 2014-10-07 11:04:50,975 > [INFO|org.apache.hadoop.hbase.regionserver.SplitRequest|SplitRequest] Skip > rollback/cleanup of failed split of > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8. > because server is stopped > java.io.IOException: Failed to notify master that split passed PONR: > TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8. > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.createDaughters(SplitTransaction.java:274) > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:407) > at > org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:82) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2014-10-07 11:04:55,340 > [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer$PeriodicMemstoreFlusher|Chore] > regionserver//10.1.2.105:0.periodicFlusher exiting > 2014-10-07 11:04:55,340 > [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] > regionserver//10.1.2.105:0.leaseChecker closing leases > 2014-10-07 11:04:55,341 > [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] > regionserver//10.1.2.105:0.leaseChecker closed leases > 2014-10-07 11:04:57,509 > [INFO|org.apache.hadoop.hbase.regionserver.LogRoller|LogRoller] LogRoller > exiting. > 2014-10-07 11:04:57,509 > [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread] > Waiting for Split Thread to finish... > 2014-10-07 11:04:57,509 > [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread] > Waiting for Merge Thread to finish... > 2014-10-07 11:04:57,509 > [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread] > Waiting for Large Compaction Thread to finish... > 2014-10-07 11:04:57,509 > [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread] > Waiting for Small Compaction Thread to finish... > 2014-10-07 11:04:57,512 > [INFO|org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation|ConnectionManager$HConnectionImplementation] > Closing zookeeper sessionid=0x148ebc846910006 > 2014-10-07 11:04:57,512 > [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer] Stopping server on > 50226 > 2014-10-07 11:04:57,513 > [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer$Listener] > RpcServer.listener,port=50226: stopping > 2014-10-07 11:04:57,513 > [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer$Responder] > RpcServer.responder: stopped > 2014-10-07 11:04:57,513 > [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer$Responder] > RpcServer.responder: stopping > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)