[ 
https://issues.apache.org/jira/browse/HBASE-9724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-9724:
---------------------------------

    Status: Patch Available  (was: Open)

> Failed region split is not handled correctly by AM
> --------------------------------------------------
>
>                 Key: HBASE-9724
>                 URL: https://issues.apache.org/jira/browse/HBASE-9724
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0, 0.96.1
>
>         Attachments: hbase-9724_v1.patch
>
>
> We ran into a case with ITLoadAndVerify with CM which left a region in RIT 
> state, because the region failed a region SPLIT, but AM did not process this 
> correctly. 
> Region server attempting SPLIT and rollover: 
> {code}
> 2013-10-06 16:42:56,959 INFO  [Priority.RpcServer.handler=0,port=60020] 
> regionserver.HRegionServer: Splitting 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.
> ...
> 2013-10-06 16:43:01,480 INFO  [regionserver60020-splits-1381102351302] 
> regionserver.SplitTransaction: Starting split of region 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.
> 2013-10-06 16:43:01,480 DEBUG [regionserver60020-splits-1381102351302] 
> regionserver.SplitTransaction: regionserver:60020-0x1418eb9f38301bb Creating 
> ephemeral node for 9608be62536a61444766da669e67d39e in SPLITTING state
> 2013-10-06 16:43:01,570 DEBUG [regionserver60020-splits-1381102351302] 
> zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioning 
> 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to 
> RS_ZK_REGION_SPLITTING
> 2013-10-06 16:43:01,607 DEBUG [regionserver60020-splits-1381102351302] 
> zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioned node 
> 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to 
> RS_ZK_REGION_SPLITTING
> 2013-10-06 16:43:01,607 DEBUG [regionserver60020-splits-1381102351302] 
> zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioning 
> 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to 
> RS_ZK_REGION_SPLITTING
> 2013-10-06 16:43:01,629 DEBUG [regionserver60020-splits-1381102351302] 
> zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioned node 
> 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to 
> RS_ZK_REGION_SPLITTING
> 2013-10-06 16:43:02,047 DEBUG [regionserver60020-splits-1381102351302] 
> regionserver.HRegion: Closing 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.:
>  disabling compactions & flushes
> 2013-10-06 16:43:02,047 DEBUG [regionserver60020-splits-1381102351302] 
> regionserver.HRegion: Updates disabled for region 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.
> 2013-10-06 16:43:02,047 DEBUG [regionserver60020-splits-1381102351302] 
> regionserver.HRegion: Started memstore flush for 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.,
>  current region memstore size 4.4 M
> 2013-10-06 16:43:03,945 INFO  [regionserver60020-splits-1381102351302] 
> regionserver.HRegion: Finished memstore flush of ~4.4 M/4607072, 
> currentsize=0/0 for region 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.
>  in 1898ms, sequenceid=6703616, compaction requested=true
> 2013-10-06 16:43:04,026 INFO  
> [StoreCloserThread-IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.-1]
>  regionserver.HStore: Closed f1
> 2013-10-06 16:43:04,027 INFO  [regionserver60020-splits-1381102351302] 
> regionserver.HRegion: Closed 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.
> 2013-10-06 16:43:36,938 INFO  [regionserver60020-splits-1381102351302] 
> regionserver.SplitRequest: Running rollback/cleanup of failed split of 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.;
>  Took too long to split the files and create the references, aborting split
> java.io.IOException: Took too long to split the files and create the 
> references, aborting split
>   at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction.splitStoreFiles(SplitTransaction.java:584)
>   at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction.createDaughters(SplitTransaction.java:296)
>   at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:465)
>   at 
> org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:82)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> 2013-10-06 16:43:37,482 INFO  
> [StoreOpener-9608be62536a61444766da669e67d39e-1] 
> compactions.CompactionConfiguration: size [134217728, 9223372036854775807); 
> files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 
> 2684354560; delete expired; major period 604800000, major jitter 0.500000
> 2013-10-06 16:43:37,509 DEBUG 
> [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded 
> hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/12cecfd69b7948eb8bbf9bacc3d7fbb0,
>  isReference=false, isBulkLoadResult=false, seqid=6703616, majorComp
> 2013-10-06 16:43:37,549 DEBUG 
> [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded 
> hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/6b715040df9b424cbd3d01f36b34677b,
>  isReference=false, isBulkLoadResult=false, seqid=6703217, majorComp
> 2013-10-06 16:43:37,588 DEBUG 
> [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded 
> hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/97cbd5f18ee64e0aaf43f48b1d3fae73,
>  isReference=false, isBulkLoadResult=false, seqid=6681405, majorComp
> 2013-10-06 16:43:37,630 DEBUG 
> [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded 
> hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/a0d33ad9786d4a6fb6fa83b530cb5acd,
>  isReference=false, isBulkLoadResult=false, seqid=6699746, majorComp
> 2013-10-06 16:43:37,676 DEBUG 
> [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded 
> hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/c4ad2148624d42e6b8408b9b257a4c70,
>  isReference=false, isBulkLoadResult=false, seqid=6689091, majorComp
> 2013-10-06 16:43:37,893 INFO  [regionserver60020-splits-1381102351302] 
> regionserver.HRegionFileSystem: Cleaned up old failed split transaction 
> detritus: 
> hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/.splits
> 2013-10-06 16:43:37,937 INFO  [regionserver60020-splits-1381102351302] 
> regionserver.HRegion: Onlined 9608be62536a61444766da669e67d39e; next 
> sequenceid=6703617
> 2013-10-06 16:43:38,021 DEBUG [regionserver60020-splits-1381102351302] 
> zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Deleted unassigned 
> node 9608be62536a61444766da669e67d39e in expected state RS_ZK_REGION_SPLITTING
> 2013-10-06 16:43:38,021 INFO  [regionserver60020-splits-1381102351302] 
> regionserver.SplitRequest: Successful rollback of failed split of 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.
> {code}
> The RS, 4 mins later will try to do another split, but it will fail to let 
> master know so, because master still thinks the region is in RIT due to the 
> first split. 
> {code}
> 2013-10-06 16:43:01,606 DEBUG [AM.ZK.Worker-pool2-t202] 
> master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, 
> region=9608be62536a61444766da669e67d39e, current_state=null
> 2013-10-06 16:43:01,609 INFO  [AM.ZK.Worker-pool2-t202] master.RegionStates: 
> Transitioned {9608be62536a61444766da669e67d39e state=OPEN, ts=1381102229682, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371} 
> to {9608be62536a61444766da669e67d39e state=SPLITTING, ts=1381102981609, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,600
> 2013-10-06 16:43:01,634 DEBUG [AM.ZK.Worker-pool2-t203] 
> master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, 
> region=9608be62536a61444766da669e67d39e, 
> current_state={9608be62536a61444766da669e67d39e state=SPLITTING, 
> ts=1381102981609, server=ambari-unsec-1381055459-hbase-5.cs1cloud
> 2013-10-06 16:43:38,019 DEBUG [AM.ZK.Worker-pool2-t204] 
> master.AssignmentManager: Znode 
> IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.
>  deleted, state: {9608be62536a61444766da669e67d39e state=SPLITTING, 
> ts=1381102981634, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371}
> 2013-10-06 16:44:35,673 DEBUG [AM.ZK.Worker-pool2-t206] 
> master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, 
> region=ae22b90a3ddeb30c45c5735f7c855014, current_state=null
> 2013-10-06 16:44:35,673 INFO  [AM.ZK.Worker-pool2-t206] master.RegionStates: 
> Transitioned {ae22b90a3ddeb30c45c5735f7c855014 state=OPEN, ts=1381102362273, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371} 
> to {ae22b90a3ddeb30c45c5735f7c855014 state=SPLITTING, ts=1381103075673, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,600
> 2013-10-06 16:44:35,725 DEBUG [AM.ZK.Worker-pool2-t207] 
> master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, 
> region=ae22b90a3ddeb30c45c5735f7c855014, 
> current_state={ae22b90a3ddeb30c45c5735f7c855014 state=SPLITTING, 
> ts=1381103075673, server=ambari-unsec-1381055459-hbase-5.cs1cloud
> 2013-10-06 16:44:38,707 DEBUG 
> [ambari-unsec-1381055459-hbase-1.cs1cloud.internal,60000,1381101554357-BalancerChore]
>  master.HMaster: Not running balancer because 2 region(s) in transition: 
> {9608be62536a61444766da669e67d39e={9608be62536a61444766da669e67d39e 
> state=SPLITTING, ts=1381102981634, 
> server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889
> {code}
> It seems that AM.nodeDeleted() does not revert the in-memory state for the 
> region to be opened. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to