[ https://issues.apache.org/jira/browse/HBASE-9724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enis Soztutar updated HBASE-9724: --------------------------------- Status: Patch Available (was: Open) > Failed region split is not handled correctly by AM > -------------------------------------------------- > > Key: HBASE-9724 > URL: https://issues.apache.org/jira/browse/HBASE-9724 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Fix For: 0.98.0, 0.96.1 > > Attachments: hbase-9724_v1.patch > > > We ran into a case with ITLoadAndVerify with CM which left a region in RIT > state, because the region failed a region SPLIT, but AM did not process this > correctly. > Region server attempting SPLIT and rollover: > {code} > 2013-10-06 16:42:56,959 INFO [Priority.RpcServer.handler=0,port=60020] > regionserver.HRegionServer: Splitting > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e. > ... > 2013-10-06 16:43:01,480 INFO [regionserver60020-splits-1381102351302] > regionserver.SplitTransaction: Starting split of region > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e. > 2013-10-06 16:43:01,480 DEBUG [regionserver60020-splits-1381102351302] > regionserver.SplitTransaction: regionserver:60020-0x1418eb9f38301bb Creating > ephemeral node for 9608be62536a61444766da669e67d39e in SPLITTING state > 2013-10-06 16:43:01,570 DEBUG [regionserver60020-splits-1381102351302] > zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioning > 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLITTING > 2013-10-06 16:43:01,607 DEBUG [regionserver60020-splits-1381102351302] > zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioned node > 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLITTING > 2013-10-06 16:43:01,607 DEBUG [regionserver60020-splits-1381102351302] > zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioning > 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLITTING > 2013-10-06 16:43:01,629 DEBUG [regionserver60020-splits-1381102351302] > zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Transitioned node > 9608be62536a61444766da669e67d39e from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLITTING > 2013-10-06 16:43:02,047 DEBUG [regionserver60020-splits-1381102351302] > regionserver.HRegion: Closing > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.: > disabling compactions & flushes > 2013-10-06 16:43:02,047 DEBUG [regionserver60020-splits-1381102351302] > regionserver.HRegion: Updates disabled for region > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e. > 2013-10-06 16:43:02,047 DEBUG [regionserver60020-splits-1381102351302] > regionserver.HRegion: Started memstore flush for > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e., > current region memstore size 4.4 M > 2013-10-06 16:43:03,945 INFO [regionserver60020-splits-1381102351302] > regionserver.HRegion: Finished memstore flush of ~4.4 M/4607072, > currentsize=0/0 for region > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e. > in 1898ms, sequenceid=6703616, compaction requested=true > 2013-10-06 16:43:04,026 INFO > [StoreCloserThread-IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.-1] > regionserver.HStore: Closed f1 > 2013-10-06 16:43:04,027 INFO [regionserver60020-splits-1381102351302] > regionserver.HRegion: Closed > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e. > 2013-10-06 16:43:36,938 INFO [regionserver60020-splits-1381102351302] > regionserver.SplitRequest: Running rollback/cleanup of failed split of > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e.; > Took too long to split the files and create the references, aborting split > java.io.IOException: Took too long to split the files and create the > references, aborting split > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.splitStoreFiles(SplitTransaction.java:584) > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.createDaughters(SplitTransaction.java:296) > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:465) > at > org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:82) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > 2013-10-06 16:43:37,482 INFO > [StoreOpener-9608be62536a61444766da669e67d39e-1] > compactions.CompactionConfiguration: size [134217728, 9223372036854775807); > files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point > 2684354560; delete expired; major period 604800000, major jitter 0.500000 > 2013-10-06 16:43:37,509 DEBUG > [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded > hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/12cecfd69b7948eb8bbf9bacc3d7fbb0, > isReference=false, isBulkLoadResult=false, seqid=6703616, majorComp > 2013-10-06 16:43:37,549 DEBUG > [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded > hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/6b715040df9b424cbd3d01f36b34677b, > isReference=false, isBulkLoadResult=false, seqid=6703217, majorComp > 2013-10-06 16:43:37,588 DEBUG > [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded > hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/97cbd5f18ee64e0aaf43f48b1d3fae73, > isReference=false, isBulkLoadResult=false, seqid=6681405, majorComp > 2013-10-06 16:43:37,630 DEBUG > [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded > hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/a0d33ad9786d4a6fb6fa83b530cb5acd, > isReference=false, isBulkLoadResult=false, seqid=6699746, majorComp > 2013-10-06 16:43:37,676 DEBUG > [StoreOpener-9608be62536a61444766da669e67d39e-1] regionserver.HStore: loaded > hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/f1/c4ad2148624d42e6b8408b9b257a4c70, > isReference=false, isBulkLoadResult=false, seqid=6689091, majorComp > 2013-10-06 16:43:37,893 INFO [regionserver60020-splits-1381102351302] > regionserver.HRegionFileSystem: Cleaned up old failed split transaction > detritus: > hdfs://ambari-unsec-1381055459-hbase-1.cs1cloud.internal:8020/apps/hbase/data/data/default/IntegrationTestLoadAndVerify/9608be62536a61444766da669e67d39e/.splits > 2013-10-06 16:43:37,937 INFO [regionserver60020-splits-1381102351302] > regionserver.HRegion: Onlined 9608be62536a61444766da669e67d39e; next > sequenceid=6703617 > 2013-10-06 16:43:38,021 DEBUG [regionserver60020-splits-1381102351302] > zookeeper.ZKAssign: regionserver:60020-0x1418eb9f38301bb Deleted unassigned > node 9608be62536a61444766da669e67d39e in expected state RS_ZK_REGION_SPLITTING > 2013-10-06 16:43:38,021 INFO [regionserver60020-splits-1381102351302] > regionserver.SplitRequest: Successful rollback of failed split of > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e. > {code} > The RS, 4 mins later will try to do another split, but it will fail to let > master know so, because master still thinks the region is in RIT due to the > first split. > {code} > 2013-10-06 16:43:01,606 DEBUG [AM.ZK.Worker-pool2-t202] > master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, > region=9608be62536a61444766da669e67d39e, current_state=null > 2013-10-06 16:43:01,609 INFO [AM.ZK.Worker-pool2-t202] master.RegionStates: > Transitioned {9608be62536a61444766da669e67d39e state=OPEN, ts=1381102229682, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371} > to {9608be62536a61444766da669e67d39e state=SPLITTING, ts=1381102981609, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,600 > 2013-10-06 16:43:01,634 DEBUG [AM.ZK.Worker-pool2-t203] > master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, > region=9608be62536a61444766da669e67d39e, > current_state={9608be62536a61444766da669e67d39e state=SPLITTING, > ts=1381102981609, server=ambari-unsec-1381055459-hbase-5.cs1cloud > 2013-10-06 16:43:38,019 DEBUG [AM.ZK.Worker-pool2-t204] > master.AssignmentManager: Znode > IntegrationTestLoadAndVerify,/(k\xCA\x1A\xF2\x86\xB6,1381100194019.9608be62536a61444766da669e67d39e. > deleted, state: {9608be62536a61444766da669e67d39e state=SPLITTING, > ts=1381102981634, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371} > 2013-10-06 16:44:35,673 DEBUG [AM.ZK.Worker-pool2-t206] > master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, > region=ae22b90a3ddeb30c45c5735f7c855014, current_state=null > 2013-10-06 16:44:35,673 INFO [AM.ZK.Worker-pool2-t206] master.RegionStates: > Transitioned {ae22b90a3ddeb30c45c5735f7c855014 state=OPEN, ts=1381102362273, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371} > to {ae22b90a3ddeb30c45c5735f7c855014 state=SPLITTING, ts=1381103075673, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,600 > 2013-10-06 16:44:35,725 DEBUG [AM.ZK.Worker-pool2-t207] > master.AssignmentManager: Handling RS_ZK_REGION_SPLITTING, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889371, > region=ae22b90a3ddeb30c45c5735f7c855014, > current_state={ae22b90a3ddeb30c45c5735f7c855014 state=SPLITTING, > ts=1381103075673, server=ambari-unsec-1381055459-hbase-5.cs1cloud > 2013-10-06 16:44:38,707 DEBUG > [ambari-unsec-1381055459-hbase-1.cs1cloud.internal,60000,1381101554357-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {9608be62536a61444766da669e67d39e={9608be62536a61444766da669e67d39e > state=SPLITTING, ts=1381102981634, > server=ambari-unsec-1381055459-hbase-5.cs1cloud.internal,60020,1381101889 > {code} > It seems that AM.nodeDeleted() does not revert the in-memory state for the > region to be opened. -- This message was sent by Atlassian JIRA (v6.1#6144)