[ https://issues.apache.org/jira/browse/HBASE-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Busbey updated HBASE-24794: -------------------------------- Description: had a cluster fail after upgrade from hbase 1 because all writes to meta failed. master started in maintenance mode looks like (RS hosting meta in non-maint would look similar starting with {{HRegion.doBatchMutate}}): {code} 2020-07-28 17:52:56,553 WARN org.apache.hadoop.hbase.regionserver.HRegion: Failed getting lock, row=some_user_table java.io.IOException: Timed out waiting for lock for row: some_user_table in region 1588230740 at org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5863) at org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation.lockRowsAndBuildMiniBatch(HRegion.java:3322) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4018) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3992) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3923) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3914) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3928) at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4255) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3047) at org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2827) at org.apache.hadoop.hbase.client.ClientServiceCallable.doMutate(ClientServiceCallable.java:55) at org.apache.hadoop.hbase.client.HTable$3.rpcCall(HTable.java:538) at org.apache.hadoop.hbase.client.HTable$3.rpcCall(HTable.java:533) at org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:542) at org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1339) at org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1329) at org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1672) at org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1112) at org.apache.hadoop.hbase.master.TableStateManager.fixTableStates(TableStateManager.java:296) at org.apache.hadoop.hbase.master.TableStateManager.start(TableStateManager.java:269) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1004) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2274) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:583) at java.lang.Thread.run(Thread.java:745) {code} logging roughly 6k times /second. failure was caused by a change in behavior for {{hbase.rowlock.wait.duration}} in HBASE-17210 (so 1.4.0+, 2.0.0+). Prior to that change setting the config <= 0 meant that row locks would succeed only if they were immediately available. After the change we fail the lock attempt without checking the lock at all. workaround: set {{hbase.rowlock.wait.duration}} to a small positive number, e.g. 1, if you want row locks to fail quickly. was: had a cluster fail after upgrade from hbase 1 because all writes to meta failed. RS with meta looks like: {code} 2020-07-28 17:52:56,553 WARN org.apache.hadoop.hbase.regionserver.HRegion: Failed getting lock, row=some_user_table java.io.IOException: Timed out waiting for lock for row: some_user_table in region 1588230740 at org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5863) at org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation.lockRowsAndBuildMiniBatch(HRegion.java:3322) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4018) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3992) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3923) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3914) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3928) at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4255) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3047) at org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2827) at org.apache.hadoop.hbase.client.ClientServiceCallable.doMutate(ClientServiceCallable.java:55) at org.apache.hadoop.hbase.client.HTable$3.rpcCall(HTable.java:538) at org.apache.hadoop.hbase.client.HTable$3.rpcCall(HTable.java:533) at org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:542) at org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1339) at org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1329) at org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1672) at org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1112) at org.apache.hadoop.hbase.master.TableStateManager.fixTableStates(TableStateManager.java:296) at org.apache.hadoop.hbase.master.TableStateManager.start(TableStateManager.java:269) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1004) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2274) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:583) at java.lang.Thread.run(Thread.java:745) {code} logging roughly 6k times /second. failure was caused by a change in behavior for {{hbase.rowlock.wait.duration}} in HBASE-17210 (so 1.4.0+, 2.0.0+). Prior to that change setting the config <= 0 meant that row locks would succeed only if they were immediately available. After the change we fail the lock attempt without checking the lock at all. workaround: set {{hbase.rowlock.wait.duration}} to a small positive number, e.g. 1, if you want row locks to fail quickly. > hbase.rowlock.wait.duration should not be <= 0 > ---------------------------------------------- > > Key: HBASE-24794 > URL: https://issues.apache.org/jira/browse/HBASE-24794 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.4.0, 1.5.0, 2.2.0, 2.3.0, 1.6.0 > Reporter: Sean Busbey > Assignee: Sean Busbey > Priority: Minor > Fix For: 3.0.0-alpha-1, 2.3.1, 1.7.0, 2.4.0, 1.4.14, 2.2.6 > > > had a cluster fail after upgrade from hbase 1 because all writes to meta > failed. > master started in maintenance mode looks like (RS hosting meta in non-maint > would look similar starting with {{HRegion.doBatchMutate}}): > {code} > 2020-07-28 17:52:56,553 WARN org.apache.hadoop.hbase.regionserver.HRegion: > Failed getting lock, row=some_user_table > java.io.IOException: Timed out waiting for lock for row: some_user_table in > region 1588230740 > at > org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5863) > at > org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation.lockRowsAndBuildMiniBatch(HRegion.java:3322) > at > org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4018) > at > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3992) > at > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3923) > at > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3914) > at > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3928) > at > org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4255) > at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3047) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2827) > at > org.apache.hadoop.hbase.client.ClientServiceCallable.doMutate(ClientServiceCallable.java:55) > at org.apache.hadoop.hbase.client.HTable$3.rpcCall(HTable.java:538) > at org.apache.hadoop.hbase.client.HTable$3.rpcCall(HTable.java:533) > at > org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at org.apache.hadoop.hbase.client.HTable.put(HTable.java:542) > at > org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1339) > at > org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1329) > at > org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1672) > at > org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1112) > at > org.apache.hadoop.hbase.master.TableStateManager.fixTableStates(TableStateManager.java:296) > at > org.apache.hadoop.hbase.master.TableStateManager.start(TableStateManager.java:269) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1004) > at > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2274) > at > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:583) > at java.lang.Thread.run(Thread.java:745) > {code} > logging roughly 6k times /second. > failure was caused by a change in behavior for > {{hbase.rowlock.wait.duration}} in HBASE-17210 (so 1.4.0+, 2.0.0+). Prior to > that change setting the config <= 0 meant that row locks would succeed only > if they were immediately available. After the change we fail the lock attempt > without checking the lock at all. > workaround: set {{hbase.rowlock.wait.duration}} to a small positive number, > e.g. 1, if you want row locks to fail quickly. -- This message was sent by Atlassian Jira (v8.3.4#803005)