Andreas Neumann created TEPHRA-249:
--------------------------------------

             Summary: HBase coprocessors sometimes cannot access tables due to 
ZK auth failure
                 Key: TEPHRA-249
                 URL: https://issues.apache.org/jira/browse/TEPHRA-249
             Project: Tephra
          Issue Type: Bug
            Reporter: Andreas Neumann
            Assignee: Poorna Chandra


Sometimes, region servers have many messages in the logs of the form:
{noformat}
2017-08-15 15:52:51,478 ERROR [tx-state-refresh] zookeeper.ZooKeeperWatcher: 
hconnection-0x234b6ae9-0x15b49966f34f9bb, 
quorum=<host>:2181,<host>:2181,<host>:2181, baseZNode=/hbase-secure Received 
unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
AuthFailed for /hbase-secure/meta-region-server
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:622)
        at 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:491)
        at 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:172)
        at 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:608)
        at 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:589)
        at 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:568)
        at 
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1192)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1159)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
        at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
        at 
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1256)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1146)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1103)
        at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:938)
        at 
org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83)
        at 
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:79)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:862)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:828)
        at 
co.cask.cdap.data2.util.hbase.ConfigurationTable.read(ConfigurationTable.java:133)
        at 
co.cask.cdap.data2.transaction.coprocessor.DefaultTransactionStateCache.getSnapshotConfiguration(DefaultTransactionStateCache.java:56)
        at 
org.apache.tephra.coprocessor.TransactionStateCache.tryInit(TransactionStateCache.java:94)
        at 
org.apache.tephra.coprocessor.TransactionStateCache.refreshState(TransactionStateCache.java:153)
        at 
org.apache.tephra.coprocessor.TransactionStateCache.access$300(TransactionStateCache.java:42)
        at 
org.apache.tephra.coprocessor.TransactionStateCache$1.run(TransactionStateCache.java:131)
{noformat}
If this happens, then it happens equally for the transaction state cache and 
for the prune state. 

The behavior is pretty bad: the coprocessor attempts to access a Table, for 
that it needs to access the meta region, which fails due to ZK authorization. 
Unfortunately, the HBase client does this with a blocking busy retry loop for 5 
minutes, so it floods the logs for 5 minutes. Then the next coprocessor gets 
its turn and produces another 5 minutes of unthrottled retries and error 
messages. 

The consequence is that coprocessors cannot read the transaction state or the 
configuration. Hence, for example, they cannot find out whether tx pruning is 
enabled and don't record prune info ever. 

There is a way to impersonate the login user when accessing a table from a 
coprocessor. That appears to fix the problem. or all coprocessors.

Or is there even a better way to access a table from a coprocessor, than using 
an HBase client? Is it possible via the coprocessor environment? 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to