[ https://issues.apache.org/jira/browse/HBASE-7670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu updated HBASE-7670: -------------------------- Status: Patch Available (was: Open) > Synchronized operation in CatalogTracker would block handling ZK Event for > long time > ------------------------------------------------------------------------------------ > > Key: HBASE-7670 > URL: https://issues.apache.org/jira/browse/HBASE-7670 > Project: HBase > Issue Type: Bug > Affects Versions: 0.94.4 > Reporter: chunhui shen > Assignee: chunhui shen > Priority: Critical > Fix For: 0.96.0 > > Attachments: HBASE-7670.patch > > > We found ZK event not be watched by master for a long time in our testing. > It seems one ZK-Event-Handle thread block it. > Attaching some logs on master > {code} > 2013-01-16 22:18:55,667 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENED, > 2013-01-16 22:18:56,270 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENED, > ... > 2013-01-16 23:55:33,259 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: > Retrying > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=100, exceptions: > at > org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:183) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:676) > at org.apache.hadoop.hbase.catalog.MetaReader.get(MetaReader.java:247) > at > org.apache.hadoop.hbase.catalog.MetaReader.getRegion(MetaReader.java:349) > at > org.apache.hadoop.hbase.catalog.MetaReader.readRegionLocation(MetaReader.java:289) > at > org.apache.hadoop.hbase.catalog.MetaReader.getMetaRegionLocation(MetaReader.java:276) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:424) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:489) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:451) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:289) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > 2013-01-16 23:55:33,261 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Attempted to handle region > transition for server but server is not online > {code} > Between 2013-01-16 22:18:56 and 2013-01-16 23:55:33, there is no any logs > about handling ZK Event. > {code} > this.metaNodeTracker = new MetaNodeTracker(zookeeper, throwableAborter) { > public void nodeDeleted(String path) { > if (!path.equals(node)) return; > ct.resetMetaLocation(); > } > } > public void resetMetaLocation() { > LOG.debug("Current cached META location, " + metaLocation + > ", is not valid, resetting"); > synchronized(this.metaAvailable) { > this.metaAvailable.set(false); > this.metaAvailable.notifyAll(); > } > } > private AdminProtocol getMetaServerConnection(){ > synchronized (metaAvailable){ > ... > ServerName newLocation = MetaReader.getMetaRegionLocation(this); > ... > } > } > {code} > From the above code, we would found that nodeDeleted() would wait > synchronized (metaAvailable) until MetaReader.getMetaRegionLocation(this) > done, > however, getMetaRegionLocation() could be retrying for a long time -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira