[ 
https://issues.apache.org/jira/browse/HBASE-14370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740081#comment-14740081
 ] 

Enis Soztutar commented on HBASE-14370:
---------------------------------------

bq. Mainly wanted to say we don't need more threads but you fellas seem to be 
trying hard to avoid long-running thread that does nothing 99.999999999999999% 
of the time so that is good.
The original motivation for this patch was due to HBASE-12635 having left a 
dynamic cluster with lots of regions with 60K acl definitions. The zk watcher 
thread will spend 3+ minutes just to do the refresh acls. Even with HBASE-12635 
fixed, I think we should follow the practice of forking a thread to process the 
zk notifications. I did not do the perf analysis, but we have a cluster with 
2000 tables which may make the refreshNodes() to be in the multi-seconds range. 

The ref counting is unfortunate, since there is no easy way to have an executor 
corresponding to a TableAuthManager since TableAuthManager itself is a static 
cache. We could have added the executor as one of the core RS threads, but that 
seems also a bit hacky. If there are suggestions there, I can try it out. 

Coming back to patch, Ted, I think I got the motivation for the preemption. v10 
patch looks fine to me. 


> Use separate thread for calling ZKPermissionWatcher#refreshNodes()
> ------------------------------------------------------------------
>
>                 Key: HBASE-14370
>                 URL: https://issues.apache.org/jira/browse/HBASE-14370
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: 14370-v1.txt, 14370-v10.txt, 14370-v3.txt, 14370-v5.txt, 
> 14370-v7.txt, 14370-v8.txt, 14370-wait-nofity-v2.txt, 14370-wait-nofity.txt, 
> hbase-14370_v4.patch
>
>
> I came off a support case (0.98.0) where main zk thread was seen doing the 
> following:
> {code}
>   at 
> org.apache.hadoop.hbase.security.access.ZKPermissionWatcher.refreshAuthManager(ZKPermissionWatcher.java:152)
>   at 
> org.apache.hadoop.hbase.security.access.ZKPermissionWatcher.refreshNodes(ZKPermissionWatcher.java:135)
>   at 
> org.apache.hadoop.hbase.security.access.ZKPermissionWatcher.nodeChildrenChanged(ZKPermissionWatcher.java:121)
>   at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
>   at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> {code}
> There were 62000 nodes under /acl due to lack of fix from HBASE-12635, 
> leading to slowness in table creation because zk notification for region 
> offline was blocked by the above.
> The attached patch separates refreshNodes() call into its own thread.
> Thanks to Enis and Devaraj for offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to