[ 
https://issues.apache.org/jira/browse/HBASE-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707816#comment-14707816
 ] 

stack commented on HBASE-13480:
-------------------------------

This is a particular problem in trunk where master serves hbase:meta. 
TestDistributedLogReplay has priority handlers pumped up to 40 which is kinda 
crazy. If I set the number way down, to 5 say, then the cluster locks up 
because the priority handlers are all occupied doing 
reportRegionStateTransition which wants to RPC back into the meta table... only 
the priority handlers are all occupied (with long timeouts as per [~elserj] 
above) so we can't progress.

We need to address the larger issue of cluster deadlock but the short-circuit 
fix here should help w/ current state of trunk at least.

Testing this patch, there is big improvement in TestDistributedLogReplay. Just 
a few timeout/retries in logs as opposed to logs filled with them when handler 
count is 5 AND it passes as opposed to hangs.

Reviewing the patch, the only problem I have is that both short-circuit and RPC 
connections are hosted inside a class named for short circuiting which seems 
incorrect. Internally it can do the switch but the hosting class that figures 
whether to rpc or go short-circuit shoudn't be called short-circuit; it could 
even be an anonymous inner class if we have trouble coming up w/ a good name.

Good one [~elserj] and [~jingcheng...@intel.com]

> ShortCircuitConnection doesn't short-circuit all calls as expected
> ------------------------------------------------------------------
>
>                 Key: HBASE-13480
>                 URL: https://issues.apache.org/jira/browse/HBASE-13480
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.0.0, 2.0.0, 1.1.0
>            Reporter: Josh Elser
>            Assignee: Jingcheng Du
>             Fix For: 2.0.0, 1.3.0, 1.2.1, 1.0.3, 1.1.3
>
>         Attachments: HBASE-13480-1.patch, HBASE-13480.patch
>
>
> Noticed the following situation in debugging unexpected unit tests failures 
> in HBASE-13351.
> {{ConnectionUtils#createShortCircuitHConnection(Connection, ServerName, 
> AdminService.BlockingInterface, ClientService.BlockingInterface)}} is 
> intended to avoid the extra RPC by calling the server's instantiation of the 
> protobuf rpc stub directly for the AdminService and ClientService.
> The problem is that this is insufficient to actually avoid extra "remote" 
> RPCs as all other calls to the Connection are routed to a "real" Connection 
> instance. As such, any object created by the "real" Connection (such as an 
> HTable) will use the real Connection, not the SSC.
> The end result is that 
> {{MasterRpcService#reportRegionStateTransition(RpcController, 
> ReportRegionStateTransitionRequest)}} will make additional "remote" RPCs over 
> what it thinks is an SSC through a {{Get}} on {{HTable}} which was 
> constructed using the SSC, but the {{Get}} itself will use the underlying 
> real Connection instead of the SSC. With insufficiently sized thread pools, 
> this has been observed to result in RPC deadlock in the HMaster where an RPC 
> attempts to make another RPC but there are no more threads available to 
> service the second RPC so the first RPC blocks indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to