stack created HBASE-20137:
-----------------------------

             Summary: TestRSGroups is flakey
                 Key: HBASE-20137
                 URL: https://issues.apache.org/jira/browse/HBASE-20137
             Project: HBase
          Issue Type: Bug
          Components: flakey
    Affects Versions: 2.0.0-beta-2
            Reporter: stack
            Assignee: stack


It was the single test that failed the hbase-2 nightlies in #440 at the hadoop2 
stage.

The failure manifests as a timeout. It actually has an interesting cause 
calling into question some of the clauses in UnassignProcedure#remoteCallFailed.

We are running a disabletable concurrent with a shutdown. pid=309 is the 
disable. pid=311 is the interesting one. The below is a little hard to read -- 
the exception 'message' is the the current procedure as a String... hard to 
parse, fixing -- but we are trying to unassign as part of a the disabletable. 
Our RPC fails because the server we are trying to rpc too is currently being 
processed as crashed (pid=308 is a servercrashprocedure for this server). As 
part of the processing of the failed RPC we will expire the server -- if we 
can't RPC to it, it must be gone. The current procedure is then suspended until 
it gets woken up by the servercrashprocedure triggered by the expire.... only 
in this case we are shutting down so the expire is ignored... The current 
procedure is left in its suspend state. This prevents the Master going down. So 
we time out.

2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
location=1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] assignment.UnassignProcedure(276): 
Expiring server pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
UnassignProcedure table=Group_ns:testKillRS, 
region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
location=1cfd208ff882,40584,1520249102524, 
exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
 pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
UnassignProcedure table=Group_ns:testKillRS, 
region=de7534c208a06502537cd95c248b3043, 
server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
progress

I need to cater for case where the expire server is rejected.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to