[ 
https://issues.apache.org/jira/browse/HBASE-9743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792070#comment-13792070
 ] 

Enis Soztutar commented on HBASE-9743:
--------------------------------------

This looks good. I could not understand the comment: 
// The start may have succeeded but don't add to startRs until we are sure it 
did... let
 // the call go around again. 
We do in fact remove the server from the deadServers list, so it won't be 
retried. However, it should be fine, since in case that there is a legit server 
failure, CM should continue without blocking on this action. 

I'll add general retry on server start and stop actions as well, see my last 
comment on HBASE-9563. 

> RollingBatchRestartRsAction aborts if timeout
> ---------------------------------------------
>
>                 Key: HBASE-9743
>                 URL: https://issues.apache.org/jira/browse/HBASE-9743
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: stack
>            Assignee: Bradford Stephens
>             Fix For: 0.96.0
>
>         Attachments: 9743.txt
>
>
> In our test rigs, we see following quiet frequently:
> {code}
> 2013-10-10 05:04:09,367 INFO  [Thread-6] actions.Action: Killing region 
> server:a1809.halxg.cloudera.com,60020,1381404629253
> 2013-10-10 05:04:09,367 INFO  [Thread-6] hbase.HBaseCluster: Aborting RS: 
> a1809.halxg.cloudera.com,60020,1381404629253
> 2013-10-10 05:04:09,367 INFO  [Thread-6] hbase.ClusterManager: Executing 
> remote command: ps aux | grep proc_regionserver | grep -v grep | tr -s ' ' | 
> cut -d ' ' -f2 | xargs kill -s SIGKILL , hostname:a1809.halxg.cloudera.com
> 2013-10-10 05:04:09,367 INFO  [Thread-6] util.Shell: Executing full command 
> [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no 
> a1809.halxg.cloudera.com "ps aux | grep proc_regionserver | grep -v grep | tr 
> -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL"]
> 2013-10-10 05:04:09,621 DEBUG [Thread-5] client.HBaseAdmin: Getting current 
> status of snapshot from master...
> 2013-10-10 05:04:09,623 DEBUG [Thread-5] client.HBaseAdmin: (#6) Sleeping: 
> 1714ms while waiting for snapshot completion.
> 2013-10-10 05:04:10,381 WARN  [Thread-6] policies.Policy: Exception occured 
> during performing action: org.apache.hadoop.util.Shell$ExitCodeException: 
> Connection timed out during banner exchange
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:458)
>       at org.apache.hadoop.util.Shell.run(Shell.java:373)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
>       at 
> org.apache.hadoop.hbase.HBaseClusterManager$RemoteShell.execute(HBaseClusterManager.java:111)
>       at 
> org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:187)
>       at 
> org.apache.hadoop.hbase.HBaseClusterManager.signal(HBaseClusterManager.java:216)
>       at org.apache.hadoop.hbase.ClusterManager.kill(ClusterManager.java:97)
>       at 
> org.apache.hadoop.hbase.DistributedHBaseCluster.killRegionServer(DistributedHBaseCluster.java:110)
>       at org.apache.hadoop.hbase.chaos.actions.Action.killRs(Action.java:84)
>       at 
> org.apache.hadoop.hbase.chaos.actions.RollingBatchRestartRsAction.perform(RollingBatchRestartRsAction.java:60)
>       at 
> org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.runOneIteration(PeriodicRandomActionPolicy.java:59)
>       at 
> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
>       at 
> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
>       at java.lang.Thread.run(Thread.java:724)
> ...
> {code}
> So, we went to kill a RS and we timed out.  Server was busy at the time.  We 
> see the kill usually going through.
> When above happens in a RollingBatchRestartRsAction, we'll usually 'lose' a 
> server for the rest of the test.  That is at a minimum.  We've also seen case 
> where we kill near all servers in cluster and then the above timeout happens 
> and we are left w/ a test limping along running real slow eventually failing.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to