[ 
https://issues.apache.org/jira/browse/HBASE-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793492#comment-13793492
 ] 

stack commented on HBASE-9750:
------------------------------

Here is the error we saw that the edit to hbase-daemon.sh is supposed to 
address:

{code}
2013-10-11 13:46:28,240 INFO  [Thread-6] hbase.HBaseCluster: Starting RS on: 
a1806.halxg.cloudera.com
2013-10-11 13:46:28,240 INFO  [Thread-6] hbase.ClusterManager: Executing remote 
command: /opt/hbase/current/bin/../bin/hbase-daemon.sh  start regionserver , 
hostname:a1806.halxg.cloudera.com
2013-10-11 13:46:28,240 INFO  [Thread-6] util.Shell: Executing full command 
[/usr/bin/ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no 
a1806.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh  start 
regionserver"]
2013-10-11 13:46:30,154 WARN  [Thread-6] policies.Policy: Exception occured 
during performing action: org.apache.hadoop.util.Shell$ExitCodeException: head: 
cannot open 
`/opt/hbase/current/bin/../logs/hbase-hbase-regionserver-a1806.halxg.cloudera.com.out'
 for reading: No such file or directory

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:458)
        at org.apache.hadoop.util.Shell.run(Shell.java:373)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
        at 
org.apache.hadoop.hbase.HBaseClusterManager$RemoteShell.execute(HBaseClusterManager.java:111)
        at 
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:187)
        at 
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:196)
        at 
org.apache.hadoop.hbase.HBaseClusterManager.start(HBaseClusterManager.java:201)
        at 
org.apache.hadoop.hbase.DistributedHBaseCluster.startRegionServer(DistributedHBaseCluster.java:104)
        at 
org.apache.hadoop.hbase.chaos.actions.BatchRestartRsAction.perform(BatchRestartRsAction.java:60)
        at 
org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.runOneIteration(PeriodicRandomActionPolicy.java:59)
        at 
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
        at 
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
        at java.lang.Thread.run(Thread.java:724)
{code}

> Add retries around Action server stop/start
> -------------------------------------------
>
>                 Key: HBASE-9750
>                 URL: https://issues.apache.org/jira/browse/HBASE-9750
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: Enis Soztutar
>
> These can fail on occasion (my upping ConnectionTimeout is not enough).  Lets 
> just retry a few times at least rather than fail at least for server start.  
> Losing a server makes tests run for longer and there is also the danger we 
> could lose all servers and the long-running test would then outright fail.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to