[ https://issues.apache.org/jira/browse/HBASE-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793492#comment-13793492 ]
stack commented on HBASE-9750: ------------------------------ Here is the error we saw that the edit to hbase-daemon.sh is supposed to address: {code} 2013-10-11 13:46:28,240 INFO [Thread-6] hbase.HBaseCluster: Starting RS on: a1806.halxg.cloudera.com 2013-10-11 13:46:28,240 INFO [Thread-6] hbase.ClusterManager: Executing remote command: /opt/hbase/current/bin/../bin/hbase-daemon.sh start regionserver , hostname:a1806.halxg.cloudera.com 2013-10-11 13:46:28,240 INFO [Thread-6] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no a1806.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh start regionserver"] 2013-10-11 13:46:30,154 WARN [Thread-6] policies.Policy: Exception occured during performing action: org.apache.hadoop.util.Shell$ExitCodeException: head: cannot open `/opt/hbase/current/bin/../logs/hbase-hbase-regionserver-a1806.halxg.cloudera.com.out' for reading: No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:458) at org.apache.hadoop.util.Shell.run(Shell.java:373) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578) at org.apache.hadoop.hbase.HBaseClusterManager$RemoteShell.execute(HBaseClusterManager.java:111) at org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:187) at org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:196) at org.apache.hadoop.hbase.HBaseClusterManager.start(HBaseClusterManager.java:201) at org.apache.hadoop.hbase.DistributedHBaseCluster.startRegionServer(DistributedHBaseCluster.java:104) at org.apache.hadoop.hbase.chaos.actions.BatchRestartRsAction.perform(BatchRestartRsAction.java:60) at org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.runOneIteration(PeriodicRandomActionPolicy.java:59) at org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) at org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) at java.lang.Thread.run(Thread.java:724) {code} > Add retries around Action server stop/start > ------------------------------------------- > > Key: HBASE-9750 > URL: https://issues.apache.org/jira/browse/HBASE-9750 > Project: HBase > Issue Type: Bug > Reporter: stack > Assignee: Enis Soztutar > > These can fail on occasion (my upping ConnectionTimeout is not enough). Lets > just retry a few times at least rather than fail at least for server start. > Losing a server makes tests run for longer and there is also the danger we > could lose all servers and the long-running test would then outright fail. -- This message was sent by Atlassian JIRA (v6.1#6144)