Improve bootstrapping and configuration to be able to isolate and repair or 
evict failing nodes on EC2
------------------------------------------------------------------------------------------------------

                 Key: WHIRR-167
                 URL: https://issues.apache.org/jira/browse/WHIRR-167
             Project: Whirr
          Issue Type: Improvement
         Environment: Amazon EC2
            Reporter: Tibor Kiss
            Assignee: Tibor Kiss


Actually it is very unstable the cluster startup process on Amazon EC2 
instances. How the number of nodes to be started up is increasing the startup 
process it fails more often. But sometimes even 2-3 nodes startup process 
fails. We don't know how many number of instance startup is going on at the 
same time at Amazon side when it fails or when it successfully starting up. The 
only think I see is that when I am starting around 10 nodes, the statistics of 
failing nodes are higher then with smaller number of nodes and is not direct 
proportional with the number of nodes, looks like it is exponentialy higher 
probability to fail some nodes.

Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for 
RunNodesException and don't bail out if only a few " which indicated the 
current unreliable startup process. So we should improve it.

We could add a "max percent failure" property (per instance template), so that 
if the number failures exceeded this value the whole cluster fails to launch 
and is shutdown. For the master node the value would be 100%, but for datanodes 
it would be more like 75%. (Tom White also mentioned in an email).

Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to