[jira] Commented: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Adrian Cole (JIRA) Sun, 09 Jan 2011 15:15:08 -0800

    [ 
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979430#action_12979430
 ]


Adrian Cole commented on WHIRR-167:
-----------------------------------

I think the problem in your stacktrace probably applies to jclouds master.  we 
have a place we can test this.  Do you mind filing a bug? 
http://code.google.com/p/jclouds/issues/entry
you may also want to watch this issue, which is very much related: 
http://code.google.com/p/jclouds/issues/detail?id=365

wrt the keys: I wouldn't worry about them.  The keys are only used to install 
your keypair, so deleting them doesn't matter.

wrt stub testing: it isn't very easy to stub something that requires a concert 
of scripts  That said, here's what we use: 
https://github.com/jclouds/jclouds/blob/master/compute/src/test/java/org/jclouds/compute/StubComputeServiceIntegrationTest.java

I hope this helps!

> Improve bootstrapping and configuration to be able to isolate and repair or 
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>         Attachments: whirr-167-1.patch, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2 
> instances. How the number of nodes to be started up is increasing the startup 
> process it fails more often. But sometimes even 2-3 nodes startup process 
> fails. We don't know how many number of instance startup is going on at the 
> same time at Amazon side when it fails or when it successfully starting up. 
> The only think I see is that when I am starting around 10 nodes, the 
> statistics of failing nodes are higher then with smaller number of nodes and 
> is not direct proportional with the number of nodes, looks like it is 
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for 
> RunNodesException and don't bail out if only a few " which indicated the 
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so 
> that if the number failures exceeded this value the whole cluster fails to 
> launch and is shutdown. For the master node the value would be 100%, but for 
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Reply via email to