[
https://issues.apache.org/jira/browse/WHIRR-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146545#comment-13146545
]
David Alves edited comment on WHIRR-414 at 11/8/11 8:45 PM:
------------------------------------------------------------
I'm not saying that a possible (and even the default) behavior would not be to
kill all machines.
I'm just saying that it should be configurable, I can easily see cases where
not killing all machines would be advantageous (transient provider errors,
testing, development). For instance in testing/development/debugging you might
want to log into the machines to see what went wrong, or if you have idempotent
bootstrap/configure you might be able to add machines without having to waste
those that did not fail to start, or if the machines failed in the config phase
you might decide to use them for some other purpose (since you are paying for
them).
was (Author: dr-alves):
I'm not saying that a possible (and even the default) behavior would not be
to kill all machines.
I'm just saying that it should be configurable, I can easily see cases where
not killing all machines would be advantageous (transient provider errors,
testing, development). For instance in testing/development/debugging you might
want to log into the machines to see what went wrong, or if you have idempotent
bootstrap/configure you might be able to add machines without having to waste
those that did not fail to start.
> whirr can have a non-zero return code and unterminated (orphaned) host
> instances
> --------------------------------------------------------------------------------
>
> Key: WHIRR-414
> URL: https://issues.apache.org/jira/browse/WHIRR-414
> Project: Whirr
> Issue Type: Bug
> Components: core
> Affects Versions: 0.6.0
> Environment: EC2, commandline whirr
> Reporter: Paul Baclace
> Assignee: Andrei Savu
> Priority: Critical
> Fix For: 0.7.0
>
> Attachments: WHIRR-414.patch
>
>
> Whirr can fail to completely start a cluster and indicates this with a
> non-zero return code. In many (currently intermittent) partial failure
> scenarios, there are resources still active (EC2 machine instances, in my
> experience) that are not cleaned up.
> The log contains "IOException: Too many instance failed while bootstrapping!"
> when I have seen orphaned nodes.
> A non-zero return code should guarantee that all resources are cleaned up.
> Without this post-condition, these failures require manual inspection and
> cleanup to stop useless expenses (which is why I marked this bug critical; it
> needs to be addressed for any kind of cron job triggered whirr).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira