[
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979413#action_12979413
]
Tibor Kiss edited comment on WHIRR-167 at 1/9/11 5:04 PM:
----------------------------------------------------------
I attached whirr-167-1.patch. I'm sure that is not a final one, but I would
like to hear your opinions too.
I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a
minimum percentage of successfully started nodes.
If we are not specify anything, it means %100, so a value
whirr.instance-templates=1 jt+nn,4 dn+tt%60
would mean that "jt+nn" roles passess only when 100% of the nodes start
successfully and
"dn+tt" roles passess only when 60% of the nodes starts successfully.
If any of the roles didn't passed the minim requirement, it will initiate a
retry phase in which the failing nodes on each roles will be replaced with new
ones. That means that even a namenode startup problem wouldn't mean a complete
lost cluster.
Without any retries a failure in namenode would break an entire cluster with
many dn+tt successfuly started. I think that it worst to minimize the chance to
fail in this way, therefore I introduced a retry cycle.
If there are some failure in dn+tt only while passing the minimum limit, the
cluster will start up only with that amount of nodes without any retry.
A retry cycle would mean a chance for both roles to increase the number of
nodes until the maximum value.
At this moment I don't think that more than one retry it worst! The target is
just to replace a few sporadic service problems.
My question would be that we can leave a retry in case of insufficient nodes or
we would leave the default value as without retry and add an extra parameter?
Initially I wouldn't like the ideea to add more parameters.
About failing nodes... There are 2 different cases:
1. In case when the minimum required nodes couldn't be satisfied by a retry
cycle, in that case all of the lost nodes will be left as it is. A full cluster
destroy will be able to remove them.
2. In case when the number of nodes is satisfied from the first round or a
retry, all the failed nodes (from first round and from retry cycle) will be
destroyed automatically at the end of BootstrapClusterAction.doAction.
I experienced some difficulties in destroying the nodes. Initially I used a
destroyNodesMatching(Predicate<NodeMetadata> filter) method which would
terminate all my enumerated nodes in parallel. But this method would like to
delete also the security group and placement group. Then I had to use the
simple destroyNode(String id), which now deletes the nodes sequentially and I
cannot control the KeyPair delition. My opinion that jclouds library is missing
some convenient methods to revoke some nodes without optional propagation of
KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get
screwed up and I feel I couldn't find an elegant solution which does not incurr
the revoke process.
About Mockito simulation of retry....
Unfortunately Mockito failed to mock the static
ComputeServiceContextBuilder.build(clusterSpec) method, therefore I could write
junit test for the retry. I could only test the retry and bad nodes cleanup by
temporary hardcoding the exception then running it in live integration test. If
somebody has an ideea how to mock all those static methods in
BootstrapClusterAction, feel free to point me a solution.
was (Author: tibor.kiss):
I attached whirr-167-1.patch. I'm sure that is not a final one, but I would
like to hear your opinions too.
I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a
minimum percentage of successfully started nodes.
If we are not specify anything, it means %100, so a value
whirr.instance-templates=1 jt+nn,4 dn+tt%60
would mean that "jt+nn" roles passess only when 100% of the nodes start
successfully and
"dn+tt" roles passess only when 60% of the nodes starts successfully.
If any of the roles didn't passed the minim requirement, it will initiate a
retry phase in which the failing nodes on each roles will be replaced with new
ones. That means that even a namenode startup problem wouldn't mean a complete
lost cluster.
Without any retries a failure in namenode would break an entire cluster with
many dn+tt successfuly started. I think that it worst to minimize the chance to
fail in this way, therefore I introduced a retry cycle.
If there are some failure in dn+tt only while passing the minimum limit, the
cluster will start up only with that amount of nodes without any retry.
A retry cycle would mean a chance for both roles to increase the number of
nodes until the maximum value.
At this moment I don't think that more than one retry it worst! The target is
just to replace a few sporadic service problems.
My question would be that we can leave a retry in case of insufficient nodes or
we would leave the default value as without retry and add an extra parameter?
Initially I wouldn't like the ideea to add more parameters.
About failing nodes... There are 2 different cases:
1. In case when the minimum required nodes couldn't be satisfied by a retry
cycle, in that case all of the lost nodes will be left as it is. A full cluster
destroy will be able to remove them.
2. In case when the number of nodes is satisfied from the first round or a
retry, all the failed nodes (from first round and from retry cycle) will be
destroyed automatically at the end of BootstrapClusterAction.doAction.
I experienced some difficulties in destroying the nodes. Initially I used a
destroyNodesMatching(Predicate<NodeMetadata> filter) method which would
terminate all my enumerated nodes in parallel. But this method would like to
delete also the security group and placement group. Then I had to use the
simple destroyNode(String id), which now deletes the nodes sequentially and I
cannot control the KeyPair delition. My opinion that jclouds library is missing
some convenient methods to revoke some nodes without optional propagation of
KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get
screwed up and I feel I couldn't find an elegant solution which does not incurr
the revoke process.
> Improve bootstrapping and configuration to be able to isolate and repair or
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
> Key: WHIRR-167
> URL: https://issues.apache.org/jira/browse/WHIRR-167
> Project: Whirr
> Issue Type: Improvement
> Environment: Amazon EC2
> Reporter: Tibor Kiss
> Assignee: Tibor Kiss
> Attachments: whirr-167-1.patch, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2
> instances. How the number of nodes to be started up is increasing the startup
> process it fails more often. But sometimes even 2-3 nodes startup process
> fails. We don't know how many number of instance startup is going on at the
> same time at Amazon side when it fails or when it successfully starting up.
> The only think I see is that when I am starting around 10 nodes, the
> statistics of failing nodes are higher then with smaller number of nodes and
> is not direct proportional with the number of nodes, looks like it is
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for
> RunNodesException and don't bail out if only a few " which indicated the
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so
> that if the number failures exceeded this value the whole cluster fails to
> launch and is shutdown. For the master node the value would be 100%, but for
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.