[jira] Issue Comment Edited: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Tibor Kiss (JIRA) Sun, 09 Jan 2011 14:05:07 -0800

    [ 
https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979413#action_12979413
 ]


Tibor Kiss edited comment on WHIRR-167 at 1/9/11 5:04 PM:
----------------------------------------------------------

I attached whirr-167-1.patch. I'm sure that is not a final one, but I would 
like to hear your opinions too.

I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a 
minimum percentage of successfully started nodes.
If we are not specify anything, it means %100, so a value
whirr.instance-templates=1 jt+nn,4 dn+tt%60
would mean that "jt+nn" roles passess only when 100% of the nodes start 
successfully and
"dn+tt" roles passess only when 60% of the nodes starts successfully.

If any of the roles didn't passed the minim requirement, it will initiate a 
retry phase in which the failing nodes on each roles will be replaced with new 
ones. That means that even a namenode startup problem wouldn't mean a complete 
lost cluster.
Without any retries a failure in namenode would break an entire cluster with 
many dn+tt successfuly started. I think that it worst to minimize the chance to 
fail in this way, therefore I introduced a retry cycle.
If there are some failure in dn+tt only while passing the minimum limit, the 
cluster will start up only with that amount of nodes without any retry.
A retry cycle would mean a chance for both roles to increase the number of 
nodes until the maximum value.

At this moment I don't think that more than one retry it worst! The target is 
just to replace a few sporadic service problems.
My question would be that we can leave a retry in case of insufficient nodes or 
we would leave the default value as without retry and add an extra parameter? 
Initially I wouldn't like the ideea to add more parameters.

About failing nodes... There are 2 different cases:
1. In case when the minimum required nodes couldn't be satisfied by a retry 
cycle, in that case all of the lost nodes will be left as it is. A full cluster 
destroy will be able to remove them.
2. In case when the number of nodes is satisfied from the first round or a 
retry, all the failed nodes (from first round and from retry cycle) will be 
destroyed automatically at the end of BootstrapClusterAction.doAction.

I experienced some difficulties in destroying the nodes. Initially I used a 
destroyNodesMatching(Predicate<NodeMetadata> filter) method which would 
terminate all my enumerated nodes in parallel. But this method would like to 
delete also the security group and placement group. Then I had to use the 
simple destroyNode(String id), which now deletes the nodes sequentially and I 
cannot control the KeyPair delition. My opinion that jclouds library is missing 
some convenient methods to revoke some nodes without optional propagation of 
KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get 
screwed up and I feel I couldn't find an elegant solution which does not incurr 
the revoke process.

About Mockito simulation of retry....
Unfortunately Mockito failed to mock the static 
ComputeServiceContextBuilder.build(clusterSpec) method, therefore I could write 
junit test for the retry. I could only test the retry and bad nodes cleanup by 
temporary hardcoding the exception then running it in live integration test. If 
somebody has an ideea how to mock all those static methods in 
BootstrapClusterAction, feel free to point me a solution.

      was (Author: tibor.kiss):
    I attached whirr-167-1.patch. I'm sure that is not a final one, but I would 
like to hear your opinions too.

I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a 
minimum percentage of successfully started nodes.
If we are not specify anything, it means %100, so a value
whirr.instance-templates=1 jt+nn,4 dn+tt%60
would mean that "jt+nn" roles passess only when 100% of the nodes start 
successfully and
"dn+tt" roles passess only when 60% of the nodes starts successfully.

If any of the roles didn't passed the minim requirement, it will initiate a 
retry phase in which the failing nodes on each roles will be replaced with new 
ones. That means that even a namenode startup problem wouldn't mean a complete 
lost cluster.
Without any retries a failure in namenode would break an entire cluster with 
many dn+tt successfuly started. I think that it worst to minimize the chance to 
fail in this way, therefore I introduced a retry cycle.
If there are some failure in dn+tt only while passing the minimum limit, the 
cluster will start up only with that amount of nodes without any retry.
A retry cycle would mean a chance for both roles to increase the number of 
nodes until the maximum value.

At this moment I don't think that more than one retry it worst! The target is 
just to replace a few sporadic service problems.
My question would be that we can leave a retry in case of insufficient nodes or 
we would leave the default value as without retry and add an extra parameter? 
Initially I wouldn't like the ideea to add more parameters.

About failing nodes... There are 2 different cases:
1. In case when the minimum required nodes couldn't be satisfied by a retry 
cycle, in that case all of the lost nodes will be left as it is. A full cluster 
destroy will be able to remove them.
2. In case when the number of nodes is satisfied from the first round or a 
retry, all the failed nodes (from first round and from retry cycle) will be 
destroyed automatically at the end of BootstrapClusterAction.doAction.

I experienced some difficulties in destroying the nodes. Initially I used a 
destroyNodesMatching(Predicate<NodeMetadata> filter) method which would 
terminate all my enumerated nodes in parallel. But this method would like to 
delete also the security group and placement group. Then I had to use the 
simple destroyNode(String id), which now deletes the nodes sequentially and I 
cannot control the KeyPair delition. My opinion that jclouds library is missing 
some convenient methods to revoke some nodes without optional propagation of 
KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get 
screwed up and I feel I couldn't find an elegant solution which does not incurr 
the revoke process.
  
> Improve bootstrapping and configuration to be able to isolate and repair or 
> evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>         Attachments: whirr-167-1.patch, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2 
> instances. How the number of nodes to be started up is increasing the startup 
> process it fails more often. But sometimes even 2-3 nodes startup process 
> fails. We don't know how many number of instance startup is going on at the 
> same time at Amazon side when it fails or when it successfully starting up. 
> The only think I see is that when I am starting around 10 nodes, the 
> statistics of failing nodes are higher then with smaller number of nodes and 
> is not direct proportional with the number of nodes, looks like it is 
> exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for 
> RunNodesException and don't bail out if only a few " which indicated the 
> current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so 
> that if the number failures exceeded this value the whole cluster fails to 
> launch and is shutdown. For the master node the value would be 100%, but for 
> datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

Reply via email to