Re: Rebalancing after a crashed node

Itai Frenkel Tue, 25 Nov 2014 03:33:06 -0800

Hi Florin,


Disclaimer: This is my opinion, and it might not suite your needs.


1. (Dynamic) scaling is the ability to change the number of instances in the 
security group on demand. If you are deploying a single topology on multiple 
supervisors(instances), and then add another supervisor(instance)... then the 
last instance is still empty. You would need AFAIK to perform some sort of 
re-balancing from nimbus... It also means you in advance prepared the number of 
relevant bolts to allow the ability to rebalance.


2. Our specific problem does not require a fully distributed architecture. So 
instead, we deploy one topology on one ec2 instance. In this scenario you 
deploy 10 exactly same topologies with 10 different names (with isolation 
scheduler), and start as many instances that you want. Nimbus will do its best 
to allocate one topology on one instance, the more instances available the more 
topologies are deployed. No rebalancing is needed.


3. Automatic scaling means automatic triggering of either scaling out or 
scaling in. If you are in the eCommerce business, for example, there is known 
seasonality and so manual scaling is enough. Make sure you have a business 
justification to automatic scaling - since any closed loop control system could 
go out of hand quickly. There are startups that provide auto-scaling as a 
service on top of ec2 for that exact reason. IMO, if you do use auto-scaling 
make sure to set the maximum number of instances... and trigger only scale out. 
Do the scale-in manually, only when the time is right for you and your 
customers.


... later down the road, you should consider going all automatic.


4. The trigger is really case-by-case dependent. It could be a business trigger 
(customer SLA is breached and you know from the past that it means you need 
more STORM nodes)... it could be a low-level trigger (such as CPU is bigger 
than 80% for more than 10 minutes). Either way the system would react in 10 
minutes... and would have a damping period after which there would be no change 
even if the trigger is set. This is in order to stabilize the system. So make 
sure your requirements are aligned with the realities of the autoscaling. If 
you need to react within 30 seconds... it's not gonna happen even if ec2 can 
start an instance within that time.



thanks,

Itai

________________________________
From: Spico Florin <[email protected]>
Sent: Tuesday, November 25, 2014 1:05 PM
To: [email protected]
Subject: Re: Rebalancing after a crashed node

Hello!
  Can you please explain how do you manage the autoscaling worker nodes on EC2? 
I'm particular interested what steps should be performed in EC2 in order to 
achieve such as elasticity.
More clear:
1. Do you have to create snapshots of a worker node (with its configuration to 
the nImbus and zookeper)?
2. Do you have to create an autoscaling group?
3.How do you trigger the autoscaling? Based on the CPU workload?
4. After adding/removing new nodes how do you manage to automatically send the 
rebalance topology command?

It will be great if you can provide me such kind of steps. I'm a novice in 
cloud computing and I would like to learn these concepts (elasticity, 
autoscaling)
I'll look forward for your answers and suggestion
Thank you .
 Regards,
  Florin


On Tue, Nov 25, 2014 at 12:47 PM, Guillermo López Leal 
<[email protected]<mailto:[email protected]>> wrote:
Hi there,

we are using storm for our real-time processing system, and so far, so good!

We have some questions about when we add nodes (autoscaling on EC2), and when 
we terminate others (based on CPU, for example).

Right now, we are seeing that if we add new nodes into the system, Storm just 
stops for around a 30 seconds (tuple timeout), rebalances itself and new nodes 
work as expected. Downtime of ~50 seconds

But, if a node goes down, we see a drop in the processing (to 0 tuples) for 
around 3 minutes, after that we see that it slowly start to rise the processing 
speed, and after one minute, everything is OK (total of 4 minutes or so)

Is there any way to not to wait 3 minutes, but the tuple timeout? (or something 
similar, like on added nodes)

Thanks for your ideas

Re: Rebalancing after a crashed node

Reply via email to