Hi Florin,
Disclaimer: This is my opinion, and it might not suite your needs. 1. (Dynamic) scaling is the ability to change the number of instances in the security group on demand. If you are deploying a single topology on multiple supervisors(instances), and then add another supervisor(instance)... then the last instance is still empty. You would need AFAIK to perform some sort of re-balancing from nimbus... It also means you in advance prepared the number of relevant bolts to allow the ability to rebalance. 2. Our specific problem does not require a fully distributed architecture. So instead, we deploy one topology on one ec2 instance. In this scenario you deploy 10 exactly same topologies with 10 different names (with isolation scheduler), and start as many instances that you want. Nimbus will do its best to allocate one topology on one instance, the more instances available the more topologies are deployed. No rebalancing is needed. 3. Automatic scaling means automatic triggering of either scaling out or scaling in. If you are in the eCommerce business, for example, there is known seasonality and so manual scaling is enough. Make sure you have a business justification to automatic scaling - since any closed loop control system could go out of hand quickly. There are startups that provide auto-scaling as a service on top of ec2 for that exact reason. IMO, if you do use auto-scaling make sure to set the maximum number of instances... and trigger only scale out. Do the scale-in manually, only when the time is right for you and your customers. ... later down the road, you should consider going all automatic. 4. The trigger is really case-by-case dependent. It could be a business trigger (customer SLA is breached and you know from the past that it means you need more STORM nodes)... it could be a low-level trigger (such as CPU is bigger than 80% for more than 10 minutes). Either way the system would react in 10 minutes... and would have a damping period after which there would be no change even if the trigger is set. This is in order to stabilize the system. So make sure your requirements are aligned with the realities of the autoscaling. If you need to react within 30 seconds... it's not gonna happen even if ec2 can start an instance within that time. thanks, Itai ________________________________ From: Spico Florin <[email protected]> Sent: Tuesday, November 25, 2014 1:05 PM To: [email protected] Subject: Re: Rebalancing after a crashed node Hello! Can you please explain how do you manage the autoscaling worker nodes on EC2? I'm particular interested what steps should be performed in EC2 in order to achieve such as elasticity. More clear: 1. Do you have to create snapshots of a worker node (with its configuration to the nImbus and zookeper)? 2. Do you have to create an autoscaling group? 3.How do you trigger the autoscaling? Based on the CPU workload? 4. After adding/removing new nodes how do you manage to automatically send the rebalance topology command? It will be great if you can provide me such kind of steps. I'm a novice in cloud computing and I would like to learn these concepts (elasticity, autoscaling) I'll look forward for your answers and suggestion Thank you . Regards, Florin On Tue, Nov 25, 2014 at 12:47 PM, Guillermo López Leal <[email protected]<mailto:[email protected]>> wrote: Hi there, we are using storm for our real-time processing system, and so far, so good! We have some questions about when we add nodes (autoscaling on EC2), and when we terminate others (based on CPU, for example). Right now, we are seeing that if we add new nodes into the system, Storm just stops for around a 30 seconds (tuple timeout), rebalances itself and new nodes work as expected. Downtime of ~50 seconds But, if a node goes down, we see a drop in the processing (to 0 tuples) for around 3 minutes, after that we see that it slowly start to rise the processing speed, and after one minute, everything is OK (total of 4 minutes or so) Is there any way to not to wait 3 minutes, but the tuple timeout? (or something similar, like on added nodes) Thanks for your ideas
