Scaling a Hadoop cluster with Hive has the following issues 1. Adding a computing node(Scaling up) when load on the cluster is high decreases the execution time of the queries but its there is still a huge time lag as the new node works on data from other nodes.
2. The process of removing a node from the cluster(Scaling down) when load on the cluster is low, is also time consuming. To reduce the time to scale the Hadoop cluster, we came up with the following solution. Prior to adding a new node, move the data from the existing nodes to the new node. This balances the cluster and if a new task comes, the newly added node can take it up as it already has the data (data locality). When decommissioning a node, move the data available on that node to the other nodes in the cluster., then decommission it. We tested this with hive on hadoop on 5 node cluster, *Time taken for Hive query,* *4node cluster* *Existing procedure(added new node) 5node cluster* *New procedure(added new node) 5node cluster* 16mins,25sec 13mins,38sec 9mins,41sec check the results and the approach here <https://github.com/rajuch/Auto-scaling-on-ec2> Any drawbacks/suggestions on this approach, we would like to hear from you.. -- Thanks & Regards, Raju Chinthala