My Task is 1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table 3) That result containing table from Hive is again exported to MS SQL SERVER back. I want to perform all this using Amazon Elastic Map Reduce. The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely). I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine. In order to improve the performance what number of instances should I need to use? As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same. And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR? Please guide me through this as I don't much about the Amazon Servers. Thanks. -- Regards, Bhavesh Shah