Sorry for the previous post. I haven't finished. Please skip it. Hi all, I've made some experiments on Hadoop on Amazon EC2. I would like to share the result and any feedback would be appreciated.
Environment: -Xen VM (Amazon EC2 instance ami-ee53b687) -1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth (small instance) -Hadoop 0.17.0 -storage: HDFS -Test example: wordcount Experiment 1: (fixed # of instances (8), variant data size (2MB~512MB), # of maps: 8, # of reduces: 8) Data Size(MB) | Time(s) 512 | 124 256 | 70 128 | 41 ... 8 | 22 4 | 17 2 | 21 The purpose is to observe the lowest framework overhead for wordcount. As the result, when the data size is between 2MB to 16MB, the time is around 20 second. May I conclude the lowest framework overhead for wordcount is 20s? Experiment 2: (variant # of instances (2~32), variant data size (128MB~2GB), # of maps: (2-32), # of reduces: (2-32)) Data Size(MB) | Map | Reduce | Time(s) 2048 | 32 | 32 | 140 1024 | 16 | 16 | 120 512 | 8 | 8 | 124 256 | 4 | 4 | 127 128 | 2 | 2 | 119 The purpose is to observe if each instance be allocated the same blocks of data, the time will be similar. As the result, when the data size is between 128MB to 1024MB, the time is around 120 seconds. The time is 140s when data size is 2048MB. I think the reason is more data to process would cause more overhead. Experiment 3: (variant # of instances (2~16), fixed data size (128MB), # of maps: (2-16), # of reduces: (2-16)) Data Size(MB) | Map | Reduce | Time(s) 128 | 16 | 16 | 31 128 | 8 | 8 | 41 128 | 4 | 4 | 69 128 | 2 | 2 | 119 The purpose is to observe for fixed data, add more and more instances, how would the result change? As the result, as the instances double, the time would be smaller but not the half. There is always the framework overhead even give infinite instances. In fact, I did more experiments, but I just post some results. Interestingly, I discover a formula for wordcount by my experiment result. That is: Time(s) ~= 20+((DataSize - 8MB)*1.6 / (# of instance)) I've check the formula by all my experiment result and almost all is matched. Maybe it's coincidental or I have something wrong. Anyway, I just want to share my experience and any feedback would be appreciated. -- Best Regards, Shawn