Has anyone done any extensive testing of what instance types on Amazon EC2 give you the most bang for the buck?
Given the normal Hadoop recommendations of beefy machines, I would expect the best performance from the extra-large, but our testing showed otherwise. We did some rough testing while we were just getting started with like a 10 node cluster, and we found that the extra large instance doesn't come close to twice the actual performance of the large instance (pricing at $0.80 and $0.40). My rationalization is that some of the resources are shared, and the extra-large instance corresponds to the actual hardware, while the large instance sometimes gets to take advantage of IO and network bandwidth beyond 50% when the other tenant isn't doing much. I'm revisiting our config because we're deploying HBase soon, and I'm not sure whether I would be better off going to the extra-large instances so that I can co-locate the tasktrackers and the region servers on the same nodes, or if I should stick with large instances and put hbase on separate servers. Mostly I'm wondering if my results were a fluke.