Hi Hadoop mavens- I'm hoping someone out there will have a quick solution for me. I'm trying to run some very basic scaling experiments for a rapidly approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes with 2 procs/node. Ideally, I would want to run my code on clusters of different numbers of nodes (1, 2, 4, 8, 16) or some such thing. The problem is that I am not able to reconfigure the cluster (in the long run, i.e., before a final version of the paper, I assume this will be possible, but for now it's not). Setting the number of mappers/reducers does not seem to be a viable option, at least not in the trivial way, since the physical layout of the input files makes hadoop run different tasks of processes than I may request (most of my jobs consist of multiple MR steps, the initial one always running on a relatively small data set, which fits into a single block, and therefore the Hadoop framework does honor my task number request on the first job-- but during the later ones it does not).
My questions: 1) can I get around this limitation programmatically? I.e., is there a way to tell the framework to only use a subset of the nodes for DFS / mapping / reducing? 2) if not, what statistics would be good to report if I can only have two data points -- a legacy "single-core" implementation of the algorithms and a MapReduce version running on a cluster full cluster? Thanks for any suggestions! Chris