Hi Hadoop mavens-
I'm hoping someone out there will have a quick solution for me.  I'm
trying to run some very basic scaling experiments for a rapidly
approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
with 2 procs/node.  Ideally, I would want to run my code on clusters
of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
The problem is that I am not able to reconfigure the cluster (in the
long run, i.e., before a final version of the paper, I assume this
will be possible, but for now it's not).  Setting the number of
mappers/reducers does not seem to be a viable option, at least not in
the trivial way, since the physical layout of the input files makes
hadoop run different tasks of processes than I may request (most of my
jobs consist of multiple MR steps, the initial one always running on a
relatively small data set, which fits into a single block, and
therefore the Hadoop framework does honor my task number request on
the first job-- but during the later ones it does not).

My questions:
1) can I get around this limitation programmatically?  I.e., is there
a way to tell the framework to only use a subset of the nodes for DFS
/ mapping / reducing?
2) if not, what statistics would be good to report if I can only have
two data points -- a legacy "single-core" implementation of the
algorithms and a MapReduce version running on a cluster full cluster?

Thanks for any suggestions!
Chris

Reply via email to