Hadoop is what chooses the number of mappers, and it bases it on input
size. Generally it will not assign less than one worker per chunk and a
chunk is usually 64MB (still, I believe). You can override this directly
(well, at least, register a suggestion to Hadoop). I would tell you the
exact flag but I'm not next to my computer. In older Hadoop verisons it was
-Dmapred.map.tasks=N I believe; in newer versions it's different, perhaps
-Dmapreduce.map.tasks=N. That's what you're looking for to start. There are
other ways to influence this like the minimum split size but try this first.

On Thu, Mar 29, 2012 at 9:59 PM, Jason L Shaw <jls...@uw.edu> wrote:

> I have a dataset that is not terribly large (~31 MB on disk in plaintext,
> ~145,000 records with 26 fields).  I am trying to build random forests over
> the data, but the process is quite slow.  It takes about half an hour to
> build 100 trees using the partial implementation. (I didn't realize I
> didn't need it.)
>
> I tried switching to the in-memory implementation so that the trees would
> be built in parallel.  I have access to a cluster with about 15 nodes and
> which can support up to 130 mappers.  It seems to me that I ought to be
> able to build 100 trees all at once and be done in less than a minute (for
> the building phase, anyway).  However, the job only generated 2 mappers,
> each building 50 trees, and it took 15 minutes.  I tried again with 500
> trees, but again only 2 mappers were started.
>
> Is there any way I can convince Hadoop to start up more mappers and load
> the data more times?  I'm not that familiar with Hadoop, but from what I've
> read, the number of mappers doesn't seem very configurable.  Memory is not
> a concern. (Typically 72 GB or more available.)
>
> Thanks,
> Jason
>

Reply via email to