Suggestion, indeed.  I passed that option, but still only 2 mappers were
created.

On Thu, Mar 29, 2012 at 5:23 PM, Sean Owen <sro...@gmail.com> wrote:

> Hadoop is what chooses the number of mappers, and it bases it on input
> size. Generally it will not assign less than one worker per chunk and a
> chunk is usually 64MB (still, I believe). You can override this directly
> (well, at least, register a suggestion to Hadoop). I would tell you the
> exact flag but I'm not next to my computer. In older Hadoop verisons it was
> -Dmapred.map.tasks=N I believe; in newer versions it's different, perhaps
> -Dmapreduce.map.tasks=N. That's what you're looking for to start. There are
> other ways to influence this like the minimum split size but try this
> first.
>
> On Thu, Mar 29, 2012 at 9:59 PM, Jason L Shaw <jls...@uw.edu> wrote:
>
> > I have a dataset that is not terribly large (~31 MB on disk in plaintext,
> > ~145,000 records with 26 fields).  I am trying to build random forests
> over
> > the data, but the process is quite slow.  It takes about half an hour to
> > build 100 trees using the partial implementation. (I didn't realize I
> > didn't need it.)
> >
> > I tried switching to the in-memory implementation so that the trees would
> > be built in parallel.  I have access to a cluster with about 15 nodes and
> > which can support up to 130 mappers.  It seems to me that I ought to be
> > able to build 100 trees all at once and be done in less than a minute
> (for
> > the building phase, anyway).  However, the job only generated 2 mappers,
> > each building 50 trees, and it took 15 minutes.  I tried again with 500
> > trees, but again only 2 mappers were started.
> >
> > Is there any way I can convince Hadoop to start up more mappers and load
> > the data more times?  I'm not that familiar with Hadoop, but from what
> I've
> > read, the number of mappers doesn't seem very configurable.  Memory is
> not
> > a concern. (Typically 72 GB or more available.)
> >
> > Thanks,
> > Jason
> >
>

Reply via email to