Re: Getting InMemBuilder to use more mappers

Sean Owen Fri, 30 Mar 2012 09:12:04 -0700

What version of Hadoop are you using? there are at least 2 distinct
branches and versions of the API floating around, and in the 0.22/0.23
branches, the keys start with "mapreduce.". This may be an issue.


In 0.20.x and 1.0 (which are confusingly more related), it's
mapred.{min,max}.split.size, and in both versions of the API. I am looking
at the code and it's there. Note that this only works with FileInputFormat
and subclasses. It doesn't work on compressed input.

I would set min to 1 and max to, oh, 10000000.

mapred.map.tasks is also correct for 1.0. I know this does in general have
an effect, even though it's not something that forces Hadoop to use more
mappers.

Your block size is 64MB? I think perhaps Hadoop won't send more than one
mapper at one HDFS chunk. You could rebuild your HDFS cluster with a
smaller chunk size if you must, but Ted's suggestion to use smaller files
is far easier. (Though it burns a little extra space.)


You may be pushing up against the limits of what Hadoop is useful for. 31MB
/ 15 minutes isn't large or long, and so it may be difficult to wrestle it
to do otherwise. You probably don't need Hadoop for this problem at all.


On Fri, Mar 30, 2012 at 4:51 PM, Jason L Shaw <jls...@uw.edu> wrote:

> Well, it looks like there's no solution for me right now.
>
> mapred.map.tasks is indeed just a suggestion -- no effect
> mapred.max.split.size does not exist as an option, at least according to
> http://hadoop.apache.org/common/docs/current/mapred-default.html, and when
> I tried it -- no effect
> Splitting up my input may or may not work, except that the Decision Forest
> code in Mahout cannot currently train on multiple input files:
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
>
> So I think I will need to look for another solution to my problem.  I could
> just train many small decision forests and then combine them -- does Mahout
> provide a slick way to combine the predictions of multiple models?
>
> On Fri, Mar 30, 2012 at 12:54 AM, deneche abdelhakim <adene...@gmail.com
> >wrote:
>
> > -Dmapred.map.tasks=N only gives a suggestion to Hadoop, and in most
> > cases (especially when the data is small) Hadoop doesn't take it into
> > consideration. To generate more mappers use -Dmapred.max.split.size=S,
> > S being the size of each data partition in bytes. So your data ~
> > 31000000B, if you want to generate 100 partitions (mappers), S should
> > be 310000 (31000000/100).
> >
> >
> >
> > On Thu, Mar 29, 2012 at 11:08 PM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> >
> > > Split your training data into lots of little files.  Depending on the
> > wind,
> > > that may cause more mappers to be invoked.
> > >
> > > On Thu, Mar 29, 2012 at 3:05 PM, Jason L Shaw <jls...@uw.edu> wrote:
> > >
> > > > Suggestion, indeed.  I passed that option, but still only 2 mappers
> > were
> > > > created.
> > > >
> > > > On Thu, Mar 29, 2012 at 5:23 PM, Sean Owen <sro...@gmail.com> wrote:
> > > >
> > > > > Hadoop is what chooses the number of mappers, and it bases it on
> > input
> > > > > size. Generally it will not assign less than one worker per chunk
> > and a
> > > > > chunk is usually 64MB (still, I believe). You can override this
> > > directly
> > > > > (well, at least, register a suggestion to Hadoop). I would tell you
> > the
> > > > > exact flag but I'm not next to my computer. In older Hadoop
> verisons
> > it
> > > > was
> > > > > -Dmapred.map.tasks=N I believe; in newer versions it's different,
> > > perhaps
> > > > > -Dmapreduce.map.tasks=N. That's what you're looking for to start.
> > There
> > > > are
> > > > > other ways to influence this like the minimum split size but try
> this
> > > > > first.
> > > > >
> > > > > On Thu, Mar 29, 2012 at 9:59 PM, Jason L Shaw <jls...@uw.edu>
> wrote:
> > > > >
> > > > > > I have a dataset that is not terribly large (~31 MB on disk in
> > > > plaintext,
> > > > > > ~145,000 records with 26 fields).  I am trying to build random
> > > forests
> > > > > over
> > > > > > the data, but the process is quite slow.  It takes about half an
> > hour
> > > > to
> > > > > > build 100 trees using the partial implementation. (I didn't
> > realize I
> > > > > > didn't need it.)
> > > > > >
> > > > > > I tried switching to the in-memory implementation so that the
> trees
> > > > would
> > > > > > be built in parallel.  I have access to a cluster with about 15
> > nodes
> > > > and
> > > > > > which can support up to 130 mappers.  It seems to me that I ought
> > to
> > > be
> > > > > > able to build 100 trees all at once and be done in less than a
> > minute
> > > > > (for
> > > > > > the building phase, anyway).  However, the job only generated 2
> > > > mappers,
> > > > > > each building 50 trees, and it took 15 minutes.  I tried again
> with
> > > 500
> > > > > > trees, but again only 2 mappers were started.
> > > > > >
> > > > > > Is there any way I can convince Hadoop to start up more mappers
> and
> > > > load
> > > > > > the data more times?  I'm not that familiar with Hadoop, but from
> > > what
> > > > > I've
> > > > > > read, the number of mappers doesn't seem very configurable.
>  Memory
> > > is
> > > > > not
> > > > > > a concern. (Typically 72 GB or more available.)
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Getting InMemBuilder to use more mappers

Reply via email to