I don't believe I can use the multiple-files solution because Mahout can't
handle multiple input files for Random Forest training.

15-30 minutes isn't a big deal for training a model I'll use a lot, but in
developing a feature set and testing a model many times, it gets pretty
tedious.  It's a shame that it's not easier to take advantage of excessive
parallelism in an algorithm such as RF, but that's the way it goes.  I'm
using Hadoop not because I think it's the ideal parallel computing
solution, but because it's what I have available to me.

Thanks for your help anyway.  I'll post back if I find a silver bullet.

On Fri, Mar 30, 2012 at 12:11 PM, Sean Owen <sro...@gmail.com> wrote:

> What version of Hadoop are you using? there are at least 2 distinct
> branches and versions of the API floating around, and in the 0.22/0.23
> branches, the keys start with "mapreduce.". This may be an issue.
>
> In 0.20.x and 1.0 (which are confusingly more related), it's
> mapred.{min,max}.split.size, and in both versions of the API. I am looking
> at the code and it's there. Note that this only works with FileInputFormat
> and subclasses. It doesn't work on compressed input.
>
> I would set min to 1 and max to, oh, 10000000.
>
> mapred.map.tasks is also correct for 1.0. I know this does in general have
> an effect, even though it's not something that forces Hadoop to use more
> mappers.
>
> Your block size is 64MB? I think perhaps Hadoop won't send more than one
> mapper at one HDFS chunk. You could rebuild your HDFS cluster with a
> smaller chunk size if you must, but Ted's suggestion to use smaller files
> is far easier. (Though it burns a little extra space.)
>
>
> You may be pushing up against the limits of what Hadoop is useful for. 31MB
> / 15 minutes isn't large or long, and so it may be difficult to wrestle it
> to do otherwise. You probably don't need Hadoop for this problem at all.
>
>
> On Fri, Mar 30, 2012 at 4:51 PM, Jason L Shaw <jls...@uw.edu> wrote:
>
> > Well, it looks like there's no solution for me right now.
> >
> > mapred.map.tasks is indeed just a suggestion -- no effect
> > mapred.max.split.size does not exist as an option, at least according to
> > http://hadoop.apache.org/common/docs/current/mapred-default.html, and
> when
> > I tried it -- no effect
> > Splitting up my input may or may not work, except that the Decision
> Forest
> > code in Mahout cannot currently train on multiple input files:
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation
> >
> > So I think I will need to look for another solution to my problem.  I
> could
> > just train many small decision forests and then combine them -- does
> Mahout
> > provide a slick way to combine the predictions of multiple models?
> >
> > On Fri, Mar 30, 2012 at 12:54 AM, deneche abdelhakim <adene...@gmail.com
> > >wrote:
> >
> > > -Dmapred.map.tasks=N only gives a suggestion to Hadoop, and in most
> > > cases (especially when the data is small) Hadoop doesn't take it into
> > > consideration. To generate more mappers use -Dmapred.max.split.size=S,
> > > S being the size of each data partition in bytes. So your data ~
> > > 31000000B, if you want to generate 100 partitions (mappers), S should
> > > be 310000 (31000000/100).
> > >
> > >
> > >
> > > On Thu, Mar 29, 2012 at 11:08 PM, Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > >
> > > > Split your training data into lots of little files.  Depending on the
> > > wind,
> > > > that may cause more mappers to be invoked.
> > > >
> > > > On Thu, Mar 29, 2012 at 3:05 PM, Jason L Shaw <jls...@uw.edu> wrote:
> > > >
> > > > > Suggestion, indeed.  I passed that option, but still only 2 mappers
> > > were
> > > > > created.
> > > > >
> > > > > On Thu, Mar 29, 2012 at 5:23 PM, Sean Owen <sro...@gmail.com>
> wrote:
> > > > >
> > > > > > Hadoop is what chooses the number of mappers, and it bases it on
> > > input
> > > > > > size. Generally it will not assign less than one worker per chunk
> > > and a
> > > > > > chunk is usually 64MB (still, I believe). You can override this
> > > > directly
> > > > > > (well, at least, register a suggestion to Hadoop). I would tell
> you
> > > the
> > > > > > exact flag but I'm not next to my computer. In older Hadoop
> > verisons
> > > it
> > > > > was
> > > > > > -Dmapred.map.tasks=N I believe; in newer versions it's different,
> > > > perhaps
> > > > > > -Dmapreduce.map.tasks=N. That's what you're looking for to start.
> > > There
> > > > > are
> > > > > > other ways to influence this like the minimum split size but try
> > this
> > > > > > first.
> > > > > >
> > > > > > On Thu, Mar 29, 2012 at 9:59 PM, Jason L Shaw <jls...@uw.edu>
> > wrote:
> > > > > >
> > > > > > > I have a dataset that is not terribly large (~31 MB on disk in
> > > > > plaintext,
> > > > > > > ~145,000 records with 26 fields).  I am trying to build random
> > > > forests
> > > > > > over
> > > > > > > the data, but the process is quite slow.  It takes about half
> an
> > > hour
> > > > > to
> > > > > > > build 100 trees using the partial implementation. (I didn't
> > > realize I
> > > > > > > didn't need it.)
> > > > > > >
> > > > > > > I tried switching to the in-memory implementation so that the
> > trees
> > > > > would
> > > > > > > be built in parallel.  I have access to a cluster with about 15
> > > nodes
> > > > > and
> > > > > > > which can support up to 130 mappers.  It seems to me that I
> ought
> > > to
> > > > be
> > > > > > > able to build 100 trees all at once and be done in less than a
> > > minute
> > > > > > (for
> > > > > > > the building phase, anyway).  However, the job only generated 2
> > > > > mappers,
> > > > > > > each building 50 trees, and it took 15 minutes.  I tried again
> > with
> > > > 500
> > > > > > > trees, but again only 2 mappers were started.
> > > > > > >
> > > > > > > Is there any way I can convince Hadoop to start up more mappers
> > and
> > > > > load
> > > > > > > the data more times?  I'm not that familiar with Hadoop, but
> from
> > > > what
> > > > > > I've
> > > > > > > read, the number of mappers doesn't seem very configurable.
> >  Memory
> > > > is
> > > > > > not
> > > > > > > a concern. (Typically 72 GB or more available.)
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to