I don't believe I can use the multiple-files solution because Mahout can't handle multiple input files for Random Forest training.
15-30 minutes isn't a big deal for training a model I'll use a lot, but in developing a feature set and testing a model many times, it gets pretty tedious. It's a shame that it's not easier to take advantage of excessive parallelism in an algorithm such as RF, but that's the way it goes. I'm using Hadoop not because I think it's the ideal parallel computing solution, but because it's what I have available to me. Thanks for your help anyway. I'll post back if I find a silver bullet. On Fri, Mar 30, 2012 at 12:11 PM, Sean Owen <sro...@gmail.com> wrote: > What version of Hadoop are you using? there are at least 2 distinct > branches and versions of the API floating around, and in the 0.22/0.23 > branches, the keys start with "mapreduce.". This may be an issue. > > In 0.20.x and 1.0 (which are confusingly more related), it's > mapred.{min,max}.split.size, and in both versions of the API. I am looking > at the code and it's there. Note that this only works with FileInputFormat > and subclasses. It doesn't work on compressed input. > > I would set min to 1 and max to, oh, 10000000. > > mapred.map.tasks is also correct for 1.0. I know this does in general have > an effect, even though it's not something that forces Hadoop to use more > mappers. > > Your block size is 64MB? I think perhaps Hadoop won't send more than one > mapper at one HDFS chunk. You could rebuild your HDFS cluster with a > smaller chunk size if you must, but Ted's suggestion to use smaller files > is far easier. (Though it burns a little extra space.) > > > You may be pushing up against the limits of what Hadoop is useful for. 31MB > / 15 minutes isn't large or long, and so it may be difficult to wrestle it > to do otherwise. You probably don't need Hadoop for this problem at all. > > > On Fri, Mar 30, 2012 at 4:51 PM, Jason L Shaw <jls...@uw.edu> wrote: > > > Well, it looks like there's no solution for me right now. > > > > mapred.map.tasks is indeed just a suggestion -- no effect > > mapred.max.split.size does not exist as an option, at least according to > > http://hadoop.apache.org/common/docs/current/mapred-default.html, and > when > > I tried it -- no effect > > Splitting up my input may or may not work, except that the Decision > Forest > > code in Mahout cannot currently train on multiple input files: > > > https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation > > > > So I think I will need to look for another solution to my problem. I > could > > just train many small decision forests and then combine them -- does > Mahout > > provide a slick way to combine the predictions of multiple models? > > > > On Fri, Mar 30, 2012 at 12:54 AM, deneche abdelhakim <adene...@gmail.com > > >wrote: > > > > > -Dmapred.map.tasks=N only gives a suggestion to Hadoop, and in most > > > cases (especially when the data is small) Hadoop doesn't take it into > > > consideration. To generate more mappers use -Dmapred.max.split.size=S, > > > S being the size of each data partition in bytes. So your data ~ > > > 31000000B, if you want to generate 100 partitions (mappers), S should > > > be 310000 (31000000/100). > > > > > > > > > > > > On Thu, Mar 29, 2012 at 11:08 PM, Ted Dunning <ted.dunn...@gmail.com> > > > wrote: > > > > > > > Split your training data into lots of little files. Depending on the > > > wind, > > > > that may cause more mappers to be invoked. > > > > > > > > On Thu, Mar 29, 2012 at 3:05 PM, Jason L Shaw <jls...@uw.edu> wrote: > > > > > > > > > Suggestion, indeed. I passed that option, but still only 2 mappers > > > were > > > > > created. > > > > > > > > > > On Thu, Mar 29, 2012 at 5:23 PM, Sean Owen <sro...@gmail.com> > wrote: > > > > > > > > > > > Hadoop is what chooses the number of mappers, and it bases it on > > > input > > > > > > size. Generally it will not assign less than one worker per chunk > > > and a > > > > > > chunk is usually 64MB (still, I believe). You can override this > > > > directly > > > > > > (well, at least, register a suggestion to Hadoop). I would tell > you > > > the > > > > > > exact flag but I'm not next to my computer. In older Hadoop > > verisons > > > it > > > > > was > > > > > > -Dmapred.map.tasks=N I believe; in newer versions it's different, > > > > perhaps > > > > > > -Dmapreduce.map.tasks=N. That's what you're looking for to start. > > > There > > > > > are > > > > > > other ways to influence this like the minimum split size but try > > this > > > > > > first. > > > > > > > > > > > > On Thu, Mar 29, 2012 at 9:59 PM, Jason L Shaw <jls...@uw.edu> > > wrote: > > > > > > > > > > > > > I have a dataset that is not terribly large (~31 MB on disk in > > > > > plaintext, > > > > > > > ~145,000 records with 26 fields). I am trying to build random > > > > forests > > > > > > over > > > > > > > the data, but the process is quite slow. It takes about half > an > > > hour > > > > > to > > > > > > > build 100 trees using the partial implementation. (I didn't > > > realize I > > > > > > > didn't need it.) > > > > > > > > > > > > > > I tried switching to the in-memory implementation so that the > > trees > > > > > would > > > > > > > be built in parallel. I have access to a cluster with about 15 > > > nodes > > > > > and > > > > > > > which can support up to 130 mappers. It seems to me that I > ought > > > to > > > > be > > > > > > > able to build 100 trees all at once and be done in less than a > > > minute > > > > > > (for > > > > > > > the building phase, anyway). However, the job only generated 2 > > > > > mappers, > > > > > > > each building 50 trees, and it took 15 minutes. I tried again > > with > > > > 500 > > > > > > > trees, but again only 2 mappers were started. > > > > > > > > > > > > > > Is there any way I can convince Hadoop to start up more mappers > > and > > > > > load > > > > > > > the data more times? I'm not that familiar with Hadoop, but > from > > > > what > > > > > > I've > > > > > > > read, the number of mappers doesn't seem very configurable. > > Memory > > > > is > > > > > > not > > > > > > > a concern. (Typically 72 GB or more available.) > > > > > > > > > > > > > > Thanks, > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > >