One trick to getting more mappers on a job when running from the command line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a size in bytes. So if you have some hypothetical 10MB input set, but you want to force ~100 mappers, use '-Dmapred.max.split.size=1000000'
On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit <galp...@ebay.com>wrote: > > Hi, > > It sounds to me like this could be related to one of the Qs I've posted > several days ago (is it?): > My mahout clustering processes seem to be running very slow (several good > hours on just ~1M items), and I'm wondering if there's anything that needs > to be changed in setting/configuration. (and how?) > I'm running on a large cluster and could potentially use thousands > of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.) > are only using max 5 mappers (I tried it on several data sets). > I've tried to define the number of mappers by something like: > -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still > only uses <=5 mappers. > Is there a different way to set the number of mappers/reducers for > a mahout process? > Or is there another configuration issue I need to consider? > > I'd definitely be happy to use such a parameter, does it not exist? > (I'm running mahout as installed on the cluster) > > Is there currently a workaround, besides running a mahout jar as an hadoop > job? > When I originally tried to run a mahout jar that uses KMeansDriver (and > that runs great on my local machine)- it did not even initiate a job on the > hadoop cluster. It seemed to be running parallel but in fact it was running > only on the local node. Is this a known issue? Is there a fix for > this? (I ended up dropping it and calling mahout step by step from command > line, but I'd be happy to know if there a fix for this). > > Thanks, > > Galit. > > -----Original Message----- > From: Ryan Josal [mailto:rjo...@gmail.com] > Sent: Monday, July 29, 2013 9:33 PM > To: Adam Baron > Cc: Ryan Josal; user@mahout.apache.org > Subject: Re: Run more than one mapper for TestForest? > > If you're running mahout from the CLI, you'll have to modify the Hadoop > config file or your env manually for each job. This is code I put in to my > custom job executions so I didn't have to calculate and set that up every > time. Maybe that's your best route in that position. You could just > provide your own mahout jar and run it as you would any other Hadoop job > and ignore the installed Mahout. I do think this could be a useful > parameter for a number of standard mahout jobs though; I know I would use > it. Does anyone in the mahout community see this as a generally useful > feature for a Mahout job? > > Ryan > > On Jul 29, 2013, at 10:25, Adam Baron <adam.j.ba...@gmail.com> wrote: > > > Ryan, > > > > Thanks for the fix, the code looks reasonable to me. Which version of > Mahout will this be in? 0.9? > > > > Unfortunately, I'm using a large shared Hadoop cluster which is not > administered by my team. So I'm not in a position push the latest from > the Mahout dev trunk into our environment; the admins will only install > official releases. > > > > Regards, > > Adam > > > > On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <r...@josal.com> wrote: > >> Late reply, but for what it's still worth, since I've seen a couple > other threads here on the topic of too few mappers, I added a parameter to > set a minimum number of mappers. Some of my mahout jobs needed more > mappers, but were not given many because of the small input file size. > >> > >> addOption("minMapTasks", "m", "Minimum number of map tasks to > >> run", String.valueOf(1)); > >> > >> > >> int minMapTasks = Integer.parseInt(getOption("minMapTasks")); > >> int mapTasksThatWouldRun = (int) > (vectorFileSizeBytes/getSplitSize()) + 1; > >> log.info("map tasks min: " + minMapTasks + " current: " + > mapTasksThatWouldRun); > >> if (minMapTasks > mapTasksThatWouldRun) { > >> String splitSizeBytes = > String.valueOf(vectorFileSizeBytes/minMapTasks); > >> log.info("Forcing mapred.max.split.size to " + > splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks); > >> hadoopConf.set("mapred.max.split.size", splitSizeBytes); > >> } > >> > >> // there is actually a private method in hadoop to calculate this > >> private long getSplitSize() { > >> long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 > * 1024); > >> long maxSize = hadoopConf.getLong("mapred.max.split.size", > Long.MAX_VALUE); > >> int minSize = hadoopConf.getInt("mapred.min.split.size", 1); > >> long splitSize = Math.max(minSize, Math.min(maxSize, > blockSize)); > >> log.info(String.format("min: %,d block: %,d max: %,d split: > %,d", minSize, blockSize, maxSize, splitSize)); > >> return splitSize; > >> } > >> > >> It seems like there should be a more straightforward way to do this, > but it works for me and I've used it on a lot of jobs to set a minimum > number of mappers. > >> > >> Ryan > >> > >> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote: > >> > >> > I'm attempting to run > >> > org.apache.mahout.classifier.df.mapreduce.TestForest > >> > on a CSV with 200,000 rows that have 500,000 features per row. > >> > However, TestForest is running extremely slow, likely because only > >> > 1 mapper was assigned to the job. This seems strange because the > >> > org.apache.mahout.classifier.df.mapreduce.BuildForest step on the > >> > same data used 1772 mappers and took about 6 minutes. (BTW: I know > >> > I > >> > *shouldn't* use the same data set for the training and the testing > >> > steps; this is purely a technical experiment to see if Mahout's > >> > Random Forest can handle the data sizes we typically deal with). > >> > > >> > Any idea on how to get > >> > org.apache.mahout.classifier.df.mapreduce.TestForest > >> > to use more mappers? Glancing at the code (and thinking about what > >> > is happening intuitively), it should be ripe for parallelization. > >> > > >> > Thanks, > >> > Adam > > >