One trick to getting more mappers on a job when running from the command
line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a
size in bytes. So if you have some hypothetical 10MB input set, but you
want to force ~100 mappers, use '-Dmapred.max.split.size=1000000'


On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit <galp...@ebay.com>wrote:

>
> Hi,
>
> It sounds to me like this could be related to one of the Qs I've posted
> several days ago (is it?):
> My mahout clustering processes seem to be running very slow (several good
> hours on just ~1M items), and I'm wondering if there's anything that needs
> to be changed in setting/configuration. (and how?)
>         I'm running on a large cluster and could potentially use thousands
> of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
> are only using max 5 mappers (I tried it on several data sets).
>         I've tried to define the number of mappers by something like:
> -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
> only uses <=5 mappers.
>         Is there a different way to set the number of mappers/reducers for
> a mahout process?
>         Or is there another configuration issue I need to consider?
>
> I'd definitely be happy to use such a parameter, does it not exist?
> (I'm running mahout as installed on the cluster)
>
> Is there currently a workaround, besides running a mahout jar as an hadoop
> job?
> When I originally tried to run a mahout jar that uses KMeansDriver (and
> that runs great on my local machine)- it did not even initiate a job on the
> hadoop cluster. It seemed to be running parallel but in fact it was running
> only on the local node.         Is this a known issue? Is there a fix for
> this? (I ended up dropping it and calling mahout step by step from command
> line, but I'd be happy to know if there a fix for this).
>
> Thanks,
>
> Galit.
>
> -----Original Message-----
> From: Ryan Josal [mailto:rjo...@gmail.com]
> Sent: Monday, July 29, 2013 9:33 PM
> To: Adam Baron
> Cc: Ryan Josal; user@mahout.apache.org
> Subject: Re: Run more than one mapper for TestForest?
>
> If you're running mahout from the CLI, you'll have to modify the Hadoop
> config file or your env manually for each job.  This is code I put in to my
> custom job executions so I didn't have to calculate and set that up every
> time.  Maybe that's your best route in that position.  You could just
> provide your own mahout jar and run it as you would any other Hadoop job
> and ignore the installed Mahout.  I do think this could be a useful
> parameter for a number of standard mahout jobs though; I know I would use
> it.  Does anyone in the mahout community see this as a generally useful
> feature for a Mahout job?
>
> Ryan
>
> On Jul 29, 2013, at 10:25, Adam Baron <adam.j.ba...@gmail.com> wrote:
>
> > Ryan,
> >
> > Thanks for the fix, the code looks reasonable to me.  Which version of
> Mahout will this be in?  0.9?
> >
> > Unfortunately, I'm using a large shared Hadoop cluster which is not
> administered by my team.   So I'm not in a position push the latest from
> the Mahout dev trunk into our environment; the admins will only install
> official releases.
> >
> > Regards,
> >           Adam
> >
> > On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <r...@josal.com> wrote:
> >> Late reply, but for what it's still worth, since I've seen a couple
> other threads here on the topic of too few mappers, I added a parameter to
> set a minimum number of mappers.  Some of my mahout jobs needed more
> mappers, but were not given many because of the small input file size.
> >>
> >>         addOption("minMapTasks", "m", "Minimum number of map tasks to
> >> run", String.valueOf(1));
> >>
> >>
> >>         int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
> >>         int mapTasksThatWouldRun = (int)
> (vectorFileSizeBytes/getSplitSize()) + 1;
> >>         log.info("map tasks min: " + minMapTasks + " current: " +
> mapTasksThatWouldRun);
> >>         if (minMapTasks > mapTasksThatWouldRun) {
> >>             String splitSizeBytes =
> String.valueOf(vectorFileSizeBytes/minMapTasks);
> >>             log.info("Forcing mapred.max.split.size to " +
> splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks);
> >>             hadoopConf.set("mapred.max.split.size", splitSizeBytes);
> >>         }
> >>
> >>     // there is actually a private method in hadoop to calculate this
> >>     private long getSplitSize() {
> >>         long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024
> * 1024);
> >>         long maxSize = hadoopConf.getLong("mapred.max.split.size",
> Long.MAX_VALUE);
> >>         int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
> >>         long splitSize = Math.max(minSize, Math.min(maxSize,
> blockSize));
> >>         log.info(String.format("min: %,d block: %,d max: %,d split:
> %,d", minSize, blockSize, maxSize, splitSize));
> >>         return splitSize;
> >>     }
> >>
> >> It seems like there should be a more straightforward way to do this,
> but it works for me and I've used it on a lot of jobs to set a minimum
> number of mappers.
> >>
> >> Ryan
> >>
> >> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:
> >>
> >> > I'm attempting to run
> >> > org.apache.mahout.classifier.df.mapreduce.TestForest
> >> > on a CSV with 200,000 rows that have 500,000 features per row.
> >> > However, TestForest is  running extremely slow, likely because only
> >> > 1 mapper was assigned to the job.  This seems strange because the
> >> > org.apache.mahout.classifier.df.mapreduce.BuildForest step on the
> >> > same data used 1772 mappers and took about 6 minutes.  (BTW: I know
> >> > I
> >> > *shouldn't* use the same data set for the training and the testing
> >> > steps; this is purely a technical experiment to see if Mahout's
> >> > Random Forest can handle the data sizes we typically deal with).
> >> >
> >> > Any idea on how to get
> >> > org.apache.mahout.classifier.df.mapreduce.TestForest
> >> > to use more mappers?  Glancing at the code (and thinking about what
> >> > is happening intuitively), it should be ripe for parallelization.
> >> >
> >> > Thanks,
> >> >        Adam
> >
>

Reply via email to