Hi Sean,

It seems to me that there has to be a compromise between exposing the right
number of knobs to users so that they can get the recommender job running
without frightening them off from even trying.  For me, I would certainly
fall on the side of exposing more knobs as the job was simply unusable in
it's current state on our cluster and with our data.  I'm saying that, as
you know, as a fan of Mahout, but one who spent much time wrestling with it
in order to persuade it to cooperate.

I'll run a few tests with different data sets and get some ratios of mappers
and reducers across different phases.  If there are general patterns, then
the quick fix that you suggest would be useful.  From my experience though,
the number of mappers and reducers required tend to be based on the sizes of
the input files, which often depend upon the output of previous phases.
 I'll then create and patch and see what discussion develops.

Thanks,
Kris



2011/5/4 Sean Owen <[email protected]>

> Hey Kris,
>
> Yeah -D args won't help as they are at best global, and the
> modifications need to happen on a per-phase basis.
>
> If you've found that, for example, phase X always benefits from using
> 8x more reducers than Hadoop would assign, for instance, then we can
> easily add that to the code. It's just a matter of adding
> Job.setNumReduceTasks() for that phase's Job. Same for io.sort.mb
> changes.
>
> It would be much better to try to adjust the number programmatically
> rather than expose another knob for the user to twiddle, when almost
> any user won't know what the best value is.
>
> You'd be welcome to kick off the discussion with a patch in JIRA, sure.
>
> Sean
>
> On Wed, May 4, 2011 at 4:26 PM, Kris Jack <[email protected]> wrote:
> > Hi,
> >
> > I have been running org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> > with large datasets (e.g. >1GB) to generate recommendations.  When
> running
> > it, I noticed that the generation time was long and that my cluster's
> > resources were being underused.  I hacked the code to reduce generation
> time
> > by:
> > 1) specifying how many mappers and reducers should be allocated for
> > individual phases based upon my data set;
> > 2) applying the
> > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.setIOSort()
> function
> > to more phases than just the last one in the RecommenderJob pipeline.
> >
> > These changes reduce the amount of time taken to generate recommendations
> by
> > making better use of cluster resources.  I'd like to propose a patch with
> > these updates.  Here's how I've implemented it locally, let me know what
> you
> > think:
> > 1) add command line arguments that allow the number of mappers and
> reducers
> > to be set for each of the individual phases in the job (all 12 phases,
> > expanding org.apache.mahout.math.hadoop.similarity.RowSimilarityJob out
> as 3
> > phases).
> > 2) before configuring each phase, set the minimum number of mappers by
> > adding a function to org.apache.mahout.common.AbstractJob that alters the
> > job's configuration.  To do this, I count the number of bytes in the
> input
> > file(s) and define the mapred.max.split.size parameter.  Note that due to
> > the way that Hadoop allocates mappers, you cannot define the exact number
> of
> > mappers to allocate, but can set the minimum number to be allocated.
> > 3) before configuring each phase, set the user-defined number of
> reducers.
> > 4) add calls to org.apache.mahout.cf.taste.hadoop.item.setIOSort() for
> > phases that require increased io.sort.mb and reduce the amount of data
> > spillage.
> >
> > Please let me know if you are interested in such a patch and/or if you
> have
> > any better ideas to get round this problem.
> >
> > Regards,
> > Kris
> >
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Reply via email to