Hey Kris,

Yeah -D args won't help as they are at best global, and the
modifications need to happen on a per-phase basis.

If you've found that, for example, phase X always benefits from using
8x more reducers than Hadoop would assign, for instance, then we can
easily add that to the code. It's just a matter of adding
Job.setNumReduceTasks() for that phase's Job. Same for io.sort.mb
changes.

It would be much better to try to adjust the number programmatically
rather than expose another knob for the user to twiddle, when almost
any user won't know what the best value is.

You'd be welcome to kick off the discussion with a patch in JIRA, sure.

Sean

On Wed, May 4, 2011 at 4:26 PM, Kris Jack <[email protected]> wrote:
> Hi,
>
> I have been running org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> with large datasets (e.g. >1GB) to generate recommendations.  When running
> it, I noticed that the generation time was long and that my cluster's
> resources were being underused.  I hacked the code to reduce generation time
> by:
> 1) specifying how many mappers and reducers should be allocated for
> individual phases based upon my data set;
> 2) applying the
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.setIOSort() function
> to more phases than just the last one in the RecommenderJob pipeline.
>
> These changes reduce the amount of time taken to generate recommendations by
> making better use of cluster resources.  I'd like to propose a patch with
> these updates.  Here's how I've implemented it locally, let me know what you
> think:
> 1) add command line arguments that allow the number of mappers and reducers
> to be set for each of the individual phases in the job (all 12 phases,
> expanding org.apache.mahout.math.hadoop.similarity.RowSimilarityJob out as 3
> phases).
> 2) before configuring each phase, set the minimum number of mappers by
> adding a function to org.apache.mahout.common.AbstractJob that alters the
> job's configuration.  To do this, I count the number of bytes in the input
> file(s) and define the mapred.max.split.size parameter.  Note that due to
> the way that Hadoop allocates mappers, you cannot define the exact number of
> mappers to allocate, but can set the minimum number to be allocated.
> 3) before configuring each phase, set the user-defined number of reducers.
> 4) add calls to org.apache.mahout.cf.taste.hadoop.item.setIOSort() for
> phases that require increased io.sort.mb and reduce the amount of data
> spillage.
>
> Please let me know if you are interested in such a patch and/or if you have
> any better ideas to get round this problem.
>
> Regards,
> Kris
>

Reply via email to