Hi,

I have been running org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
with large datasets (e.g. >1GB) to generate recommendations.  When running
it, I noticed that the generation time was long and that my cluster's
resources were being underused.  I hacked the code to reduce generation time
by:
1) specifying how many mappers and reducers should be allocated for
individual phases based upon my data set;
2) applying the
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.setIOSort() function
to more phases than just the last one in the RecommenderJob pipeline.

These changes reduce the amount of time taken to generate recommendations by
making better use of cluster resources.  I'd like to propose a patch with
these updates.  Here's how I've implemented it locally, let me know what you
think:
1) add command line arguments that allow the number of mappers and reducers
to be set for each of the individual phases in the job (all 12 phases,
expanding org.apache.mahout.math.hadoop.similarity.RowSimilarityJob out as 3
phases).
2) before configuring each phase, set the minimum number of mappers by
adding a function to org.apache.mahout.common.AbstractJob that alters the
job's configuration.  To do this, I count the number of bytes in the input
file(s) and define the mapred.max.split.size parameter.  Note that due to
the way that Hadoop allocates mappers, you cannot define the exact number of
mappers to allocate, but can set the minimum number to be allocated.
3) before configuring each phase, set the user-defined number of reducers.
4) add calls to org.apache.mahout.cf.taste.hadoop.item.setIOSort() for
phases that require increased io.sort.mb and reduce the amount of data
spillage.

Please let me know if you are interested in such a patch and/or if you have
any better ideas to get round this problem.

Regards,
Kris

Reply via email to