Hi, I have been running org.apache.mahout.cf.taste.hadoop.item.RecommenderJob with large datasets (e.g. >1GB) to generate recommendations. When running it, I noticed that the generation time was long and that my cluster's resources were being underused. I hacked the code to reduce generation time by: 1) specifying how many mappers and reducers should be allocated for individual phases based upon my data set; 2) applying the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.setIOSort() function to more phases than just the last one in the RecommenderJob pipeline.
These changes reduce the amount of time taken to generate recommendations by making better use of cluster resources. I'd like to propose a patch with these updates. Here's how I've implemented it locally, let me know what you think: 1) add command line arguments that allow the number of mappers and reducers to be set for each of the individual phases in the job (all 12 phases, expanding org.apache.mahout.math.hadoop.similarity.RowSimilarityJob out as 3 phases). 2) before configuring each phase, set the minimum number of mappers by adding a function to org.apache.mahout.common.AbstractJob that alters the job's configuration. To do this, I count the number of bytes in the input file(s) and define the mapred.max.split.size parameter. Note that due to the way that Hadoop allocates mappers, you cannot define the exact number of mappers to allocate, but can set the minimum number to be allocated. 3) before configuring each phase, set the user-defined number of reducers. 4) add calls to org.apache.mahout.cf.taste.hadoop.item.setIOSort() for phases that require increased io.sort.mb and reduce the amount of data spillage. Please let me know if you are interested in such a patch and/or if you have any better ideas to get round this problem. Regards, Kris
