Hi Jeff,

The RecommenderJob has 10 phases.  I find that the job runs most efficiently
by tweaking the number of mappers and reducers for these phases.
 Unfortunately, they don't all need the same number of mappers and reducers.
 Can you use the -D arguments to allocate different numbers of mappers and
reducers to the individual phases?

Thanks,
Kris



2011/5/4 Jeff Eastman <[email protected]>

> Hi Kris,
> Actually, the tide has been running in the other direction: removing
> explicit command line arguments in favor of -D arguments which directly set
> Hadoop configuration values. If you add -D arguments at the beginning of
> your Mahout CLI invocations, you can set any Hadoop job parameter you wish.
>
> -----Original Message-----
> From: Kris Jack [mailto:[email protected]]
> Sent: Wednesday, May 04, 2011 8:27 AM
> To: [email protected]
> Subject: Specifying Number of Mappers and Reducers for Individual Phases in
> RecommenderJob
>
> Hi,
>
> I have been running org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> with large datasets (e.g. >1GB) to generate recommendations.  When running
> it, I noticed that the generation time was long and that my cluster's
> resources were being underused.  I hacked the code to reduce generation
> time
> by:
> 1) specifying how many mappers and reducers should be allocated for
> individual phases based upon my data set;
> 2) applying the
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.setIOSort() function
> to more phases than just the last one in the RecommenderJob pipeline.
>
> These changes reduce the amount of time taken to generate recommendations
> by
> making better use of cluster resources.  I'd like to propose a patch with
> these updates.  Here's how I've implemented it locally, let me know what
> you
> think:
> 1) add command line arguments that allow the number of mappers and reducers
> to be set for each of the individual phases in the job (all 12 phases,
> expanding org.apache.mahout.math.hadoop.similarity.RowSimilarityJob out as
> 3
> phases).
> 2) before configuring each phase, set the minimum number of mappers by
> adding a function to org.apache.mahout.common.AbstractJob that alters the
> job's configuration.  To do this, I count the number of bytes in the input
> file(s) and define the mapred.max.split.size parameter.  Note that due to
> the way that Hadoop allocates mappers, you cannot define the exact number
> of
> mappers to allocate, but can set the minimum number to be allocated.
> 3) before configuring each phase, set the user-defined number of reducers.
> 4) add calls to org.apache.mahout.cf.taste.hadoop.item.setIOSort() for
> phases that require increased io.sort.mb and reduce the amount of data
> spillage.
>
> Please let me know if you are interested in such a patch and/or if you have
> any better ideas to get round this problem.
>
> Regards,
> Kris
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Reply via email to