[ https://issues.apache.org/jira/browse/MAHOUT-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269889#comment-13269889 ]
Sean Owen commented on MAHOUT-1007: ----------------------------------- I see, Sebastian what do you think? I wonder if there isn't some other issue at work here, but I don't know. Say you have n mappers in UnsymmetrifyMapper. It sounds like you're saying that one record from SimilarityReducer can be so large that it becomes larger than 1/n of all the input the mappers process. Yes then you would have a gain from chopping up the input. But that sounds very very large indeed! Do we really have a problem that some vectors from SimilarityReducer need to be pruned, capped in size? If so, that's the better way to attack this I think. (Or if we don't want to cap, yep I agree with this sort of patch. In fact, keep it simple -- hard code a max vector size.) I don't think we have a partitioning problem otherwise this technique wouldn't yield any speedup. > Performance improvement in recommenditembased by splitting long records > ----------------------------------------------------------------------- > > Key: MAHOUT-1007 > URL: https://issues.apache.org/jira/browse/MAHOUT-1007 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.6 > Reporter: Bhaskar Devireddy > Assignee: Sean Owen > Priority: Minor > Fix For: 0.7 > > Attachments: Patch_1007.patch > > > While running the recommendations with ASFEMail dataset using the example > script provided with mahout, we are noticing that one of the map task in > unsymmetrify mapper job has a very long execution time than others. While > profiling, the problem seems to be with the number of elements in each > record. The attached patch address this issue by splitting longer records > into smaller once, so the data distributed evenly among the unsymmetrify map > tasks. > There is a new command line option maxSimilarityReducerVectorSize is > introduced for RecommanderJob. Tested with > maxSimilarityReducerVectorSize=5000 and with same functionality speeds up > unsymmetrify mapper job by several X on x86 architectures and increases CPU > utilization. By default the records are not split and setting the command > line option maxSimilarityReducerVectorSize to a value greater than 0 will > increase performance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira