[ https://issues.apache.org/jira/browse/MAHOUT-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268815#comment-13268815 ]
Sean Owen commented on MAHOUT-1007: ----------------------------------- Why would this improve performance? Sure you split up records, but you process the same amount of data. If anything I'd imagine this slows things down. What's the intuition behind why it would be faster? > Performance improvement in recommenditembased by splitting long records > ----------------------------------------------------------------------- > > Key: MAHOUT-1007 > URL: https://issues.apache.org/jira/browse/MAHOUT-1007 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.6 > Reporter: Bhaskar Devireddy > Assignee: Sean Owen > Priority: Minor > Fix For: 0.7 > > Attachments: Patch_1007.patch > > > While running the recommendations with ASFEMail dataset using the example > script provided with mahout, we are noticing that one of the map task in > unsymmetrify mapper job has a very long execution time than others. While > profiling, the problem seems to be with the number of elements in each > record. The attached patch address this issue by splitting longer records > into smaller once, so the data distributed evenly among the unsymmetrify map > tasks. > There is a new command line option maxSimilarityReducerVectorSize is > introduced for RecommanderJob. Tested with > maxSimilarityReducerVectorSize=5000 and with same functionality speeds up > unsymmetrify mapper job by several X on x86 architectures and increases CPU > utilization. By default the records are not split and setting the command > line option maxSimilarityReducerVectorSize to a value greater than 0 will > increase performance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira