[jira] [Commented] (MAHOUT-1007) Performance improvement in recommenditembased by splitting long records

Sebastian Schelter (JIRA) Tue, 08 May 2012 06:06:20 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270441#comment-13270441
 ]


Sebastian Schelter commented on MAHOUT-1007:
--------------------------------------------

Unsymmetrify taking very long is a sign that you have much cooccurrences in 
your data, try experimenting with --treshold to prune away near-zero 
similarities. 

It could be that you encounter data skew, which means there are "top items" 
which cooccurr with everything. It's crucial to either filter these or sample 
down their interactions.

                
> Performance improvement in recommenditembased by splitting long records
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-1007
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1007
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Bhaskar Devireddy
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: Patch_1007.patch
>
>
> While running the recommendations with ASFEMail dataset using the example 
> script provided with mahout, we are noticing that one of the map task in 
> unsymmetrify mapper job has a very long execution time than others.  While 
> profiling, the problem seems to be with the number of elements in each 
> record.  The attached patch address this issue by splitting longer records 
> into smaller once, so the data distributed evenly among the unsymmetrify map 
> tasks.
> There is a new command line option maxSimilarityReducerVectorSize is 
> introduced for RecommanderJob.  Tested with 
> maxSimilarityReducerVectorSize=5000 and with same functionality speeds up 
> unsymmetrify mapper job by several X on x86 architectures and increases CPU 
> utilization.  By default the records are not split and setting the command 
> line option maxSimilarityReducerVectorSize to a value greater than 0 will 
> increase performance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1007) Performance improvement in recommenditembased by splitting long records

Reply via email to