[ 
https://issues.apache.org/jira/browse/MAHOUT-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269889#comment-13269889
 ] 

Sean Owen commented on MAHOUT-1007:
-----------------------------------

I see, Sebastian what do you think? I wonder if there isn't some other issue at 
work here, but I don't know.

Say you have n mappers in UnsymmetrifyMapper. It sounds like you're saying that 
one record from SimilarityReducer can be so large that it becomes larger than 
1/n of all the input the mappers process. Yes then you would have a gain from 
chopping up the input.

But that sounds very very large indeed! Do we really have a problem that some 
vectors from SimilarityReducer need to be pruned, capped in size? If so, that's 
the better way to attack this I think. (Or if we don't want to cap, yep I agree 
with this sort of patch. In fact, keep it simple -- hard code a max vector 
size.)

I don't think we have a partitioning problem otherwise this technique wouldn't 
yield any speedup.
                
> Performance improvement in recommenditembased by splitting long records
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-1007
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1007
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Bhaskar Devireddy
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: Patch_1007.patch
>
>
> While running the recommendations with ASFEMail dataset using the example 
> script provided with mahout, we are noticing that one of the map task in 
> unsymmetrify mapper job has a very long execution time than others.  While 
> profiling, the problem seems to be with the number of elements in each 
> record.  The attached patch address this issue by splitting longer records 
> into smaller once, so the data distributed evenly among the unsymmetrify map 
> tasks.
> There is a new command line option maxSimilarityReducerVectorSize is 
> introduced for RecommanderJob.  Tested with 
> maxSimilarityReducerVectorSize=5000 and with same functionality speeds up 
> unsymmetrify mapper job by several X on x86 architectures and increases CPU 
> utilization.  By default the records are not split and setting the command 
> line option maxSimilarityReducerVectorSize to a value greater than 0 will 
> increase performance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to