[
https://issues.apache.org/jira/browse/MAHOUT-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269860#comment-13269860
]
Bhaskar Devireddy commented on MAHOUT-1007:
-------------------------------------------
The first map task in unsymmetrify job has very long execution time compare to
other map tasks in the job with ASF Mail dataset. This map task runs on a
single core for longer period of time performing more work than others in the
same job. This patch is addressing the issue by splitting the data evenly
between the map tasks so all of them can finish in the same amount of time.
There is overhead in splitting the data but the map tasks processing the evenly
split data can run in parallel on several cores, which makes this job more
scalable. We did measure the performance gains with the patch and Unsymmetrify
job gains more than 6X on x86 architectures. Our test cluster has 4 data nodes
with 8 cores each(Total of 32 cores for the cluster).
> Performance improvement in recommenditembased by splitting long records
> -----------------------------------------------------------------------
>
> Key: MAHOUT-1007
> URL: https://issues.apache.org/jira/browse/MAHOUT-1007
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.6
> Reporter: Bhaskar Devireddy
> Assignee: Sean Owen
> Priority: Minor
> Fix For: 0.7
>
> Attachments: Patch_1007.patch
>
>
> While running the recommendations with ASFEMail dataset using the example
> script provided with mahout, we are noticing that one of the map task in
> unsymmetrify mapper job has a very long execution time than others. While
> profiling, the problem seems to be with the number of elements in each
> record. The attached patch address this issue by splitting longer records
> into smaller once, so the data distributed evenly among the unsymmetrify map
> tasks.
> There is a new command line option maxSimilarityReducerVectorSize is
> introduced for RecommanderJob. Tested with
> maxSimilarityReducerVectorSize=5000 and with same functionality speeds up
> unsymmetrify mapper job by several X on x86 architectures and increases CPU
> utilization. By default the records are not split and setting the command
> line option maxSimilarityReducerVectorSize to a value greater than 0 will
> increase performance.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira