[ 
https://issues.apache.org/jira/browse/MAHOUT-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840468#action_12840468
 ] 

Ankur commented on MAHOUT-317:
------------------------------

To minimize the spill issuse you'll have to play with 

1. Increasing the JVM max heap size of the mapper.
2. Increasing sort buffer size (io.sort.mb, typically defaults to 256 MB). This 
can be increased to a larger size if user's mapper does not consume too much 
memory. 
3. Increasing spill threshold (io.sort.spill.percent, typically defaults to 80% 
of the above bufer). Try setting in to 90 %.

> Collocations: Eliminate in-memory frequency calculation
> -------------------------------------------------------
>
>                 Key: MAHOUT-317
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-317
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>             Fix For: 0.3
>
>         Attachments: MAHOUT-317.patch, MAHOUT-317.patch, MAHOUT-317.patch
>
>
> see: 
> http://www.lucidimagination.com/search/document/ae484d53e969250e/who_owns_mahout_bucket_on_s3
> The collocation code currently uses maps in the CollocCombiner and 
> CollocReducer to perform frequency calculations which can cause the process 
> to exceed the heap space if a large number of ngrams exist for any given 
> subgram.
> Convert the code to use a composite key / secondary sort to avoid the need 
> for in-memory map for frequency calculations. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to