[
https://issues.apache.org/jira/browse/MAHOUT-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295584#comment-13295584
]
CodyInnowhere commented on MAHOUT-1032:
---------------------------------------
Well we have a billion+ distinct items(www.taobao.com), the test data set is a
part of online items. I see the reason for the index mapping, however, this
makes enterprise-scale data set a bit difficult to fit in mahout CF.
BTW, the index mapping is also a problem for so many items as our itemId may
exceed int.MAX.
> AggregateAndRecommendReducer gets OOM in setup() method
> -------------------------------------------------------
>
> Key: MAHOUT-1032
> URL: https://issues.apache.org/jira/browse/MAHOUT-1032
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.5, 0.6, 0.7, 0.8
> Environment: hadoop cluster with -Xmx set to 2G
> Reporter: CodyInnowhere
> Assignee: Sean Owen
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> This bug is actually caused by the very first job: itemIDIndex. This job
> transfers itemID to an integer index, and in the later
> AggregateAndRecommendReducer, tries to read all items to the
> OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my
> test data set covers 100million+ items(not too many items for a large
> e-commerce website), tasks get out of memory in setup() method. I don't think
> the itemIDIndex is necessary, without this job, the final
> AggregateAndRecommend step doesn't have to read all items to the memory to do
> the reverse index mapping.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira