[ https://issues.apache.org/jira/browse/MAHOUT-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295593#comment-13295593 ]
Sean Owen commented on MAHOUT-1032: ----------------------------------- Yeah I can imagine having a billion distinct IDs -- but they may not be the logical entities you recommend on. I'm suggesting that it's unlikely there are 'really' a billion items and you'll benefit a lot by collapsing them, perhaps. No matter what approach, it will almost surely save you $$ in processing. Put another way, unless you have trillions of data points, this data set is going to be very sparse, so much that the result may not be so useful. What's the average number of interactions per user or item? You can rewrite these bits to do lookups via a M/R join. It will take more time, is all, since the whole data set is mapped out again, mapped with copies of the lookup, joined, and then finally output. Most any situation where something is loaded in memory is 'cheating', but a very useful speedup since it works fine up to 'merely huge' numbers of items, like 10M. > AggregateAndRecommendReducer gets OOM in setup() method > ------------------------------------------------------- > > Key: MAHOUT-1032 > URL: https://issues.apache.org/jira/browse/MAHOUT-1032 > Project: Mahout > Issue Type: Bug > Components: Collaborative Filtering > Affects Versions: 0.5, 0.6, 0.7, 0.8 > Environment: hadoop cluster with -Xmx set to 2G > Reporter: CodyInnowhere > Assignee: Sean Owen > Original Estimate: 168h > Remaining Estimate: 168h > > This bug is actually caused by the very first job: itemIDIndex. This job > transfers itemID to an integer index, and in the later > AggregateAndRecommendReducer, tries to read all items to the > OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my > test data set covers 100million+ items(not too many items for a large > e-commerce website), tasks get out of memory in setup() method. I don't think > the itemIDIndex is necessary, without this job, the final > AggregateAndRecommend step doesn't have to read all items to the memory to do > the reverse index mapping. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira