[ https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pat Ferrel resolved MAHOUT-1883. -------------------------------- Resolution: Fixed Hmm, I thought these were aut-resolved with a commit that contains the issue name? Maybe I had a senior moment there :-) > Create a type if IndexedDataset that filters unneeded data for CCO > ------------------------------------------------------------------ > > Key: MAHOUT-1883 > URL: https://issues.apache.org/jira/browse/MAHOUT-1883 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering > Affects Versions: 0.13.0 > Reporter: Pat Ferrel > Assignee: Pat Ferrel > Fix For: 0.13.0 > > > The collaborative filtering CCO algo uses drms for each "indicator" type. The > input must have the same set of user-id and so the row rank for all input > matrices must be the same. > In the past we have padded the row-id dictionary to include new rows only in > secondary matrices. This can lead to very large amounts of data processed in > the CCO pipeline that does not affect the results. Put another way if the row > doesn't exist in the primary matrix, there will be no cross-occurrence in the > other calculated cooccurrences matrix. > if we are calculating P'P and P'S, S will not need rows that don't exist in P > so this Jira is to create an IndexedDataset companion object that takes an > RDD[(String, String)] of interactions but that uses the dictionary from P for > row-ids and filters out all data that doesn't correspond to P. The companion > object will create the row-ids dictionary if it is not passed in, and use it > to filter if it is passed in. > We have seen data that can be reduced by many orders of magnitude using this > technique. This could be handled outside of Mahout but always produces better > performance and so this version of data-prep seems worth including. > It does not affect the CLI version yet but could be included there in a > future Jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)