Sounds like a very plausible root cause.
On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893 > ] > > Pat Ferrel commented on MAHOUT-1464: > ------------------------------------ > > seems like the downsampleAndBinarize method is returning the wrong values. > It is actually summing the values where it should be counting the non-zero > elements????? > > // Downsample the interaction vector of each user > for (userIndex <- 0 until keys.size) { > > val interactionsOfUser = block(userIndex, ::) // this is a Vector > // if the values are non-boolean the sum will not be the number > of interactions it will be a sum of strength-of-interaction, right? > // val numInteractionsOfUser = interactionsOfUser.sum // doesn't > this sum strength of interactions? > val numInteractionsOfUser = > interactionsOfUser.getNumNonZeroElements() // should do this I think > > val perUserSampleRate = math.min(maxNumInteractions, > numInteractionsOfUser) / numInteractionsOfUser > > interactionsOfUser.nonZeroes().foreach { elem => > val numInteractionsWithThing = numInteractions(elem.index) > val perThingSampleRate = math.min(maxNumInteractions, > numInteractionsWithThing) / numInteractionsWithThing > > if (random.nextDouble() <= math.min(perUserSampleRate, > perThingSampleRate)) { > // We ignore the original interaction value and create a > binary 0-1 matrix > // as we only consider whether interactions happened or did > not happen > downsampledBlock(userIndex, elem.index) = 1 > } > } > > > > Cooccurrence Analysis on Spark > > ------------------------------ > > > > Key: MAHOUT-1464 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > > Project: Mahout > > Issue Type: Improvement > > Components: Collaborative Filtering > > Environment: hadoop, spark > > Reporter: Pat Ferrel > > Assignee: Pat Ferrel > > Fix For: 1.0 > > > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, > MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, > run-spark-xrsj.sh > > > > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) > that runs on Spark. This should be compatible with Mahout Spark DRM DSL so > a DRM can be used as input. > > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence > has several applications including cross-action recommendations. > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >