Oh good catch! I had an extra binarize method before, so that the data was already binary. I merged that into the downsample code and must have overlooked that thing. You are right, numNonZeros is the way to go!

On 06/10/2014 01:11 AM, Ted Dunning wrote:
Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) <[email protected]> wrote:


     [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:
------------------------------------

seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?????

         // Downsample the interaction vector of each user
         for (userIndex <- 0 until keys.size) {

           val interactionsOfUser = block(userIndex, ::) // this is a Vector
           // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
           // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
           val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

           val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

           interactionsOfUser.nonZeroes().foreach { elem =>
             val numInteractionsWithThing = numInteractions(elem.index)
             val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

             if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
               // We ignore the original interaction value and create a
binary 0-1 matrix
               // as we only consider whether interactions happened or did
not happen
               downsampledBlock(userIndex, elem.index) = 1
             }
           }


Cooccurrence Analysis on Spark
------------------------------

                 Key: MAHOUT-1464
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
         Environment: hadoop, spark
            Reporter: Pat Ferrel
            Assignee: Pat Ferrel
             Fix For: 1.0

         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



Reply via email to