Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

Pat Ferrel Tue, 10 Jun 2014 08:29:47 -0700

Still getting the wrong values with non-boolean input so I’ll continue to look 
at.


Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

            // exclude co-occurrences of the item with itself
            if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter <s...@apache.org> wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:
> Sounds like a very plausible root cause.
> 
> 
> 
> 
> 
> On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) <j...@apache.org> wrote:
> 
>> 
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
>> ]
>> 
>> Pat Ferrel commented on MAHOUT-1464:
>> ------------------------------------
>> 
>> seems like the downsampleAndBinarize method is returning the wrong values.
>> It is actually summing the values where it should be counting the non-zero
>> elements?????
>> 
>>         // Downsample the interaction vector of each user
>>         for (userIndex <- 0 until keys.size) {
>> 
>>           val interactionsOfUser = block(userIndex, ::) // this is a Vector
>>           // if the values are non-boolean the sum will not be the number
>> of interactions it will be a sum of strength-of-interaction, right?
>>           // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
>> this sum strength of interactions?
>>           val numInteractionsOfUser =
>> interactionsOfUser.getNumNonZeroElements()  // should do this I think
>> 
>>           val perUserSampleRate = math.min(maxNumInteractions,
>> numInteractionsOfUser) / numInteractionsOfUser
>> 
>>           interactionsOfUser.nonZeroes().foreach { elem =>
>>             val numInteractionsWithThing = numInteractions(elem.index)
>>             val perThingSampleRate = math.min(maxNumInteractions,
>> numInteractionsWithThing) / numInteractionsWithThing
>> 
>>             if (random.nextDouble() <= math.min(perUserSampleRate,
>> perThingSampleRate)) {
>>               // We ignore the original interaction value and create a
>> binary 0-1 matrix
>>               // as we only consider whether interactions happened or did
>> not happen
>>               downsampledBlock(userIndex, elem.index) = 1
>>             }
>>           }
>> 
>> 
>>> Cooccurrence Analysis on Spark
>>> ------------------------------
>>> 
>>>                 Key: MAHOUT-1464
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Collaborative Filtering
>>>         Environment: hadoop, spark
>>>            Reporter: Pat Ferrel
>>>            Assignee: Pat Ferrel
>>>             Fix For: 1.0
>>> 
>>>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>>> 
>>> 
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>> 
>

Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

Reply via email to