Oh good catch! I had an extra binarize method before, so that the data
was already binary. I merged that into the downsample code and must have
overlooked that thing. You are right, numNonZeros is the way to go!
On 06/10/2014 01:11 AM, Ted Dunning wrote:
Sounds like a very plausible root cause.
On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) <[email protected]> wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]
Pat Ferrel commented on MAHOUT-1464:
------------------------------------
seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?????
// Downsample the interaction vector of each user
for (userIndex <- 0 until keys.size) {
val interactionsOfUser = block(userIndex, ::) // this is a Vector
// if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
// val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements() // should do this I think
val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser
interactionsOfUser.nonZeroes().foreach { elem =>
val numInteractionsWithThing = numInteractions(elem.index)
val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing
if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
// We ignore the original interaction value and create a
binary 0-1 matrix
// as we only consider whether interactions happened or did
not happen
downsampledBlock(userIndex, elem.index) = 1
}
}
Cooccurrence Analysis on Spark
------------------------------
Key: MAHOUT-1464
URL: https://issues.apache.org/jira/browse/MAHOUT-1464
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 1.0
Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh
Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)