[
https://issues.apache.org/jira/browse/MAHOUT-803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271494#comment-13271494
]
Sebastian Schelter commented on MAHOUT-803:
-------------------------------------------
I'd like to clarify this issue a little, as Suneel wants to work on it. It's
basically just a little math I was too lazy to do by myself :)
Our vector similarity measures offer a method called consider() which can be
used to prune unnecessary vector pairs early.
This method is invoked with five parameters: the number of non-zero entries of
the first vector, the maximum value in the first vector, the number of non-zero
entries of the second vector, the maximum value in the second vector and a
similarity threshold.
The method should return a boolean which signals whether it is possible that
the similarity of the two vectors can be above the threshold (otherwise the
pair can be ignored).
An easy example for this is CooccurrenceCountSimilarity which only measures the
number of common non-zero dimensions of the vectors. Say the threshold is
three, which means we only want pairs with at least three matching non-zero
dimensions and one of the vectors has less than three non-zero dimensions, then
we can ignore this pair.
Similar things can be done for other similarity measures, we still lack an
implementation for CityBlockSimilarity, LoglikelihoodSimilarity and
EuclideanDistanceSimilarity, I'm not sure whether you can even do this for all
of them.
> Complete minsize constraints for similarity measures used in RowSimilarityJob
> -----------------------------------------------------------------------------
>
> Key: MAHOUT-803
> URL: https://issues.apache.org/jira/browse/MAHOUT-803
> Project: Mahout
> Issue Type: Task
> Components: Math
> Affects Versions: 0.6
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
>
> The latest implementation of RowSimilarityJob allows specifying a threshold
> for the minimum similarity value of the resulting row pairs.
> A measure can specify a minsize constraints via
> VectorSimilarityMeasure.consider(...) to prune some candidate pairs very
> early by looking at some statistics computed for the single rows.
> For example if cooccurrence count is used as similarity measure and a
> threshold of 5 is set, then all row pairs where one of the vectors has less
> than 5 non-zero components can be discarded.
> These min-size constraints are still missing for CityBlockSimilarity,
> LoglikelihoodSimilarity and EuclideanDistanceSimilarity
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira