Re: [jira] [Commented] (MAHOUT-767) Improve RowSimilarityJob performance for count-based distance measures

Ted Dunning Wed, 20 Jul 2011 08:41:11 -0700

Not if you are doing Euclidean distance.

On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA) <[email protected]>wrote:


>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]
>
> Grant Ingersoll commented on MAHOUT-767:
> ----------------------------------------
>
> Why do we need the norms?  Why can't we assume the user has already
> normalized?
>
> > Improve RowSimilarityJob performance for count-based distance measures
> > ----------------------------------------------------------------------
> >
> >                 Key: MAHOUT-767
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
> >             Project: Mahout
> >          Issue Type: Improvement
> >            Reporter: Grant Ingersoll
> >             Fix For: 0.6
> >
> >
> > (See
> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for
>  background)
> > Currently, the RowSimilarityJob defers the calculation of the similarity
> metric until the reduce phase, while emitting many Cooccurrence objects.
>  For similarity metrics that are algebraic (
> http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
> be able to do much of the computation during the Mapper part of this phase
> and also take advantage of a Combiner.
> > We should use a marker interface to know whether a similarity metric is
> algebraic and then make use of an appropriate Mapper implementation,
> otherwise we can fall back on our existing implementation.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] [Commented] (MAHOUT-767) Improve RowSimilarityJob performance for count-based distance measures

Reply via email to