How large is the win for canopies in practice?  < 10 x?  < 2x?

I could imagine that since the distance computations are (should be) heavily
vector oriented that increasing the vector length by comparing to all
centroids causes a sub-linear increase in time because much of the
computation time would be involved in setting up the computation in the
first place.  Once the cluster centroids are in L1 cache, using them should
be really, really fast.

On Mon, May 12, 2008 at 9:22 PM, Jeff Eastman (JIRA) <[EMAIL PROTECTED]>
wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596266#action_12596266]
>
> Jeff Eastman commented on MAHOUT-54:
> ------------------------------------
>
> What I get is you are concerned by kmeans comparing all points against all
> cluster centers in order to find the closest. Since canopy has already
> assigned each point to one or more canopies, and since the kmeans cluster
> centers are initially the canopy centers, it should only be necessary to
> measure the distance between each point's canopy cluster centers and not all
> of the cluster centers. Then, the point would only be emitted to the closest
> cluster and many distance calculations could be avoided.
>
> I'd still like to understand the changes you are proposing to the existing
> algorithms. The code in your patch does little to motivate or explain its
>  differences and indeed it breaks the existing canopy unit tests. If your
> patch were instead organized to make as few changes to the code as possible
> and if these changes were well documented it would be easier to evaluate.
> Currently, one must compare your new  implementation with the existing,
> somewhat modified implementation without the benefit of diff or any other
> documentation to see what has actually changed.
>
> It appears you wish to augment the canopy code to produce an additional
> output folder, and that kmeans would be able to utilize this folder to
> optimize its measurements. Could you say more about the structure of this
> new folder and how you intend to use it in kmeans?
>
>
>
> > parallelize k-means sharing the predominance of canopies
> > --------------------------------------------------------
> >
> >                 Key: MAHOUT-54
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-54
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Clustering
> >    Affects Versions: 0.1
> >         Environment: OS Independent
> >            Reporter: Jeremy Chow
> >             Fix For: 0.1
> >
> >         Attachments: canopykeams.patch
> >
> >
> > The implementation of mahout at present only using canopy algorithm
> creating initial cluster centroids for k-means.  It will calculate the
> distance from  each center to every point while iterating. But  the most
> import improvement of canopies is that needs only calculating the distance
> from each  center to a much smaller number of points which exists in the
> same canopy.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
ted

Reply via email to