Ah, you are using MeanShift. That makes sense now. In MeanShift, all of the 
input vectors are converted into MeanShiftCanopy instances by 
MeanShiftCanopyCreatorMapper during a preprocessing step. This is done so that 
they will be assigned clusterIds which are retained during subsequent cluster 
mergers. In this case, it would be appropriate to preserve the NamedVector in 
the canopy center, as the clustering (classification) step processes over the 
(nominally clusters-0) input data and not the original vectors.

I think the names are lost when constructor 
MeanShiftCanopy(Vector,int,DistanceMeasure) calls super(point,id,measure). You 
could try fixing this by just assigning the point to the center directly after 
the super call. If your NamedVectors are already RandomAccessSparseVectors then 
this will have no effect. If they are sequential, then this change should only 
affect the first iteration, as the centers will be recomputed to become 
RASVectors during computeParameters.

Try this out and let me know if it works for you. 

-----Original Message-----
From: Pere Ferrera Bertran (JIRA) [mailto:[email protected]] 
Sent: Wednesday, November 24, 2010 9:21 AM
To: [email protected]
Subject: [jira] Commented: (MAHOUT-552) AbstractCluster eliminates NamedVectors 
by replacing them with RandomAccessSparseVector always


    [ 
https://issues.apache.org/jira/browse/MAHOUT-552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935407#action_12935407
 ] 

Pere Ferrera Bertran commented on MAHOUT-552:
---------------------------------------------

Thanks for your observations, Jeff. Then I guess the problem I am reporting is 
specific to some clustering algorithm. Concretely, I am using Mean Shift 
Clustering. There is no way I can preserve vectors names in -cl mode. I am 
using the latest code (0.5 snapshot).

In MeanShiftCanopyClusterMapper there is some sort of equivalence between input 
vectors and canopies. I can see the vector that is output to clusteredPoints is 
canopy.getCenter(). Is this right?

> AbstractCluster eliminates NamedVectors by replacing them with 
> RandomAccessSparseVector always
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-552
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-552
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Pere Ferrera Bertran
>             Fix For: 0.5
>
>         Attachments: MAHOUT-552.patch
>
>
> When clustering using NamedVectors as input - after running seq2sparse with 
> patch https://issues.apache.org/jira/browse/MAHOUT-401 - names are lost 
> because AbstractCluster replaces vectors coming in the constructor with 
> RandomAccessSparseVector.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to