Ah, you are using MeanShift. That makes sense now. In MeanShift, all of the input vectors are converted into MeanShiftCanopy instances by MeanShiftCanopyCreatorMapper during a preprocessing step. This is done so that they will be assigned clusterIds which are retained during subsequent cluster mergers. In this case, it would be appropriate to preserve the NamedVector in the canopy center, as the clustering (classification) step processes over the (nominally clusters-0) input data and not the original vectors.
I think the names are lost when constructor MeanShiftCanopy(Vector,int,DistanceMeasure) calls super(point,id,measure). You could try fixing this by just assigning the point to the center directly after the super call. If your NamedVectors are already RandomAccessSparseVectors then this will have no effect. If they are sequential, then this change should only affect the first iteration, as the centers will be recomputed to become RASVectors during computeParameters. Try this out and let me know if it works for you. -----Original Message----- From: Pere Ferrera Bertran (JIRA) [mailto:[email protected]] Sent: Wednesday, November 24, 2010 9:21 AM To: [email protected] Subject: [jira] Commented: (MAHOUT-552) AbstractCluster eliminates NamedVectors by replacing them with RandomAccessSparseVector always [ https://issues.apache.org/jira/browse/MAHOUT-552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935407#action_12935407 ] Pere Ferrera Bertran commented on MAHOUT-552: --------------------------------------------- Thanks for your observations, Jeff. Then I guess the problem I am reporting is specific to some clustering algorithm. Concretely, I am using Mean Shift Clustering. There is no way I can preserve vectors names in -cl mode. I am using the latest code (0.5 snapshot). In MeanShiftCanopyClusterMapper there is some sort of equivalence between input vectors and canopies. I can see the vector that is output to clusteredPoints is canopy.getCenter(). Is this right? > AbstractCluster eliminates NamedVectors by replacing them with > RandomAccessSparseVector always > ---------------------------------------------------------------------------------------------- > > Key: MAHOUT-552 > URL: https://issues.apache.org/jira/browse/MAHOUT-552 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.5 > Reporter: Pere Ferrera Bertran > Fix For: 0.5 > > Attachments: MAHOUT-552.patch > > > When clustering using NamedVectors as input - after running seq2sparse with > patch https://issues.apache.org/jira/browse/MAHOUT-401 - names are lost > because AbstractCluster replaces vectors coming in the constructor with > RandomAccessSparseVector. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
