To be clear this change only affects classification of the input vectors. Everything else in clustering works fine without it. I need to know which vectors are in which clusters, it is why I run clustering, for its classification function. There will be many who don't care about classification.
On Sep 12, 2012, at 8:27 PM, Pat Ferrel <[email protected]> wrote: Yes, you have output but it is only partly useful. There are two things created during clustering: Clusters, which are basically centroids and their vectors If you ask the driver to classify your input into clusters, you get clusteredPoints Both of these are created, even without NamedVectors. The clusters centroids are quite alright with non-NamedVectors as input. However though clusteredPoints is created there is no way to tell which vectors are classified by cluster since all you get is anonymous weights in the vectors. How can you tell which doc was in which cluster? Creating a new classifier that would attach vector IDs when there is no NamedVector is my #2 solution below. So yes, it still runs and produces clusters but in my application and I suspect quite a few others, the cluster is only of interest if the input is classified into the clusters. On Sep 12, 2012, at 7:07 PM, Dmitriy Lyubimov <[email protected]> wrote: I am curious though. do you really have no cluster output unless Named vectors are used? It is strange because even if I did not use Named vectors, i would still expect for for clusters to form correctly, with the cluster ids and points and top terms. So cluster dumper should still produce document vectors (even if without original name) and top terms, i.e. clustered points should not be empty. After all, I am not obliged to follow text analysis pipeline as in the MIA, i might as well come up with my own DRM i would like to find clusters for; and i might not have used text labels in that matrix.. On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <[email protected]> wrote: > There appears to be a gap in the pipeline SSVD-->Clustering. It can be > patched in a couple ways so can the devs please advise before we make a patch: > > The Issues: > * There is currently no output from clustering that maps input vectors to > clusters, unless you input NamedVectors to clustering. > * SSVD does not output NamedVectors even if they are input. > > Solutions: > 1. We could modify clustering to output in the file > clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original > input vectors and the Vector would be the original input VectorWritable. This > might be done by replacing the WeightedVectorWritable with a > WeightedPropertyVectorWritable and putting the ID in properties. This would > require a change in the clustering classifier but no change to SSVD or the > rest of clustering. This would impact anyone using clusteredPoints since they > would have to deal with a new output vector type (actually wasn't this file > using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?) > 2. We could alter SSVD to output NamedVectors and Clustering would simply > pass them through without modification as it does today. This would require a > change to SSVD but not to Clustering. Since NamedVectors seems to be the only > way to perform this mapping now, there would be very little impact on current > users. > > Afaict one of these has to be done and they are not mutually exclusive. Any > advice? >
