I am curious though. do you really have no cluster output unless Named vectors are used?
It is strange because even if I did not use Named vectors, i would still expect for for clusters to form correctly, with the cluster ids and points and top terms. So cluster dumper should still produce document vectors (even if without original name) and top terms, i.e. clustered points should not be empty. After all, I am not obliged to follow text analysis pipeline as in the MIA, i might as well come up with my own DRM i would like to find clusters for; and i might not have used text labels in that matrix.. On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <[email protected]> wrote: > There appears to be a gap in the pipeline SSVD-->Clustering. It can be > patched in a couple ways so can the devs please advise before we make a patch: > > The Issues: > * There is currently no output from clustering that maps input vectors to > clusters, unless you input NamedVectors to clustering. > * SSVD does not output NamedVectors even if they are input. > > Solutions: > 1. We could modify clustering to output in the file > clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original > input vectors and the Vector would be the original input VectorWritable. This > might be done by replacing the WeightedVectorWritable with a > WeightedPropertyVectorWritable and putting the ID in properties. This would > require a change in the clustering classifier but no change to SSVD or the > rest of clustering. This would impact anyone using clusteredPoints since they > would have to deal with a new output vector type (actually wasn't this file > using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?) > 2. We could alter SSVD to output NamedVectors and Clustering would simply > pass them through without modification as it does today. This would require a > change to SSVD but not to Clustering. Since NamedVectors seems to be the only > way to perform this mapping now, there would be very little impact on current > users. > > Afaict one of these has to be done and they are not mutually exclusive. Any > advice? >
