Re: Need dev advice: SSVD - Clustering Pipeline

Pat Ferrel Wed, 12 Sep 2012 20:55:17 -0700

To be clear this change only affects classification of the input vectors. 
Everything else in clustering works fine without it. I need to know which 
vectors are in which clusters, it is why I run clustering, for its 
classification function. There will be many who don't care about classification.

On Sep 12, 2012, at 8:27 PM, Pat Ferrel <[email protected]> wrote:

Yes, you have output but it is only partly useful.

There are two things created during clustering:
Clusters, which are basically centroids and their vectors
If you ask the driver to classify your input into clusters, you get 
clusteredPoints
Both of these are created, even without NamedVectors. The clusters centroids 
are quite alright with non-NamedVectors as input. However though 
clusteredPoints is created there is no way to tell which vectors are classified 
by cluster since all you get is anonymous weights in the vectors. How can you 
tell which doc was in which cluster?

Creating a new classifier that would attach vector IDs when there is no 
NamedVector is my #2 solution below.

So yes, it still runs and produces clusters but in my application and I suspect 
quite a few others, the cluster is only of interest if the input is classified 
into the clusters.

On Sep 12, 2012, at 7:07 PM, Dmitriy Lyubimov <[email protected]> wrote:

I am curious though.

do you really have no cluster output unless Named vectors are used?

It is strange because even if I did not use Named vectors, i would
still expect for for clusters to form correctly, with the cluster ids
and points and top terms. So cluster dumper should still produce
document vectors (even if without original name) and top terms, i.e.
clustered points should not be empty. After all, I am not obliged to
follow text analysis pipeline as in the MIA, i might as well come up
with my own DRM i would like to find clusters for; and i might not
have used text labels in that matrix..

On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <[email protected]> wrote:
> There appears to be a gap in the pipeline SSVD-->Clustering. It can be 
> patched in a couple ways so can the devs please advise before we make a patch:
> 
> The Issues:
>  * There is currently no output from clustering that maps input vectors to 
> clusters, unless you input NamedVectors to clustering.
>  * SSVD does not output NamedVectors even if they are input.
> 
> Solutions:
>  1. We could modify clustering to output in the file 
> clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the original 
> input vectors and the Vector would be the original input VectorWritable. This 
> might be done by replacing the WeightedVectorWritable with a 
> WeightedPropertyVectorWritable and putting the ID in properties. This would 
> require a change in the clustering classifier but no change to SSVD or the 
> rest of clustering. This would impact anyone using clusteredPoints since they 
> would have to deal with a new output vector type (actually wasn't this file 
> using WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>  2. We could alter SSVD to output NamedVectors and Clustering would simply 
> pass them through without modification as it does today. This would require a 
> change to SSVD but not to Clustering. Since NamedVectors seems to be the only 
> way to perform this mapping now, there would be very little impact on current 
> users.
> 
> Afaict one of these has to be done and they are not mutually exclusive. Any 
> advice?
>

Re: Need dev advice: SSVD - Clustering Pipeline

Reply via email to