[jira] [Commented] (MAHOUT-845) Make cluster top terms code more reusable

Jake Mannix (Commented) (JIRA) Wed, 30 Nov 2011 19:28:12 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160621#comment-13160621
 ]


Jake Mannix commented on MAHOUT-845:
------------------------------------

Ok, so I've thought about this a little, and the implementation that Frank put 
on here, and I had on my github branch too, essentially, is probably a bad 
idea, for exactly Lance's points mentioned here.

So instead, we modify VectorDumper and VectorHelper to add a couple of static 
methods and options:

in VectorHelper:
[code]
public static String vectorToJson(Vector vector, String[] dictionary, int 
maxEntries, boolean sort)
[code]

where the "sort" option sorts by the values of the Vector entries, and 
maxEntries describes the maximum number of vector entries to use.  If 
dictionary is supplied and not null, then the vector indexes are replaced with 
their respective term entries in the dictionary.

This way, VectorDumper is modified with the following options:
[code]
Option sortVectorsOpt = 
obuilder.withLongName("sortVectors").withRequired(false).withDescription(
            "Sort output key/value pairs of the vector entries in abs magnitude 
descending order")
            .withShortName("sort").create();
Option numIndexesPerVectorOpt = 
obuilder.withLongName("vectorSize").withShortName("vs").withRequired(false)
         
.withArgument(abuilder.withName("vs").withMinimum(1).withMaximum(1).create())
         .withDescription("Truncate vectors to <vs> length when dumping (most 
useful when in"
                          + " conjunction with -sort").create();
[code]

Then if you have clusters represented as vector centroids (or distributions 
over terms/features, or anything else which is a collection of Vectors linked 
to a dictionary of String labels for the vector indexes), then you don't really 
need a "ClusterDumper", as

[code]
$MAHOUT_HOME/bin/mahout vectordump -s "path/to/vectors/part-*" --dictionary 
"path/to/dictionary.file-0" -dt sequencefile -sort --vectorSize 100 -o 
local_vectors.json
[code]

puts each vector in "path/to/vectors/part-*" one per line in 
local_vectors.json, in json format, with the keys being the terms with the 
highest weight for the vector, the values being the vector values, and only the 
top 100 (by value) per vector are emitted.

I've found this modification to VectorDumper invaluable in inspecting LDA topic 
models, but doing it without modifying the Vector interface is even better.
                
> Make cluster top terms code more reusable
> -----------------------------------------
>
>                 Key: MAHOUT-845
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-845
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Frank Scholten
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: MAHOUT-845.patch, MAHOUT-845.patch, MAHOUT-845.patch
>
>
> When working with Mahout text clustering I find that I keep writing code 
> similar to the contents of
> public static String getTopFeatures(Cluster cluster, String[] dictionary, int 
> numTerms)
> in ClusterDumper in order to determine cluster labels.
> I think it would be useful if (parts of) this code are added to the cluster 
> or vector API so that you could do something like
> Cluster cluster = ... // get the cluster from seq file iterable
> String clusterLabel = cluster.getTopTerms(1, dictionary); // Do something 
> with the label  
> I think this would make it easier to export and post-process clustering 
> results, like indexing or storing them elsewhere.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-845) Make cluster top terms code more reusable

Reply via email to