[
https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160621#comment-13160621
]
Jake Mannix commented on MAHOUT-845:
------------------------------------
Ok, so I've thought about this a little, and the implementation that Frank put
on here, and I had on my github branch too, essentially, is probably a bad
idea, for exactly Lance's points mentioned here.
So instead, we modify VectorDumper and VectorHelper to add a couple of static
methods and options:
in VectorHelper:
[code]
public static String vectorToJson(Vector vector, String[] dictionary, int
maxEntries, boolean sort)
[code]
where the "sort" option sorts by the values of the Vector entries, and
maxEntries describes the maximum number of vector entries to use. If
dictionary is supplied and not null, then the vector indexes are replaced with
their respective term entries in the dictionary.
This way, VectorDumper is modified with the following options:
[code]
Option sortVectorsOpt =
obuilder.withLongName("sortVectors").withRequired(false).withDescription(
"Sort output key/value pairs of the vector entries in abs magnitude
descending order")
.withShortName("sort").create();
Option numIndexesPerVectorOpt =
obuilder.withLongName("vectorSize").withShortName("vs").withRequired(false)
.withArgument(abuilder.withName("vs").withMinimum(1).withMaximum(1).create())
.withDescription("Truncate vectors to <vs> length when dumping (most
useful when in"
+ " conjunction with -sort").create();
[code]
Then if you have clusters represented as vector centroids (or distributions
over terms/features, or anything else which is a collection of Vectors linked
to a dictionary of String labels for the vector indexes), then you don't really
need a "ClusterDumper", as
[code]
$MAHOUT_HOME/bin/mahout vectordump -s "path/to/vectors/part-*" --dictionary
"path/to/dictionary.file-0" -dt sequencefile -sort --vectorSize 100 -o
local_vectors.json
[code]
puts each vector in "path/to/vectors/part-*" one per line in
local_vectors.json, in json format, with the keys being the terms with the
highest weight for the vector, the values being the vector values, and only the
top 100 (by value) per vector are emitted.
I've found this modification to VectorDumper invaluable in inspecting LDA topic
models, but doing it without modifying the Vector interface is even better.
> Make cluster top terms code more reusable
> -----------------------------------------
>
> Key: MAHOUT-845
> URL: https://issues.apache.org/jira/browse/MAHOUT-845
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.5
> Reporter: Frank Scholten
> Priority: Minor
> Fix For: 0.6
>
> Attachments: MAHOUT-845.patch, MAHOUT-845.patch, MAHOUT-845.patch
>
>
> When working with Mahout text clustering I find that I keep writing code
> similar to the contents of
> public static String getTopFeatures(Cluster cluster, String[] dictionary, int
> numTerms)
> in ClusterDumper in order to determine cluster labels.
> I think it would be useful if (parts of) this code are added to the cluster
> or vector API so that you could do something like
> Cluster cluster = ... // get the cluster from seq file iterable
> String clusterLabel = cluster.getTopTerms(1, dictionary); // Do something
> with the label
> I think this would make it easier to export and post-process clustering
> results, like indexing or storing them elsewhere.
> Thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira