In a thread beginning at [1], Bikash Gupta asks about what seems to be the
same issue I have, namely getting cluster lists from the output of a
clustering of NamedVector objects. (In my case clusters come from a
CanopyDriver call.)
In the thread at [1] I understand Suneel's answer of Feb 24 2014
I'm doing Canopy clustering with CanopyDriver on a sequence file of
NamedVectors and seem to get the expected set of map and reduce
directories. But when I try to read the part-r- file with a
SequenceFile.Reader, an attempt to iterate over the reader, I immediately
get a NullPointerException
My first entry on such a page would be a plea for more rigor in the
annotation of the java code for the utilities. For example,
ClusterDumper.java has essentially no annotation and I found that I
had to spend a lot of time to figure out whether (a) it had a call
that would do what I wanted and if
I was taken aback that the immensely touted and convenient Canopy
KMeans package was today deprecated [1] in the incubating mahout 1.0
with no hint that I could find warned in this, at least back through
March. And even then I can see only in retrospect that a suggestion
lurked in [2] that
After running CanopyDriver.run on some 4 dimensional DenseVectors, I'm
using a handcrafted text dictionary passed to ClusterDumper declared
as dictionary type text. The dictionary looks like this, with the
entry lines having dimension and feature name separated by tab:
4
0 recordedBy
1
I'm a mahout novice trying to do some semantic data clustering with
Canopy clustering on some low-dimensional SequenceFiles that I
vectorized with ad-hoc java code. (Some features are strings
vextorized by the Levenstein distance from a constant, some are
DateTime objects vectorized as