Printing clusters of NamedVectors in mahout 0.9

2014-07-29 Thread Bob Morris
In a thread beginning at [1], Bikash Gupta asks about what seems to be the same issue I have, namely getting cluster lists from the output of a clustering of NamedVector objects. (In my case clusters come from a CanopyDriver call.) In the thread at [1] I understand Suneel's answer of Feb 24 2014

structure of part-r-00000 and SequenceFile.Reader NullPointerException

2014-07-08 Thread Bob Morris
I'm doing Canopy clustering with CanopyDriver on a sequence file of NamedVectors and seem to get the expected set of map and reduce directories. But when I try to read the part-r- file with a SequenceFile.Reader, an attempt to iterate over the reader, I immediately get a NullPointerException

Re: simple idea for improving mahout docs over the next month?

2014-04-18 Thread Bob Morris
My first entry on such a page would be a plea for more rigor in the annotation of the java code for the utilities. For example, ClusterDumper.java has essentially no annotation and I found that I had to spend a lot of time to figure out whether (a) it had a call that would do what I wanted and if

Grumble about (lack of) warning of deprecation of Canopy KMeans

2014-04-18 Thread Bob Morris
I was taken aback that the immensely touted and convenient Canopy KMeans package was today deprecated [1] in the incubating mahout 1.0 with no hint that I could find warned in this, at least back through March. And even then I can see only in retrospect that a suggestion lurked in [2] that

text dictionary errors from ClusterDumper

2014-03-30 Thread Bob Morris
After running CanopyDriver.run on some 4 dimensional DenseVectors, I'm using a handcrafted text dictionary passed to ClusterDumper declared as dictionary type text. The dictionary looks like this, with the entry lines having dimension and feature name separated by tab: 4 0 recordedBy 1

newbie asks how to making dictionary files

2014-03-23 Thread Bob Morris
I'm a mahout novice trying to do some semantic data clustering with Canopy clustering on some low-dimensional SequenceFiles that I vectorized with ad-hoc java code. (Some features are strings vextorized by the Levenstein distance from a constant, some are DateTime objects vectorized as