The Shashikant code ends up with a SparseVector. There must be some easy easy way to pull in a SparseVector instead of a DenseVector. The SparseVector reader wants a DataInput, and the InputMapper has a Text, but perhaps a quick StringReader is all I need.
The code in the example On Fri, May 29, 2009 at 12:00 PM, Grant Ingersoll <[email protected]>wrote: > I think Shashikant was using a modified form of Mahout that encoded the > labels in the output. > > I think we're still a little bit away from having a utility that truly > makes this straightforward to go from text to clusterable vectors. > > No doubt what is happening is the recognition of a need for some type of > pipeline process that can work with multiple data sources and output various > consumable formats and help select features. Unfortunately, we aren't there > just yet. > > -Grant > > > On May 29, 2009, at 11:27 AM, Benson Margulies wrote: > > I'll fish for a one more hint. I'm using the MAHOUT-126 code to turn text >> into data via TF-IDF. What comes out of there is not in the same format as >> your example data. This means that I need a different InputDriver? Is one >> lying about for the format written by that DocumentVector class? >> >> On Fri, May 29, 2009 at 10:29 AM, Jeff Eastman >> <[email protected]>wrote: >> >> Benson Margulies wrote: >>> >>> OK, I've got some inputs, I want to run k-means, how do I feed the >>>> beast? >>>> >>>> >>>> >>>> Make sure you can run the Synthetic Control example to get everything >>> wired >>> together correctly: JDK, Hadoop, Mahout. See >>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html. Then write an >>> input job to convert your data similar to >>> >>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/InputDriver.java >>> and make a new job like >>> >>> /Mahout/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java. >>> You will have a small adventure and then be operational. >>> >>> Have fun, >>> Jeff >>> >>> >
