Hi Jeff, Thanks for your reply. I just got trunk and started the install. It ended with this error:
Error loading supplemental data models: Cannot create file-based resource. org.codehaus.plexus.resource.loader.FileResourceCreationException: Cannot create file-based resource. A lot built, so I went ahead and tried your command-line example, but got: ERROR: Could not find mahout-examples-*.job in /mnt/install/tools/mahout or /mnt/install/tools/mahout/examples/target, please run 'mvn install' to create the .job file I retrieved trunk as follows: svn co http://svn.apache.org/repos/asf/mahout/trunk Then ran 'mvn install' in the trunk folder. Any issues with trunk today? Thanks, Matt On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman <j...@windwardsolutions.com> wrote: > Hi Matt, > > From your command arguments, it looks like you are running 0.3. Due to the > rate of change in Mahout we recommend you check out trunk and use that > instead. With a little tweaking (added a --charset ASCII on seqdirectory) I > was able to get as far as you did on trunk but seq2sparse is not what you > want to use. > > The utilities you are using are intended for text preprocessing, to get > documents word-counted, into term vector sequenceFiles and then running TF > and/or TF-IDF processing on the results to produce VectorWritable sequence > files suitable for clustering. For your problem, I suggest you instead look > at the Synthetic Control clustering examples, starting with Canopy. These > use an InputDriver to process text files containing space-delimited numbers > like your data.dat file and produce the VectorWritable sequence files > directly. > > I was able to run this on your data using trunk and it produced 3 clusters. > You should be able to run the other synthetic control jobs on it too: > > CommandLine: > ./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \ > -i data \ > -o output \ > -t1 3 \ > -t2 2 \ > -ow \ > -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure > > Clusters output: > C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]} > Weight: Point: > 1.0: [22.000, 21.000] > C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]} > Weight: Point: > 1.0: [19.000, 20.000] > 1.0: [18.000, 22.000] > C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]} > Weight: Point: > 1.0: [1.000, 3.000] > 1.0: [3.000, 2.000] > > > Good hunting, > Jeff > > On 9/29/10 2:26 PM, Matt Tanquary wrote: >> >> I was able to run the tutorials, etc. Now I would like to generate my >> own small test. >> >> I have created a data.dat file and put these contents: >> 22 21 >> 19 20 >> 18 22 >> 1 3 >> 3 2 >> >> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir >> >> This created kmeans/seqdir/chunk-o in my dfs with the following content: >> ź/% >> /data.dat22 21 >> 19 20 >> 18 22 >> 1 3 >> 3 2 >> >> Next I ran: mahout seq2sparse -i kmeans/seqdir -o kmeans/input >> >> This generated several things in kmeans/input including the >> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000 >> which contains: >> řĎân >> /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable >> /data.dat@@ >> >> It does not seem to have the numeric data at this point. >> >> I am hoping someone can shed some light on how I can get my datapoint >> file into the proper vector format for running mahout kmeans. >> >> Just fyi, when I run kmeans against that file (mahout kmeans -i >> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2 >> -w) I get: >> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: >> 1, Size: 1 >> at java.util.ArrayList.RangeCheck(ArrayList.java:547) >> >> which tells me it was unable to find even 1 vector in the given input >> folder. >> >> Thanks for any comments you provide. >> -M@ > > -- Have you thanked a teacher today? ---> http://www.liftateacher.org