Hello vs, I am also a beginner mahout user and I think that the problem may be with your initial step to convert the txt matrix to a sequence file. I had a similar task to convert a tab delimited matrix into a sequence file of <IntWrtiable,VectorWrtiable> for SVD computations.
What I did, was to write some custom Java code using the Hadoop and Mahout API to convert my text file to a SequenceFile. I used a Map/Reduce implementation but there mus be an easier way. In your case, it seems that kmeans takes in a sequence file of <Writable,Canopy> or <Writable,Cluster>. I can include more details if you would like but I am also interested to see if there is an easier way. Vincent On Fri, Apr 22, 2011 at 7:24 PM, vs <[email protected]> wrote: > Mahout Users, > > I have seen posts attempting to an answer the problem i have in hand. But, > i would like to seek some comments from who have been successful in > resolving this issue. > > (1) Input data: A space-delimited symmetric matrix of 500x500 double > values. > The entire matrix is in one-single file, say 'raw-data.txt' > Example: > 1 0.8 0.9 .... > 0.8 1 0.7 .... > 0.3 0.5 1 .... > > (2) Data format conversion: > > (a) Convert 'raw-data.txt' into a sequence format representation using > the commans > > ./mahout seqdirectory -i ~/temp/kmeans-input-dir/raw-dir/ -o > ~/temp/kmeans-input-dir/seq-dir -c ascii > > > > ~/temp/kmeans-input-dir/seq-dir> ls -a > . .. chunk-0 .chunk-0.crc > > > (b) Convert sequence data into vector format: > > /mahout seq2sparse -i ~/temp/kmeans-input-dir/seq-dir/ -o > ~/temp/kmeans-input-dir/vec-dir > > > > ~/temp/kmeans-input-dir/vec-dir> ls -aR > > . .. df-count dictionary.file-0 .dictionary.file-0.crc > frequency.file-0 > .frequency.file-0.crc tfidf-vectors tf-vectors tokenized-documents > wordcount > > ./df-count: > . .. part-r-00000 .part-r-00000.crc > > ./tfidf-vectors: > . .. part-r-00000 .part-r-00000.crc > > ./tf-vectors: > . .. part-r-00000 .part-r-00000.crc > > ./tokenized-documents: > . .. part-m-00000 .part-m-00000.crc > > ./wordcount: > . .. part-r-00000 .part-r-00000.crc > > > > (3) Run kmeans on the vector data > > > ./mahout kmeans -c ~/temp/kmeans-input-dir/clusters/ -i > ~/temp/kmeans-input-dir/vec-dir/tfidf-vectors/ -o > ~/temp/kmeans-input-dir/kmeans-output -x 10 -k 5 -ow > > > > 11/04/22 13:11:35 INFO common.AbstractJob: Command line arguments: > {--clusters=~/temp/kmeans-input-dir/test-data-1-again/clusters, > --convergenceDelta=0.5, > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, > --endPhase=2147483647, > --input=~/temp/kmeans-input-dir/test-data-1-again/vec-dir/tfidf-vectors/, > --maxIter=10, --method=mapreduce, --numClusters=5, > --output=~/temp/kmeans-input-dir/test-data-1-again/kmeans-output, > --overwrite=null, --startPhase=0, --tempDir=temp} > 11/04/22 13:11:35 INFO util.NativeCodeLoader: Loaded the native-hadoop > library > 11/04/22 13:11:35 INFO zlib.ZlibFactory: Successfully loaded & initialized > native-zlib library > 11/04/22 13:11:35 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, > Size: 1 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcce > > thoughts and comments on the above procedure is highly appreciated. > > thanks, > > ----- > vs > -- > View this message in context: > http://lucene.472066.n3.nabble.com/kmeans-on-space-delimited-input-data-tp2852337p2852337.html > Sent from the Mahout User List mailing list archive at Nabble.com. >
