Hi Guys, I'm totally new to Mahout so I'm running into what I expect are newbie issues.
To get started with clustering, I tried importing some indexes from Lucene. Following the Lucene tutorial, I created a really simple index of the Lucene source code: http://lucene.apache.org/java/3_0_0/demo.html I then tried to convert this to a Mahout Vector, following as per http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html This gives me a CorruptIndexException: r...@rob:~/svn/mahout$ java org.apache.mahout.utils.vectors.lucene.Driver --dir /home/rob/Reference/Installers/lucene-3.0.0/index --output /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field contents Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Incompatible format version: 2 expected 1 or lower at org.apache.lucene.index.FieldsReader.<init>(FieldsReader.java:117) at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:277) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599) at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:104) at org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at org.apache.lucene.index.IndexReader.open(IndexReader.java:476) at org.apache.lucene.index.IndexReader.open(IndexReader.java:314) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:140) I also tried running the driver on the actual Lucene index that I want to apply it to, and this time to a NullPointerException: r...@rob:~/svn/mahout$ java org.apache.mahout.utils.vectors.lucene.Driver --dir /home/rob/git/thinklink/scala/bin/index/ --output /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field contents Jan 14, 2010 9:40:40 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Output File: /home/rob/test/output Exception in thread "main" java.lang.NullPointerException at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73) at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265) at org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197) In both cases, the indexes should have the "contents" field. I assume I'm doing something stupid here. If someone can tell me what that is, then that would be great. Thanks -Rob
