Hi Guys,

I'm totally new to Mahout so I'm running into what I expect are newbie issues.

To get started with clustering, I tried importing some indexes from Lucene.

Following the Lucene tutorial, I created a really simple index of the
Lucene source code:
http://lucene.apache.org/java/3_0_0/demo.html

I then tried to convert this to a Mahout Vector, following as per
http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

This gives me a CorruptIndexException:

r...@rob:~/svn/mahout$ java
org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/Reference/Installers/lucene-3.0.0/index --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Exception in thread "main"
org.apache.lucene.index.CorruptIndexException: Incompatible format
version: 2 expected 1 or lower
        at org.apache.lucene.index.FieldsReader.<init>(FieldsReader.java:117)
        at 
org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:277)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599)
        at 
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:104)
        at 
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
        at 
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
        at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:314)
        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:140)


I also tried running the driver on the actual Lucene index that I want
to apply it to, and this time to a NullPointerException:

r...@rob:~/svn/mahout$ java
org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/git/thinklink/scala/bin/index/ --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Jan 14, 2010 9:40:40 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Output File: /home/rob/test/output
Exception in thread "main" java.lang.NullPointerException
        at 
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
        at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
        at 
org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
        at 
org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
        at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)


In both cases, the indexes should have the "contents" field.


I assume I'm doing something stupid here. If someone can tell me what
that is, then that would be great.


Thanks

-Rob

Reply via email to