Thanks for the help.
I hadn't realized that Java was picking up the Lucene class from the
target/dependency/ directory, rather than from my Lucene installation.
I fixed this by replacing the Lucene jar in the dependency directory
with the only from Lucene 3.0.0, and now I get the
NullPointerException for the Lucene demo index as well:
r...@rob:~/svn/mahout$ java
org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/Reference/Installers/lucene-3.0.0/index --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Jan 18, 2010 2:06:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Output File: /home/rob/test/output
Exception in thread "main" java.lang.NullPointerException
at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
at
org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
at
org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
I then tried downgrading Lucene to 2.9.1 to see if this fixed the
NullPointerException, but I get the same problem:
java org.apache.mahout.utils.vectors.lucene.Driver --dir
/home/rob/Reference/Installers/lucene-2.9.1/index --output
/home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
contents
Jan 18, 2010 2:18:37 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Output File: /home/rob/test/output
Exception in thread "main" java.lang.NullPointerException
at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
at
org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
at
org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
Any idea what's going on here?
Thanks
-Rob
On Thu, Jan 14, 2010 at 10:01 PM, Shashikant Kore <[email protected]> wrote:
> The first problem seems to be index version incompatibility.
>
> Since you created index with Lucene 3.0, you will need the same
> version to read the index. It seem while creating the vectors, the
> version of Lucene is lower than that. Can you check if you are using
> the same lucene jar while creating vector?
>
> Not sure what the second problem is.
>
> --shashi
>
> On Fri, Jan 15, 2010 at 11:11 AM, Rob Ennals <[email protected]> wrote:
>> Hi Guys,
>>
>> I'm totally new to Mahout so I'm running into what I expect are newbie
>> issues.
>>
>> To get started with clustering, I tried importing some indexes from Lucene.
>>
>> Following the Lucene tutorial, I created a really simple index of the
>> Lucene source code:
>> http://lucene.apache.org/java/3_0_0/demo.html
>>
>> I then tried to convert this to a Mahout Vector, following as per
>> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>>
>> This gives me a CorruptIndexException:
>>
>> r...@rob:~/svn/mahout$ java
>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>> /home/rob/Reference/Installers/lucene-3.0.0/index --output
>> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
>> contents
>> Exception in thread "main"
>> org.apache.lucene.index.CorruptIndexException: Incompatible format
>> version: 2 expected 1 or lower
>> at org.apache.lucene.index.FieldsReader.<init>(FieldsReader.java:117)
>> at
>> org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:277)
>> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:640)
>> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:599)
>> at
>> org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:104)
>> at
>> org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:27)
>> at
>> org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
>> at
>> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
>> at
>> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
>> at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
>> at org.apache.lucene.index.IndexReader.open(IndexReader.java:314)
>> at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:140)
>>
>>
>> I also tried running the driver on the actual Lucene index that I want
>> to apply it to, and this time to a NullPointerException:
>>
>> r...@rob:~/svn/mahout$ java
>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>> /home/rob/git/thinklink/scala/bin/index/ --output
>> /home/rob/test/output --dictOut /home/rob/test/dict --max 50 --field
>> contents
>> Jan 14, 2010 9:40:40 PM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Output File: /home/rob/test/output
>> Exception in thread "main" java.lang.NullPointerException
>> at
>> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>> at
>> org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
>> at
>> org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
>> at
>> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
>> at
>> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
>> at
>> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
>> at
>> org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter(Driver.java:226)
>> at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
>>
>>
>> In both cases, the indexes should have the "contents" field.
>>
>>
>> I assume I'm doing something stupid here. If someone can tell me what
>> that is, then that would be great.
>>
>>
>> Thanks
>>
>> -Rob
>>
>