It seems that the problem is because that not all the documents in my index
has the field that I am using to get term vectors from. I made the following
changes to make this work, but I am not sure if thats the right way. I
wanted to get this work to run the LDA topic modeling using the output from
the Driver.
Index:
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
===================================================================
---
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
(revision 830343)
+++
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
(working copy)
@@ -42,7 +42,7 @@
break;
}
//point.write(dataOut);
- writer.append(new LongWritable(recNum++), point);
+ if(point!=null) writer.append(new LongWritable(recNum++), point);
}
return recNum;
Index:
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
===================================================================
---
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
(revision 830343)
+++
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
(working copy)
@@ -104,6 +104,10 @@
try {
indexReader.getTermFreqVector(doc, field, mapper);
result = mapper.getVector();
+
+ if (result == null)
+ return null;
+
if (idField != null) {
String id = indexReader.document(doc,
idFieldSelector).get(idField);
result.setName(id);
sushil_kb wrote:
>
> I am having the same problem as Allan. I checked out mahout from trunk and
> tried to create term frequency vector from a lucene index and ran into
> this..
>
> 09/10/27 17:36:10 INFO lucene.Driver: Output File:
> /Users/shoeseal/DATA/luc2tvec.out
> 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
> at
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
> at
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
> at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
>
> I am running this from Eclipse (snow leopard with JDK 6), on an index that
> has field with stored term vectors..
>
> my input parameters for Driver are:
> --dir <path>/smallidx/ --output <path>/luc2tvec.out --idField id_field
> --field field_with_TV --dictOut <path>/luc2tvec.dict --max 50 --weight
> tf
>
> Luke shows the following info on the fields I am using:
> id_field is indexed, stored, omit norms
> field_with_TV is indexed, tokenized, stored, term vector
>
> I can run the test LuceneIterableTest fine but when I run the Driver on my
> index I get into trouble. Any possible reasons for this behavior besides
> not having an index field with stored term vector?
>
> Thanks.
> - sushil
>
>
>
>
> Grant Ingersoll-6 wrote:
>>
>>
>> On Jul 2, 2009, at 12:09 PM, Allan Roberto Avendano Sudario wrote:
>>
>>> Regards,
>>> This is the entire exception message:
>>>
>>>
>>> java -cp $JAVACLASSPATH org.apache.mahout.utils.vectors.Driver --dir
>>> /home/hadoop/Desktop/<urls>/index --field content --dictOut
>>> /home/hadoop/Desktop/dictionary/dict.txt --output
>>> /home/hadoop/Desktop/dictionary/out.txt --max 50 --norm 2
>>>
>>>
>>> 09/07/02 09:35:47 INFO vectors.Driver: Output File:
>>> /home/hadoop/Desktop/dictionary/out.txt
>>> 09/07/02 09:35:47 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>> 09/07/02 09:35:47 INFO zlib.ZlibFactory: Successfully loaded &
>>> initialized
>>> native-zlib library
>>> 09/07/02 09:35:47 INFO compress.CodecPool: Got brand-new compressor
>>> Exception in thread "main" java.lang.NullPointerException
>>> at
>>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable
>>> $TDIterator.next(LuceneIteratable.java:111)
>>> at
>>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable
>>> $TDIterator.next(LuceneIteratable.java:82)
>>> at
>>> org
>>> .apache
>>> .mahout
>>> .utils
>>> .vectors
>>> .io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>>> at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>>>
>>>
>>> Well, I used a nutch crawl index, is that correct? mmm... I have
>>> change to
>>> contenc field, but nothing happened.
>>> Possibly the nutch crawl doesn´t have Term Vector indexed.
>>
>> This would be my guess. A small edit to Nutch code would probably
>> allow it. Just find where it creates a new Field and add in the TV
>> stuff.
>>
>
>
--
View this message in context:
http://www.nabble.com/Creating-Vectors-from-Text-tp24298643p26087765.html
Sent from the Mahout User List mailing list archive at Nabble.com.