Hi,

I opened a rather detailed JIRA ticket and submitted patch regarding this issue 
already:
https://issues.apache.org/jira/browse/MAHOUT-675

The short of it is that the LuceneIterator throws an IllegalStateException when 
a null term vector is encountered in the computeNext method. That is 
problematic as it pushes the responsibility of checking if a document field has 
no legitimate terms onto the person creating and maintaining the Lucene index. 
That might sound good in principle however, it isn't exactly intuitive to do 
that when creating or maintaining a Lucene index. The reason this is even an 
issue is that when you start creating custom Lucene analyzers, a pretty 
important practice if you want to improve your text mining results, it is 
possible that you will end up filtering out all terms in the target field for 
some documents; that is actually a desirable result as it indicates that those 
document is noise. Thus, when you attempt to dump the vectors of that index, 
the noise documents cause an IllegalStateException and it does not indicate 
that the issue was due the the custom analyzer.

I believe, at least in my situation, a better approach is for the 
LuceneIterator to log a warning with the idField when it encounters a problem 
document and move onto the next one.

Thanks,

Chris

Reply via email to