[ 
https://issues.apache.org/jira/browse/MAHOUT-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning reopened MAHOUT-675:
--------------------------------


I think we need to limit the amount of logging.  Patch coming for review.

> LuceneIterator throws an IllegalStateException when a null TermFreqVector is 
> encountered for a document instead of skipping to the next one
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-675
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-675
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>            Reporter: Chris Jordan
>             Fix For: 0.5
>
>         Attachments: MAHOUT-675, MAHOUT-675-1, MAHOUT-675.patch
>
>
> The org.apache.mahout.utils.vectors.lucene.LuceneIterator currently throws an 
> IllegalStateException if it encounters a document with a null term frequency 
> vector for the target field in the computeNext() method. That is problematic 
> for people who are developing text mining applications on top of lucene as it 
> forces them to check that the documents that they are adding to their lucene 
> indexes actually have terms for the target field. While that check may sound 
> reasonable, it actually is not in practice.
> Lucene in most cases will apply an analyzer to a field in a document as it is 
> added to the index. The StandardAnalyzer is pretty lenient and barely removes 
> any terms. In most cases though, if you want to have better text mining 
> performance, you will create your own custom analyzer. For example, in my 
> current work with document clustering, in order to generate tighter clusters 
> and have more human readable top terms, I am using a stop word list specific 
> to my subject domain and I am filtering out terms that contain numbers. The 
> net result is that some of my documents have no terms for the target field 
> which is a desirable outcome. When I attempt to dump the lucene vectors 
> though, I encounter an IllegalStateException because of those documents.
> Now it is possible for me to check the TokenStream of the target field before 
> I insert into my index however, if we were to follow that approach, it means 
> for each of my applications, I would have to perform this check. That isn't a 
> great practice as someone could be experimenting with custom analyzers to 
> improve text mining performance and then encounter this exception without any 
> real indication that it was due to the custom analyzer.
> I believe a better approach is to log a warning with the field id of the 
> problem document and then skip to the next one. That way, a warning will be 
> in the logs and the lucene vector dump process will not halt.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to