[ https://issues.apache.org/jira/browse/MAHOUT-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Dunning reopened MAHOUT-675: -------------------------------- I think we need to limit the amount of logging. Patch coming for review. > LuceneIterator throws an IllegalStateException when a null TermFreqVector is > encountered for a document instead of skipping to the next one > ------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-675 > URL: https://issues.apache.org/jira/browse/MAHOUT-675 > Project: Mahout > Issue Type: Improvement > Components: Utils > Reporter: Chris Jordan > Fix For: 0.5 > > Attachments: MAHOUT-675, MAHOUT-675-1, MAHOUT-675.patch > > > The org.apache.mahout.utils.vectors.lucene.LuceneIterator currently throws an > IllegalStateException if it encounters a document with a null term frequency > vector for the target field in the computeNext() method. That is problematic > for people who are developing text mining applications on top of lucene as it > forces them to check that the documents that they are adding to their lucene > indexes actually have terms for the target field. While that check may sound > reasonable, it actually is not in practice. > Lucene in most cases will apply an analyzer to a field in a document as it is > added to the index. The StandardAnalyzer is pretty lenient and barely removes > any terms. In most cases though, if you want to have better text mining > performance, you will create your own custom analyzer. For example, in my > current work with document clustering, in order to generate tighter clusters > and have more human readable top terms, I am using a stop word list specific > to my subject domain and I am filtering out terms that contain numbers. The > net result is that some of my documents have no terms for the target field > which is a desirable outcome. When I attempt to dump the lucene vectors > though, I encounter an IllegalStateException because of those documents. > Now it is possible for me to check the TokenStream of the target field before > I insert into my index however, if we were to follow that approach, it means > for each of my applications, I would have to perform this check. That isn't a > great practice as someone could be experimenting with custom analyzers to > improve text mining performance and then encounter this exception without any > real indication that it was due to the custom analyzer. > I believe a better approach is to log a warning with the field id of the > problem document and then skip to the next one. That way, a warning will be > in the logs and the lucene vector dump process will not halt. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira