[ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661758#comment-13661758 ]
Lance Norskog commented on LUCENE-2899: --------------------------------------- I'm updating the patches for 4.x and trunk. Kai's fix works. The unit tests did not attempt to analyse text that is longer than the fixed size temp buffer, and thus the code for copying successive buffers was never exercised. Kai's fix handles this problem. I've added a unit test. Em: the Lucene Tokenizer lifecyle is that the Tokenizer is created with a Reader, and each call to incrementToken() walks the input. When incrementToken() returns false, that is all- the Tokenizer is finished. TokenStream can support a 'stateful' token stream: with OpenNLPFilter, you call incrementToken() until it returns false, and then you can call 'reset' and it will start over from the beginning. The unit tests include a check that reset() works. The changes you made support a feature that is not supported by Lucene. Also, the changes break most of the unit tests. Please create a unit test that shows the bug, and fix the existing unit tests. No unit test = no bug report. I'm posting a patch for the current 4.x and trunk. It includes some changes for TokenStream/TokenFilter method signatures, some refactoring in the unit tests, a little tightening in the Tokenizer & Filter, and Kai's fix. There are unit tests for the problem Kai found, and also a test that has TokenizerFactory create multiple Tokenizer streams. If there is a bug in this patch, please write a unit test which demonstrates it. The patch is called LUCENE-2899-current.patch. It is tested against the current 4.x branch and the current trunk. Thanks for your interest and hard work- I know it is really tedious to understand this code :) Lance Norskog > Add OpenNLP Analysis capabilities as a module > --------------------------------------------- > > Key: LUCENE-2899 > URL: https://issues.apache.org/jira/browse/LUCENE-2899 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 4.4 > > Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, > LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, > LUCENE-2899.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, > OpenNLPTokenizer.java, opennlp_trunk.patch > > > Now that OpenNLP is an ASF project and has a nice license, it would be nice > to have a submodule (under analysis) that exposed capabilities for it. Drew > Farris, Tom Morton and I have code that does: > * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it > would have to change slightly to buffer tokens) > * NamedEntity recognition as a TokenFilter > We are also planning a Tokenizer/TokenFilter that can put parts of speech as > either payloads (PartOfSpeechAttribute?) on a token or at the same position. > I'd propose it go under: > modules/analysis/opennlp -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org