[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661758#comment-13661758
 ] 

Lance Norskog commented on LUCENE-2899:
---------------------------------------

I'm updating the patches for 4.x and trunk. Kai's fix works. The unit tests did 
not attempt to analyse text that is longer than the fixed size temp buffer, and 
thus the code for copying successive buffers was never exercised. Kai's fix 
handles this problem. I've added a unit test. 

Em: the Lucene Tokenizer lifecyle is that the Tokenizer is created with a 
Reader, and each call to incrementToken() walks the input. When 
incrementToken() returns false, that is all- the Tokenizer is finished. 
TokenStream can support a 'stateful' token stream: with OpenNLPFilter, you call 
incrementToken() until it returns false, and then you can call 'reset' and it 
will start over from the beginning. The unit tests include a check that reset() 
works. The changes you made support a feature that is not supported by Lucene. 
Also, the changes break most of the unit tests. Please create a unit test that 
shows the bug, and fix the existing unit tests. No unit test = no bug report.

I'm posting a patch for the current 4.x and trunk. It includes some changes for 
TokenStream/TokenFilter method signatures, some refactoring in the unit tests, 
a little tightening in the Tokenizer & Filter, and Kai's fix. There are unit 
tests for the problem Kai found, and also a test that has TokenizerFactory 
create multiple Tokenizer streams. If there is a bug in this patch, please 
write a unit test which demonstrates it.

The patch is called LUCENE-2899-current.patch. It is tested against the current 
4.x branch and the current trunk.

Thanks for your interest and hard work- I know it is really tedious to 
understand this code :)

Lance Norskog

                
> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to