[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819813#comment-13819813
 ] 

Robert Muir commented on LUCENE-2899:
-------------------------------------

Just some thoughts:

I think it would be best to split out the different functionality here into 
subtasks for each piece, and figure out how each should best be integrated.

The current patch does strange things to try to deal with some impedence 
mismatch due to the design here, such as the tokenfilter which consumes the 
entire analysis chain and then replays the whole thing back with POS or NER as 
payloads. Is it really necessary to give this thing more scope than a single 
setnence? typically such tagging models (at least the ones ive worked with) 
tend to be trained only within sentence scope. 

Also payloads should not be used internally, instead things like TypeAttribute 
should be used for POSTags, if someone wants to filter out certain POS or 
maintain certain POS they can use already existing stuff like TypeTokenFilter, 
if they want to index Type as a payload, they can use TypeAsPayloadTokenFilter, 
and so on.

While I can see this POS-tagging being useful inside the analysis chain: the 
NER case is much less clear, I think its more important to e.g. be integrated 
outside of the analysis chain so that named entities/mentions can be faceted 
on, added to separate fields for search (likely with a different analysis chain 
for that), etc. So for lucene that would be an easier way to add these as 
facets, for solr it probably makes more sense as UpdateProcessor than as 
analysis chain.

Finally: I'm confused as to what benefit we get from using OpenNLP directly, 
versus integrating with it via opennlp-uima? Our UIMA integration at various 
levels (analysis chain/update processor) is already there, so I'm just 
wondering if thats a much shorter way path.


> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 4.6
>
>         Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to