[ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819813#comment-13819813 ]
Robert Muir commented on LUCENE-2899: ------------------------------------- Just some thoughts: I think it would be best to split out the different functionality here into subtasks for each piece, and figure out how each should best be integrated. The current patch does strange things to try to deal with some impedence mismatch due to the design here, such as the tokenfilter which consumes the entire analysis chain and then replays the whole thing back with POS or NER as payloads. Is it really necessary to give this thing more scope than a single setnence? typically such tagging models (at least the ones ive worked with) tend to be trained only within sentence scope. Also payloads should not be used internally, instead things like TypeAttribute should be used for POSTags, if someone wants to filter out certain POS or maintain certain POS they can use already existing stuff like TypeTokenFilter, if they want to index Type as a payload, they can use TypeAsPayloadTokenFilter, and so on. While I can see this POS-tagging being useful inside the analysis chain: the NER case is much less clear, I think its more important to e.g. be integrated outside of the analysis chain so that named entities/mentions can be faceted on, added to separate fields for search (likely with a different analysis chain for that), etc. So for lucene that would be an easier way to add these as facets, for solr it probably makes more sense as UpdateProcessor than as analysis chain. Finally: I'm confused as to what benefit we get from using OpenNLP directly, versus integrating with it via opennlp-uima? Our UIMA integration at various levels (analysis chain/update processor) is already there, so I'm just wondering if thats a much shorter way path. > Add OpenNLP Analysis capabilities as a module > --------------------------------------------- > > Key: LUCENE-2899 > URL: https://issues.apache.org/jira/browse/LUCENE-2899 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 4.6 > > Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, > OpenNLPFilter.java, OpenNLPTokenizer.java > > > Now that OpenNLP is an ASF project and has a nice license, it would be nice > to have a submodule (under analysis) that exposed capabilities for it. Drew > Farris, Tom Morton and I have code that does: > * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it > would have to change slightly to buffer tokens) > * NamedEntity recognition as a TokenFilter > We are also planning a Tokenizer/TokenFilter that can put parts of speech as > either payloads (PartOfSpeechAttribute?) on a token or at the same position. > I'd propose it go under: > modules/analysis/opennlp -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org