[ https://issues.apache.org/jira/browse/LUCENE-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215434#comment-17215434 ]
Gus Heck edited comment on LUCENE-9575 at 10/16/20, 3:15 PM: ------------------------------------------------------------- Yeah I looked at our FST based regex class, but as you say, no group tracking which was critical. I had somewhat hoped that the performance of a non FST list of regexes would force me to learn all the nitty gritty of FST's and do something really nifty add group support but the ingest for the customer (involving ~25 regexps) didn't seem to be limited by the analysis so there was no justifying that work... optimize later. Also, no not across multiple tokens, again more than the customer needed, but a valid enhancement. was (Author: gus_heck): Yeah I looked at our FST based regex class, but as you say, no group tracking which was critical. I had somewhat hoped that the performance of a non FST list of regexes would force me to learn all the nitty gritty of FST's and do something really nifty add group support but the ingest for the customer (involving ~25 regexps) didn't seem to be limited by the analysis so there was no justifying that work... optimize later. > Add PatternTypingFilter > ----------------------- > > Key: LUCENE-9575 > URL: https://issues.apache.org/jira/browse/LUCENE-9575 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Gus Heck > Assignee: Gus Heck > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > One of the key asks when the Library of Congress was asking me to develop the > Advanced Query Parser was to be able to recognize arbitrary patterns that > included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they > wanted 401k and 401(k) to match documents with either style reference, and > NOT match documents that happen to have isolated 401 or k tokens (i.e. not > documents about the http status code) And of course we wanted to give up as > little of the text analysis features they were already using. > This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and > one solr specific filter in SOLR-14597 that re-analyzes tokens with an > arbitrary analyzer defined for a type in the solr schema, combine to achieve > this. > This filter has the job of spotting the patterns, and adding the intended > synonym as at type to the token (from which minimal punctuation has been > removed). It also sets flags on the token which are retained through the > analysis chain, and at the very end the type is converted to a synonym and > the original token(s) for that type are dropped avoiding the match on 401 > (for example) > The pattern matching is specified in a file that looks like: > {code} > 2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2 > 2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3 > 2 C\+\+ ::: c_plus_plus > {code} > That file would match match legal reference patterns such as 401(k), 401k, > 501(c)3 and C++ The format is: > <flagsInt> <pattern> ::: <replacement> > and groups in the pattern are substituted into the replacement so the first > line above would create synonyms such as: > {code} > 401k --> legal2_401_k > 401(k) --> legal2_401_k > 503(c) --> legal2_503_c > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org