Gus Heck created LUCENE-9575:
--------------------------------

             Summary: Add PatternTypingFilter
                 Key: LUCENE-9575
                 URL: https://issues.apache.org/jira/browse/LUCENE-9575
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
            Reporter: Gus Heck
            Assignee: Gus Heck


One of the key asks when the Library of Congress was asking me to develop the 
Advanced Query Parser was to be able to recognize arbitrary patterns that 
included punctuation such as POW/MIA or 401(k) or C++ etc. Additionally they 
wanted 401k and 401(k) to match documents with either style reference, and NOT 
match documents that happen to have isolated 401 or k tokens (i.e. not 
documents about the http status code) And of course we wanted to give up as 
little of the text analysis features they were already using.

This filter in conjunction with the filters from LUCENE-9572, LUCENE-9574 and 
one solr specific filter in SOLR-14597 that re-analyzes tokens with an 
arbitrary analyzer defined for a type in the solr schema, combine to achieve 
this. 

This filter has the job of spotting the patterns, and adding the intended 
synonym as at type to the token (from which minimal punctuation has been 
removed). It also sets flags on the token which are retained through the 
analysis chain, and at the very end the type is converted to a synonym and the 
original token(s) for that type are dropped avoiding the match on 401 (for 
example) 

The pattern matching is specified in a file that looks like: 
{code}
2 (\d+)\(?([a-z])\)? ::: legal2_$1_$2
2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_$3
2 C\+\+ ::: c_plus_plus
{code}

That file would match match legal reference patterns such as 401(k), 401k, 
501(c)3 and C++ The format is:

<flagsInt> <pattern> ::: <replacement>

and groups in the pattern are substituted into the replacement so the first 
line above would create synonyms such as:

{code}
401k   --> legal2_401_k
401(k) --> legal2_401_k
503(c) --> legal2_503_c
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to