[ https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Sekiguchi updated LUCENE-5252: ----------------------------------- Attachment: (was: LUCENE-5252_b4.patch) > add NGramSynonymTokenizer > ------------------------- > > Key: LUCENE-5252 > URL: https://issues.apache.org/jira/browse/LUCENE-5252 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, > LUCENE-5252_4x.patch, LUCENE-5252_4x.patch > > > I'd like to propose that we have another n-gram tokenizer which can process > synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram > size is fixed, i.e. minGramSize = maxGramSize. > Today, I think we have the following problems when using SynonymFilter with > NGramTokenizer. > For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ > expand=true and N = 2 (2-gram). > # There is no consensus (I think :-) how we assign offsets to generated > synonym tokens DE, EF and FG when expanding source token AB and BC. > # If the query pattern looks like ABCY, it cannot be matched even if there is > a document "…ABCY…" in index when autoGeneratePhraseQueries set to true, > because there is no "CY" token (but "GY" is there) in the index. > NGramSynonymTokenizer can solve these problems by providing the following > methods. > * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't > tokenize registered words. e.g. > ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer|| > |ABC|AB/DE/BC/EF/FG|ABC/DEFG| > * The back and forth of the registered words, NGramSynonymTokenizer generates > *extra* tokens w/ posInc=0. e.g. > ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer|| > |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23| > In the above sample, "Z" and "1" are the extra tokens. -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org