[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

Koji Sekiguchi (JIRA) Mon, 07 Oct 2013 08:16:01 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Sekiguchi updated LUCENE-5252:
-----------------------------------

    Description: 
I'd like to propose that we have another n-gram tokenizer which can process 
synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
size is fixed, i.e. minGramSize = maxGramSize.

Today, I think we have the following problems when using SynonymFilter with 
NGramTokenizer. 
For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ 
expand=true and N = 2 (2-gram).

# There is no consensus (I think :-) how we assign offsets to generated synonym 
tokens DE, EF and FG when expanding source token AB and BC.
# If the query pattern looks like ABCY, it cannot be matched even if there is a 
document "…ABCY…" in index when autoGeneratePhraseQueries set to true, because 
there is no "CY" token (but "GY" is there) in the index.

NGramSynonymTokenizer can solve these problems by providing the following 
methods.

* NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
tokenize registered words. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|ABC|AB/DE/BC/EF/FG|ABC/DEFG|

* The back and forth of the registered words, NGramSynonymTokenizer generates 
*extra* tokens w/ posInc=0. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|

In the above sample, "Z" and "1" are the extra tokens.


  was:
I'd like to propose that we have another n-gram tokenizer which can process 
synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
size is fixed, i.e. minGramSize = maxGramSize.

Today, I think we have the following problems when using SynonymFilter with 
NGramTokenizer. 
For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ 
expand=true and N = 2 (2-gram).

# There is no consensus (I think :-) how we assign offsets to generated synonym 
tokens DE, EF and FG when expanding source token AB and BC.
# If the query pattern looks like XABC or ABCY, it cannot be matched even if 
there is a document "…XABCY…" in index when autoGeneratePhraseQueries set to 
true, because there is no "XA" or "CY" tokens in the index.

NGramSynonymTokenizer can solve these problems by providing the following 
methods.

* NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
tokenize registered words. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|ABC|AB/DE/BC/EF/FG|ABC/DEFG|

* The back and forth of the registered words, NGramSynonymTokenizer generates 
*extra* tokens w/ posInc=0. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|

In the above sample, "Z" and "1" are the extra tokens.



> add NGramSynonymTokenizer
> -------------------------
>
>                 Key: LUCENE-5252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5252
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch
>
>
> I'd like to propose that we have another n-gram tokenizer which can process 
> synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
> size is fixed, i.e. minGramSize = maxGramSize.
> Today, I think we have the following problems when using SynonymFilter with 
> NGramTokenizer. 
> For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ 
> expand=true and N = 2 (2-gram).
> # There is no consensus (I think :-) how we assign offsets to generated 
> synonym tokens DE, EF and FG when expanding source token AB and BC.
> # If the query pattern looks like ABCY, it cannot be matched even if there is 
> a document "…ABCY…" in index when autoGeneratePhraseQueries set to true, 
> because there is no "CY" token (but "GY" is there) in the index.
> NGramSynonymTokenizer can solve these problems by providing the following 
> methods.
> * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
> tokenize registered words. e.g.
> ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
> |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
> * The back and forth of the registered words, NGramSynonymTokenizer generates 
> *extra* tokens w/ posInc=0. e.g.
> ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
> |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
> In the above sample, "Z" and "1" are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

Reply via email to