[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-18 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Attachment: LUCENE-5252_b4.patch

New patch. As for some reason, I give up to support one-way synonym in 
NGramSynonymTokenizer, I removed indexMode parameter in this patch.

 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, 
 LUCENE-5252_4x.patch, LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like ABCY, it cannot be matched even if there is 
 a document …ABCY… in index when autoGeneratePhraseQueries set to true, 
 because there is no CY token (but GY is there) in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-18 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Attachment: (was: LUCENE-5252_b4.patch)

 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, 
 LUCENE-5252_4x.patch, LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like ABCY, it cannot be matched even if there is 
 a document …ABCY… in index when autoGeneratePhraseQueries set to true, 
 because there is no CY token (but GY is there) in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-18 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Attachment: LUCENE-5252_4x.patch

Oops, replace the previous funny name by this patch. Sorry for the noise.

 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, 
 LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like ABCY, it cannot be matched even if there is 
 a document …ABCY… in index when autoGeneratePhraseQueries set to true, 
 because there is no CY token (but GY is there) in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-15 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Attachment: LUCENE-5252_4x.patch

Fix code regarding one-way synonym (aaa=bbb).

 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, 
 LUCENE-5252_4x.patch, LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like ABCY, it cannot be matched even if there is 
 a document …ABCY… in index when autoGeneratePhraseQueries set to true, 
 because there is no CY token (but GY is there) in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-11 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Attachment: LUCENE-5252_4x.patch

Fix a bug regarding ignoreCase in the attached patch.

 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch, 
 LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like ABCY, it cannot be matched even if there is 
 a document …ABCY… in index when autoGeneratePhraseQueries set to true, 
 because there is no CY token (but GY is there) in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-07 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Description: 
I'd like to propose that we have another n-gram tokenizer which can process 
synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
size is fixed, i.e. minGramSize = maxGramSize.

Today, I think we have the following problems when using SynonymFilter with 
NGramTokenizer. 
For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
expand=true and N = 2 (2-gram).

# There is no consensus (I think :-) how we assign offsets to generated synonym 
tokens DE, EF and FG when expanding source token AB and BC.
# If the query pattern looks like ABCY, it cannot be matched even if there is a 
document …ABCY… in index when autoGeneratePhraseQueries set to true, because 
there is no CY token (but GY is there) in the index.

NGramSynonymTokenizer can solve these problems by providing the following 
methods.

* NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
tokenize registered words. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|ABC|AB/DE/BC/EF/FG|ABC/DEFG|

* The back and forth of the registered words, NGramSynonymTokenizer generates 
*extra* tokens w/ posInc=0. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|

In the above sample, Z and 1 are the extra tokens.


  was:
I'd like to propose that we have another n-gram tokenizer which can process 
synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
size is fixed, i.e. minGramSize = maxGramSize.

Today, I think we have the following problems when using SynonymFilter with 
NGramTokenizer. 
For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
expand=true and N = 2 (2-gram).

# There is no consensus (I think :-) how we assign offsets to generated synonym 
tokens DE, EF and FG when expanding source token AB and BC.
# If the query pattern looks like XABC or ABCY, it cannot be matched even if 
there is a document …XABCY… in index when autoGeneratePhraseQueries set to 
true, because there is no XA or CY tokens in the index.

NGramSynonymTokenizer can solve these problems by providing the following 
methods.

* NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
tokenize registered words. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|ABC|AB/DE/BC/EF/FG|ABC/DEFG|

* The back and forth of the registered words, NGramSynonymTokenizer generates 
*extra* tokens w/ posInc=0. e.g.

||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
|XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|

In the above sample, Z and 1 are the extra tokens.



 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like ABCY, it cannot be matched even if there is 
 a document …ABCY… in index when autoGeneratePhraseQueries set to true, 
 because there is no CY token (but GY is there) in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-03 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Attachment: LUCENE-5252_4x.patch

New patch that has tests.

Because the original test was developed in RONDHUIT and it includes test codes 
for not only NGramSynonymTokenizer but also synonym dictionary, the attached 
tests may be redundant in terms of SynonymMap.

 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch, LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like XABC or ABCY, it cannot be matched even if 
 there is a document …XABCY… in index when autoGeneratePhraseQueries set to 
 true, because there is no XA or CY tokens in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5252) add NGramSynonymTokenizer

2013-10-02 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-5252:
---

Attachment: LUCENE-5252_4x.patch

The draft patch without tests.

When NGramSynonymTokenizer was developed in RONDHUIT, it used double array trie 
for the synonym dictionary.

I've tried to convert the code to Lucene's FST. As this is the first experience 
of FST for me, any inefficient code may be there. Comments are welcome!

 add NGramSynonymTokenizer
 -

 Key: LUCENE-5252
 URL: https://issues.apache.org/jira/browse/LUCENE-5252
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-5252_4x.patch


 I'd like to propose that we have another n-gram tokenizer which can process 
 synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram 
 size is fixed, i.e. minGramSize = maxGramSize.
 Today, I think we have the following problems when using SynonymFilter with 
 NGramTokenizer. 
 For purpose of illustration, we have a synonym setting ABC, DEFG w/ 
 expand=true and N = 2 (2-gram).
 # There is no consensus (I think :-) how we assign offsets to generated 
 synonym tokens DE, EF and FG when expanding source token AB and BC.
 # If the query pattern looks like XABC or ABCY, it cannot be matched even if 
 there is a document …XABCY… in index when autoGeneratePhraseQueries set to 
 true, because there is no XA or CY tokens in the index.
 NGramSynonymTokenizer can solve these problems by providing the following 
 methods.
 * NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't 
 tokenize registered words. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |ABC|AB/DE/BC/EF/FG|ABC/DEFG|
 * The back and forth of the registered words, NGramSynonymTokenizer generates 
 *extra* tokens w/ posInc=0. e.g.
 ||source text||NGramTokenizer+SynonymFilter||NGramSynonymTokenizer||
 |XYZABC123|XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23|XY/YZ/Z/ABC/DEFG/1/12/23|
 In the above sample, Z and 1 are the extra tokens.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org