[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-05-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652770#comment-13652770
 ] 

Jack Krupansky commented on LUCENE-3907:


Why not make the big changes for trunk/5.0, but leave the existing 
filters/tokenziers in 4.x as deprecated. Add the leading replacements as well 
in 4x, but be sure to preserve the existing stuff with support for back in 4x 
- as deprecated.



 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Adrien Grand
  Labels: gsoc2013
 Fix For: 4.3

 Attachments: LUCENE-3907.patch


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-05-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652878#comment-13652878
 ] 

Adrien Grand commented on LUCENE-3907:
--

The previous behaviour could trigger highlighting bugs so I think it is 
important that we fix it in 4.x. In case the broken behaviour is still needed, 
it can be emulated by providing Version.LUCENE_43 as the Lucene match version.

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Adrien Grand
  Labels: gsoc2013
 Fix For: 4.3

 Attachments: LUCENE-3907.patch


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-05-09 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652885#comment-13652885
 ] 

Jack Krupansky commented on LUCENE-3907:


Look, the fix of position bugs here is to keep the position the same for all 
tokens, right? And that logic can simply be applied to back as well, for the 
same reasons and with the same effect. So, how could back - which should 
apply that same position logic be a separate cause of highlighting bugs?

previous behavior (incremented position) is simply NOT linked to front vs. 
back. I'm not sure why you are claiming that it is!

The Jira record simply shows that some people want to eliminate a feature... 
not that the feature (if fixed in the same manner as the rest of the fix) 
could trigger highlighting bugs - unless I'm missing something, and if I'm 
missing something it is because you are not stating it clearly! So, please do 
so.

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Adrien Grand
  Labels: gsoc2013
 Fix For: 4.3

 Attachments: LUCENE-3907.patch


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-05-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13653011#comment-13653011
 ] 

Adrien Grand commented on LUCENE-3907:
--

bq. previous behavior (incremented position) is simply NOT linked to front 
vs. back. I'm not sure why you are claiming that it is!

Indeed these issues are unrelated, and backward n-graming doesn't cause 
highlighting issues. Sorry if I seemed to mean the opposite, it was not 
intentional.

My main motivation was to fix the positions/offsets bugs. I also deprecated 
support for backward n-graming since there seemed to be lazy consensus: as Uwe 
noted, backward n-graming can be obtained by applying ReverseStringFilter, then 
EdgeNGramTokenFilter and then ReverseStringFilter again. This helps make 
filters simpler, hence easier to understand and to test.

So now, here is how you would use filters depending on whether you want front 
or back n-graming and with or without the new positions/offsets.

| | previous positions/offsets (broken) | new positions/offsets |
| front n-graming | EdgeNGramTokenFilter(version=LUCENE_43,side=FRONT) | 
EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT) |
| back n-graming | EdgeNGramTokenFilter(version=LUCENE_43,side=BACK) | 
ReverseStringFilter, EdgeNGramTokenFilter(version=LUCENE_44,side=FRONT), 
ReverseStringFilter |

It is true that the patch prevents users from constructing EdgeNGramTokenFilter 
with version=LUCENE_44 and side=BACK to encourage users to upgrade their 
analysis chain. But if you think we should allow for it, I'm open for 
discussion.

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Adrien Grand
  Labels: gsoc2013
 Fix For: 4.3

 Attachments: LUCENE-3907.patch


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-05-07 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13650856#comment-13650856
 ] 

Adrien Grand commented on LUCENE-3907:
--

As Steve suggested, I think these tokenizers/filters need to be renamed (trunk 
only) since they don't support backward graming anymore. Please don't hesitate 
to let me know if you have a good idea for a name, otherwise I plan to rename 
them to Leading... in the next few days.

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Adrien Grand
  Labels: gsoc2013
 Fix For: 4.3

 Attachments: LUCENE-3907.patch


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-05-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13650530#comment-13650530
 ] 

Uwe Schindler commented on LUCENE-3907:
---

Hi Adrien, thanks for the fixes. You can take the issue and assign it to you!

Uwe

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Uwe Schindler
  Labels: gsoc2013
 Fix For: 4.3

 Attachments: LUCENE-3907.patch


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-03-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596002#comment-13596002
 ] 

Michael McCandless commented on LUCENE-3907:


I think we should remove the Side (BACK/FRONT) enum: an app can always use 
ReverseStringFilter if it really wants BACK grams (what are BACK grams used 
for?).

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Uwe Schindler
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.2


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-03-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596413#comment-13596413
 ] 

Uwe Schindler commented on LUCENE-3907:
---

bq. Back grams would work for leading wildcards. They might be useful for 
things where the head is at the end (tail-first?), like domain names.

If you need reverse n-grams, you could always add a filter to do that 
afterwards. There is no need to have this as separate logic in *this* filter. 
We should split logic and keep filters as simple as possible.

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Uwe Schindler
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.2


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2013-03-07 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596419#comment-13596419
 ] 

Steve Rowe commented on LUCENE-3907:


Edge is the wrong name for something that only works on one edge.  Maybe rename 
to LeadingNgram? 

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Uwe Schindler
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.2


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2012-03-29 Thread Reinardus Surya Pradhitya (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241303#comment-13241303
 ] 

Reinardus Surya Pradhitya commented on LUCENE-3907:
---

Hi,

I'm interested in this project. I have done a Natural Language Processing 
project in language classification in which I did tokenization using Stanford's 
NLP tool. I'm also currently doing an Information Retrieval project in 
documents indexing and searching using Lucene and Weka. I might not be too 
familiar with Lucene's ngram tokenizer, but I have been working with NGram and 
Lucene before, so I believe that I would be able to learn quickly. Thanks :)

Best regards,
Reinardus Surya Pradhitya

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3907) Improve the Edge/NGramTokenizer/Filters

2012-03-29 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241357#comment-13241357
 ] 

Michael McCandless commented on LUCENE-3907:


Awesome!  We just need a possible mentor here... volunteers...?

 Improve the Edge/NGramTokenizer/Filters
 ---

 Key: LUCENE-3907
 URL: https://issues.apache.org/jira/browse/LUCENE-3907
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 Our ngram tokenizers/filters could use some love.  EG, they output ngrams in 
 multiple passes, instead of stacked, which messes up offsets/positions and 
 requires too much buffering (can hit OOME for long tokens).  They clip at 
 1024 chars (tokenizers) but don't (token filters).  The split up surrogate 
 pairs incorrectly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org