Positions in EdgeNgramTokenFilter

2013-03-01 Thread Walter Underwood
I'm fixing position increment in EdgeNgramTokenFilter to act like synonyms, with each ngram at the same position as the source token. Currently, the position is incremented for each output token, which breaks phrase searching with edge ngrams. I could not find a current Jira issue for this. Is

Re: Positions in EdgeNgramTokenFilter

2013-03-01 Thread Robert Muir
Walter, sounds very interesting. Maybe just use this issue: https://issues.apache.org/jira/browse/LUCENE-3907 ? On Fri, Mar 1, 2013 at 10:41 AM, Walter Underwood wun...@wunderwood.org wrote: I'm fixing position increment in EdgeNgramTokenFilter to act like synonyms, with each ngram at the same

Re: Positions in EdgeNgramTokenFilter

2013-03-01 Thread Walter Underwood
That is a pretty broad bug, but this fix is somewhere in improve ngrams. Maybe a specific bug linked to that one? Incrementing positions might be the right thing for pure ngrams. wunder On Mar 1, 2013, at 11:02 AM, Robert Muir wrote: Walter, sounds very interesting. Maybe just use this

Re: Positions in EdgeNgramTokenFilter

2013-03-01 Thread Robert Muir
sure, you could just make a new issue and link it to that one if you like. thanks for looking at this! On Fri, Mar 1, 2013 at 2:15 PM, Walter Underwood wun...@wunderwood.org wrote: That is a pretty broad bug, but this fix is somewhere in improve ngrams. Maybe a specific bug linked to that one?

Re: Positions in EdgeNgramTokenFilter

2013-03-01 Thread Otis Gospodnetic
Wunder, you may be thinking of LUCENE-1224 from a few years ago? http://search-lucene.com/?q=ngramfc_project=Lucenefc_type=issue Otis -- Solr ElasticSearch Support http://sematext.com/ On Fri, Mar 1, 2013 at 1:41 PM, Walter Underwood wun...@wunderwood.orgwrote: I'm fixing position

Re: Positions in EdgeNgramTokenFilter

2013-03-01 Thread Walter Underwood
I can see an argument for NGramTokenFilter incrementing position for each ngram, because they really are an ordered scan across the text. Pure ngrams are a different text representation than words. That could be an option on the token filter. LUCENE-1224 is mostly concerned with ngram