Re: CJKBigramFilter - position bug with outputUnigrams?

Robert Muir Mon, 21 Apr 2014 11:49:28 -0700

I think you misunderstand what the filter does. It does not "output unigrams".


In the case you choose this option, the positions are from the
unigrams omitted by your tokenizer (StandardTokenizer or whatever),
and it just adds bigrams as synonyms to those. It cannot safely do
anything else.

There can be only one "n".

On Mon, Apr 21, 2014 at 2:15 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 4/21/2014 11:19 AM, Shawn Heisey wrote:
>> Does the two-character case need to be treated differently here?  If so,
>> it is probably something that should be configurable.
>
> An even more general idea: output the last unigram in a sequence at the
> same position as the last bigram.  If necessary, make this configurable,
> or based on luceneMatchVersion.  This would also fix what I'm seeing
> with the two-character case.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: CJKBigramFilter - position bug with outputUnigrams?

Reply via email to