Re: CJKBigramFilter - position bug with outputUnigrams?

Shawn Heisey Thu, 01 May 2014 20:45:13 -0700

On 4/21/2014 12:47 PM, Robert Muir wrote:
> I think you misunderstand what the filter does. It does not "output unigrams".
> 
> In the case you choose this option, the positions are from the
> unigrams omitted by your tokenizer (StandardTokenizer or whatever),
> and it just adds bigrams as synonyms to those. It cannot safely do
> anything else.
> 
> There can be only one "n".


I took a quick look at the code.  I'm sure it's easy to grasp once
you're really familiar with everything, but I'm having a hard time
decoding exactly how the filter works.  I don't have any more time to
plow through it tonight.

Would it be possible to implement an option with a name similar to
"lastUnigramAtPreviousPosition" so that I can optionally get the
behavior I'm after when the input is two or more characters, without
changing current behavior for anyone else?  This would completely solve
my current problem.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: CJKBigramFilter - position bug with outputUnigrams?

Reply via email to