Re: CJKBigramFilter - position bug with outputUnigrams?

Shawn Heisey Sun, 20 Apr 2014 18:52:04 -0700

On 4/20/2014 6:20 PM, Benson Margulies wrote:
> Could I perhaps wonder why your customer is so intent on indexing
> ngrams? Why not use Kuromoji and index words?


The data is not just Japanese.  There is a mixture.  For text in the
Latin character set, StandardTokenizer and other similar things do not
work for us, mostly because of the way that they handle punctuation.
ICUTokenizer with its default rule set wouldn't work either, but as
you'll see below, I've got a modified ruleset for Latin.

The following is what I currently have for my analysis.  A lot of this
has evolved over the last few years on my other index that is primarily
English:

http://apaste.info/ypy

We may need to have a major overhaul of our analysis chain for this
customer.  Perhaps what we've learned in the past won't apply here.

Right now we have outputUnigrams enabled for both index and query.  This
solves the phrase query problem but causes things to match that the
customer doesn't want to match.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: CJKBigramFilter - position bug with outputUnigrams?

Reply via email to