On 4/20/2014 6:20 PM, Benson Margulies wrote: > Could I perhaps wonder why your customer is so intent on indexing > ngrams? Why not use Kuromoji and index words?
The data is not just Japanese. There is a mixture. For text in the Latin character set, StandardTokenizer and other similar things do not work for us, mostly because of the way that they handle punctuation. ICUTokenizer with its default rule set wouldn't work either, but as you'll see below, I've got a modified ruleset for Latin. The following is what I currently have for my analysis. A lot of this has evolved over the last few years on my other index that is primarily English: http://apaste.info/ypy We may need to have a major overhaul of our analysis chain for this customer. Perhaps what we've learned in the past won't apply here. Right now we have outputUnigrams enabled for both index and query. This solves the phrase query problem but causes things to match that the customer doesn't want to match. Thanks, Shawn --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org