Have you looked at commercial offerings? At some point, it becomes an ROI issue. If it is becoming such a serious issue: http://www.basistech.com/text-analytics/rosette/base-linguistics/asian-languages/
Regards, Alex. P.s. This is a link, not a recommendation. I haven't tested either their quality or their pricing Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Mon, Apr 21, 2014 at 8:50 AM, Shawn Heisey <s...@elyograg.org> wrote: > On 4/20/2014 6:20 PM, Benson Margulies wrote: >> Could I perhaps wonder why your customer is so intent on indexing >> ngrams? Why not use Kuromoji and index words? > > The data is not just Japanese. There is a mixture. For text in the > Latin character set, StandardTokenizer and other similar things do not > work for us, mostly because of the way that they handle punctuation. > ICUTokenizer with its default rule set wouldn't work either, but as > you'll see below, I've got a modified ruleset for Latin. > > The following is what I currently have for my analysis. A lot of this > has evolved over the last few years on my other index that is primarily > English: > > http://apaste.info/ypy > > We may need to have a major overhaul of our analysis chain for this > customer. Perhaps what we've learned in the past won't apply here. > > Right now we have outputUnigrams enabled for both index and query. This > solves the phrase query problem but causes things to match that the > customer doesn't want to match. > > Thanks, > Shawn > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org