I have used Basistech linguistics in two products at two companies and they make high-quality software. At one point, I met with our Japanese partner, in Japan, and was able to make them comfortable with using Basistech instead of their own morphological package.
wunder On Apr 20, 2014, at 7:16 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Have you looked at commercial offerings? At some point, it becomes an > ROI issue. If it is becoming such a serious issue: > http://www.basistech.com/text-analytics/rosette/base-linguistics/asian-languages/ > > Regards, > Alex. > P.s. This is a link, not a recommendation. I haven't tested either > their quality or their pricing > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Mon, Apr 21, 2014 at 8:50 AM, Shawn Heisey <s...@elyograg.org> wrote: >> On 4/20/2014 6:20 PM, Benson Margulies wrote: >>> Could I perhaps wonder why your customer is so intent on indexing >>> ngrams? Why not use Kuromoji and index words? >> >> The data is not just Japanese. There is a mixture. For text in the >> Latin character set, StandardTokenizer and other similar things do not >> work for us, mostly because of the way that they handle punctuation. >> ICUTokenizer with its default rule set wouldn't work either, but as >> you'll see below, I've got a modified ruleset for Latin. >> >> The following is what I currently have for my analysis. A lot of this >> has evolved over the last few years on my other index that is primarily >> English: >> >> http://apaste.info/ypy >> >> We may need to have a major overhaul of our analysis chain for this >> customer. Perhaps what we've learned in the past won't apply here. >> >> Right now we have outputUnigrams enabled for both index and query. This >> solves the phrase query problem but causes things to match that the >> customer doesn't want to match. >> >> Thanks, >> Shawn