Have you looked at commercial offerings? At some point, it becomes an
ROI issue. If it is becoming such a serious issue:
http://www.basistech.com/text-analytics/rosette/base-linguistics/asian-languages/

Regards,
   Alex.
P.s. This is a link, not a recommendation. I haven't tested either
their quality or their pricing
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Mon, Apr 21, 2014 at 8:50 AM, Shawn Heisey <s...@elyograg.org> wrote:
> On 4/20/2014 6:20 PM, Benson Margulies wrote:
>> Could I perhaps wonder why your customer is so intent on indexing
>> ngrams? Why not use Kuromoji and index words?
>
> The data is not just Japanese.  There is a mixture.  For text in the
> Latin character set, StandardTokenizer and other similar things do not
> work for us, mostly because of the way that they handle punctuation.
> ICUTokenizer with its default rule set wouldn't work either, but as
> you'll see below, I've got a modified ruleset for Latin.
>
> The following is what I currently have for my analysis.  A lot of this
> has evolved over the last few years on my other index that is primarily
> English:
>
> http://apaste.info/ypy
>
> We may need to have a major overhaul of our analysis chain for this
> customer.  Perhaps what we've learned in the past won't apply here.
>
> Right now we have outputUnigrams enabled for both index and query.  This
> solves the phrase query problem but causes things to match that the
> customer doesn't want to match.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to