Solr, ICUTokenizer with Latin-break-only-on-whitespace

Jonathan Rochkind Thu, 20 Jun 2013 12:28:23 -0700

(to solr-user, CC'ing author I'm responding to)

I found the solr-user listserv contribution at:


https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E

Which explain a way you can supply custom rulefiles to ICUTokenizer, inthis case to tell it to only break on whitespace for Latin charactersubstrings.

I am trying to use the technique explained there in Solr 4.3, but eitherit's not working, or it's not doing what I'd expect.

I want, for instance, "C++ Language" to be tokenized into "C++","Language". But the ICUTokenizer, even with therulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbi filefrom the Solr 4.3 source [1].

But the ICUTokenizer, even with the that rulefile, is still strippingthe punctuation, and tokenizing that into "C", "Language".

Can anyone give me any guidance or hints? I don't entirely understandthe semantics of the rbbi file to try debugging there. Is something notworking, or does the rbbi file just not express the semantics I want?


Thanks for any tips.

[1]http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_3_0/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi?revision=1479557&view=markup

Solr, ICUTokenizer with Latin-break-only-on-whitespace

Reply via email to