(to solr-user, CC'ing author I'm responding to)

I found the solr-user listserv contribution at:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E

Which explain a way you can supply custom rulefiles to ICUTokenizer, in this case to tell it to only break on whitespace for Latin character substrings.

I am trying to use the technique explained there in Solr 4.3, but either it's not working, or it's not doing what I'd expect.

I want, for instance, "C++ Language" to be tokenized into "C++", "Language". But the ICUTokenizer, even with the rulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbi file from the Solr 4.3 source [1].

But the ICUTokenizer, even with the that rulefile, is still stripping the punctuation, and tokenizing that into "C", "Language".

Can anyone give me any guidance or hints? I don't entirely understand the semantics of the rbbi file to try debugging there. Is something not working, or does the rbbi file just not express the semantics I want?

Thanks for any tips.



[1] http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_3_0/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi?revision=1479557&view=markup

Reply via email to