(to solr-user, CC'ing author I'm responding to)
I found the solr-user listserv contribution at:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E
Which explain a way you can supply custom rulefiles to ICUTokenizer, in
this case to tell it to only break on whitespace for Latin character
substrings.
I am trying to use the technique explained there in Solr 4.3, but either
it's not working, or it's not doing what I'd expect.
I want, for instance, "C++ Language" to be tokenized into "C++",
"Language". But the ICUTokenizer, even with the
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbi file
from the Solr 4.3 source [1].
But the ICUTokenizer, even with the that rulefile, is still stripping
the punctuation, and tokenizing that into "C", "Language".
Can anyone give me any guidance or hints? I don't entirely understand
the semantics of the rbbi file to try debugging there. Is something not
working, or does the rbbi file just not express the semantics I want?
Thanks for any tips.
[1]
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_3_0/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi?revision=1479557&view=markup