There is often the possibility to put another tokenizer in the chain to create 
a variant analyzer.  This NOT very hard at all in either Lucene or 
ElasticSearch. 
Extra tokenizers can often be used to tweak the overall processing to add a 
late tokenization to overcome an overlooked tokenization (break on colon would 
be a simple example).  Adding a tokenizer before others can change a token that 
seem incorrectly  processed into one that is done how you like.

Trejkaz, I haven't tried to use ICU yet, but what I understand, I think you'll 
find that ICU is more in agreement with your views and embraces the idea of 
refining the tokenization etc. as needed, not relying on the curios (and often 
flawed) choices of some design committee somewhere.  

 [ICU]
> -----Original Message-----
> ... no specialisation for straight Roman script, but I guess it could
> always be added.

That would be one of the main points of the whole ICU infrastructure.

-Paul 


Reply via email to