RE: Dubious stuff spotted in LowerCaseFilter

Uwe Schindler Thu, 22 Oct 2015 04:59:19 -0700

Hi,


> >> Setting aside the fact that Character.toLowerCase is already dubious
> >> in some locales (e.g. Turkish),
> >
> > This is not true. Character.toLowerCase() works locale-independent.
> > It is only String.toLowerCase that works using default locale.

So you mean the opposite. You wanted to have it locale-dependent. That’s 
already possible: LowercaseFilter is documented to only use default unicode 
folding, no locale specific stuff. If you have a turkish lucene field, you need 
to do locale-specific analysis anyways (e.g. use TukishAnalyzer). This one uses 
TurkishLowercaseFilter. Having both variant as synonyms needs more work, but 
out of the scope of this mail thread.
 
> Yet if you have a field like "title" and the user and system are Turkish, the
> user would expect their locale to apply, yet LowerCaseFilter will not handle
> that. So whereas it is "safe" for English hard-coded strings, it isn't safe 
> for all
> fields you might index in general.

That's documented like that!

> Dawid's response shows, though, that at least for the time being, there is
> nothing to worry about. Hopefully Unicode will never add a code point which
> lowercases to one with less code units (or I guess changes one of the lower
> ones to lowercase to more than one...)

There was a discussion about that in JIRA already at the time of rewriting 
LowercaseFilter to allow suppl characters outside BMP. I have to lookup the 
issue, but I am quite sure that the Unicode Policeman did a lot of recherche 
and found some statement in Unicode spec that the upper and lowercase letters 
are always in same block. I will try to look this up.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Dubious stuff spotted in LowerCaseFilter

Reply via email to