Hi,
> >> Setting aside the fact that Character.toLowerCase is already dubious > >> in some locales (e.g. Turkish), > > > > This is not true. Character.toLowerCase() works locale-independent. > > It is only String.toLowerCase that works using default locale. So you mean the opposite. You wanted to have it locale-dependent. That’s already possible: LowercaseFilter is documented to only use default unicode folding, no locale specific stuff. If you have a turkish lucene field, you need to do locale-specific analysis anyways (e.g. use TukishAnalyzer). This one uses TurkishLowercaseFilter. Having both variant as synonyms needs more work, but out of the scope of this mail thread. > Yet if you have a field like "title" and the user and system are Turkish, the > user would expect their locale to apply, yet LowerCaseFilter will not handle > that. So whereas it is "safe" for English hard-coded strings, it isn't safe > for all > fields you might index in general. That's documented like that! > Dawid's response shows, though, that at least for the time being, there is > nothing to worry about. Hopefully Unicode will never add a code point which > lowercases to one with less code units (or I guess changes one of the lower > ones to lowercase to more than one...) There was a discussion about that in JIRA already at the time of rewriting LowercaseFilter to allow suppl characters outside BMP. I have to lookup the issue, but I am quite sure that the Unicode Policeman did a lot of recherche and found some statement in Unicode spec that the upper and lowercase letters are always in same block. I will try to look this up. Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org