Hi all. LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work. The latter method looks like this:
public final void toLowerCase(final char[] buffer, final int offset, final int limit) { assert buffer.length >= limit; assert offset <=0 && offset <= buffer.length; for (int i = offset; i < limit;) { i += Character.toChars( Character.toLowerCase( codePointAt(buffer, i, limit)), buffer, i); } } Setting aside the fact that Character.toLowerCase is already dubious in some locales (e.g. Turkish), I notice that this is using the same "i" index counter to refer to both the source offset and the destination offset. So basically, this code has an undocumented assumption that Character.toLowerCase always returns a code point which takes up the same number of characters as the original one. Whereas I do suppose that this might be the case, did someone actually verify it? Say, by iterating all code points or something? How confident are we that this will continue to be the case forever? :) TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org