Dubious stuff spotted in LowerCaseFilter

Trejkaz Wed, 21 Oct 2015 22:15:50 -0700

Hi all.

LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work.
The latter method looks like this:


public final void toLowerCase(final char[] buffer, final int offset,
final int limit) {
  assert buffer.length >= limit;
  assert offset <=0 && offset <= buffer.length;
  for (int i = offset; i < limit;) {
    i += Character.toChars(
            Character.toLowerCase(
                codePointAt(buffer, i, limit)), buffer, i);
   }
}

Setting aside the fact that Character.toLowerCase is already dubious
in some locales (e.g. Turkish), I notice that this is using the same
"i" index counter to refer to both the source offset and the
destination offset. So basically, this code has an undocumented
assumption that Character.toLowerCase always returns a code point
which takes up the same number of characters as the original one.

Whereas I do suppose that this might be the case, did someone actually
verify it? Say, by iterating all code points or something? How
confident are we that this will continue to be the case forever? :)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Dubious stuff spotted in LowerCaseFilter

Reply via email to