Hi, > Setting aside the fact that Character.toLowerCase is already dubious in some > locales (e.g. Turkish),
This is not true. Character.toLowerCase() works locale-independent. It is only String.toLowerCase that works using default locale. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Trejkaz [mailto:trej...@trypticon.org] > Sent: Thursday, October 22, 2015 7:15 AM > To: Lucene Users Mailing List > Subject: Dubious stuff spotted in LowerCaseFilter > > Hi all. > > LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work. > The latter method looks like this: > > public final void toLowerCase(final char[] buffer, final int offset, final > int limit) > { > assert buffer.length >= limit; > assert offset <=0 && offset <= buffer.length; > for (int i = offset; i < limit;) { > i += Character.toChars( > Character.toLowerCase( > codePointAt(buffer, i, limit)), buffer, i); > } > } > > Setting aside the fact that Character.toLowerCase is already dubious in some > locales (e.g. Turkish), I notice that this is using the same "i" index > counter to > refer to both the source offset and the destination offset. So basically, this > code has an undocumented assumption that Character.toLowerCase always > returns a code point which takes up the same number of characters as the > original one. > > Whereas I do suppose that this might be the case, did someone actually > verify it? Say, by iterating all code points or something? How confident are > we that this will continue to be the case forever? :) > > TX > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org