Well, practice says there are no such cases... for (int cp = Character.MIN_CODE_POINT; cp < Character.MAX_CODE_POINT; cp++) { int c1 = Character.charCount(cp); int c2 = Character.charCount(Character.toUpperCase(cp)); int c3 = Character.charCount(Character.toLowerCase(cp)); if (c1 != c2 || c1 != c3) { System.out.println(String.format(Locale.ROOT, "%d %d %d", c1, c2, c3)); } }
D. On Thu, Oct 22, 2015 at 10:15 AM, Dawid Weiss <dawid.we...@gmail.com> wrote: > > I think the issue here is what happens if an "uppercase" codepoint > requires a surrogate pair and the lowercase counterpart does not -- then > the index variable would indeed be screwed. > > Dawid > > On Thu, Oct 22, 2015 at 10:05 AM, Uwe Schindler <u...@thetaphi.de> wrote: > >> Hi, >> >> > Setting aside the fact that Character.toLowerCase is already dubious in >> some locales (e.g. Turkish), >> >> This is not true. Character.toLowerCase() works locale-independent. It is >> only String.toLowerCase that works using default locale. >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> > -----Original Message----- >> > From: Trejkaz [mailto:trej...@trypticon.org] >> > Sent: Thursday, October 22, 2015 7:15 AM >> > To: Lucene Users Mailing List >> > Subject: Dubious stuff spotted in LowerCaseFilter >> > >> > Hi all. >> > >> > LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work. >> > The latter method looks like this: >> > >> > public final void toLowerCase(final char[] buffer, final int offset, >> final int limit) >> > { >> > assert buffer.length >= limit; >> > assert offset <=0 && offset <= buffer.length; >> > for (int i = offset; i < limit;) { >> > i += Character.toChars( >> > Character.toLowerCase( >> > codePointAt(buffer, i, limit)), buffer, i); >> > } >> > } >> > >> > Setting aside the fact that Character.toLowerCase is already dubious in >> some >> > locales (e.g. Turkish), I notice that this is using the same "i" index >> counter to >> > refer to both the source offset and the destination offset. So >> basically, this >> > code has an undocumented assumption that Character.toLowerCase always >> > returns a code point which takes up the same number of characters as the >> > original one. >> > >> > Whereas I do suppose that this might be the case, did someone actually >> > verify it? Say, by iterating all code points or something? How >> confident are >> > we that this will continue to be the case forever? :) >> > >> > TX >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >