On May 24, 1:13 pm, Steven D'Aprano <st...@remove-this- cybersource.com.au> wrote: > Do unicode.lower() or unicode.upper() ever change the length of the > string? > > The Unicode standard allows for case conversions that change length, e.g. > sharp-S in German should convert to SS: > > http://unicode.org/faq/casemap_charprop.html#6 > > but I see that Python doesn't do that: > > >>> s = "Paßstraße" > >>> s.upper() > > 'PAßSTRAßE' > > The more I think about this, the more I think that upper/lower/title case > conversions should change length (at least sometimes) and if Python > doesn't do so, that's a bug. Any thoughts?
Digging a bit deeper, it looks like these methods are using the Simple_{Upper,Lower,Title}case_Mapping functions described at http://www.unicode.org/Public/5.1.0/ucd/UCD.html fields 12, 13 and 14 of the unicode data; you can see this in the source in Tools/unicode/ makeunicodedata.py, which is the Python code that generates the database of unicode properties. It contains code like: if record[12]: upper = int(record[12], 16) else: upper = char if record[13]: lower = int(record[13], 16) else: lower = char if record[14]: title = int(record[14], 16) ... and so on. I agree that it might be desirable for these operations to product the multicharacter equivalents. That idea looks like a tough sell, though: apart from backwards compatibility concerns (which could probably be worked around somehow), it looks as though it would require significant effort to implement. -- Mark -- http://mail.python.org/mailman/listinfo/python-list