Marc-Andre Lemburg <m...@egenix.com> added the comment: STINNER Victor wrote: > > STINNER Victor <victor.stin...@haypocalc.com> added the comment: > > I think that the normalization function in unicodeobject.c (only used for > internal functions) can skip any character different than a-z, A-Z and 0-9. > Something like: > >>>> import re >>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower()) > ... >>>> normalize("UTF-8") > 'utf8' >>>> normalize("ISO-8859-1") > 'iso88591' >>>> normalize("latin1") > 'latin1' > > So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be > normalized to iso88591, latin1 and utf8. > > I don't know any encoding name where a character outside a-z, A-Z, 0-9 means > anything special. But I don't know all encoding names! :-)
I think rather than removing any hyphens, spaces, etc. the function should additionally: * add hyphens whenever (they are missing and) there's switch from [a-z] to [0-9] That way you end up with the correct names for the given set of optimized encoding names. ---------- title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') -> b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue11303> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com