On 03/01/13 23:52, eryksun wrote:
On Tue, Jan 1, 2013 at 1:29 AM, Steven D'Aprano<st...@pearwood.info>  wrote:

2 Since "wide builds" use so much extra memory for the average ASCII
   string, hardly anyone uses them.

On Windows (and I think OS X, too) a narrow build has been practical
since the wchar_t type is 16-bit. As to Linux I'm most familiar with
Debian, which uses a wide build. Do you know off-hand which distros
release a narrow build?

Centos, and presumably therefore Red Hat do. Fedora did, and I presume
still do.

I didn't actually realize until now that Debian defaults to a wide
build.


But more important than the memory savings, it means that for the first
time Python's handling of Unicode strings is correct for the entire range
of all one million plus characters, not just the first 65 thousand.

Still, be careful not to split 'characters':

     >>>  list(normalize('NFC', '\u1ebf'))
     ['ế']
     >>>  list(normalize('NFD', '\u1ebf'))
     ['e', '̂', '́']


Yes, but presumably if you are normalizing to decomposed forms (NFD or NFKD
modes), you're doing it for a reason and are taking care not to let the
accents wander away from their base character, unless you want them to.

By the way, for anyone else trying this, the normalize function above is not
a built-in, it comes from the unicodedata module.

More on normalization:

https://en.wikipedia.org/wiki/Unicode_equivalence



Doing-a-lot-of-presuming-today-ly y'rs,


--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to