Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes: > Paul Rubin already told you about his experience using OCR to generate > multiple terrabytes of text, and how he would not be happy if that was > stored in UCS-4.
That particular text was stored on disk as compressed XML that had UTF-8 in the data fields, but I think Roy is right that it would have compressed to around the same size in UCS-4. Converting it to UCS-4 on input would have bloated up the memory footprint and that was the issue of concern to me. > Pittance or not, I do not believe that people will widely abandon compact > storage formats like UTF-8 and Latin-1 for UCS-4 any time soon. Looking at http://www.icu-project.org/ the C++ classes seem to use UTF-16 sort like Python 3.2 :(. I'm not certain of this though. -- http://mail.python.org/mailman/listinfo/python-list