Marko Rauhamaa <ma...@pacujo.net> writes: > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > >> Nevertheless, there are important abstractions that are written on top >> of the bytes layer, and in the Unix and Linux world, the most >> important abstraction is *text*. In the Unix world, text formats and >> text processing is much more common in user-space apps than binary >> processing. > > That linux text is not the same thing as Python's text. Conceptually, > Python text is a sequence of 32-bit integers. Linux text is a sequence > of 8-bit integers.
_Unicode string in Python is a sequence of Unicode codepoints_. It is correct that 32-bit integer is enough to represent any Unicode codepoint: \u0000...\U0010FFFF It says *nothing* about how Unicode strings are represented *internally* in Python. It may vary from version to version, build options and even may depend on the content of a string at runtime. In the past, "narrow builds" might break the abstraction in some cases that is why Linux distributions used wide python builds. _Unicode codepoint is not a Python concept_. There is Unicode standard http://unicode.org Though intead of following the self-referential defenitions web, I find it easier to learn from examples such as http://codepoints.net/U+0041 (A) or http://codepoints.net/U+1F3A7 (🎧) _There is no such thing as 8-bit text_ http://www.joelonsoftware.com/articles/Unicode.html If you insert a space after each byte (8-bit) in the input text then you may get garbage i.e., you can't assume that a character is a byte: $ echo "Hyvää yötä" | perl -pe's/.\K/ /g' H y v a � � � � y � � t � � In general, you can't assume that a character is a Unicode codepoint: $ echo "Hyvää yötä" | perl -C -pe's/.\K/ /g' H y v a ̈ ä y ö t ä The eXtended grapheme clusters (user-perceived characters) may be useful in this case: $ echo "Hyvää yötä" | perl -C -pe's/\X\K/ /g' H y v ä ä y ö t ä \X pattern is supported by `regex` module in Python i.e., you can't even iterate over characters (as they are seen by a user) in Python using only stdlib. \w+ pattern is also broken for Unicode text http://bugs.python.org/issue1693050 (it is fixed in the `regex` module) i.e., you can't select a word in Unicode text using only stdlib. \X along is not enough in some cases e.g., "“ch” may be considered a grapheme cluster in Slovak, for processes such as collation" [1] (sorting order). `PyICU` module might be useful here. Knowing about Unicode normalization forms (NFC, NFKD, etc) http://unicode.org/reports/tr15/ Unicode text segmentation [1] and Unicode collation algorithm http://www.unicode.org/reports/tr10/ concepts is also useful; if you want to work with text. [1]: http://www.unicode.org/reports/tr29/ -- akira -- https://mail.python.org/mailman/listinfo/python-list