Re: [Python-Dev] Which direction is UnTransform? / Unicode is different
On Tue, Nov 19, 2013 at 05:28:48PM -0800, Jim J. Jewett wrote: (Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote: Serhiy Storchaka wrote: If the transform() method will be added, I prefer to have only one transformation method and specify a direction by the transformation name (bzip2/unbzip2). Me too. Until I consider special cases like compress, or lower, and realize that there are enough special cases to become a major wart if generic transforms ever became popular. I'm not sure I understand this comment. Why are compress and lower special cases? If there's a compress codec, presumably there'll be an uncompress or expand that reverses it. In the case of lower, it's not losslessly reversable, but there's certainly a reverse transformation, upper. Some transformations are their own reverse, e.g. rot13. In that case, there's no need for an unrot13 codec, since applying it twice undoes it. People think about these transformations as en- or de-coding, not transforming, most of the time. Even for a transformation that is an involution (eg, rot13), people have an very clear idea of what's encoded and what's not, and they are going to prefer the names encode and decode for these (generic) operations in many cases. I think this is one of the major stumbling blocks with unicode. I originally disagreed strongly with what Stephen wrote -- but then I realized that all my counterexamples involved unicode text. Counterexamples to what? Again, I'm afraid I can't really understand what point you're trying to make here. Perhaps an explicit counterexample, and an explicit statement of what you're disagreeing with (e.g. I disagree that people have a clear example of what's encoded and what's not) will help. [...] But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't seem encoded, and it doesn't make sense to decode a perfectly readable (ASCII) string into a sequence of code units. Of course it is encoded. There's nothing a-like about the byte 0x61, byte 0x2E is nothing like a period, and there is nothing about the byte 0x0A that forces text editors to start a new line -- or should that be 0x0D, or even possibly 0x85? There's nothing that distinguishes the text spam from the four-byte integer 1936744813 (0x7370616d in hex) except the semantics that we grant it, and that includes an implicit transformation 0x73 - s, etc. Reading this may help: www.joelonsoftware.com/articles/Unicode.html Nor does it help that http://www.unicode.org/glossary/#code_unit defines code unit as The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.) I agree that the official Unicode glossary is unfortunately confusing. It has a huge amount of information, often with confusingly similar terminology (code points and code units are, in a sense, opposites), and it's quite hard for beginners to Unicode to make sense of it all. I have to read that very carefully to avoid mentally translating it into Code Units are *en*coded, Code units *are* encoded, in the sense that we say a burger is cooked. Take a raw meat patty and cook it, and you get a burger. Similarly, code units are the product of an encoding process, hence have been encoded. Code points (think of them as characters, modulo a few technicalities) are encoded *into* code units, which are bytes. Which code units you get depend on the encoding form you use, i.e. the codec. If you start with the character a, and apply the UTF-8 encoding, you get a single 8-bit (one byte) code unit, 0x61. If you apply the UTF-16 (big endian) encoding, you get a single 16-bit (two bytes) code unit, 0x0061. If you apply UTF-32be codec, you get a single 32-bit (four bytes) code unit, 0x0061. and there are lots of different complicated encodings that I wouldn't use unless I were doing special processing or interchange. Very few of those encodings are Unicode. With the exception of a small handful of UTF-* codecs, and maybe one or two others, the vast majority are legacy encodings from the Bad Old Days when just about every computer had it's own distinct character set, or sets. If you're a Windows user, the non-UTF codecs (all the Latin-whatever codecs, Big5, cp-whatever, koi8-whatever, there are dozens of them) are basically old Windows code pages and the equivalent from other computer systems. And yes, it is best to avoid them like the plague except when you need them for interoperability with legacy data. If I'm not using the network, or if my interchange format already looks like readable ASCII, then unicode sure sounds like a complication. It's not, not compared to the Bad Old Days. If
Re: [Python-Dev] Which direction is UnTransform? / Unicode is different
On Wed, Nov 20, 2013 at 11:03 PM, Steven D'Aprano st...@pearwood.info wrote: I *will* get confused over which direction is encoding and which is decoding. (Removing .decode() from the (unicode) str type in 3 does help a lot, if I have a Python 3 interpreter running to check against.) It took me a long time to learn that text encodes to bytes, and bytes decode back to text. Using Python 3 really helped with that. Rule of thumb: Stuff gets encoded for transmission/storage and decoded for usage. That covers encryption (you transmit the coded form and read the decoded), compression (you store the tighter form and use the expanded), Unicode (you store bytes, you work with characters), and quite a few others. I don't know that it's an iron-clad rule (though I can't off-hand think of a contrary example), but it's certainly an easy way to remember a lot of the encode/decode pairs. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Which direction is UnTransform? / Unicode is different
On 20.11.13 02:28, Jim J. Jewett wrote: [...] Instead of relying on introspection of .decodes_to and .encodes_to, it would be useful to have charsetcodecs and tranformcodecs as entirely different modules, with their own separate registries. I will even note that the existing help(codecs) seems more appropriate for charsetcodecs than it does for the current conjoined module. I don't understand how a registry of transformation functions would simplify code. Without the transform() method I would write: import binascii binascii.hexlify(b'foo') b'666f6f' With the transform() method I should be able to write: b'foo'.transform(hex) However how does the hex transformer get registered in the registry? If the hex transformer is not part of the stdlib, there must be some code that does the registration, but to get that code to execute, I'd have to import a module, so we're back to square one, as I'd have to write: import hex_transformer b'foo'.transform(hex) A way around this would be some kind of import magic, but is this really neccessary to be able to avoid one import statement? Furthermore different transformation functions might have different additional options. Supporting those is simple when we have simple transformation functions: The functions has arguments, and those are documented where the function is documented. If we want to support custom options for the .transform() method, transform() would have to pass along *args, **kwargs to the underlying transformer. However this is difficult to document in a way that makes it easy to find which options exist for a particular transformer. Servus, Walter ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Which direction is UnTransform? / Unicode is different
On 20 November 2013 23:38, Walter Dörwald wal...@livinglogic.de wrote: On 20.11.13 02:28, Jim J. Jewett wrote: [...] Instead of relying on introspection of .decodes_to and .encodes_to, it would be useful to have charsetcodecs and tranformcodecs as entirely different modules, with their own separate registries. I will even note that the existing help(codecs) seems more appropriate for charsetcodecs than it does for the current conjoined module. I don't understand how a registry of transformation functions would simplify code. Without the transform() method I would write: import binascii binascii.hexlify(b'foo') b'666f6f' With the transform() method I should be able to write: b'foo'.transform(hex) However how does the hex transformer get registered in the registry? If the hex transformer is not part of the stdlib, there must be some code that does the registration, but to get that code to execute, I'd have to import a module, so we're back to square one, as I'd have to write: import hex_transformer b'foo'.transform(hex) A way around this would be some kind of import magic, but is this really neccessary to be able to avoid one import statement? Could we please move discussion of hypothetical future divisions of the codec namespace into additional APIs to python-ideas? We don't even have consensus to restore the codecs module to parity with its Python 2 functionality at this point, let alone agreement on adding more convenience APIs for functionality that we have core developers arguing shouldn't be restored at all. Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Which direction is UnTransform? / Unicode is different
(Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote: Serhiy Storchaka wrote: If the transform() method will be added, I prefer to have only one transformation method and specify a direction by the transformation name (bzip2/unbzip2). Me too. Until I consider special cases like compress, or lower, and realize that there are enough special cases to become a major wart if generic transforms ever became popular. People think about these transformations as en- or de-coding, not transforming, most of the time. Even for a transformation that is an involution (eg, rot13), people have an very clear idea of what's encoded and what's not, and they are going to prefer the names encode and decode for these (generic) operations in many cases. I think this is one of the major stumbling blocks with unicode. I originally disagreed strongly with what Stephen wrote -- but then I realized that all my counterexamples involved unicode text. I can tell whether something is tarred or untarred, zipped or unzipped. But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't seem encoded, and it doesn't make sense to decode a perfectly readable (ASCII) string into a sequence of code units. Nor does it help that http://www.unicode.org/glossary/#code_unit defines code unit as The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.) I have to read that very carefully to avoid mentally translating it into Code Units are *en*coded, and there are lots of different complicated encodings that I wouldn't use unless I were doing special processing or interchange. If I'm not using the network, or if my interchange format already looks like readable ASCII, then unicode sure sounds like a complication. I *will* get confused over which direction is encoding and which is decoding. (Removing .decode() from the (unicode) str type in 3 does help a lot, if I have a Python 3 interpreter running to check against.) I'm not sure exactly what implications the above has, but it certainly supports separating the Text Processing from the generic codecs, both in the documentation and in any potential new methods. Instead of relying on introspection of .decodes_to and .encodes_to, it would be useful to have charsetcodecs and tranformcodecs as entirely different modules, with their own separate registries. I will even note that the existing help(codecs) seems more appropriate for charsetcodecs than it does for the current conjoined module. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Which direction is UnTransform? / Unicode is different
My thought on this for the day, for what it's worth: Anything that doesn't have directions clearly identifiable as encoding and decoding maybe shouldn't be called a codec? -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com