Re: unicode(s, enc).encode(enc) == s ?
On Jan 2, 9:34 pm, Martin v. Löwis [EMAIL PROTECTED] wrote: In any case, it goes well beyond the situation that triggered my original question in the first place, that basically was to provide a reasonable check on whether round-tripping a string is successful -- this is in the context of a small utility to guess an encoding and to use it to decode a byte string. This utility module was triggered by one that Skip Montanaro had written some time ago, but I wanted to add and combine several ideas and techniques (and support for my usage scenarios) for guessing a string's encoding in one convenient place. Notice that this algorithm is not capable of detecting the ISO-2022 encodings - they look like ASCII to this algorithm. This is by design, as the encoding was designed to only use 7-bit bytes, so that you can safely transport them in Email and such (*) Well, one could specify decode_heuristically(s, enc=iso-2022-jp) and that encoding will be checked before ascii or any other encoding in the list. If you want to add support for ISO-2022, you should look for escape characters, and then check whether the escape sequences are among the ISO-2022 ones: - ESC ( - 94-character graphic character set, G0 - ESC ) - 94-character graphic character set, G1 - ESC * - 94-character graphic character set, G2 - ESC + - 94-character graphic character set, G3 - ESC - - 96-character graphic character set, G1 - ESC . - 96-character graphic character set, G2 - ESC / - 96-character graphic character set, G3 - ESC $ - Multibyte ( G0 ) G1 * G2 + G3 - ESC % - Non-ISO-2022 (e.g. UTF-8) If you see any of these, it should be ISO-2022; see the Wiki page as to what subset may be in use. G0..G3 means what register the character set is loaded into; when you have loaded a character set into a register, you can switch between registers through ^N (to G1), ^O (to G0), ESC n (to G2), ESC o (to G3) (*) OK, suppose we do not know the string is likely to be iso-2022, but we still want to detect it if it is. I have added a may_do_better mechanism to the algorithm, to add special checks on a *guessed* algorithm. I am not sure this will not however introduce more or other problems than the one it is addressing... I have re-instated checks for iso-8859-1 control chars (likely to be cp1252), for special symbols in iso-8859-15 when they occur in iso-8859-1 and cp1252, and for the iso-2022-jp escape sequences. To flesh out with other checks is mechanical work... If you could take a look at the updated page: http://gizmojo.org/code/decodeh/ I still have issues with what happens in situations when for example a file contains iso-2022 esc sequences but is anyway actally in ascii or utf-8? e.g. this mail message! I'll let this issue turn for a little while... I will be very interested in any remarks any of you may have! From a shallow inspection, it looks right. I would have spelled losses as loses. Yes, corrected. Thanks, mario -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode(s, enc).encode(enc) == s ?
Thanks again. I will chunk my responses as your message has too much in it for me to process all at once... On Jan 2, 9:34 pm, Martin v. Löwis [EMAIL PROTECTED] wrote: Thanks a lot Martin and Marc for the really great explanations! I was wondering if it would be reasonable to imagine a utility that will determine whether, for a given encoding, two byte strings would be equivalent. But that is much easier to answer: s1.decode(enc) == s2.decode(enc) Assuming Unicode's unification, for a single encoding, this should produce correct results in all cases I'm aware of. If the you also have different encodings, you should add def normal_decode(s, enc): return unicode.normalize(NFKD, s.decode(enc)) normal_decode(s1, enc) == normal_decode(s2, enc) This would flatten out compatibility characters, and ambiguities left in Unicode itself. Hmmn, true, it would be that easy. I am now not sure why I needed that check, or how to use this version of it... I am always starting from one string, and decoding it... that may be lossy when that is re-encoded, and compared to original. However it is clear that the test above should always pass in this case, so doing it seems superfluos. Thanks for the unicodedata.normalize() tip. mario -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode(s, enc).encode(enc) == s ?
Thanks a lot Martin and Marc for the really great explanations! I was wondering if it would be reasonable to imagine a utility that will determine whether, for a given encoding, two byte strings would be equivalent. But I think such a utility will require *extensive* knowledge about many bizarrities of many encodings -- and has little chance of being pretty! In any case, it goes well beyond the situation that triggered my original question in the first place, that basically was to provide a reasonable check on whether round-tripping a string is successful -- this is in the context of a small utility to guess an encoding and to use it to decode a byte string. This utility module was triggered by one that Skip Montanaro had written some time ago, but I wanted to add and combine several ideas and techniques (and support for my usage scenarios) for guessing a string's encoding in one convenient place. I provide a write-up and the code for it here: http://gizmojo.org/code/decodeh/ I will be very interested in any remarks any of you may have! Best regards, mario -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode(s, enc).encode(enc) == s ?
Thanks a lot Martin and Marc for the really great explanations! I was wondering if it would be reasonable to imagine a utility that will determine whether, for a given encoding, two byte strings would be equivalent. But that is much easier to answer: s1.decode(enc) == s2.decode(enc) Assuming Unicode's unification, for a single encoding, this should produce correct results in all cases I'm aware of. If the you also have different encodings, you should add def normal_decode(s, enc): return unicode.normalize(NFKD, s.decode(enc)) normal_decode(s1, enc) == normal_decode(s2, enc) This would flatten out compatibility characters, and ambiguities left in Unicode itself. But I think such a utility will require *extensive* knowledge about many bizarrities of many encodings -- and has little chance of being pretty! See above. In any case, it goes well beyond the situation that triggered my original question in the first place, that basically was to provide a reasonable check on whether round-tripping a string is successful -- this is in the context of a small utility to guess an encoding and to use it to decode a byte string. This utility module was triggered by one that Skip Montanaro had written some time ago, but I wanted to add and combine several ideas and techniques (and support for my usage scenarios) for guessing a string's encoding in one convenient place. Notice that this algorithm is not capable of detecting the ISO-2022 encodings - they look like ASCII to this algorithm. This is by design, as the encoding was designed to only use 7-bit bytes, so that you can safely transport them in Email and such (*) If you want to add support for ISO-2022, you should look for escape characters, and then check whether the escape sequences are among the ISO-2022 ones: - ESC ( - 94-character graphic character set, G0 - ESC ) - 94-character graphic character set, G1 - ESC * - 94-character graphic character set, G2 - ESC + - 94-character graphic character set, G3 - ESC - - 96-character graphic character set, G1 - ESC . - 96-character graphic character set, G2 - ESC / - 96-character graphic character set, G3 - ESC $ - Multibyte ( G0 ) G1 * G2 + G3 - ESC % - Non-ISO-2022 (e.g. UTF-8) If you see any of these, it should be ISO-2022; see the Wiki page as to what subset may be in use. G0..G3 means what register the character set is loaded into; when you have loaded a character set into a register, you can switch between registers through ^N (to G1), ^O (to G0), ESC n (to G2), ESC o (to G3) (*) http://gizmojo.org/code/decodeh/ I will be very interested in any remarks any of you may have! From a shallow inspection, it looks right. I would have spelled losses as loses. Regards, Martin (*) For completeness: ISO-2022 also supports 8-bit characters, and there are more control codes to shift between the various registers. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode(s, enc).encode(enc) == s ?
On Dec 27, 7:37 pm, Martin v. Löwis [EMAIL PROTECTED] wrote: Certainly. ISO-2022 is famous for having ambiguous encodings. Try these: unicode(Hallo,iso-2022-jp) unicode(\x1b(BHallo,iso-2022-jp) unicode(\x1b(JHallo,iso-2022-jp) unicode(\x1b(BHal\x1b(Jlo,iso-2022-jp) or likewise unicode([EMAIL PROTECTED],iso-2022-jp) unicode(\x1b$BBB,iso-2022-jp) In iso-2022-jp-3, there are even more ways to encode the same string. Wow, that's not easy to see why would anyone ever want that? Is there any logic behind this? In your samples both of unicode(\x1b(BHallo,iso-2022-jp) and unicode(\x1b(JHallo,iso-2022-jp) give uHallo -- does this mean that the ignored/lost bytes in the original strings are not illegal but *represent nothing* in this encoding? I.e. in practice (in a context limited to the encoding in question) should this be considered as a data loss, or should these strings be considered equivalent? Thanks! mario -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode(s, enc).encode(enc) == s ?
On Fri, 28 Dec 2007 03:00:59 -0800, mario wrote: On Dec 27, 7:37 pm, Martin v. Löwis [EMAIL PROTECTED] wrote: Certainly. ISO-2022 is famous for having ambiguous encodings. Try these: unicode(Hallo,iso-2022-jp) unicode(\x1b(BHallo,iso-2022-jp) unicode(\x1b(JHallo,iso-2022-jp) unicode(\x1b(BHal\x1b(Jlo,iso-2022-jp) or likewise unicode([EMAIL PROTECTED],iso-2022-jp) unicode(\x1b$BBB,iso-2022-jp) In iso-2022-jp-3, there are even more ways to encode the same string. Wow, that's not easy to see why would anyone ever want that? Is there any logic behind this? In your samples both of unicode(\x1b(BHallo,iso-2022-jp) and unicode(\x1b(JHallo,iso-2022-jp) give uHallo -- does this mean that the ignored/lost bytes in the original strings are not illegal but *represent nothing* in this encoding? They are not lost or ignored but escape sequences that tell how the following bytes should be interpreted '\x1b(B' switches to ASCII and '\x1b(J' to some roman encoding which is a superset of ASCII, so it doesn't matter which one you choose unless the following bytes are all ASCII. And of course you can use that escape prefix as often as you want within a string of ASCII byte values. http://en.wikipedia.org/wiki/ISO-2022-JP#ISO_2022_Character_Sets I.e. in practice (in a context limited to the encoding in question) should this be considered as a data loss, or should these strings be considered equivalent? Equivalent I would say. As Unicode they contain the same characters. Just differently encoded as bytes. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode(s, enc).encode(enc) == s ?
Wow, that's not easy to see why would anyone ever want that? Is there any logic behind this? It's the pre-Unicode solution to the we want to have many characters encoded in a single file problem. Suppose you have pre-defined characters sets A, B, C, and you want text to contain characters from all three sets, one possible encoding is switch-to-ACharactersInAswitch-to-BCharactersFromBand-so-on Now also suppose that A, B, and C are not completely different, but have slight overlap - and you get ambiguous encodings. ISO-2022 works that way. IPSJ maintains a registry of character sets for ISO, and assigns escape codes to them. There are currently about 200 character sets registered. Somebody decoding this would have to know all the character sets (remember it's a growing registry), hence iso-2022-jp restricts the character sets that you can use for that particular encoding. (Likewise, iso-2022-kr also restricts it, but to a different set of sets). It's a mess, sure, and one of the primary driving force of Unicode (which even has the unification - ie. lack of ambiguity - in its name). In your samples both of unicode(\x1b(BHallo,iso-2022-jp) and unicode(\x1b(JHallo,iso-2022-jp) give uHallo -- does this mean that the ignored/lost bytes in the original strings are not illegal but *represent nothing* in this encoding? See above, and Marc's explanation. ESC ( B switches to ISO 646, USA Version X3.4 - 1968; ESC ( J to ISO 646, Japanese Version for Roman Characters JIS C6220-1969 These are identical, except for the following differences: - The USA version has reverse solidus at 5/12; the Japanese version Yen sign - The USA version has Tilde (overline; general accent) at 7/14 (depicted as tilde); the Japanese version Overline (depicted as straight overline) - The Japanese version specifies that you can switch between roman and katakana mode by sending shift out (SO, '\x0e') and shift-in (SI, '\x0F') respectively; this switches to the JIS KATAKANA character set. (source: http://www.itscj.ipsj.or.jp/ISO-IR/006.pdf http://www.itscj.ipsj.or.jp/ISO-IR/014.pdf ) I.e. in practice (in a context limited to the encoding in question) should this be considered as a data loss, or should these strings be considered equivalent? These particular differences should be considered as irrelevant. There are some cases where Unicode had introduced particular compatibility characters to accommodate such encodings (specifically, the full-width latin (*) and half-width Japanese characters). Good codecs are supposed to round-trip the relevant differences to Unicode, and generate the appropriate compatibility characters. Bad codecs might not, and in some cases, users might complain that certain compatibility characters are lacking in Unicode so that correct round-tripping is not possible. I believe the Unicode consortium has resolved all these complaints by adding the missing characters; but I'm not sure. Regards, Martin (*) As an example for full-width characters, consider these two strings: Hello Hello Should they be equivalent, or not? They are under NFKD, but not NFD. -- http://mail.python.org/mailman/listinfo/python-list
unicode(s, enc).encode(enc) == s ?
I have checks in code, to ensure a decode/encode cycle returns the original string. Given no UnicodeErrors, are there any cases for the following not to be True? unicode(s, enc).encode(enc) == s mario -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode(s, enc).encode(enc) == s ?
Given no UnicodeErrors, are there any cases for the following not to be True? unicode(s, enc).encode(enc) == s Certainly. ISO-2022 is famous for having ambiguous encodings. Try these: unicode(Hallo,iso-2022-jp) unicode(\x1b(BHallo,iso-2022-jp) unicode(\x1b(JHallo,iso-2022-jp) unicode(\x1b(BHal\x1b(Jlo,iso-2022-jp) or likewise unicode([EMAIL PROTECTED],iso-2022-jp) unicode(\x1b$BBB,iso-2022-jp) In iso-2022-jp-3, there are even more ways to encode the same string. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list