On Jan 2, 9:34 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > In any case, it goes well beyond the situation that triggered my > > original question in the first place, that basically was to provide a > > reasonable check on whether round-tripping a string is successful -- > > this is in the context of a small utility to guess an encoding and to > > use it to decode a byte string. This utility module was triggered by > > one that Skip Montanaro had written some time ago, but I wanted to add > > and combine several ideas and techniques (and support for my usage > > scenarios) for guessing a string's encoding in one convenient place. > > Notice that this algorithm is not capable of detecting the ISO-2022 > encodings - they look like ASCII to this algorithm. This is by design, > as the encoding was designed to only use 7-bit bytes, so that you can > safely transport them in Email and such (*)
Well, one could specify decode_heuristically(s, enc="iso-2022-jp") and that encoding will be checked before ascii or any other encoding in the list. > If you want to add support for ISO-2022, you should look for escape > characters, and then check whether the escape sequences are among > the ISO-2022 ones: > - ESC ( - 94-character graphic character set, G0 > - ESC ) - 94-character graphic character set, G1 > - ESC * - 94-character graphic character set, G2 > - ESC + - 94-character graphic character set, G3 > - ESC - - 96-character graphic character set, G1 > - ESC . - 96-character graphic character set, G2 > - ESC / - 96-character graphic character set, G3 > - ESC $ - Multibyte > ( G0 > ) G1 > * G2 > + G3 > - ESC % - Non-ISO-2022 (e.g. UTF-8) > > If you see any of these, it should be ISO-2022; see > the Wiki page as to what subset may be in use. > > G0..G3 means what register the character set is loaded > into; when you have loaded a character set into a register, > you can switch between registers through ^N (to G1), > ^O (to G0), ESC n (to G2), ESC o (to G3) (*) OK, suppose we do not know the string is likely to be iso-2022, but we still want to detect it if it is. I have added a "may_do_better" mechanism to the algorithm, to add special checks on a *guessed* algorithm. I am not sure this will not however introduce more or other problems than the one it is addressing... I have re-instated checks for iso-8859-1 control chars (likely to be cp1252), for special symbols in iso-8859-15 when they occur in iso-8859-1 and cp1252, and for the iso-2022-jp escape sequences. To flesh out with other checks is mechanical work... If you could take a look at the updated page: > >http://gizmojo.org/code/decodeh/ I still have issues with what happens in situations when for example a file contains iso-2022 esc sequences but is anyway actally in ascii or utf-8? e.g. this mail message! I'll let this issue turn for a little while... > > I will be very interested in any remarks any of you may have! > > From a shallow inspection, it looks right. I would have spelled > "losses" as "loses". Yes, corrected. Thanks, mario -- http://mail.python.org/mailman/listinfo/python-list