On 2018-05-23 08:43:02 +1000, Chris Angelico wrote: > On Wed, May 23, 2018 at 8:31 AM, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > On 2018-05-23 07:38:27 +1000, Chris Angelico wrote: > >> > 1) For any given file it is almost always possible to find the correct > >> > encoding (or *a* correct encoding, as there may be more than one). > >> > >> You can find an encoding which is capable of decoding a file. That's > >> not the same thing. > > > > If the result is correct, it is the same thing. > > > > If I have an input file > > > > 4c 69 65 62 65 20 47 72 fc df 65 0a > > > > and I decode it correctly to > > > > Liebe Grüße > > > > it doesn't matter whether I used ISO-8859-1 or ISO-8859-2. The mapping > > for all bytes in the input file is the same in both encodings. > > Sure, but if you try it as ISO-8859-5 or -7, you won't get an error, > but you also won't get that string. So it DOES matter.
I get Liebe Grќпe or Liebe Grόίe which I can immediately recognize as wrong: They mix Cyrillic resp. Greek letters with Latin letters in the same word which doesn't happen in any natural language. Of course "Grќпe" could be a nickname in an online forum (I've seen stranger names than that), but since "Liebe Grüße" is a common German phrase it is much much more likely to the correct interpretation. Also, a real file will usually contain more than two words. So if the text is German it will contain more words with umlauts and each byte which is part of a correctly spelled German word when interpreted according to ISO-8859-1 increases the probability that decoding with ISO-8859-1 will produce the correct result. There remains a tiny probability that all those matches are mere coincidence, but I wrote "almost always", not "always", so I can live with an error rate of 0.000001% (or something like that). hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | h...@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list