Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to

Peter J. Holzer Tue, 29 May 2018 05:07:28 -0700

On 2018-05-29 21:13:43 +1000, Chris Angelico wrote:
> You can always solve a subset of problems. Using your own knowledge of
> German, you are able to better solve problems involving German text.
> But that doesn't make you any better than chardet at validating
> Chinese text, or Korean text, or Klingon text, or any other language
> you don't know.


But I don't have to. Chardet has to be reasonably good at identifying
any encoding. I only have to be good at identifying the encoding of
files which I need to import (or otherwise process.).

Please go back to the original posting. The poster has one file which he
wants to read, and asked how to determine the encoding. He was told
categorically that this is impossible and he must ask the source.

THIS is what I'm responding to, not the problem of finding a generic
solution which works for every possible file.

The OP has one file. He wants to read it. The very fact that he wants to
read this particular file makes it very likely that he knows something
about the contents of the file. So he has domain knowledge. Which makes
it very likely that he can distinguish a correct from an incorrect
decoding. He probably can't distinguish Korean poetry from a Vietnamese
shopping list, but his file probably isn't either.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | h...@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to

Reply via email to