Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to

Peter J. Holzer Mon, 04 Jun 2018 03:29:30 -0700

On 2018-06-03 16:36:12 -0700, [email protected] wrote:
> On Tuesday, 22 May 2018 17:23:55 UTC-4, Peter J. Holzer  wrote:
> > On 2018-05-20 15:43:54 +0200, Karsten Hilbert wrote:
> > > On Sun, May 20, 2018 at 04:59:12AM -0700, [email protected] wrote:
> > > > thank you for the reply, but how exactly am i supposed to find
> > > > oout what is the correct encodeing??
> > > 
> > > One CAN NOT.
> > > 
> > > The best you can do is to go ask the canonical source of the
> > > file what encoding the file is _supposed_ to be in.
> > 
> > I disagree on both counts.
> > 
> > 1) For any given file it is almost always possible to find the correct
> >    encoding (or *a* correct encoding, as there may be more than one).
> > 
> >    This may require domain-specific knowledge (e.g. it may be necessary
> >    to recognize the human language and know at least some distinctive
> >    words, or to know some special symbols likely to be used in a data
> >    file), and it almost always takes a bit of detective work and trial
> >    and error. But I don't think I ever encountered a file where I
> >    couldn't figure out the encoding.
[...]
> 
> hello peter ...how exactly would i solve this issue .....


There is no "exactly" here. Determining the encoding of a file depends
on experience and trial and error. However, I can give you some general
guide lines:

Preparation:

     Make sure you have a way to reliably display files:

     1) As a stream of bytes. On Linux hd works well for this purpose,
        although a hex editor might be even better

     2) As a unicode text. On Linux terminal emulators usually use UTF-8
        encoding, so viewing a file with less should be sufficient.
        Beware of programs which try to guess the encoding. They can
        fool you. If you don't have anything which works reliably you
        might want to have a look at my utf8dump script
        (https://www.hjp.at/programs/utf8dump/ (Perl code, sorry ;-)).

First guess:

    As has already been mentioned, chardet usually does a good job.
    So first let chardet guess the encoding. Then use iconv to convert
    from this encoding to UTF-8 (or any other UTF you can reliably read)
    and open it in your text reader (preparation step 2 above) to check
    whether the result makes sense. If it does, you are done.

Checking other encodings:

    This is where it gets tedious. You could systematically try all
    encodings supported by iconv, but there are a lot of them (over
    1000!). So you should try to narrow it down: What language is the
    file in? On what OS was the file (probably) created? If most of the
    non-ascii characters are already correct, but a few are wrong, what
    other encodings are there in the same family? But however you
    determined the list of candidate encodings, the actual check is the
    same as above: Use iconv to convert from the candidate encoding and
    check the result for plausibility.

Use the encoding in your program:

    When you are done, open the file in your with open(...,
    encoding='...') with the encoding you determined above.


> i have a script that works in python 2 but not pytho3..i did 2 to 3.py
> ...but i still get the errro...character undefieed..unicode decode
> error cant decode byte 1x09 in line 7414 from cp 1252..like would you
> have a sraright solution answer??..i cant get a straight answer..it
> was ported from ansi to python...so its utf-8 as far asi can see

If it is utf-8, just open the file with open(filename, encoding="utf-8") 
(or open(filename, encoding="utf-8-sig"), if it starts with a BOM). 

And follow Steven's advice and read all the stuff he mentioned. It is
important to have a firm understanding of what "character", "byte",
"encoding" etc. mean. If you understand that, the rest is easy (sometimes
tedious, but not difficult). If you don't understand that, you can only
resort to try and error and will be continuously baffled by unexpected
results.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | [email protected]         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to

Reply via email to