On 2018-06-03 16:36:12 -0700, bellcanada...@gmail.com wrote: > On Tuesday, 22 May 2018 17:23:55 UTC-4, Peter J. Holzer wrote: > > On 2018-05-20 15:43:54 +0200, Karsten Hilbert wrote: > > > On Sun, May 20, 2018 at 04:59:12AM -0700, bellcanada...@gmail.com wrote: > > > > thank you for the reply, but how exactly am i supposed to find > > > > oout what is the correct encodeing?? > > > > > > One CAN NOT. > > > > > > The best you can do is to go ask the canonical source of the > > > file what encoding the file is _supposed_ to be in. > > > > I disagree on both counts. > > > > 1) For any given file it is almost always possible to find the correct > > encoding (or *a* correct encoding, as there may be more than one). > > > > This may require domain-specific knowledge (e.g. it may be necessary > > to recognize the human language and know at least some distinctive > > words, or to know some special symbols likely to be used in a data > > file), and it almost always takes a bit of detective work and trial > > and error. But I don't think I ever encountered a file where I > > couldn't figure out the encoding. [...] > > hello peter ...how exactly would i solve this issue .....
There is no "exactly" here. Determining the encoding of a file depends on experience and trial and error. However, I can give you some general guide lines: Preparation: Make sure you have a way to reliably display files: 1) As a stream of bytes. On Linux hd works well for this purpose, although a hex editor might be even better 2) As a unicode text. On Linux terminal emulators usually use UTF-8 encoding, so viewing a file with less should be sufficient. Beware of programs which try to guess the encoding. They can fool you. If you don't have anything which works reliably you might want to have a look at my utf8dump script (https://www.hjp.at/programs/utf8dump/ (Perl code, sorry ;-)). First guess: As has already been mentioned, chardet usually does a good job. So first let chardet guess the encoding. Then use iconv to convert from this encoding to UTF-8 (or any other UTF you can reliably read) and open it in your text reader (preparation step 2 above) to check whether the result makes sense. If it does, you are done. Checking other encodings: This is where it gets tedious. You could systematically try all encodings supported by iconv, but there are a lot of them (over 1000!). So you should try to narrow it down: What language is the file in? On what OS was the file (probably) created? If most of the non-ascii characters are already correct, but a few are wrong, what other encodings are there in the same family? But however you determined the list of candidate encodings, the actual check is the same as above: Use iconv to convert from the candidate encoding and check the result for plausibility. Use the encoding in your program: When you are done, open the file in your with open(..., encoding='...') with the encoding you determined above. > i have a script that works in python 2 but not pytho3..i did 2 to 3.py > ...but i still get the errro...character undefieed..unicode decode > error cant decode byte 1x09 in line 7414 from cp 1252..like would you > have a sraright solution answer??..i cant get a straight answer..it > was ported from ansi to python...so its utf-8 as far asi can see If it is utf-8, just open the file with open(filename, encoding="utf-8") (or open(filename, encoding="utf-8-sig"), if it starts with a BOM). And follow Steven's advice and read all the stuff he mentioned. It is important to have a firm understanding of what "character", "byte", "encoding" etc. mean. If you understand that, the rest is easy (sometimes tedious, but not difficult). If you don't understand that, you can only resort to try and error and will be continuously baffled by unexpected results. hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | h...@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list