Re: [Tutor] Encoding error when reading text files in Python 3

2012-07-28 Thread Dat Huynh
I change my code and it runs on Python 3 now.

   f = open(rootdir+file, 'rb')
  data = f.read().decode('utf8', 'ignore')

Thank you very much.
Sincerely,
Dat.




On Sat, Jul 28, 2012 at 6:09 PM, Steven D'Aprano  wrote:
> Dat Huynh wrote:
>>
>> Dear all,
>>
>> I have written a simple application by Python to read data from text
>> files.
>>
>> Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop.
>> I don't know why it does not run on Python version 3 while it runs
>> well on Python 2.
>
>
> Python 2 is more forgiving of beginner errors when dealing with text and
> bytes, but makes it harder to deal with text correctly.
>
> Python 3 makes it easier to deal with text correctly, but is less forgiving.
>
> When you read from a file in Python 2, it will give you *something*, even if
> it is the wrong thing. It will not give an decoding error, even if the text
> you are reading is not valid text. It will just give you junk bytes,
> sometimes known as moji-bake.
>
> Python 3 no longer does that. It tells you when there is a problem, so you
> can fix it.
>
>
>
>> Could you please tell me how I can run it on python 3?
>> Following is my Python code.
>>
>>  --
>>for subdir, dirs, files in os.walk(rootdir):
>> for file in files:
>> print("Processing [" +file +"]...\n" )
>> f = open(rootdir+file, 'r')
>> data = f.read()
>> f.close()
>> print(data)
>> --
>>
>> This is the error message:
>
> [...]
>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position
>> 4980: ordinal not in range(128)
>
>
>
> This tells you that you are reading a non-ASCII file but haven't told Python
> what encoding to use, so by default Python uses ASCII.
>
> Do you know what encoding the file is?
>
> Do you understand about Unicode text and bytes? If not, I suggest you read
> this article:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
>
> In Python 3, you can either tell Python what encoding to use:
>
> f = open(rootdir+file, 'r', encoding='utf8')  # for example
>
> or you can set an error handler:
>
> f = open(rootdir+file, 'r', errors='ignore')  # for example
>
> or both
>
> f = open(rootdir+file, 'r', encoding='ascii', errors='replace')
>
>
> You can see the list of encodings and error handlers here:
>
> http://docs.python.org/py3k/library/codecs.html
>
>
> Unfortunately, Python 2 does not support this using the built-in open
> function. Instead, you have to uses codecs.open instead of the built-in
> open, like this:
>
> import codecs
> f = codecs.open(rootdir+file, 'r', encoding='utf8')  # for example
>
> which fortunately works in both Python 2 or 3.
>
>
> Or you can read the file in binary mode, and then decode it into text:
>
> f = open(rootdir+file, 'rb')
> data = f.read()
> f.close()
> text = data.decode('cp866', 'replace')
> print(text)
>
>
> If you don't know the encoding, you can try opening the file in Firefox or
> Internet Explorer and see if they can guess it, or you can use the chardet
> library in Python.
>
> http://pypi.python.org/pypi/chardet
>
> Or if you don't care about getting moji-bake, you can pretend that the file
> is encoded using Latin-1. That will pretty much read anything, although what
> it gives you may be junk.
>
>
>
> --
> Steven
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding error when reading text files in Python 3

2012-07-28 Thread Steven D'Aprano

Dat Huynh wrote:

Dear all,

I have written a simple application by Python to read data from text files.

Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop.
I don't know why it does not run on Python version 3 while it runs
well on Python 2.


Python 2 is more forgiving of beginner errors when dealing with text and 
bytes, but makes it harder to deal with text correctly.


Python 3 makes it easier to deal with text correctly, but is less forgiving.

When you read from a file in Python 2, it will give you *something*, even if 
it is the wrong thing. It will not give an decoding error, even if the text 
you are reading is not valid text. It will just give you junk bytes, sometimes 
known as moji-bake.


Python 3 no longer does that. It tells you when there is a problem, so you can 
fix it.




Could you please tell me how I can run it on python 3?
Following is my Python code.

 --
   for subdir, dirs, files in os.walk(rootdir):
for file in files:
print("Processing [" +file +"]...\n" )
f = open(rootdir+file, 'r')
data = f.read()
f.close()
print(data)
--

This is the error message:

[...]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position
4980: ordinal not in range(128)



This tells you that you are reading a non-ASCII file but haven't told Python 
what encoding to use, so by default Python uses ASCII.


Do you know what encoding the file is?

Do you understand about Unicode text and bytes? If not, I suggest you read 
this article:


http://www.joelonsoftware.com/articles/Unicode.html


In Python 3, you can either tell Python what encoding to use:

f = open(rootdir+file, 'r', encoding='utf8')  # for example

or you can set an error handler:

f = open(rootdir+file, 'r', errors='ignore')  # for example

or both

f = open(rootdir+file, 'r', encoding='ascii', errors='replace')


You can see the list of encodings and error handlers here:

http://docs.python.org/py3k/library/codecs.html


Unfortunately, Python 2 does not support this using the built-in open 
function. Instead, you have to uses codecs.open instead of the built-in open, 
like this:


import codecs
f = codecs.open(rootdir+file, 'r', encoding='utf8')  # for example

which fortunately works in both Python 2 or 3.


Or you can read the file in binary mode, and then decode it into text:

f = open(rootdir+file, 'rb')
data = f.read()
f.close()
text = data.decode('cp866', 'replace')
print(text)


If you don't know the encoding, you can try opening the file in Firefox or 
Internet Explorer and see if they can guess it, or you can use the chardet 
library in Python.


http://pypi.python.org/pypi/chardet

Or if you don't care about getting moji-bake, you can pretend that the file is 
encoded using Latin-1. That will pretty much read anything, although what it 
gives you may be junk.




--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Encoding error when reading text files in Python 3

2012-07-28 Thread Dat Huynh
Dear all,

I have written a simple application by Python to read data from text files.

Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop.
I don't know why it does not run on Python version 3 while it runs
well on Python 2.

Could you please tell me how I can run it on python 3?
Following is my Python code.

 --
   for subdir, dirs, files in os.walk(rootdir):

for file in files:

print("Processing [" +file +"]...\n" )

f = open(rootdir+file, 'r')

data = f.read()

f.close()

print(data)
--

This is the error message:

--
Traceback (most recent call last):

  File "/Users/dathuynh/Documents/workspace/PyTest/MyParser.py", line
53, in 

main()

  File "/Users/dathuynh/Documents/workspace/PyTest/MyParser.py", line
20, in main

data = f.read()

  File 
"/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/encodings/ascii.py",
line 26, in decode

return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position
4980: ordinal not in range(128)
--

Thank you very much for your help.

Sincerely,
Dat Huynh.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor