[issue15278] UnicodeDecodeError when readline in codecs.py

2013-01-07 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
resolution:  -> duplicate
stage: patch review -> committed/rejected
status: open -> closed
superseder:  -> UTF-16 incremental decoder doesn't support partial surrogate 
pair

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-24 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
keywords: +needs review
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-10 Thread Walter Dörwald

Walter Dörwald added the comment:

> >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2])
> ('', 0)
>
> Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly 
> this.

No, those function are not decoders, they're just helper functions used to 
implement the real incremental decoders. That's why they're undocumented.

Whether codecs.utf_8_decode() returns partial results or raises an exception 
depends on the final argument::

>>> s = '\u20ac'.encode('utf8')[:2]
>>> codecs.utf_8_decode(s, 'strict')
('', 0)
>>> codecs.utf_8_decode(s, 'strict', False)
('', 0)
>>> codecs.utf_8_decode(s, 'strict', True)
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: 
unexpected end of data

If you look at encodings/utf_8.py you see that the stateless decoder call 
codecs.utf_8_decode() with final==True::

def decode(input, errors='strict'):
return codecs.utf_8_decode(input, errors, True)

so the stateless decoder *will* raise exceptions for partial results. The 
incremental decoder simply passed on the final argument given to its encode() 
method.

--
nosy: +doerwalter

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-09 Thread STINNER Victor

STINNER Victor added the comment:

> I don't understand you.

Read my last message, I was wrong.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> Hum no. The bug is an issue in the design of codecs.Stream* classes: 
> incremental decoders and encoders should be used instead of classic 
> decoders/encoders.

I don't understand you. StreamReader and IncrementalDecoder both use the same 
decoder.

class IncrementalDecoder(codecs.BufferedIncrementalDecoder):
_buffer_decode = codecs.utf_16_le_decode

class StreamReader(codecs.StreamReader):
decode = codecs.utf_16_le_decode

> I don't want to fix this issue: it's better to move to the io library for the 
> reasons listed in the PEP 400.

The bug in utf-16 decoder, not in codecs.StreamReader.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-09 Thread STINNER Victor

STINNER Victor added the comment:

>>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2])
('', 0)

Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly 
this.

"The bug is an issue in the design of codecs.Stream* classes: incremental 
decoders and encoders should be used instead of classic decoders/encoders."

Hum, I suppose that the issue cannot be reproduded with TextIOWrapper, just 
because io.TextIOWrapper and codecs.StreamReader use different buffer sizes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> This issue may be related or a duplicate of #11461.

Oh, yes, it is a duplicate. I totally forgot about it and made the work again.

> Only incremental decoder should return partial results. Other decoders are
> strict and (usually) stateless.

Yes, there is a incremental decoder.

> >>> decoder('\u20ac'.encode('utf8')[:2], 'strict')
> 
> UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1:
> unexpected end of data

>>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2])
('', 0)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-09 Thread STINNER Victor

STINNER Victor added the comment:

> This issue may be related or a duplicate of #11461.

Hum no. The bug is an issue in the design of codecs.Stream* classes: 
incremental decoders and encoders should be used instead of classic 
decoders/encoders.

I don't want to fix this issue: it's better to move to the io library for the 
reasons listed in the PEP 400.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-09 Thread STINNER Victor

STINNER Victor added the comment:

> with codecs.open('test.txt', 'wb', 'utf-16-le') as fp:

Since Python 2.6+, you can use io.open() which uses the new io library. The io 
library uses TextIOWrapper which uses incremental encoder and decoder and so 
handles multibyte encodings correctly (as UTF-16).

Said differently, this issue is already fixed in the io library.

It remembers me that I should propose again my PEP 400 :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-09 Thread STINNER Victor

STINNER Victor added the comment:

This issue may be related or a duplicate of #11461.

> For example codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), 
> but raises UnicodeDecodeError.

Only incremental decoder should return partial results. Other decoders are 
strict and (usually) stateless.

$ ./python 
>>> import codecs
>>> decoder = codecs.getdecoder('utf8')
>>> decoder('\u20ac'.encode('utf8'), 'strict')
('€', 3)
>>> decoder('\u20ac'.encode('utf8')[:2], 'strict')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: 
unexpected end of data

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-08 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here are the patches.

--
keywords: +patch
Added file: http://bugs.python.org/file27495/utf16_partial_decode-3.3.patch
Added file: http://bugs.python.org/file27496/utf16_partial_decode-3.2.patch
Added file: http://bugs.python.org/file27497/utf16_partial_decode-2.7.patch

___
Python tracker 

___diff -r 0d41a1b4c4fe Lib/test/test_codecs.py
--- a/Lib/test/test_codecs.py   Mon Oct 08 07:46:11 2012 +0200
+++ b/Lib/test/test_codecs.py   Mon Oct 08 18:26:34 2012 +0300
@@ -330,7 +330,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 [
 "", # first byte of BOM read
 "", # second byte of BOM read
@@ -352,6 +352,10 @@
 "\x00\xff\u0100",
 "\x00\xff\u0100",
 "\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 ]
 )
 
@@ -386,7 +390,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 [
 "",
 "",
@@ -404,6 +408,10 @@
 "\x00\xff\u0100",
 "\x00\xff\u0100",
 "\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 ]
 )
 
@@ -426,7 +434,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 [
 "",
 "",
@@ -444,6 +452,10 @@
 "\x00\xff\u0100",
 "\x00\xff\u0100",
 "\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 ]
 )
 
@@ -494,7 +506,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 [
 "", # first byte of BOM read
 "", # second byte of BOM read => byteorder known
@@ -506,6 +518,10 @@
 "\x00\xff\u0100",
 "\x00\xff\u0100",
 "\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 ]
 )
 
@@ -543,7 +559,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 [
 "",
 "\x00",
@@ -553,6 +569,10 @@
 "\x00\xff\u0100",
 "\x00\xff\u0100",
 "\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 ]
 )
 
@@ -582,7 +602,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 [
 "",
 "\x00",
@@ -592,6 +612,10 @@
 "\x00\xff\u0100",
 "\x00\xff\u0100",
 "\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u",
+"\x00\xff\u0100\u\U0001",
 ]
 )
 
@@ -621,7 +645,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\x00\xff\u07ff\u0800\u",
+"\x00\xff\u07ff\u0800\u\U0001",
 [
 "\x00",
 "\x00",
@@ -634,6 +658,10 @@
 "\x00\xff\u07ff\u0800",
 "\x00\xff\u07ff\u0800",
 "\x00\xff\u07ff\u0800\u",
+"\x00\xff\u07ff\u0800\u",
+"\x00\xff\u07ff\u0800\u",
+"\x00\xff\u07ff\u0800\u",
+"\x00\xff\u07ff\u0800\u\U0001",
 ]
 )
 
@@ -812,7 +840,7 @@
 
 def test_partial(self):
 self.check_partial(
-"\ufeff\x00\xff\u07ff\u0800\u",
+"\ufeff\x00\xff\u07ff\u0800\u\U0001",
 [
 "",
 "",
@@ -831,6 +859,10 @@
 "\ufeff\x00\xff\u07f

[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

This error happens due to the fact that utf16* decoders do not properly partial 
decode truncated data. Exception raised if input data truncated on the second 
surrogate in the surrogate pair. For example 
codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), but raises 
UnicodeDecodeError.

--
components: +Unicode -Library (Lib)
nosy: +ezio.melotti, serhiy.storchaka
versions: +Python 3.3, Python 3.4 -Python 2.6, Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-10-08 Thread Marcus Gröber

Marcus Gröber added the comment:

I came across this today as well. A short way of summarizing this error seems 
to be:

Reading a file using readline (or "for line in file") fails, if the following 
two conditions are true:

•   A codec (e.g. UTF-8) for a multi-byte encoding is used, and
•   The first line of the file is at least 73 bytes long, and contains a 
multi-byte-sequence that starts before offset 72, and ends after offset 72

At least for UTF-8 input files, it may be possible to work around this by 
opening the input file without a codec, and then applying decode("utf-8") to 
each line.

--
nosy: +Marcus.Gröber

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15278] UnicodeDecodeError when readline in codecs.py

2012-07-07 Thread lovelylain

New submission from lovelylain :

This is an example, `for line in fp` will raise UnicodeDecodeError:
#! -*- coding: utf-8 -*-
import codecs

text = u'\u6731' + u'\U0002a6a5' * 18
print repr(text)

with codecs.open('test.txt', 'wb', 'utf-16-le') as fp:
fp.write(text)

with codecs.open('test.txt', 'rb', 'utf-16-le') as fp:
print repr(fp.read())

with codecs.open('test.txt', 'rb', 'utf-16-le') as fp:
for line in fp:
print repr(line)

I read code in codecs.py:
def read(self, size=-1, chars=-1, firstline=False):

""" Decodes data from the stream self.stream and returns the
resulting object.
...
If firstline is true, and a UnicodeDecodeError happens
after the first line terminator in the input only the first line
will be returned, the rest of the input will be kept until the
next call to read().

"""
...
try:
newchars, decodedbytes = self.decode(data, self.errors)
except UnicodeDecodeError, exc:
if firstline:
newchars, decodedbytes = self.decode(data[:exc.start], 
self.errors)
lines = newchars.splitlines(True)
if len(lines)<=1:
raise
else:
raise
...

It seems that the firstline argument is not consistent with its doc description.
I don't konw why this argument was added and why lines count was checked.
If it was added for readline function to fix some decode errors, we may have no 
EOLs in data readed, so it caused UnicodeDecodeError too.
Maybe we should write code like below to support codecs readline.

def read(self, size=-1, chars=-1, autotruncate=False):
...
try:
newchars, decodedbytes = self.decode(data, self.errors)
except UnicodeDecodeError, exc:
if autotruncate and exc.start:
newchars, decodedbytes = self.decode(data[:exc.start], 
self.errors)
else:
raise
...

--
components: Library (Lib)
messages: 164869
nosy: lovelylain
priority: normal
severity: normal
status: open
title: UnicodeDecodeError when readline in codecs.py
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com