Changes by Julian Mehnle jul...@mehnle.net:
--
nosy: +jmehnle
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing
Changes by Serhiy Storchaka storch...@gmail.com:
Removed file: http://bugs.python.org/file25709/issue8271-3.3.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
Changes by Serhiy Storchaka storch...@gmail.com:
Removed file: http://bugs.python.org/file26116/issue8271-3.3-fast-2.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
Changes by Serhiy Storchaka storch...@gmail.com:
--
versions: +Python 3.4 -Python 2.7, Python 3.1, Python 3.2, Python 3.3
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
Serhiy Storchaka added the comment:
What about commit? All Ezio's tests passsed, microbenchmark shows less than
10% differences:
vanilla patched
MB/s MB/s
2076 (-3%) 2007 decode utf-8 'A'*1
414 (-0%)413decode utf-8 '\x80'*1
1283 (-1%) 1275 decode
Roundup Robot added the comment:
New changeset 5962f192a483 by Ezio Melotti in branch '3.3':
#8271: the utf-8 decoder now outputs the correct number of U+FFFD characters
when used with the replace error handler on invalid utf-8 sequences. Patch
by Serhiy Storchaka, tests by Ezio Melotti.
Ezio Melotti added the comment:
Fixed, thanks for updating the patch!
I committed it on 3.3 too, and while this could have gone on 2.7/3.2 too IMHO,
it's to much work to port it there and not worth it.
--
status: open - closed
versions: +Python 3.3
Serhiy Storchaka added the comment:
Agree. In 2.7 UTF-8 codec still broken in corner cases (it accepts
surrogates) and 3.2 is coming to an end of maintaining. In any case it is
only recomendation, not demands.
--
___
Python tracker
Roundup Robot added the comment:
New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3':
Issue #8271: Fix compilation on Windows
http://hg.python.org/cpython/rev/96f4cee8ea5e
New changeset 6f44f33460cd by Victor Stinner in branch 'default':
(Merge 3.3) Issue #8271: Fix compilation on
Changes by Serhiy Storchaka storch...@gmail.com:
Removed file: http://bugs.python.org/file25720/issue8271-3.3-fast.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
Antoine Pitrou pit...@free.fr added the comment:
Why is this marked fixed? Is it fixed or not?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
Serhiy Storchaka storch...@gmail.com added the comment:
I deleted a fast patch, since it unsafe. Issue14923 should safer compensate a
small slowdown.
I think this change is not a bugfix (this is not a bug, the standard allows
such behavior), but a new feature, so I doubt the need to fix 2.7
Serhiy Storchaka storch...@gmail.com added the comment:
No, it is not fully fixed. Only one bug was fixed, but the current
behavior is still not conformed with the Unicode Standard
*recommendations*. Non-conforming with recommendations is not a bug,
conforming is a feature.
--
Serhiy Storchaka storch...@gmail.com added the comment:
Here is updated, a little faster, patch. It merged with
decode_utf8_range_check.patch from issue14923.
Patch contains non-modified Ezio Melotti's tests which all successfully passed.
--
Added file:
Serhiy Storchaka storch...@gmail.com added the comment:
Here is updated patch with resolved merge conflict with 3214c9ebcf5e.
--
Added file: http://bugs.python.org/file26118/issue8271-3.3-fast-3.patch
___
Python tracker rep...@bugs.python.org
Serhiy Storchaka storch...@gmail.com added the comment:
Here are the benchmark results (numbers are speed, MB/s).
On 32-bit Linux, AMD Athlon 64 X2:
vanilla patched
utf-8 'A'*1 2016 (+5%) 2111
utf-8
Serhiy Storchaka storch...@gmail.com added the comment:
Fortunately, issue14923 (if accepted) will compensate for the slowdown.
On 32-bit Linux, AMD Athlon 64 X2:
vanilla old patchfast patch
utf-8 'A'*1 2016
Serhiy Storchaka storch...@gmail.com added the comment:
Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it
is a little slow, but I tried to minimize the losses.
--
Added file: http://bugs.python.org/file25709/issue8271-3.3.patch
Ezio Melotti ezio.melo...@gmail.com added the comment:
Do you have any benchmark results?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
Serhiy Storchaka storch...@gmail.com added the comment:
Looks like issue14738 fixes this bug for Python 3.3.
print(ascii(b\xc2\x41\x42.decode('utf8', 'replace')))
'\ufffdAB'
print(ascii(b\xf1ABCD.decode('utf8', 'replace')))
'\ufffdABCD'
--
nosy: +storchaka
Ezio Melotti ezio.melo...@gmail.com added the comment:
The original bug should be fixed already in 3.3 and there should be tests
(unless they got removed/skipped after we changed unicode implementation).
The only issue left was about the number of U+FFFD generated with invalid
sequences in
Serhiy Storchaka storch...@gmail.com added the comment:
The only issue left was about the number of U+FFFD generated with invalid
sequences in some cases.
My last patch has extensive tests for this, so you could try to apply it (or
copy the tests) and see if they all pass.
Tests fails,
Saul Spatz saul.sp...@gmail.com added the comment:
b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I
don't think that is right.
I think that one U+FFFD is correct. The on;y error is a premature end of
data.
On Thu, May 17, 2012 at 12:31 PM, Serhiy Storchaka
Ezio Melotti ezio.melo...@gmail.com added the comment:
Tests fails, but I'm not sure that the tests are correct.
b'\xe0\x00' raises 'unexpected end of data' and not 'invalid
continuation byte'. This is terminological issue.
This might be just because it first checks if there two more bytes
Ezio Melotti ezio.melo...@gmail.com added the comment:
Changing from 'unexpected end of data' to 'invalid continuation byte' for
b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7,
3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).
If you
Serhiy Storchaka storch...@gmail.com added the comment:
I think that one U+FFFD is correct. The on;y error is a premature end of
data.
I poorly expressed. I also think that there is only one decoding error,
and not two. I think the test is wrong.
--
Serhiy Storchaka storch...@gmail.com added the comment:
This might be just because it first checks if there two more bytes before
checking if they are valid, but 'invalid continuation byte' works too.
Yes, this implementation detail. It is much easier and faster. Whether
it is necessary to
Serhiy Storchaka storch...@gmail.com added the comment:
Changing from 'unexpected end of data' to 'invalid continuation byte' for
b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7,
3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though).
I
Ezio Melotti ezio.melo...@gmail.com added the comment:
\xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD.
OK, now I get what you mean. The valid range for continuation bytes that can
follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation
byte here.
Ezio Melotti ezio.melo...@gmail.com added the comment:
I probably poorly said. Past and current implementations raise
'unexpected end of data' and not 'invalid continuation byte'. Test
expects 'invalid continuation byte'.
I don't think it matters much either way.
--
Serhiy Storchaka storch...@gmail.com added the comment:
I don't remember all the details right now, but it that test was passing with
my patch there must be something wrong somewhere (either in the patch, in the
test, or in our understanding of the standard).
No, test correctly expects two
Changes by Stefan Ring stefan...@gmail.com:
--
nosy: +Ringding
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here are some benchmarks:
Commands:
# half of the bytes are invalid
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec(utf-8,
surrogateescape)'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode'
Changes by Saul Spatz saul.sp...@gmail.com:
--
nosy: +spatz123
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing
Ezio Melotti ezio.melo...@gmail.com added the comment:
Attached patch against 3.1 fixes the number of FFFD.
A test for the range in the error message should probably be added. I haven't
done any benchmark yet. There's some code duplication, but I'm not sure it can
be factored out.
Marc-Andre Lemburg m...@egenix.com added the comment:
Ezio Melotti wrote:
Ezio Melotti ezio.melo...@gmail.com added the comment:
The patch turned out to be less trivial than I initially thought.
The current algorithm checks for invalid continuation bytes in 4 places:
1) before the
Ezio Melotti ezio.melo...@gmail.com added the comment:
The patch turned out to be less trivial than I initially thought.
The current algorithm checks for invalid continuation bytes in 4 places:
1) before the switch/case statement in Objects/unicodeobject.c when it checks
if there are enough
Ezio Melotti ezio.melo...@gmail.com added the comment:
After a mail I sent to the Unicode Consortium about the corner case I found,
they updated the Best Practices for Using U+FFFD[0] and now it says:
Another example illustrates the application of the concept of maximal subpart
for UTF-8
Changes by Alexander Belopolsky belopol...@users.sourceforge.net:
--
nosy: +belopolsky
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
John Machin sjmac...@users.sourceforge.net added the comment:
About the E0 80 81 61 problem: my interpretation is that you are correct, the
80 is not valid in the current state (start byte == E0), so no look-ahead,
three FFFDs must be issued followed by 0061. I don't really care about issuing
Ezio Melotti ezio.melo...@gmail.com added the comment:
I've found a subtle corner case about 3- and 4-bytes long sequences.
For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
(pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are
invalid.
Ezio Melotti ezio.melo...@gmail.com added the comment:
Backported to 2.6 and 3.1 in r82470 and r82469.
I'll leave this open for a while to see if anyone has any comment on my
previous message.
--
resolution: - fixed
stage: patch review - committed/rejected
Ezio Melotti ezio.melo...@gmail.com added the comment:
Ported to py3k in r82413.
Some test with non-BMP characters should probably be added.
The patch should still be ported to 2.6 and 3.1.
--
___
Python tracker rep...@bugs.python.org
Ezio Melotti ezio.melo...@gmail.com added the comment:
The issue about invalid surrogates in UTF-8 has been raised in #9133.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
Ezio Melotti ezio.melo...@gmail.com added the comment:
Fixed on trunk in r81758 and r81759.
I'm leaving the issue open until I port it on the other versions.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
Ezio Melotti ezio.melo...@gmail.com added the comment:
I added a test for the 'ignore' error handler. I will commit the patch before
the RC unless someone has something against it.
To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629,
so:
1) Invalid sequences are now
Marc-Andre Lemburg m...@egenix.com added the comment:
STINNER Victor wrote:
STINNER Victor victor.stin...@haypocalc.com added the comment:
I also found out that, according to RFC 3629, surrogates
are considered invalid and they can't be encoded/decoded,
but the UTF-8 codec actually
STINNER Victor victor.stin...@haypocalc.com added the comment:
I also found out that, according to RFC 3629, surrogates
are considered invalid and they can't be encoded/decoded,
but the UTF-8 codec actually does it.
Python2 does, but Python3 raises an error.
(...)
I wonder how
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +pitrou
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list
Ezio Melotti ezio.melo...@gmail.com added the comment:
The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the
test.
(The failing test in test_codeccallbacks was testing that registering error
handlers works, using a function that replaced \xc0\x80 with \x00. Since
now
Marc-Andre Lemburg m...@egenix.com added the comment:
Ezio Melotti wrote:
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here's a new patch. Should be complete but I want to test it some more before
committing.
I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in
STINNER Victor victor.stin...@haypocalc.com added the comment:
I also found out that, according to RFC 3629, surrogates
are considered invalid and they can't be encoded/decoded,
but the UTF-8 codec actually does it.
Python2 does, but Python3 raises an error.
Python 2.7a4+ (trunk:79675,
Ezio Melotti ezio.melo...@gmail.com added the comment:
This new patch (v3) should be ok.
I added a few more tests and found another corner case:
'\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the
start byte of a 3-byte sequence and there were only two bytes in the
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here's a new patch. Should be complete but I want to test it some more before
committing.
I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD
(we can always put them back in the unlikely case that the Unicode
John Machin sjmac...@users.sourceforge.net added the comment:
@ezio.melotti: Your second sentence is true, but it is not the whole truth.
Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered
part of the sequence because they (like 00-7F) are invalid as continuation
Ezio Melotti ezio.melo...@gmail.com added the comment:
Yes, right now I'm considering valid all the bytes that start with '10...'. C2
starts with '11...' so it's a failing byte.
--
___
Python tracker rep...@bugs.python.org
John Machin sjmac...@users.sourceforge.net added the comment:
#ezio.melotti: I'm considering valid all the bytes that start with '10...'
Sorry, WRONG. Read what I wrote: Further, some bytes in the range 80-BF are
NOT always valid as the first continuation byte, it depends on what starter
byte
Ezio Melotti ezio.melo...@gmail.com added the comment:
That's why I'm writing tests that cover all the cases, including overlong
sequences. If the test will fail I'll change the patch :)
--
___
Python tracker rep...@bugs.python.org
Marc-Andre Lemburg m...@egenix.com added the comment:
John Machin wrote:
John Machin sjmac...@users.sourceforge.net added the comment:
@lemburg: failing byte seems rather obvious: first byte that you meet that
is not valid in the current state. I don't understand your explanation,
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here is an incomplete patch. It seems to solve the problem but I still have to
add more tests and check it better.
I also wonder if the sequences with the first byte in range F5-FD (start of
4/5/6-byte sequences, restricted by RFC 3629)
Marc-Andre Lemburg m...@egenix.com added the comment:
Ezio Melotti wrote:
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here is an incomplete patch. It seems to solve the problem but I still have
to add more tests and check it better.
Thanks. Please also check whether it's
John Machin sjmac...@users.sourceforge.net added the comment:
Unicode has been frozen at 0x10. That's it. There is no such thing as a
valid 5-byte or 6-byte UTF-8 string.
--
___
Python tracker rep...@bugs.python.org
Marc-Andre Lemburg m...@egenix.com added the comment:
John Machin wrote:
John Machin sjmac...@users.sourceforge.net added the comment:
Unicode has been frozen at 0x10. That's it. There is no such thing as a
valid 5-byte or 6-byte UTF-8 string.
The UTF-8 codec was written at a time
John Machin sjmac...@users.sourceforge.net added the comment:
@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now
says 21 bits is it. F5-FF are declared to be invalid. I don't understand what
you mean by supporting those possibilities. The code is correctly issuing
Marc-Andre Lemburg m...@egenix.com added the comment:
John Machin wrote:
John Machin sjmac...@users.sourceforge.net added the comment:
@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago.
I know.
The standard now says 21 bits is it.
It says that the current Unicode
John Machin sjmac...@users.sourceforge.net added the comment:
Patch review:
Preamble: pardon my ignorance of how the codebase works, but trunk
unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k
unicodeobject.c is r79506 (and bans the surrogate caper) and I can't
Ezio Melotti ezio.melo...@gmail.com added the comment:
Even if they are not valid they still eat all the 4/5/6 bytes, so they should
be fixed too. I haven't see anything about these bytes in chapter 3 so far, but
there are at least two possibilities:
1) consider all the bytes in range F5-FD as
Marc-Andre Lemburg m...@egenix.com added the comment:
Ezio Melotti wrote:
Ezio Melotti ezio.melo...@gmail.com added the comment:
Even if they are not valid they still eat all the 4/5/6 bytes, so they
should be fixed too. I haven't see anything about these bytes in chapter 3 so
far, but
John Machin sjmac...@users.sourceforge.net added the comment:
Chapter 3, page 94: As a consequence of the well-formedness conditions
specified in Table 3-7, the following byte values are disallowed in UTF-8:
C0–C1, F5–FF
Of course they should be handled by the simple expedient of setting
John Machin sjmac...@users.sourceforge.net added the comment:
@lemburg: perhaps applying the same logic as for the other sequences is a
better strategy
What other sequences??? F5-FF are invalid bytes; they don't start valid
sequences. What same logic?? At the start of a character, they should
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
components: +Unicode
nosy: +ezio.melotti
priority: - normal
stage: - test needed
versions: +Python 3.2
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
Changes by Daniel Graña dan...@gmail.com:
--
nosy: +dangra
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing
Daniel Graña dan...@gmail.com added the comment:
Some background for this report at
http://stackoverflow.com/questions/2547262/why-is-python-decode-replacing-more-than-the-invalid-bytes-from-an-encoded-string/2548480
--
___
Python tracker
Changes by R. David Murray rdmur...@bitdance.com:
--
nosy: +lemburg
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list
Marc-Andre Lemburg m...@egenix.com added the comment:
I guess the term failing byte somewhat underdefined.
Page 95 of the standard PDF
(http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to Replace
each maximal subpart of an ill-formed subsequence by a single U+FFFD.
John Machin sjmac...@users.sourceforge.net added the comment:
@lemburg: failing byte seems rather obvious: first byte that you meet that is
not valid in the current state. I don't understand your explanation, especially
does not have the high bit set. I think you mean is a valid starter byte.
Ezio Melotti ezio.melo...@gmail.com added the comment:
Having the 'high bit set' means that the first bit is set to 1.
All the continuation bytes (i.e. the 2nd, 3rd or 4th byte in a sequence) have
the first two bits set to 1 and 0 respectively, so if the first bit is not set
to 1 then the byte
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
assignee: - ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
New submission from John Machin sjmac...@users.sourceforge.net:
Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed Constraints on
Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't
comply. Using the Unicode example:
79 matches
Mail list logo