[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2014-03-31 Thread Julian Mehnle
Changes by Julian Mehnle jul...@mehnle.net: -- nosy: +jmehnle ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___ Python-bugs-list mailing

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: Removed file: http://bugs.python.org/file25709/issue8271-3.3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: Removed file: http://bugs.python.org/file26116/issue8271-3.3-fast-2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: -- versions: +Python 3.4 -Python 2.7, Python 3.1, Python 3.2, Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: What about commit? All Ezio's tests passsed, microbenchmark shows less than 10% differences: vanilla patched MB/s MB/s 2076 (-3%) 2007 decode utf-8 'A'*1 414 (-0%)413decode utf-8 '\x80'*1 1283 (-1%) 1275 decode

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Roundup Robot
Roundup Robot added the comment: New changeset 5962f192a483 by Ezio Melotti in branch '3.3': #8271: the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the replace error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Ezio Melotti
Ezio Melotti added the comment: Fixed, thanks for updating the patch! I committed it on 3.3 too, and while this could have gone on 2.7/3.2 too IMHO, it's to much work to port it there and not worth it. -- status: open - closed versions: +Python 3.3

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Agree. In 2.7 UTF-8 codec still broken in corner cases (it accepts surrogates) and 3.2 is coming to an end of maintaining. In any case it is only recomendation, not demands. -- ___ Python tracker

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Roundup Robot
Roundup Robot added the comment: New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3': Issue #8271: Fix compilation on Windows http://hg.python.org/cpython/rev/96f4cee8ea5e New changeset 6f44f33460cd by Victor Stinner in branch 'default': (Merge 3.3) Issue #8271: Fix compilation on

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: Removed file: http://bugs.python.org/file25720/issue8271-3.3-fast.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Antoine Pitrou
Antoine Pitrou pit...@free.fr added the comment: Why is this marked fixed? Is it fixed or not? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: I deleted a fast patch, since it unsafe. Issue14923 should safer compensate a small slowdown. I think this change is not a bugfix (this is not a bug, the standard allows such behavior), but a new feature, so I doubt the need to fix 2.7

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: No, it is not fully fixed. Only one bug was fixed, but the current behavior is still not conformed with the Unicode Standard *recommendations*. Non-conforming with recommendations is not a bug, conforming is a feature. --

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: Here is updated, a little faster, patch. It merged with decode_utf8_range_check.patch from issue14923. Patch contains non-modified Ezio Melotti's tests which all successfully passed. -- Added file:

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: Here is updated patch with resolved merge conflict with 3214c9ebcf5e. -- Added file: http://bugs.python.org/file26118/issue8271-3.3-fast-3.patch ___ Python tracker rep...@bugs.python.org

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-26 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: Here are the benchmark results (numbers are speed, MB/s). On 32-bit Linux, AMD Athlon 64 X2: vanilla patched utf-8 'A'*1 2016 (+5%) 2111 utf-8

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-26 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: Fortunately, issue14923 (if accepted) will compensate for the slowdown. On 32-bit Linux, AMD Athlon 64 X2: vanilla old patchfast patch utf-8 'A'*1 2016

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-25 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it is a little slow, but I tried to minimize the losses. -- Added file: http://bugs.python.org/file25709/issue8271-3.3.patch

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-25 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Do you have any benchmark results? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: Looks like issue14738 fixes this bug for Python 3.3. print(ascii(b\xc2\x41\x42.decode('utf8', 'replace'))) '\ufffdAB' print(ascii(b\xf1ABCD.decode('utf8', 'replace'))) '\ufffdABCD' -- nosy: +storchaka

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: The original bug should be fixed already in 3.3 and there should be tests (unless they got removed/skipped after we changed unicode implementation). The only issue left was about the number of U+FFFD generated with invalid sequences in

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: The only issue left was about the number of U+FFFD generated with invalid sequences in some cases. My last patch has extensive tests for this, so you could try to apply it (or copy the tests) and see if they all pass. Tests fails,

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Saul Spatz
Saul Spatz saul.sp...@gmail.com added the comment: b'\xe0\x80'.decode('utf-8', 'replace') returns one U+FFFD and not two. I don't think that is right. I think that one U+FFFD is correct. The on;y error is a premature end of data. On Thu, May 17, 2012 at 12:31 PM, Serhiy Storchaka

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Tests fails, but I'm not sure that the tests are correct. b'\xe0\x00' raises 'unexpected end of data' and not 'invalid continuation byte'. This is terminological issue. This might be just because it first checks if there two more bytes

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though). If you

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: I think that one U+FFFD is correct. The on;y error is a premature end of data. I poorly expressed. I also think that there is only one decoding error, and not two. I think the test is wrong. --

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: This might be just because it first checks if there two more bytes before checking if they are valid, but 'invalid continuation byte' works too. Yes, this implementation detail. It is much easier and faster. Whether it is necessary to

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though). I

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: \xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD. OK, now I get what you mean. The valid range for continuation bytes that can follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation byte here.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: I probably poorly said. Past and current implementations raise 'unexpected end of data' and not 'invalid continuation byte'. Test expects 'invalid continuation byte'. I don't think it matters much either way. --

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: I don't remember all the details right now, but it that test was passing with my patch there must be something wrong somewhere (either in the patch, in the test, or in our understanding of the standard). No, test correctly expects two

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-09-21 Thread Stefan Ring
Changes by Stefan Ring stefan...@gmail.com: -- nosy: +Ringding ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___ Python-bugs-list mailing

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-08-15 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Here are some benchmarks: Commands: # half of the bytes are invalid ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec(utf-8, surrogateescape)' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode'

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-07-07 Thread Saul Spatz
Changes by Saul Spatz saul.sp...@gmail.com: -- nosy: +spatz123 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___ Python-bugs-list mailing

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-04-19 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Attached patch against 3.1 fixes the number of FFFD. A test for the range in the error message should probably be added. I haven't done any benchmark yet. There's some code duplication, but I'm not sure it can be factored out.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-02-28 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: The patch turned out to be less trivial than I initially thought. The current algorithm checks for invalid continuation bytes in 4 places: 1) before the

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-02-27 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: The patch turned out to be less trivial than I initially thought. The current algorithm checks for invalid continuation bytes in 4 places: 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-02-25 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: After a mail I sent to the Unicode Consortium about the corner case I found, they updated the Best Practices for Using U+FFFD[0] and now it says: Another example illustrates the application of the concept of maximal subpart for UTF-8

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-12-29 Thread Alexander Belopolsky
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: -- nosy: +belopolsky ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-03 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: About the E0 80 81 61 problem: my interpretation is that you are correct, the 80 is not valid in the current state (start byte == E0), so no look-ahead, three FFFDs must be issued followed by 0061. I don't really care about issuing

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-02 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: I've found a subtle corner case about 3- and 4-bytes long sequences. For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are invalid.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-02 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Backported to 2.6 and 3.1 in r82470 and r82469. I'll leave this open for a while to see if anyone has any comment on my previous message. -- resolution: - fixed stage: patch review - committed/rejected

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-01 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Ported to py3k in r82413. Some test with non-BMP characters should probably be added. The patch should still be ported to 2.6 and 3.1. -- ___ Python tracker rep...@bugs.python.org

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-06-30 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: The issue about invalid surrogates in UTF-8 has been raised in #9133. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-06-05 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Fixed on trunk in r81758 and r81759. I'm leaving the issue open until I port it on the other versions. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-06-04 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: I added a test for the 'ignore' error handler. I will commit the patch before the RC unless someone has something against it. To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629, so: 1) Invalid sequences are now

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-07 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: STINNER Victor wrote: STINNER Victor victor.stin...@haypocalc.com added the comment: I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-07 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. Python2 does, but Python3 raises an error. (...) I wonder how

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-07 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +pitrou ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___ Python-bugs-list

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-06 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the test. (The failing test in test_codeccallbacks was testing that registering error handlers works, using a function that replaced \xc0\x80 with \x00. Since now

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-03 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Here's a new patch. Should be complete but I want to test it some more before committing. I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-03 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: I also found out that, according to RFC 3629, surrogates are considered invalid and they can't be encoded/decoded, but the UTF-8 codec actually does it. Python2 does, but Python3 raises an error. Python 2.7a4+ (trunk:79675,

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-03 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: This new patch (v3) should be ok. I added a few more tests and found another corner case: '\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-02 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Here's a new patch. Should be complete but I want to test it some more before committing. I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: @ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Yes, right now I'm considering valid all the bytes that start with '10...'. C2 starts with '11...' so it's a failing byte. -- ___ Python tracker rep...@bugs.python.org

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: #ezio.melotti: I'm considering valid all the bytes that start with '10...' Sorry, WRONG. Read what I wrote: Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: That's why I'm writing tests that cover all the cases, including overlong sequences. If the test will fail I'll change the patch :) -- ___ Python tracker rep...@bugs.python.org

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: John Machin wrote: John Machin sjmac...@users.sourceforge.net added the comment: @lemburg: failing byte seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation,

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better. I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629)

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better. Thanks. Please also check whether it's

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: Unicode has been frozen at 0x10. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string. -- ___ Python tracker rep...@bugs.python.org

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: John Machin wrote: John Machin sjmac...@users.sourceforge.net added the comment: Unicode has been frozen at 0x10. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string. The UTF-8 codec was written at a time

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now says 21 bits is it. F5-FF are declared to be invalid. I don't understand what you mean by supporting those possibilities. The code is correctly issuing

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: John Machin wrote: John Machin sjmac...@users.sourceforge.net added the comment: @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. I know. The standard now says 21 bits is it. It says that the current Unicode

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: Patch review: Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Even if they are not valid they still eat all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities: 1) consider all the bytes in range F5-FD as

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Even if they are not valid they still eat all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: Chapter 3, page 94: As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF Of course they should be handled by the simple expedient of setting

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: @lemburg: perhaps applying the same logic as for the other sequences is a better strategy What other sequences??? F5-FF are invalid bytes; they don't start valid sequences. What same logic?? At the start of a character, they should

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- components: +Unicode nosy: +ezio.melotti priority: - normal stage: - test needed versions: +Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Daniel Graña
Changes by Daniel Graña dan...@gmail.com: -- nosy: +dangra ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___ Python-bugs-list mailing

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Daniel Graña
Daniel Graña dan...@gmail.com added the comment: Some background for this report at http://stackoverflow.com/questions/2547262/why-is-python-decode-replacing-more-than-the-invalid-bytes-from-an-encoded-string/2548480 -- ___ Python tracker

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread R. David Murray
Changes by R. David Murray rdmur...@bitdance.com: -- nosy: +lemburg ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___ Python-bugs-list

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: I guess the term failing byte somewhat underdefined. Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread John Machin
John Machin sjmac...@users.sourceforge.net added the comment: @lemburg: failing byte seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially does not have the high bit set. I think you mean is a valid starter byte.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Having the 'high bit set' means that the first bit is set to 1. All the continuation bytes (i.e. the 2nd, 3rd or 4th byte in a sequence) have the first two bits set to 1 and 0 respectively, so if the first bit is not set to 1 then the byte

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- assignee: - ezio.melotti ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8271 ___ ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-30 Thread John Machin
New submission from John Machin sjmac...@users.sourceforge.net: Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example: