[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-07-07 Thread Roundup Robot

Roundup Robot devnull@devnull added the comment:

New changeset 16cbd84de848 by Victor Stinner in branch 'default':
Issue #12016: Multibyte CJK decoders now resynchronize faster
http://hg.python.org/cpython/rev/16cbd84de848

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-07-07 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 Because I consider this issue as a bug, I would like
 to apply this patch to 2.7, 3.2 and 3.3.

It is maybe a bug but it is also an important change on Python behaviour, so 
finally I prefer to only change (fix) Python 3.3.

Thanks for reporting the bug zy (cdqzzy). Tell me if it now behaves as you 
expected.

I'm closing this issue because the initial issue is now fixed.

--
resolution:  - fixed
status: open - closed
versions:  -Python 2.7, Python 3.1, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-06-03 Thread Roundup Robot

Roundup Robot devnull@devnull added the comment:

New changeset 3610841f7357 by Victor Stinner in branch '3.2':
Issue #12016: Reindent decoders of HK and JP codecs
http://hg.python.org/cpython/rev/3610841f7357

New changeset aa07c1237f4e by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12016: Reindent decoders of HK and JP codecs
http://hg.python.org/cpython/rev/aa07c1237f4e

New changeset 685351d65592 by Victor Stinner in branch '2.7':
Issue #12016: Reindent decoders of HK and JP codecs
http://hg.python.org/cpython/rev/685351d65592

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-06-03 Thread Roundup Robot

Roundup Robot devnull@devnull added the comment:

New changeset 8572bf1b56ec by Victor Stinner in branch '3.2':
Issue #12016: Add test_errorhandle() to TestBase_Mapping of
http://hg.python.org/cpython/rev/8572bf1b56ec

New changeset c3dc94d53ef8 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12016: Add test_errorhandle() to TestBase_Mapping of
http://hg.python.org/cpython/rev/c3dc94d53ef8

New changeset 53912b58eee6 by Victor Stinner in branch '2.7':
Issue #12016: Add test_errorhandle() to TestBase_Mapping of
http://hg.python.org/cpython/rev/53912b58eee6

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-06-03 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

cjk_decode.patch:
 - patch *all* CJK decoders to replace only the first byte of an invalid byte 
sequence (by U+FFFD). Example from the issue title: b'\xff\n'.decode('gb2312', 
'replace') gives now '�\n' instead of just '�'
 - add at least one unit test for *each* path in the decoder (sometimes it was 
really hard to see how to go into a specific path, especially for the johab 
decoder!)
 - add testcases for euc_jis_2004 and shift_jis_2004
 - factorize codec tests (codectests) of all japanese EUC tests 
(euc_commontests)

Because I consider this issue as a bug, I would like to apply this patch to 
2.7, 3.2 and 3.3.

--
keywords: +patch
Added file: http://bugs.python.org/file22241/cjk_decode.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-30 Thread Roundup Robot

Roundup Robot devnull@devnull added the comment:

New changeset 3b1b06570cf9 by Victor Stinner in branch '2.7':
Issue #12016: my_fgets() now always clears errors before calling fgets(). Fix
http://hg.python.org/cpython/rev/3b1b06570cf9

New changeset de07f90ef45c by Victor Stinner in branch '3.2':
Issue #12016: my_fgets() now always clears errors before calling fgets(). Fix
http://hg.python.org/cpython/rev/de07f90ef45c

New changeset deb6e7859211 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12016: my_fgets() now always clears errors before calling
http://hg.python.org/cpython/rev/deb6e7859211

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-30 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
Removed message: http://bugs.python.org/msg137334

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-30 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

- I added tests for the HZ codec and some ISO 2022 codecs: #12057
 - I fixed IncrementalEncoder.encode() (of multibytecodec ): #12100
 - I fixed IncrementalEncoder.reset() (of multibytecodec): #12171

I can now work confidently on this issue. I will try to patch all CJK decoders 
to only replace 1 invalid byte by U+FFFD (and not 2, 3 or 4 bytes) and try to 
write a test for each case (each byte sequence generating a different error).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

I asked if the change is correct on iconv mail list. Here is a copy of an 
answer.

De: Bruno Haible
À:  [iconv mailing list]
Cc: Victor Stinner
Sujet:  Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
Date:   Tue, 10 May 2011 14:52:09 +0200

Hi,

 Someone opened an issue in Python bug tracker asking to change how
 invalid multibyte sequences are handled.
 http://bugs.python.org/issue12016

For UTF-8 the recommended way of handling malformed input is written down
in http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. But the
principle applies to any encoding with a variable number of bytes per
character:
  When an invalid or malformed byte sequence is found, the smallest
  such byte sequence is transformed to U+FFFD (replacement character).

In particular, normally, if the first byte that is considered wrong
or invalid is a valid starter byte, the malformed byte sequence should
be considered to end before that byte. If it is not a valid starter
byte, then use your judgement.

For an example implementation, see
http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/unistr/u8-mbtouc.c;hb=HEAD
Here the return value is the number of bytes consumed. Look carefully
when it is 1, 2, 3, or 4.

 b'\xffabc'.decode('gb2312', 'replace') gives �bc. The 'a' character is
 seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61}
 is invalid in GB2312, the two bytes are replaced by U+FFFD.
 
 Is it the right way to to do?

It is better to replace only the 0xFF byte with U+FFFD, because 0x61 is a
valid first byte (even a complete character).

 UTF-8 decoder changed recently to ignore a single byte and restart the
 decoder, so '\xF1\x80\x41\x42\x43' is now decoded �ABC instead �C.
 Should we do the same for all encodings?

Generally, yes.

 Or at least for asian encodings 
 (gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR,
 CP949, Big5, CP950, ...)?

For stateful encodings of the ISO 2202 family, you may want to ignore/replace
a complete escape sequence, where the syntax of escape sequences is defined
through general rules.

Bruno
-- 
In memoriam Siegfried Rädel http://en.wikipedia.org/wiki/Siegfried_Rädel

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Oh, the HZ codec has no test! And what is this horrible BLOB, 
Lib/test/cjkencodings_test.py?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-11 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
dependencies: +HZ codec has no test

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-11 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
nosy: +hyeshik.chang

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-07 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

u'' in 2.7.1 also, on winxp

--
nosy: +terry.reedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-07 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

So the correct result for b'\xff\n'.decode('gb2312', 'replace') is u'?\n'?

--
versions: +Python 3.1, Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-07 Thread zy

zy cdq...@gmail.com added the comment:

 So the correct result for b'\xff\n'.decode('gb2312', 'replace') is u'?\n'?

I think it should be so. This behavior does not leave out possible information, 
has no side-effect on later decodings, and should the '\n'  indeed be 
redundant, an output of u'?\n' would unlikely cause confusions.

Though, I have no knowledge on this subject code-wise. If a change of the 
behavior will have an impact on performance, maybe the change should not come 
in.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-07 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

_codecs_cn implements different multibyte encodings: gb2312, gbkext, gbcommon, 
gb18030ext, gbk, gb18030.

And there are other Asian multibyte encodings: big5 family, ISO 2202 family, 
JIS family, korean encodings (KSX1001, EUC_KR, CP949, ...), Big5, CP950, ...

All of them ignore the all bytes if one byte of a multibyte sequence is invalid 
(lile 0xFF 0x0A: replaced by ? instead of ?\n using replace error handler).

I don't think that you can/should patch only one encoding: we should use the 
same rule for all encodings.

By the way, do you have any document explaining which result is the good one (? 
or ?\n)? For UTF-8, we have well defined standards explaining exactly what to 
do with invalid byte sequences = see issue #8271. It is easy to fix the 
decoders, but I would like to be sure that your proposed change is the right 
way to decode these encodings.

Change the multibyte encodings can also concern the security. Read for example 
the following section Check byte strings before decoding them to character 
strings of my book:
http://www.haypocalc.com/tmp/unicode-2011-03-25/html/issues.html#check-byte-strings-before-decoding-them-to-character-strings
(https://github.com/haypo/unicode_book/wiki)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-07 Thread zy

zy cdq...@gmail.com added the comment:

I do not have documents on this subject. Though, I found that GNU iconv(1) 
behaves the same as my proposed behavior. My reading of the source code 
suggests that iconv(1) treat all encodings equally, which I think should also 
be true for python.

As of security concerns, I do not think the change in decoding function itself 
would introduce any security vulnerabilities. If a security issue arises 
because of the proposed change, there must be improper code out side of python, 
which is out of python's control. That said, the proposed change is unlikely to 
introduce new security vulnerability, as all it does in effect is retaining a 
few ascii characters in the string to the output as opposed to removing.  In 
the issue of wordpress, if we suppose that wordpress was written in python, and 
that the attacker was using gb2312 encoded strings instead of gbk, then my 
proposed change would by chance fix the issue, as the backslash would be 
retained when we decode the string.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-06 Thread zy

New submission from zy cdq...@gmail.com:

let s='\xff\n' 
The expected result of s.decode('gb2312', 'ignore') is u\n, while in 2.6.6 it 
is u.
  s can be replaced with chr(m) + chr(n) , where m is in range of 128~255, and 
n in 0~127.
  In the above cases, try decoding from chr(n) will never interfere with later 
parts in the string if there is any, since chr(n) do not start a multibyte 
sequence.

--
components: Unicode
messages: 135268
nosy: cdqzzy
priority: normal
severity: normal
status: open
title: Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')
type: behavior
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-06 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
stage:  - test needed
versions: +Python 2.7 -Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-06 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
nosy: +haypo, lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12016
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com