[issue9804] ascii() does not always join surrogate pairs

2010-09-09 Thread Antoine Pitrou
Antoine Pitrou added the comment: Modified patch committed in r84655 (3.x) and r84656 (3.1). Thanks! -- resolution: -> fixed stage: needs patch -> committed/rejected status: open -> closed ___ Python tracker _

[issue9804] ascii() does not always join surrogate pairs

2010-09-09 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc added the comment: I agree with the feature and the patch, with two minor nits: - Py_UCS4 should be used in place of "unsigned long" - "*p >= 0xD800" is the most selective test and should be the first -- nosy: +amaury.forgeotdarc ___

[issue9804] ascii() does not always join surrogate pairs

2010-09-09 Thread Antoine Pitrou
Antoine Pitrou added the comment: New patch with tests. -- Added file: http://bugs.python.org/file18810/backslashsurrogates2.patch ___ Python tracker ___

[issue9804] ascii() does not always join surrogate pairs

2010-09-08 Thread STINNER Victor
STINNER Victor added the comment: > >>> s = "'\0\"\n\r\t abcd\x85é\U00012fff\U0001D121xxx\uD800." > (...) > (I think I've included everything: > - normal chars > - control chars > - one-byte non-ASCII > - two-byte non-ASCII (and lone surrogate) > - printable and non-printable surrogate pairs) >

[issue9804] ascii() does not always join surrogate pairs

2010-09-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: Here is a patch (lacking tests for now). -- keywords: +patch Added file: http://bugs.python.org/file18805/backslashsurrogates.patch ___ Python tracker _

[issue9804] ascii() does not always join surrogate pairs

2010-09-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: Or perhaps not, since we would like surrogate pairs to be fused in other cases (ascii() of other types) as well. So "backslashreplace" would need to be changed instead: >>> print("\U00012345".encode('ascii', 'backslashreplace')) b'\\ud808\\udf45' Expected re

[issue9804] ascii() does not always join surrogate pairs

2010-09-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: Actually, it would probably be simpler to export a _PyUnicode_Repr(PyUnicodeObject *, int only_ascii) function since all the code is already there in unicodeobject.c. -- ___ Python tracker

[issue9804] ascii() does not always join surrogate pairs

2010-09-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: How about the following solution: >>> def a(s): ...s = s.encode('unicode-escape').decode('ascii') ...s = s.replace("'", r"\'") ...return "'" + s + "'" ... >>> s = "'\0\"\n\r\t abcd\x85é\U00012fff\U0001D121xxx\uD800." >>> print(ascii(s)); print(a(s)

[issue9804] ascii() does not always join surrogate pairs

2010-09-08 Thread STINNER Victor
STINNER Victor added the comment: For unicode, ascii(x) is implemented as repr(x).encode('ascii', 'backslashreplace').decode('ascii'). repr(x) is "'" + x + "'" for printable characters (eg. U+1D121), and "'U+%08x'" % ord(x) for not printable characters (eg. U+12FFF). About the unexpected out

[issue9804] ascii() does not always join surrogate pairs

2010-09-08 Thread Antoine Pitrou
New submission from Antoine Pitrou : This is on an UCS-2 py3k build: >>> ascii('\U00012FFF') "'\\U00012fff'" >>> ascii('\U0001D121') "'\\ud834\\udd21'" -- components: Interpreter Core, Unicode messages: 115905 nosy: haypo, pitrou priority: normal severity: normal stage: needs patch stat