[issue15866] encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs

Serhiy Storchaka Mon, 04 Mar 2013 05:20:40 -0800

Serhiy Storchaka added the comment:

I prefer a little different (simpler for me) form:


                for (p = collstart; p < collend;) {
                    Py_UCS4 ch = *p++;
                    if ((0xD800 <= ch && ch <= 0xDBFF) &&
                        (p < collend) &&
                        (0xDC00 <= *p && *p <= 0xDFFF)) {
                        ch = ((((ch & 0x03FF) << 10) |
                               ((Py_UCS4)*p++ & 0x03FF)) + 0x10000);
                    }
                    str += sprintf(str, "&#%d;", (int)ch);
                }

And please look at the loop above ("determine replacement size"). It should be 
corrected too. It will be simpler to use a buffer with static size (``char 
buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be 
fixed too (and common code extracted to separate function).

I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as 
b'&#128157;' and not as b'&#55357;&#56477;'? This will be compatible with 
narrow build but will break compatibility with 3.3+. What is less evil?

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue15866>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue15866] encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs

Reply via email to