On Wed, Apr 27, 2011 at 03:05:30PM -0400, Andrew Dunstan wrote:
> On 04/26/2011 05:11 PM, Noah Misch wrote:
>> On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:
>>> I came across this today, while helping a customer. The following will
>>> happily create a piece of XML with an embedded ^A:
>>>
>>>     select xmlelement(name foo, null, E'abc\x01def');
>>>
>>> Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
>>> as "" or equivalent in XML version 1.1, and not as a 0x01 byte
>>> (see<http://en.wikipedia.org/wiki/XML#Valid_characters>)
>>>
>>> ISTM this is something we should definitely try to fix ASAP, even if we
>>> probably can't backpatch the fix.
>> +1.  Given that such a datum breaks dump+reload, it seems risky to do 
>> nothing at
>> all in the back branches.
>
> Here's a patch along the lines suggested by Peter.
>
> I'm not sure what to do about the back branches and cases where data is  
> already in databases. This is fairly ugly. Suggestions welcome.

We could provide a script in (or linked from) the release notes for testing the
data in all your xml columns.

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something.  With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it.  Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2.  Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

> *** a/src/backend/utils/adt/xml.c
> --- b/src/backend/utils/adt/xml.c
> ***************
> *** 1844,1850 **** escape_xml(const char *str)
> --- 1844,1865 ----
>                       case '\r':
>                               appendStringInfoString(&buf, "&#x0d;");
>                               break;
> +                     case '\n':
> +                     case '\t':
> +                             appendStringInfoCharMacro(&buf, *p);
> +                             break;
>                       default:
> +                             /* 
> +                              * Any control char we haven't already 
> explicitly handled
> +                              * (i.e. TAB, NL and CR)is an error. 
> +                              * If we ever support XML 1.1 they will be 
> allowed,
> +                              * but will have to be escaped.
> +                              */
> +                             if (*p < '\x20')

This needs to be an unsigned comparison.  On my system, "char" is signed, so
"SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly.

The XML character set forbids more than just control characters; see
http://www.w3.org/TR/xml/#charsets.  We also ought to reject, for example,
"SELECT xmlelement(name foo, null, E'\ufffe')".

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through.  You can also still inject the
byte into an attribute value via "xmlelement".  I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

> +                                     ereport(ERROR,
> +                                                     
> (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> +                                                      errmsg("character out 
> of range"),
> +                                                      errdetail("XML does 
> not support control characters.")));
>                               appendStringInfoCharMacro(&buf, *p);
>                               break;
>               }


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to