On Sun, May 08, 2011 at 06:25:27PM -0400, Andrew Dunstan wrote: > On 04/27/2011 11:41 PM, Noah Misch wrote: >> On Wed, Apr 27, 2011 at 11:22:37PM -0400, Andrew Dunstan wrote: >>> On 04/27/2011 05:30 PM, Noah Misch wrote: >>>> To make things worse, the dump/reload problems seems to depend on your >>>> version >>>> of libxml2, or something. With git master, a CentOS 5 system with >>>> 2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system >>>> with >>>> 2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with >>>> a >>>> lenient libxml2 will be liable to store XML data that won't restore on a >>>> system >>>> with a strict libxml2. Perhaps we should emit a build-time warning if the >>>> local >>>> libxml2 is lenient? >>> No, I think we need to be strict ourselves. >> Then I suppose we'd also scan for invalid characters in xml_parse()? Or, at >> least, do so when linked to a libxml2 that neglects to do so itself? > > Yep.
I see you've gone with doing it unconditionally. I'd lean toward testing the library in pg_xml_init and setting a flag indicating whether we need the extra pass. However, a later patch can always optimize that. >>>> Injecting the check here aids "xmlelement" and "xmlforest" , but >>>> "xmlcomment" >>>> and "xmlpi" still let the invalid byte through. You can also still inject >>>> the >>>> byte into an attribute value via "xmlelement". I wonder if it wouldn't >>>> make >>>> more sense to just pass any XML that we generate from scratch through >>>> libxml2. >>>> There are a lot of holes to plug, otherwise. >>> Maybe there are, but I'd want lots of convincing that we should do that >>> at this stage. Maybe for 9.2. I think we can plug the holes fairly >>> simply for xmlpi and xmlcomment, and catch the attributes by moving this >>> check up into map_sql_value_to_xml_value(). >> I don't have much convincing to offer -- hunting down the holes seem fine, >> too. > > I think I've done that. Here's the patch I have now. It looks like we > can catch pretty much everything by putting checks in four places, which > isn't too bad. > > Please review and try to break. Here are the test cases I tried: -- caught successfully SELECT E'\x01'::xml; SELECT xmlcomment(E'\x01'); SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), ''); SELECT xmlelement(name foo, NULL, E'\x01'); SELECT xmlforest(E'\x01' AS foo); SELECT xmlpi(name foo, E'\x01'); SELECT query_to_xml($$SELECT E'\x01'$$, true, false, ''); -- not caught SELECT xmlroot('<root/>', version E'\x01'); SELECT xmlcomment(E'\ufffe'); -- not directly related, but also wrongly accepted SELECT xmlroot('<root/>', version ' '); SELECT xmlroot('<root/>', version 'foo'); Offhand, I don't find libxml2's handling of XML declarations particularly consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption = document) accepts '<?xml version="foo"?>' but rejects '<?xml version=" "?>'. Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content) accepts anything, even control characters. The XML 1.0 standard is stricter: the version must match ^1\.[0-9]+$. We might want to tighten this at the same time. > diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c > index ee82d46..12cfd56 100644 > --- a/src/backend/utils/adt/xml.c > +++ b/src/backend/utils/adt/xml.c > @@ -142,6 +142,20 @@ static void SPI_sql_row_to_xmlelement(int rownum, > StringInfo result, > #define NAMESPACE_XSI "http://www.w3.org/2001/XMLSchema-instance" > #define NAMESPACE_SQLXML "http://standards.iso.org/iso/9075/2003/sqlxml" > > +/* forbidden C0 control chars */ > +#define FORBIDDEN_C0 \ > + "\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C\x0E\x0F\x10\x11" \ > + "\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F" > + > +static inline void > +check_forbidden_c0(char * str) > +{ > + if (strpbrk(str,FORBIDDEN_C0) != NULL) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > + errmsg("character out of range"), > + errdetail("XML does not support control > characters."))); This would be an errhint, I think. However, the message seems to emphasize the wrong thing. XML 1.0 defines a lexical production called Char that includes various Unicode character ranges. Control characters as we know them happen to not fall in any of those ranges. The characters aren't unsupported in the sense of being missing features; the language simply forbids them. libxml2's error message for this case is "PCDATA invalid Char value 1" (assuming \x01). Mentioning PCDATA seems redundant, since no other context offers greater freedom. How about: ereport(ERROR, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("invalid XML 1.0 Char \\U%08x", char_val))); nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers