Re: [HACKERS] XML Issue with DTDs
On Dec26, 2013, at 21:30 , Florian Pflug f...@phlo.org wrote: On Dec23, 2013, at 18:39 , Peter Eisentraut pete...@gmx.net wrote: On 12/19/13, 6:40 PM, Florian Pflug wrote: The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. select xmlconcat( xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'), xmlparse(content 'test/') )::text::xml; The SQL standard specifies that DTDs are dropped by xmlconcat. It's just not implemented. OK, cool, I'll try to figure out how to do that with libxml Hm, I've read through the (draft) SQL/XML 2003 standard, and it seems that it mandates more processing of DTDs than we currently do. In particular, it says that attribute default values and custom entities are to be expanded by xmlparse(). Without doing that, stripping the DTD can change the meaning of an XML document, or make it not well-formed (in the case of custom entity definitions). So I think that we unless we implement that, I we have to raise an error, not silently strip the DTD. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XML Issue with DTDs
On Dec23, 2013, at 03:45 , Robert Haas robertmh...@gmail.com wrote: On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug f...@phlo.org wrote: On Dec20, 2013, at 18:52 , Robert Haas robertmh...@gmail.com wrote: On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote: Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY. Or we can just decide that it was a bug that this was ever allowed, and if you upgrade to $FIXEDVERSION you'll need to sanitize your data. This is roughly what we did with encoding checks. What exactly do you suggest we outlaw? !DOCTYPE anywhere but at the beginning. I think we're talking past one another here. Fixing XMLCONCAT/XMLAGG to not produce XML values which are neither valid DOCUMENTS nor valid CONTENT fixes *one* part of the problem. The other part of the problem is that since not every DOCUMENT is valid CONTENT (because CONTENT forbids DTDs) and not every CONTENT is a valid DOCUMENT (because DOCUMENT forbids multiple root nodes), it's impossible to set XMLOPTION to a value which accepts *all* valid XML values. That breaks pg_dump/pg_restore. To fix this, we must provide a way to insert XML data which accepts both DOCUMENTS and CONTENT, and not only one or the other. Due to the way COPY works, we cannot call a special conversion function, so we must modify the input functions. My initial thought was to simply allow XML values which are CONTENT, not DOCUMENTS, to contain a DTD (at the beginning), thus making CONTENT a superset of DOCUMENT. But I've since then realized that the 2003 standard explicitly constrains CONTENT to *not* contain a DTD. The only other option that I can see is to invert a third, non-standard XMLOPTION value, ANY. ANY would accept anything accepted by either DOCUMENT or CONTENT, but no more than that. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XML Issue with DTDs
On Dec23, 2013, at 18:39 , Peter Eisentraut pete...@gmx.net wrote: On 12/19/13, 6:40 PM, Florian Pflug wrote: The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. select xmlconcat( xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'), xmlparse(content 'test/') )::text::xml; The SQL standard specifies that DTDs are dropped by xmlconcat. It's just not implemented. OK, cool, I'll try to figure out how to do that with libxml best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XML Issue with DTDs
On 12/19/13, 6:40 PM, Florian Pflug wrote: The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. select xmlconcat( xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'), xmlparse(content 'test/') )::text::xml; The SQL standard specifies that DTDs are dropped by xmlconcat. It's just not implemented. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XML Issue with DTDs
On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug f...@phlo.org wrote: On Dec20, 2013, at 18:52 , Robert Haas robertmh...@gmail.com wrote: On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote: Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY. Or we can just decide that it was a bug that this was ever allowed, and if you upgrade to $FIXEDVERSION you'll need to sanitize your data. This is roughly what we did with encoding checks. What exactly do you suggest we outlaw? !DOCTYPE anywhere but at the beginning. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XML Issue with DTDs
On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote: While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and de-escapes them (i.e. Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY. Or we can just decide that it was a bug that this was ever allowed, and if you upgrade to $FIXEDVERSION you'll need to sanitize your data. This is roughly what we did with encoding checks. We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTION setting. That means either (1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to XMLAGG or (2) Complaining if these values contain a DTD. or (3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY. I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently handle XML declarations (?xml …?), so I'm slightly in favour of that. I don't like #3, mostly because I don't like XMLOPTION ANY in the first place. Either #1 or #2 sounds OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XML Issue with DTDs
On Dec20, 2013, at 18:52 , Robert Haas robertmh...@gmail.com wrote: On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote: Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY. Or we can just decide that it was a bug that this was ever allowed, and if you upgrade to $FIXEDVERSION you'll need to sanitize your data. This is roughly what we did with encoding checks. What exactly do you suggest we outlaw? If there are XML values which are CONTENT but not a DOCUMENT, and other values which are a DOCUMENT but not CONTENT, then what is pg_restore supposed to set XMLOPTION to? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] XML Issue with DTDs
Hi, While looking into ways to implement a XMLSTRIP function which extracts the textual contents of an XML value and de-escapes them (i.e. replaces entity references by their text equivalent), I've ran into another issue with the XML type. XML values can either contain a DOCUMENT or CONTENT. In the first case, the value is well-formed XML according to the XML specification. In the latter case, the value is a collection of nodes, each of which may contain children. Without DTDs in the mix, CONTENT is thus a generalization of DOCUMENT, i.e. a DOCUMENT may contain only a single root node while a CONTENT may contain multiple. That guarantees that a concatenation of two XML values is always at least valid CONTENT. That, however, is no longer true once DTDs enter the picture. A DOCUMENT may contain a DTD as long as it precedes the root node (processing instructions and comments may precede the DTD, though). Yet CONTENT may not include a DTD at all. A concatenation of a DOCUMENT with a DTD and CONTENT thus yields something that is neither a DOCUMENT nor a CONTENT, yet XMLCONCAT fails to complain. The following example fails for XMLOPTION set to DOCUMENT as well as for XMLOPTION set to CONTENT. select xmlconcat( xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'), xmlparse(content 'test/') )::text::xml; Solving this seems a bit messy, unfortunately. First, I think we need to have some XMLOPTION value which is a superset of all the others - otherwise, dump restore won't work reliably. That means either allowing DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY. We then need to ensure that combining XML values yields something that is valid according to the most general XMLOPTION setting. That means either (1) Removing the DTD from all but the first argument to XMLCONCAT, and similarly all but the first value passed to XMLAGG or (2) Complaining if these values contain a DTD. or (3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY. I'm not in favour of (3), since clients are unlikely to be able to process such a value. (1) matches how we currently handle XML declarations (?xml …?), so I'm slightly in favour of that. Thoughts? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers