Re: [HACKERS] XML Issue with DTDs

2013-12-29 Thread Florian Pflug
On Dec26, 2013, at 21:30 , Florian Pflug f...@phlo.org wrote:
 On Dec23, 2013, at 18:39 , Peter Eisentraut pete...@gmx.net wrote:
 On 12/19/13, 6:40 PM, Florian Pflug wrote:
 The following example fails for XMLOPTION set to DOCUMENT as well as for 
 XMLOPTION set to CONTENT.
 
 select xmlconcat(
   xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'),
   xmlparse(content 'test/')
 )::text::xml;
 
 The SQL standard specifies that DTDs are dropped by xmlconcat.  It's
 just not implemented.
 
 OK, cool, I'll try to figure out how to do that with libxml

Hm, I've read through the (draft) SQL/XML 2003 standard, and it seems that
it mandates more processing of DTDs than we currently do. In particular, it
says that attribute default values and custom entities are to be expanded
by xmlparse(). Without doing that, stripping the DTD can change the meaning
of an XML document, or make it not well-formed (in the case of custom
entity definitions). So I think that we unless we implement that, I we have
to raise an error, not silently strip the DTD. 

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XML Issue with DTDs

2013-12-26 Thread Florian Pflug
On Dec23, 2013, at 03:45 , Robert Haas robertmh...@gmail.com wrote:
 On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug f...@phlo.org wrote:
 On Dec20, 2013, at 18:52 , Robert Haas robertmh...@gmail.com wrote:
 On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote:
 Solving this seems a bit messy, unfortunately. First, I think we need to 
 have some XMLOPTION value which is a superset of all the others - 
 otherwise, dump  restore won't work reliably. That means either allowing 
 DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.
 
 Or we can just decide that it was a bug that this was ever allowed,
 and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
 This is roughly what we did with encoding checks.
 
 What exactly do you suggest we outlaw?
 
 !DOCTYPE anywhere but at the beginning.

I think we're talking past one another here. Fixing XMLCONCAT/XMLAGG
to not produce XML values which are neither valid DOCUMENTS nor valid
CONTENT fixes *one* part of the problem.

The other part of the problem is that since not every DOCUMENT
is valid CONTENT (because CONTENT forbids DTDs) and not every CONTENT
is a valid DOCUMENT (because DOCUMENT forbids multiple root nodes), it's
impossible to set XMLOPTION to a value which accepts *all* valid XML
values. That breaks pg_dump/pg_restore. To fix this, we must provide
a way to insert XML data which accepts both DOCUMENTS and CONTENT, and
not only one or the other. Due to the way COPY works, we cannot call
a special conversion function, so we must modify the input functions.

My initial thought was to simply allow XML values which are CONTENT,
not DOCUMENTS, to contain a DTD (at the beginning), thus making CONTENT
a superset of DOCUMENT. But I've since then realized that the 2003
standard explicitly constrains CONTENT to *not* contain a DTD. The
only other option that I can see is to invert a third, non-standard
XMLOPTION value, ANY. ANY would accept anything accepted by either
DOCUMENT or CONTENT, but no more than that.

best regards,
Florian Pflug






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XML Issue with DTDs

2013-12-26 Thread Florian Pflug
On Dec23, 2013, at 18:39 , Peter Eisentraut pete...@gmx.net wrote:
 On 12/19/13, 6:40 PM, Florian Pflug wrote:
 The following example fails for XMLOPTION set to DOCUMENT as well as for 
 XMLOPTION set to CONTENT.
 
  select xmlconcat(
xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'),
xmlparse(content 'test/')
  )::text::xml;
 
 The SQL standard specifies that DTDs are dropped by xmlconcat.  It's
 just not implemented.

OK, cool, I'll try to figure out how to do that with libxml

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XML Issue with DTDs

2013-12-23 Thread Peter Eisentraut
On 12/19/13, 6:40 PM, Florian Pflug wrote:
 The following example fails for XMLOPTION set to DOCUMENT as well as for 
 XMLOPTION set to CONTENT.
 
   select xmlconcat(
 xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'),
 xmlparse(content 'test/')
   )::text::xml;

The SQL standard specifies that DTDs are dropped by xmlconcat.  It's
just not implemented.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XML Issue with DTDs

2013-12-22 Thread Robert Haas
On Fri, Dec 20, 2013 at 8:16 PM, Florian Pflug f...@phlo.org wrote:
 On Dec20, 2013, at 18:52 , Robert Haas robertmh...@gmail.com wrote:
 On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote:
 Solving this seems a bit messy, unfortunately. First, I think we need to 
 have some XMLOPTION value which is a superset of all the others - 
 otherwise, dump  restore won't work reliably. That means either allowing 
 DTDs if XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.

 Or we can just decide that it was a bug that this was ever allowed,
 and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
 This is roughly what we did with encoding checks.

 What exactly do you suggest we outlaw?

!DOCTYPE anywhere but at the beginning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XML Issue with DTDs

2013-12-20 Thread Robert Haas
On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote:
 While looking into ways to implement a XMLSTRIP function which extracts the 
 textual contents of an XML value and de-escapes them (i.e.  Solving this 
 seems a bit messy, unfortunately. First, I think we need to have some 
 XMLOPTION value which is a superset of all the others - otherwise, dump  
 restore won't work reliably. That means either allowing DTDs if XMLOPTION is 
 CONTENT, or inventing a third XMLOPTION, say ANY.

Or we can just decide that it was a bug that this was ever allowed,
and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
 This is roughly what we did with encoding checks.

 We then need to ensure that combining XML values yields something that is 
 valid according to the most general XMLOPTION setting. That means either

 (1) Removing the DTD from all but the first argument to XMLCONCAT, and 
 similarly all but the first value passed to XMLAGG

 or

 (2) Complaining if these values contain a DTD.

 or

 (3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.

 I'm not in favour of (3), since clients are unlikely to be able to process 
 such a value. (1) matches how we currently handle XML declarations (?xml 
 …?), so I'm slightly in favour of that.

I don't like #3, mostly because I don't like XMLOPTION ANY in the
first place.  Either #1 or #2 sounds OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XML Issue with DTDs

2013-12-20 Thread Florian Pflug
On Dec20, 2013, at 18:52 , Robert Haas robertmh...@gmail.com wrote:
 On Thu, Dec 19, 2013 at 6:40 PM, Florian Pflug f...@phlo.org wrote:
 Solving this seems a bit messy, unfortunately. First, I think we need to 
 have some XMLOPTION value which is a superset of all the others - otherwise, 
 dump  restore won't work reliably. That means either allowing DTDs if 
 XMLOPTION is CONTENT, or inventing a third XMLOPTION, say ANY.
 
 Or we can just decide that it was a bug that this was ever allowed,
 and if you upgrade to $FIXEDVERSION you'll need to sanitize your data.
 This is roughly what we did with encoding checks.

What exactly do you suggest we outlaw? If there are XML values which
are CONTENT but not a DOCUMENT, and other values which are a DOCUMENT
but not CONTENT, then what is pg_restore supposed to set XMLOPTION
to?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] XML Issue with DTDs

2013-12-19 Thread Florian Pflug
Hi,

While looking into ways to implement a XMLSTRIP function which extracts the 
textual contents of an XML value and de-escapes them (i.e. replaces entity 
references by their text equivalent), I've ran into another issue with the XML 
type.

XML values can either contain a DOCUMENT or CONTENT. In the first case, the 
value is well-formed XML according to the XML specification. In the latter 
case, the value is a collection of nodes, each of which may contain children. 
Without DTDs in the mix, CONTENT is thus a generalization of DOCUMENT, i.e. a 
DOCUMENT may contain only a single root node while a CONTENT may contain 
multiple. That guarantees that a concatenation of two XML values is always at 
least valid CONTENT. That, however, is no longer true once DTDs enter the 
picture. A DOCUMENT may contain a DTD as long as it precedes the root node 
(processing instructions and comments may precede the DTD, though). Yet CONTENT 
may not include a DTD at all. A concatenation of a DOCUMENT with a DTD and 
CONTENT thus yields something that is neither a DOCUMENT nor a CONTENT, yet 
XMLCONCAT fails to complain. The following example fails for XMLOPTION set to 
DOCUMENT as well as for XMLOPTION set to CONTENT.

  select xmlconcat(
xmlparse(document '!DOCTYPE test [!ELEMENT test EMPTY]test/'),
xmlparse(content 'test/')
  )::text::xml;

Solving this seems a bit messy, unfortunately. First, I think we need to have 
some XMLOPTION value which is a superset of all the others - otherwise, dump  
restore won't work reliably. That means either allowing DTDs if XMLOPTION is 
CONTENT, or inventing a third XMLOPTION, say ANY.

We then need to ensure that combining XML values yields something that is valid 
according to the most general XMLOPTION setting. That means either 

(1) Removing the DTD from all but the first argument to XMLCONCAT, and 
similarly all but the first value passed to XMLAGG

or

(2) Complaining if these values contain a DTD. 

or 

(3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.

I'm not in favour of (3), since clients are unlikely to be able to process such 
a value. (1) matches how we currently handle XML declarations (?xml …?), so 
I'm slightly in favour of that.

Thoughts?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers