Hi,

While looking into ways to implement a XMLSTRIP function which extracts the 
textual contents of an XML value and de-escapes them (i.e. replaces entity 
references by their text equivalent), I've ran into another issue with the XML 
type.

XML values can either contain a DOCUMENT or CONTENT. In the first case, the 
value is well-formed XML according to the XML specification. In the latter 
case, the value is a collection of nodes, each of which may contain children. 
Without DTDs in the mix, CONTENT is thus a generalization of DOCUMENT, i.e. a 
DOCUMENT may contain only a single root node while a CONTENT may contain 
multiple. That guarantees that a concatenation of two XML values is always at 
least valid CONTENT. That, however, is no longer true once DTDs enter the 
picture. A DOCUMENT may contain a DTD as long as it precedes the root node 
(processing instructions and comments may precede the DTD, though). Yet CONTENT 
may not include a DTD at all. A concatenation of a DOCUMENT with a DTD and 
CONTENT thus yields something that is neither a DOCUMENT nor a CONTENT, yet 
XMLCONCAT fails to complain. The following example fails for XMLOPTION set to 
DOCUMENT as well as for XMLOPTION set to CONTENT.

  select xmlconcat(
    xmlparse(document '<!DOCTYPE test [<!ELEMENT test EMPTY>]><test/>'),
    xmlparse(content '<test/>')
  )::text::xml;

Solving this seems a bit messy, unfortunately. First, I think we need to have 
some XMLOPTION value which is a superset of all the others - otherwise, dump & 
restore won't work reliably. That means either allowing DTDs if XMLOPTION is 
CONTENT, or inventing a third XMLOPTION, say ANY.

We then need to ensure that combining XML values yields something that is valid 
according to the most general XMLOPTION setting. That means either 

(1) Removing the DTD from all but the first argument to XMLCONCAT, and 
similarly all but the first value passed to XMLAGG

or

(2) Complaining if these values contain a DTD. 

or 

(3) Allowing multiple DTDs in a document if XMLOPTION is, say, ANY.

I'm not in favour of (3), since clients are unlikely to be able to process such 
a value. (1) matches how we currently handle XML declarations (<?xml …?>), so 
I'm slightly in favour of that.

Thoughts?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to