Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-04-04 Thread Takahiro Itagaki

Tom Lane t...@sss.pgh.pa.us wrote:

 Takahiro Itagaki itagaki.takah...@oss.ntt.co.jp writes:
  Can we take the patch for 9.0? The bug is registered as an open item:
  http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items
 
 Given that there are still problems with it, applying the patch for 9.0
 would mean changing the behavior of xmlconcat in 9.0 and then again in
 9.1.  I don't think that's a good idea.  Better to leave it alone until
 we have a full fix.

Ok, I added it in ToDo list, and removed it from 9.0 open items.
better handling of PIs and DTDs in xmlconcat()

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-04-02 Thread Takahiro Itagaki

Andrew Dunstan and...@dunslane.net wrote:

 Hmm. OK. Well here is a patch that tries to fix the xmlconcat error, 
 anyway. It seems to work, but maybe could stand a little tightening.

Can we take the patch for 9.0? The bug is registered as an open item:
http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items

As far as the patch, I found there are still two issues even after it applied:

1. A linebreak is added at the line end DOCTYPE exists.
=# SELECT xmlconcat('foo', xmlparse(DOCUMENT '!DOCTYPE htmlhtml/'));
 xmlconcat

 foohtml/+

(1 row)

2. DOCUMENT could have ?xml before DOCTYPE.
=# SELECT xmlconcat('foo', xmlparse(DOCUMENT '?xml version=1.0? 
!DOCTYPE  html html/'));
 xmlconcat
---
 foo
(1 row)

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-04-02 Thread Tom Lane
Takahiro Itagaki itagaki.takah...@oss.ntt.co.jp writes:
 Can we take the patch for 9.0? The bug is registered as an open item:
 http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items

Given that there are still problems with it, applying the patch for 9.0
would mean changing the behavior of xmlconcat in 9.0 and then again in
9.1.  I don't think that's a good idea.  Better to leave it alone until
we have a full fix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-24 Thread Andrew Dunstan



Peter Eisentraut wrote:

On mån, 2010-03-22 at 19:38 -0400, Andrew Dunstan wrote:
  

But if we are not comfortable about being able to do that safely, I
would be OK with just raising an error if a concatenation is
  

attempted


where one value contains a DTD.  The impact in practice should be
  

low.

  
  
Right. Can you find a way to do that using the libxml API? I haven't 
managed to, and I'm pretty sure I can construct XML that fails every 
simple string search test I can think of, either with a false negative

or a false positive.



The documentation on that is terse as usual.  In any case, you will need
to XML parse the input values, and so you might as well resort to
parsing the output value to see if it is well-formed, which should catch
this mistake and possibly others.

  


Actually, I have come to the conclusion that the biggest problem in this 
area is that we accept XML documents with a leading DOCTYPE node at all. 
Our docs state:


   The xml type can store well-formed documents, as defined by the
   XML standard, as well as content fragments, which are defined by
   the production XMLDecl? content in the XML standard.

A document with a leading DOCTYPE node matches neither of these rules, 
and when we strip the XMLDecl from a piece of XML where it's followed by 
a DOCTYPE node we turn something that is legal XML into something that 
isn't, even by our own (or possibly the standard's) relaxed definition. 
A doctypedecl can only follow an  XMLDecl, see 
http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-prolog-dtd.


So I think we need to go back to the drawing board a bit, rather than 
patch a particular reported error case. But these problems are not at 
all new to 9.0, and coming up to beta as I hope we are is not the time 
for it. I think it will have to wait to 9.1.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-24 Thread Peter Eisentraut
On ons, 2010-03-24 at 14:51 -0400, Andrew Dunstan wrote:
 Actually, I have come to the conclusion that the biggest problem in
 this 
 area is that we accept XML documents with a leading DOCTYPE node at
 all. 
 Our docs state:
 
 The xml type can store well-formed documents, as defined by the
 XML standard, as well as content fragments, which are defined by
 the production XMLDecl? content in the XML standard.
 
 A document with a leading DOCTYPE node matches neither of these
 rules, 
 and when we strip the XMLDecl from a piece of XML where it's followed
 by 
 a DOCTYPE node we turn something that is legal XML into something
 that 
 isn't, even by our own (or possibly the standard's) relaxed
 definition. 
 A doctypedecl can only follow an  XMLDecl, see 
 http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-prolog-dtd.

Our version of SQL/XML support references SQL:2003 which references XML
1.0, where omitting the XMLDecl is legal.  You can't omit the XMLDecl in
XML 1.1, because you need it to communicate the fact that it's version
1.1.

But note that that is correctly supported:

=# select xmlconcat('?xml version=1.0?foo/', '?xml
version=1.0?bar/');
  xmlconcat
--
 foo/bar/

and

=# select xmlconcat('?xml version=1.1?foo/', '?xml
version=1.1?bar/');
 xmlconcat
---
 ?xml version=1.1?foo/bar/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-24 Thread Andrew Dunstan



Peter Eisentraut wrote:

Our version of SQL/XML support references SQL:2003 which references XML
1.0, where omitting the XMLDecl is legal.  You can't omit the XMLDecl in
XML 1.1, because you need it to communicate the fact that it's version
1.1.


  


Hmm. OK. Well here is a patch that tries to fix the xmlconcat error, 
anyway. It seems to work, but maybe could stand a little tightening.


cheers

andrew
Index: src/backend/utils/adt/xml.c
===
RCS file: /cvsroot/pgsql/src/backend/utils/adt/xml.c,v
retrieving revision 1.97
diff -c -r1.97 xml.c
*** src/backend/utils/adt/xml.c	3 Mar 2010 17:29:45 -	1.97
--- src/backend/utils/adt/xml.c	24 Mar 2010 22:05:19 -
***
*** 418,424 
--- 418,466 
  #endif
  }
  
+ #ifdef USE_LIBXML
+ static inline void
+ strip_dtd(char ** xmlstr)
+ {
+ 
+ 	xmlDocPtr	doc;
+ 	xmlChar *xmlbuff;
+ 	int buffersize;
+ 	booldtd_found = false;
+ 	xmlNodePtr child;
+ 	char * skip_xmldecl;
+ 
+ 	if (strstr(*xmlstr,!DOCTYPE) == NULL)
+ 		return ;
+ 
+ 	doc = xml_parse(cstring_to_text(*xmlstr), XMLOPTION_DOCUMENT, true, GetDatabaseEncoding());
+ 
+ 	for (child = doc-children; child != NULL; child = child-next)
+ 	{
+ 		if (child-type == XML_DOCUMENT_TYPE_NODE || 
+ 			child-type == XML_DTD_NODE)
+ 		{
+ 			xmlUnlinkNode(child);
+ 			xmlFreeNode(child);
+ 			dtd_found = true;
+ 		}
+ 	}
+ 	if (dtd_found)
+ 	{
+ 		pfree(*xmlstr);
+ 		xmlDocDumpFormatMemory(doc, xmlbuff, buffersize, 0);
+ 		if (strncmp((char *)xmlbuff,?xml,5) == 0)
+ 			skip_xmldecl = strstr((char *)xmlbuff,?\n) + 3;
+ 		else
+ 			skip_xmldecl = (char *) xmlbuff;
+ 		*xmlstr = pstrdup(skip_xmldecl);
+ 		xmlFree(xmlbuff);
+ 		
+ 	}
+ 	xmlFreeDoc(doc);
  
+ }
+ #endif
  
  /*
   * TODO: xmlconcat needs to merge the notations and unparsed entities
***
*** 460,465 
--- 502,509 
  		else if (xmlStrcmp(version, global_version) != 0)
  			global_version_no_value = true;
  
+ 		strip_dtd(str);
+ 
  		appendStringInfoString(buf, str + len);
  		pfree(str);
  	}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-24 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Hmm. OK. Well here is a patch that tries to fix the xmlconcat error, 
 anyway. It seems to work, but maybe could stand a little tightening.

I liked your previous idea (rethink the whole mess in 9.1) better.

As far as the patch itself is concerned, the complete lack of error
checks seems scary, and I wonder whether the case sensitivity and
lack of whitespace tolerance in the string comparisons is OK.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-24 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:
  
Hmm. OK. Well here is a patch that tries to fix the xmlconcat error, 
anyway. It seems to work, but maybe could stand a little tightening.



I liked your previous idea (rethink the whole mess in 9.1) better.

As far as the patch itself is concerned, the complete lack of error
checks seems scary, 


Yes, this wasn't intended as the final patch. If it's not wanted right 
now, that's fine too. I just wanted to get it on the record as possibly 
something useful when we do come to reconsider the whole mess. Getting 
to grips with the libxml2 API is no fun, and it's better not to have to 
repeat it if possible ;-)



and I wonder whether the case sensitivity and
lack of whitespace tolerance in the string comparisons is OK.


  


The tokens were chosen with some care to be such that no whitespace 
tolerance would be needed (or correct). XML is case sensitive, so that's 
not an issue either.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-23 Thread Peter Eisentraut
On mån, 2010-03-22 at 19:38 -0400, Andrew Dunstan wrote:
  But if we are not comfortable about being able to do that safely, I
  would be OK with just raising an error if a concatenation is
 attempted
  where one value contains a DTD.  The impact in practice should be
 low.

 
 Right. Can you find a way to do that using the libxml API? I haven't 
 managed to, and I'm pretty sure I can construct XML that fails every 
 simple string search test I can think of, either with a false negative
 or a false positive.

The documentation on that is terse as usual.  In any case, you will need
to XML parse the input values, and so you might as well resort to
parsing the output value to see if it is well-formed, which should catch
this mistake and possibly others.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-22 Thread Peter Eisentraut
On sön, 2010-03-21 at 13:07 -0400, Andrew Dunstan wrote:
 Yeah, maybe. According to 
 http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html the only 
 legal child of an XML Document node that is not also a legal child of a 
 DocumentFragment node is a DocumentType node. So we could probably just 
 look for one of those in each argument node and strip it out. That 
 should be fairly lightweight in the common case where it's not present - 
 we'd just be searching for a fixed string. Removing it if found would be 
 more complex. We'd have to parse the node to remove it, since a legal 
 DocumentType node string could appear legally inside a CDATA node.

According to the SQL/XML standard, the document type declaration should
apparently be stripped when doing a concatenation.  (This makes sense
because the result of a concatenation can never be valid according to a
DTD.)

But if we are not comfortable about being able to do that safely, I
would be OK with just raising an error if a concatenation is attempted
where one value contains a DTD.  The impact in practice should be low.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-22 Thread Andrew Dunstan



Peter Eisentraut wrote:

On sön, 2010-03-21 at 13:07 -0400, Andrew Dunstan wrote:
  
Yeah, maybe. According to 
http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html the only 
legal child of an XML Document node that is not also a legal child of a 
DocumentFragment node is a DocumentType node. So we could probably just 
look for one of those in each argument node and strip it out. That 
should be fairly lightweight in the common case where it's not present - 
we'd just be searching for a fixed string. Removing it if found would be 
more complex. We'd have to parse the node to remove it, since a legal 
DocumentType node string could appear legally inside a CDATA node.



According to the SQL/XML standard, the document type declaration should
apparently be stripped when doing a concatenation.  (This makes sense
because the result of a concatenation can never be valid according to a
DTD.)

But if we are not comfortable about being able to do that safely, I
would be OK with just raising an error if a concatenation is attempted
where one value contains a DTD.  The impact in practice should be low.
  


Right. Can you find a way to do that using the libxml API? I haven't 
managed to, and I'm pretty sure I can construct XML that fails every 
simple string search test I can think of, either with a false negative 
or a false positive.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-21 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items

 I have just been looking at the xmlconcat bug on that list. I can't 
 think of any better solution than parsing the resulting string to make 
 sure it is well-formed before we return,

That might be a reasonable thing to do as a safety check, but I can't
escape the feeling that what this fundamentally is is a data typing
error, traceable to the lack of differentiation between xml documents
and xml fragments.  Is there a way to attack it based on saying that the
inputs can't be documents, or stripping the document overhead if they are?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: xmlconcat (was [HACKERS] 9.0 release notes done)

2010-03-21 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:
  

http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items
  


  
I have just been looking at the xmlconcat bug on that list. I can't 
think of any better solution than parsing the resulting string to make 
sure it is well-formed before we return,



That might be a reasonable thing to do as a safety check, but I can't
escape the feeling that what this fundamentally is is a data typing
error, traceable to the lack of differentiation between xml documents
and xml fragments.  Is there a way to attack it based on saying that the
inputs can't be documents, or stripping the document overhead if they are?
  


Yeah, maybe. According to 
http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html the only 
legal child of an XML Document node that is not also a legal child of a 
DocumentFragment node is a DocumentType node. So we could probably just 
look for one of those in each argument node and strip it out. That 
should be fairly lightweight in the common case where it's not present - 
we'd just be searching for a fixed string. Removing it if found would be 
more complex. We'd have to parse the node to remove it, since a legal 
DocumentType node string could appear legally inside a CDATA node.


That has the advantage that it would fix the error rather than failing, 
but I'm slightly nervous about silently mangling user supplied XML. I 
guess we do that in a few other cases to make other combinations 
function sanely.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers