Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function
On 10 July 2010 14:12, Mike Fowler m...@mlfowler.com wrote: Robert Haas wrote: On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut pete...@gmx.net wrote: On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote: Here's the patch to add the 'xml_is_well_formed' function. I suppose we should remove the function from contrib/xml2 at the same time. Yep Revised patch deleting the contrib/xml2 version of the function attached. Regards, -- Mike Fowler Registered Linux user: 379787 sql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Would a test for mismatched or undefined namespaces be necessary? For example: Mismatched namespace: pg:foo xmlns:pg=http://postgresql.org/stuff;bar/my:foo Undefined namespace when used in conjunction with IS DOCUMENT: pg:foo xmlns:my=http://postgresql.org/stuff;bar/pg:foo Also, having a look at the following example from the patch: SELECT xml_is_well_formed('local:data xmlns:local=http://127.0.0.1;;local:piece id=1number one/local:piecelocal:piece id=2 //local:data'); xml_is_well_formed t (1 row) Just wondering about that semi-colon after the namespace definition. Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function
Thom Brown wrote: Would a test for mismatched or undefined namespaces be necessary? For example: Mismatched namespace: pg:foo xmlns:pg=http://postgresql.org/stuff;bar/my:foo Undefined namespace when used in conjunction with IS DOCUMENT: pg:foo xmlns:my=http://postgresql.org/stuff;bar/pg:foo Thanks for looking at my patch Thom. I hadn't thought of that particular scenario and even though I didn't specifically code for it, the underlying libxml call does correctly reject the mismatched namespace: template1=# SELECT xml_is_well_formed('pg:foo xmlns:pg=http://postgresql.org/stuff;bar/my:foo'); xml_is_well_formed f (1 row) In the attached patch I've added the example to the SGML documentation and the regression tests. Also, having a look at the following example from the patch: SELECT xml_is_well_formed('local:data xmlns:local=http://127.0.0.1;;local:piece id=1number one/local:piecelocal:piece id=2 //local:data'); xml_is_well_formed t (1 row) Just wondering about that semi-colon after the namespace definition. Thom The semi-colon is not supposed to be there, and I'm not sure where it's come from. With Thunderbird I see the email with my patch as an attachement, downloaded and viewing the file there are no instances of a followed by a ;. However, if I look at the message on the archive at http://archives.postgresql.org/message-id/4c3871c2.8000...@mlfowler.com I can see every URL that ends with a has a ; following it. Should I be escaping the in the patch file in some way or this just an artifact of HTML parsing a patch? Regards, -- Mike Fowler Registered Linux user: 379787 *** a/contrib/xml2/xpath.c --- b/contrib/xml2/xpath.c *** *** 27,33 PG_MODULE_MAGIC; /* externally accessible functions */ - Datum xml_is_well_formed(PG_FUNCTION_ARGS); Datum xml_encode_special_chars(PG_FUNCTION_ARGS); Datum xpath_nodeset(PG_FUNCTION_ARGS); Datum xpath_string(PG_FUNCTION_ARGS); --- 27,32 *** *** 70,97 pgxml_parser_init(void) xmlLoadExtDtdDefaultValue = 1; } - - /* Returns true if document is well-formed */ - - PG_FUNCTION_INFO_V1(xml_is_well_formed); - - Datum - xml_is_well_formed(PG_FUNCTION_ARGS) - { - text *t = PG_GETARG_TEXT_P(0); /* document buffer */ - int32 docsize = VARSIZE(t) - VARHDRSZ; - xmlDocPtr doctree; - - pgxml_parser_init(); - - doctree = xmlParseMemory((char *) VARDATA(t), docsize); - if (doctree == NULL) - PG_RETURN_BOOL(false); /* i.e. not well-formed */ - xmlFreeDoc(doctree); - PG_RETURN_BOOL(true); - } - - /* Encodes special characters (, , , and \r) as XML entities */ PG_FUNCTION_INFO_V1(xml_encode_special_chars); --- 69,74 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** *** 8554,8562 SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; ]]/screen /para /sect3 sect3 ! titleXML Predicates/title indexterm primaryIS DOCUMENT/primary --- 8554,8566 ]]/screen /para /sect3 + /sect2 + + sect2 +titleXML Predicates/title sect3 ! titleIS DOCUMENT/title indexterm primaryIS DOCUMENT/primary *** *** 8574,8579 SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; --- 8578,8675 between documents and content fragments. /para /sect3 + +sect3 + titlexml_is_well_formed/title + + indexterm + primaryxml_is_well_formed/primary + secondarywell formed/secondary + /indexterm + + synopsis + functionxml_is_well_formed/function(replaceabletext/replaceable) + /synopsis + + para + The function functionxml_is_well_formed/function evaluates whether + the replaceabletext/replaceable is well formed XML content, returning + a boolean. + /para + para + Example: + screen![CDATA[ + SELECT xml_is_well_formed('foobar/foo'); + xml_is_well_formed + + t + (1 row) + + SELECT xml_is_well_formed('foobar/foo'); + xml_is_well_formed + + f + (1 row) + + SELECT xml_is_well_formed('foobarstuff/foo'); + xml_is_well_formed + + f + (1 row) + ]]/screen + /para + para + In addition to the structure checks, the function ensures that namespaces are correcty matched. + screen![CDATA[ + SELECT xml_is_well_formed('pg:foo xmlns:pg=http://postgresql.org/stuff;bar/my:foo'); + xml_is_well_formed + + f + (1 row) + + SELECT xml_is_well_formed('pg:foo xmlns:pg=http://postgresql.org/stuff;bar/pg:foo'); + xml_is_well_formed + + t + (1 row) + ]]/screen + /para + para + This function can be combined with the IS DOCUMENT predicate to prevent + invalid XML content errors from occuring in queries. For example, given a + table that may have rows with invalid XML mixed in with rows of valid +
Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function
On 12 July 2010 13:07, Mike Fowler m...@mlfowler.com wrote: Thom Brown wrote: Just wondering about that semi-colon after the namespace definition. Thom The semi-colon is not supposed to be there, and I'm not sure where it's come from. With Thunderbird I see the email with my patch as an attachement, downloaded and viewing the file there are no instances of a followed by a ;. However, if I look at the message on the archive at http://archives.postgresql.org/message-id/4c3871c2.8000...@mlfowler.com I can see every URL that ends with a has a ; following it. Should I be escaping the in the patch file in some way or this just an artifact of HTML parsing a patch? Yeah, I guess it's a parsing issue related to the archive viewer. I arrived there from the commitfest page and should have really looked directly at the patch. No problem there then I guess. Thanks for the work you've done on this. :) Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function
Robert Haas wrote: On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut pete...@gmx.net wrote: On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote: Here's the patch to add the 'xml_is_well_formed' function. I suppose we should remove the function from contrib/xml2 at the same time. Yep Revised patch deleting the contrib/xml2 version of the function attached. Regards, -- Mike Fowler Registered Linux user: 379787 *** a/contrib/xml2/xpath.c --- b/contrib/xml2/xpath.c *** *** 27,33 PG_MODULE_MAGIC; /* externally accessible functions */ - Datum xml_is_well_formed(PG_FUNCTION_ARGS); Datum xml_encode_special_chars(PG_FUNCTION_ARGS); Datum xpath_nodeset(PG_FUNCTION_ARGS); Datum xpath_string(PG_FUNCTION_ARGS); --- 27,32 *** *** 70,97 pgxml_parser_init(void) xmlLoadExtDtdDefaultValue = 1; } - - /* Returns true if document is well-formed */ - - PG_FUNCTION_INFO_V1(xml_is_well_formed); - - Datum - xml_is_well_formed(PG_FUNCTION_ARGS) - { - text *t = PG_GETARG_TEXT_P(0); /* document buffer */ - int32 docsize = VARSIZE(t) - VARHDRSZ; - xmlDocPtr doctree; - - pgxml_parser_init(); - - doctree = xmlParseMemory((char *) VARDATA(t), docsize); - if (doctree == NULL) - PG_RETURN_BOOL(false); /* i.e. not well-formed */ - xmlFreeDoc(doctree); - PG_RETURN_BOOL(true); - } - - /* Encodes special characters (, , , and \r) as XML entities */ PG_FUNCTION_INFO_V1(xml_encode_special_chars); --- 69,74 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** *** 8554,8562 SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; ]]/screen /para /sect3 sect3 ! titleXML Predicates/title indexterm primaryIS DOCUMENT/primary --- 8554,8566 ]]/screen /para /sect3 + /sect2 + + sect2 +titleXML Predicates/title sect3 ! titleIS DOCUMENT/title indexterm primaryIS DOCUMENT/primary *** *** 8574,8579 SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; --- 8578,8653 between documents and content fragments. /para /sect3 + +sect3 + titlexml_is_well_formed/title + + indexterm + primaryxml_is_well_formed/primary + secondarywell formed/secondary + /indexterm + + synopsis + functionxml_is_well_formed/function(replaceabletext/replaceable) + /synopsis + + para + The function functionxml_is_well_formed/function evaluates whether + the replaceabletext/replaceable is well formed XML content, returning + a boolean. + /para + para + Example: + screen![CDATA[ + SELECT xml_is_well_formed('foobar/foo'); + xml_is_well_formed + + t + (1 row) + + SELECT xml_is_well_formed('foobar/foo'); + xml_is_well_formed + + f + (1 row) + ]]/screen + /para + para + This function can be combined with the IS DOCUMENT predicate to prevent + invalid XML content errors from occuring in queries. For example, given a + table that may have rows with invalid XML mixed in with rows of valid + XML, functionxml_is_well_formed/function can be used to filter out all + the invalid rows. + /para + para + Example: + screen![CDATA[ + SELECT * FROM mixed; + data + -- + foobar/foo + foobar/foo + foobar/foobarfoo/bar + foobar/foobarfoo/bar + (4 rows) + + SELECT COUNT(data) FROM mixed WHERE data::xml IS DOCUMENT; + ERROR: invalid XML content + DETAIL: Entity: line 1: parser error : expected '' + foobar/foo + ^ + Entity: line 1: parser error : chunk is not well balanced + foobar/foo + ^ + + SELECT COUNT(data) FROM mixed WHERE xml_is_well_formed(data) AND data::xml IS DOCUMENT; + count + --- + 1 + (1 row) + ]]/screen + /para +/sect3 /sect2 sect2 id=functions-xml-processing *** a/src/backend/utils/adt/xml.c --- b/src/backend/utils/adt/xml.c *** *** 3293,3298 xml_xmlnodetoxmltype(xmlNodePtr cur) --- 3293,3365 } #endif + Datum + xml_is_well_formed(PG_FUNCTION_ARGS) + { + #ifdef USE_LIBXML + text*data = PG_GETARG_TEXT_P(0); + boolresult; + int res_code; + int32len; + const xmlChar *string; + xmlParserCtxtPtr ctxt; + xmlDocPtr doc = NULL; + + len = VARSIZE(data) - VARHDRSZ; + string = xml_text2xmlChar(data); + + /* Start up libxml and its parser (no-ops if already done) */ + pg_xml_init(); + xmlInitParser(); + + ctxt = xmlNewParserCtxt(); + if (ctxt == NULL) + xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY, + could not allocate parser context); + + PG_TRY(); + { + size_t count; + xmlChar*version = NULL; + int standalone = -1; + + res_code = parse_xml_decl(string, count, version, NULL, standalone); + if (res_code != 0) + xml_ereport_by_code(ERROR,
Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function
On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote: Here's the patch to add the 'xml_is_well_formed' function. I suppose we should remove the function from contrib/xml2 at the same time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function
On Fri, Jul 9, 2010 at 4:06 PM, Peter Eisentraut pete...@gmx.net wrote: On ons, 2010-07-07 at 16:37 +0100, Mike Fowler wrote: Here's the patch to add the 'xml_is_well_formed' function. I suppose we should remove the function from contrib/xml2 at the same time. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[PATCH] Re: [HACKERS] Issue: Deprecation of the XML2 module 'xml_is_well_formed' function
Peter Eisentraut wrote: On lör, 2010-07-03 at 09:26 +0100, Mike Fowler wrote: What I will do instead is implement the xml_is_well_formed function and get a patch out in the next day or two. That sounds very useful. Here's the patch to add the 'xml_is_well_formed' function. Paraphrasing the SGML the syntax is: |xml_is_well_formed|(/text/) The function |xml_is_well_formed| evaluates whether the /text/ is well formed XML content, returning a boolean. I've done some tests (included in the patch) with tables containing a mixture of well formed documents and content and the function is happily returning the expected result. Combining with IS (NOT) DOCUMENT is working nicely for pulling out content or documents from a table of text. Unless I missed something in the original correspondence, I think this patch will solve the issue. Regards, -- Mike Fowler Registered Linux user: 379787 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** *** 8554,8562 SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; ]]/screen /para /sect3 sect3 ! titleXML Predicates/title indexterm primaryIS DOCUMENT/primary --- 8554,8566 ]]/screen /para /sect3 + /sect2 + + sect2 +titleXML Predicates/title sect3 ! titleIS DOCUMENT/title indexterm primaryIS DOCUMENT/primary *** *** 8574,8579 SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; --- 8578,8653 between documents and content fragments. /para /sect3 + +sect3 + titlexml_is_well_formed/title + + indexterm + primaryxml_is_well_formed/primary + secondarywell formed/secondary + /indexterm + + synopsis + functionxml_is_well_formed/function(replaceabletext/replaceable) + /synopsis + + para + The function functionxml_is_well_formed/function evaluates whether + the replaceabletext/replaceable is well formed XML content, returning + a boolean. + /para + para + Example: + screen![CDATA[ + SELECT xml_is_well_formed('foobar/foo'); + xml_is_well_formed + + t + (1 row) + + SELECT xml_is_well_formed('foobar/foo'); + xml_is_well_formed + + f + (1 row) + ]]/screen + /para + para + This function can be combined with the IS DOCUMENT predicate to prevent + invalid XML content errors from occuring in queries. For example, given a + table that may have rows with invalid XML mixed in with rows of valid + XML, functionxml_is_well_formed/function can be used to filter out all + the invalid rows. + /para + para + Example: + screen![CDATA[ + SELECT * FROM mixed; + data + -- + foobar/foo + foobar/foo + foobar/foobarfoo/bar + foobar/foobarfoo/bar + (4 rows) + + SELECT COUNT(data) FROM mixed WHERE data::xml IS DOCUMENT; + ERROR: invalid XML content + DETAIL: Entity: line 1: parser error : expected '' + foobar/foo + ^ + Entity: line 1: parser error : chunk is not well balanced + foobar/foo + ^ + + SELECT COUNT(data) FROM mixed WHERE xml_is_well_formed(data) AND data::xml IS DOCUMENT; + count + --- + 1 + (1 row) + ]]/screen + /para +/sect3 /sect2 sect2 id=functions-xml-processing *** a/src/backend/utils/adt/xml.c --- b/src/backend/utils/adt/xml.c *** *** 3293,3298 xml_xmlnodetoxmltype(xmlNodePtr cur) --- 3293,3365 } #endif + Datum + xml_is_well_formed(PG_FUNCTION_ARGS) + { + #ifdef USE_LIBXML + text*data = PG_GETARG_TEXT_P(0); + boolresult; + int res_code; + int32len; + const xmlChar *string; + xmlParserCtxtPtr ctxt; + xmlDocPtr doc = NULL; + + len = VARSIZE(data) - VARHDRSZ; + string = xml_text2xmlChar(data); + + /* Start up libxml and its parser (no-ops if already done) */ + pg_xml_init(); + xmlInitParser(); + + ctxt = xmlNewParserCtxt(); + if (ctxt == NULL) + xml_ereport(ERROR, ERRCODE_OUT_OF_MEMORY, + could not allocate parser context); + + PG_TRY(); + { + size_t count; + xmlChar*version = NULL; + int standalone = -1; + + res_code = parse_xml_decl(string, count, version, NULL, standalone); + if (res_code != 0) + xml_ereport_by_code(ERROR, ERRCODE_INVALID_XML_CONTENT, + invalid XML content: invalid XML declaration, + res_code); + + doc = xmlNewDoc(version); + doc-encoding = xmlStrdup((const xmlChar *) UTF-8); + doc-standalone = 1; + + res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, string + count, NULL); + + result = !res_code; + } + PG_CATCH(); + { + if (doc) + xmlFreeDoc(doc); + if (ctxt) + xmlFreeParserCtxt(ctxt); + + PG_RE_THROW(); + } + PG_END_TRY(); + + if (doc) + xmlFreeDoc(doc); + if (ctxt) + xmlFreeParserCtxt(ctxt); + + return result; + #else + NO_XML_SUPPORT(); + return 0; +