2017-08-20 9:21 GMT+02:00 Noah Misch <n...@leadboat.com>:

> On Sun, Aug 20, 2017 at 08:46:03AM +0200, Pavel Stehule wrote:
> > 2017-08-20 4:17 GMT+02:00 Noah Misch <n...@leadboat.com>:
> > > On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
> > > > I am sending some POC  - it does support XPATH and XMLTABLE for not
> UTF8
> > > > server encoding.
> > > >
> > > > In this case, all strings should be converted to UTF8 before call
> libXML2
> > > > functions, and result should be converted back from UTF8.
> > >
> > > Adding support for xpath in non-UTF8 databases is a v11 feature
> proposal.
> > > Please start a new thread for this, and add it to the open CommitFest.
> > >
> > > In this thread, would you provide the version of your patch that I
> > > described
> > > in my 2017-08-08 post to this thread?  That's a back-patchable bug fix.
> >
> >
> > There are three issues:
> >
> > 1. processing 1byte encoding XMLs documents with encoding declaration -
> > should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is
> very
> > short and safe - can be apply immediately (there is regress tests)
>
> We should fix that problem, yes.  encoding_for_xmlCtxtReadMemory.patch is
> not
> the right fix; see below.
>
> > 2 encoding issues in XPath specification (and  namespaces) - because
> > multibytes chars are not usually used in tag names, this issue hit
> minimum
> > users.
> >
> > 3. encoding issues in XPath and XMLTABLE results - this is bad issue -
> the
> > function XMLTABLE will not be functional on non UTF8 databases.
> Fortunately
> > - there are less users with this encoding, but probably should be apply
> as
> > fix in 10/11 Postgres.
>
> (2) and (3) are long-documented (since commit 14180f9, 2009-06)
> limitations in
> xpath functions.  That's why I would treat an improvement as a new feature
> and
> not back-patch it.  It is tempting to fix v10 so XMLTABLE is born without
> this
> limitation, but it is too late in the release cycle.
>

I agree

>
>
> Reviewing encoding_for_xmlCtxtReadMemory.patch:
>
> On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
> > -             doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> NULL, 0);
> > +             doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> > +
>  pg_encoding_to_char(GetDatabaseEncoding()), 0);
>
> This assumes libxml2 understands every encoding name that
> pg_encoding_to_char() can return, but it does not.  xpath-bugfix.patch was
> better.  Here are the relevant parts of my review of that patch.
>

I understand to this argument.

>
> On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
> > On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> > > One possible fix - and similar technique is used more times in xml.c
> code
> > > .. xmlroot
> >
> > > +   /* remove header */
> > > +   parse_xml_decl(string, &header_len, NULL, NULL, NULL);
> >
> > > -       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> NULL, 0);
> > > +       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len,
> len -
>
> > I like this parse_xml_decl() approach better.  Would you update
> > your patch to use it and to add the test case I give above?
>
> Again, would you make those two edits?
>

Now, I am not sure so it is correct fix. We will fix case, when server is
in UTF8, but maybe we will break some cases when server is not in UTF8
(although it is broken already).

I am think so correct solution is encoding to UTF8 and passing encoding
parameter. It will works safely in all cases - and we don't need cut
header. We should not to cut header if server encoding is not in UTF8 and
we don't pass encoding as parameter. When we pass encoding as parameter,
then cutting header is not necessary.

What do you think about last patch?

Regards

Pavel



>
> Thanks,
> nm
>
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index c47624eff6..5ceda8034d 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3895,6 +3895,13 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
 				 errmsg("empty XPath expression")));
 
 	string = pg_xmlCharStrndup(datastr, len);
+
+	/* It does nothing when string is in UTF8 already */
+	string = pg_do_encoding_conversion(string,
+									   len,
+									   GetDatabaseEncoding(),
+									   PG_UTF8);
+
 	xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
 
 	xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3911,7 +3918,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
 		if (ctxt == NULL || xmlerrcxt->err_occurred)
 			xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
 						"could not allocate parser context");
-		doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+		doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, "UTF-8", 0);
 		if (doc == NULL || xmlerrcxt->err_occurred)
 			xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
 						"could not parse XML document");
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index bcc585d427..0a5a80255f 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1452,3 +1452,29 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
  14
 (4 rows)
 
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+  str text;
+  result text;
+BEGIN
+  -- leave early without error, when we are not sure about result of conversion
+  IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+  -- build valid UTF8 XML with broken encoding declaration
+  str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+          || convert_from('\xf2', 'windows-1250')
+          || '</remark></vino></enprimeur>';
+
+  -- should not to raise XML parsing error
+  result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+  -- should be quiet when result is expected
+  -- result be same for all encodings when all is ok
+  IF result <> '909' THEN
+    RAISE EXCEPTION 'unexpected result: %s', result;
+  END IF;
+END; $$;
diff --git a/src/test/regress/expected/xml_1.out b/src/test/regress/expected/xml_1.out
index d3bd8c91d7..9c20d384c9 100644
--- a/src/test/regress/expected/xml_1.out
+++ b/src/test/regress/expected/xml_1.out
@@ -1302,3 +1302,33 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
 ---
 (0 rows)
 
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+  str text;
+  result text;
+BEGIN
+  -- leave early without error, when we are not sure about result of conversion
+  IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+  -- build valid UTF8 XML with broken encoding declaration
+  str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+          || convert_from('\xf2', 'windows-1250')
+          || '</remark></vino></enprimeur>';
+
+  -- should not to raise XML parsing error
+  result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+  -- should be quiet when result is expected
+  -- result be same for all encodings when all is ok
+  IF result <> '909' THEN
+    RAISE EXCEPTION 'unexpected result: %s', result;
+  END IF;
+END; $$;
+ERROR:  unsupported XML feature
+DETAIL:  This functionality requires the server to be built with libxml support.
+HINT:  You need to rebuild PostgreSQL using --with-libxml.
+CONTEXT:  PL/pgSQL function inline_code_block line 15 at assignment
diff --git a/src/test/regress/expected/xml_2.out b/src/test/regress/expected/xml_2.out
index ff77132803..694163c7fb 100644
--- a/src/test/regress/expected/xml_2.out
+++ b/src/test/regress/expected/xml_2.out
@@ -1432,3 +1432,29 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
  14
 (4 rows)
 
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+  str text;
+  result text;
+BEGIN
+  -- leave early without error, when we are not sure about result of conversion
+  IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+  -- build valid UTF8 XML with broken encoding declaration
+  str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+          || convert_from('\xf2', 'windows-1250')
+          || '</remark></vino></enprimeur>';
+
+  -- should not to raise XML parsing error
+  result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+  -- should be quiet when result is expected
+  -- result be same for all encodings when all is ok
+  IF result <> '909' THEN
+    RAISE EXCEPTION 'unexpected result: %s', result;
+  END IF;
+END; $$;
diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql
index eb4687fb09..37b87ce5a2 100644
--- a/src/test/regress/sql/xml.sql
+++ b/src/test/regress/sql/xml.sql
@@ -558,3 +558,30 @@ INSERT INTO xmltest2 VALUES('<d><r><dc>2</dc></r></d>', 'D');
 SELECT xmltable.* FROM xmltest2, LATERAL xmltable('/d/r' PASSING x COLUMNS a int PATH '' || lower(_path) || 'c');
 SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH '.');
 SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH 'x' DEFAULT ascii(_path) - 54);
+
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+  str text;
+  result text;
+BEGIN
+  -- leave early without error, when we are not sure about result of conversion
+  IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+  -- build valid UTF8 XML with broken encoding declaration
+  str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+          || convert_from('\xf2', 'windows-1250')
+          || '</remark></vino></enprimeur>';
+
+  -- should not to raise XML parsing error
+  result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+  -- should be quiet when result is expected
+  -- result be same for all encodings when all is ok
+  IF result <> '909' THEN
+    RAISE EXCEPTION 'unexpected result: %s', result;
+  END IF;
+END; $$;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to