Re: [HACKERS] possible encoding issues with libxml2 functions

Pavel Stehule Fri, 18 Aug 2017 14:44:44 -0700

Hi

2017-08-08 2:10 GMT+02:00 Noah Misch <n...@leadboat.com>:


> On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> > 2017-03-17 4:23 GMT+01:00 Noah Misch <n...@leadboat.com>:
> > > On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
> > > > 2017-03-12 21:57 GMT+01:00 Noah Misch <n...@leadboat.com>:
> > > > > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
> > > > > > 2017-03-12 0:56 GMT+01:00 Noah Misch <n...@leadboat.com>:
> > > > > Please add a test case.
> >
> > I am sending test case.
> >
> > I am afraid so this cannot be part of regress tests due strong dependency
> > on locale win1250.
>
> Right.  Based on that, I've distilled this portable test case:
>
> -- Round-trip non-ASCII data through xpath().
> DO $$
> DECLARE
>         xml_declaration text := '<?xml version="1.0"
> encoding="ISO-8859-1"?>';
>         degree_symbol text;
>         res xml[];
> BEGIN
>         -- Per the documentation, xpath() doesn't work on non-ASCII data
> when
>         -- the server encoding is not UTF8.  The EXCEPTION block below,
>         -- currently dead code, will be relevant if we remove this
> limitation.
>         IF current_setting('server_encoding') <> 'UTF8' THEN
>                 RAISE LOG 'skip: encoding % unsupported for xml',
>                         current_setting('server_encoding');
>                 RETURN;
>         END IF;
>
>         degree_symbol := convert_from('\xc2b0', 'UTF8');
>         res := xpath('text()', (xml_declaration ||
>                 '<x>' || degree_symbol || '</x>')::xml);
>         IF degree_symbol <> res[1]::text THEN
>                 RAISE 'expected % (%), got % (%)',
>                         degree_symbol, convert_to(degree_symbol, 'UTF8'),
>                         res[1], convert_to(res[1]::text, 'UTF8');
>         END IF;
> EXCEPTION
>         -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has
> no equivalent in encoding "LATIN8"
>         WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
>         -- default conversion function for encoding "UTF8" to
> "MULE_INTERNAL" does not exist
>         WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
> END
> $$
>

yes, probably libXML2 try to do check from utf8 encoding to header
specified encoding.


> > > > > Why not use xml_parse() instead of calling xmlCtxtReadMemory()
> > > > > directly?  The answer is probably in the archives, because someone
> > > > > understood the problem enough to document "Some XML-related
> functions
> > > > > may not work at all on non-ASCII data when the server encoding is
> not
> > > > > UTF-8. This is known to be an issue for xpath() in particular."
> > > >
> > > > Probably there are two possible issues
> > >
> > > Would you research in the archives to confirm?
> > >
> > > > 1. what I touched - recv function does encoding to database encoding
> -
> > > > but document encoding is not updated.
> > >
> > > Using xml_parse() would fix that, right?
> > >
> >
> > It should to help, because it cuts a header - but it does little bit more
> > work, than we would - it check if xml is doc or not too.
>
> That's true.  xpath() currently works on both DOCUMENT and CONTENT xml
> values.
> If xpath() used xml_parse(), this would stop working:
>
>   SELECT xpath('text()', XMLPARSE (DOCUMENT '<!DOCTYPE foo
> []><x>bar</x>'));
>
> > One possible fix - and similar technique is used more times in xml.c code
> > .. xmlroot
>
> > +   /* remove header */
> > +   parse_xml_decl(string, &header_len, NULL, NULL, NULL);
>
> > -       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL,
> 0);
> > +       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
>
> > another possibility is using xml_out_internal - that is used in XMLTABLE
> > function - what was initial fix.
> >
> > xml_out_internal uses parse_xml_decl too, but it is little bit more
> > expensive due call print_xml_decl that has not any effect in this case
> > (where only encoding is interesting) - but it can has sense, when server
> > encoding is not UTF8, because in this case, xmlstr is not encoded to
> UTF8 -
> > so now, I am think the correct solution should be using xml_out_internal
> -
> > because it appends a header with server encoding to XML doc
>
> As you may have been implying here, your patch never adds an xml
> declaration
> that bears an encoding.  (That is because it passes targetencoding=0 to
> xml_out_internal().)  If we were going to fix xpath() to support non-ASCII
> characters in non-UTF8 databases, we would not do that by adding xml
> declarations.  We would do our own conversion to UTF8, like xml_parse()
> does.
> In that light, I like this parse_xml_decl() approach better.  Would you
> update
> your patch to use it and to add the test case I give above?
>
>
I need to do some recapitulation (my analyses was wrong):

a) all values created by  xml_in iterface are in database encoding - input
string is stored without any change. xml_parse is called only due
validation.

b) inside xml_parse, the input is converted to UTF8, and document is read
by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or
by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8"
and removed decl section.

So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document,
wellformated_xml) the database encoding is not important

c) xml_recv function does validation by xml_parse and translation to
database encoding.

Now I don't see a difference between @b and @c - so my hypotheses about
necessity to use recv interface was wrong.

So Looks like libXML2 behave - when we push document with encoding decl,
then this library expects document in this encoding

So test case is simple

 -- should to work
select xpath('/enprimeur/vino/id'/,
'<enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');

 -- should to fail
select xpath('/enprimeur/vino/id'/, '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');

I didn't expect this error on recv API, because we do implicit encoding to
database encoding - and I expected implicit removing of encoding
declaration. The problematic char is ok, the issue is different length

Other functions is working - so I have a expectation so xpath should to work

postgres=# select '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'::xml;
┌────────────────────────────────────────────────────────────────────┐
│                                xml                                 │
╞════════════════════════════════════════════════════════════════════╡
│ <enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur> │
└────────────────────────────────────────────────────────────────────┘
(1 row)

postgres=# select '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'
is document
postgres-# ;
┌──────────┐
│ ?column? │
╞══════════╡
│ t        │
└──────────┘
(1 row)
postgres=# select xml_is_well_formed_document('<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌─────────────────────────────┐
│ xml_is_well_formed_document │
╞═════════════════════════════╡
│ t                           │
└─────────────────────────────┘
(1 row)

postgres=# select xml_is_well_formed_content('<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌────────────────────────────┐
│ xml_is_well_formed_content │
╞════════════════════════════╡
│ t                          │
└────────────────────────────┘
(1 row)

d) XMLEXISTS doesn't work too .. because it share code wit xpath function



This change can make things worse in a non-UTF8 database.  The following
> happens to work today in a LATIN1 database, but it will cease to work:
>
> SELECT xpath('string-length()', ('<?xml version="1.0"
> encoding="ISO-8859-1"?>' ||
>                 '<x>' || chr(176) || '</x>')::xml);
>
> However, extracting actual text already gives wrong answers, because we
> treat
> the UTF8 response from libxml2 as though it were LATIN1:
>
> SELECT xpath('string()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' ||
>                 '<x>' || chr(176) || '</x>')::xml);
>
> I plan to back-patch this.  The documentation says "Some XML-related
> functions
> may not work at all on non-ASCII data when the server encoding is not
> UTF-8. This is known to be an issue for xpath() in particular."  Your patch
> fixes a case where even a UTF8 database mishandles non-ASCII data.
> (Sometimes
> it mishandles the data silently.  Other times, it raises an error.)  It's
> worth further breaking a case where we already disclaim support to repair a
> supported case.  Other opinions?
>

Isn't the most correct solution to call xml_parse function?

Regards

Pavel

Re: [HACKERS] possible encoding issues with libxml2 functions

Reply via email to