Hi 2017-08-08 2:10 GMT+02:00 Noah Misch <n...@leadboat.com>:
> On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote: > > 2017-03-17 4:23 GMT+01:00 Noah Misch <n...@leadboat.com>: > > > On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote: > > > > 2017-03-12 21:57 GMT+01:00 Noah Misch <n...@leadboat.com>: > > > > > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote: > > > > > > 2017-03-12 0:56 GMT+01:00 Noah Misch <n...@leadboat.com>: > > > > > Please add a test case. > > > > I am sending test case. > > > > I am afraid so this cannot be part of regress tests due strong dependency > > on locale win1250. > > Right. Based on that, I've distilled this portable test case: > > -- Round-trip non-ASCII data through xpath(). > DO $$ > DECLARE > xml_declaration text := '<?xml version="1.0" > encoding="ISO-8859-1"?>'; > degree_symbol text; > res xml[]; > BEGIN > -- Per the documentation, xpath() doesn't work on non-ASCII data > when > -- the server encoding is not UTF8. The EXCEPTION block below, > -- currently dead code, will be relevant if we remove this > limitation. > IF current_setting('server_encoding') <> 'UTF8' THEN > RAISE LOG 'skip: encoding % unsupported for xml', > current_setting('server_encoding'); > RETURN; > END IF; > > degree_symbol := convert_from('\xc2b0', 'UTF8'); > res := xpath('text()', (xml_declaration || > '<x>' || degree_symbol || '</x>')::xml); > IF degree_symbol <> res[1]::text THEN > RAISE 'expected % (%), got % (%)', > degree_symbol, convert_to(degree_symbol, 'UTF8'), > res[1], convert_to(res[1]::text, 'UTF8'); > END IF; > EXCEPTION > -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has > no equivalent in encoding "LATIN8" > WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM; > -- default conversion function for encoding "UTF8" to > "MULE_INTERNAL" does not exist > WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM; > END > $$ > yes, probably libXML2 try to do check from utf8 encoding to header specified encoding. > > > > > Why not use xml_parse() instead of calling xmlCtxtReadMemory() > > > > > directly? The answer is probably in the archives, because someone > > > > > understood the problem enough to document "Some XML-related > functions > > > > > may not work at all on non-ASCII data when the server encoding is > not > > > > > UTF-8. This is known to be an issue for xpath() in particular." > > > > > > > > Probably there are two possible issues > > > > > > Would you research in the archives to confirm? > > > > > > > 1. what I touched - recv function does encoding to database encoding > - > > > > but document encoding is not updated. > > > > > > Using xml_parse() would fix that, right? > > > > > > > It should to help, because it cuts a header - but it does little bit more > > work, than we would - it check if xml is doc or not too. > > That's true. xpath() currently works on both DOCUMENT and CONTENT xml > values. > If xpath() used xml_parse(), this would stop working: > > SELECT xpath('text()', XMLPARSE (DOCUMENT '<!DOCTYPE foo > []><x>bar</x>')); > > > One possible fix - and similar technique is used more times in xml.c code > > .. xmlroot > > > + /* remove header */ > > + parse_xml_decl(string, &header_len, NULL, NULL, NULL); > > > - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, > 0); > > + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len - > > > another possibility is using xml_out_internal - that is used in XMLTABLE > > function - what was initial fix. > > > > xml_out_internal uses parse_xml_decl too, but it is little bit more > > expensive due call print_xml_decl that has not any effect in this case > > (where only encoding is interesting) - but it can has sense, when server > > encoding is not UTF8, because in this case, xmlstr is not encoded to > UTF8 - > > so now, I am think the correct solution should be using xml_out_internal > - > > because it appends a header with server encoding to XML doc > > As you may have been implying here, your patch never adds an xml > declaration > that bears an encoding. (That is because it passes targetencoding=0 to > xml_out_internal().) If we were going to fix xpath() to support non-ASCII > characters in non-UTF8 databases, we would not do that by adding xml > declarations. We would do our own conversion to UTF8, like xml_parse() > does. > In that light, I like this parse_xml_decl() approach better. Would you > update > your patch to use it and to add the test case I give above? > > I need to do some recapitulation (my analyses was wrong): a) all values created by xml_in iterface are in database encoding - input string is stored without any change. xml_parse is called only due validation. b) inside xml_parse, the input is converted to UTF8, and document is read by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8" and removed decl section. So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document, wellformated_xml) the database encoding is not important c) xml_recv function does validation by xml_parse and translation to database encoding. Now I don't see a difference between @b and @c - so my hypotheses about necessity to use recv interface was wrong. So Looks like libXML2 behave - when we push document with encoding decl, then this library expects document in this encoding So test case is simple -- should to work select xpath('/enprimeur/vino/id'/, '<enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'); -- should to fail select xpath('/enprimeur/vino/id'/, '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'); I didn't expect this error on recv API, because we do implicit encoding to database encoding - and I expected implicit removing of encoding declaration. The problematic char is ok, the issue is different length Other functions is working - so I have a expectation so xpath should to work postgres=# select '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'::xml; ┌────────────────────────────────────────────────────────────────────┐ │ xml │ ╞════════════════════════════════════════════════════════════════════╡ │ <enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur> │ └────────────────────────────────────────────────────────────────────┘ (1 row) postgres=# select '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>' is document postgres-# ; ┌──────────┐ │ ?column? │ ╞══════════╡ │ t │ └──────────┘ (1 row) postgres=# select xml_is_well_formed_document('<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'); ┌─────────────────────────────┐ │ xml_is_well_formed_document │ ╞═════════════════════════════╡ │ t │ └─────────────────────────────┘ (1 row) postgres=# select xml_is_well_formed_content('<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'); ┌────────────────────────────┐ │ xml_is_well_formed_content │ ╞════════════════════════════╡ │ t │ └────────────────────────────┘ (1 row) d) XMLEXISTS doesn't work too .. because it share code wit xpath function This change can make things worse in a non-UTF8 database. The following > happens to work today in a LATIN1 database, but it will cease to work: > > SELECT xpath('string-length()', ('<?xml version="1.0" > encoding="ISO-8859-1"?>' || > '<x>' || chr(176) || '</x>')::xml); > > However, extracting actual text already gives wrong answers, because we > treat > the UTF8 response from libxml2 as though it were LATIN1: > > SELECT xpath('string()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' || > '<x>' || chr(176) || '</x>')::xml); > > I plan to back-patch this. The documentation says "Some XML-related > functions > may not work at all on non-ASCII data when the server encoding is not > UTF-8. This is known to be an issue for xpath() in particular." Your patch > fixes a case where even a UTF8 database mishandles non-ASCII data. > (Sometimes > it mishandles the data silently. Other times, it raises an error.) It's > worth further breaking a case where we already disclaim support to repair a > supported case. Other opinions? > Isn't the most correct solution to call xml_parse function? Regards Pavel