2017-03-17 4:23 GMT+01:00 Noah Misch <n...@leadboat.com>: > On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote: > > 2017-03-12 21:57 GMT+01:00 Noah Misch <n...@leadboat.com>: > > > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote: > > > > 2017-03-12 0:56 GMT+01:00 Noah Misch <n...@leadboat.com>: > > > Please add a test case. >
I am sending test case. I am afraid so this cannot be part of regress tests due strong dependency on locale win1250. > > > > It needs a application - currently there is not possibility to import XML > > document via recv API :( > > I think xml_in() can create every value that xml_recv() can create; > xml_recv() > is just more convenient given diverse source encodings. If you make your > application store the value into a table, does "pg_dump --inserts" emit > code > that reproduces the same value? If so, you can use that in your test case. > If not, please provide precise instructions (code, SQL commands) for > reproducing the bug manually. > I was wrong - both methods produces broken internal XML document <?xml version="1.0" encoding="windows-1250"?> <enprimeur> <vino> <id>1</id> <nazev>Alter Ego de Palmer</nazev> <vyrobce>63</vyrobce> <rocnik>2012</rocnik> <cena0375>0</cena0375> <cena1500>0</cena1500> <cena3000>0</cena3000> <cena6000>0</cena6000> <cena0750>1425</cena0750> <cenastart>1085</cenastart> <min0375>0</min0375> <min0750>0</min0750> <odrudy>51 % Merlot, 40 % Cabernet Sauvignon,9 % Petit Verdot</odrudy> <bestin>2017 - 2026</bestin> <klas>2</klas> <sklad0375>0</sklad0375> <sklad0750>0</sklad0750> <sklad1500>0</sklad1500> <sklad3000>0</sklad3000> <sklad6000>0</sklad6000> <alk>13,4 %</alk> <remark>Premiant oblasti Margaux Ch. Palmer tentokrát ve svých obou vínech těžil z dokonale zralého Merlotu, kterého do svých směsí naládoval více než Cabernet Sauvignonu. 2. víno Alter Ego de Palmer 2012 se může p ochlubit skvělou kondicí. Není pochyb o tom, že na Ch. Palmer sklízeli z vinice dokonale zralé hrozny. Alter Ego je mohutné, husté a syté víno, nabízí objemnou dávku ovoce, voní po ostružinách, chuť je krásně kulatá, pevná a svěží, taniny vykazují fenomenální strukturu, takto krásné spojení všech aromatických a chuťových složek s příměsí úžasných kyselin a alkoholu je radost mít ve sklepě.</remark> <rating>Robert Parker: /100 TOPVINO SCORE: 92-94/100 James Suckling: 92-93/100 Wine Spectator: 90-93/100</rating> <zalozeno></zalozeno> <rozloha></rozloha> <stari></stari> <puda></puda> <produkce></produkce> <zrani></zrani> <active>1</active> <stitky> <stitek>8</stitek> <stitek>1</stitek> </stitky> </vino> </enprimeur> Document is well encoded from win1250 to UTF8 - it is well readable, but the header is wrong - it is not in win1250 now (after encoding). This xml is attached (in original form (win1250 encoding)). Import with create table target(x xml); \set xml `cat ~/Downloads/e6.xml` postgres=# set client_encoding to win1250; postgres=# insert into target values(:'xml'); disconnect or reset client encoding Document should be correctly inserted. Then run a query postgres=# select * from target, xpath('/enprimeur/vino', x); ERROR: could not parse XML document DETAIL: input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75 input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75 encoder error line 25: Premature end of data in tag remark line 25 line 25: Premature end of data in tag vino line 3 line 25: Premature end of data in tag enprimeur line 2 select xmltable.* from target, xmltable('/enprimeur/vino' passing x columns id int, nazev text, remark text); XMLTABLE is not vulnerable against this bug and result should be correct. when you do select x from target you can see a correct xml without header but select x::text from target; shows wrong header > > > > Why not use xml_parse() instead of calling xmlCtxtReadMemory() > directly? > > > The > > > answer is probably in the archives, because someone understood the > problem > > > enough to document "Some XML-related functions may not work at all on > > > non-ASCII data when the server encoding is not UTF-8. This is known to > be > > > an > > > issue for xpath() in particular." > > > > > > Probably there are two possible issues > > Would you research in the archives to confirm? > > > 1. what I touched - recv function does encoding to database encoding - > but > > document encoding is not updated. > > Using xml_parse() would fix that, right? > It should to help, because it cuts a header - but it does little bit more work, than we would - it check if xml is doc or not too. In mostly functions there the important work is done in parse_xml_decl function One possible fix - and similar technique is used more times in xml.c code .. xmlroot diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 2f87151..45faf83 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -3834,6 +3834,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces, Datum *ns_names_uris; bool *ns_names_uris_nulls; int ns_count; + size_t header_len; /* * Namespace mappings are passed as text[]. If an empty array is passed @@ -3882,6 +3883,10 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces, errmsg("empty XPath expression"))); string = pg_xmlCharStrndup(datastr, len); + + /* remove header */ + parse_xml_decl(string, &header_len, NULL, NULL, NULL); + xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len); xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL); @@ -3898,7 +3903,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces, if (ctxt == NULL || xmlerrcxt->err_occurred) xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY, "could not allocate parser context"); - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0); + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len - header_len, NULL, NULL, 0); if (doc == NULL || xmlerrcxt->err_occurred) xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT, "could not parse XML document"); another possibility is using xml_out_internal - that is used in XMLTABLE function - what was initial fix. xml_out_internal uses parse_xml_decl too, but it is little bit more expensive due call print_xml_decl that has not any effect in this case (where only encoding is interesting) - but it can has sense, when server encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 - so now, I am think the correct solution should be using xml_out_internal - because it appends a header with server encoding to XML doc Regards Pavel > > > 2. there are not possibility to encode from document encoding to database > > encoding. > > Both xml_in() and xml_recv() require the value to be representable in the > database encoding, so I don't think this particular problem can remain by > the > time we reach an xpath_internal() call. >
<?xml version="1.0" encoding="windows-1250"?> <enprimeur> <vino> <id>909</id> <nazev>Domaine de l´A</nazev> <vyrobce>660</vyrobce> <rocnik>2012</rocnik> <cena0375>0</cena0375> <cena1500>0</cena1500> <cena3000>0</cena3000> <cena6000>0</cena6000> <cena0750>468</cena0750> <cenastart>425</cenastart> <min0375>0</min0375> <min0750>0</min0750> <odrudy>80 % Merlot, 20 % Cabernet Franc</odrudy> <bestin>2016 - 2024</bestin> <klas>62</klas> <sklad0375>0</sklad0375> <sklad0750>0</sklad0750> <sklad1500>0</sklad1500> <sklad3000>0</sklad3000> <sklad6000>0</sklad6000> <alk>13,5 %</alk> <remark>Domaine de l´A je pozoruhodné víno v majetku předního bordeauxského enologa/konzultanta Stéphane Derenoncourta. Nabízí nádhernou vůni černého rybízu, švestek, ostružin a exotického koření, tělo je plné, chuť hluboká a masitá. Přítomnost dubu je zřejmá, velice lehká a jemná moderna tomuto vínu sluší. Cotes de Castillon sice není Saint-Emilion, nicméně velká bordeauxská scéna nabízí nepřeberné množství možností. Domaine de I´A je skvělé víno nejen na poklidnému pití, ale také minimálně na deset let k uložení do sklepa. Letos proti loňsku za cenu nižší o 12 %.</remark> <rating>Wine Spectator: 87-90/100</rating> <zalozeno></zalozeno> <rozloha>10 ha</rozloha> <stari>35 let</stari> <puda>Hlinito-vápencová hluboká půda</puda> <produkce>23 000 lahví</produkce> <zrani>16 měsíců</zrani> <active>1</active> <stitky> <stitek>1</stitek> </stitky> </vino> </enprimeur>
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers