[ https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Boris Kolpackov updated XERCESC-1936: ------------------------------------- Fix Version/s: 3.1.2 3.2.0 4.0.0 Yes, I just tried your test with ICU and I get the error. Scheduling this bug for the next release. > ICUTransService and IconvGNUransService CAN NOT deal with huge file. > -------------------------------------------------------------------- > > Key: XERCESC-1936 > URL: https://issues.apache.org/jira/browse/XERCESC-1936 > Project: Xerces-C++ > Issue Type: Bug > Components: Utilities > Affects Versions: 2.8.0, 3.1.1 > Environment: RHEL-5.5 > glibc-2.5-49.el5_5.2 > libicu-3.6-5.11.4 > Reporter: kirby zhou > Fix For: 3.1.2, 3.2.0, 4.0.0 > > > If a huge file passed to XMLReader, it will call TransService mulitple times, > and splite the file content into several fragments. > Unfortunately, the fragment will contain incomplete multi-byte characters. > But neither ICUTransService nor IconvGNUransService deal with it. > ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and > IconvGNUransService did not deal with EINVAL. > Both 2.8.0 and 3.1.1 have the same bug. > For example, make 2 XML like that: > ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for > ((i=0;i<2;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > > ~/small.xml > ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for > ((i=0;i<100000;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > > ~/big.xml > # the small.xml and big.xml are analogical. > ]# samples/SAXPrint -x=gbk ~/small.xml > <?xml version="1.0" encoding="gbk"?> > <data> > 中文汉字A中文汉字A > </data> > # with icu > ]# samples/SAXPrint -x=gbk ~/big.xml > <?xml version="1.0" encoding="gbk"?> > <data> > Fatal Error at file /root/big.xml, line 3, char 16377 > Message: char 0x6C49 is not representable in 'gbk' encoding > # with iconvgnu > ]# samples/SAXPrint -x=gbk ~/big.xml > ]# samples/SAXPrint -x=gbk ~/big.xml > <?xml version="1.0" encoding="gbk"?> > <data> > Fatal Error at file /root/big.xml, line 3, char 16377 > Message: invalid multi-byte sequence -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org For additional commands, e-mail: c-dev-h...@xerces.apache.org