[ 
https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Kolpackov updated XERCESC-1936:
-------------------------------------

    Fix Version/s: 3.1.2
                   3.2.0
                   4.0.0

Yes, I just tried your test with ICU and I get the error. Scheduling this bug 
for the next release.

> ICUTransService and IconvGNUransService CAN NOT deal with huge file.
> --------------------------------------------------------------------
>
>                 Key: XERCESC-1936
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1936
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0, 3.1.1
>         Environment: RHEL-5.5
> glibc-2.5-49.el5_5.2
> libicu-3.6-5.11.4
>            Reporter: kirby zhou
>             Fix For: 3.1.2, 3.2.0, 4.0.0
>
>
> If a huge file passed to XMLReader, it will call TransService mulitple times, 
> and splite the file content into several fragments.
> Unfortunately, the fragment will contain incomplete multi-byte characters.
> But neither ICUTransService nor IconvGNUransService deal with it. 
> ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and 
> IconvGNUransService did not deal with EINVAL.
> Both 2.8.0 and 3.1.1 have the same bug.
> For example, make 2 XML like that:
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for 
> ((i=0;i<2;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > 
> ~/small.xml
> ]# ( echo '<?xml version="1.0" encoding="GBK" ?>'; echo '<data>'; for 
> ((i=0;i<100000;++i)); do echo -n '中文汉字A'; done ; echo; echo '</data>' ) > 
> ~/big.xml
> # the small.xml and big.xml are analogical. 
> ]# samples/SAXPrint -x=gbk ~/small.xml 
> <?xml version="1.0" encoding="gbk"?>
> <data>
> 中文汉字A中文汉字A
> </data>
> # with icu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
>   Message: char 0x6C49 is not representable in 'gbk' encoding
> # with iconvgnu
> ]# samples/SAXPrint -x=gbk ~/big.xml
> ]# samples/SAXPrint -x=gbk ~/big.xml 
> <?xml version="1.0" encoding="gbk"?>
> <data>
> Fatal Error at file /root/big.xml, line 3, char 16377
>   Message: invalid multi-byte sequence

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org

Reply via email to