[jira] Created: (XERCESC-1936) ICUTransService and IconvGNUransService CAN NOT deal with huge file.

2010-08-03 Thread kirby zhou (JIRA)
ICUTransService and IconvGNUransService CAN NOT deal with huge file.


 Key: XERCESC-1936
 URL: https://issues.apache.org/jira/browse/XERCESC-1936
 Project: Xerces-C++
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.8.0, 3.1.1
 Environment: RHEL-5.5
glibc-2.5-49.el5_5.2
libicu-3.6-5.11.4

Reporter: kirby zhou


If a huge file passed to XMLReader, it will call TransService mulitple times, 
and splite the file content into several fragments.
Unfortunately, the fragment will contain incomplete multi-byte characters.
But neither ICUTransService nor IconvGNUransService deal with it. 
ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and 
IconvGNUransService did not deal with EINVAL.

Both 2.8.0 and 3.1.1 have the same bug.

For example, make 2 XML like that:

]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i2;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  ~/small.xml
]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i10;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
~/big.xml

# the small.xml and big.xml are analogical. 

]# samples/SAXPrint -x=gbk ~/small.xml 
?xml version=1.0 encoding=gbk?
data
中文汉字A中文汉字A
/data

# with icu
]# samples/SAXPrint -x=gbk ~/big.xml
?xml version=1.0 encoding=gbk?
data
Fatal Error at file /root/big.xml, line 3, char 16377
  Message: char 0x6C49 is not representable in 'gbk' encoding

# with iconvgnu
]# samples/SAXPrint -x=gbk ~/big.xml
]# samples/SAXPrint -x=gbk ~/big.xml 
?xml version=1.0 encoding=gbk?
data
Fatal Error at file /root/big.xml, line 3, char 16377
  Message: invalid multi-byte sequence





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org



[jira] Commented: (XERCESC-1936) ICUTransService and IconvGNUransService CAN NOT deal with huge file.

2010-08-03 Thread kirby zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894980#action_12894980
 ] 

kirby zhou commented on XERCESC-1936:
-

The following 2 lines are more suitable for UTF-8 locale users to debug.

]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i2;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A'; done ; echo; 
echo '/data' )   /small.xml

]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i10;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A'; done ; 
echo; echo '/data' )  ~/big.xml 


diff -x .svn -x CVS -ru --show-c-function 
xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp
 xerces-c-3.1.1/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp
--- 
xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp
   2010-01-20 16:45:02.0 +0800
+++ 
xerces-c-3.1.1/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp   
2010-08-04 02:07:06.0 +0800
@@ -1049,6 +1049,9 @@ XMLSize_tIconvGNUTranscoder::transco
 for (size_t cnt = 0; cnt  maxChars  srcLen; cnt++) {
 size_trc = iconvFrom(startSrc, srcLen, orgTarget, uChSize());
 if (rc == (size_t)-1) {
+if (errno == EINVAL) {
+break;
+}
 if (errno != E2BIG || prevSrcLen == srcLen) {
 ThrowXMLwithMemMgr(TranscodingException, 
XMLExcepts::Trans_BadSrcSeq, getMemoryManager());
 }
diff -x .svn -x CVS -ru --show-c-function 
xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp 
xerces-c-3.1.1/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp
--- xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp 
2010-01-20 16:45:02.0 +0800
+++ xerces-c-3.1.1/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp 
2010-08-04 02:28:46.0 +0800
@@ -666,7 +666,7 @@ ICUTranscoder::transcodeTo( const   XMLC
 );
 
 // Rememember the status before we possibly overite the error code
-const bool res = (err == U_ZERO_ERROR);
+const bool res = (err == U_ZERO_ERROR || (err == U_BUFFER_OVERFLOW_ERROR 
 startSrc  srcPtr));
 
 // Put the old handler back
 err = U_ZERO_ERROR;
[



 ICUTransService and IconvGNUransService CAN NOT deal with huge file.
 

 Key: XERCESC-1936
 URL: https://issues.apache.org/jira/browse/XERCESC-1936
 Project: Xerces-C++
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.8.0, 3.1.1
 Environment: RHEL-5.5
 glibc-2.5-49.el5_5.2
 libicu-3.6-5.11.4
Reporter: kirby zhou

 If a huge file passed to XMLReader, it will call TransService mulitple times, 
 and splite the file content into several fragments.
 Unfortunately, the fragment will contain incomplete multi-byte characters.
 But neither ICUTransService nor IconvGNUransService deal with it. 
 ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and 
 IconvGNUransService did not deal with EINVAL.
 Both 2.8.0 and 3.1.1 have the same bug.
 For example, make 2 XML like that:
 ]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
 ((i=0;i2;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
 ~/small.xml
 ]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
 ((i=0;i10;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
 ~/big.xml
 # the small.xml and big.xml are analogical. 
 ]# samples/SAXPrint -x=gbk ~/small.xml 
 ?xml version=1.0 encoding=gbk?
 data
 中文汉字A中文汉字A
 /data
 # with icu
 ]# samples/SAXPrint -x=gbk ~/big.xml
 ?xml version=1.0 encoding=gbk?
 data
 Fatal Error at file /root/big.xml, line 3, char 16377
   Message: char 0x6C49 is not representable in 'gbk' encoding
 # with iconvgnu
 ]# samples/SAXPrint -x=gbk ~/big.xml
 ]# samples/SAXPrint -x=gbk ~/big.xml 
 ?xml version=1.0 encoding=gbk?
 data
 Fatal Error at file /root/big.xml, line 3, char 16377
   Message: invalid multi-byte sequence

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org