[jira] Created: (XERCESC-1936) ICUTransService and IconvGNUransService CAN NOT deal with huge file.

2010-08-03 Thread kirby zhou (JIRA)
ICUTransService and IconvGNUransService CAN NOT deal with huge file.


 Key: XERCESC-1936
 URL: https://issues.apache.org/jira/browse/XERCESC-1936
 Project: Xerces-C++
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.8.0, 3.1.1
 Environment: RHEL-5.5
glibc-2.5-49.el5_5.2
libicu-3.6-5.11.4

Reporter: kirby zhou


If a huge file passed to XMLReader, it will call TransService mulitple times, 
and splite the file content into several fragments.
Unfortunately, the fragment will contain incomplete multi-byte characters.
But neither ICUTransService nor IconvGNUransService deal with it. 
ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and 
IconvGNUransService did not deal with EINVAL.

Both 2.8.0 and 3.1.1 have the same bug.

For example, make 2 XML like that:

]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i2;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  ~/small.xml
]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i10;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
~/big.xml

# the small.xml and big.xml are analogical. 

]# samples/SAXPrint -x=gbk ~/small.xml 
?xml version=1.0 encoding=gbk?
data
中文汉字A中文汉字A
/data

# with icu
]# samples/SAXPrint -x=gbk ~/big.xml
?xml version=1.0 encoding=gbk?
data
Fatal Error at file /root/big.xml, line 3, char 16377
  Message: char 0x6C49 is not representable in 'gbk' encoding

# with iconvgnu
]# samples/SAXPrint -x=gbk ~/big.xml
]# samples/SAXPrint -x=gbk ~/big.xml 
?xml version=1.0 encoding=gbk?
data
Fatal Error at file /root/big.xml, line 3, char 16377
  Message: invalid multi-byte sequence





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org



[jira] Updated: (XERCESC-1936) ICUTransService and IconvGNUransService CAN NOT deal with huge file.

2010-08-03 Thread Boris Kolpackov (JIRA)

 [ 
https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Kolpackov updated XERCESC-1936:
-


Hi,

Can you attach the sample files to the bug report? The content that you have 
pasted in the description is all garbled. Also, would you be able to come up 
with a patch for this issue?

 ICUTransService and IconvGNUransService CAN NOT deal with huge file.
 

 Key: XERCESC-1936
 URL: https://issues.apache.org/jira/browse/XERCESC-1936
 Project: Xerces-C++
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.8.0, 3.1.1
 Environment: RHEL-5.5
 glibc-2.5-49.el5_5.2
 libicu-3.6-5.11.4
Reporter: kirby zhou

 If a huge file passed to XMLReader, it will call TransService mulitple times, 
 and splite the file content into several fragments.
 Unfortunately, the fragment will contain incomplete multi-byte characters.
 But neither ICUTransService nor IconvGNUransService deal with it. 
 ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and 
 IconvGNUransService did not deal with EINVAL.
 Both 2.8.0 and 3.1.1 have the same bug.
 For example, make 2 XML like that:
 ]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
 ((i=0;i2;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
 ~/small.xml
 ]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
 ((i=0;i10;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
 ~/big.xml
 # the small.xml and big.xml are analogical. 
 ]# samples/SAXPrint -x=gbk ~/small.xml 
 ?xml version=1.0 encoding=gbk?
 data
 中文汉字A中文汉字A
 /data
 # with icu
 ]# samples/SAXPrint -x=gbk ~/big.xml
 ?xml version=1.0 encoding=gbk?
 data
 Fatal Error at file /root/big.xml, line 3, char 16377
   Message: char 0x6C49 is not representable in 'gbk' encoding
 # with iconvgnu
 ]# samples/SAXPrint -x=gbk ~/big.xml
 ]# samples/SAXPrint -x=gbk ~/big.xml 
 ?xml version=1.0 encoding=gbk?
 data
 Fatal Error at file /root/big.xml, line 3, char 16377
   Message: invalid multi-byte sequence

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org



[jira] Commented: (XERCESC-1936) ICUTransService and IconvGNUransService CAN NOT deal with huge file.

2010-08-03 Thread kirby zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/XERCESC-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894980#action_12894980
 ] 

kirby zhou commented on XERCESC-1936:
-

The following 2 lines are more suitable for UTF-8 locale users to debug.

]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i2;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A'; done ; echo; 
echo '/data' )   /small.xml

]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
((i=0;i10;++i)); do echo -en '\xd6\xd0\xce\xc4\xba\xba\xd7\xd6A'; done ; 
echo; echo '/data' )  ~/big.xml 


diff -x .svn -x CVS -ru --show-c-function 
xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp
 xerces-c-3.1.1/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp
--- 
xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp
   2010-01-20 16:45:02.0 +0800
+++ 
xerces-c-3.1.1/src/xercesc/util/Transcoders/IconvGNU/IconvGNUTransService.cpp   
2010-08-04 02:07:06.0 +0800
@@ -1049,6 +1049,9 @@ XMLSize_tIconvGNUTranscoder::transco
 for (size_t cnt = 0; cnt  maxChars  srcLen; cnt++) {
 size_trc = iconvFrom(startSrc, srcLen, orgTarget, uChSize());
 if (rc == (size_t)-1) {
+if (errno == EINVAL) {
+break;
+}
 if (errno != E2BIG || prevSrcLen == srcLen) {
 ThrowXMLwithMemMgr(TranscodingException, 
XMLExcepts::Trans_BadSrcSeq, getMemoryManager());
 }
diff -x .svn -x CVS -ru --show-c-function 
xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp 
xerces-c-3.1.1/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp
--- xerces-c-3.1.1.bak/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp 
2010-01-20 16:45:02.0 +0800
+++ xerces-c-3.1.1/src/xercesc/util/Transcoders/ICU/ICUTransService.cpp 
2010-08-04 02:28:46.0 +0800
@@ -666,7 +666,7 @@ ICUTranscoder::transcodeTo( const   XMLC
 );
 
 // Rememember the status before we possibly overite the error code
-const bool res = (err == U_ZERO_ERROR);
+const bool res = (err == U_ZERO_ERROR || (err == U_BUFFER_OVERFLOW_ERROR 
 startSrc  srcPtr));
 
 // Put the old handler back
 err = U_ZERO_ERROR;
[



 ICUTransService and IconvGNUransService CAN NOT deal with huge file.
 

 Key: XERCESC-1936
 URL: https://issues.apache.org/jira/browse/XERCESC-1936
 Project: Xerces-C++
  Issue Type: Bug
  Components: Utilities
Affects Versions: 2.8.0, 3.1.1
 Environment: RHEL-5.5
 glibc-2.5-49.el5_5.2
 libicu-3.6-5.11.4
Reporter: kirby zhou

 If a huge file passed to XMLReader, it will call TransService mulitple times, 
 and splite the file content into several fragments.
 Unfortunately, the fragment will contain incomplete multi-byte characters.
 But neither ICUTransService nor IconvGNUransService deal with it. 
 ICUTransService did not deal with U_TRUNCATED_CHAR_FOUND, and 
 IconvGNUransService did not deal with EINVAL.
 Both 2.8.0 and 3.1.1 have the same bug.
 For example, make 2 XML like that:
 ]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
 ((i=0;i2;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
 ~/small.xml
 ]# ( echo '?xml version=1.0 encoding=GBK ?'; echo 'data'; for 
 ((i=0;i10;++i)); do echo -n '中文汉字A'; done ; echo; echo '/data' )  
 ~/big.xml
 # the small.xml and big.xml are analogical. 
 ]# samples/SAXPrint -x=gbk ~/small.xml 
 ?xml version=1.0 encoding=gbk?
 data
 中文汉字A中文汉字A
 /data
 # with icu
 ]# samples/SAXPrint -x=gbk ~/big.xml
 ?xml version=1.0 encoding=gbk?
 data
 Fatal Error at file /root/big.xml, line 3, char 16377
   Message: char 0x6C49 is not representable in 'gbk' encoding
 # with iconvgnu
 ]# samples/SAXPrint -x=gbk ~/big.xml
 ]# samples/SAXPrint -x=gbk ~/big.xml 
 ?xml version=1.0 encoding=gbk?
 data
 Fatal Error at file /root/big.xml, line 3, char 16377
   Message: invalid multi-byte sequence

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org



Memory management bugs

2010-08-03 Thread Lyublena Antova
I tried to use Xerces with the pluggable MemoryManager and I discovered that on 
several occasions objects are instantiated with the global new operator that 
does not use the memory manager. Here are some of those cases:

 *   initializing the EncodingValidator in EncodingValidator.cpp
 *   creating a DOMImplementationListImpl in DOMImplementationImpl.cpp and 
DOMImplementationRegistry.cpp
 *   creating a DOMNodeListImpl in DOMNodeImpl.cpp
 *   creating a DOMDocumentTypeImpl in DOMImplementationImpl.cpp
 *   ...

In our code we essentially forbid the use of plain global new so the above 
cases blow up when Xerces is linked against our codebase.

To my understanding the pluggable memory manager is used either:

 *   by making classes derive from the XMemory class which overloads new and 
delete, or
 *   by using the global overloaded placement new operators that take a 
DomDocument(Impl) object

The problem classes mentioned above are not derived from XMemory but 
occasionally get instantiated with a plain new operator instead of the 
placement new-s.

I have a fix that makes those classes inherit the XMemory class, and thus get 
instantiated with the global memory manager. That caused some problems because 
on some occasions the global placement new-s were shadowed by the Xmemory 
member new-s  which produced unexpected results. The solution was to force 
the use of the global new (::new) to avoid wrong resolving of operator calls.

Was there any reason why the classes above do not inherit from XMemory in the 
first place?

On a broader note, is there a particular reason why not have a placement new 
operator that takes a MemoryManager instance? Perhaps deallocation issues?

Thanks,
Lyublena