Wons, Jean-Baptiste wrote:
Hello. I am not sure if this is a bug in xerces or me not using xerces well. This is my code: <code> #include <string> #include <iostream> #include <xercesc/dom/DOM.hpp> #include <xercesc/dom/DOMException.hpp> #include <xercesc/dom/DOMImplementationRegistry.hpp> #include <xercesc/framework/MemBufInputSource.hpp> #include <xercesc/parsers/XercesDOMParser.hpp> #include <xercesc/util/PlatformUtils.hpp> #include <xercesc/util/XMLString.hpp>

using namespace std; using namespace XERCES_CPP_NAMESPACE; void replaceSpecialCharactersXML(std::string &s) { string cp; unsigned int i; cp.reserve(s.size()*2); for (i = 0; i < s.size(); i++) { const unsigned char c = s[i]; if ((c < 32 && c != '\012' && c != '\015') || c > 127) { char buffer[10000]; sprintf(buffer, "&#x%02x;", c); cp += buffer; } else { cp += c; } } s = cp; }

int main() { XMLPlatformUtils::Initialize(); string aString0 ("This will crash ££££ ..."); XMLCh* fUnicodeForm = XMLString::transcode(aString0.c_str()); char *pMsg = XMLString::transcode(fUnicodeForm); string res(pMsg); replaceSpecialCharactersXML(res); cout << aString0 << " -> " << pMsg << " -> " << res << endl; return 0; } </code> When I compile and run, I have that output: <output> sh$ ./testxerces This will crash ££££ ... -> This will crash ... -> This will crash &#x1a;&#x1a;&#x1a;&#x1a; ... </output>
I ran your code on Windows XP with the default Windows code page for English and got the following result:

This will crash úúúú ... -> This will crash úúúú ... -> This will crash &#xa3;&#xa3;&#xa3;&#xa3; ...

The fact that your system displays "ú" instead of the pound sign is your first clue that something is very wrong.


When I transcode the £ sign to XMLCh, then transcode it back to a char*, it is transformed to 0x1a. Is it a real bug, or is it just me missing something ?
It's generally dangerous to transcode between the local code page and Unicode because it's easy to lose data. It may be that your current code page encodes the Unicode character U+00A3 Pound Sign as 0x1A, although that seems unlikely. Without knowing what anything your system's local code page, we can only guess. Also, if your code will run on other systems, you can't make any assumptions about the local code page.

It's also dangerous to embed strings in your program with code units outside of a very limited set, because they will be sensitive to the compiler's idea of how characters are encoded. For example, you may be using an editor that supports UTF-8 or ISO-8859-1, while your compiler assumes some other encoding for the bytes of the source file. Since your email arrived encoded in ISO-8859-1, perhaps your editor also uses that encoding.

The best thing to do is to use Unicode strings throughout your code, and only transcode to the local code page when you absolutely must, making sure you never assume that any particular character can be represented in the local code page. Also, construct hard-coded strings directly in UTF-16, instead of embedded character string constants and transcoding them. You can look at src/xerces/util/XMLUni.cpp for some examples of how to do that.

Dave

Reply via email to