Re: Sterling pound sign encoding sith XML string

David Bertoni Thu, 19 Jun 2008 18:17:10 -0700

Wons, Jean-Baptiste wrote:

Hello.I am not sure if this is a bug in xerces or me not using xerces well.This is my code:<code>#include <string>#include <iostream>#include <xercesc/dom/DOM.hpp>#include <xercesc/dom/DOMException.hpp>#include <xercesc/dom/DOMImplementationRegistry.hpp>#include <xercesc/framework/MemBufInputSource.hpp>#include <xercesc/parsers/XercesDOMParser.hpp>#include <xercesc/util/PlatformUtils.hpp>#include <xercesc/util/XMLString.hpp>
using namespace std;using namespace XERCES_CPP_NAMESPACE;void replaceSpecialCharactersXML(std::string &s){string cp;unsigned int i;cp.reserve(s.size()*2);for (i = 0; i < s.size(); i++){const unsigned char c = s[i];if ((c < 32 && c != '\012' && c != '\015') || c > 127){char buffer[10000];sprintf(buffer, "&#x%02x;", c);cp += buffer;}else{cp += c;}}s = cp;}
int main(){XMLPlatformUtils::Initialize();string aString0 ("This will crash ££££ ...");XMLCh* fUnicodeForm = XMLString::transcode(aString0.c_str());char *pMsg = XMLString::transcode(fUnicodeForm);string res(pMsg);replaceSpecialCharactersXML(res);cout << aString0 << " -> " << pMsg << " -> " << res << endl;return 0;}</code>When I compile and run, I have that output:<output>sh$ ./testxercesThis will crash ££££ ... -> This will crash ... -> This will crash  ...</output>

I ran your code on Windows XP with the default Windows code page forEnglish and got the following result:

This will crash úúúú ... -> This will crash úúúú ... -> This will crash££££ ...

The fact that your system displays "ú" instead of the pound sign is yourfirst clue that something is very wrong.

When I transcode the £ sign to XMLCh, then transcode it back to a char*, it is transformed to 0x1a.Is it a real bug, or is it just me missing something ?

It's generally dangerous to transcode between the local code page andUnicode because it's easy to lose data. It may be that your currentcode page encodes the Unicode character U+00A3 Pound Sign as 0x1A,although that seems unlikely. Without knowing what anything yoursystem's local code page, we can only guess. Also, if your code willrun on other systems, you can't make any assumptions about the localcode page.

It's also dangerous to embed strings in your program with code unitsoutside of a very limited set, because they will be sensitive to thecompiler's idea of how characters are encoded. For example, you may beusing an editor that supports UTF-8 or ISO-8859-1, while your compilerassumes some other encoding for the bytes of the source file. Sinceyour email arrived encoded in ISO-8859-1, perhaps your editor also usesthat encoding.

The best thing to do is to use Unicode strings throughout your code, andonly transcode to the local code page when you absolutely must, makingsure you never assume that any particular character can be representedin the local code page. Also, construct hard-coded strings directly inUTF-16, instead of embedded character string constants and transcodingthem. You can look at src/xerces/util/XMLUni.cpp for some examples ofhow to do that.


Dave

Re: Sterling pound sign encoding sith XML string

Reply via email to