Hello,

The fact that I hardcoded the Pound sign in this piece of code was just to show 
my problem.
In my program, I actualy get data from a file. Then I try to make a xml with 
the data embedded in this file.

Sometime, I have the pound sign in this file (this file is encoded in 
ISO-8859-1).
But the transcode method of XMLString get quite upset with that, and when I 
transcode the xml back to ISO-8859-1, I get the 0x1a.

Is there any way to work-around this ?

Thanks,
Jean-Baptiste


-----Original Message-----
From: David Bertoni [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 19, 2008 9:17 PM
To: [email protected]
Subject: Re: Sterling pound sign encoding sith XML string

Wons, Jean-Baptiste wrote:
> Hello. 
> 
> I am not sure if this is a bug in xerces or me not using xerces well. 
> 
> This is my code: 
> 
> <code> 
> 
> #include <string> 
> #include <iostream> 
> 
> #include <xercesc/dom/DOM.hpp> 
> #include <xercesc/dom/DOMException.hpp> 
> #include <xercesc/dom/DOMImplementationRegistry.hpp> 
> #include <xercesc/framework/MemBufInputSource.hpp> 
> #include <xercesc/parsers/XercesDOMParser.hpp> 
> #include <xercesc/util/PlatformUtils.hpp> 
> #include <xercesc/util/XMLString.hpp> 
> 
> 
> using namespace std; 
> using namespace XERCES_CPP_NAMESPACE; 
> 
> void replaceSpecialCharactersXML(std::string &s) 
> { 
>     string cp; 
>     unsigned int i; 
>     cp.reserve(s.size()*2); 
>     for (i = 0; i < s.size(); i++) 
>     { 
>         const unsigned char c = s[i]; 
> 
>         if ((c < 32 && c != '\012' && c != '\015') || c > 127) 
>         { 
>             char buffer[10000]; 
>             sprintf(buffer, "&#x%02x;", c); 
>             cp += buffer; 
>         } 
>         else 
>         { 
>             cp += c; 
>         } 
>     } 
>     s = cp; 
> } 
> 
> 
> int main() 
> { 
>     XMLPlatformUtils::Initialize(); 
>     string aString0 ("This will crash ££££ ..."); 
>     XMLCh* fUnicodeForm = XMLString::transcode(aString0.c_str()); 
>     char *pMsg = XMLString::transcode(fUnicodeForm); 
>     string res(pMsg); 
>     replaceSpecialCharactersXML(res); 
> 
>     cout << aString0 << " -> " << pMsg << " -> " << res << endl; 
> 
>     return 0; 
> } 
> 
> </code> 
> 
> When I compile and run, I have that output: 
> 
> <output> 
> sh$ ./testxerces 
> This will crash ££££ ... -> This will crash ... -> This will crash 
> &#x1a;&#x1a;&#x1a;&#x1a; ... 
> </output> 
I ran your code on Windows XP with the default Windows code page for 
English and got the following result:

This will crash úúúú ... -> This will crash úúúú ... -> This will crash 
&#xa3;&#xa3;&#xa3;&#xa3; ...

The fact that your system displays "ú" instead of the pound sign is your 
first clue that something is very wrong.

> 
> When I transcode the £ sign to XMLCh, then transcode it back to a char*, it 
> is transformed to 0x1a. 
> 
> Is it a real bug, or is it just me missing something ? 
It's generally dangerous to transcode between the local code page and 
Unicode because it's easy to lose data.  It may be that your current 
code page encodes the Unicode character U+00A3 Pound Sign as 0x1A, 
although that seems unlikely.  Without knowing what anything your 
system's local code page, we can only guess.  Also, if your code will 
run on other systems, you can't make any assumptions about the local 
code page.

It's also dangerous to embed strings in your program with code units 
outside of a very limited set, because they will be sensitive to the 
compiler's idea of how characters are encoded.  For example, you may be 
using an editor that supports UTF-8 or ISO-8859-1, while your compiler 
assumes some other encoding for the bytes of the source file.  Since 
your email arrived encoded in ISO-8859-1, perhaps your editor also uses 
that encoding.

The best thing to do is to use Unicode strings throughout your code, and 
only transcode to the local code page when you absolutely must, making 
sure you never assume that any particular character can be represented 
in the local code page.  Also, construct hard-coded strings directly in 
UTF-16, instead of embedded character string constants and transcoding 
them.  You can look at src/xerces/util/XMLUni.cpp for some examples of 
how to do that.

Dave

-- 
This message may contain confidential, proprietary, or legally privileged 
information. No confidentiality or privilege is waived by any transmission to 
an unintended recipient. If you are not an intended recipient, please notify 
the sender and delete this message immediately. Any views expressed in this 
message are those of the sender, not those of any entity within the KBC 
Financial Products group of companies (together referred to as "KBC FP"). 

This message does not create any obligation, contractual or otherwise, on the 
part of KBC FP. It is not an offer (or solicitation of an offer) of, or a 
recommendation to buy or sell, any financial product. Any prices or other 
values included in this message are indicative only, and do not necessarily 
represent current market prices, prices at which KBC FP would enter into a 
transaction, or prices at which similar transactions may be carried on KBC FP's 
own books. The information contained in this message is provided "as is", 
without representations or warranties, express or implied, of any kind. Past 
performance is not indicative of future returns.

Reply via email to