Hi Dara,
this is a recurring issue when dealing with non-ASCII characters (in
your case, accented characters). Xerces works with Unicode
characters, so every time you are using XMLCh strings you are
guaranteed that whatever data was stored will be retrieved. The weak
point is how you got those accented characters inside the DOM; I
guess you are using XMLString::transcode in order to convert them
from an ISO-8859-1 literal stored in your C++ source code.
But XMLString::transcode will interpret the data as it was encoded in
the current locale (e.g. read using scanf or cin) so there is the
chance it will misinterpret it.
You have several options:
1) if the characters are stored in the C++ source, use Unicode
strings (if your compiler supports the L"string" format, try using
it; otherwise, create arrays of XMLCh elements and initialize them
using the Unicode codepoint for the characters)
2) use normal C literals, but replace XMLString::transcode with a
transcoder for ISO-8859-1
3) keep using XMLString::transcode, but ensure it is working with the
encoding you need (e.g. try calling setlocale before
XMLPlatformUtils::Initialize)
Hope this helps,
Alberto
At 04:14 PM 2/24/2006 +0000, dara wrote:
Hi all,
I am having a little fun with 'accented characters' in an
application i'm working with at the minute.
Through various tracing and debug methods, i can see that the
characters are correctly being propogated around the system until
the point where they are added to a DOM in Xerces.
After subsequent writing of the DOM to a MemBufFormatTarget* and
retrieval through the getRawBuffer() interface, the data is
-corrupt-. (i get characters out, but they are not accented and not
the chars from my buffer)
To remove any potential issues in the application, i've altered the
DOMPrint sample to write to the same type of target. It correctly
parses and writes the small XML document I've given it with accented
characters and in the codepage:: iso-8859-1.
I then added some code to add another node with some accented
characters, and these do not appear in the output. (which is
interesting, my app is giving me corrupt data and the sampl isn't
giving me anything).
I'm running v240 at the moment under a linux variant, and i've set
the shell locale to the same as the xml document being parsed.
However the application still reports the locale as "C", so I'm
going to try a few more things.
But I've a feelling I'm missing something basic. Do I need to do
something special with the Transcoder class, or the DOM document
when adding an element and text node ?
I'm sure somebody has met this already. Any pointers ?
Thanks and Regards
Dara