Hi Vincent,
You can ask Xerces to give you a transcoder for any particular encoding it supports. In addition, if you've built Xerces with ICU integration, you'll get transcoders for just about every encoding there is. There's a FAQ that has the list of encodings that Xerces supports: http://xml.apache.org/xerces-c/faq-parse.html#faq-19 Latin1 or more properly ISO-8859-1 maps directly to the first 256 characters of Unicode, so if you want a cheap ISO-8859-1 transcoder, just check to see if the Unicode value is less than 256. If it is, just truncate it to a char. Otherwise, signal an error, or recover in whatever manner you see fit. If your input documents and stylesheets are encoded in ISO-8859-1, then most likely all of the characters in the result tree will also be in that encoding. The only other way to get unrepresentable characters into the result tree would be if you're documents contain numeric entity references, or other entity references to characters outside range supported by ISO-8859-1. Xalan is using GCC's transcoding APIs, so if they can't transcode characters, then Xalan cannot. This may be a limitation in GCC's transcoding, or it may be that it really thinks it can't represent the characters. I have no idea which it is. I'm sorry if I haven't explained the transcoding issues very clearly, but it's a very complex topic and difficult to explain without going into great detail. Unfortunately, I really haven't had the time to reply to your message in that much detail. Dave |---------+---------------------------------> | | "Vincent Berruchon" | | | <vincent.berruchon@neo| | | -logism.fr> | | | | | | 04/16/2002 10:21 AM | | | Please respond to | | | xalan-dev | | | | |---------+---------------------------------> >---------------------------------------------------------------------------------------------------------------------------| | | | To: <[EMAIL PROTECTED]> | | cc: (bcc: David N Bertoni/Cambridge/IBM) | | Subject: Smartest way from XalanDOMString to char*? | >---------------------------------------------------------------------------------------------------------------------------| I'm still quite confuse with encoding problem. In fact my problem is just to get a char array (a C style string: char[] or char* ) from a XalanDOMString (on Linux Mandrake8.2 with Xalan-C with gcc-2.96). Should I transcode XalanDOMString to local code page to get char*??? I've got no problem with accent (like �����) in my char*, but it seems that the transcode function of Xalan doesn't like this character and can't transcode them (?? because it doesn't know the extended char table??). During my experiences I've tried to copy directly one by one the Unicode value of each character in my XalanDOMString to a char array (So an unsigned short directly in a char...) It's not a very good idea since XalanDOMString chararacters (unsigned short) can use 16bits and char only 8bits. I can limit this to value inferior to xFF (255) if I only use basic latin and latin-1 supplement characters that don't use value on more than 8 bits (x00 to xFF in hexadecimal) in the unicode enconding. But the biggest problem is that I suppose that Unicode from 00 to FF match the ANSI or ISO Latin1 encoding but I don't if it's right an which one is used in char???? So can someone tell me what is the best way from XalanDOMString to char*? and how to know what is the local code page encoding and/or the extended encoding in my C char (ANSI or Iso Latin 1 ?? in my case) Thanks for your help Vincent Subject: Re: accent From: "David N Bertoni/Cambridge/IBM" <[EMAIL PROTECTED]> Date: 2002-04-11 15:45:28 >You can transcode to any encoding, but the transcode() call on >XalanDOMstring transcodes to the local code page. As I said before, if the >local code page does not support that character, you cannot transcode the >string to it. If a code page cannot represent a character, there's no >other solution. >If you want to transcode to something else, like iso-8859-1, you'll need to >get a transcoder for the encoding and transcode the string. Whether or not >your environment supports and can display that encoding, I don't know. See >the Xerces documentation for more information on transcoding, or search the >source files for examples. >Xalan has a collection of serializers that you can use if you want to >serialize an entire document, or sub-tree, but it's overkill for simple >string transcoding. >Dave Tell me if I'm wrong: XalanDOMString (the m_data) are UTF-16 string (in unsigned short on most platforms...)
