Re: Smartest way from XalanDOMString to char*?

David N Bertoni/Cambridge/IBM Tue, 16 Apr 2002 13:14:25 -0700


Hi Vincent,


You can ask Xerces to give you a transcoder for any particular encoding it
supports.  In addition, if you've built Xerces with ICU integration, you'll
get transcoders for just about every encoding there is.  There's a FAQ that
has the list of encodings that Xerces supports:

   http://xml.apache.org/xerces-c/faq-parse.html#faq-19

Latin1 or more properly ISO-8859-1 maps directly to the first 256
characters of Unicode, so if you want a cheap ISO-8859-1 transcoder, just
check to see if the Unicode value is less than 256.  If it is, just
truncate it to a char.  Otherwise, signal an error, or recover in whatever
manner you see fit.  If your input documents and stylesheets are encoded in
ISO-8859-1, then most likely all of the characters in the result tree will
also be in that encoding.  The only other way to get unrepresentable
characters into the result tree would be if you're documents contain
numeric entity references, or other entity references to characters outside
range supported by ISO-8859-1.

Xalan is using GCC's transcoding APIs, so if they can't transcode
characters, then Xalan cannot.  This may be a limitation in GCC's
transcoding, or it may be that it really thinks it can't represent the
characters.  I have no idea which it is.

I'm sorry if I haven't explained the transcoding issues very clearly, but
it's a very complex topic and difficult to explain without going into great
detail.  Unfortunately, I really haven't had the time to reply to your
message in that much detail.

Dave



|---------+--------------------------------->
|         |           "Vincent Berruchon"   |
|         |           <vincent.berruchon@neo|
|         |           -logism.fr>           |
|         |                                 |
|         |           04/16/2002 10:21 AM   |
|         |           Please respond to     |
|         |           xalan-dev             |
|         |                                 |
|---------+--------------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------|
  |                                                                                    
                                       |
  |        To:      <[EMAIL PROTECTED]>                                         
                                       |
  |        cc:      (bcc: David N Bertoni/Cambridge/IBM)                               
                                       |
  |        Subject: Smartest way from XalanDOMString to char*?                         
                                       |
  
>---------------------------------------------------------------------------------------------------------------------------|



I'm still quite confuse with encoding problem.
In fact my problem is just to get a char array (a C style string: char[] or
char* ) from a XalanDOMString (on Linux Mandrake8.2 with Xalan-C with
gcc-2.96).

Should I transcode XalanDOMString to local code page to get char*???

I've got no problem with accent (like �����) in my char*, but it seems that
the transcode function of Xalan doesn't like this character and can't
transcode them (?? because it doesn't know the extended char table??).


During my experiences I've tried to copy directly one by one the Unicode
value of  each character in my XalanDOMString to a char array  (So an
unsigned short directly in a char...)
It's not a very good idea since XalanDOMString chararacters (unsigned
short)
can use 16bits and char only 8bits.
I can limit this to value inferior to xFF (255) if I only use basic latin
and latin-1 supplement characters that don't use value on more than 8 bits
(x00 to xFF in hexadecimal) in the unicode enconding.
But the biggest problem is that I suppose that Unicode from 00 to FF match
the ANSI or ISO Latin1 encoding but I don't if it's right an which one is
used in char????

So can someone tell me what is the best way from XalanDOMString to char*?
and how to know what is the local code page encoding
and/or the extended encoding in my C char (ANSI or Iso Latin 1 ?? in my
case)

Thanks for your help
Vincent

Subject:  Re: accent
From:     "David N Bertoni/Cambridge/IBM" <[EMAIL PROTECTED]>
Date:     2002-04-11 15:45:28
>You can transcode to any encoding, but the transcode() call on
>XalanDOMstring transcodes to the local code page.  As I said before, if
the
>local code page does not support that character, you cannot transcode the
>string to it.  If a code page cannot represent a character, there's no
>other solution.
>If you want to transcode to something else, like iso-8859-1, you'll need
to
>get a transcoder for the encoding and transcode the string.  Whether or
not
>your environment supports and can display that encoding, I don't know.
See
>the Xerces documentation for more information on transcoding, or search
the
>source files for examples.

>Xalan has a collection of serializers that you can use if you want to
>serialize an entire document, or sub-tree, but it's overkill for simple
>string transcoding.

>Dave



Tell me if I'm wrong:
 XalanDOMString (the m_data) are UTF-16 string (in unsigned short on most
platforms...)

Re: Smartest way from XalanDOMString to char*?

Reply via email to