On Thu, Sep 15, 2005 at 05:24:29PM -0400, Fred Fung wrote: > The byte sequence for "Ç" that would appear in an xml or html page > is "Ç" as I stated in my first email.
yes and that's why I said the internal representation would not depend on the encoding information in my previous mail. I *did* read and remember your initial message when answering. > I understand that all strings are internally encoded as UTF-8. But what > I want to achieve is that, once I retrieve the UTF-8 encoded string into > a C variable, how can I convert the UTF-8 encoded squence "#C3#87" back > to the corresponding "Ç" character so that other part of my application > can use this character instead of the UTF-8 sequence ? It is not the question you asked in the first mail. You can use UTF8Toisolat1() which is defined in <libxml/encoding.h> or the iconv library which is part of the POSIX subsystem. > As I said in my original email, I ran xmllint on the xml file and > it was able to output "Ç" properly on my screen, NOT the UTF-8 > encoded string. When an encoding was provided as part of the document the serialization routines try to convert back to that encoding. It will use UTF8Toisolat1() internally unless the iconv() system or converters provided by the application override libxml2 provided default. > So there must be something that I should be calling to do the conversion. http://xmlsoft.org/encoding.html#Default the page I pointed to states "libxml2 has a set of default converters for the following encodings (located in encoding.c): 1. UTF-8 is supported by default (null handlers) 2. UTF-16, both little and big endian 3. ISO-Latin-1 (ISO-8859-1) covering most western languages 4. ASCII, useful mostly for saving 5. HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML predefined entities like © for the Copyright sign. More over when compiled on an Unix platform with iconv support the full set of encodings supported by iconv can be instantly be used by libxml" I can't really list all encodings one by one and point to the associated converter. But I do point to the module holding them and to the iconv system routine which can be used too. > Please, if you are not able to help, just say so, or just don't bother to > reply. I can help if I get a coherent question. To me your initial set of question were not coherent, so I pointed to the documentation. Your second mail were not clearer about what you wanted to do, sometimes you took examples about the input, some time about the return values returned by the API, sometimes about the reserialized after parse form. Reread your mails, they mixed all 3 plus they mixed encoding, code point, and representation issues: "it was able to output "Ç" properly on my screen" really mean to me that you still don't understand the problem. the output on your screen is a *representation* of the sequence of bytes emitted. If it was serialized in UTF-8, even if that character had been represented by 2 bytes, then you would see one caracter glyph on screen anyway if your locale was fr_FR.UTF-8 or any other locale using an UTF-8 representation. There is 3 layers, the byte sequence, the character sequence based on Unicode code points and the representation as a sequence of glyphs. Mixing the 3 is a common problem even for "fellow competant programmer", and Joel Spolsky is right on when he says that it's a very serious problem http://www.joelonsoftware.com/articles/Unicode.html Daniel -- Daniel Veillard | Red Hat Desktop team http://redhat.com/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml