RE: Xalan-C 1.4 special character question

Mark Weaver 5 Sep 2003 18:26:40 -0000

>
> Hello,
>
> I forgot to remove the prefix TL. It is not the real problem however.
> Regard it as:
> <?xml version="1.0" encoding="UTF-8"?>
> <?xml-stylesheet type="text/xsl" href="Stylesheet.xsl"?>
> ...
> <Element description="1234567890!&quot;�$%&amp;/()=?�"/>
> ...
>
> What do you mean with: "the XML document snippet you provided contains an
> illegal UTF-8 byte sequence"
>
It's still much easier if you provide a test case -- this means attach, and
probably zip to be sure, your stylesheet and xml file.  The problem is that
when you cut and paste to an email, you change the character set, and we
don't have exactly what you are feeding to the parser anymore.


The basic problem is that UTF-8 is often misunderstood.  It is an 8-bit
encoding of a 21-bit character set.  It so turns out that any characters >=
0x80 are encoded as more than one byte.  That covers the paragraph separator
(�) and the backtick (�).  These are two bytes in the UTF-8 stream, the
first of which is �.  So, if you view the output in an editor which assumes
a single byte encoding, you see that "extra" character, and some
"gibberish".  To make the complication complete, there is a standard
(iso-8859-1) and M$ standard (win-1252 or something) -- these contain the
characters you want, but in a single byte format.  You need to be sure if
you are dealing with UTF-8, you really are dealing with UTF-8 -- and the
only way we can be sure is with actual, unmodified samples.

> I noticed that with notepad everything was ok, but this was the
> only editor
> recognizing the encoding from scratch. All other editors showed
> sth from the
> kind : 1234567890!"§$%&/()=?´
>
This backs up what I said above -- if you don't view the result as UTF-8,
you will be mislead.  Notepad for w2k I believe supports UTF-8 natively.
The names of the other editors would probably be enlightening.

Mark

RE: Xalan-C 1.4 special character question

Reply via email to