[CTJUG Tech] Re: XML Encodings

Paul Gilowey Wed, 03 May 2006 13:59:54 -0700

Hi again

I've done some reading and I've found a guy who reckons that although 1A (x001A) is a valid UTF-8 character it is not a valid XML character. What?! There are invalid XML characters??

Visit this link http://www.dpawson.co.uk/xsl/sect2/N3353.html and see section 3 - How to parse internal data which is in UTF-8 format.

mmmm... sigh... so yes, you were more or less right. The base64 option is an option. Perhaps I'll try to get the source data cleaned up. That should be exciting :).

Regards
Paul

On 5/3/06, Paul Gilowey <[EMAIL PROTECTED]> wrote:

Hello Gary

Yeah, I realise that c26 isn't in the english text range, but it is a "valid" character. I'm under the impression that UTF-8 encoding leaves all "english text" characters as single byte characters and encodes all characters outside of that realm as unicode. I believe this is to save on wasted space... because most of the time in the western world we use normal english text.

The setText function correctly encodes the single byte character 26 as  and here I'm assuming that it would encode a japanese character as something like . This causes the XML to be valid plain ascii text... encoded - sure.

If what you say is correct then that would mean that to cater for internationalisation the defacto standard for xml would be to encode all text as base64 (yet just another encoding).

So what do you reckon? Am I off base here? I know that your solution will work, but is it the correct aproach to take?

Thanks for your input so far.

Regards
Paul

On 5/3/06, Gary Jacobson <[EMAIL PROTECTED] > wrote:
Hi Paul

Character 26 is not a legal text character. The setText function should really have thrown an error and not tried to encode it. If you have binary data in your Strings, you need to use some kind of binary-to-text conversion (like Base64) BEFORE you call setText.

Incidentally, XML encodes all non-ASCII characters as &#<integer value of character>;

Cheers
Gary

On 5/3/06, Paul < [EMAIL PROTECTED]> wrote:

Hello,

I receive some data from a mainframe application and have to place it
into an XML file. I'm using DOM4J 1.6.1 to do this. I set the encoding
to UTF-8.

I build the document by calling a method to place text into a node,
e.g. myElement.setText("some text");

I've come across a situation where a particular string contains the hex
value 1A. When I place the text into DOM element it gets converted to


This is all great - I assume the  is UTF-8 encoding for A1 (hex).

The problem is that if I spool the raw xml into a file on disk, and
then try to load it into a DOM I get an error "Character reference
"&#26" is an invalid XML character. Nested exception: Character
reference "&#26" is an invalid XML character."

The XML doc looks as follows in the text file:
<?xml version="1.0" encoding="UTF-8"?>
<text></text>

Does anybody know how to deal with this senario? It seems as though
nobody I know really knows how these encodings work.

Thanks for your time
Paul

--
Red Balloon Craft Junction
South Africa's premier online source of crafting information.
http://www.redballoon.co.za

--
Red Balloon Craft Junction
South Africa's premier online source of crafting information.
http://www.redballoon.co.za
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CTJUG Tech" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/CTJUG-Tech
-~----------~----~----~----~------~----~------~--~---

[CTJUG Tech] Re: XML Encodings

Reply via email to