RE: Invalid byte 1 (£) of a 1-byte sequence

John Lilley Fri, 14 Aug 2009 04:42:18 -0700

I will also quite likely say some thing stupid, but here goes :)

I suggest that there are two possibilities:

1) Xerces on AIX is ignoring your request to use UTF-8, and is instead using 
the default 8859-1
2) Xerces, or the underlying transcoder it uses, is translating  UTF-8, but is 
too lenient when it encounters the invalid escape sequence, and makes some 
ad-hoc (or buggy) attempt to convert the code anyway.

I would suggest this experiment: feed the parser a document containing the 
valid sequence (C2 A3) and see if it is parsed correctly.  If so, then the 
answer is most likely (2) else (1).  Armed with that information you can seek 
the appropriate corrective action.

john

-----Original Message-----
From: Giulio Troccoli [mailto:[email protected]] 
Sent: Friday, August 14, 2009 2:29 AM
To: [email protected]
Subject: RE: Invalid byte 1 (£) of a 1-byte sequence

>

Linedata Services (UK) Ltd
Registered Office: Bishopsgate Court, 4-12 Norton Folgate, London, E1 6DB
Registered in England and Wales No 3027851    VAT Reg No 778499447

-----Original Message-----

> From: David Bertoni [mailto:[email protected]]
> Sent: 13 August 2009 18:38
> To: [email protected]
> Subject: Re: Invalid byte 1 (£) of a 1-byte sequence
>
> Giulio Troccoli wrote:
> > Well, I configure the built as follows
> >
> > runConfigure -paix -cxlc -xxlC_r
> >
> > So it should have used the 'native' for transcoding. I'm
> afraid I don't know about this to be of more help.
> What could be happening is there's local code page
> transcoding somewhere in your processing stream, and on
> Windows, you're getting a Windows code page (probably 1252),
> not UTF-8.
>
> Note that Windows-1252 and ISO-8859-1 are not compatible, so
> don't assume you can interchange them. Rather than hack
> around your source files, you should make sure all of your
> processing is in UTF-8.
>
> You should also run the locale command on your AIX machine to
> verify what code page it's using. If it's UTF-8, that would
> be further evidence your application is doing inappropriate
> local code page transcoding.

I'm a bit out of depth here so forgive me if I say soemthing really stupid 
(which is quite likely).

My XML document is not in UTF-8. The pound sign is just A3, not C2 A3.

But I'm telling my application that the document IS in UTF-8 (using the 
encoding="UTF-8" option).

Windows correctly rejects it. AIX does not.

When you say "make sure all of your processing is in UTF-8", I can't do that. 
The XML is not in UTF-8 and I can not change that (it's created by a C 
programme and I have no idea how to do that).

I ran some locale commands on my AIX box and here's the result

ibu...@kylie% locale
LANG=en_US
LC_COLLATE="en_US"
LC_CTYPE="en_US"
LC_MONETARY="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_MESSAGES="en_US"
LC_ALL=

ibu...@kylie% locale charmap
ISO8859-1

Is this why Xerces on AIX understands that A3 is in fact the pound sign?

Giulio

RE: Invalid byte 1 (£) of a 1-byte sequence

Reply via email to