I will also quite likely say some thing stupid, but here goes :) I suggest that there are two possibilities:
1) Xerces on AIX is ignoring your request to use UTF-8, and is instead using the default 8859-1 2) Xerces, or the underlying transcoder it uses, is translating UTF-8, but is too lenient when it encounters the invalid escape sequence, and makes some ad-hoc (or buggy) attempt to convert the code anyway. I would suggest this experiment: feed the parser a document containing the valid sequence (C2 A3) and see if it is parsed correctly. If so, then the answer is most likely (2) else (1). Armed with that information you can seek the appropriate corrective action. john -----Original Message----- From: Giulio Troccoli [mailto:[email protected]] Sent: Friday, August 14, 2009 2:29 AM To: [email protected] Subject: RE: Invalid byte 1 (£) of a 1-byte sequence > Linedata Services (UK) Ltd Registered Office: Bishopsgate Court, 4-12 Norton Folgate, London, E1 6DB Registered in England and Wales No 3027851 VAT Reg No 778499447 -----Original Message----- > From: David Bertoni [mailto:[email protected]] > Sent: 13 August 2009 18:38 > To: [email protected] > Subject: Re: Invalid byte 1 (£) of a 1-byte sequence > > Giulio Troccoli wrote: > > Well, I configure the built as follows > > > > runConfigure -paix -cxlc -xxlC_r > > > > So it should have used the 'native' for transcoding. I'm > afraid I don't know about this to be of more help. > What could be happening is there's local code page > transcoding somewhere in your processing stream, and on > Windows, you're getting a Windows code page (probably 1252), > not UTF-8. > > Note that Windows-1252 and ISO-8859-1 are not compatible, so > don't assume you can interchange them. Rather than hack > around your source files, you should make sure all of your > processing is in UTF-8. > > You should also run the locale command on your AIX machine to > verify what code page it's using. If it's UTF-8, that would > be further evidence your application is doing inappropriate > local code page transcoding. I'm a bit out of depth here so forgive me if I say soemthing really stupid (which is quite likely). My XML document is not in UTF-8. The pound sign is just A3, not C2 A3. But I'm telling my application that the document IS in UTF-8 (using the encoding="UTF-8" option). Windows correctly rejects it. AIX does not. When you say "make sure all of your processing is in UTF-8", I can't do that. The XML is not in UTF-8 and I can not change that (it's created by a C programme and I have no idea how to do that). I ran some locale commands on my AIX box and here's the result ibu...@kylie% locale LANG=en_US LC_COLLATE="en_US" LC_CTYPE="en_US" LC_MONETARY="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_MESSAGES="en_US" LC_ALL= ibu...@kylie% locale charmap ISO8859-1 Is this why Xerces on AIX understands that A3 is in fact the pound sign? Giulio
