RE: Two questions - BOM in UTF-8, and manually cleaning XML

Matthew Brown Wed, 05 Jul 2006 08:04:13 -0700

Manuel,

I believe you hit the problem on the head - the response prolog says utf-8 but 
(according to Etherpeak) the BOM is ff/ef. Coincidentally, by the time the 
response XML gets logged by axis, these initial characters are logged as ef bf 
bd ef bf bd.

Unfortunately we may be in a bit of a tough place with having the producer of 
the XML change it; the customer whose web services we are consuming doesn't 
seem to see any issue with this (as they are fine with their .NET tools).

If it is the case where we are seeing a UTF-16 BOM but a prolog that declares 
UTF-8; is there any way to instruct Axis/Xerces to parse it as UTF-16? Sorry if 
this question doesn't make much sense, but I'm not too familiar with how Axis 
and/or Xerces decide which character encoding to use when reading the XML.

Thanks again
Matt

-----Original Message-----
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:58 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> Yes, there is a work-around. It works if you encode the file with
> UTF-8 (for example), and do not include the BOM at the beginning. I
> use notepad++ for that task, where you can save in "UTF-8 without
> BOM".
>
> The process for that is easy:
> 1. open the file in notepad++
> 2. mark everything via CTRL-A
> 3. cut (not copy!)
> 4. in the format menu, choose "ANSI" formatting and select "UTF
> without BOM" at the bottom
> 5. paste
> 6. save.
>
> that is a crap workaround, but works for me. for automatically
> generated files ..... I dunno :-)
>
>
> Greetings,
> Axel.
>
>
> On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> > wrote:
>
> Hi all,
>
> I hate to do this, but can anyone please help me with either of these
> issues? I've tried to upgrade Xerces to 2.8.0 but to no avail.
>
> Is there anything else I could be doing?

Just wondering if your file in question starts with hex 'ef bb bf' 
or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe 
you have an utf-16 encoded file (little endian or big endian) not 
utf-8. If it is the 'ef bb bf' sequence then it starts correctly with 
the utf-8 encoded unicode code point for BOM U+FEFF. In all cases 
xerces should be able to handle it. A problem may arise if it starts 
with 'ff ef' but the XML prolog says encoding="utf-8" as that is a 
contradiction I believe.

I know this does not help directly but may help to check if the problem 
is with the producer of the XML document or your consumer.

Manuel
>
> What about the possibility of programmatically editing/cleaning the
> response XML before it is given to the parser?
>
> Thanks
> Matt
>
> -----Original Message-----
> From: Matthew Brown [mailto: [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> ]
> Sent: Saturday, July 01, 2006 12:41 PM
> To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> Subject: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> 1. From searching the mailing list archives, I see several references
> to people having problems with Byte Order Mark characters appearing
> before the prolog in their UTF-8 messages. However I can't seem to
> find much of a known resolution to these issues. Is there a
> standard/common workaround for these BOM and UTF-8 issues?
>
> 2. If there is no answer to my #1, is there anyway that Axis will
> allow me to pragmatically edit the response XML before it is passed
> to the parser and de-serialized? I've tried adding Handlers, but I'm
> assuming that the Handler comes into the picture after the message is
> parsed, because my Handler is only ever seeing the request message,
> and not the response.
>
> Thanks
> Matt Brown

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Two questions - BOM in UTF-8, and manually cleaning XML

Reply via email to