On Wednesday 05 July 2006 23:37, Matthew Brown wrote: > I've tried to add a handler to simply log the messages but it seems > to (a beginner like) me that the Handler doesn't come into play until > after the XML is parsed/deserialized. > > Just to serve as a confirmation, can anyone comment on how Xerces > will determine what type of encoding the xml is in? Will it look at > the prolog, the byte order mark, etc? >
See section F. of the XML 1.0 spec (http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing) Manuel > Thanks > > > -----Original Message----- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 11:24 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > > Two bytes per char; Etherpeak is showing the second byte as 00. > > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious > but not easily done. You may be able to write a handler that > re-encodes the byte stream into utf-8 before giving it to the Axis > stacks. But how to write such an Axis handler and how to hook it > correctly into the Axis processing chain is outside my area of > expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel > > > -----Original Message----- > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 11:09 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > > Manuel, > > > > > > I believe you hit the problem on the head - the response prolog > > > says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > > Coincidentally, by the time the response XML gets logged by axis, > > > these initial characters are logged as ef bf bd ef bf bd. > > > > Matt, > > > > what about the rest of the byte stream when you look at it in > > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded > > (1 byte per char for all typical ascii characters)? > > > > Manuel > > > > > Unfortunately we may be in a bit of a tough place with having the > > > producer of the XML change it; the customer whose web services we > > > are consuming doesn't seem to see any issue with this (as they > > > are fine with their .NET tools). > > > > > > If it is the case where we are seeing a UTF-16 BOM but a prolog > > > that declares UTF-8; is there any way to instruct Axis/Xerces to > > > parse it as UTF-16? Sorry if this question doesn't make much > > > sense, but I'm not too familiar with how Axis and/or Xerces > > > decide which character encoding to use when reading the XML. > > > > > > Thanks again > > > Matt > > > > > > -----Original Message----- > > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, July 05, 2006 10:58 AM > > > To: axis-user@ws.apache.org > > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > > XML > > > > > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > > > Yes, there is a work-around. It works if you encode the file > > > > with UTF-8 (for example), and do not include the BOM at the > > > > beginning. I use notepad++ for that task, where you can save in > > > > "UTF-8 without BOM". > > > > > > > > The process for that is easy: > > > > 1. open the file in notepad++ > > > > 2. mark everything via CTRL-A > > > > 3. cut (not copy!) > > > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > > > without BOM" at the bottom > > > > 5. paste > > > > 6. save. > > > > > > > > that is a crap workaround, but works for me. for automatically > > > > generated files ..... I dunno :-) > > > > > > > > > > > > Greetings, > > > > Axel. > > > > > > > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > > > > > Hi all, > > > > > > > > I hate to do this, but can anyone please help me with either of > > > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > > > avail. > > > > > > > > Is there anything else I could be doing? > > > > > > Just wondering if your file in question starts with hex 'ef bb > > > bf' or 'ff ef' or 'ef ff'. If it is one of the latter two forms I > > > believe you have an utf-16 encoded file (little endian or big > > > endian) not utf-8. If it is the 'ef bb bf' sequence then it > > > starts correctly with the utf-8 encoded unicode code point for > > > BOM U+FEFF. In all cases xerces should be able to handle it. A > > > problem may arise if it starts with 'ff ef' but the XML prolog > > > says > > > encoding="utf-8" as that is a contradiction I believe. > > > > > > I know this does not help directly but may help to check if the > > > problem is with the producer of the XML document or your > > > consumer. > > > > > > Manuel > > > > > > > What about the possibility of programmatically editing/cleaning > > > > the response XML before it is given to the parser? > > > > > > > > Thanks > > > > Matt > > > > > > > > -----Original Message----- > > > > From: Matthew Brown [mailto: [EMAIL PROTECTED] > > > > <mailto:[EMAIL PROTECTED]> ] > > > > Sent: Saturday, July 01, 2006 12:41 PM > > > > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> > > > > Subject: Two questions - BOM in UTF-8, and manually cleaning > > > > XML > > > > > > > > > > > > 1. From searching the mailing list archives, I see several > > > > references to people having problems with Byte Order Mark > > > > characters appearing before the prolog in their UTF-8 messages. > > > > However I can't seem to find much of a known resolution to > > > > these issues. Is there a standard/common workaround for these > > > > BOM and UTF-8 issues? > > > > > > > > 2. If there is no answer to my #1, is there anyway that Axis > > > > will allow me to pragmatically edit the response XML before it > > > > is passed to the parser and de-serialized? I've tried adding > > > > Handlers, but I'm assuming that the Handler comes into the > > > > picture after the message is parsed, because my Handler is only > > > > ever seeing the request message, and not the response. > > > > > > > > Thanks > > > > Matt Brown > > > > > > ----------------------------------------------------------------- > > >-- -- To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > ------------------------------------------------------------------- > >-- To unsubscribe, e-mail: [EMAIL PROTECTED] For > > additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]