Re: Two questions - BOM in UTF-8, and manually cleaning XML
else { > String > cleaned = strMessage.substring(idx); > > > log.debug("invoke - Setting SOAPPart.currentMessage to: " + > cleaned); > > > axisPart.setCurrentMessage(cleaned, > axisPart.getCurrentForm()); } > } > } > } > } > } > if (log.isInfoEnabled()) log.info("invoke - complete"); > } > catch (Exception ex) { > log.error("Caught exception in invoke()", ex); > } > } > > } > > -Original Message- > From: Davanum Srinivas [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 3:41 PM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > > did you see my response on setting the CHARACTER_SET_ENCODING? what > is the exact stack trace you get on the client? > > thanks, > dims > > On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > > text/xml and utf-8, which I suppose explains the attempt to parse > > the UTF-16 message as UTF-8. The customer has changed the format of > > the message to correctly be UTF-8 in actuality, although Xerces > > still isn't a fan of the UTF-8 BOM (ef bb bf). > > > > > > > > -----Original Message- > > From: Simon Fell [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 2:46 PM > > To: axis-user@ws.apache.org > > Subject: RE: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > > > What does the content-type header say the charset is? That takes > > precedence over the payload (at least for SOAP 1.1) > > > > Cheers > > Simon > > > > -Original Message- > > From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 8:30 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > Maybe changing the xml prolog from "utf-8" to "utf-16" will be > > easier. It seems like a demo example for a servlet filter ;-) > > > > > > Hope this helps, > > Rodrigo > > > > Manuel Mall wrote: > > > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > > >> Two bytes per char; Etherpeak is showing the second byte as 00. > > > > > > Seems you are stuck between a "rock and a hard place" here. The > > > byte stream appears to be correctly utf-16 encoded but the xml > > > prolog says utf-8. Not sure what to recommend. Fix it at the > > > source is obvious but not easily done. You may be able to write a > > > handler that re-encodes the byte stream into utf-8 before giving > > > it to the Axis stacks. But how to write such an Axis handler and > > > how to hook it correctly into the Axis processing chain is > > > outside my area of expertise. > > > > > > May be someone else can give advice on how to attempt such a > > > thing. > > > > > > Manuel > > > > > >> -Original Message- > > >> From: Manuel Mall [mailto:[EMAIL PROTECTED] > > >> Sent: Wednesday, July 05, 2006 11:09 AM > > >> To: axis-user@ws.apache.org > > >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > >> XML > > >> > > >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > >>> Manuel, > > >>> > > >>> I believe you hit the problem on the head - the response prolog > > >>> says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > >>> Coincidentally, by the time the response XML gets logged by > > >>> axis, these initial characters are logged as ef bf bd ef bf bd. > > >> > > >> Matt, > > >> > > >> what about the rest of the byte stream when you look at it in > > >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 > > >> encoded (1 byte per char for all typical ascii characters)? > > >> > > >> Manuel > > >> > > >>> Unfortunately we may be in a bit of a tough place with having
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Davanum, I had tried this previously and the only effect that I noticed was that the encoding attribute of my request message's prolog changed. The response message was still being parsed as UTF-8 (which the headers had said) although it was truly 16. Anyway, now that the service provider has changed their service to return true UTF-8 data, and Xerces still has trouble interpreting the UTF-8 BOM before the prolog, I have found a very hack-ish solution: Add a handler that will remove any characters in the currentMessage if the MessageContext is past the pivot. This doesn't feel like a great solution to me (why isn't the XML parser prepared to handle the BOM? Is the wrong parse method being used?), it works for us for right now. Thanks for the help all Matt - package com.viecore.ipl.ws; import javax.xml.soap.SOAPMessage; import org.apache.axis.AxisFault; import org.apache.axis.Message; import org.apache.axis.MessageContext; import org.apache.axis.SOAPPart; import org.apache.axis.handlers.BasicHandler; import org.apache.log4j.LogManager; import org.apache.log4j.Logger; public class MyHandler extends BasicHandler { private static Logger log = LogManager.getLogger(MyHandler.class); public void invoke(MessageContext messageContext) throws AxisFault { try { if (log.isInfoEnabled()) log.info("invoke - start"); log.info("invoke - past pivot [" + messageContext.getPastPivot() + "]"); SOAPMessage rpcMsg = messageContext.getMessage(); if (rpcMsg instanceof Message) { Message axisMsg = (Message) rpcMsg; if (log.isDebugEnabled()) log.debug("invoke - cast java.xml.rpc.SOAPMessage to org.apache.axis.Message"); javax.xml.soap.SOAPPart rpcPart = axisMsg.getSOAPPart(); if (rpcPart instanceof SOAPPart) { SOAPPart axisPart = (SOAPPart) rpcPart; if (log.isDebugEnabled()) log.debug("invoke - cast java.xml.rpc.SOAPPart to org.apache.axis.SOAPPart"); Object currentMessage = axisPart.getCurrentMessage(); if (currentMessage == null) { log.debug("invoke - current message is null, cannot clean"); } else { if (log.isDebugEnabled()) log.debug("invoke - current message of SOAP part has type [" + currentMessage.getClass().getName() + "] content [" + currentMessage.toString() + "]"); // attempt to remove bad characters from the response if (messageContext.getPastPivot() == true) { if (currentMessage instanceof String) { String strMessage = (String) currentMessage; int idx = strMessage.indexOf("mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 3:41 PM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML did you see my response on setting the CHARACTER_SET_ENCODING? what is the exact stack trace you get on the client? thanks, dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 > message as UTF-8. The customer has changed the format of the message to > correctly be UTF-8 in actuality, although Xerces still isn't a fan of the > UTF-8 BOM (ef bb bf). > > > > -Original Message- > From: Simon Fell [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 2:46 PM > To: axis-user@ws.apache.org > Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML > > > What does the content-type header say the charset is? That takes precedence > over the payload (at least for SOAP 1.1) > > Cheers > Simon > > -----Original Message- > From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 8:30 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. > It seems like a demo example for a servlet filter ;-)
Re: Two questions - BOM in UTF-8, and manually cleaning XML
did you see my response on setting the CHARACTER_SET_ENCODING? what is the exact stack trace you get on the client? thanks, dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 message as UTF-8. The customer has changed the format of the message to correctly be UTF-8 in actuality, although Xerces still isn't a fan of the UTF-8 BOM (ef bb bf). -Original Message- From: Simon Fell [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 2:46 PM To: axis-user@ws.apache.org Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML What does the content-type header say the charset is? That takes precedence over the payload (at least for SOAP 1.1) Cheers Simon -Original Message- From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 8:30 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. It seems like a demo example for a servlet filter ;-) Hope this helps, Rodrigo Manuel Mall wrote: > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: >> Two bytes per char; Etherpeak is showing the second byte as 00. >> > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious but > not easily done. You may be able to write a handler that re-encodes > the byte stream into utf-8 before giving it to the Axis stacks. But > how to write such an Axis handler and how to hook it correctly into > the Axis processing chain is outside my area of expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel >> -Original Message- >> From: Manuel Mall [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, July 05, 2006 11:09 AM >> To: axis-user@ws.apache.org >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >> >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote: >>> Manuel, >>> >>> I believe you hit the problem on the head - the response prolog says >>> utf-8 but (according to Etherpeak) the BOM is ff/ef. >>> Coincidentally, by the time the response XML gets logged by axis, >>> these initial characters are logged as ef bf bd ef bf bd. >> Matt, >> >> what about the rest of the byte stream when you look at it in >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded >> (1 byte per char for all typical ascii characters)? >> >> Manuel >> >>> Unfortunately we may be in a bit of a tough place with having the >>> producer of the XML change it; the customer whose web services we >>> are consuming doesn't seem to see any issue with this (as they are >>> fine with their .NET tools). >>> >>> If it is the case where we are seeing a UTF-16 BOM but a prolog that >>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it >>> as UTF-16? Sorry if this question doesn't make much sense, but I'm >>> not too familiar with how Axis and/or Xerces decide which character >>> encoding to use when reading the XML. >>> >>> Thanks again >>> Matt >>> >>> -Original Message- >>> From: Manuel Mall [mailto:[EMAIL PROTECTED] >>> Sent: Wednesday, July 05, 2006 10:58 AM >>> To: axis-user@ws.apache.org >>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >>> >>> On Wednesday 05 July 2006 22:16, Axel Bock wrote: >>>> Yes, there is a work-around. It works if you encode the file with >>>> UTF-8 (for example), and do not include the BOM at the beginning. >>>> I use notepad++ for that task, where you can save in "UTF-8 without >>>> BOM". >>>> >>>> The process for that is easy: >>>> 1. open the file in notepad++ >>>> 2. mark everything via CTRL-A >>>> 3. cut (not copy!) >>>> 4. in the format menu, choose "ANSI" formatting and select "UTF >>>> without BOM" at the bottom 5. paste 6. save. >>>> >>>> that is a crap workaround, but works for me. for automatically >>>> generated files . I dunno :-) >>>> >>>> >>>> Greetings, >>>> Axel. >>>> >>>> >>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] >>>> <mailto:[EMAIL PROTECTED]&g
RE: Two questions - BOM in UTF-8, and manually cleaning XML
text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 message as UTF-8. The customer has changed the format of the message to correctly be UTF-8 in actuality, although Xerces still isn't a fan of the UTF-8 BOM (ef bb bf). -Original Message- From: Simon Fell [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 2:46 PM To: axis-user@ws.apache.org Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML What does the content-type header say the charset is? That takes precedence over the payload (at least for SOAP 1.1) Cheers Simon -Original Message- From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 8:30 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. It seems like a demo example for a servlet filter ;-) Hope this helps, Rodrigo Manuel Mall wrote: > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: >> Two bytes per char; Etherpeak is showing the second byte as 00. >> > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious but > not easily done. You may be able to write a handler that re-encodes > the byte stream into utf-8 before giving it to the Axis stacks. But > how to write such an Axis handler and how to hook it correctly into > the Axis processing chain is outside my area of expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel >> -Original Message- >> From: Manuel Mall [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, July 05, 2006 11:09 AM >> To: axis-user@ws.apache.org >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >> >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote: >>> Manuel, >>> >>> I believe you hit the problem on the head - the response prolog says >>> utf-8 but (according to Etherpeak) the BOM is ff/ef. >>> Coincidentally, by the time the response XML gets logged by axis, >>> these initial characters are logged as ef bf bd ef bf bd. >> Matt, >> >> what about the rest of the byte stream when you look at it in >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded >> (1 byte per char for all typical ascii characters)? >> >> Manuel >> >>> Unfortunately we may be in a bit of a tough place with having the >>> producer of the XML change it; the customer whose web services we >>> are consuming doesn't seem to see any issue with this (as they are >>> fine with their .NET tools). >>> >>> If it is the case where we are seeing a UTF-16 BOM but a prolog that >>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it >>> as UTF-16? Sorry if this question doesn't make much sense, but I'm >>> not too familiar with how Axis and/or Xerces decide which character >>> encoding to use when reading the XML. >>> >>> Thanks again >>> Matt >>> >>> -Original Message- >>> From: Manuel Mall [mailto:[EMAIL PROTECTED] >>> Sent: Wednesday, July 05, 2006 10:58 AM >>> To: axis-user@ws.apache.org >>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >>> >>> On Wednesday 05 July 2006 22:16, Axel Bock wrote: >>>> Yes, there is a work-around. It works if you encode the file with >>>> UTF-8 (for example), and do not include the BOM at the beginning. >>>> I use notepad++ for that task, where you can save in "UTF-8 without >>>> BOM". >>>> >>>> The process for that is easy: >>>> 1. open the file in notepad++ >>>> 2. mark everything via CTRL-A >>>> 3. cut (not copy!) >>>> 4. in the format menu, choose "ANSI" formatting and select "UTF >>>> without BOM" at the bottom 5. paste 6. save. >>>> >>>> that is a crap workaround, but works for me. for automatically >>>> generated files . I dunno :-) >>>> >>>> >>>> Greetings, >>>> Axel. >>>> >>>> >>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] >>>> <mailto:[EMAIL PROTECTED]> > wrote: >>>> >>>> Hi all, >>>> >>>> I hate to do this, but can anyone please help me with either of >>>>
RE: Two questions - BOM in UTF-8, and manually cleaning XML
What does the content-type header say the charset is? That takes precedence over the payload (at least for SOAP 1.1) Cheers Simon -Original Message- From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 8:30 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. It seems like a demo example for a servlet filter ;-) Hope this helps, Rodrigo Manuel Mall wrote: > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: >> Two bytes per char; Etherpeak is showing the second byte as 00. >> > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious but > not easily done. You may be able to write a handler that re-encodes > the byte stream into utf-8 before giving it to the Axis stacks. But > how to write such an Axis handler and how to hook it correctly into > the Axis processing chain is outside my area of expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel >> -Original Message- >> From: Manuel Mall [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, July 05, 2006 11:09 AM >> To: axis-user@ws.apache.org >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >> >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote: >>> Manuel, >>> >>> I believe you hit the problem on the head - the response prolog says >>> utf-8 but (according to Etherpeak) the BOM is ff/ef. >>> Coincidentally, by the time the response XML gets logged by axis, >>> these initial characters are logged as ef bf bd ef bf bd. >> Matt, >> >> what about the rest of the byte stream when you look at it in >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded >> (1 byte per char for all typical ascii characters)? >> >> Manuel >> >>> Unfortunately we may be in a bit of a tough place with having the >>> producer of the XML change it; the customer whose web services we >>> are consuming doesn't seem to see any issue with this (as they are >>> fine with their .NET tools). >>> >>> If it is the case where we are seeing a UTF-16 BOM but a prolog that >>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it >>> as UTF-16? Sorry if this question doesn't make much sense, but I'm >>> not too familiar with how Axis and/or Xerces decide which character >>> encoding to use when reading the XML. >>> >>> Thanks again >>> Matt >>> >>> -Original Message- >>> From: Manuel Mall [mailto:[EMAIL PROTECTED] >>> Sent: Wednesday, July 05, 2006 10:58 AM >>> To: axis-user@ws.apache.org >>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >>> >>> On Wednesday 05 July 2006 22:16, Axel Bock wrote: >>>> Yes, there is a work-around. It works if you encode the file with >>>> UTF-8 (for example), and do not include the BOM at the beginning. >>>> I use notepad++ for that task, where you can save in "UTF-8 without >>>> BOM". >>>> >>>> The process for that is easy: >>>> 1. open the file in notepad++ >>>> 2. mark everything via CTRL-A >>>> 3. cut (not copy!) >>>> 4. in the format menu, choose "ANSI" formatting and select "UTF >>>> without BOM" at the bottom 5. paste 6. save. >>>> >>>> that is a crap workaround, but works for me. for automatically >>>> generated files . I dunno :-) >>>> >>>> >>>> Greetings, >>>> Axel. >>>> >>>> >>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] >>>> <mailto:[EMAIL PROTECTED]> > wrote: >>>> >>>> Hi all, >>>> >>>> I hate to do this, but can anyone please help me with either of >>>> these issues? I've tried to upgrade Xerces to 2.8.0 but to no >>>> avail. >>>> >>>> Is there anything else I could be doing? >>> Just wondering if your file in question starts with hex 'ef bb bf' >>> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I >>> believe you have an utf-16 encoded file (little endian or big >>> endian) not utf-8. If
Re: Two questions - BOM in UTF-8, and manually cleaning XML
call.setProperty(Call.CHARACTER_SET_ENCODING, "UTF-16"); On 7/5/06, Davanum Srinivas <[EMAIL PROTECTED]> wrote: Matt, Please try setting the CHARACTER_SET_ENCODING in call's properties to utf-16 and see if that works. -- dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > I've tried to add a handler to simply log the messages but it seems to (a beginner like) me that the Handler doesn't come into play until after the XML is parsed/deserialized. > > Just to serve as a confirmation, can anyone comment on how Xerces will determine what type of encoding the xml is in? Will it look at the prolog, the byte order mark, etc? > > Thanks > > > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 11:24 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > > Two bytes per char; Etherpeak is showing the second byte as 00. > > > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious but > not easily done. You may be able to write a handler that re-encodes the > byte stream into utf-8 before giving it to the Axis stacks. But how to > write such an Axis handler and how to hook it correctly into the Axis > processing chain is outside my area of expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel > > -Original Message- > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 11:09 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > > > On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > > Manuel, > > > > > > I believe you hit the problem on the head - the response prolog > > > says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > > Coincidentally, by the time the response XML gets logged by axis, > > > these initial characters are logged as ef bf bd ef bf bd. > > > > Matt, > > > > what about the rest of the byte stream when you look at it in > > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded > > (1 byte per char for all typical ascii characters)? > > > > Manuel > > > > > Unfortunately we may be in a bit of a tough place with having the > > > producer of the XML change it; the customer whose web services we > > > are consuming doesn't seem to see any issue with this (as they are > > > fine with their .NET tools). > > > > > > If it is the case where we are seeing a UTF-16 BOM but a prolog > > > that declares UTF-8; is there any way to instruct Axis/Xerces to > > > parse it as UTF-16? Sorry if this question doesn't make much sense, > > > but I'm not too familiar with how Axis and/or Xerces decide which > > > character encoding to use when reading the XML. > > > > > > Thanks again > > > Matt > > > > > > -Original Message- > > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, July 05, 2006 10:58 AM > > > To: axis-user@ws.apache.org > > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > > XML > > > > > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > > > Yes, there is a work-around. It works if you encode the file with > > > > UTF-8 (for example), and do not include the BOM at the beginning. > > > > I use notepad++ for that task, where you can save in "UTF-8 > > > > without BOM". > > > > > > > > The process for that is easy: > > > > 1. open the file in notepad++ > > > > 2. mark everything via CTRL-A > > > > 3. cut (not copy!) > > > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > > > without BOM" at the bottom > > > > 5. paste > > > > 6. save. > > > > > > > > that is a crap workaround, but works for me. for automatically > > > > generated files . I dunno :-) > > > > > > > > > > > > Greetings, > > > > Axel. > > > > > > > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > > > <mailto:[EMAIL PROTECTED]> > wrote: > > > >
Re: Two questions - BOM in UTF-8, and manually cleaning XML
Matt, Please try setting the CHARACTER_SET_ENCODING in call's properties to utf-16 and see if that works. -- dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: I've tried to add a handler to simply log the messages but it seems to (a beginner like) me that the Handler doesn't come into play until after the XML is parsed/deserialized. Just to serve as a confirmation, can anyone comment on how Xerces will determine what type of encoding the xml is in? Will it look at the prolog, the byte order mark, etc? Thanks -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 11:24 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > Two bytes per char; Etherpeak is showing the second byte as 00. > Seems you are stuck between a "rock and a hard place" here. The byte stream appears to be correctly utf-16 encoded but the xml prolog says utf-8. Not sure what to recommend. Fix it at the source is obvious but not easily done. You may be able to write a handler that re-encodes the byte stream into utf-8 before giving it to the Axis stacks. But how to write such an Axis handler and how to hook it correctly into the Axis processing chain is outside my area of expertise. May be someone else can give advice on how to attempt such a thing. Manuel > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 11:09 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > Manuel, > > > > I believe you hit the problem on the head - the response prolog > > says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > Coincidentally, by the time the response XML gets logged by axis, > > these initial characters are logged as ef bf bd ef bf bd. > > Matt, > > what about the rest of the byte stream when you look at it in > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded > (1 byte per char for all typical ascii characters)? > > Manuel > > > Unfortunately we may be in a bit of a tough place with having the > > producer of the XML change it; the customer whose web services we > > are consuming doesn't seem to see any issue with this (as they are > > fine with their .NET tools). > > > > If it is the case where we are seeing a UTF-16 BOM but a prolog > > that declares UTF-8; is there any way to instruct Axis/Xerces to > > parse it as UTF-16? Sorry if this question doesn't make much sense, > > but I'm not too familiar with how Axis and/or Xerces decide which > > character encoding to use when reading the XML. > > > > Thanks again > > Matt > > > > -Original Message- > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 10:58 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > > Yes, there is a work-around. It works if you encode the file with > > > UTF-8 (for example), and do not include the BOM at the beginning. > > > I use notepad++ for that task, where you can save in "UTF-8 > > > without BOM". > > > > > > The process for that is easy: > > > 1. open the file in notepad++ > > > 2. mark everything via CTRL-A > > > 3. cut (not copy!) > > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > > without BOM" at the bottom > > > 5. paste > > > 6. save. > > > > > > that is a crap workaround, but works for me. for automatically > > > generated files . I dunno :-) > > > > > > > > > Greetings, > > > Axel. > > > > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > > > Hi all, > > > > > > I hate to do this, but can anyone please help me with either of > > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > > avail. > > > > > > Is there anything else I could be doing? > > > > Just wondering if your file in question starts with hex 'ef bb bf' > > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I > > believe you have an utf-16 encoded file (little endian or big > > endian) not utf-8. If it is the 'e
Re: Two questions - BOM in UTF-8, and manually cleaning XML
On Wednesday 05 July 2006 23:37, Matthew Brown wrote: > I've tried to add a handler to simply log the messages but it seems > to (a beginner like) me that the Handler doesn't come into play until > after the XML is parsed/deserialized. > > Just to serve as a confirmation, can anyone comment on how Xerces > will determine what type of encoding the xml is in? Will it look at > the prolog, the byte order mark, etc? > See section F. of the XML 1.0 spec (http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing) Manuel > Thanks > > > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 11:24 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > > Two bytes per char; Etherpeak is showing the second byte as 00. > > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious > but not easily done. You may be able to write a handler that > re-encodes the byte stream into utf-8 before giving it to the Axis > stacks. But how to write such an Axis handler and how to hook it > correctly into the Axis processing chain is outside my area of > expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel > > > -Original Message- > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 11:09 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > > Manuel, > > > > > > I believe you hit the problem on the head - the response prolog > > > says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > > Coincidentally, by the time the response XML gets logged by axis, > > > these initial characters are logged as ef bf bd ef bf bd. > > > > Matt, > > > > what about the rest of the byte stream when you look at it in > > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded > > (1 byte per char for all typical ascii characters)? > > > > Manuel > > > > > Unfortunately we may be in a bit of a tough place with having the > > > producer of the XML change it; the customer whose web services we > > > are consuming doesn't seem to see any issue with this (as they > > > are fine with their .NET tools). > > > > > > If it is the case where we are seeing a UTF-16 BOM but a prolog > > > that declares UTF-8; is there any way to instruct Axis/Xerces to > > > parse it as UTF-16? Sorry if this question doesn't make much > > > sense, but I'm not too familiar with how Axis and/or Xerces > > > decide which character encoding to use when reading the XML. > > > > > > Thanks again > > > Matt > > > > > > -Original Message- > > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, July 05, 2006 10:58 AM > > > To: axis-user@ws.apache.org > > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > > XML > > > > > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > > > Yes, there is a work-around. It works if you encode the file > > > > with UTF-8 (for example), and do not include the BOM at the > > > > beginning. I use notepad++ for that task, where you can save in > > > > "UTF-8 without BOM". > > > > > > > > The process for that is easy: > > > > 1. open the file in notepad++ > > > > 2. mark everything via CTRL-A > > > > 3. cut (not copy!) > > > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > > > without BOM" at the bottom > > > > 5. paste > > > > 6. save. > > > > > > > > that is a crap workaround, but works for me. for automatically > > > > generated files . I dunno :-) > > > > > > > > > > > > Greetings, > > > > Axel. > > > > > > > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > > > > > Hi all, > > > > > > > > I hate to do this, but can anyone please help me w
RE: Two questions - BOM in UTF-8, and manually cleaning XML
I've tried to add a handler to simply log the messages but it seems to (a beginner like) me that the Handler doesn't come into play until after the XML is parsed/deserialized. Just to serve as a confirmation, can anyone comment on how Xerces will determine what type of encoding the xml is in? Will it look at the prolog, the byte order mark, etc? Thanks -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 11:24 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > Two bytes per char; Etherpeak is showing the second byte as 00. > Seems you are stuck between a "rock and a hard place" here. The byte stream appears to be correctly utf-16 encoded but the xml prolog says utf-8. Not sure what to recommend. Fix it at the source is obvious but not easily done. You may be able to write a handler that re-encodes the byte stream into utf-8 before giving it to the Axis stacks. But how to write such an Axis handler and how to hook it correctly into the Axis processing chain is outside my area of expertise. May be someone else can give advice on how to attempt such a thing. Manuel > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 11:09 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > Manuel, > > > > I believe you hit the problem on the head - the response prolog > > says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > Coincidentally, by the time the response XML gets logged by axis, > > these initial characters are logged as ef bf bd ef bf bd. > > Matt, > > what about the rest of the byte stream when you look at it in > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded > (1 byte per char for all typical ascii characters)? > > Manuel > > > Unfortunately we may be in a bit of a tough place with having the > > producer of the XML change it; the customer whose web services we > > are consuming doesn't seem to see any issue with this (as they are > > fine with their .NET tools). > > > > If it is the case where we are seeing a UTF-16 BOM but a prolog > > that declares UTF-8; is there any way to instruct Axis/Xerces to > > parse it as UTF-16? Sorry if this question doesn't make much sense, > > but I'm not too familiar with how Axis and/or Xerces decide which > > character encoding to use when reading the XML. > > > > Thanks again > > Matt > > > > -Original Message- > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 10:58 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > > Yes, there is a work-around. It works if you encode the file with > > > UTF-8 (for example), and do not include the BOM at the beginning. > > > I use notepad++ for that task, where you can save in "UTF-8 > > > without BOM". > > > > > > The process for that is easy: > > > 1. open the file in notepad++ > > > 2. mark everything via CTRL-A > > > 3. cut (not copy!) > > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > > without BOM" at the bottom > > > 5. paste > > > 6. save. > > > > > > that is a crap workaround, but works for me. for automatically > > > generated files . I dunno :-) > > > > > > > > > Greetings, > > > Axel. > > > > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > > > Hi all, > > > > > > I hate to do this, but can anyone please help me with either of > > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > > avail. > > > > > > Is there anything else I could be doing? > > > > Just wondering if your file in question starts with hex 'ef bb bf' > > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I > > believe you have an utf-16 encoded file (little endian or big > > endian) not utf-8. If it is the 'ef bb bf' sequence then it starts > > correctly with the utf-8 encoded unicode code point for BOM U+FEFF. > > In all cases xerces should be able to handle it. A
Re: Two questions - BOM in UTF-8, and manually cleaning XML
I think you guys are being too lienient ont he service provider. I would shame them into fixing the problem. :-) Clearly its not cool to publish an non-interoperable service implementation! Why not use .NET remoting or Java RMI in that case? This assumes you can have a productive discussion with the service team. Jim Murphy Mindreef, Inc. On 7/5/06, Rodrigo Ruiz <[EMAIL PROTECTED]> wrote: Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. It seems like a demo example for a servlet filter ;-) Hope this helps, Rodrigo Manuel Mall wrote: > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: >> Two bytes per char; Etherpeak is showing the second byte as 00. >> > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious but > not easily done. You may be able to write a handler that re-encodes the > byte stream into utf-8 before giving it to the Axis stacks. But how to > write such an Axis handler and how to hook it correctly into the Axis > processing chain is outside my area of expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel >> -Original Message- >> From: Manuel Mall [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, July 05, 2006 11:09 AM >> To: axis-user@ws.apache.org >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >> >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote: >>> Manuel, >>> >>> I believe you hit the problem on the head - the response prolog >>> says utf-8 but (according to Etherpeak) the BOM is ff/ef. >>> Coincidentally, by the time the response XML gets logged by axis, >>> these initial characters are logged as ef bf bd ef bf bd. >> Matt, >> >> what about the rest of the byte stream when you look at it in >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded >> (1 byte per char for all typical ascii characters)? >> >> Manuel >> >>> Unfortunately we may be in a bit of a tough place with having the >>> producer of the XML change it; the customer whose web services we >>> are consuming doesn't seem to see any issue with this (as they are >>> fine with their .NET tools). >>> >>> If it is the case where we are seeing a UTF-16 BOM but a prolog >>> that declares UTF-8; is there any way to instruct Axis/Xerces to >>> parse it as UTF-16? Sorry if this question doesn't make much sense, >>> but I'm not too familiar with how Axis and/or Xerces decide which >>> character encoding to use when reading the XML. >>> >>> Thanks again >>> Matt >>> >>> -Original Message- >>> From: Manuel Mall [mailto:[EMAIL PROTECTED] >>> Sent: Wednesday, July 05, 2006 10:58 AM >>> To: axis-user@ws.apache.org >>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning >>> XML >>> >>> On Wednesday 05 July 2006 22:16, Axel Bock wrote: >>>> Yes, there is a work-around. It works if you encode the file with >>>> UTF-8 (for example), and do not include the BOM at the beginning. >>>> I use notepad++ for that task, where you can save in "UTF-8 >>>> without BOM". >>>> >>>> The process for that is easy: >>>> 1. open the file in notepad++ >>>> 2. mark everything via CTRL-A >>>> 3. cut (not copy!) >>>> 4. in the format menu, choose "ANSI" formatting and select "UTF >>>> without BOM" at the bottom >>>> 5. paste >>>> 6. save. >>>> >>>> that is a crap workaround, but works for me. for automatically >>>> generated files . I dunno :-) >>>> >>>> >>>> Greetings, >>>> Axel. >>>> >>>> >>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] >>>> <mailto:[EMAIL PROTECTED]> > wrote: >>>> >>>> Hi all, >>>> >>>> I hate to do this, but can anyone please help me with either of >>>> these issues? I've tried to upgrade Xerces to 2.8.0 but to no >>>> avail. >>>> >>>> Is there anything else I could be doing? >>> Just wondering if your file in question starts with hex 'ef bb bf' >>> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I >>> believe
Re: Two questions - BOM in UTF-8, and manually cleaning XML
Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. It seems like a demo example for a servlet filter ;-) Hope this helps, Rodrigo Manuel Mall wrote: On Wednesday 05 July 2006 23:12, Matthew Brown wrote: Two bytes per char; Etherpeak is showing the second byte as 00. Seems you are stuck between a "rock and a hard place" here. The byte stream appears to be correctly utf-16 encoded but the xml prolog says utf-8. Not sure what to recommend. Fix it at the source is obvious but not easily done. You may be able to write a handler that re-encodes the byte stream into utf-8 before giving it to the Axis stacks. But how to write such an Axis handler and how to hook it correctly into the Axis processing chain is outside my area of expertise. May be someone else can give advice on how to attempt such a thing. Manuel -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 11:09 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 23:04, Matthew Brown wrote: Manuel, I believe you hit the problem on the head - the response prolog says utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally, by the time the response XML gets logged by axis, these initial characters are logged as ef bf bd ef bf bd. Matt, what about the rest of the byte stream when you look at it in Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded (1 byte per char for all typical ascii characters)? Manuel Unfortunately we may be in a bit of a tough place with having the producer of the XML change it; the customer whose web services we are consuming doesn't seem to see any issue with this (as they are fine with their .NET tools). If it is the case where we are seeing a UTF-16 BOM but a prolog that declares UTF-8; is there any way to instruct Axis/Xerces to parse it as UTF-16? Sorry if this question doesn't make much sense, but I'm not too familiar with how Axis and/or Xerces decide which character encoding to use when reading the XML. Thanks again Matt -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 10:58 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 22:16, Axel Bock wrote: Yes, there is a work-around. It works if you encode the file with UTF-8 (for example), and do not include the BOM at the beginning. I use notepad++ for that task, where you can save in "UTF-8 without BOM". The process for that is easy: 1. open the file in notepad++ 2. mark everything via CTRL-A 3. cut (not copy!) 4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" at the bottom 5. paste 6. save. that is a crap workaround, but works for me. for automatically generated files . I dunno :-) Greetings, Axel. On 7/5/06, Matthew Brown < [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > wrote: Hi all, I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. Is there anything else I could be doing? Just wondering if your file in question starts with hex 'ef bb bf' or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe you have an utf-16 encoded file (little endian or big endian) not utf-8. If it is the 'ef bb bf' sequence then it starts correctly with the utf-8 encoded unicode code point for BOM U+FEFF. In all cases xerces should be able to handle it. A problem may arise if it starts with 'ff ef' but the XML prolog says encoding="utf-8" as that is a contradiction I believe. I know this does not help directly but may help to check if the problem is with the producer of the XML document or your consumer. Manuel What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? Thanks Matt -Original Message- From: Matthew Brown [mailto: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> ] Sent: Saturday, July 01, 2006 12:41 PM To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> Subject: Two questions - BOM in UTF-8, and manually cleaning XML 1. From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm
Re: Two questions - BOM in UTF-8, and manually cleaning XML
On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > Two bytes per char; Etherpeak is showing the second byte as 00. > Seems you are stuck between a "rock and a hard place" here. The byte stream appears to be correctly utf-16 encoded but the xml prolog says utf-8. Not sure what to recommend. Fix it at the source is obvious but not easily done. You may be able to write a handler that re-encodes the byte stream into utf-8 before giving it to the Axis stacks. But how to write such an Axis handler and how to hook it correctly into the Axis processing chain is outside my area of expertise. May be someone else can give advice on how to attempt such a thing. Manuel > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 11:09 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > Manuel, > > > > I believe you hit the problem on the head - the response prolog > > says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > Coincidentally, by the time the response XML gets logged by axis, > > these initial characters are logged as ef bf bd ef bf bd. > > Matt, > > what about the rest of the byte stream when you look at it in > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded > (1 byte per char for all typical ascii characters)? > > Manuel > > > Unfortunately we may be in a bit of a tough place with having the > > producer of the XML change it; the customer whose web services we > > are consuming doesn't seem to see any issue with this (as they are > > fine with their .NET tools). > > > > If it is the case where we are seeing a UTF-16 BOM but a prolog > > that declares UTF-8; is there any way to instruct Axis/Xerces to > > parse it as UTF-16? Sorry if this question doesn't make much sense, > > but I'm not too familiar with how Axis and/or Xerces decide which > > character encoding to use when reading the XML. > > > > Thanks again > > Matt > > > > -Original Message- > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 10:58 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > > Yes, there is a work-around. It works if you encode the file with > > > UTF-8 (for example), and do not include the BOM at the beginning. > > > I use notepad++ for that task, where you can save in "UTF-8 > > > without BOM". > > > > > > The process for that is easy: > > > 1. open the file in notepad++ > > > 2. mark everything via CTRL-A > > > 3. cut (not copy!) > > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > > without BOM" at the bottom > > > 5. paste > > > 6. save. > > > > > > that is a crap workaround, but works for me. for automatically > > > generated files . I dunno :-) > > > > > > > > > Greetings, > > > Axel. > > > > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > > > Hi all, > > > > > > I hate to do this, but can anyone please help me with either of > > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > > avail. > > > > > > Is there anything else I could be doing? > > > > Just wondering if your file in question starts with hex 'ef bb bf' > > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I > > believe you have an utf-16 encoded file (little endian or big > > endian) not utf-8. If it is the 'ef bb bf' sequence then it starts > > correctly with the utf-8 encoded unicode code point for BOM U+FEFF. > > In all cases xerces should be able to handle it. A problem may > > arise if it starts with 'ff ef' but the XML prolog says > > encoding="utf-8" as that is a contradiction I believe. > > > > I know this does not help directly but may help to check if the > > problem is with the producer of the XML document or your consumer. > > > > Manuel > > > > > What about the possibility of programmatically editing/cleaning > > > the response XML before it is given to the parser? > > > > > > Thanks > > > Matt > > > > >
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Two bytes per char; Etherpeak is showing the second byte as 00. -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 11:09 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > Manuel, > > I believe you hit the problem on the head - the response prolog says > utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally, > by the time the response XML gets logged by axis, these initial > characters are logged as ef bf bd ef bf bd. > Matt, what about the rest of the byte stream when you look at it in Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded (1 byte per char for all typical ascii characters)? Manuel > Unfortunately we may be in a bit of a tough place with having the > producer of the XML change it; the customer whose web services we are > consuming doesn't seem to see any issue with this (as they are fine > with their .NET tools). > > If it is the case where we are seeing a UTF-16 BOM but a prolog that > declares UTF-8; is there any way to instruct Axis/Xerces to parse it > as UTF-16? Sorry if this question doesn't make much sense, but I'm > not too familiar with how Axis and/or Xerces decide which character > encoding to use when reading the XML. > > Thanks again > Matt > > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 10:58 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > Yes, there is a work-around. It works if you encode the file with > > UTF-8 (for example), and do not include the BOM at the beginning. I > > use notepad++ for that task, where you can save in "UTF-8 without > > BOM". > > > > The process for that is easy: > > 1. open the file in notepad++ > > 2. mark everything via CTRL-A > > 3. cut (not copy!) > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > without BOM" at the bottom > > 5. paste > > 6. save. > > > > that is a crap workaround, but works for me. for automatically > > generated files . I dunno :-) > > > > > > Greetings, > > Axel. > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > Hi all, > > > > I hate to do this, but can anyone please help me with either of > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > avail. > > > > Is there anything else I could be doing? > > Just wondering if your file in question starts with hex 'ef bb bf' > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe > you have an utf-16 encoded file (little endian or big endian) not > utf-8. If it is the 'ef bb bf' sequence then it starts correctly with > the utf-8 encoded unicode code point for BOM U+FEFF. In all cases > xerces should be able to handle it. A problem may arise if it starts > with 'ff ef' but the XML prolog says encoding="utf-8" as that is a > contradiction I believe. > > I know this does not help directly but may help to check if the > problem is with the producer of the XML document or your consumer. > > Manuel > > > What about the possibility of programmatically editing/cleaning the > > response XML before it is given to the parser? > > > > Thanks > > Matt > > > > -Original Message- > > From: Matthew Brown [mailto: [EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]> ] > > Sent: Saturday, July 01, 2006 12:41 PM > > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> > > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > > > > 1. From searching the mailing list archives, I see several > > references to people having problems with Byte Order Mark > > characters appearing before the prolog in their UTF-8 messages. > > However I can't seem to find much of a known resolution to these > > issues. Is there a standard/common workaround for these BOM and > > UTF-8 issues? > > > > 2. If there is no answer to my #1, is there anyway that Axis will > > allow me to pragmatically edit the response XML before it is passed > > to the parser and de-serialized? I've tried adding Handlers, but > > I'm assuming that the Handler comes into the picture after the > > message is parsed, because my Handler is only ever seeing the > > request message, and not the response. > > > > Thanks > > Matt Brown > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Two questions - BOM in UTF-8, and manually cleaning XML
On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > Manuel, > > I believe you hit the problem on the head - the response prolog says > utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally, > by the time the response XML gets logged by axis, these initial > characters are logged as ef bf bd ef bf bd. > Matt, what about the rest of the byte stream when you look at it in Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded (1 byte per char for all typical ascii characters)? Manuel > Unfortunately we may be in a bit of a tough place with having the > producer of the XML change it; the customer whose web services we are > consuming doesn't seem to see any issue with this (as they are fine > with their .NET tools). > > If it is the case where we are seeing a UTF-16 BOM but a prolog that > declares UTF-8; is there any way to instruct Axis/Xerces to parse it > as UTF-16? Sorry if this question doesn't make much sense, but I'm > not too familiar with how Axis and/or Xerces decide which character > encoding to use when reading the XML. > > Thanks again > Matt > > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 10:58 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > Yes, there is a work-around. It works if you encode the file with > > UTF-8 (for example), and do not include the BOM at the beginning. I > > use notepad++ for that task, where you can save in "UTF-8 without > > BOM". > > > > The process for that is easy: > > 1. open the file in notepad++ > > 2. mark everything via CTRL-A > > 3. cut (not copy!) > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > without BOM" at the bottom > > 5. paste > > 6. save. > > > > that is a crap workaround, but works for me. for automatically > > generated files . I dunno :-) > > > > > > Greetings, > > Axel. > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > Hi all, > > > > I hate to do this, but can anyone please help me with either of > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > avail. > > > > Is there anything else I could be doing? > > Just wondering if your file in question starts with hex 'ef bb bf' > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe > you have an utf-16 encoded file (little endian or big endian) not > utf-8. If it is the 'ef bb bf' sequence then it starts correctly with > the utf-8 encoded unicode code point for BOM U+FEFF. In all cases > xerces should be able to handle it. A problem may arise if it starts > with 'ff ef' but the XML prolog says encoding="utf-8" as that is a > contradiction I believe. > > I know this does not help directly but may help to check if the > problem is with the producer of the XML document or your consumer. > > Manuel > > > What about the possibility of programmatically editing/cleaning the > > response XML before it is given to the parser? > > > > Thanks > > Matt > > > > -Original Message- > > From: Matthew Brown [mailto: [EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]> ] > > Sent: Saturday, July 01, 2006 12:41 PM > > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> > > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > > > > 1. From searching the mailing list archives, I see several > > references to people having problems with Byte Order Mark > > characters appearing before the prolog in their UTF-8 messages. > > However I can't seem to find much of a known resolution to these > > issues. Is there a standard/common workaround for these BOM and > > UTF-8 issues? > > > > 2. If there is no answer to my #1, is there anyway that Axis will > > allow me to pragmatically edit the response XML before it is passed > > to the parser and de-serialized? I've tried adding Handlers, but > > I'm assuming that the Handler comes into the picture after the > > message is parsed, because my Handler is only ever seeing the > > request message, and not the response. > > > > Thanks > > Matt Brown > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Manuel, I believe you hit the problem on the head - the response prolog says utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally, by the time the response XML gets logged by axis, these initial characters are logged as ef bf bd ef bf bd. Unfortunately we may be in a bit of a tough place with having the producer of the XML change it; the customer whose web services we are consuming doesn't seem to see any issue with this (as they are fine with their .NET tools). If it is the case where we are seeing a UTF-16 BOM but a prolog that declares UTF-8; is there any way to instruct Axis/Xerces to parse it as UTF-16? Sorry if this question doesn't make much sense, but I'm not too familiar with how Axis and/or Xerces decide which character encoding to use when reading the XML. Thanks again Matt -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 10:58 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 22:16, Axel Bock wrote: > Yes, there is a work-around. It works if you encode the file with > UTF-8 (for example), and do not include the BOM at the beginning. I > use notepad++ for that task, where you can save in "UTF-8 without > BOM". > > The process for that is easy: > 1. open the file in notepad++ > 2. mark everything via CTRL-A > 3. cut (not copy!) > 4. in the format menu, choose "ANSI" formatting and select "UTF > without BOM" at the bottom > 5. paste > 6. save. > > that is a crap workaround, but works for me. for automatically > generated files . I dunno :-) > > > Greetings, > Axel. > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > wrote: > > Hi all, > > I hate to do this, but can anyone please help me with either of these > issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. > > Is there anything else I could be doing? Just wondering if your file in question starts with hex 'ef bb bf' or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe you have an utf-16 encoded file (little endian or big endian) not utf-8. If it is the 'ef bb bf' sequence then it starts correctly with the utf-8 encoded unicode code point for BOM U+FEFF. In all cases xerces should be able to handle it. A problem may arise if it starts with 'ff ef' but the XML prolog says encoding="utf-8" as that is a contradiction I believe. I know this does not help directly but may help to check if the problem is with the producer of the XML document or your consumer. Manuel > > What about the possibility of programmatically editing/cleaning the > response XML before it is given to the parser? > > Thanks > Matt > > -----Original Message- > From: Matthew Brown [mailto: [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> ] > Sent: Saturday, July 01, 2006 12:41 PM > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > 1. From searching the mailing list archives, I see several references > to people having problems with Byte Order Mark characters appearing > before the prolog in their UTF-8 messages. However I can't seem to > find much of a known resolution to these issues. Is there a > standard/common workaround for these BOM and UTF-8 issues? > > 2. If there is no answer to my #1, is there anyway that Axis will > allow me to pragmatically edit the response XML before it is passed > to the parser and de-serialized? I've tried adding Handlers, but I'm > assuming that the Handler comes into the picture after the message is > parsed, because my Handler is only ever seeing the request message, > and not the response. > > Thanks > Matt Brown - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Hi Davanum Sorry if I didn't give all of the details before - we are using Axis as a client and communicating with a ASP.NET (v1.1) server. Just for testing, we built a client in .NET off of the same WSDL, and although the response XML/data from the service looks the same, .NET was somehow able to parse it fine. So at this point I'm not sure if this is a problem I should be tackling in Axis or somehow thru the XML parser, but in my searches I've found some previous discussion of this problem on the list, but not any known solution posted. Thanks Matt -Original Message- From: Davanum Srinivas [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 10:44 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML Matthew, Is this from a non-axis web service? and you are having problems with an axis client? -- dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > > > Alex, > > The problem I am having is with the SOAP response from the web service; so > I'm not really sure how we'd be saving that to a file... this isn't a static > piece of text. > > -Original Message- > From: Axel Bock [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 10:17 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > Yes, there is a work-around. It works if you encode the file with UTF-8 (for > example), and do not include the BOM at the beginning. I use notepad++ for > that task, where you can save in "UTF-8 without BOM". > > The process for that is easy: > 1. open the file in notepad++ > 2. mark everything via CTRL-A > 3. cut (not copy!) > 4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" > at the bottom > 5. paste > 6. save. > > that is a crap workaround, but works for me. for automatically generated > files . I dunno :-) > > > Greetings, > Axel. > > > On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi all, > > > > I hate to do this, but can anyone please help me with either of these > issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. > > > > Is there anything else I could be doing? > > > > What about the possibility of programmatically editing/cleaning the > response XML before it is given to the parser? > > > > Thanks > > Matt > > > > -Original Message- > > From: Matthew Brown [mailto:[EMAIL PROTECTED] > > Sent: Saturday, July 01, 2006 12:41 PM > > To: axis-user@ws.apache.org > > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > > > > 1. From searching the mailing list archives, I see several references to > people having problems with Byte Order Mark characters appearing before the > prolog in their UTF-8 messages. However I can't seem to find much of a known > resolution to these issues. Is there a standard/common workaround for these > BOM and UTF-8 issues? > > > > 2. If there is no answer to my #1, is there anyway that Axis will allow me > to pragmatically edit the response XML before it is passed to the parser and > de-serialized? I've tried adding Handlers, but I'm assuming that the Handler > comes into the picture after the message is parsed, because my Handler is > only ever seeing the request message, and not the response. > > > > Thanks > > Matt Brown > > -- Davanum Srinivas : http://people.apache.org/~dims/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Two questions - BOM in UTF-8, and manually cleaning XML
On Wednesday 05 July 2006 22:16, Axel Bock wrote: > Yes, there is a work-around. It works if you encode the file with > UTF-8 (for example), and do not include the BOM at the beginning. I > use notepad++ for that task, where you can save in "UTF-8 without > BOM". > > The process for that is easy: > 1. open the file in notepad++ > 2. mark everything via CTRL-A > 3. cut (not copy!) > 4. in the format menu, choose "ANSI" formatting and select "UTF > without BOM" at the bottom > 5. paste > 6. save. > > that is a crap workaround, but works for me. for automatically > generated files . I dunno :-) > > > Greetings, > Axel. > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > wrote: > > Hi all, > > I hate to do this, but can anyone please help me with either of these > issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. > > Is there anything else I could be doing? Just wondering if your file in question starts with hex 'ef bb bf' or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe you have an utf-16 encoded file (little endian or big endian) not utf-8. If it is the 'ef bb bf' sequence then it starts correctly with the utf-8 encoded unicode code point for BOM U+FEFF. In all cases xerces should be able to handle it. A problem may arise if it starts with 'ff ef' but the XML prolog says encoding="utf-8" as that is a contradiction I believe. I know this does not help directly but may help to check if the problem is with the producer of the XML document or your consumer. Manuel > > What about the possibility of programmatically editing/cleaning the > response XML before it is given to the parser? > > Thanks > Matt > > -----Original Message----- > From: Matthew Brown [mailto: [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> ] > Sent: Saturday, July 01, 2006 12:41 PM > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > 1. From searching the mailing list archives, I see several references > to people having problems with Byte Order Mark characters appearing > before the prolog in their UTF-8 messages. However I can't seem to > find much of a known resolution to these issues. Is there a > standard/common workaround for these BOM and UTF-8 issues? > > 2. If there is no answer to my #1, is there anyway that Axis will > allow me to pragmatically edit the response XML before it is passed > to the parser and de-serialized? I've tried adding Handlers, but I'm > assuming that the Handler comes into the picture after the message is > parsed, because my Handler is only ever seeing the request message, > and not the response. > > Thanks > Matt Brown - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Two questions - BOM in UTF-8, and manually cleaning XML
Hi, hm. then ... maybe you could write an axis handler which actually modifies the response buffer before xerces kicks in. I don't know how to do that, though, so you'd have to refer to some other guys who know better :-) . and, ah, it's AXEL ;-)Greetings, Axel.On 7/5/06, Matthew Brown <[EMAIL PROTECTED] > wrote: Alex, The problem I am having is with the SOAP response from the web service; so I'm not really sure how we'd be saving that to a file... this isn't a static piece of text. -Original Message-From: Axel Bock [mailto:[EMAIL PROTECTED]]Sent: Wednesday, July 05, 2006 10:17 AMTo: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XMLYes, there is a work-around. It works if you encode the file with UTF-8 (for example), and do not include the BOM at the beginning. I use notepad++ for that task, where you can save in "UTF-8 without BOM". The process for that is easy: 1. open the file in notepad++2. mark everything via CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" at the bottom 5. paste6. save.that is a crap workaround, but works for me. for automatically generated files . I dunno :-) Greetings, Axel. On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: Hi all, I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. Is there anything else I could be doing? What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? Thanks Matt -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]]Sent: Saturday, July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions - BOM in UTF-8, and manually cleaning XML 1. From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown
Re: Two questions - BOM in UTF-8, and manually cleaning XML
Matthew, Is this from a non-axis web service? and you are having problems with an axis client? -- dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: Alex, The problem I am having is with the SOAP response from the web service; so I'm not really sure how we'd be saving that to a file... this isn't a static piece of text. -Original Message- From: Axel Bock [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 10:17 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML Yes, there is a work-around. It works if you encode the file with UTF-8 (for example), and do not include the BOM at the beginning. I use notepad++ for that task, where you can save in "UTF-8 without BOM". The process for that is easy: 1. open the file in notepad++ 2. mark everything via CTRL-A 3. cut (not copy!) 4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" at the bottom 5. paste 6. save. that is a crap workaround, but works for me. for automatically generated files . I dunno :-) Greetings, Axel. On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > > > > Hi all, > > I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. > > Is there anything else I could be doing? > > What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? > > Thanks > Matt > > -Original Message- > From: Matthew Brown [mailto:[EMAIL PROTECTED] > Sent: Saturday, July 01, 2006 12:41 PM > To: axis-user@ws.apache.org > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > 1. From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? > > 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. > > Thanks > Matt Brown -- Davanum Srinivas : http://people.apache.org/~dims/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Alex, The problem I am having is with the SOAP response from the web service; so I'm not really sure how we'd be saving that to a file... this isn't a static piece of text. -Original Message-From: Axel Bock [mailto:[EMAIL PROTECTED]Sent: Wednesday, July 05, 2006 10:17 AMTo: axis-user@ws.apache.orgSubject: Re: Two questions - BOM in UTF-8, and manually cleaning XMLYes, there is a work-around. It works if you encode the file with UTF-8 (for example), and do not include the BOM at the beginning. I use notepad++ for that task, where you can save in "UTF-8 without BOM". The process for that is easy: 1. open the file in notepad++2. mark everything via CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" at the bottom 5. paste6. save.that is a crap workaround, but works for me. for automatically generated files . I dunno :-) Greetings, Axel. On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: Hi all, I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. Is there anything else I could be doing? What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? Thanks Matt -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]]Sent: Saturday, July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions - BOM in UTF-8, and manually cleaning XML 1. From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown
Re: Two questions - BOM in UTF-8, and manually cleaning XML
Yes, there is a work-around. It works if you encode the file with UTF-8 (for example), and do not include the BOM at the beginning. I use notepad++ for that task, where you can save in "UTF-8 without BOM". The process for that is easy: 1. open the file in notepad++2. mark everything via CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" at the bottom 5. paste6. save.that is a crap workaround, but works for me. for automatically generated files . I dunno :-) Greetings, Axel.On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: Hi all, I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. Is there anything else I could be doing? What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? Thanks Matt -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]]Sent: Saturday, July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions - BOM in UTF-8, and manually cleaning XML 1. From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Hi all, I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. Is there anything else I could be doing? What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? Thanks Matt -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]Sent: Saturday, July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions - BOM in UTF-8, and manually cleaning XML 1. >From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown
Two questions - BOM in UTF-8, and manually cleaning XML
1. >From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown