RE: Two questions - BOM in UTF-8, and manually cleaning XML
Davanum, I had tried this previously and the only effect that I noticed was that the encoding attribute of my request message's prolog changed. The response message was still being parsed as UTF-8 (which the headers had said) although it was truly 16. Anyway, now that the service provider has changed their service to return true UTF-8 data, and Xerces still has trouble interpreting the UTF-8 BOM before the prolog, I have found a very hack-ish solution: Add a handler that will remove any characters in the currentMessage if the MessageContext is past the pivot. This doesn't feel like a great solution to me (why isn't the XML parser prepared to handle the BOM? Is the wrong parse method being used?), it works for us for right now. Thanks for the help all Matt - package com.viecore.ipl.ws; import javax.xml.soap.SOAPMessage; import org.apache.axis.AxisFault; import org.apache.axis.Message; import org.apache.axis.MessageContext; import org.apache.axis.SOAPPart; import org.apache.axis.handlers.BasicHandler; import org.apache.log4j.LogManager; import org.apache.log4j.Logger; public class MyHandler extends BasicHandler { private static Logger log = LogManager.getLogger(MyHandler.class); public void invoke(MessageContext messageContext) throws AxisFault { try { if (log.isInfoEnabled()) log.info("invoke - start"); log.info("invoke - past pivot [" + messageContext.getPastPivot() + "]"); SOAPMessage rpcMsg = messageContext.getMessage(); if (rpcMsg instanceof Message) { Message axisMsg = (Message) rpcMsg; if (log.isDebugEnabled()) log.debug("invoke - cast java.xml.rpc.SOAPMessage to org.apache.axis.Message"); javax.xml.soap.SOAPPart rpcPart = axisMsg.getSOAPPart(); if (rpcPart instanceof SOAPPart) { SOAPPart axisPart = (SOAPPart) rpcPart; if (log.isDebugEnabled()) log.debug("invoke - cast java.xml.rpc.SOAPPart to org.apache.axis.SOAPPart"); Object currentMessage = axisPart.getCurrentMessage(); if (currentMessage == null) { log.debug("invoke - current message is null, cannot clean"); } else { if (log.isDebugEnabled()) log.debug("invoke - current message of SOAP part has type [" + currentMessage.getClass().getName() + "] content [" + currentMessage.toString() + "]"); // attempt to remove bad characters from the response if (messageContext.getPastPivot() == true) { if (currentMessage instanceof String) { String strMessage = (String) currentMessage; int idx = strMessage.indexOf("mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 3:41 PM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML did you see my response on setting the CHARACTER_SET_ENCODING? what is the exact stack trace you get on the client? thanks, dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 > message as UTF-8. The customer has changed the format of the message to > correctly be UTF-8 in actuality, although Xerces still isn't a fan of the > UTF-8 BOM (ef bb bf). > > > > -Original Message- > From: Simon Fell [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 2:46 PM > To: axis-user@ws.apache.org > Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML > > > What does the content-type header say the charset is? That takes precedence > over the payload (at least for SOAP 1.1) > > Cheers > Simon > > -Original Message- > From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 8:30 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. > It seems like a demo example for a servlet filter ;-)
RE: Two questions - BOM in UTF-8, and manually cleaning XML
text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 message as UTF-8. The customer has changed the format of the message to correctly be UTF-8 in actuality, although Xerces still isn't a fan of the UTF-8 BOM (ef bb bf). -Original Message- From: Simon Fell [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 2:46 PM To: axis-user@ws.apache.org Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML What does the content-type header say the charset is? That takes precedence over the payload (at least for SOAP 1.1) Cheers Simon -Original Message- From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 8:30 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. It seems like a demo example for a servlet filter ;-) Hope this helps, Rodrigo Manuel Mall wrote: > On Wednesday 05 July 2006 23:12, Matthew Brown wrote: >> Two bytes per char; Etherpeak is showing the second byte as 00. >> > Seems you are stuck between a "rock and a hard place" here. The byte > stream appears to be correctly utf-16 encoded but the xml prolog says > utf-8. Not sure what to recommend. Fix it at the source is obvious but > not easily done. You may be able to write a handler that re-encodes > the byte stream into utf-8 before giving it to the Axis stacks. But > how to write such an Axis handler and how to hook it correctly into > the Axis processing chain is outside my area of expertise. > > May be someone else can give advice on how to attempt such a thing. > > Manuel >> -Original Message- >> From: Manuel Mall [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, July 05, 2006 11:09 AM >> To: axis-user@ws.apache.org >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >> >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote: >>> Manuel, >>> >>> I believe you hit the problem on the head - the response prolog says >>> utf-8 but (according to Etherpeak) the BOM is ff/ef. >>> Coincidentally, by the time the response XML gets logged by axis, >>> these initial characters are logged as ef bf bd ef bf bd. >> Matt, >> >> what about the rest of the byte stream when you look at it in >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded >> (1 byte per char for all typical ascii characters)? >> >> Manuel >> >>> Unfortunately we may be in a bit of a tough place with having the >>> producer of the XML change it; the customer whose web services we >>> are consuming doesn't seem to see any issue with this (as they are >>> fine with their .NET tools). >>> >>> If it is the case where we are seeing a UTF-16 BOM but a prolog that >>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it >>> as UTF-16? Sorry if this question doesn't make much sense, but I'm >>> not too familiar with how Axis and/or Xerces decide which character >>> encoding to use when reading the XML. >>> >>> Thanks again >>> Matt >>> >>> -Original Message- >>> From: Manuel Mall [mailto:[EMAIL PROTECTED] >>> Sent: Wednesday, July 05, 2006 10:58 AM >>> To: axis-user@ws.apache.org >>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML >>> >>> On Wednesday 05 July 2006 22:16, Axel Bock wrote: >>>> Yes, there is a work-around. It works if you encode the file with >>>> UTF-8 (for example), and do not include the BOM at the beginning. >>>> I use notepad++ for that task, where you can save in "UTF-8 without >>>> BOM". >>>> >>>> The process for that is easy: >>>> 1. open the file in notepad++ >>>> 2. mark everything via CTRL-A >>>> 3. cut (not copy!) >>>> 4. in the format menu, choose "ANSI" formatting and select "UTF >>>> without BOM" at the bottom 5. paste 6. save. >>>> >>>> that is a crap workaround, but works for me. for automatically >>>> generated files . I dunno :-) >>>> >>>> >>>> Greetings, >>>> Axel. >>>> >>>> >>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] >>>> <mailto:[EMAIL PROTECTED]> > wrote: >>>> >>>> Hi all, >>>> >>>> I hate to do this, but can anyone please help me with either of >>>>
RE: Two questions - BOM in UTF-8, and manually cleaning XML
I've tried to add a handler to simply log the messages but it seems to (a beginner like) me that the Handler doesn't come into play until after the XML is parsed/deserialized. Just to serve as a confirmation, can anyone comment on how Xerces will determine what type of encoding the xml is in? Will it look at the prolog, the byte order mark, etc? Thanks -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 11:24 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 23:12, Matthew Brown wrote: > Two bytes per char; Etherpeak is showing the second byte as 00. > Seems you are stuck between a "rock and a hard place" here. The byte stream appears to be correctly utf-16 encoded but the xml prolog says utf-8. Not sure what to recommend. Fix it at the source is obvious but not easily done. You may be able to write a handler that re-encodes the byte stream into utf-8 before giving it to the Axis stacks. But how to write such an Axis handler and how to hook it correctly into the Axis processing chain is outside my area of expertise. May be someone else can give advice on how to attempt such a thing. Manuel > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 11:09 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > > Manuel, > > > > I believe you hit the problem on the head - the response prolog > > says utf-8 but (according to Etherpeak) the BOM is ff/ef. > > Coincidentally, by the time the response XML gets logged by axis, > > these initial characters are logged as ef bf bd ef bf bd. > > Matt, > > what about the rest of the byte stream when you look at it in > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded > (1 byte per char for all typical ascii characters)? > > Manuel > > > Unfortunately we may be in a bit of a tough place with having the > > producer of the XML change it; the customer whose web services we > > are consuming doesn't seem to see any issue with this (as they are > > fine with their .NET tools). > > > > If it is the case where we are seeing a UTF-16 BOM but a prolog > > that declares UTF-8; is there any way to instruct Axis/Xerces to > > parse it as UTF-16? Sorry if this question doesn't make much sense, > > but I'm not too familiar with how Axis and/or Xerces decide which > > character encoding to use when reading the XML. > > > > Thanks again > > Matt > > > > -Original Message- > > From: Manuel Mall [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 05, 2006 10:58 AM > > To: axis-user@ws.apache.org > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning > > XML > > > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > > Yes, there is a work-around. It works if you encode the file with > > > UTF-8 (for example), and do not include the BOM at the beginning. > > > I use notepad++ for that task, where you can save in "UTF-8 > > > without BOM". > > > > > > The process for that is easy: > > > 1. open the file in notepad++ > > > 2. mark everything via CTRL-A > > > 3. cut (not copy!) > > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > > without BOM" at the bottom > > > 5. paste > > > 6. save. > > > > > > that is a crap workaround, but works for me. for automatically > > > generated files . I dunno :-) > > > > > > > > > Greetings, > > > Axel. > > > > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > > > Hi all, > > > > > > I hate to do this, but can anyone please help me with either of > > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > > avail. > > > > > > Is there anything else I could be doing? > > > > Just wondering if your file in question starts with hex 'ef bb bf' > > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I > > believe you have an utf-16 encoded file (little endian or big > > endian) not utf-8. If it is the 'ef bb bf' sequence then it starts > > correctly with the utf-8 encoded unicode code point for BOM U+FEFF. > > In all cases xerces should be able to handle it. A
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Two bytes per char; Etherpeak is showing the second byte as 00. -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 11:09 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 23:04, Matthew Brown wrote: > Manuel, > > I believe you hit the problem on the head - the response prolog says > utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally, > by the time the response XML gets logged by axis, these initial > characters are logged as ef bf bd ef bf bd. > Matt, what about the rest of the byte stream when you look at it in Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded (1 byte per char for all typical ascii characters)? Manuel > Unfortunately we may be in a bit of a tough place with having the > producer of the XML change it; the customer whose web services we are > consuming doesn't seem to see any issue with this (as they are fine > with their .NET tools). > > If it is the case where we are seeing a UTF-16 BOM but a prolog that > declares UTF-8; is there any way to instruct Axis/Xerces to parse it > as UTF-16? Sorry if this question doesn't make much sense, but I'm > not too familiar with how Axis and/or Xerces decide which character > encoding to use when reading the XML. > > Thanks again > Matt > > -Original Message- > From: Manuel Mall [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 10:58 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > On Wednesday 05 July 2006 22:16, Axel Bock wrote: > > Yes, there is a work-around. It works if you encode the file with > > UTF-8 (for example), and do not include the BOM at the beginning. I > > use notepad++ for that task, where you can save in "UTF-8 without > > BOM". > > > > The process for that is easy: > > 1. open the file in notepad++ > > 2. mark everything via CTRL-A > > 3. cut (not copy!) > > 4. in the format menu, choose "ANSI" formatting and select "UTF > > without BOM" at the bottom > > 5. paste > > 6. save. > > > > that is a crap workaround, but works for me. for automatically > > generated files . I dunno :-) > > > > > > Greetings, > > Axel. > > > > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]> > wrote: > > > > Hi all, > > > > I hate to do this, but can anyone please help me with either of > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no > > avail. > > > > Is there anything else I could be doing? > > Just wondering if your file in question starts with hex 'ef bb bf' > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe > you have an utf-16 encoded file (little endian or big endian) not > utf-8. If it is the 'ef bb bf' sequence then it starts correctly with > the utf-8 encoded unicode code point for BOM U+FEFF. In all cases > xerces should be able to handle it. A problem may arise if it starts > with 'ff ef' but the XML prolog says encoding="utf-8" as that is a > contradiction I believe. > > I know this does not help directly but may help to check if the > problem is with the producer of the XML document or your consumer. > > Manuel > > > What about the possibility of programmatically editing/cleaning the > > response XML before it is given to the parser? > > > > Thanks > > Matt > > > > -Original Message- > > From: Matthew Brown [mailto: [EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]> ] > > Sent: Saturday, July 01, 2006 12:41 PM > > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> > > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > > > > 1. From searching the mailing list archives, I see several > > references to people having problems with Byte Order Mark > > characters appearing before the prolog in their UTF-8 messages. > > However I can't seem to find much of a known resolution to these > > issues. Is there a standard/common workaround for these BOM and > > UTF-8 issues? > > > > 2. If there is no answer to my #1, is there anyway that Axis will > > allow me to pragmatically edit the response XML before it is passed > > to the parser and de-serialized? I've tried adding Handlers, but > > I'm assuming that the Handler comes into the picture after the > > message is parsed, because my Handler is only ever seeing the > > request message, and not the response. > > > > Thanks > > Matt Brown > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Manuel, I believe you hit the problem on the head - the response prolog says utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally, by the time the response XML gets logged by axis, these initial characters are logged as ef bf bd ef bf bd. Unfortunately we may be in a bit of a tough place with having the producer of the XML change it; the customer whose web services we are consuming doesn't seem to see any issue with this (as they are fine with their .NET tools). If it is the case where we are seeing a UTF-16 BOM but a prolog that declares UTF-8; is there any way to instruct Axis/Xerces to parse it as UTF-16? Sorry if this question doesn't make much sense, but I'm not too familiar with how Axis and/or Xerces decide which character encoding to use when reading the XML. Thanks again Matt -Original Message- From: Manuel Mall [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 10:58 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML On Wednesday 05 July 2006 22:16, Axel Bock wrote: > Yes, there is a work-around. It works if you encode the file with > UTF-8 (for example), and do not include the BOM at the beginning. I > use notepad++ for that task, where you can save in "UTF-8 without > BOM". > > The process for that is easy: > 1. open the file in notepad++ > 2. mark everything via CTRL-A > 3. cut (not copy!) > 4. in the format menu, choose "ANSI" formatting and select "UTF > without BOM" at the bottom > 5. paste > 6. save. > > that is a crap workaround, but works for me. for automatically > generated files . I dunno :-) > > > Greetings, > Axel. > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > wrote: > > Hi all, > > I hate to do this, but can anyone please help me with either of these > issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. > > Is there anything else I could be doing? Just wondering if your file in question starts with hex 'ef bb bf' or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe you have an utf-16 encoded file (little endian or big endian) not utf-8. If it is the 'ef bb bf' sequence then it starts correctly with the utf-8 encoded unicode code point for BOM U+FEFF. In all cases xerces should be able to handle it. A problem may arise if it starts with 'ff ef' but the XML prolog says encoding="utf-8" as that is a contradiction I believe. I know this does not help directly but may help to check if the problem is with the producer of the XML document or your consumer. Manuel > > What about the possibility of programmatically editing/cleaning the > response XML before it is given to the parser? > > Thanks > Matt > > -Original Message- > From: Matthew Brown [mailto: [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> ] > Sent: Saturday, July 01, 2006 12:41 PM > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > 1. From searching the mailing list archives, I see several references > to people having problems with Byte Order Mark characters appearing > before the prolog in their UTF-8 messages. However I can't seem to > find much of a known resolution to these issues. Is there a > standard/common workaround for these BOM and UTF-8 issues? > > 2. If there is no answer to my #1, is there anyway that Axis will > allow me to pragmatically edit the response XML before it is passed > to the parser and de-serialized? I've tried adding Handlers, but I'm > assuming that the Handler comes into the picture after the message is > parsed, because my Handler is only ever seeing the request message, > and not the response. > > Thanks > Matt Brown - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Hi Davanum Sorry if I didn't give all of the details before - we are using Axis as a client and communicating with a ASP.NET (v1.1) server. Just for testing, we built a client in .NET off of the same WSDL, and although the response XML/data from the service looks the same, .NET was somehow able to parse it fine. So at this point I'm not sure if this is a problem I should be tackling in Axis or somehow thru the XML parser, but in my searches I've found some previous discussion of this problem on the list, but not any known solution posted. Thanks Matt -Original Message- From: Davanum Srinivas [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 05, 2006 10:44 AM To: axis-user@ws.apache.org Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML Matthew, Is this from a non-axis web service? and you are having problems with an axis client? -- dims On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > > > Alex, > > The problem I am having is with the SOAP response from the web service; so > I'm not really sure how we'd be saving that to a file... this isn't a static > piece of text. > > -Original Message- > From: Axel Bock [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 05, 2006 10:17 AM > To: axis-user@ws.apache.org > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML > > Yes, there is a work-around. It works if you encode the file with UTF-8 (for > example), and do not include the BOM at the beginning. I use notepad++ for > that task, where you can save in "UTF-8 without BOM". > > The process for that is easy: > 1. open the file in notepad++ > 2. mark everything via CTRL-A > 3. cut (not copy!) > 4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" > at the bottom > 5. paste > 6. save. > > that is a crap workaround, but works for me. for automatically generated > files . I dunno :-) > > > Greetings, > Axel. > > > On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi all, > > > > I hate to do this, but can anyone please help me with either of these > issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. > > > > Is there anything else I could be doing? > > > > What about the possibility of programmatically editing/cleaning the > response XML before it is given to the parser? > > > > Thanks > > Matt > > > > -Original Message- > > From: Matthew Brown [mailto:[EMAIL PROTECTED] > > Sent: Saturday, July 01, 2006 12:41 PM > > To: axis-user@ws.apache.org > > Subject: Two questions - BOM in UTF-8, and manually cleaning XML > > > > > > 1. From searching the mailing list archives, I see several references to > people having problems with Byte Order Mark characters appearing before the > prolog in their UTF-8 messages. However I can't seem to find much of a known > resolution to these issues. Is there a standard/common workaround for these > BOM and UTF-8 issues? > > > > 2. If there is no answer to my #1, is there anyway that Axis will allow me > to pragmatically edit the response XML before it is passed to the parser and > de-serialized? I've tried adding Handlers, but I'm assuming that the Handler > comes into the picture after the message is parsed, because my Handler is > only ever seeing the request message, and not the response. > > > > Thanks > > Matt Brown > > -- Davanum Srinivas : http://people.apache.org/~dims/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Alex, The problem I am having is with the SOAP response from the web service; so I'm not really sure how we'd be saving that to a file... this isn't a static piece of text. -Original Message-From: Axel Bock [mailto:[EMAIL PROTECTED]Sent: Wednesday, July 05, 2006 10:17 AMTo: axis-user@ws.apache.orgSubject: Re: Two questions - BOM in UTF-8, and manually cleaning XMLYes, there is a work-around. It works if you encode the file with UTF-8 (for example), and do not include the BOM at the beginning. I use notepad++ for that task, where you can save in "UTF-8 without BOM". The process for that is easy: 1. open the file in notepad++2. mark everything via CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" at the bottom 5. paste6. save.that is a crap workaround, but works for me. for automatically generated files ..... I dunno :-) Greetings, Axel. On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote: Hi all, I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. Is there anything else I could be doing? What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? Thanks Matt -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]]Sent: Saturday, July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions - BOM in UTF-8, and manually cleaning XML 1. From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown
RE: Two questions - BOM in UTF-8, and manually cleaning XML
Hi all, I hate to do this, but can anyone please help me with either of these issues? I've tried to upgrade Xerces to 2.8.0 but to no avail. Is there anything else I could be doing? What about the possibility of programmatically editing/cleaning the response XML before it is given to the parser? Thanks Matt -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]Sent: Saturday, July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions - BOM in UTF-8, and manually cleaning XML 1. >From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown
RE: Content is not allowed in prolog
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227) at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696) ... 40 more -Original Message- From: Dies Koper [mailto:[EMAIL PROTECTED] Sent: Sunday, July 02, 2006 8:45 PM To: axis-user@ws.apache.org Cc: [EMAIL PROTECTED] Subject: Re: Content is not allowed in prolog Hello Derek, I used Xerces-J 2.7.1 and had no problems with a Unicode Byte Order Mark (BOM) in my UTF-8 and UTF-16 messages using Axis 1.3. Can you try reproducing the error message with this parser? Regards, Dies Matthew Brown wrote: > Thanks Derek. I've etherpeak to capture the raw packets coming across > and using it's hex editor, have found that they appear to be hex FF > FE. > > I understand from searching and from old posts on this list that > Xerces will have trouble that starts with this byte-order-mark. Is > this still the case? If so, can anyone provide the known workaround > for this? > > Thanks again Matt > -----Original Message- From: Matthew Brown > [mailto:[EMAIL PROTECTED] Sent: Friday, June 30, 2006 7:16 > AM To: axis-user@ws.apache.org Subject: RE: Content is not allowed in > prolog > > > Some followup information.. > > I've tested using .NET and their wsdl.exe tool to create a client to > use the customer's web service. The response still looks the same, > but .NET has zero issues parsing. Could this just be an XML parser > issue? Can someone point me in the direction of how to > change/configure the parser, or find out if parsing a message such as > the one below (with all those extra spaces) is possible? > -Original Message- From: Matthew Brown > [mailto:[EMAIL PROTECTED] Sent: Friday, June 30, 2006 9:23 > AM To: axis-user@ws.apache.org Subject: RE: Content is not allowed in > prolog > > > I happen to be having a similar error, although it isn't an endpoint > issue. > > The response we are getting back from the server looks like this: > > ??< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n g = " u t f - > 8 " ? > < s o a p : E n v e l o p e x m l n s : s o a p = " h t t p > : / / s c h e m a s . x m l s o a p . o r g / s o a p / e n v e l o p > e / " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 > 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s : x s d = " h > t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " > < s > o a p : H e a d e r > < R e s p o n s e H e a d e r x m l n s = " h > t t p : / / b l a h . c o m / C A S / " > < H e a d e r s > < / H e a > d e r s > < / R e s p o n s e H e a d e r > < / s o a p : H e a d e r > > < s o a p : B o d y > < G e t A c c o u n t I n f o r m a t i o n R > e s p o n s e x m l n s = " h t t p : / / b l a h . c o m / C A S / > " > < A c c o u n t I n f o r m a t i o n R e s p o n s e x m l n s > : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h > e m a " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 > 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s = " h t t p > : / / b l a h . c o m / C A S / I V R . M e s s a g e D e f i n i t i > o n s . x s d " > > > < N u m b e r O f M a t c h e s > 0 < / N u m b e r O f M a t c h e s > > > > < M o n t h l y E x t e n s i o n A m o u n t > 0 < / M o n t h l y E > x t e n s i o n A m o u n t > > > > > > with garbage characters inserted between each legit XML character > (and two before the prolog). > > Is it possible to add a handler to modify the raw response XML before > Axis passes it off to the XML parser? Does anyone know? Is there some > other simple setting I might be overlooking that might be causing > this? > > Thanks in advance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Two questions - BOM in UTF-8, and manually cleaning XML
1. >From searching the mailing list archives, I see several references to people having problems with Byte Order Mark characters appearing before the prolog in their UTF-8 messages. However I can't seem to find much of a known resolution to these issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 2. If there is no answer to my #1, is there anyway that Axis will allow me to pragmatically edit the response XML before it is passed to the parser and de-serialized? I've tried adding Handlers, but I'm assuming that the Handler comes into the picture after the message is parsed, because my Handler is only ever seeing the request message, and not the response. Thanks Matt Brown
RE: Content is not allowed in prolog
Title: Message Thanks Derek. I've etherpeak to capture the raw packets coming across and using it's hex editor, have found that they appear to be hex FF FE. I understand from searching and from old posts on this list that Xerces will have trouble that starts with this byte-order-mark. Is this still the case? If so, can anyone provide the known workaround for this? Thanks again Matt -Original Message-From: Derek [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 1:43 PMTo: axis-user@ws.apache.orgSubject: RE: Content is not allowed in prolog Just a suggestion: The message that you list below, with blanks between each character, looks to me like you might be trying to view Unicode text as if it were ASCII. Unicode uses sixteen bits to represent a character, while ASCII uses 8 (technically, 7), so each unicode character in the ASCII numeric range constitutes an all-zeroes byte plus a character byte. Perhaps the extra characters you are seeing in the message aren't really spaces, but are really null characters (0x00) and your editor or viewer translates them to spaces because it has no way to display nulls. The two question marks before the initial " Just a thought. That's the problem I've usually had when I see text files that look like this one. Derek -----Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED] Sent: Friday, June 30, 2006 7:16 AMTo: axis-user@ws.apache.orgSubject: RE: Content is not allowed in prolog Some followup information.. I've tested using .NET and their wsdl.exe tool to create a client to use the customer's web service. The response still looks the same, but .NET has zero issues parsing. Could this just be an XML parser issue? Can someone point me in the direction of how to change/configure the parser, or find out if parsing a message such as the one below (with all those extra spaces) is possible? -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 9:23 AMTo: axis-user@ws.apache.orgSubject: RE: Content is not allowed in prolog I happen to be having a similar error, although it isn't an endpoint issue. The response we are getting back from the server looks like this: ??< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n g = " u t f - 8 " ? > < s o a p : E n v e l o p e x m l n s : s o a p = " h t t p : / / s c h e m a s . x m l s o a p . o r g / s o a p / e n v e l o p e / " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " > < s o a p : H e a d e r > < R e s p o n s e H e a d e r x m l n s = " h t t p : / / b l a h . c o m / C A S / " > < H e a d e r s > < / H e a d e r s > < / R e s p o n s e H e a d e r > < / s o a p : H e a d e r > < s o a p : B o d y > < G e t A c c o u n t I n f o r m a t i o n R e s p o n s e x m l n s = " h t t p : / / b l a h . c o m / C A S / " > < A c c o u n t I n f o r m a t i o n R e s p o n s e x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s = " h t t p : / / b l a h . c o m / C A S / I V R . M e s s a g e D e f i n i t i o n s . x s d " > < N u m b e r O f M a t c h e s > 0 < / N u m b e r O f M a t c h e s > < M o n t h l y E x t e n s i o n A m o u n t > 0 < / M o n t h l y E x t e n s i o n A m o u n t > with garbage characters inserted between each legit XML character (and two before the prolog). Is it possible to add a handler to modify the raw response XML before Axis passes it off to the XML parser? Does anyone know? Is there some other simple setting I might be overlooking that might be causing this? Thanks in advance.
RE: Content is not allowed in prolog
Some followup information.. I've tested using .NET and their wsdl.exe tool to create a client to use the customer's web service. The response still looks the same, but .NET has zero issues parsing. Could this just be an XML parser issue? Can someone point me in the direction of how to change/configure the parser, or find out if parsing a message such as the one below (with all those extra spaces) is possible? -Original Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 9:23 AMTo: axis-user@ws.apache.orgSubject: RE: Content is not allowed in prolog I happen to be having a similar error, although it isn't an endpoint issue. The response we are getting back from the server looks like this: ??< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n g = " u t f - 8 " ? > < s o a p : E n v e l o p e x m l n s : s o a p = " h t t p : / / s c h e m a s . x m l s o a p . o r g / s o a p / e n v e l o p e / " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " > < s o a p : H e a d e r > < R e s p o n s e H e a d e r x m l n s = " h t t p : / / b l a h . c o m / C A S / " > < H e a d e r s > < / H e a d e r s > < / R e s p o n s e H e a d e r > < / s o a p : H e a d e r > < s o a p : B o d y > < G e t A c c o u n t I n f o r m a t i o n R e s p o n s e x m l n s = " h t t p : / / b l a h . c o m / C A S / " > < A c c o u n t I n f o r m a t i o n R e s p o n s e x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s = " h t t p : / / b l a h . c o m / C A S / I V R . M e s s a g e D e f i n i t i o n s . x s d " > < N u m b e r O f M a t c h e s > 0 < / N u m b e r O f M a t c h e s > < M o n t h l y E x t e n s i o n A m o u n t > 0 < / M o n t h l y E x t e n s i o n A m o u n t > with garbage characters inserted between each legit XML character (and two before the prolog). Is it possible to add a handler to modify the raw response XML before Axis passes it off to the XML parser? Does anyone know? Is there some other simple setting I might be overlooking that might be causing this? Thanks in advance.
RE: Content is not allowed in prolog
I happen to be having a similar error, although it isn't an endpoint issue. The response we are getting back from the server looks like this: ??< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n g = " u t f - 8 " ? > < s o a p : E n v e l o p e x m l n s : s o a p = " h t t p : / / s c h e m a s . x m l s o a p . o r g / s o a p / e n v e l o p e / " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " > < s o a p : H e a d e r > < R e s p o n s e H e a d e r x m l n s = " h t t p : / / b l a h . c o m / C A S / " > < H e a d e r s > < / H e a d e r s > < / R e s p o n s e H e a d e r > < / s o a p : H e a d e r > < s o a p : B o d y > < G e t A c c o u n t I n f o r m a t i o n R e s p o n s e x m l n s = " h t t p : / / b l a h . c o m / C A S / " > < A c c o u n t I n f o r m a t i o n R e s p o n s e x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e " x m l n s = " h t t p : / / b l a h . c o m / C A S / I V R . M e s s a g e D e f i n i t i o n s . x s d " > < N u m b e r O f M a t c h e s > 0 < / N u m b e r O f M a t c h e s > < M o n t h l y E x t e n s i o n A m o u n t > 0 < / M o n t h l y E x t e n s i o n A m o u n t > with garbage characters inserted between each legit XML character (and two before the prolog). Is it possible to add a handler to modify the raw response XML before Axis passes it off to the XML parser? Does anyone know? Is there some other simple setting I might be overlooking that might be causing this? Thanks in advance. -Original Message-From: Luanne Coutinho [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 1:18 AMTo: axis-user@ws.apache.orgSubject: RE: Content is not allowed in prolog Hi, Turns out that the endpoint supplied by our client was wrong! I wonder why Axis kept throwing this particular error… -Luanne -Original Message-From: Luanne Coutinho Sent: Friday, June 30, 2006 9:41 AMTo: Luanne CoutinhoSubject: Hello, I had this same error before. Question though, what version of Axis are you using? Also if you areusing any attachments in your program, you need to include the activation.jar. Tom Luanne Coutinho wrote:>> Hi,>> >> I used wsdl2Java to generate stubs so that I can access a web service > hosted elsewhere.>> I wrote a test program to invoke an operation, but I keep getting this > error:>> >> AxisFault>> faultCode: > {http://schemas.xmlsoap.org/soap/envelope/}Server.userException>> faultSubcode:>> faultString: org.xml.sax.SAXParseException: Content is not allowed in > prolog.>> faultActor:>> faultNode:>> faultDetail:>> > {http://xml.apache.org/axis/}stackTrace:org.xml.sax.SAXParseException: > Content is not allowed in prolog.>> at > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > Source)>> at > org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)>> at > org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)>> at > org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)>> at > org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)>> at > org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown > Source)>> at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source)>> at > org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)>> at > org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)>> at > org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)>> at > org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)>> at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)>> at > org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)>> at > org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)>> at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)>> at > org.apache.axis.handlers.soap.MustUnderstandChecker.invoke(MustUnderstandChecker.java:62)>> at > org.apache.axis.client.AxisClient.invoke(AxisClient.java:206)>> at org.apache.axis.client.Call.invokeEngine(Call.java:2784)>> at org.apache.axis.client.Call.invoke(Call.java:2767)>> at org.apache.axis.client.Call.invoke(Call.java:2443)>> at org.apache.axis.client.Call.invoke(Call.java:2366)>>
Strange format of SOAP Response causing errors
We are using stub classes created from WSDL2Java to communicate with a customer's web service. Axis (1.3 and 1.4) seems unable to parse the response of the SOAP message, and eyeballing the response in a tool like tcpmon one can see junk characters inserted between every valid XML character (the typical ASCII square), and two before the opening xml bracket. Using the default Http sender, Axis reports an IO exception with a message like "Invalid byte 1 of 1 byte UTF-8 sequence". Using the commons-http-client, this becomes a SAXParseException of "Content is not allowed in prolog". The SOAP response's header claims a content type of UTF-8, although it does not appear to be so. I've been able to test out communications with the same web services using a .NET generated proxy. Watching the traffic in tcpmon, the response looks the same, but is understood by the client. Should we be setting the character set / encoding expected in the response stream manually somewhere? Thanks Matthew Brown