RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown
Davanum,

I had tried this previously and the only effect that I noticed was that the 
encoding attribute of my request message's prolog changed. The response message 
was still being parsed as UTF-8 (which the headers had said) although it was 
truly 16.

Anyway, now that the service provider has changed their service to return true 
UTF-8 data, and Xerces still has trouble interpreting the UTF-8 BOM before the 
prolog, I have found a very hack-ish solution: Add a handler that will remove 
any characters in the currentMessage if the MessageContext is past the pivot. 
This doesn't feel like a great solution to me (why isn't the XML parser 
prepared to handle the BOM? Is the wrong parse method being used?), it works 
for us for right now.

Thanks for the help all
Matt

-

package com.viecore.ipl.ws;

import javax.xml.soap.SOAPMessage;

import org.apache.axis.AxisFault;
import org.apache.axis.Message;
import org.apache.axis.MessageContext;
import org.apache.axis.SOAPPart;
import org.apache.axis.handlers.BasicHandler;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;

public class MyHandler extends BasicHandler {

private static Logger log = LogManager.getLogger(MyHandler.class);


public void invoke(MessageContext messageContext) throws AxisFault {

try {
if (log.isInfoEnabled()) log.info("invoke - start");
log.info("invoke - past pivot [" + 
messageContext.getPastPivot() + "]");

SOAPMessage rpcMsg = messageContext.getMessage();

if (rpcMsg instanceof Message) {
Message axisMsg = (Message) rpcMsg;

if (log.isDebugEnabled()) log.debug("invoke - 
cast java.xml.rpc.SOAPMessage to org.apache.axis.Message");

javax.xml.soap.SOAPPart rpcPart = 
axisMsg.getSOAPPart();
if (rpcPart instanceof SOAPPart) {
SOAPPart axisPart = (SOAPPart) rpcPart;

if (log.isDebugEnabled()) 
log.debug("invoke - cast java.xml.rpc.SOAPPart to org.apache.axis.SOAPPart");

Object currentMessage = 
axisPart.getCurrentMessage();
if (currentMessage == null) {
log.debug("invoke - current 
message is null, cannot clean");
}
else {
if (log.isDebugEnabled())
log.debug("invoke - 
current message of SOAP part has type [" + currentMessage.getClass().getName()
+ "] 
content [" + currentMessage.toString() + "]");

// attempt to remove bad 
characters from the response
if 
(messageContext.getPastPivot() == true) {

if (currentMessage 
instanceof String) {
String 
strMessage = (String) currentMessage;
int idx = 
strMessage.indexOf("mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 3:41 PM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


did you see my response on setting the CHARACTER_SET_ENCODING? what is
the exact stack trace you get on the client?

thanks,
dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
> text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 
> message as UTF-8. The customer has changed the format of the message to 
> correctly be UTF-8 in actuality, although Xerces still isn't a fan of the 
> UTF-8 BOM (ef bb bf).
>
>
>
> -Original Message-
> From: Simon Fell [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 2:46 PM
> To: axis-user@ws.apache.org
> Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> What does the content-type header say the charset is? That takes precedence 
> over the payload (at least for SOAP 1.1)
>
> Cheers
> Simon
>
> -Original Message-
> From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 8:30 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier.
> It seems like a demo example for a servlet filter ;-)

RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown
text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 
message as UTF-8. The customer has changed the format of the message to 
correctly be UTF-8 in actuality, although Xerces still isn't a fan of the UTF-8 
BOM (ef bb bf).



-Original Message-
From: Simon Fell [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 2:46 PM
To: axis-user@ws.apache.org
Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML


What does the content-type header say the charset is? That takes precedence 
over the payload (at least for SOAP 1.1) 

Cheers
Simon

-Original Message-
From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 05, 2006 8:30 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. 
It seems like a demo example for a servlet filter ;-)


Hope this helps,
Rodrigo



Manuel Mall wrote:
> On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
>> Two bytes per char; Etherpeak is showing the second byte as 00.
>>
> Seems you are stuck between a "rock and a hard place" here. The byte 
> stream appears to be correctly utf-16 encoded but the xml prolog says 
> utf-8. Not sure what to recommend. Fix it at the source is obvious but 
> not easily done. You may be able to write a handler that re-encodes 
> the byte stream into utf-8 before giving it to the Axis stacks. But 
> how to write such an Axis handler and how to hook it correctly into 
> the Axis processing chain is outside my area of expertise.
> 
> May be someone else can give advice on how to attempt such a thing.
> 
> Manuel
>> -Original Message-
>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, July 05, 2006 11:09 AM
>> To: axis-user@ws.apache.org
>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>
>> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
>>> Manuel,
>>>
>>> I believe you hit the problem on the head - the response prolog says 
>>> utf-8 but (according to Etherpeak) the BOM is ff/ef.
>>> Coincidentally, by the time the response XML gets logged by axis, 
>>> these initial characters are logged as ef bf bd ef bf bd.
>> Matt,
>>
>> what about the rest of the byte stream when you look at it in 
>> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
>> (1 byte per char for all typical ascii characters)?
>>
>> Manuel
>>
>>> Unfortunately we may be in a bit of a tough place with having the 
>>> producer of the XML change it; the customer whose web services we 
>>> are consuming doesn't seem to see any issue with this (as they are 
>>> fine with their .NET tools).
>>>
>>> If it is the case where we are seeing a UTF-16 BOM but a prolog that 
>>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it 
>>> as UTF-16? Sorry if this question doesn't make much sense, but I'm 
>>> not too familiar with how Axis and/or Xerces decide which character 
>>> encoding to use when reading the XML.
>>>
>>> Thanks again
>>> Matt
>>>
>>> -Original Message-
>>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>>> Sent: Wednesday, July 05, 2006 10:58 AM
>>> To: axis-user@ws.apache.org
>>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>>
>>> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
>>>> Yes, there is a work-around. It works if you encode the file with
>>>> UTF-8 (for example), and do not include the BOM at the beginning.
>>>> I use notepad++ for that task, where you can save in "UTF-8 without 
>>>> BOM".
>>>>
>>>> The process for that is easy:
>>>> 1. open the file in notepad++
>>>> 2. mark everything via CTRL-A
>>>> 3. cut (not copy!)
>>>> 4. in the format menu, choose "ANSI" formatting and select "UTF 
>>>> without BOM" at the bottom 5. paste 6. save.
>>>>
>>>> that is a crap workaround, but works for me. for automatically 
>>>> generated files . I dunno :-)
>>>>
>>>>
>>>> Greetings,
>>>> Axel.
>>>>
>>>>
>>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] 
>>>> <mailto:[EMAIL PROTECTED]> > wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I hate to do this, but can anyone please help me with either of 
>>>>

RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown
I've tried to add a handler to simply log the messages but it seems to (a 
beginner like) me that the Handler doesn't come into play until after the XML 
is parsed/deserialized.

Just to serve as a confirmation, can anyone comment on how Xerces will 
determine what type of encoding the xml is in? Will it look at the prolog, the 
byte order mark, etc?

Thanks


-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 11:24 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> Two bytes per char; Etherpeak is showing the second byte as 00.
>
Seems you are stuck between a "rock and a hard place" here. The byte 
stream appears to be correctly utf-16 encoded but the xml prolog says 
utf-8. Not sure what to recommend. Fix it at the source is obvious but 
not easily done. You may be able to write a handler that re-encodes the 
byte stream into utf-8 before giving it to the Axis stacks. But how to 
write such an Axis handler and how to hook it correctly into the Axis 
processing chain is outside my area of expertise.

May be someone else can give advice on how to attempt such a thing.

Manuel
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 11:09 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > Manuel,
> >
> > I believe you hit the problem on the head - the response prolog
> > says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > Coincidentally, by the time the response XML gets logged by axis,
> > these initial characters are logged as ef bf bd ef bf bd.
>
> Matt,
>
> what about the rest of the byte stream when you look at it in
> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> (1 byte per char for all typical ascii characters)?
>
> Manuel
>
> > Unfortunately we may be in a bit of a tough place with having the
> > producer of the XML change it; the customer whose web services we
> > are consuming doesn't seem to see any issue with this (as they are
> > fine with their .NET tools).
> >
> > If it is the case where we are seeing a UTF-16 BOM but a prolog
> > that declares UTF-8; is there any way to instruct Axis/Xerces to
> > parse it as UTF-16? Sorry if this question doesn't make much sense,
> > but I'm not too familiar with how Axis and/or Xerces decide which
> > character encoding to use when reading the XML.
> >
> > Thanks again
> > Matt
> >
> > -Original Message-
> > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 10:58 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > > Yes, there is a work-around. It works if you encode the file with
> > > UTF-8 (for example), and do not include the BOM at the beginning.
> > > I use notepad++ for that task, where you can save in "UTF-8
> > > without BOM".
> > >
> > > The process for that is easy:
> > > 1. open the file in notepad++
> > > 2. mark everything via CTRL-A
> > > 3. cut (not copy!)
> > > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > > without BOM" at the bottom
> > > 5. paste
> > > 6. save.
> > >
> > > that is a crap workaround, but works for me. for automatically
> > > generated files . I dunno :-)
> > >
> > >
> > > Greetings,
> > > Axel.
> > >
> > >
> > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > > <mailto:[EMAIL PROTECTED]> > wrote:
> > >
> > > Hi all,
> > >
> > > I hate to do this, but can anyone please help me with either of
> > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > > avail.
> > >
> > > Is there anything else I could be doing?
> >
> > Just wondering if your file in question starts with hex 'ef bb bf'
> > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
> > believe you have an utf-16 encoded file (little endian or big
> > endian) not utf-8. If it is the 'ef bb bf' sequence then it starts
> > correctly with the utf-8 encoded unicode code point for BOM U+FEFF.
> > In all cases xerces should be able to handle it. A

RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown
Two bytes per char; Etherpeak is showing the second byte as 00.

-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 11:09 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> Manuel,
>
> I believe you hit the problem on the head - the response prolog says
> utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally,
> by the time the response XML gets logged by axis, these initial
> characters are logged as ef bf bd ef bf bd.
>
Matt,

what about the rest of the byte stream when you look at it in Etherpeak. 
Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded (1 byte per 
char for all typical ascii characters)?

Manuel
> Unfortunately we may be in a bit of a tough place with having the
> producer of the XML change it; the customer whose web services we are
> consuming doesn't seem to see any issue with this (as they are fine
> with their .NET tools).
>
> If it is the case where we are seeing a UTF-16 BOM but a prolog that
> declares UTF-8; is there any way to instruct Axis/Xerces to parse it
> as UTF-16? Sorry if this question doesn't make much sense, but I'm
> not too familiar with how Axis and/or Xerces decide which character
> encoding to use when reading the XML.
>
> Thanks again
> Matt
>
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 10:58 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > Yes, there is a work-around. It works if you encode the file with
> > UTF-8 (for example), and do not include the BOM at the beginning. I
> > use notepad++ for that task, where you can save in "UTF-8 without
> > BOM".
> >
> > The process for that is easy:
> > 1. open the file in notepad++
> > 2. mark everything via CTRL-A
> > 3. cut (not copy!)
> > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > without BOM" at the bottom
> > 5. paste
> > 6. save.
> >
> > that is a crap workaround, but works for me. for automatically
> > generated files . I dunno :-)
> >
> >
> > Greetings,
> > Axel.
> >
> >
> > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]> > wrote:
> >
> > Hi all,
> >
> > I hate to do this, but can anyone please help me with either of
> > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > avail.
> >
> > Is there anything else I could be doing?
>
> Just wondering if your file in question starts with hex 'ef bb bf'
> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe
> you have an utf-16 encoded file (little endian or big endian) not
> utf-8. If it is the 'ef bb bf' sequence then it starts correctly with
> the utf-8 encoded unicode code point for BOM U+FEFF. In all cases
> xerces should be able to handle it. A problem may arise if it starts
> with 'ff ef' but the XML prolog says encoding="utf-8" as that is a
> contradiction I believe.
>
> I know this does not help directly but may help to check if the
> problem is with the producer of the XML document or your consumer.
>
> Manuel
>
> > What about the possibility of programmatically editing/cleaning the
> > response XML before it is given to the parser?
> >
> > Thanks
> > Matt
> >
> > -Original Message-
> > From: Matthew Brown [mailto: [EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]> ]
> > Sent: Saturday, July 01, 2006 12:41 PM
> > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML
> >
> >
> > 1. From searching the mailing list archives, I see several
> > references to people having problems with Byte Order Mark
> > characters appearing before the prolog in their UTF-8 messages.
> > However I can't seem to find much of a known resolution to these
> > issues. Is there a standard/common workaround for these BOM and
> > UTF-8 issues?
> >
> > 2. If there is no answer to my #1, is there anyway that Axis will
> > allow me to pragmatically edit the response XML before it is passed
> > to the parser and de-serialized? I've tried adding Handlers, but
> > I'm assuming that the Handler comes into the picture after the
> > message is parsed, because my Handler is only ever seeing the
> > request message, and not the response.
> >
> > Thanks
> > Matt Brown
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown
Manuel,

I believe you hit the problem on the head - the response prolog says utf-8 but 
(according to Etherpeak) the BOM is ff/ef. Coincidentally, by the time the 
response XML gets logged by axis, these initial characters are logged as ef bf 
bd ef bf bd.

Unfortunately we may be in a bit of a tough place with having the producer of 
the XML change it; the customer whose web services we are consuming doesn't 
seem to see any issue with this (as they are fine with their .NET tools).

If it is the case where we are seeing a UTF-16 BOM but a prolog that declares 
UTF-8; is there any way to instruct Axis/Xerces to parse it as UTF-16? Sorry if 
this question doesn't make much sense, but I'm not too familiar with how Axis 
and/or Xerces decide which character encoding to use when reading the XML.

Thanks again
Matt

-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:58 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> Yes, there is a work-around. It works if you encode the file with
> UTF-8 (for example), and do not include the BOM at the beginning. I
> use notepad++ for that task, where you can save in "UTF-8 without
> BOM".
>
> The process for that is easy:
> 1. open the file in notepad++
> 2. mark everything via CTRL-A
> 3. cut (not copy!)
> 4. in the format menu, choose "ANSI" formatting and select "UTF
> without BOM" at the bottom
> 5. paste
> 6. save.
>
> that is a crap workaround, but works for me. for automatically
> generated files . I dunno :-)
>
>
> Greetings,
> Axel.
>
>
> On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> > wrote:
>
> Hi all,
>
> I hate to do this, but can anyone please help me with either of these
> issues? I've tried to upgrade Xerces to 2.8.0 but to no avail.
>
> Is there anything else I could be doing?

Just wondering if your file in question starts with hex 'ef bb bf' 
or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe 
you have an utf-16 encoded file (little endian or big endian) not 
utf-8. If it is the 'ef bb bf' sequence then it starts correctly with 
the utf-8 encoded unicode code point for BOM U+FEFF. In all cases 
xerces should be able to handle it. A problem may arise if it starts 
with 'ff ef' but the XML prolog says encoding="utf-8" as that is a 
contradiction I believe.

I know this does not help directly but may help to check if the problem 
is with the producer of the XML document or your consumer.

Manuel
>
> What about the possibility of programmatically editing/cleaning the
> response XML before it is given to the parser?
>
> Thanks
> Matt
>
> -Original Message-
> From: Matthew Brown [mailto: [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> ]
> Sent: Saturday, July 01, 2006 12:41 PM
> To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> Subject: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> 1. From searching the mailing list archives, I see several references
> to people having problems with Byte Order Mark characters appearing
> before the prolog in their UTF-8 messages. However I can't seem to
> find much of a known resolution to these issues. Is there a
> standard/common workaround for these BOM and UTF-8 issues?
>
> 2. If there is no answer to my #1, is there anyway that Axis will
> allow me to pragmatically edit the response XML before it is passed
> to the parser and de-serialized? I've tried adding Handlers, but I'm
> assuming that the Handler comes into the picture after the message is
> parsed, because my Handler is only ever seeing the request message,
> and not the response.
>
> Thanks
> Matt Brown

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown
Hi Davanum

Sorry if I didn't give all of the details before - we are using Axis as a 
client and communicating with a ASP.NET (v1.1) server.

Just for testing, we built a client in .NET off of the same WSDL, and although 
the response XML/data from the service looks the same, .NET was somehow able to 
parse it fine.

So at this point I'm not sure if this is a problem I should be tackling in Axis 
or somehow thru the XML parser, but in my searches I've found some previous 
discussion of this problem on the list, but not any known solution posted.

Thanks
Matt

-Original Message-
From: Davanum Srinivas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:44 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


Matthew,

Is this from a non-axis web service? and you are having problems with
an axis client?

-- dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
>
>
> Alex,
>
> The problem I am having is with the SOAP response from the web service; so
> I'm not really sure how we'd be saving that to a file... this isn't a static
> piece of text.
>
> -Original Message-
> From: Axel Bock [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 10:17 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> Yes, there is a work-around. It works if you encode the file with UTF-8 (for
> example), and do not include the BOM at the beginning. I use notepad++ for
> that task, where you can save in "UTF-8 without BOM".
>
> The process for that is easy:
> 1. open the file in notepad++
> 2. mark everything via CTRL-A
> 3. cut (not copy!)
> 4. in the format menu, choose "ANSI" formatting and select "UTF without BOM"
> at the bottom
> 5. paste
> 6. save.
>
> that is a crap workaround, but works for me. for automatically generated
> files . I dunno :-)
>
>
> Greetings,
> Axel.
>
>
> On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> > Hi all,
> >
> > I hate to do this, but can anyone please help me with either of these
> issues? I've tried to upgrade Xerces to 2.8.0 but to no avail.
> >
> > Is there anything else I could be doing?
> >
> > What about the possibility of programmatically editing/cleaning the
> response XML before it is given to the parser?
> >
> > Thanks
> > Matt
> >
> > -Original Message-
> > From: Matthew Brown [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, July 01, 2006 12:41 PM
> > To: axis-user@ws.apache.org
> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML
> >
> >
> > 1. From searching the mailing list archives, I see several references to
> people having problems with Byte Order Mark characters appearing before the
> prolog in their UTF-8 messages. However I can't seem to find much of a known
> resolution to these issues. Is there a standard/common workaround for these
> BOM and UTF-8 issues?
> >
> > 2. If there is no answer to my #1, is there anyway that Axis will allow me
> to pragmatically edit the response XML before it is passed to the parser and
> de-serialized? I've tried adding Handlers, but I'm assuming that the Handler
> comes into the picture after the message is parsed, because my Handler is
> only ever seeing the request message, and not the response.
> >
> > Thanks
> > Matt Brown
>
>


-- 
Davanum Srinivas : http://people.apache.org/~dims/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown



Alex,
 
The 
problem I am having is with the SOAP response from the web service; so I'm not 
really sure how we'd be saving that to a file... this isn't a static piece of 
text.

  -Original Message-From: Axel Bock 
  [mailto:[EMAIL PROTECTED]Sent: Wednesday, July 05, 
  2006 10:17 AMTo: axis-user@ws.apache.orgSubject: Re: Two 
  questions - BOM in UTF-8, and manually cleaning XMLYes, 
  there is a work-around. It works if you encode the file with UTF-8 (for 
  example), and do not include the BOM at the beginning. I use notepad++ for 
  that task, where you can save in "UTF-8 without BOM". The process for 
  that is easy: 1. open the file in notepad++2. mark everything via 
  CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" 
  formatting and select "UTF without BOM" at the bottom 5. paste6. 
  save.that is a crap workaround, but works for me. for automatically 
  generated files ..... I dunno :-) Greetings, Axel.
  On 7/5/06, Matthew 
  Brown <[EMAIL PROTECTED]> 
  wrote:
  


Hi all,
 
I hate to do this, but can 
anyone please help me with either of these issues? I've tried to upgrade 
Xerces to 2.8.0 but to no avail. 
 
Is there anything else I 
could be doing?
 
What about the possibility 
of programmatically editing/cleaning the response XML before it is given to 
the parser?
 
Thanks
    Matt
    
  -Original 
  Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]]Sent: Saturday, 
  July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions 
  - BOM in UTF-8, and manually cleaning XML
  1. From searching the mailing list 
  archives, I see several references to people having problems with Byte 
  Order Mark characters appearing before the prolog in their UTF-8 messages. 
  However I can't seem to find much of a known resolution to these issues. 
  Is there a standard/common workaround for these BOM and UTF-8 issues? 
  
   
  2. If there is no answer to my #1, is 
  there anyway that Axis will allow me to pragmatically edit the response 
  XML before it is passed to the parser and de-serialized? I've tried adding 
  Handlers, but I'm assuming that the Handler comes into the picture after 
  the message is parsed, because my Handler is only ever seeing the request 
  message, and not the response.
   
  Thanks
  Matt 
Brown


RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Matthew Brown



Hi 
all,
 
I hate 
to do this, but can anyone please help me with either of these issues? I've 
tried to upgrade Xerces to 2.8.0 but to no avail. 
 
Is 
there anything else I could be doing?
 
What 
about the possibility of programmatically editing/cleaning the response XML 
before it is given to the parser?
 
Thanks
Matt

  -Original Message-From: Matthew Brown 
  [mailto:[EMAIL PROTECTED]Sent: Saturday, July 01, 2006 
  12:41 PMTo: axis-user@ws.apache.orgSubject: Two 
  questions - BOM in UTF-8, and manually cleaning XML
  1. >From searching 
  the mailing list archives, I see several references to people having problems 
  with Byte Order Mark characters appearing before the prolog in their UTF-8 
  messages. However I can't seem to find much of a known resolution to these 
  issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 
  
   
  2. If there is no 
  answer to my #1, is there anyway that Axis will allow me to pragmatically edit 
  the response XML before it is passed to the parser and de-serialized? I've 
  tried adding Handlers, but I'm assuming that the Handler comes into the 
  picture after the message is parsed, because my Handler is only ever seeing 
  the request message, and not the response.
   
  Thanks
  Matt 
  Brown


RE: Content is not allowed in prolog

2006-07-05 Thread Matthew Brown
 org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at 
org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)
... 40 more




-Original Message-
From: Dies Koper [mailto:[EMAIL PROTECTED]
Sent: Sunday, July 02, 2006 8:45 PM
To: axis-user@ws.apache.org
Cc: [EMAIL PROTECTED]
Subject: Re: Content is not allowed in prolog


Hello Derek,

I used Xerces-J 2.7.1 and had no problems with a Unicode Byte Order Mark 
(BOM) in my UTF-8 and UTF-16 messages using Axis 1.3.

Can you try reproducing the error message with this parser?

Regards,
Dies


Matthew Brown wrote:
> Thanks Derek. I've etherpeak to capture the raw packets coming across
> and using it's hex editor, have found that they appear to be hex FF
> FE.
> 
> I understand from searching and from old posts on this list that
> Xerces will have trouble that starts with this byte-order-mark. Is
> this still the case? If so, can anyone provide the known workaround
> for this?
> 
> Thanks again Matt
> -----Original Message- From: Matthew Brown
> [mailto:[EMAIL PROTECTED] Sent: Friday, June 30, 2006 7:16
> AM To: axis-user@ws.apache.org Subject: RE: Content is not allowed in
> prolog
> 
> 
> Some followup information..
> 
> I've tested using .NET and their wsdl.exe tool to create a client to
> use the customer's web service. The response still looks the same,
> but .NET has zero issues parsing. Could this just be an XML parser
> issue? Can someone point me in the direction of how to
> change/configure the parser, or find out if parsing a message such as
> the one below (with all those extra spaces) is possible? 
> -Original Message- From: Matthew Brown
> [mailto:[EMAIL PROTECTED] Sent: Friday, June 30, 2006 9:23
> AM To: axis-user@ws.apache.org Subject: RE: Content is not allowed in
> prolog
> 
> 
> I happen to be having a similar error, although it isn't an endpoint
> issue.
> 
> The response we are getting back from the server looks like this:
> 
> ??< ? x m l   v e r s i o n = " 1 . 0 "   e n c o d i n g = " u t f -
> 8 " ? > < s o a p : E n v e l o p e   x m l n s : s o a p = " h t t p
> : / / s c h e m a s . x m l s o a p . o r g / s o a p / e n v e l o p
> e / "   x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0
> 0 1 / X M L S c h e m a - i n s t a n c e "   x m l n s : x s d = " h
> t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " > < s
> o a p : H e a d e r > < R e s p o n s e H e a d e r   x m l n s = " h
> t t p : / / b l a h . c o m / C A S / " > < H e a d e r s > < / H e a
> d e r s > < / R e s p o n s e H e a d e r > < / s o a p : H e a d e r
> > < s o a p : B o d y > < G e t A c c o u n t I n f o r m a t i o n R
> e s p o n s e   x m l n s = " h t t p : / / b l a h . c o m / C A S /
> " > < A c c o u n t I n f o r m a t i o n R e s p o n s e   x m l n s
> : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h
> e m a "   x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2
> 0 0 1 / X M L S c h e m a - i n s t a n c e "   x m l n s = " h t t p
> : / / b l a h . c o m / C A S / I V R . M e s s a g e D e f i n i t i
> o n s . x s d " >
> 
> < N u m b e r O f M a t c h e s > 0 < / N u m b e r O f M a t c h e s
> >
> 
> < M o n t h l y E x t e n s i o n A m o u n t > 0 < / M o n t h l y E
> x t e n s i o n A m o u n t >
> 
> 
> 
> 
> with garbage characters inserted between each legit XML character
> (and two before the prolog).
> 
> Is it possible to add a handler to modify the raw response XML before
> Axis passes it off to the XML parser? Does anyone know? Is there some
> other simple setting I might be overlooking that might be causing
> this?
> 
> Thanks in advance.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-01 Thread Matthew Brown



1. >From searching 
the mailing list archives, I see several references to people having problems 
with Byte Order Mark characters appearing before the prolog in their UTF-8 
messages. However I can't seem to find much of a known resolution to these 
issues. Is there a standard/common workaround for these BOM and UTF-8 issues? 

 
2. If there is no 
answer to my #1, is there anyway that Axis will allow me to pragmatically edit 
the response XML before it is passed to the parser and de-serialized? I've tried 
adding Handlers, but I'm assuming that the Handler comes into the picture after 
the message is parsed, because my Handler is only ever seeing the request 
message, and not the response.
 
Thanks
Matt 
Brown


RE: Content is not allowed in prolog

2006-06-30 Thread Matthew Brown
Title: Message



Thanks 
Derek. I've etherpeak to capture the raw packets coming across and using it's 
hex editor, have found that they appear to be hex FF FE. 
 
I 
understand from searching and from old posts on this list that Xerces will have 
trouble that starts with this byte-order-mark. Is this still the case? If so, 
can anyone provide the known workaround for this?
 
Thanks 
again
Matt

  -Original Message-From: Derek 
  [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 1:43 
  PMTo: axis-user@ws.apache.orgSubject: RE: Content is not 
  allowed in prolog
  Just 
  a suggestion:
   
  The 
  message that you list below, with blanks between each character, looks to me 
  like you might be trying to view Unicode text as if it were ASCII. Unicode uses sixteen bits to represent a character, while 
  ASCII uses 8 (technically, 7), so each unicode character in the ASCII numeric 
  range constitutes an all-zeroes byte plus a character byte. Perhaps the extra 
  characters you are seeing in the message aren't really spaces, but are really 
  null characters (0x00) and your editor or viewer translates them to spaces 
  because it has no way to display nulls.
   
  The 
  two question marks before the initial "
   
  
  Just 
  a thought. That's the problem I've usually had when I see text files that look 
  like this one.
   
  Derek
  

-----Original Message-From: Matthew Brown 
[mailto:[EMAIL PROTECTED] Sent: Friday, June 30, 2006 
7:16 AMTo: axis-user@ws.apache.orgSubject: RE: Content 
is not allowed in prolog
Some followup information..
 
I've tested using .NET and their wsdl.exe tool to create a client to 
use the customer's web service. The response still looks the same, but .NET 
has zero issues parsing. Could this just be an XML parser issue? Can someone 
point me in the direction of how to change/configure the parser, or find out 
if parsing a message such as the one below (with all those extra spaces) is 
possible?
    
  -Original Message-From: Matthew Brown 
  [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 
  9:23 AMTo: axis-user@ws.apache.orgSubject: RE: 
  Content is not allowed in prolog
  I happen to be having a similar error, although it isn't an 
  endpoint issue.
   
  The response we are getting back from the server looks like 
  this:
   
  ??< ? x m l   v e r s i o n = " 1 . 0 "   e 
  n c o d i n g = " u t f - 8 " ? > < s o a p : E n v e l o p 
  e   x m l n s : s o a p = " h t t p : / / s c h e m a s . x m l 
  s o a p . o r g / s o a p / e n v e l o p e / "   x m l n s : x 
  s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - 
  i n s t a n c e "   x m l n s : x s d = " h t t p : / / w w w . 
  w 3 . o r g / 2 0 0 1 / X M L S c h e m a " > < s o a p : H e a d e 
  r > < R e s p o n s e H e a d e r   x m l n s = " h t t p 
  : / / b l a h . c o m / C A S / " > < H e a d e r s > < / H e 
  a d e r s > < / R e s p o n s e H e a d e r > < / s o a p : H 
  e a d e r > < s o a p : B o d y > < G e t A c c o u n t I n f 
  o r m a t i o n R e s p o n s e   x m l n s = " h t t p : / / b 
  l a h . c o m / C A S / " > < A c c o u n t I n f o r m a t i o n R 
  e s p o n s e   x m l n s : x s d = " h t t p : / / w w w . w 3 
  . o r g / 2 0 0 1 / X M L S c h e m a "   x m l n s : x s i = " 
  h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t 
  a n c e "   x m l n s = " h t t p : / / b l a h . c o m / C A S 
  / I V R . M e s s a g e D e f i n i t i o n s . x s d " > 
    < N u m b e r O f M a t c h e s 
  > 0 < / N u m b e r O f M a t c h e s > 
    < M o n t h l y E x t e n s i o 
  n A m o u n t > 0 < / M o n t h l y E x t e n s i o n A m o u n t 
  >   
   
  with garbage characters inserted between each legit XML 
  character (and two before the prolog).
   
  Is it possible to add a handler to modify the raw response XML 
  before Axis passes it off to the XML parser? Does anyone know? Is there 
  some other simple setting I might be overlooking that might be causing 
  this?
   
  Thanks in advance.
   


RE: Content is not allowed in prolog

2006-06-30 Thread Matthew Brown



Some 
followup information..
 
I've 
tested using .NET and their wsdl.exe tool to create a client to use the 
customer's web service. The response still looks the same, but .NET has zero 
issues parsing. Could this just be an XML parser issue? Can someone point me in 
the direction of how to change/configure the parser, or find out if parsing a 
message such as the one below (with all those extra spaces) is 
possible?

  -Original Message-From: Matthew Brown 
  [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 9:23 
  AMTo: axis-user@ws.apache.orgSubject: RE: Content is not 
  allowed in prolog
  I 
  happen to be having a similar error, although it isn't an endpoint 
  issue.
   
  The 
  response we are getting back from the server looks like 
  this:
   
  ??< ? x m l   v e r s i o n = " 1 . 0 "   e n c 
  o d i n g = " u t f - 8 " ? > < s o a p : E n v e l o p e   x 
  m l n s : s o a p = " h t t p : / / s c h e m a s . x m l s o a p . o r g / s 
  o a p / e n v e l o p e / "   x m l n s : x s i = " h t t p : / / w 
  w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e 
  "   x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 
  1 / X M L S c h e m a " > < s o a p : H e a d e r > < R e s p o n 
  s e H e a d e r   x m l n s = " h t t p : / / b l a h . c o m / C A 
  S / " > < H e a d e r s > < / H e a d e r s > < / R e s p o 
  n s e H e a d e r > < / s o a p : H e a d e r > < s o a p : B o d 
  y > < G e t A c c o u n t I n f o r m a t i o n R e s p o n s 
  e   x m l n s = " h t t p : / / b l a h . c o m / C A S / " > 
  < A c c o u n t I n f o r m a t i o n R e s p o n s e   x m l n s 
  : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a 
  "   x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 0 0 
  1 / X M L S c h e m a - i n s t a n c e "   x m l n s = " h t t p : 
  / / b l a h . c o m / C A S / I V R . M e s s a g e D e f i n i t i o n s . x 
  s d " >   < N u m b e r O f M a t c 
  h e s > 0 < / N u m b e r O f M a t c h e s > 
    < M o n t h l y E x t e n s i o n A 
  m o u n t > 0 < / M o n t h l y E x t e n s i o n A m o u n t > 
    
   
  with 
  garbage characters inserted between each legit XML character (and two 
  before the prolog).
   
  Is 
  it possible to add a handler to modify the raw response XML before Axis passes 
  it off to the XML parser? Does anyone know? Is there some other simple setting 
  I might be overlooking that might be causing this?
   
  Thanks in advance.
   


RE: Content is not allowed in prolog

2006-06-30 Thread Matthew Brown



I 
happen to be having a similar error, although it isn't an endpoint 
issue.
 
The 
response we are getting back from the server looks like 
this:
 
??< 
? x m l   v e r s i o n = " 1 . 0 "   e n c o d i n g = " u 
t f - 8 " ? > < s o a p : E n v e l o p e   x m l n s : s o a p 
= " h t t p : / / s c h e m a s . x m l s o a p . o r g / s o a p / e n v e l o 
p e / "   x m l n s : x s i = " h t t p : / / w w w . w 3 . o r g / 2 
0 0 1 / X M L S c h e m a - i n s t a n c e "   x m l n s : x s d = " 
h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a " > < s o 
a p : H e a d e r > < R e s p o n s e H e a d e r   x m l n s = 
" h t t p : / / b l a h . c o m / C A S / " > < H e a d e r s > < / 
H e a d e r s > < / R e s p o n s e H e a d e r > < / s o a p : H e 
a d e r > < s o a p : B o d y > < G e t A c c o u n t I n f o r m a 
t i o n R e s p o n s e   x m l n s = " h t t p : / / b l a h . c o m 
/ C A S / " > < A c c o u n t I n f o r m a t i o n R e s p o n s 
e   x m l n s : x s d = " h t t p : / / w w w . w 3 . o r g / 2 0 0 1 
/ X M L S c h e m a "   x m l n s : x s i = " h t t p : / / w w w . w 
3 . o r g / 2 0 0 1 / X M L S c h e m a - i n s t a n c e "   x m l n 
s = " h t t p : / / b l a h . c o m / C A S / I V R . M e s s a g e D e f i n i 
t i o n s . x s d " >   < N u m b e r 
O f M a t c h e s > 0 < / N u m b e r O f M a t c h e s > 
  < M o n t h l y E x t e n s i o n A m 
o u n t > 0 < / M o n t h l y E x t e n s i o n A m o u n t > 
  
 
with 
garbage characters inserted between each legit XML character (and two 
before the prolog).
 
Is it 
possible to add a handler to modify the raw response XML before Axis passes it 
off to the XML parser? Does anyone know? Is there some other simple setting I 
might be overlooking that might be causing this?
 
Thanks 
in advance.


  -Original Message-From: Luanne Coutinho 
  [mailto:[EMAIL PROTECTED]Sent: Friday, June 30, 2006 
  1:18 AMTo: axis-user@ws.apache.orgSubject: RE: Content 
  is not allowed in prolog
  
  Hi,
   
  Turns out that the endpoint 
  supplied by our client was wrong! I wonder why Axis kept throwing this 
  particular error…
   
  -Luanne
   
   
  -Original 
  Message-From: 
  Luanne Coutinho Sent: 
  Friday, June 30, 
  2006 9:41 
  AMTo: Luanne 
  CoutinhoSubject: 
   Hello, I had this same error before.  Question though, what version of Axis are you using?  Also if you areusing any attachments in your program, you need to include the activation.jar. Tom  Luanne Coutinho wrote:>> Hi,>>  >> I used wsdl2Java to generate stubs so that I can access a web service > hosted elsewhere.>> I wrote a test program to invoke an operation, but I keep getting this > error:>>  >> AxisFault>>  faultCode: > {http://schemas.xmlsoap.org/soap/envelope/}Server.userException>>  faultSubcode:>>  faultString: org.xml.sax.SAXParseException: Content is not allowed in > prolog.>>  faultActor:>>  faultNode:>>  faultDetail:>> > {http://xml.apache.org/axis/}stackTrace:org.xml.sax.SAXParseException: > Content is not allowed in prolog.>> at > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > Source)>> at > org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)>> at > org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)>> at > org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)>> at > org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)>> at > org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown > Source)>> at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source)>> at > org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)>> at > org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)>> at > org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)>> at > org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)>> at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)>> at > org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)>> at > org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:696)>> at org.apache.axis.Message.getSOAPEnvelope(Message.java:435)>> at > org.apache.axis.handlers.soap.MustUnderstandChecker.invoke(MustUnderstandChecker.java:62)>> at > org.apache.axis.client.AxisClient.invoke(AxisClient.java:206)>> at org.apache.axis.client.Call.invokeEngine(Call.java:2784)>> at org.apache.axis.client.Call.invoke(Call.java:2767)>> at org.apache.axis.client.Call.invoke(Call.java:2443)>> at org.apache.axis.client.Call.invoke(Call.java:2366)>>   

Strange format of SOAP Response causing errors

2006-06-29 Thread Matthew Brown



We are using stub 
classes created from WSDL2Java to communicate with a customer's web service. 
Axis (1.3 and 1.4) seems unable to parse the response of the SOAP message, and 
eyeballing the response in a tool like tcpmon one can see junk characters 
inserted between every valid XML character (the typical ASCII square), and two 
before the opening xml bracket.
 
Using the default 
Http sender, Axis reports an IO exception with a message like "Invalid byte 
1 of 1 byte UTF-8 sequence". Using the commons-http-client, this becomes a 
SAXParseException of "Content is not allowed in prolog". The SOAP response's 
header claims a content type of UTF-8, although it does not appear to be 
so.
 
I've been able to 
test out communications with the same web services using a .NET generated proxy. 
Watching the traffic in tcpmon, the response looks the same, but is understood 
by the client.
 
Should we be setting 
the character set / encoding expected in the response stream manually 
somewhere?
 
 
Thanks
Matthew 
Brown