subject:"Two questions \- BOM in UTF\-8, and manually cleaning XML"

 else {
>   String 
> cleaned = strMessage.substring(idx);
>
>   
> log.debug("invoke - Setting SOAPPart.currentMessage to: " +
> cleaned);
>
>   
> axisPart.setCurrentMessage(cleaned,
> axisPart.getCurrentForm()); }
>   }
>   }
>   }
>                   }
>       }
>   if (log.isInfoEnabled()) log.info("invoke - complete");
>   }
>   catch (Exception ex) {
>   log.error("Caught exception in invoke()", ex);
>   }
>   }
>
> }
>
> -Original Message-
> From: Davanum Srinivas [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 3:41 PM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> did you see my response on setting the CHARACTER_SET_ENCODING? what
> is the exact stack trace you get on the client?
>
> thanks,
> dims
>
> On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
> > text/xml and utf-8, which I suppose explains the attempt to parse
> > the UTF-16 message as UTF-8. The customer has changed the format of
> > the message to correctly be UTF-8 in actuality, although Xerces
> > still isn't a fan of the UTF-8 BOM (ef bb bf).
> >
> >
> >
> > -----Original Message-
> > From: Simon Fell [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 2:46 PM
> > To: axis-user@ws.apache.org
> > Subject: RE: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> >
> > What does the content-type header say the charset is? That takes
> > precedence over the payload (at least for SOAP 1.1)
> >
> > Cheers
> > Simon
> >
> > -Original Message-
> > From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 8:30 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > Maybe changing the xml prolog from "utf-8" to "utf-16" will be
> > easier. It seems like a demo example for a servlet filter ;-)
> >
> >
> > Hope this helps,
> > Rodrigo
> >
> > Manuel Mall wrote:
> > > On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> > >> Two bytes per char; Etherpeak is showing the second byte as 00.
> > >
> > > Seems you are stuck between a "rock and a hard place" here. The
> > > byte stream appears to be correctly utf-16 encoded but the xml
> > > prolog says utf-8. Not sure what to recommend. Fix it at the
> > > source is obvious but not easily done. You may be able to write a
> > > handler that re-encodes the byte stream into utf-8 before giving
> > > it to the Axis stacks. But how to write such an Axis handler and
> > > how to hook it correctly into the Axis processing chain is
> > > outside my area of expertise.
> > >
> > > May be someone else can give advice on how to attempt such a
> > > thing.
> > >
> > > Manuel
> > >
> > >> -Original Message-
> > >> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > >> Sent: Wednesday, July 05, 2006 11:09 AM
> > >> To: axis-user@ws.apache.org
> > >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > >> XML
> > >>
> > >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > >>> Manuel,
> > >>>
> > >>> I believe you hit the problem on the head - the response prolog
> > >>> says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > >>> Coincidentally, by the time the response XML gets logged by
> > >>> axis, these initial characters are logged as ef bf bd ef bf bd.
> > >>
> > >> Matt,
> > >>
> > >> what about the rest of the byte stream when you look at it in
> > >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8
> > >> encoded (1 byte per char for all typical ascii characters)?
> > >>
> > >> Manuel
> > >>
> > >>> Unfortunately we may be in a bit of a tough place with having

RE: Two questions - BOM in UTF-8, and manually cleaning XML

Davanum,

I had tried this previously and the only effect that I noticed was that the 
encoding attribute of my request message's prolog changed. The response message 
was still being parsed as UTF-8 (which the headers had said) although it was 
truly 16.

Anyway, now that the service provider has changed their service to return true 
UTF-8 data, and Xerces still has trouble interpreting the UTF-8 BOM before the 
prolog, I have found a very hack-ish solution: Add a handler that will remove 
any characters in the currentMessage if the MessageContext is past the pivot. 
This doesn't feel like a great solution to me (why isn't the XML parser 
prepared to handle the BOM? Is the wrong parse method being used?), it works 
for us for right now.

Thanks for the help all
Matt

-

package com.viecore.ipl.ws;

import javax.xml.soap.SOAPMessage;

import org.apache.axis.AxisFault;
import org.apache.axis.Message;
import org.apache.axis.MessageContext;
import org.apache.axis.SOAPPart;
import org.apache.axis.handlers.BasicHandler;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;

public class MyHandler extends BasicHandler {

private static Logger log = LogManager.getLogger(MyHandler.class);


public void invoke(MessageContext messageContext) throws AxisFault {

try {
if (log.isInfoEnabled()) log.info("invoke - start");
log.info("invoke - past pivot [" + 
messageContext.getPastPivot() + "]");

SOAPMessage rpcMsg = messageContext.getMessage();

if (rpcMsg instanceof Message) {
Message axisMsg = (Message) rpcMsg;

if (log.isDebugEnabled()) log.debug("invoke - 
cast java.xml.rpc.SOAPMessage to org.apache.axis.Message");

javax.xml.soap.SOAPPart rpcPart = 
axisMsg.getSOAPPart();
if (rpcPart instanceof SOAPPart) {
SOAPPart axisPart = (SOAPPart) rpcPart;

if (log.isDebugEnabled()) 
log.debug("invoke - cast java.xml.rpc.SOAPPart to org.apache.axis.SOAPPart");

Object currentMessage = 
axisPart.getCurrentMessage();
if (currentMessage == null) {
log.debug("invoke - current 
message is null, cannot clean");
}
else {
if (log.isDebugEnabled())
log.debug("invoke - 
current message of SOAP part has type [" + currentMessage.getClass().getName()
+ "] 
content [" + currentMessage.toString() + "]");

// attempt to remove bad 
characters from the response
if 
(messageContext.getPastPivot() == true) {

if (currentMessage 
instanceof String) {
String 
strMessage = (String) currentMessage;
int idx = 
strMessage.indexOf("mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 3:41 PM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


did you see my response on setting the CHARACTER_SET_ENCODING? what is
the exact stack trace you get on the client?

thanks,
dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
> text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 
> message as UTF-8. The customer has changed the format of the message to 
> correctly be UTF-8 in actuality, although Xerces still isn't a fan of the 
> UTF-8 BOM (ef bb bf).
>
>
>
> -Original Message-
> From: Simon Fell [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 2:46 PM
> To: axis-user@ws.apache.org
> Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> What does the content-type header say the charset is? That takes precedence 
> over the payload (at least for SOAP 1.1)
>
> Cheers
> Simon
>
> -----Original Message-
> From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 8:30 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier.
> It seems like a demo example for a servlet filter ;-)

Re: Two questions - BOM in UTF-8, and manually cleaning XML


did you see my response on setting the CHARACTER_SET_ENCODING? what is
the exact stack trace you get on the client?

thanks,
dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:

text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 
message as UTF-8. The customer has changed the format of the message to 
correctly be UTF-8 in actuality, although Xerces still isn't a fan of the UTF-8 
BOM (ef bb bf).



-Original Message-
From: Simon Fell [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 2:46 PM
To: axis-user@ws.apache.org
Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML


What does the content-type header say the charset is? That takes precedence 
over the payload (at least for SOAP 1.1)

Cheers
Simon

-Original Message-
From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 8:30 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier.
It seems like a demo example for a servlet filter ;-)


Hope this helps,
Rodrigo



Manuel Mall wrote:
> On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
>> Two bytes per char; Etherpeak is showing the second byte as 00.
>>
> Seems you are stuck between a "rock and a hard place" here. The byte
> stream appears to be correctly utf-16 encoded but the xml prolog says
> utf-8. Not sure what to recommend. Fix it at the source is obvious but
> not easily done. You may be able to write a handler that re-encodes
> the byte stream into utf-8 before giving it to the Axis stacks. But
> how to write such an Axis handler and how to hook it correctly into
> the Axis processing chain is outside my area of expertise.
>
> May be someone else can give advice on how to attempt such a thing.
>
> Manuel
>> -Original Message-
>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, July 05, 2006 11:09 AM
>> To: axis-user@ws.apache.org
>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>
>> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
>>> Manuel,
>>>
>>> I believe you hit the problem on the head - the response prolog says
>>> utf-8 but (according to Etherpeak) the BOM is ff/ef.
>>> Coincidentally, by the time the response XML gets logged by axis,
>>> these initial characters are logged as ef bf bd ef bf bd.
>> Matt,
>>
>> what about the rest of the byte stream when you look at it in
>> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
>> (1 byte per char for all typical ascii characters)?
>>
>> Manuel
>>
>>> Unfortunately we may be in a bit of a tough place with having the
>>> producer of the XML change it; the customer whose web services we
>>> are consuming doesn't seem to see any issue with this (as they are
>>> fine with their .NET tools).
>>>
>>> If it is the case where we are seeing a UTF-16 BOM but a prolog that
>>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it
>>> as UTF-16? Sorry if this question doesn't make much sense, but I'm
>>> not too familiar with how Axis and/or Xerces decide which character
>>> encoding to use when reading the XML.
>>>
>>> Thanks again
>>> Matt
>>>
>>> -Original Message-
>>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>>> Sent: Wednesday, July 05, 2006 10:58 AM
>>> To: axis-user@ws.apache.org
>>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>>
>>> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
>>>> Yes, there is a work-around. It works if you encode the file with
>>>> UTF-8 (for example), and do not include the BOM at the beginning.
>>>> I use notepad++ for that task, where you can save in "UTF-8 without
>>>> BOM".
>>>>
>>>> The process for that is easy:
>>>> 1. open the file in notepad++
>>>> 2. mark everything via CTRL-A
>>>> 3. cut (not copy!)
>>>> 4. in the format menu, choose "ANSI" formatting and select "UTF
>>>> without BOM" at the bottom 5. paste 6. save.
>>>>
>>>> that is a crap workaround, but works for me. for automatically
>>>> generated files . I dunno :-)
>>>>
>>>>
>>>> Greetings,
>>>> Axel.
>>>>
>>>>
>>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
>>>> <mailto:[EMAIL PROTECTED]&g

RE: Two questions - BOM in UTF-8, and manually cleaning XML

text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 
message as UTF-8. The customer has changed the format of the message to 
correctly be UTF-8 in actuality, although Xerces still isn't a fan of the UTF-8 
BOM (ef bb bf).



-Original Message-
From: Simon Fell [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 2:46 PM
To: axis-user@ws.apache.org
Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML


What does the content-type header say the charset is? That takes precedence 
over the payload (at least for SOAP 1.1) 

Cheers
Simon

-Original Message-
From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 05, 2006 8:30 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. 
It seems like a demo example for a servlet filter ;-)


Hope this helps,
Rodrigo



Manuel Mall wrote:
> On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
>> Two bytes per char; Etherpeak is showing the second byte as 00.
>>
> Seems you are stuck between a "rock and a hard place" here. The byte 
> stream appears to be correctly utf-16 encoded but the xml prolog says 
> utf-8. Not sure what to recommend. Fix it at the source is obvious but 
> not easily done. You may be able to write a handler that re-encodes 
> the byte stream into utf-8 before giving it to the Axis stacks. But 
> how to write such an Axis handler and how to hook it correctly into 
> the Axis processing chain is outside my area of expertise.
> 
> May be someone else can give advice on how to attempt such a thing.
> 
> Manuel
>> -Original Message-
>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, July 05, 2006 11:09 AM
>> To: axis-user@ws.apache.org
>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>
>> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
>>> Manuel,
>>>
>>> I believe you hit the problem on the head - the response prolog says 
>>> utf-8 but (according to Etherpeak) the BOM is ff/ef.
>>> Coincidentally, by the time the response XML gets logged by axis, 
>>> these initial characters are logged as ef bf bd ef bf bd.
>> Matt,
>>
>> what about the rest of the byte stream when you look at it in 
>> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
>> (1 byte per char for all typical ascii characters)?
>>
>> Manuel
>>
>>> Unfortunately we may be in a bit of a tough place with having the 
>>> producer of the XML change it; the customer whose web services we 
>>> are consuming doesn't seem to see any issue with this (as they are 
>>> fine with their .NET tools).
>>>
>>> If it is the case where we are seeing a UTF-16 BOM but a prolog that 
>>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it 
>>> as UTF-16? Sorry if this question doesn't make much sense, but I'm 
>>> not too familiar with how Axis and/or Xerces decide which character 
>>> encoding to use when reading the XML.
>>>
>>> Thanks again
>>> Matt
>>>
>>> -Original Message-
>>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>>> Sent: Wednesday, July 05, 2006 10:58 AM
>>> To: axis-user@ws.apache.org
>>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>>
>>> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
>>>> Yes, there is a work-around. It works if you encode the file with
>>>> UTF-8 (for example), and do not include the BOM at the beginning.
>>>> I use notepad++ for that task, where you can save in "UTF-8 without 
>>>> BOM".
>>>>
>>>> The process for that is easy:
>>>> 1. open the file in notepad++
>>>> 2. mark everything via CTRL-A
>>>> 3. cut (not copy!)
>>>> 4. in the format menu, choose "ANSI" formatting and select "UTF 
>>>> without BOM" at the bottom 5. paste 6. save.
>>>>
>>>> that is a crap workaround, but works for me. for automatically 
>>>> generated files . I dunno :-)
>>>>
>>>>
>>>> Greetings,
>>>> Axel.
>>>>
>>>>
>>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] 
>>>> <mailto:[EMAIL PROTECTED]> > wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I hate to do this, but can anyone please help me with either of 
>>>>

RE: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Simon Fell

What does the content-type header say the charset is? That takes precedence 
over the payload (at least for SOAP 1.1) 

Cheers
Simon

-Original Message-
From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 05, 2006 8:30 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. 
It seems like a demo example for a servlet filter ;-)


Hope this helps,
Rodrigo



Manuel Mall wrote:
> On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
>> Two bytes per char; Etherpeak is showing the second byte as 00.
>>
> Seems you are stuck between a "rock and a hard place" here. The byte 
> stream appears to be correctly utf-16 encoded but the xml prolog says 
> utf-8. Not sure what to recommend. Fix it at the source is obvious but 
> not easily done. You may be able to write a handler that re-encodes 
> the byte stream into utf-8 before giving it to the Axis stacks. But 
> how to write such an Axis handler and how to hook it correctly into 
> the Axis processing chain is outside my area of expertise.
> 
> May be someone else can give advice on how to attempt such a thing.
> 
> Manuel
>> -Original Message-
>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, July 05, 2006 11:09 AM
>> To: axis-user@ws.apache.org
>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>
>> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
>>> Manuel,
>>>
>>> I believe you hit the problem on the head - the response prolog says 
>>> utf-8 but (according to Etherpeak) the BOM is ff/ef.
>>> Coincidentally, by the time the response XML gets logged by axis, 
>>> these initial characters are logged as ef bf bd ef bf bd.
>> Matt,
>>
>> what about the rest of the byte stream when you look at it in 
>> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
>> (1 byte per char for all typical ascii characters)?
>>
>> Manuel
>>
>>> Unfortunately we may be in a bit of a tough place with having the 
>>> producer of the XML change it; the customer whose web services we 
>>> are consuming doesn't seem to see any issue with this (as they are 
>>> fine with their .NET tools).
>>>
>>> If it is the case where we are seeing a UTF-16 BOM but a prolog that 
>>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it 
>>> as UTF-16? Sorry if this question doesn't make much sense, but I'm 
>>> not too familiar with how Axis and/or Xerces decide which character 
>>> encoding to use when reading the XML.
>>>
>>> Thanks again
>>> Matt
>>>
>>> -Original Message-
>>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>>> Sent: Wednesday, July 05, 2006 10:58 AM
>>> To: axis-user@ws.apache.org
>>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>>
>>> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
>>>> Yes, there is a work-around. It works if you encode the file with
>>>> UTF-8 (for example), and do not include the BOM at the beginning.
>>>> I use notepad++ for that task, where you can save in "UTF-8 without 
>>>> BOM".
>>>>
>>>> The process for that is easy:
>>>> 1. open the file in notepad++
>>>> 2. mark everything via CTRL-A
>>>> 3. cut (not copy!)
>>>> 4. in the format menu, choose "ANSI" formatting and select "UTF 
>>>> without BOM" at the bottom 5. paste 6. save.
>>>>
>>>> that is a crap workaround, but works for me. for automatically 
>>>> generated files . I dunno :-)
>>>>
>>>>
>>>> Greetings,
>>>> Axel.
>>>>
>>>>
>>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED] 
>>>> <mailto:[EMAIL PROTECTED]> > wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I hate to do this, but can anyone please help me with either of 
>>>> these issues? I've tried to upgrade Xerces to 2.8.0 but to no 
>>>> avail.
>>>>
>>>> Is there anything else I could be doing?
>>> Just wondering if your file in question starts with hex 'ef bb bf'
>>> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I 
>>> believe you have an utf-16 encoded file (little endian or big
>>> endian) not utf-8. If

Re: Two questions - BOM in UTF-8, and manually cleaning XML


call.setProperty(Call.CHARACTER_SET_ENCODING, "UTF-16");

On 7/5/06, Davanum Srinivas <[EMAIL PROTECTED]> wrote:

Matt,

Please try setting the CHARACTER_SET_ENCODING in call's properties  to
utf-16 and see if that works.

-- dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
> I've tried to add a handler to simply log the messages but it seems to (a 
beginner like) me that the Handler doesn't come into play until after the XML is 
parsed/deserialized.
>
> Just to serve as a confirmation, can anyone comment on how Xerces will 
determine what type of encoding the xml is in? Will it look at the prolog, the 
byte order mark, etc?
>
> Thanks
>
>
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 11:24 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> > Two bytes per char; Etherpeak is showing the second byte as 00.
> >
> Seems you are stuck between a "rock and a hard place" here. The byte
> stream appears to be correctly utf-16 encoded but the xml prolog says
> utf-8. Not sure what to recommend. Fix it at the source is obvious but
> not easily done. You may be able to write a handler that re-encodes the
> byte stream into utf-8 before giving it to the Axis stacks. But how to
> write such an Axis handler and how to hook it correctly into the Axis
> processing chain is outside my area of expertise.
>
> May be someone else can give advice on how to attempt such a thing.
>
> Manuel
> > -Original Message-
> > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 11:09 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
> >
> > On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > > Manuel,
> > >
> > > I believe you hit the problem on the head - the response prolog
> > > says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > > Coincidentally, by the time the response XML gets logged by axis,
> > > these initial characters are logged as ef bf bd ef bf bd.
> >
> > Matt,
> >
> > what about the rest of the byte stream when you look at it in
> > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> > (1 byte per char for all typical ascii characters)?
> >
> > Manuel
> >
> > > Unfortunately we may be in a bit of a tough place with having the
> > > producer of the XML change it; the customer whose web services we
> > > are consuming doesn't seem to see any issue with this (as they are
> > > fine with their .NET tools).
> > >
> > > If it is the case where we are seeing a UTF-16 BOM but a prolog
> > > that declares UTF-8; is there any way to instruct Axis/Xerces to
> > > parse it as UTF-16? Sorry if this question doesn't make much sense,
> > > but I'm not too familiar with how Axis and/or Xerces decide which
> > > character encoding to use when reading the XML.
> > >
> > > Thanks again
> > > Matt
> > >
> > > -Original Message-
> > > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, July 05, 2006 10:58 AM
> > > To: axis-user@ws.apache.org
> > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > > XML
> > >
> > > On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > > > Yes, there is a work-around. It works if you encode the file with
> > > > UTF-8 (for example), and do not include the BOM at the beginning.
> > > > I use notepad++ for that task, where you can save in "UTF-8
> > > > without BOM".
> > > >
> > > > The process for that is easy:
> > > > 1. open the file in notepad++
> > > > 2. mark everything via CTRL-A
> > > > 3. cut (not copy!)
> > > > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > > > without BOM" at the bottom
> > > > 5. paste
> > > > 6. save.
> > > >
> > > > that is a crap workaround, but works for me. for automatically
> > > > generated files . I dunno :-)
> > > >
> > > >
> > > > Greetings,
> > > > Axel.
> > > >
> > > >
> > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > > > <mailto:[EMAIL PROTECTED]> > wrote:
> > > >

Re: Two questions - BOM in UTF-8, and manually cleaning XML


Matt,

Please try setting the CHARACTER_SET_ENCODING in call's properties  to
utf-16 and see if that works.

-- dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:

I've tried to add a handler to simply log the messages but it seems to (a 
beginner like) me that the Handler doesn't come into play until after the XML 
is parsed/deserialized.

Just to serve as a confirmation, can anyone comment on how Xerces will 
determine what type of encoding the xml is in? Will it look at the prolog, the 
byte order mark, etc?

Thanks


-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 11:24 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> Two bytes per char; Etherpeak is showing the second byte as 00.
>
Seems you are stuck between a "rock and a hard place" here. The byte
stream appears to be correctly utf-16 encoded but the xml prolog says
utf-8. Not sure what to recommend. Fix it at the source is obvious but
not easily done. You may be able to write a handler that re-encodes the
byte stream into utf-8 before giving it to the Axis stacks. But how to
write such an Axis handler and how to hook it correctly into the Axis
processing chain is outside my area of expertise.

May be someone else can give advice on how to attempt such a thing.

Manuel
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 11:09 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > Manuel,
> >
> > I believe you hit the problem on the head - the response prolog
> > says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > Coincidentally, by the time the response XML gets logged by axis,
> > these initial characters are logged as ef bf bd ef bf bd.
>
> Matt,
>
> what about the rest of the byte stream when you look at it in
> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> (1 byte per char for all typical ascii characters)?
>
> Manuel
>
> > Unfortunately we may be in a bit of a tough place with having the
> > producer of the XML change it; the customer whose web services we
> > are consuming doesn't seem to see any issue with this (as they are
> > fine with their .NET tools).
> >
> > If it is the case where we are seeing a UTF-16 BOM but a prolog
> > that declares UTF-8; is there any way to instruct Axis/Xerces to
> > parse it as UTF-16? Sorry if this question doesn't make much sense,
> > but I'm not too familiar with how Axis and/or Xerces decide which
> > character encoding to use when reading the XML.
> >
> > Thanks again
> > Matt
> >
> > -Original Message-
> > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 10:58 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > > Yes, there is a work-around. It works if you encode the file with
> > > UTF-8 (for example), and do not include the BOM at the beginning.
> > > I use notepad++ for that task, where you can save in "UTF-8
> > > without BOM".
> > >
> > > The process for that is easy:
> > > 1. open the file in notepad++
> > > 2. mark everything via CTRL-A
> > > 3. cut (not copy!)
> > > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > > without BOM" at the bottom
> > > 5. paste
> > > 6. save.
> > >
> > > that is a crap workaround, but works for me. for automatically
> > > generated files . I dunno :-)
> > >
> > >
> > > Greetings,
> > > Axel.
> > >
> > >
> > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > > <mailto:[EMAIL PROTECTED]> > wrote:
> > >
> > > Hi all,
> > >
> > > I hate to do this, but can anyone please help me with either of
> > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > > avail.
> > >
> > > Is there anything else I could be doing?
> >
> > Just wondering if your file in question starts with hex 'ef bb bf'
> > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
> > believe you have an utf-16 encoded file (little endian or big
> > endian) not utf-8. If it is the 'e

Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 23:37, Matthew Brown wrote:
> I've tried to add a handler to simply log the messages but it seems
> to (a beginner like) me that the Handler doesn't come into play until
> after the XML is parsed/deserialized.
>
> Just to serve as a confirmation, can anyone comment on how Xerces
> will determine what type of encoding the xml is in? Will it look at
> the prolog, the byte order mark, etc?
>

See section F. of the XML 1.0 spec 
(http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing)

Manuel

> Thanks
>
>
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 11:24 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> > Two bytes per char; Etherpeak is showing the second byte as 00.
>
> Seems you are stuck between a "rock and a hard place" here. The byte
> stream appears to be correctly utf-16 encoded but the xml prolog says
> utf-8. Not sure what to recommend. Fix it at the source is obvious
> but not easily done. You may be able to write a handler that
> re-encodes the byte stream into utf-8 before giving it to the Axis
> stacks. But how to write such an Axis handler and how to hook it
> correctly into the Axis processing chain is outside my area of
> expertise.
>
> May be someone else can give advice on how to attempt such a thing.
>
> Manuel
>
> > -Original Message-
> > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 11:09 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > > Manuel,
> > >
> > > I believe you hit the problem on the head - the response prolog
> > > says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > > Coincidentally, by the time the response XML gets logged by axis,
> > > these initial characters are logged as ef bf bd ef bf bd.
> >
> > Matt,
> >
> > what about the rest of the byte stream when you look at it in
> > Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> > (1 byte per char for all typical ascii characters)?
> >
> > Manuel
> >
> > > Unfortunately we may be in a bit of a tough place with having the
> > > producer of the XML change it; the customer whose web services we
> > > are consuming doesn't seem to see any issue with this (as they
> > > are fine with their .NET tools).
> > >
> > > If it is the case where we are seeing a UTF-16 BOM but a prolog
> > > that declares UTF-8; is there any way to instruct Axis/Xerces to
> > > parse it as UTF-16? Sorry if this question doesn't make much
> > > sense, but I'm not too familiar with how Axis and/or Xerces
> > > decide which character encoding to use when reading the XML.
> > >
> > > Thanks again
> > > Matt
> > >
> > > -Original Message-
> > > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, July 05, 2006 10:58 AM
> > > To: axis-user@ws.apache.org
> > > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > > XML
> > >
> > > On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > > > Yes, there is a work-around. It works if you encode the file
> > > > with UTF-8 (for example), and do not include the BOM at the
> > > > beginning. I use notepad++ for that task, where you can save in
> > > > "UTF-8 without BOM".
> > > >
> > > > The process for that is easy:
> > > > 1. open the file in notepad++
> > > > 2. mark everything via CTRL-A
> > > > 3. cut (not copy!)
> > > > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > > > without BOM" at the bottom
> > > > 5. paste
> > > > 6. save.
> > > >
> > > > that is a crap workaround, but works for me. for automatically
> > > > generated files . I dunno :-)
> > > >
> > > >
> > > > Greetings,
> > > > Axel.
> > > >
> > > >
> > > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > > > <mailto:[EMAIL PROTECTED]> > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I hate to do this, but can anyone please help me w

RE: Two questions - BOM in UTF-8, and manually cleaning XML

I've tried to add a handler to simply log the messages but it seems to (a 
beginner like) me that the Handler doesn't come into play until after the XML 
is parsed/deserialized.

Just to serve as a confirmation, can anyone comment on how Xerces will 
determine what type of encoding the xml is in? Will it look at the prolog, the 
byte order mark, etc?

Thanks


-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 11:24 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> Two bytes per char; Etherpeak is showing the second byte as 00.
>
Seems you are stuck between a "rock and a hard place" here. The byte 
stream appears to be correctly utf-16 encoded but the xml prolog says 
utf-8. Not sure what to recommend. Fix it at the source is obvious but 
not easily done. You may be able to write a handler that re-encodes the 
byte stream into utf-8 before giving it to the Axis stacks. But how to 
write such an Axis handler and how to hook it correctly into the Axis 
processing chain is outside my area of expertise.

May be someone else can give advice on how to attempt such a thing.

Manuel
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 11:09 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > Manuel,
> >
> > I believe you hit the problem on the head - the response prolog
> > says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > Coincidentally, by the time the response XML gets logged by axis,
> > these initial characters are logged as ef bf bd ef bf bd.
>
> Matt,
>
> what about the rest of the byte stream when you look at it in
> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> (1 byte per char for all typical ascii characters)?
>
> Manuel
>
> > Unfortunately we may be in a bit of a tough place with having the
> > producer of the XML change it; the customer whose web services we
> > are consuming doesn't seem to see any issue with this (as they are
> > fine with their .NET tools).
> >
> > If it is the case where we are seeing a UTF-16 BOM but a prolog
> > that declares UTF-8; is there any way to instruct Axis/Xerces to
> > parse it as UTF-16? Sorry if this question doesn't make much sense,
> > but I'm not too familiar with how Axis and/or Xerces decide which
> > character encoding to use when reading the XML.
> >
> > Thanks again
> > Matt
> >
> > -Original Message-
> > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 10:58 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > > Yes, there is a work-around. It works if you encode the file with
> > > UTF-8 (for example), and do not include the BOM at the beginning.
> > > I use notepad++ for that task, where you can save in "UTF-8
> > > without BOM".
> > >
> > > The process for that is easy:
> > > 1. open the file in notepad++
> > > 2. mark everything via CTRL-A
> > > 3. cut (not copy!)
> > > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > > without BOM" at the bottom
> > > 5. paste
> > > 6. save.
> > >
> > > that is a crap workaround, but works for me. for automatically
> > > generated files . I dunno :-)
> > >
> > >
> > > Greetings,
> > > Axel.
> > >
> > >
> > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > > <mailto:[EMAIL PROTECTED]> > wrote:
> > >
> > > Hi all,
> > >
> > > I hate to do this, but can anyone please help me with either of
> > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > > avail.
> > >
> > > Is there anything else I could be doing?
> >
> > Just wondering if your file in question starts with hex 'ef bb bf'
> > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
> > believe you have an utf-16 encoded file (little endian or big
> > endian) not utf-8. If it is the 'ef bb bf' sequence then it starts
> > correctly with the utf-8 encoded unicode code point for BOM U+FEFF.
> > In all cases xerces should be able to handle it. A

Re: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Jim Murphy


I think you guys are being too lienient ont he service provider.  I
would shame them into fixing the problem. :-)

Clearly its not cool to publish an non-interoperable service
implementation!  Why not use .NET remoting or Java RMI in that case?

This assumes you can have a productive discussion with the service team.

Jim Murphy
Mindreef, Inc.



On 7/5/06, Rodrigo Ruiz <[EMAIL PROTECTED]> wrote:

Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier.
It seems like a demo example for a servlet filter ;-)


Hope this helps,
Rodrigo



Manuel Mall wrote:
> On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
>> Two bytes per char; Etherpeak is showing the second byte as 00.
>>
> Seems you are stuck between a "rock and a hard place" here. The byte
> stream appears to be correctly utf-16 encoded but the xml prolog says
> utf-8. Not sure what to recommend. Fix it at the source is obvious but
> not easily done. You may be able to write a handler that re-encodes the
> byte stream into utf-8 before giving it to the Axis stacks. But how to
> write such an Axis handler and how to hook it correctly into the Axis
> processing chain is outside my area of expertise.
>
> May be someone else can give advice on how to attempt such a thing.
>
> Manuel
>> -Original Message-
>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, July 05, 2006 11:09 AM
>> To: axis-user@ws.apache.org
>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>>
>> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
>>> Manuel,
>>>
>>> I believe you hit the problem on the head - the response prolog
>>> says utf-8 but (according to Etherpeak) the BOM is ff/ef.
>>> Coincidentally, by the time the response XML gets logged by axis,
>>> these initial characters are logged as ef bf bd ef bf bd.
>> Matt,
>>
>> what about the rest of the byte stream when you look at it in
>> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
>> (1 byte per char for all typical ascii characters)?
>>
>> Manuel
>>
>>> Unfortunately we may be in a bit of a tough place with having the
>>> producer of the XML change it; the customer whose web services we
>>> are consuming doesn't seem to see any issue with this (as they are
>>> fine with their .NET tools).
>>>
>>> If it is the case where we are seeing a UTF-16 BOM but a prolog
>>> that declares UTF-8; is there any way to instruct Axis/Xerces to
>>> parse it as UTF-16? Sorry if this question doesn't make much sense,
>>> but I'm not too familiar with how Axis and/or Xerces decide which
>>> character encoding to use when reading the XML.
>>>
>>> Thanks again
>>> Matt
>>>
>>> -Original Message-
>>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
>>> Sent: Wednesday, July 05, 2006 10:58 AM
>>> To: axis-user@ws.apache.org
>>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
>>> XML
>>>
>>> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
>>>> Yes, there is a work-around. It works if you encode the file with
>>>> UTF-8 (for example), and do not include the BOM at the beginning.
>>>> I use notepad++ for that task, where you can save in "UTF-8
>>>> without BOM".
>>>>
>>>> The process for that is easy:
>>>> 1. open the file in notepad++
>>>> 2. mark everything via CTRL-A
>>>> 3. cut (not copy!)
>>>> 4. in the format menu, choose "ANSI" formatting and select "UTF
>>>> without BOM" at the bottom
>>>> 5. paste
>>>> 6. save.
>>>>
>>>> that is a crap workaround, but works for me. for automatically
>>>> generated files . I dunno :-)
>>>>
>>>>
>>>> Greetings,
>>>> Axel.
>>>>
>>>>
>>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
>>>> <mailto:[EMAIL PROTECTED]> > wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I hate to do this, but can anyone please help me with either of
>>>> these issues? I've tried to upgrade Xerces to 2.8.0 but to no
>>>> avail.
>>>>
>>>> Is there anything else I could be doing?
>>> Just wondering if your file in question starts with hex 'ef bb bf'
>>> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
>>> believe

Re: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Rodrigo Ruiz

Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier. 
It seems like a demo example for a servlet filter ;-)



Hope this helps,
Rodrigo



Manuel Mall wrote:

On Wednesday 05 July 2006 23:12, Matthew Brown wrote:

Two bytes per char; Etherpeak is showing the second byte as 00.

Seems you are stuck between a "rock and a hard place" here. The byte 
stream appears to be correctly utf-16 encoded but the xml prolog says 
utf-8. Not sure what to recommend. Fix it at the source is obvious but 
not easily done. You may be able to write a handler that re-encodes the 
byte stream into utf-8 before giving it to the Axis stacks. But how to 
write such an Axis handler and how to hook it correctly into the Axis 
processing chain is outside my area of expertise.


May be someone else can give advice on how to attempt such a thing.

Manuel

-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 11:09 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 23:04, Matthew Brown wrote:

Manuel,

I believe you hit the problem on the head - the response prolog
says utf-8 but (according to Etherpeak) the BOM is ff/ef.
Coincidentally, by the time the response XML gets logged by axis,
these initial characters are logged as ef bf bd ef bf bd.

Matt,

what about the rest of the byte stream when you look at it in
Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
(1 byte per char for all typical ascii characters)?

Manuel


Unfortunately we may be in a bit of a tough place with having the
producer of the XML change it; the customer whose web services we
are consuming doesn't seem to see any issue with this (as they are
fine with their .NET tools).

If it is the case where we are seeing a UTF-16 BOM but a prolog
that declares UTF-8; is there any way to instruct Axis/Xerces to
parse it as UTF-16? Sorry if this question doesn't make much sense,
but I'm not too familiar with how Axis and/or Xerces decide which
character encoding to use when reading the XML.

Thanks again
Matt

-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:58 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
XML

On Wednesday 05 July 2006 22:16, Axel Bock wrote:

Yes, there is a work-around. It works if you encode the file with
UTF-8 (for example), and do not include the BOM at the beginning.
I use notepad++ for that task, where you can save in "UTF-8
without BOM".

The process for that is easy:
1. open the file in notepad++
2. mark everything via CTRL-A
3. cut (not copy!)
4. in the format menu, choose "ANSI" formatting and select "UTF
without BOM" at the bottom
5. paste
6. save.

that is a crap workaround, but works for me. for automatically
generated files . I dunno :-)


Greetings,
Axel.


On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> > wrote:

Hi all,

I hate to do this, but can anyone please help me with either of
these issues? I've tried to upgrade Xerces to 2.8.0 but to no
avail.

Is there anything else I could be doing?

Just wondering if your file in question starts with hex 'ef bb bf'
or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
believe you have an utf-16 encoded file (little endian or big
endian) not utf-8. If it is the 'ef bb bf' sequence then it starts
correctly with the utf-8 encoded unicode code point for BOM U+FEFF.
In all cases xerces should be able to handle it. A problem may
arise if it starts with 'ff ef' but the XML prolog says
encoding="utf-8" as that is a contradiction I believe.

I know this does not help directly but may help to check if the
problem is with the producer of the XML document or your consumer.

Manuel


What about the possibility of programmatically editing/cleaning
the response XML before it is given to the parser?

Thanks
Matt

-Original Message-
From: Matthew Brown [mailto: [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> ]
Sent: Saturday, July 01, 2006 12:41 PM
To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
Subject: Two questions - BOM in UTF-8, and manually cleaning XML


1. From searching the mailing list archives, I see several
references to people having problems with Byte Order Mark
characters appearing before the prolog in their UTF-8 messages.
However I can't seem to find much of a known resolution to these
issues. Is there a standard/common workaround for these BOM and
UTF-8 issues?

2. If there is no answer to my #1, is there anyway that Axis will
allow me to pragmatically edit the response XML before it is
passed to the parser and de-serialized? I've tried adding
Handlers, but I'm

Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> Two bytes per char; Etherpeak is showing the second byte as 00.
>
Seems you are stuck between a "rock and a hard place" here. The byte 
stream appears to be correctly utf-16 encoded but the xml prolog says 
utf-8. Not sure what to recommend. Fix it at the source is obvious but 
not easily done. You may be able to write a handler that re-encodes the 
byte stream into utf-8 before giving it to the Axis stacks. But how to 
write such an Axis handler and how to hook it correctly into the Axis 
processing chain is outside my area of expertise.

May be someone else can give advice on how to attempt such a thing.

Manuel
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 11:09 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > Manuel,
> >
> > I believe you hit the problem on the head - the response prolog
> > says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > Coincidentally, by the time the response XML gets logged by axis,
> > these initial characters are logged as ef bf bd ef bf bd.
>
> Matt,
>
> what about the rest of the byte stream when you look at it in
> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> (1 byte per char for all typical ascii characters)?
>
> Manuel
>
> > Unfortunately we may be in a bit of a tough place with having the
> > producer of the XML change it; the customer whose web services we
> > are consuming doesn't seem to see any issue with this (as they are
> > fine with their .NET tools).
> >
> > If it is the case where we are seeing a UTF-16 BOM but a prolog
> > that declares UTF-8; is there any way to instruct Axis/Xerces to
> > parse it as UTF-16? Sorry if this question doesn't make much sense,
> > but I'm not too familiar with how Axis and/or Xerces decide which
> > character encoding to use when reading the XML.
> >
> > Thanks again
> > Matt
> >
> > -Original Message-
> > From: Manuel Mall [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 05, 2006 10:58 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > > Yes, there is a work-around. It works if you encode the file with
> > > UTF-8 (for example), and do not include the BOM at the beginning.
> > > I use notepad++ for that task, where you can save in "UTF-8
> > > without BOM".
> > >
> > > The process for that is easy:
> > > 1. open the file in notepad++
> > > 2. mark everything via CTRL-A
> > > 3. cut (not copy!)
> > > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > > without BOM" at the bottom
> > > 5. paste
> > > 6. save.
> > >
> > > that is a crap workaround, but works for me. for automatically
> > > generated files . I dunno :-)
> > >
> > >
> > > Greetings,
> > > Axel.
> > >
> > >
> > > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > > <mailto:[EMAIL PROTECTED]> > wrote:
> > >
> > > Hi all,
> > >
> > > I hate to do this, but can anyone please help me with either of
> > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > > avail.
> > >
> > > Is there anything else I could be doing?
> >
> > Just wondering if your file in question starts with hex 'ef bb bf'
> > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
> > believe you have an utf-16 encoded file (little endian or big
> > endian) not utf-8. If it is the 'ef bb bf' sequence then it starts
> > correctly with the utf-8 encoded unicode code point for BOM U+FEFF.
> > In all cases xerces should be able to handle it. A problem may
> > arise if it starts with 'ff ef' but the XML prolog says
> > encoding="utf-8" as that is a contradiction I believe.
> >
> > I know this does not help directly but may help to check if the
> > problem is with the producer of the XML document or your consumer.
> >
> > Manuel
> >
> > > What about the possibility of programmatically editing/cleaning
> > > the response XML before it is given to the parser?
> > >
> > > Thanks
> > > Matt
> > >
> >

RE: Two questions - BOM in UTF-8, and manually cleaning XML

Two bytes per char; Etherpeak is showing the second byte as 00.

-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 11:09 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> Manuel,
>
> I believe you hit the problem on the head - the response prolog says
> utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally,
> by the time the response XML gets logged by axis, these initial
> characters are logged as ef bf bd ef bf bd.
>
Matt,

what about the rest of the byte stream when you look at it in Etherpeak. 
Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded (1 byte per 
char for all typical ascii characters)?

Manuel
> Unfortunately we may be in a bit of a tough place with having the
> producer of the XML change it; the customer whose web services we are
> consuming doesn't seem to see any issue with this (as they are fine
> with their .NET tools).
>
> If it is the case where we are seeing a UTF-16 BOM but a prolog that
> declares UTF-8; is there any way to instruct Axis/Xerces to parse it
> as UTF-16? Sorry if this question doesn't make much sense, but I'm
> not too familiar with how Axis and/or Xerces decide which character
> encoding to use when reading the XML.
>
> Thanks again
> Matt
>
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 10:58 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > Yes, there is a work-around. It works if you encode the file with
> > UTF-8 (for example), and do not include the BOM at the beginning. I
> > use notepad++ for that task, where you can save in "UTF-8 without
> > BOM".
> >
> > The process for that is easy:
> > 1. open the file in notepad++
> > 2. mark everything via CTRL-A
> > 3. cut (not copy!)
> > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > without BOM" at the bottom
> > 5. paste
> > 6. save.
> >
> > that is a crap workaround, but works for me. for automatically
> > generated files . I dunno :-)
> >
> >
> > Greetings,
> > Axel.
> >
> >
> > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]> > wrote:
> >
> > Hi all,
> >
> > I hate to do this, but can anyone please help me with either of
> > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > avail.
> >
> > Is there anything else I could be doing?
>
> Just wondering if your file in question starts with hex 'ef bb bf'
> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe
> you have an utf-16 encoded file (little endian or big endian) not
> utf-8. If it is the 'ef bb bf' sequence then it starts correctly with
> the utf-8 encoded unicode code point for BOM U+FEFF. In all cases
> xerces should be able to handle it. A problem may arise if it starts
> with 'ff ef' but the XML prolog says encoding="utf-8" as that is a
> contradiction I believe.
>
> I know this does not help directly but may help to check if the
> problem is with the producer of the XML document or your consumer.
>
> Manuel
>
> > What about the possibility of programmatically editing/cleaning the
> > response XML before it is given to the parser?
> >
> > Thanks
> > Matt
> >
> > -Original Message-
> > From: Matthew Brown [mailto: [EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]> ]
> > Sent: Saturday, July 01, 2006 12:41 PM
> > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML
> >
> >
> > 1. From searching the mailing list archives, I see several
> > references to people having problems with Byte Order Mark
> > characters appearing before the prolog in their UTF-8 messages.
> > However I can't seem to find much of a known resolution to these
> > issues. Is there a standard/common workaround for these BOM and
> > UTF-8 issues?
> >
> > 2. If there is no answer to my #1, is there anyway that Axis will
> > allow me to pragmatically edit the response XML before it is passed
> > to the parser and de-serialized? I've tried adding Handlers, but
> > I'm assuming that the Handler comes into the picture after the
> > message is parsed, because my Handler is only ever seeing the
> > request message, and not the response.
> >
> > Thanks
> > Matt Brown
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> Manuel,
>
> I believe you hit the problem on the head - the response prolog says
> utf-8 but (according to Etherpeak) the BOM is ff/ef. Coincidentally,
> by the time the response XML gets logged by axis, these initial
> characters are logged as ef bf bd ef bf bd.
>
Matt,

what about the rest of the byte stream when you look at it in Etherpeak. 
Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded (1 byte per 
char for all typical ascii characters)?

Manuel
> Unfortunately we may be in a bit of a tough place with having the
> producer of the XML change it; the customer whose web services we are
> consuming doesn't seem to see any issue with this (as they are fine
> with their .NET tools).
>
> If it is the case where we are seeing a UTF-16 BOM but a prolog that
> declares UTF-8; is there any way to instruct Axis/Xerces to parse it
> as UTF-16? Sorry if this question doesn't make much sense, but I'm
> not too familiar with how Axis and/or Xerces decide which character
> encoding to use when reading the XML.
>
> Thanks again
> Matt
>
> -Original Message-
> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 10:58 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > Yes, there is a work-around. It works if you encode the file with
> > UTF-8 (for example), and do not include the BOM at the beginning. I
> > use notepad++ for that task, where you can save in "UTF-8 without
> > BOM".
> >
> > The process for that is easy:
> > 1. open the file in notepad++
> > 2. mark everything via CTRL-A
> > 3. cut (not copy!)
> > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > without BOM" at the bottom
> > 5. paste
> > 6. save.
> >
> > that is a crap workaround, but works for me. for automatically
> > generated files . I dunno :-)
> >
> >
> > Greetings,
> > Axel.
> >
> >
> > On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]> > wrote:
> >
> > Hi all,
> >
> > I hate to do this, but can anyone please help me with either of
> > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > avail.
> >
> > Is there anything else I could be doing?
>
> Just wondering if your file in question starts with hex 'ef bb bf'
> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe
> you have an utf-16 encoded file (little endian or big endian) not
> utf-8. If it is the 'ef bb bf' sequence then it starts correctly with
> the utf-8 encoded unicode code point for BOM U+FEFF. In all cases
> xerces should be able to handle it. A problem may arise if it starts
> with 'ff ef' but the XML prolog says encoding="utf-8" as that is a
> contradiction I believe.
>
> I know this does not help directly but may help to check if the
> problem is with the producer of the XML document or your consumer.
>
> Manuel
>
> > What about the possibility of programmatically editing/cleaning the
> > response XML before it is given to the parser?
> >
> > Thanks
> > Matt
> >
> > -Original Message-
> > From: Matthew Brown [mailto: [EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]> ]
> > Sent: Saturday, July 01, 2006 12:41 PM
> > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML
> >
> >
> > 1. From searching the mailing list archives, I see several
> > references to people having problems with Byte Order Mark
> > characters appearing before the prolog in their UTF-8 messages.
> > However I can't seem to find much of a known resolution to these
> > issues. Is there a standard/common workaround for these BOM and
> > UTF-8 issues?
> >
> > 2. If there is no answer to my #1, is there anyway that Axis will
> > allow me to pragmatically edit the response XML before it is passed
> > to the parser and de-serialized? I've tried adding Handlers, but
> > I'm assuming that the Handler comes into the picture after the
> > message is parsed, because my Handler is only ever seeing the
> > request message, and not the response.
> >
> > Thanks
> > Matt Brown
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Two questions - BOM in UTF-8, and manually cleaning XML

Manuel,

I believe you hit the problem on the head - the response prolog says utf-8 but 
(according to Etherpeak) the BOM is ff/ef. Coincidentally, by the time the 
response XML gets logged by axis, these initial characters are logged as ef bf 
bd ef bf bd.

Unfortunately we may be in a bit of a tough place with having the producer of 
the XML change it; the customer whose web services we are consuming doesn't 
seem to see any issue with this (as they are fine with their .NET tools).

If it is the case where we are seeing a UTF-16 BOM but a prolog that declares 
UTF-8; is there any way to instruct Axis/Xerces to parse it as UTF-16? Sorry if 
this question doesn't make much sense, but I'm not too familiar with how Axis 
and/or Xerces decide which character encoding to use when reading the XML.

Thanks again
Matt

-Original Message-
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:58 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> Yes, there is a work-around. It works if you encode the file with
> UTF-8 (for example), and do not include the BOM at the beginning. I
> use notepad++ for that task, where you can save in "UTF-8 without
> BOM".
>
> The process for that is easy:
> 1. open the file in notepad++
> 2. mark everything via CTRL-A
> 3. cut (not copy!)
> 4. in the format menu, choose "ANSI" formatting and select "UTF
> without BOM" at the bottom
> 5. paste
> 6. save.
>
> that is a crap workaround, but works for me. for automatically
> generated files . I dunno :-)
>
>
> Greetings,
> Axel.
>
>
> On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> > wrote:
>
> Hi all,
>
> I hate to do this, but can anyone please help me with either of these
> issues? I've tried to upgrade Xerces to 2.8.0 but to no avail.
>
> Is there anything else I could be doing?

Just wondering if your file in question starts with hex 'ef bb bf' 
or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe 
you have an utf-16 encoded file (little endian or big endian) not 
utf-8. If it is the 'ef bb bf' sequence then it starts correctly with 
the utf-8 encoded unicode code point for BOM U+FEFF. In all cases 
xerces should be able to handle it. A problem may arise if it starts 
with 'ff ef' but the XML prolog says encoding="utf-8" as that is a 
contradiction I believe.

I know this does not help directly but may help to check if the problem 
is with the producer of the XML document or your consumer.

Manuel
>
> What about the possibility of programmatically editing/cleaning the
> response XML before it is given to the parser?
>
> Thanks
> Matt
>
> -----Original Message-
> From: Matthew Brown [mailto: [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> ]
> Sent: Saturday, July 01, 2006 12:41 PM
> To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> Subject: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> 1. From searching the mailing list archives, I see several references
> to people having problems with Byte Order Mark characters appearing
> before the prolog in their UTF-8 messages. However I can't seem to
> find much of a known resolution to these issues. Is there a
> standard/common workaround for these BOM and UTF-8 issues?
>
> 2. If there is no answer to my #1, is there anyway that Axis will
> allow me to pragmatically edit the response XML before it is passed
> to the parser and de-serialized? I've tried adding Handlers, but I'm
> assuming that the Handler comes into the picture after the message is
> parsed, because my Handler is only ever seeing the request message,
> and not the response.
>
> Thanks
> Matt Brown

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Two questions - BOM in UTF-8, and manually cleaning XML

Hi Davanum

Sorry if I didn't give all of the details before - we are using Axis as a 
client and communicating with a ASP.NET (v1.1) server.

Just for testing, we built a client in .NET off of the same WSDL, and although 
the response XML/data from the service looks the same, .NET was somehow able to 
parse it fine.

So at this point I'm not sure if this is a problem I should be tackling in Axis 
or somehow thru the XML parser, but in my searches I've found some previous 
discussion of this problem on the list, but not any known solution posted.

Thanks
Matt

-Original Message-
From: Davanum Srinivas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:44 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


Matthew,

Is this from a non-axis web service? and you are having problems with
an axis client?

-- dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
>
>
> Alex,
>
> The problem I am having is with the SOAP response from the web service; so
> I'm not really sure how we'd be saving that to a file... this isn't a static
> piece of text.
>
> -Original Message-
> From: Axel Bock [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 10:17 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> Yes, there is a work-around. It works if you encode the file with UTF-8 (for
> example), and do not include the BOM at the beginning. I use notepad++ for
> that task, where you can save in "UTF-8 without BOM".
>
> The process for that is easy:
> 1. open the file in notepad++
> 2. mark everything via CTRL-A
> 3. cut (not copy!)
> 4. in the format menu, choose "ANSI" formatting and select "UTF without BOM"
> at the bottom
> 5. paste
> 6. save.
>
> that is a crap workaround, but works for me. for automatically generated
> files . I dunno :-)
>
>
> Greetings,
> Axel.
>
>
> On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> > Hi all,
> >
> > I hate to do this, but can anyone please help me with either of these
> issues? I've tried to upgrade Xerces to 2.8.0 but to no avail.
> >
> > Is there anything else I could be doing?
> >
> > What about the possibility of programmatically editing/cleaning the
> response XML before it is given to the parser?
> >
> > Thanks
> > Matt
> >
> > -Original Message-
> > From: Matthew Brown [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, July 01, 2006 12:41 PM
> > To: axis-user@ws.apache.org
> > Subject: Two questions - BOM in UTF-8, and manually cleaning XML
> >
> >
> > 1. From searching the mailing list archives, I see several references to
> people having problems with Byte Order Mark characters appearing before the
> prolog in their UTF-8 messages. However I can't seem to find much of a known
> resolution to these issues. Is there a standard/common workaround for these
> BOM and UTF-8 issues?
> >
> > 2. If there is no answer to my #1, is there anyway that Axis will allow me
> to pragmatically edit the response XML before it is passed to the parser and
> de-serialized? I've tried adding Handlers, but I'm assuming that the Handler
> comes into the picture after the message is parsed, because my Handler is
> only ever seeing the request message, and not the response.
> >
> > Thanks
> > Matt Brown
>
>


-- 
Davanum Srinivas : http://people.apache.org/~dims/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> Yes, there is a work-around. It works if you encode the file with
> UTF-8 (for example), and do not include the BOM at the beginning. I
> use notepad++ for that task, where you can save in "UTF-8 without
> BOM".
>
> The process for that is easy:
> 1. open the file in notepad++
> 2. mark everything via CTRL-A
> 3. cut (not copy!)
> 4. in the format menu, choose "ANSI" formatting and select "UTF
> without BOM" at the bottom
> 5. paste
> 6. save.
>
> that is a crap workaround, but works for me. for automatically
> generated files . I dunno :-)
>
>
> Greetings,
> Axel.
>
>
> On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> > wrote:
>
> Hi all,
>
> I hate to do this, but can anyone please help me with either of these
> issues? I've tried to upgrade Xerces to 2.8.0 but to no avail.
>
> Is there anything else I could be doing?

Just wondering if your file in question starts with hex 'ef bb bf' 
or 'ff ef' or 'ef ff'. If it is one of the latter two forms I believe 
you have an utf-16 encoded file (little endian or big endian) not 
utf-8. If it is the 'ef bb bf' sequence then it starts correctly with 
the utf-8 encoded unicode code point for BOM U+FEFF. In all cases 
xerces should be able to handle it. A problem may arise if it starts 
with 'ff ef' but the XML prolog says encoding="utf-8" as that is a 
contradiction I believe.

I know this does not help directly but may help to check if the problem 
is with the producer of the XML document or your consumer.

Manuel
>
> What about the possibility of programmatically editing/cleaning the
> response XML before it is given to the parser?
>
> Thanks
> Matt
>
> -----Original Message-----
> From: Matthew Brown [mailto: [EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]> ]
> Sent: Saturday, July 01, 2006 12:41 PM
> To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> Subject: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> 1. From searching the mailing list archives, I see several references
> to people having problems with Byte Order Mark characters appearing
> before the prolog in their UTF-8 messages. However I can't seem to
> find much of a known resolution to these issues. Is there a
> standard/common workaround for these BOM and UTF-8 issues?
>
> 2. If there is no answer to my #1, is there anyway that Axis will
> allow me to pragmatically edit the response XML before it is passed
> to the parser and de-serialized? I've tried adding Handlers, but I'm
> assuming that the Handler comes into the picture after the message is
> parsed, because my Handler is only ever seeing the request message,
> and not the response.
>
> Thanks
> Matt Brown

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Axel Bock

Hi, hm. then ... maybe you could write an axis handler which actually modifies the response buffer before xerces kicks in. I don't know how to do that, though, so you'd have to refer to some other guys who know better :-) . 
and, ah, it's AXEL ;-)Greetings, Axel.On 7/5/06, Matthew Brown <[EMAIL PROTECTED]
> wrote:






Alex,
 
The 
problem I am having is with the SOAP response from the web service; so I'm not 
really sure how we'd be saving that to a file... this isn't a static piece of 
text.

  -Original Message-From: Axel Bock 
  [mailto:[EMAIL PROTECTED]]Sent: Wednesday, July 05, 
  2006 10:17 AMTo: axis-user@ws.apache.org
Subject: Re: Two 
  questions - BOM in UTF-8, and manually cleaning XMLYes, 
  there is a work-around. It works if you encode the file with UTF-8 (for 
  example), and do not include the BOM at the beginning. I use notepad++ for 
  that task, where you can save in "UTF-8 without BOM". The process for 
  that is easy: 1. open the file in notepad++2. mark everything via 
  CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" 
  formatting and select "UTF without BOM" at the bottom 5. paste6. 
  save.that is a crap workaround, but works for me. for automatically 
  generated files . I dunno :-) Greetings, Axel.
  On 7/5/06, Matthew 
  Brown <[EMAIL PROTECTED]> 
  wrote:
  


Hi all,
 
I hate to do this, but can 
anyone please help me with either of these issues? I've tried to upgrade 
Xerces to 2.8.0 but to no avail. 
 
Is there anything else I 
could be doing?
 
What about the possibility 
of programmatically editing/cleaning the response XML before it is given to 
the parser?
 
Thanks
Matt

  -Original 
  Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]]Sent: Saturday, 
  July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions 
  - BOM in UTF-8, and manually cleaning XML
  1. From searching the mailing list 
  archives, I see several references to people having problems with Byte 
  Order Mark characters appearing before the prolog in their UTF-8 messages. 
  However I can't seem to find much of a known resolution to these issues. 
  Is there a standard/common workaround for these BOM and UTF-8 issues? 
  
   
  2. If there is no answer to my #1, is 
  there anyway that Axis will allow me to pragmatically edit the response 
  XML before it is passed to the parser and de-serialized? I've tried adding 
  Handlers, but I'm assuming that the Handler comes into the picture after 
  the message is parsed, because my Handler is only ever seeing the request 
  message, and not the response.
   
  Thanks
  Matt 
Brown

Re: Two questions - BOM in UTF-8, and manually cleaning XML

Matthew,

Is this from a non-axis web service? and you are having problems with
an axis client?

-- dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:

Alex,

The problem I am having is with the SOAP response from the web service; so
I'm not really sure how we'd be saving that to a file... this isn't a static
piece of text.

-Original Message-
From: Axel Bock [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:17 AM
To: axis-user@ws.apache.org
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

Yes, there is a work-around. It works if you encode the file with UTF-8 (for
example), and do not include the BOM at the beginning. I use notepad++ for
that task, where you can save in "UTF-8 without BOM".

The process for that is easy:
1. open the file in notepad++
2. mark everything via CTRL-A
3. cut (not copy!)
4. in the format menu, choose "ANSI" formatting and select "UTF without BOM"
at the bottom
5. paste
6. save.

that is a crap workaround, but works for me. for automatically generated
files . I dunno :-)

Greetings,
Axel.

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
>
>
>
> Hi all,
>
> I hate to do this, but can anyone please help me with either of these
issues? I've tried to upgrade Xerces to 2.8.0 but to no avail.
>
> Is there anything else I could be doing?
>
> What about the possibility of programmatically editing/cleaning the
response XML before it is given to the parser?
>
> Thanks
> Matt
>
> -Original Message-
> From: Matthew Brown [mailto:[EMAIL PROTECTED]
> Sent: Saturday, July 01, 2006 12:41 PM
> To: axis-user@ws.apache.org
> Subject: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> 1. From searching the mailing list archives, I see several references to
people having problems with Byte Order Mark characters appearing before the
prolog in their UTF-8 messages. However I can't seem to find much of a known
resolution to these issues. Is there a standard/common workaround for these
BOM and UTF-8 issues?
>
> 2. If there is no answer to my #1, is there anyway that Axis will allow me
to pragmatically edit the response XML before it is passed to the parser and
de-serialized? I've tried adding Handlers, but I'm assuming that the Handler
comes into the picture after the message is parsed, because my Handler is
only ever seeing the request message, and not the response.
>
> Thanks
> Matt Brown

--
Davanum Srinivas : http://people.apache.org/~dims/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Two questions - BOM in UTF-8, and manually cleaning XML




Alex,
 
The 
problem I am having is with the SOAP response from the web service; so I'm not 
really sure how we'd be saving that to a file... this isn't a static piece of 
text.

  -Original Message-From: Axel Bock 
  [mailto:[EMAIL PROTECTED]Sent: Wednesday, July 05, 
  2006 10:17 AMTo: axis-user@ws.apache.orgSubject: Re: Two 
  questions - BOM in UTF-8, and manually cleaning XMLYes, 
  there is a work-around. It works if you encode the file with UTF-8 (for 
  example), and do not include the BOM at the beginning. I use notepad++ for 
  that task, where you can save in "UTF-8 without BOM". The process for 
  that is easy: 1. open the file in notepad++2. mark everything via 
  CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" 
  formatting and select "UTF without BOM" at the bottom 5. paste6. 
  save.that is a crap workaround, but works for me. for automatically 
  generated files . I dunno :-) Greetings, Axel.
  On 7/5/06, Matthew 
  Brown <[EMAIL PROTECTED]> 
  wrote:
  


Hi all,
 
I hate to do this, but can 
anyone please help me with either of these issues? I've tried to upgrade 
Xerces to 2.8.0 but to no avail. 
 
Is there anything else I 
could be doing?
 
What about the possibility 
of programmatically editing/cleaning the response XML before it is given to 
the parser?
 
Thanks
Matt

  -Original 
  Message-From: Matthew Brown [mailto:[EMAIL PROTECTED]]Sent: Saturday, 
  July 01, 2006 12:41 PMTo: axis-user@ws.apache.orgSubject: Two questions 
  - BOM in UTF-8, and manually cleaning XML
  1. From searching the mailing list 
  archives, I see several references to people having problems with Byte 
  Order Mark characters appearing before the prolog in their UTF-8 messages. 
  However I can't seem to find much of a known resolution to these issues. 
  Is there a standard/common workaround for these BOM and UTF-8 issues? 
  
   
  2. If there is no answer to my #1, is 
  there anyway that Axis will allow me to pragmatically edit the response 
  XML before it is passed to the parser and de-serialized? I've tried adding 
  Handlers, but I'm assuming that the Handler comes into the picture after 
  the message is parsed, because my Handler is only ever seeing the request 
  message, and not the response.
   
  Thanks
  Matt 
Brown

Re: Two questions - BOM in UTF-8, and manually cleaning XML

2006-07-05 Thread Axel Bock

Yes, there is a work-around. It works if you encode the file with UTF-8 (for example), and do not include the BOM at the beginning. I use notepad++ for that task, where you can save in "UTF-8 without BOM".
The process for that is easy: 1. open the file in notepad++2. mark everything via CTRL-A3. cut (not copy!)4. in the format menu, choose "ANSI" formatting and select "UTF without BOM" at the bottom
5. paste6. save.that is a crap workaround, but works for me. for automatically generated files . I dunno :-) Greetings, Axel.On 7/5/06,
Matthew Brown <[EMAIL PROTECTED]> wrote:

Hi
all,
I hate
to do this, but can anyone please help me with either of these issues? I've
tried to upgrade Xerces to 2.8.0 but to no avail.
Is
there anything else I could be doing?
What
about the possibility of programmatically editing/cleaning the response XML
before it is given to the parser?
Thanks
Matt

-Original Message-From: Matthew Brown
[mailto:[EMAIL PROTECTED]]Sent: Saturday, July 01, 2006
12:41 PMTo: axis-user@ws.apache.orgSubject: Two
questions - BOM in UTF-8, and manually cleaning XML
1. From searching
the mailing list archives, I see several references to people having problems
with Byte Order Mark characters appearing before the prolog in their UTF-8
messages. However I can't seem to find much of a known resolution to these
issues. Is there a standard/common workaround for these BOM and UTF-8 issues?

2. If there is no
answer to my #1, is there anyway that Axis will allow me to pragmatically edit
the response XML before it is passed to the parser and de-serialized? I've
tried adding Handlers, but I'm assuming that the Handler comes into the
picture after the message is parsed, because my Handler is only ever seeing
the request message, and not the response.
Thanks
Matt
Brown

RE: Two questions - BOM in UTF-8, and manually cleaning XML