Re: Encoding issues for 2.0.8

Kenney Westerhof Fri, 22 Jun 2007 03:37:03 -0700

Hi,


indeed, it's a case of doing new XXXInputstream( something, "encoding" ),
or a reader. Some work has been done on this, IIRC.

The problem is that you need to prescan the xml declaration, so you start
parsing until you get the first xml language element that is not a comment,
(an xml element, in which case encoding is utf8, or
a doctype declaration, encoding is utf8, or
a processing instruction, and if it's the xml processing instruction parse the 
encoding
attribute and use that, otherwise it's utf8).

This isn't too hard to do, except you need to restart reading the xml file from
start, if the encoding is not utf-8. The real problem is in the API's; you 
cannot
take a reader and restart that, since you cannot change the encoding on an 
instantiated
reader, and you certainly don't want to wrap it. You'd need access to a raw
inputstream that doesn't apply encoding transformations to the bytes, and wrap 
that
in a Pushback something and then rewrap it if you found the encoding.

I'm a bit fuzzy on all the java.io api's, so we'll have to find the proper

class to use in the API so we can do this; a File would work.

Anyway, I once tried to fix this issue but the api had to be changed and there 
were
just too many changes across plexus and maven at the time to push this through.

-- Kenney

Hervé BOUTEMY wrote:

Le jeudi 21 juin 2007, Jason van Zyl a écrit :
It seems like there are many problems with encoding that could be
easily solved with a couple tweaks to modello, specifically the
reader and writing so I've scheduled these for 2.0.8. There some
patches for these and hopefully Herve will work his magic with his
suggested fix. I like the idea of borrowing the idea from the Rome IO
utils to find the right encoding by default. That could easily be
integrated into modello. Herve if you need access to Modello we can
set you up.
I'm interested at working on that. Do I need Modello access, or othercomponents? I don't really know, these Modello things are the parts I didn'treally dive into for the moment.The magic of the idea is that the encoding handling is not done by the parser,but by the reader. Then, the code that has to change is the code creating theReader from a File: it must be changed from "new FileReader(file)" to "newXmlReader(file)".
We need to:
1. choose where we put the XmlReader so that any code can use it whennecessary. Or have a dependency on Rome: but all Rome for only 1 class (evenif this class is really great)...
2. change every code that creates a Reader for XML parsing

WDYT?
Thanks,

Jason

----------------------------------------------------------
Jason van Zyl
Founder and PMC Chair, Apache Maven
jason at sonatype dot com
----------------------------------------------------------




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Encoding issues for 2.0.8

Reply via email to