Hi,

I've got a terible headache... It happens all the time I try to touch the
bugs related with encodings - any of them...

I'm sure you already know ( but I just found out ) what
"surrogate" characters are. I know that UTF is _not_ 16 bits, but I had no
idea it is 21 bits ( as opposed to UCS - 31 bits ). 

I'll try to get something working this weekend. Craig - you may want to
take a look, the code in "DefaultServlet" is creating a writter for each
encoding ( that's terribly expensive ), and doesn't seem to deal with
surrogates ( well, the second part is not a problem - I doubt someone
would use hieroglyphs or musical signs in a URL ). 

Now, the biggest problem is as ussually M$. From strange reasons, MSIE's
javascript encode() method is generating %XXXX sequences instead of %XX%XX
( as most would expect ). That means the whole decoding might have to be
rewritten 3.3 ( Apache doesn't deal with that either ). 

Question: what should happen with the context path ? It is supposed to be
returned in the orignal form ( not decoded ) - but that can't work as a
certain path can be encoded in many ways. I'm also not sure what should
happen if web.xml and in server.xml ( where path is defined ) - should we
use %xx encoded URLs ? But what would that mean for characters that have
multiple encodings ? 
 

The solution I have in mind right now is to keep doing all the mappings
and process web.xml - and do all internal operations with decoded
characters, while keeping the "original" form for the facade, so servlets
get what they expect.

Any ideas ? I'm not sure I can handle this.


Costin

Reply via email to