RE: cvs commit: jakarta-tomcat RELEASE-PLAN-3.3.1.txt

costinm Mon, 04 Feb 2002 10:31:31 -0800

On Mon, 4 Feb 2002, Jonathan Reichhold wrote:

> /el-niño.jsp should be sent (per the w3c recommendation) as
> /el-nin%c3%b1o.jsp which is a UTF-8 encoded bytes sequences for any
> characters which aren't in the ~60 characters allowed from ASCII.  The
> encoding used for the byte conversion is not specified in the official
> URI spec (RFC 2396), but the w3c in December recommended UTF-8 should be
> used by all.  IE and Mozilla already appear to encode requests this way.
> The server is technically supposed to attempt to read the bytes as UTF-8
> and decode with the platform default as a fallback.


If UTF8 is sent - we're all happy, and %c3%b1 will be used in the
encoded url ( regardless if the requests came url encoded or with
binary UTF8 in it ). That assuming the char encoding is UTF8 for
the body as well ( which should be in any browser that supports
sending the URL as UTF8).

Having the body and the URL in different encoding is very problematic.
Regardless of W3C recommendations, the servlet spec requires 8859_1
if no encoding is detected ( which is a huge problem ).

The current code can deal with the UTF8 corectly, but it can also
deal with old browsers who will send the URL using the same encoding
as the body ( if you are on a 8859_2 browser, it's likely that will
be used for both, I doubt any browser will send UTF8 ).


> For the record, /el-niño.jsp is /el-nin%f1.jsp if the bytes are encoded
> via iso-latin-1.  Any character >0x7f isn't safe will be encoded as 2-4
> bytes under UTF-8.  Certain byte sequences are also reserved.  I've
> spent a long time with this trying to create truly internationalized
> code.

Great to have you on tomcat-dev !



> If you look at the Java 1.4 Release Candidate you will see that they now
> recognize in URLEncode and URLDecode that this is the correct behaviour.
> URLEncode and URLDecode have deprecated methods that don't pass in the
> encoding.  I think they should default to UTF-8, but the default is the
> platform default.

On java's URLEncode - yes, the default should be utf8 ( but it is the
platform default ). On servlets - no, the spec is clear about that,
the default is 8859_1, and there's little we can do about it ( except
complain, which we did in the last year and so ).

I spent a lot of time making sure all URLEncode/URLDecode are done
with the right charset, i.e. whatever is detected from the request
or session ( since most browsers today are just broken )
You can override the default to UTF8 - but that brakes the servlet
spec and we can't ship with this setting on. And I'm sure
there are many bugs and cases the code can't handle.


> The w3c has a good section on this at
> http://www.w3.org/International/O-URL-and-ident.html

Yes, but what's important for now is the reality that most software
is not designed with internationalization in mind ( and browsers
are the the best example ) :-)

Not sending the charset header when a non-standard encoding is used
is absolutely stupid and against http1.1 spec - but it's what we
have.

Costin


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: cvs commit: jakarta-tomcat RELEASE-PLAN-3.3.1.txt

Reply via email to