Christopher Schultz wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

(Marking OT because, well... just because).

On 1/22/2010 2:59 PM, Warnier wrote:
Christopher Schultz wrote:
That "authorization.getBytes()" is just asking for trouble, because it
uses the platform default encoding to convert characters to bytes. It
should be using US-ASCII, ISO-8859-1, or something like that.
-1
I don't think you have a problem there, because what you are decoding
into bytes there IS bytes (it is base64-encoded).

Maybe all character sets have bytes 0-127 the same as US-ASCII, but I
don't know about some of those I never see myself: Shift-JS and all
those Asian encodings, etc. It would be better to be explicit.

With respect, I think you are mistaken here.
Base64 encoding is essentially a method to encode pairs of bytes into
triplets of bytes, in such a way that no byte in the resulting triplet
has the high bit set. (Use "octet" instead of "byte" if it is more
comfortable).
Basically, it was created in order to allow 8-bit character data to be
sent over an 7-bit channel.
So there is no character set implication at all in either encoding or
decoding :
- to encode, you take each group of 2 bytes, and encode it into a group
of 3 bytes
- to decode, you take each group of 3 bytes, and decode it into a group
of 2 bytes.

So maybe the "authorization.getBytes()" above is wrong intellectually
(if it implies that "authorization" is some kind of string expressed in
a character set). The Base64-encoded "string" should really be read as bytes, because that is what it is.

The next step after the base64-decoding is where it matters : now we have an array of bytes with values 0-255, and we have to interpret it into a "userid:password" string which /might/ be us-ascii or iso-8859-1, but might also be something else.
But is is impossible to know which character set the browser used,
just by examining that series of bytes.  Inherently, nothing
distinguishes a series of bytes from another, and they could just as
well represent an iso-8859-1 string, as an iso-8859-2,3,4,5.. or a UTF-8
string.
You can examine a series of bytes and tell whether it could
be a valid UTF-8 string (because some byte sequences are not possible
under UTF-8).  But even if it could be valid UTF-8, does not mean that
it is UTF-8; and distinguishing different iso-8859-x byte sequences from one another is totally impossible.

Example :
We receive a base64 authorization token, which once it is base64-decoded , results in the following series of octets shown in hex :
73 63 68 75 6C 74 7A 3A C3 A9 74 C3 A9
If we decode this as being utf-8, we get the string
schultz:été
and we would thus suppose that this userid is "shultz" and his password
is "été".
But if we decide that the origin character set was iso-8859-1, then we
would decode it into
schultz:été
and the user would still be "schultz", but his password would be "été"
(which would be an equally-valid password).
There is no way to decide in the absolute which decoding is "right",
in the absence of more information.


So there are only 2 choices possible :

1) the rules specify that the base64-decoded "userid:password"
string is always encoded using one specific charset.  In the case of
HTTP, this would have to be iso-8859-1.
(And in that case, HTTP Basic Authentication does not allow for
non-iso-8859-1 userid's and passwords, and too bad for 80% of the world population)

or

2) the rules specify something like :
- if the base64-decoded authorization token does not start with the
iso-8859-1 characters "=?", then it is interpreted as iso-8859-1 (the default) - if it starts with "=?" and ends with "?=", then it is interpreted as a rfc2047-encoded token, to be decoded using the charset indicated after the leading "=?". (And user-id's starting with "=?" are forbidden, but that's not a very likely case nor a big limitation).

So back to Gábor's original problem :

His specific "client" is not a browser, and it allows a user:password string to contain non-iso-8859-1 characters, and it encodes it in UTF-8, prior to encoding it with base64.

At the Tomcat level :

If Gábor modifies the Tomcat container-managed Basic Authentication code, so that it will first base64-decode the token, then convert it to a string using UTF-8 encoding, that will work for requests from this special client. But it will break with any other client.

If Gábor can distinguish requests from this special client, from requests from standard clients, then he could make the UTF-8 decoding conditional on where the request comes from. If this is done in the container-based Basic Authentication code, then it would still result in a non-standard Tomcat, but at least it would not break with normal clients.

If Gábor drops the container-based authentication, and uses a servlet filter like SecurityFilter (modified the same way), then that would have the advantage of keeping a standard Tomcat, and also of working with other servlet containers.

But if Gábor can modify the client to first encode the token following RFC 2047, and then modify the Tomcat container-based Basic Authentication code to handle it as suggested above, then he could probably claim the first client/server combination which is totally spec-compliant.
;-)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to