Re: [OT] Basic Authentication Failed with multibyte username

André Warnier Sun, 24 Jan 2010 06:23:14 -0800

Christopher Schultz wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


André,

(Marking OT because, well... just because).

On 1/22/2010 2:59 PM, Warnier wrote:

Christopher Schultz wrote:

That "authorization.getBytes()" is just asking for trouble, because it
uses the platform default encoding to convert characters to bytes. It
should be using US-ASCII, ISO-8859-1, or something like that.

-1
I don't think you have a problem there, because what you are decoding
into bytes there IS bytes (it is base64-encoded).


Maybe all character sets have bytes 0-127 the same as US-ASCII, but I
don't know about some of those I never see myself: Shift-JS and all
those Asian encodings, etc. It would be better to be explicit.


With respect, I think you are mistaken here.
Base64 encoding is essentially a method to encode pairs of bytes into
triplets of bytes, in such a way that no byte in the resulting triplet
has the high bit set. (Use "octet" instead of "byte" if it is more
comfortable).
Basically, it was created in order to allow 8-bit character data to be
sent over an 7-bit channel.
So there is no character set implication at all in either encoding or
decoding :
- to encode, you take each group of 2 bytes, and encode it into a group
of 3 bytes
- to decode, you take each group of 3 bytes, and decode it into a group
of 2 bytes.

So maybe the "authorization.getBytes()" above is wrong intellectually
(if it implies that "authorization" is some kind of string expressed in

a character set). The Base64-encoded "string" should really be read asbytes, because that is what it is.

The next step after the base64-decoding is where it matters : now wehave an array of bytes with values 0-255, and we have to interpret itinto a "userid:password" string which /might/ be us-ascii or iso-8859-1,but might also be something else.

But is is impossible to know which character set the browser used,
just by examining that series of bytes.  Inherently, nothing
distinguishes a series of bytes from another, and they could just as
well represent an iso-8859-1 string, as an iso-8859-2,3,4,5.. or a UTF-8
string.
You can examine a series of bytes and tell whether it could
be a valid UTF-8 string (because some byte sequences are not possible
under UTF-8).  But even if it could be valid UTF-8, does not mean that

it is UTF-8; and distinguishing different iso-8859-x byte sequences fromone another is totally impossible.


Example :

We receive a base64 authorization token, which once it is base64-decoded, results in the following series of octets shown in hex :

73 63 68 75 6C 74 7A 3A C3 A9 74 C3 A9
If we decode this as being utf-8, we get the string
schultz:été
and we would thus suppose that this userid is "shultz" and his password
is "été".
But if we decide that the origin character set was iso-8859-1, then we
would decode it into
schultz:Ã©tÃ©
and the user would still be "schultz", but his password would be "Ã©tÃ©"
(which would be an equally-valid password).
There is no way to decide in the absolute which decoding is "right",
in the absence of more information.


So there are only 2 choices possible :

1) the rules specify that the base64-decoded "userid:password"
string is always encoded using one specific charset.  In the case of
HTTP, this would have to be iso-8859-1.
(And in that case, HTTP Basic Authentication does not allow for

non-iso-8859-1 userid's and passwords, and too bad for 80% of the worldpopulation)


or

2) the rules specify something like :
- if the base64-decoded authorization token does not start with the

iso-8859-1 characters "=?", then it is interpreted as iso-8859-1 (thedefault)- if it starts with "=?" and ends with "?=", then it is interpreted as arfc2047-encoded token, to be decoded using the charset indicated afterthe leading "=?".(And user-id's starting with "=?" are forbidden, but that's not a verylikely case nor a big limitation).


So back to Gábor's original problem :

His specific "client" is not a browser, and it allows a user:passwordstring to contain non-iso-8859-1 characters, and it encodes it in UTF-8,prior to encoding it with base64.


At the Tomcat level :

If Gábor modifies the Tomcat container-managed Basic Authenticationcode, so that it will first base64-decode the token, then convert it toa string using UTF-8 encoding, that will work for requests from thisspecial client. But it will break with any other client.

If Gábor can distinguish requests from this special client, fromrequests from standard clients, then he could make the UTF-8 decodingconditional on where the request comes from.If this is done in the container-based Basic Authentication code, thenit would still result in a non-standard Tomcat, but at least it wouldnot break with normal clients.

If Gábor drops the container-based authentication, and uses a servletfilter like SecurityFilter (modified the same way), then that would havethe advantage of keeping a standard Tomcat, and also of working withother servlet containers.

But if Gábor can modify the client to first encode the token followingRFC 2047, and then modify the Tomcat container-based BasicAuthentication code to handle it as suggested above, then he couldprobably claim the first client/server combination which is totallyspec-compliant.

;-)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: [OT] Basic Authentication Failed with multibyte username

Reply via email to