Re: [OT] Basic Authentication Failed with multibyte username
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 André, On 1/24/2010 9:22 AM, André Warnier wrote: > Christopher Schultz wrote: > >> Maybe all character sets have bytes 0-127 the same as US-ASCII, but I >> don't know about some of those I never see myself: Shift-JS and all >> those Asian encodings, etc. It would be better to be explicit. > > With respect, I think you are mistaken here. > Base64 encoding is essentially a method to encode pairs of bytes into > triplets of bytes, in such a way that no byte in the resulting triplet > has the high bit set. (Use "octet" instead of "byte" if it is more > comfortable). It's more than that: it uses an explicit set of characters in the US-ASCII encoding as display. If you were to Base64 encode a string and then transmit it as EBCDIC, it would look the same to human eyes but have different underlying byte values (octets, if you prefer). > Basically, it was created in order to allow 8-bit character data to be > sent over an 7-bit channel. > So there is no character set implication at all in either encoding or > decoding : > - to encode, you take each group of 2 bytes, and encode it into a group > of 3 bytes > - to decode, you take each group of 3 bytes, and decode it into a group > of 2 bytes. Actually, I was wrong above: it's not a US-ASCII encoding. Instead, the byte values are an index into a string of characters, as described in the reference-less Wikipedia article: " The buffer is then used, six bits at a time, most significant first, as indices into the string: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/", and the indicated character is output. " So, in DBCDEC, the human reader would be confused :( > So maybe the "authorization.getBytes()" above is wrong intellectually > (if it implies that "authorization" is some kind of string expressed in > a character set). The Base64-encoded "string" should really be read as > bytes, because that is what it is. Fair enough, though the above string fits nicely into US-ASCII which, coincidentally, is the official encoding of HTTP headers :) > The next step after the base64-decoding is where it matters I agree, and here's where your arguments fall on deaf ears: each client does whatever it wants with regard to encoding of this data. The major web browsers don't even agree on what to do. Since the OP has his own client (right? or have I gotten confused with one or two other threads this week), he can do whatever he wants as long as the authentication mechanism agrees with the client. > But is is impossible to know which character set the browser used, > just by examining that series of bytes. Almost certainly true, although a tight client/server relationship could include a scheme to indicate the encoding in the value itself. Something like RFC2047, for instance. > So there are only 2 choices possible : > > 1) the rules specify that the base64-decoded "userid:password" > string is always encoded using one specific charset. In the case of > HTTP, this would have to be iso-8859-1. > (And in that case, HTTP Basic Authentication does not allow for > non-iso-8859-1 userid's and passwords, and too bad for 80% of the world > population) I disagree: the spec is unclear about the encoding used before the Base64 encoding. This is the source of the problem because clients have decided to take it upon themselves to decide what is best (UTF-8, page encoding, random encoding, no encoding, etc.). > 2) the rules specify something like : > - if the base64-decoded authorization token does not start with the > iso-8859-1 characters "=?", then it is interpreted as iso-8859-1 (the > default) > - if it starts with "=?" and ends with "?=", then it is interpreted as a > rfc2047-encoded token, to be decoded using the charset indicated after > the leading "=?". > (And user-id's starting with "=?" are forbidden, but that's not a very > likely case nor a big limitation). That would be a great implementation, but nobody appears to have done it. If the OP wants to use this strategy, he'll have to hack Tomcat's authenticator to accept this type of encoding... or use something like Securityfilter, again, with a patch to accept this type of encoding. > So back to Gábor's original problem : > > His specific "client" is not a browser, and it allows a user:password > string to contain non-iso-8859-1 characters, and it encodes it in UTF-8, > prior to encoding it with base64. Fortunately, he has control over the client, which is great. > At the Tomcat level : > > If Gábor modifies the Tomcat container-managed Basic Authentication > code, so that it will first base64-decode the token, then convert it to > a string using UTF-8 encoding, that will work for requests from this > special client. But it will break with any other client. +1 > If Gábor can distinguish requests from this special client, from > requests from standard clients, then he could make the UTF-8 decoding > conditional on where the request comes from. +1
Re: [OT] Basic Authentication Failed with multibyte username
Christopher Schultz wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 André, (Marking OT because, well... just because). On 1/22/2010 2:59 PM, Warnier wrote: Christopher Schultz wrote: That "authorization.getBytes()" is just asking for trouble, because it uses the platform default encoding to convert characters to bytes. It should be using US-ASCII, ISO-8859-1, or something like that. -1 I don't think you have a problem there, because what you are decoding into bytes there IS bytes (it is base64-encoded). Maybe all character sets have bytes 0-127 the same as US-ASCII, but I don't know about some of those I never see myself: Shift-JS and all those Asian encodings, etc. It would be better to be explicit. With respect, I think you are mistaken here. Base64 encoding is essentially a method to encode pairs of bytes into triplets of bytes, in such a way that no byte in the resulting triplet has the high bit set. (Use "octet" instead of "byte" if it is more comfortable). Basically, it was created in order to allow 8-bit character data to be sent over an 7-bit channel. So there is no character set implication at all in either encoding or decoding : - to encode, you take each group of 2 bytes, and encode it into a group of 3 bytes - to decode, you take each group of 3 bytes, and decode it into a group of 2 bytes. So maybe the "authorization.getBytes()" above is wrong intellectually (if it implies that "authorization" is some kind of string expressed in a character set). The Base64-encoded "string" should really be read as bytes, because that is what it is. The next step after the base64-decoding is where it matters : now we have an array of bytes with values 0-255, and we have to interpret it into a "userid:password" string which /might/ be us-ascii or iso-8859-1, but might also be something else. But is is impossible to know which character set the browser used, just by examining that series of bytes. Inherently, nothing distinguishes a series of bytes from another, and they could just as well represent an iso-8859-1 string, as an iso-8859-2,3,4,5.. or a UTF-8 string. You can examine a series of bytes and tell whether it could be a valid UTF-8 string (because some byte sequences are not possible under UTF-8). But even if it could be valid UTF-8, does not mean that it is UTF-8; and distinguishing different iso-8859-x byte sequences from one another is totally impossible. Example : We receive a base64 authorization token, which once it is base64-decoded , results in the following series of octets shown in hex : 73 63 68 75 6C 74 7A 3A C3 A9 74 C3 A9 If we decode this as being utf-8, we get the string schultz:été and we would thus suppose that this userid is "shultz" and his password is "été". But if we decide that the origin character set was iso-8859-1, then we would decode it into schultz:été and the user would still be "schultz", but his password would be "été" (which would be an equally-valid password). There is no way to decide in the absolute which decoding is "right", in the absence of more information. So there are only 2 choices possible : 1) the rules specify that the base64-decoded "userid:password" string is always encoded using one specific charset. In the case of HTTP, this would have to be iso-8859-1. (And in that case, HTTP Basic Authentication does not allow for non-iso-8859-1 userid's and passwords, and too bad for 80% of the world population) or 2) the rules specify something like : - if the base64-decoded authorization token does not start with the iso-8859-1 characters "=?", then it is interpreted as iso-8859-1 (the default) - if it starts with "=?" and ends with "?=", then it is interpreted as a rfc2047-encoded token, to be decoded using the charset indicated after the leading "=?". (And user-id's starting with "=?" are forbidden, but that's not a very likely case nor a big limitation). So back to Gábor's original problem : His specific "client" is not a browser, and it allows a user:password string to contain non-iso-8859-1 characters, and it encodes it in UTF-8, prior to encoding it with base64. At the Tomcat level : If Gábor modifies the Tomcat container-managed Basic Authentication code, so that it will first base64-decode the token, then convert it to a string using UTF-8 encoding, that will work for requests from this special client. But it will break with any other client. If Gábor can distinguish requests from this special client, from requests from standard clients, then he could make the UTF-8 decoding conditional on where the request comes from. If this is done in the container-based Basic Authentication code, then it would still result in a non-standard Tomcat, but at least it would not break with normal clients. If Gábor drops the container-based authentication, and uses a servlet filter like SecurityFilter (modified the same way), then that would have the advantage of keeping a standard Tomcat, and also of working
Re: [OT] Basic Authentication Failed with multibyte username
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 André, (Marking OT because, well... just because). On 1/22/2010 2:59 PM, Warnier wrote: > Christopher Schultz wrote: >> That "authorization.getBytes()" is just asking for trouble, because it >> uses the platform default encoding to convert characters to bytes. It >> should be using US-ASCII, ISO-8859-1, or something like that. > > -1 > I don't think you have a problem there, because what you are decoding > into bytes there IS bytes (it is base64-encoded). Maybe all character sets have bytes 0-127 the same as US-ASCII, but I don't know about some of those I never see myself: Shift-JS and all those Asian encodings, etc. It would be better to be explicit. >> It also calls the String constructor with a byte array without >> specifying the encoding, therefore using the platform default. > > +1 > That is indeed where you have a problem. There you SHOULD always decode > it as US-ASCII (or maybe iso-8859-1, I'm not quite sure what the spec > says exactly). - From my reading, the spec is silent but one can draw the conclusion that US-ASCII is basically all that is supported. I should all the capability of configuring this encoding to override the (soon to be) default of US-ASCII: if the user knows the client will use UTF-8, they should be allowed to force that encoding to be used. > Let's say that the spec is clear and says that the header value is > *TEXT, and that *TEXT is always US-ASCII (or ISO-8859-1) by default. > > Let's take it from the browser side first. > If the "userid:password" is indeed composed only of us-ascii characters, > then the browser base64-encodes this directly and it is trivial.(*) > > But let's say that "userid:password" is something else than us-ascii. > Another part of the spec says that then, you have to encode it according > to RFC2047. No, I don't think this is correct: the spec says that the HTTP header values must be in US-ASCII, and may be encoded using RFC2047 in order to achieve that. Since Base64 encoding always results in a US-ASCII-compatible value, there is no reason to involve RFC2047. > My contention is then that the browser should first RFC2047-encode > "userid:password", and then base64-encode the result. While that sounds like a good idea, it's almost certainly never done that way. > Back on the server side. > The server base64-decodes the authorization token, into an ascii string. > It can do that always, because either the string was ascii to start > with, or else it was not, but then it has been RFC2047-encoded, yelding > a result that is ascii. > (like : =?iso-8859-2?B?base64-encoded stuff...?= ) This would be a decent configurable setting for a BASIC authenticator... something like "allow-rfc2047" or whatever. What about those people who really want to have a username like "=?whatever" and a password like "whatever?="? They can't login? :) > The above, I believe, would be totally consistent with the current RFCs. Yes, but for whatever reason, nobody ever fully implements the RFCs :) There are standards and there are practices. In this case, I think practices outweigh the standards :) > But there is a major catch : I don't believe that there is a browser on > the market today, which "properly" encodes the "userid:password" string > via rfc2047 when it isn't ascii. Nor would it be appropriate to do so, because base64 encoding is /always/ used and will therefore /always/ result in a valid HTTP Authenticate header value. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktaFaQACgkQ9CaO5/Lv0PBMcACgpSL6QcBn6C2thQash4W/LIhg 5VgAn2hmTLmwdgk1HkhDxOshDDyZkBr0 =xBQs -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
Christopher Schultz wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 André, On 1/21/2010 6:35 PM, André Warnier wrote: Basically, I would tend to say that if the server knows who the clients are and vice-versa, you should be free to use any encoding you want, with the limitation that what is exchanged on the wire conforms to HTTP (because there may be proxies on the way which are not so tolerant). +1 What the client is sending is already (in a way) conformant to HTTP, because it is base64 encoded and so, on the surface, it does not contain non-ascii characters. +1 But the problem is that the standard Tomcat code which decodes the Basic Authorization header does not work in the way you want, for these illegal headers. And this code should preferably not be changed in a way which breaks the conformance with standard HTTP. Because if you do that, then your Tomcat becomes useless for anything else than your special client. +1 Another possibility would be to use something like SecurityFilter, which allows you to (more easily) write your own authenticator and realm implementations, and you could write a BasicAuthenticator that reads these specially-formatted credentials. I checked the sf source, and it looks like we might have a bug: private String decodeBasicAuthorizationString(String authorization) { if (authorization == null || !authorization.toLowerCase().startsWith("basic ")) { return null; } else { authorization = authorization.substring(6).trim(); // Decode and parse the authorization credentials return new String(Base64.decodeBase64(authorization.getBytes())); } } That "authorization.getBytes()" is just asking for trouble, because it uses the platform default encoding to convert characters to bytes. It should be using US-ASCII, ISO-8859-1, or something like that. -1 I don't think you have a problem there, because what you are decoding into bytes there IS bytes (it is base64-encoded). It also calls the String constructor with a byte array without specifying the encoding, therefore using the platform default. +1 That is indeed where you have a problem. There you SHOULD always decode it as US-ASCII (or maybe iso-8859-1, I'm not quite sure what the spec says exactly). Let's say that the spec is clear and says that the header value is *TEXT, and that *TEXT is always US-ASCII (or ISO-8859-1) by default. Let's take it from the browser side first. If the "userid:password" is indeed composed only of us-ascii characters, then the browser base64-encodes this directly and it is trivial.(*) But let's say that "userid:password" is something else than us-ascii. Another part of the spec says that then, you have to encode it according to RFC2047. My contention is then that the browser should first RFC2047-encode "userid:password", and then base64-encode the result. Back on the server side. The server base64-decodes the authorization token, into an ascii string. It can do that always, because either the string was ascii to start with, or else it was not, but then it has been RFC2047-encoded, yelding a result that is ascii. (like : =?iso-8859-2?B?base64-encoded stuff...?= ) Then the server must do another round of decoding via RFC2047. That consists of a double decoding again : base64-decode the string between the ?? into bytes, and then decode those bytes into Unicode, using the charset indicated at the beginning of the rfc2047-encoded sequence. The above, I believe, would be totally consistent with the current RFCs. But there is a major catch : I don't believe that there is a browser on the market today, which "properly" encodes the "userid:password" string via rfc2047 when it isn't ascii. And the OP's special client sends UTF-8, but also does not rfc2047-encode it. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 André, On 1/21/2010 6:35 PM, André Warnier wrote: > Basically, I would tend to say that if the server knows who the clients > are and vice-versa, you should be free to use any encoding you want, > with the limitation that what is exchanged on the wire conforms to HTTP > (because there may be proxies on the way which are not so tolerant). +1 > What the client is sending is already (in a way) conformant to HTTP, > because it is base64 encoded and so, on the surface, it does not contain > non-ascii characters. +1 > But the problem is that the standard Tomcat code which decodes the Basic > Authorization header does not work in the way you want, for these > illegal headers. > And this code should preferably not be changed in a way which breaks the > conformance with standard HTTP. > Because if you do that, then your Tomcat becomes useless for anything > else than your special client. +1 Another possibility would be to use something like SecurityFilter, which allows you to (more easily) write your own authenticator and realm implementations, and you could write a BasicAuthenticator that reads these specially-formatted credentials. I checked the sf source, and it looks like we might have a bug: private String decodeBasicAuthorizationString(String authorization) { if (authorization == null || !authorization.toLowerCase().startsWith("basic ")) { return null; } else { authorization = authorization.substring(6).trim(); // Decode and parse the authorization credentials return new String(Base64.decodeBase64(authorization.getBytes())); } } That "authorization.getBytes()" is just asking for trouble, because it uses the platform default encoding to convert characters to bytes. It should be using US-ASCII, ISO-8859-1, or something like that. It also calls the String constructor with a byte array without specifying the encoding, therefore using the platform default. Finally, this method is private, which means it cannot be overridden by a subclass, which would be a nice feature. Maybe I'll fix all that. :) > Or, you drop the container-managed security, and you use something like > the SecurityFilter (http://securityfilter.sourceforge.net/), but read > the homepage carefully first. Note that the warning about BASIC authentication is waaay outdated: sf definitely does support BASIC auth. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktZy68ACgkQ9CaO5/Lv0PAdMACfVnkkBJRIo8Gt1LcsegO/JhPD Tl0AoLcI5QP0XoCa8kgy5zFJnkKBvL6Y =CBKO -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
To get back to the underlying issue : Auth Gábor wrote: So... this is the real chaos... :) Yes. By the way, my users are not use HTML browsers, they are using JAX-WS in their client program, and the JAX-WS sends authentication data in UTF-8 (like Opera), because the default encoding is UTF-8 in the client JVM (and the server too). Basically, I would tend to say that if the server knows who the clients are and vice-versa, you should be free to use any encoding you want, with the limitation that what is exchanged on the wire conforms to HTTP (because there may be proxies on the way which are not so tolerant). What the client is sending is already (in a way) conformant to HTTP, because it is base64 encoded and so, on the surface, it does not contain non-ascii characters. And (I presume) you cannot change the code of the client, so it will continue to send these "invalid" headers with a UTF-8 value, base64-encoded. But the problem is that the standard Tomcat code which decodes the Basic Authorization header does not work in the way you want, for these illegal headers. And this code should preferably not be changed in a way which breaks the conformance with standard HTTP. Because if you do that, then your Tomcat becomes useless for anything else than your special client. An additional complication is that, if you want to use the embedded "container-managed" Tomcat authentication mechanisms, then you have to do something very early in the cycle, because that authentication takes place even before any servlet filter is invoked. Up to Tomcat 5.5, you would have to do this in a Valve then, which has the inconvenient that it is Tomcat-specific. (I think Tomcat 6 may give other options, maybe not Tomcat-specific.) Or, you drop the container-managed security, and you use something like the SecurityFilter (http://securityfilter.sourceforge.net/), but read the homepage carefully first. So, to be pragmatic, I would tend to go in the following direction : - create a Valve which - checks the User-Agent. If it does not match your special client, do nothing. If it matches, then - get the Authorization header. If there is none, do nothing - else, decode its value properly into a Unicode string - re-encode this string in a way that fits with standard HTTP. For example, replace each character by a string like {}, where is the hex value of the Unicode codepoint of the character. (That is always valid us-ascii, but check the maximum length). - re-encode the result using base64 - replace the Authorization header value with this new string - in your back-end authentication mechanism (I will suppose it is a database of userids/passwords), encode the userids/passwords the same way, and make this an alternate key The embedded Tomcat authentication will then decode the new base64 string, split it into userid:password, and use them to verify the credentials, which will match. If you do not like a Valve, then use a front-end server like Apache, and do the transformation of the header there, before the request is passed to Tomcat. Alternatively then, you could also do the user authentication at the Apache level, and just pass the user-id to Tomcat. (being an Apache/mod_perl guy myself, I find this last option much easier, but YMMV). And all that for a few Ö's and Á's and ß's Another option is to use a front-end Apache httpd server, which would modify the requests as follows : (I presume that you have a way to identify requests coming from this particular client)(User-Agent header e.g.). Create a filter at the Apache level, which detects your special client. If it detects it, then it adds an additional header to the request - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
Christopher Schultz wrote: ... Nice that someone looked at actual behavior of the browsers. There is an easy way to find out what really happens. Gábor, I presume that you have a workstation set for iso-8859-2 (or whichever non iso-8859-1 charset is appropriate for Magyar, I forgot), and a browser set up similarly. Could you get one of these add-ons like Fiddler2 or LiveHttpHeaders, and arrange to capture what is sent by the browser in its authorization header when you enter a user-id/password containing some characters of the range above \x9F ? That should be the base64 encoding of whatever the browser is sending. Then of course you'll have to find a way to show us the base64-encoded form, and the corresponding non-encoded form of ditto (but I think that composing and sending your post as UTF-8 should do the trick). We could probably do much the same with our own charset-challenged browsers, but we don't have the easiest keyboards for that. It is my deep suspicion that the browsers will just take the input as iso-latin-x (whatever the workstation/browser is set for), and base64-encode it, without bothering to indicate the real charset in any way. But we'll see. Kösönöm szepen, I think it is... - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Gábor, On 1/21/2010 9:16 AM, Auth Gábor wrote: > Mark Thomas wrote: >>OCTET = >>CTL= > (octets 0 - 31) and DEL (127)> >> >> So actually, Tomcat is correct in the current treatment of credentials. >> Therefore, not a bug. > > Yes, but the UTF-8 encoded text is contains any 8-bit sequence of data except > control characters, so IMHO the UTF-8 encoded text is TEXT. Sure, UTF-8 encoded text is TEXT, but you may not get the String value you expect. André is correct in that non-Latin characters appear to be unsupported by the HTTP Authenticate header. Now, there /are/ things that can be done to accommodate you. See below. The patch you posted probably will only work when the platform encoding is set to UTF-8. Instead, an encoding setting would probably have to be provided to the BasicAuthenticator to allow the Base64-encoded header value to use the desired encoding. Actually, the code as it looks right now does have a bug: the platform default encoding is used to decode Base-64 decoded bytes in the Authenticate header. Instead, it should probably be ASCII or maybe ISO-8859-1. >> Also André's comments regarding ISO-8859-1 were right if considering the >> actual user name and password rather than the header. > > Yes, thats right. The default header encoding is ISO-8859-1. It's ASCII, though ISO-8859-1 is backward-compatible (as is UTF-8). > I've found some information about this issue: > http://stackoverflow.com/questions/702629/utf-8-characters-mangled-in-http- > basic-auth-username Nice that someone looked at actual behavior of the browsers. It would be pretty trivial to add a settable charset to the BasicAuthenticator, and also to allow things like RFC 2047 charset-in-value decoding, though I don't think that's appropriate because the Bas64 value has already been decoded. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktYuooACgkQ9CaO5/Lv0PAQZQCgoWiesTSQ/aX+oeRmF8Qvv+u3 73oAniYbXKfEIGdnIVyEHpZNgJ82ZjsI =qPwi -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
Hi, Mark Thomas wrote: >OCTET = >CTL= (octets 0 - 31) and DEL (127)> > > So actually, Tomcat is correct in the current treatment of credentials. > Therefore, not a bug. Yes, but the UTF-8 encoded text is contains any 8-bit sequence of data except control characters, so IMHO the UTF-8 encoded text is TEXT. > Also André's comments regarding ISO-8859-1 were right if considering the > actual user name and password rather than the header. Yes, thats right. The default header encoding is ISO-8859-1. > Supporting other encodings would be a useful enhancement but the default > will have to be ISO-8859-1 to remain spec compliant. What the browsers > will do for user names and passwords in other encodings is not defined > so it will be a case of YMMV. I've found some information about this issue: http://stackoverflow.com/questions/702629/utf-8-characters-mangled-in-http- basic-auth-username So... this is the real chaos... :) By the way, my users are not use HTML browsers, they are using JAX-WS in their client program, and the JAX-WS sends authentication data in UTF-8 (like Opera), because the default encoding is UTF-8 in the client JVM (and the server too). Gábor Auth - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
Mark Thomas wrote: On 21/01/2010 06:55, André Warnier wrote: Mark Thomas wrote: The authorisation header is base64 encoded so it is automatically compliant with RFC2616. Yes, it sounds like you're right; my mistake. (Also for Gabor, I admit my mistake.) I agree that the HTTP header itself is correct. But there is still somethig which puzzles me in the absolute. Suppose that the browser and the server know nothing particular about one another, and that the server gets such an Authentication header from the browser. The Base64 decoding is done, and yields a series of bytes. Now this series of bytes have to be interpreted, to be translated into a string in Java (which is Unicode). Which encoding should be chosen to decode the byte array ? If you use the default platform JVM encoding, you are making the assumption that the browser knew what this encoding is, aren't you ? On the other hand, the browser sent nothing to indicate in which encoding this string was, before it encoded it using Base64, or did it ? RFC2617 to the rescue... basic-credentials = base64-user-pass base64-user-pass = user-pass = userid ":" password userid= * password = *TEXT *TEXT is defined in RFC2616 TEXT = and finally OCTET = CTL= So actually, Tomcat is correct in the current treatment of credentials. Therefore, not a bug. Also André's comments regarding ISO-8859-1 were right if considering the actual user name and password rather than the header. Supporting other encodings would be a useful enhancement but the default will have to be ISO-8859-1 to remain spec compliant. What the browsers will do for user names and passwords in other encodings is not defined so it will be a case of YMMV. Mark Let me be even more pernickety : According to the HTTP 1.1 RFC 2616, HTTP header fields MAY contain *TEXT portions representing character sets other than US-ASCII. But then, such header field values MUST be encoded according to the rules of RFC 2047. RFC 2047 in turn, in "2. Syntax of encoded-words ", indicates that this should be done using the form : encoded-word = "=?" charset "?" encoding "?" encoded-text "?=" for example : Header-name: =?iso-8859-1?B?some iso-8859-1 text, base-64 encoded?= or Header-name: =?utf-8?B?some unicode/utf-8 text, base-64 encoded?= (I am not quite sure here of the "utf-8" part as the correct name for the charset.) (NDLR: That is something one does find regularly in email headers; but I have never seen it used in HTTP headers until now.) On the other hand, regarding authentication mechanisms, RFC 2616 refers to RFC 2617, which itself indicates the following format for an authorization header sent by the browser to the server : Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== When base64-decoded, the above string should look like "userid:password". I did not find in RFC 2617 any specific mention of character set encoding, but it itself refers back to RFC 2616 as being the "base rules". And the base rules in RFC 2616 seem to be that header values are US-ASCII unless otherwise indicated. In other words, my contention is as follows : - if the "userid:password" above contain only US-ASCII characters, then the above simple form of the header is fine. - if the "userid:password" string above contain characters other than US-ASCII however, then they should be further encoded, using the rules of RFC 2047. This would mean that you should have something like : Authorization: Basic =?utf-8?B?QWxhZGRpbjpvcGVuIHNlc2FtZQ==?= (or, maybe, the other way around : it is the "QWxhZGRpbjpvcGVuIHNlc2FtZQ" string which, when base64-decoded, should yield a new string of the form "=?utf-8?B?QWxhZGRpbjpvcGVuIHNlc2FtZQ==?=", which should then be decoded once more to give the "userid:password" string). Now, I am not sure that if you pass such a HTTP header, encoded as above, from Apache to Tomcat, that the Tomcat getHeader() call will properly decode it, using the indicated charset. And I am not sure either that there exists any browser on the market that will encode a userid:password string that way. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
On 21/01/2010 06:55, André Warnier wrote: > Mark Thomas wrote: >> The authorisation header is base64 >> encoded so it is automatically compliant with RFC2616. >> > Yes, it sounds like you're right; my mistake. > (Also for Gabor, I admit my mistake.) > > I agree that the HTTP header itself is correct. > But there is still somethig which puzzles me in the absolute. > Suppose that the browser and the server know nothing particular about > one another, and that the server gets such an Authentication header from > the browser. > The Base64 decoding is done, and yields a series of bytes. > Now this series of bytes have to be interpreted, to be translated into a > string in Java (which is Unicode). Which encoding should be chosen to > decode the byte array ? > If you use the default platform JVM encoding, you are making the > assumption that the browser knew what this encoding is, aren't you ? > On the other hand, the browser sent nothing to indicate in which > encoding this string was, before it encoded it using Base64, or did it ? RFC2617 to the rescue... basic-credentials = base64-user-pass base64-user-pass = user-pass = userid ":" password userid= * password = *TEXT *TEXT is defined in RFC2616 TEXT = and finally OCTET = CTL= So actually, Tomcat is correct in the current treatment of credentials. Therefore, not a bug. Also André's comments regarding ISO-8859-1 were right if considering the actual user name and password rather than the header. Supporting other encodings would be a useful enhancement but the default will have to be ISO-8859-1 to remain spec compliant. What the browsers will do for user names and passwords in other encodings is not defined so it will be a case of YMMV. Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
Mark Thomas wrote: On 21/01/2010 06:12, André Warnier wrote: Auth Gábor wrote: Hi, I've found a potential bug in the Basic Authentication module. I have users and some user's username is contains national characters (encoded in UTF-8). The HTTP header based authentication is fails when the username or the password contains multibyte characters. The root of the bug is the Base64 decoder, which decodes the Base64 stream to char array: converts each byte to individual char, this decode method corrupts the multibyte characters... Hi. Before declaring that this is a bug, I suggest that you read the other thread entitled "mod_jk codepage in header values". The main point is : according to the HTTP RFCs, a HTTP header value is supposed to contain /only/ US-ASCII characters. Some byte values in UTF-8 encoding are /not/ valid US-ASCII characters, so strictly speaking and according to the RFC, HTTP headers which would contain them are invalid. It's a pain, but it's (probably) not a bug. In this case I think it is a bug. The authorisation header is base64 encoded so it is automatically compliant with RFC2616. Yes, it sounds like you're right; my mistake. (Also for Gabor, I admit my mistake.) I agree that the HTTP header itself is correct. But there is still somethig which puzzles me in the absolute. Suppose that the browser and the server know nothing particular about one another, and that the server gets such an Authentication header from the browser. The Base64 decoding is done, and yields a series of bytes. Now this series of bytes have to be interpreted, to be translated into a string in Java (which is Unicode). Which encoding should be chosen to decode the byte array ? If you use the default platform JVM encoding, you are making the assumption that the browser knew what this encoding is, aren't you ? On the other hand, the browser sent nothing to indicate in which encoding this string was, before it encoded it using Base64, or did it ? - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
On 21/01/2010 06:12, André Warnier wrote: > Auth Gábor wrote: >> Hi, >> >> I've found a potential bug in the Basic Authentication module. I have >> users and some user's username is contains national characters >> (encoded in UTF-8). The HTTP header based authentication is fails when >> the username or the password contains multibyte characters. >> >> The root of the bug is the Base64 decoder, which decodes the Base64 >> stream to char array: converts each byte to individual char, this >> decode method corrupts the multibyte characters... >> > Hi. > Before declaring that this is a bug, I suggest that you read the other > thread entitled "mod_jk codepage in header values". > The main point is : according to the HTTP RFCs, a HTTP header value is > supposed to contain /only/ US-ASCII characters. Some byte values in > UTF-8 encoding are /not/ valid US-ASCII characters, so strictly speaking > and according to the RFC, HTTP headers which would contain them are > invalid. > It's a pain, but it's (probably) not a bug. In this case I think it is a bug. The authorisation header is base64 encoded so it is automatically compliant with RFC2616. Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
Hi, André Warnier wrote: >> I've found a potential bug in the Basic Authentication module. I have >> users and some user's username is contains national characters (encoded >> in UTF-8). The HTTP header based authentication is fails when the >> username or the password contains multibyte characters. >> >> The root of the bug is the Base64 decoder, which decodes the Base64 >> stream to char array: converts each byte to individual char, this decode >> method corrupts the multibyte characters... > Before declaring that this is a bug, I suggest that you read the other > thread entitled "mod_jk codepage in header values". I've read that. > The main point is : according to the HTTP RFCs, a HTTP header value is > supposed to contain /only/ US-ASCII characters. Some byte values in > UTF-8 encoding are /not/ valid US-ASCII characters, so strictly speaking > and according to the RFC, HTTP headers which would contain them are > invalid. It's a pain, but it's (probably) not a bug. Hmm... the Basic Authorization header like this: Authorization: BASIC w7pzZXJfMDA3MjpqZWxzem8xMkFB Where do you see non US-ASCII character in the header? :) Gábor Auth - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
On 21/01/2010 05:54, Auth Gábor wrote: > Hi, > > I've found a potential bug in the Basic Authentication module. I have users > and some user's username is contains national characters (encoded in UTF-8). > The HTTP header based authentication is fails when the username or the > password contains multibyte characters. That sounds like a bug to me. > The root of the bug is the Base64 decoder, which decodes the Base64 stream to > char array: converts each byte to individual char, this decode method > corrupts > the multibyte characters... And that sounds like the root cause. > It works, because the byte[] to String conversion supports the multibyte > conversion and uses the encoding of the JVM. > > What do you think about it? I haven't tested it or looked at the detail of the base 64 decoding but on the basis it works for you then... Great! Many thanks. Please create a Bugzilla entry and add your patch to it. Patches sent to the mailing list are too easy to forget. Before you do, I have have one improvement suggestion. Using the platform default encoding to convert bytes to String is something that itself has caused bugs in the past and I can see it doing so here too. I'd suggest adding a characterEncoding attribute to the BasicAuthenticator (like there is for FormAuthenticator). Don't forget to include documenting this new attribute in your patch. The tricky question is what should the default be. I see the options as ISO-8859-1 or UTF-8. I'd use UTF-8 since that will work for most input including all ISO-8859-1 input. Thanks again for the patch. Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Basic Authentication Failed with multibyte username
Auth Gábor wrote: Hi, I've found a potential bug in the Basic Authentication module. I have users and some user's username is contains national characters (encoded in UTF-8). The HTTP header based authentication is fails when the username or the password contains multibyte characters. The root of the bug is the Base64 decoder, which decodes the Base64 stream to char array: converts each byte to individual char, this decode method corrupts the multibyte characters... Hi. Before declaring that this is a bug, I suggest that you read the other thread entitled "mod_jk codepage in header values". The main point is : according to the HTTP RFCs, a HTTP header value is supposed to contain /only/ US-ASCII characters. Some byte values in UTF-8 encoding are /not/ valid US-ASCII characters, so strictly speaking and according to the RFC, HTTP headers which would contain them are invalid. It's a pain, but it's (probably) not a bug. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Basic Authentication Failed with multibyte username
Hi, I've found a potential bug in the Basic Authentication module. I have users and some user's username is contains national characters (encoded in UTF-8). The HTTP header based authentication is fails when the username or the password contains multibyte characters. The root of the bug is the Base64 decoder, which decodes the Base64 stream to char array: converts each byte to individual char, this decode method corrupts the multibyte characters... Here is the patch: === Index: java/org/apache/catalina/util/Base64.java === --- java/org/apache/catalina/util/Base64.java (revision 901368) +++ java/org/apache/catalina/util/Base64.java (working copy) @@ -283,5 +283,84 @@ } } +/** + * Decodes Base64 data into octects + * + * @param base64DataBC Byte array containing Base64 data + * @param decodedDataBC The decoded data bytes + */ +public static void decode( ByteChunk base64DataBC, ByteChunk decodedDataBC) +{ +int start = base64DataBC.getStart(); +int end = base64DataBC.getEnd(); +byte[] base64Data = base64DataBC.getBuffer(); + +decodedDataBC.recycle(); + +// handle the edge case, so we don't have to worry about it later +if(end - start == 0) { return; } +int numberQuadruple= (end - start)/FOURBYTE; +byte b1=0,b2=0,b3=0, b4=0, marker0=0, marker1=0; + +// Throw away anything not in base64Data + +int encodedIndex = 0; +int dataIndex = start; +byte[] decodedData = null; + +{ +// this sizes the output array properly - rlw +int lastData = end - start; +// ignore the '=' padding +while (base64Data[start+lastData-1] == PAD) +{ +if (--lastData == 0) +{ +return; +} +} +decodedDataBC.allocate(lastData - numberQuadruple, -1); +decodedDataBC.setEnd(lastData - numberQuadruple); +decodedData = decodedDataBC.getBuffer(); +} + +for (int i = 0; i < numberQuadruple; i++) +{ +dataIndex = start + i * 4; +marker0 = base64Data[dataIndex + 2]; +marker1 = base64Data[dataIndex + 3]; + +b1 = base64Alphabet[base64Data[dataIndex]]; +b2 = base64Alphabet[base64Data[dataIndex +1]]; + +if (marker0 != PAD && marker1 != PAD) +{ +//No PAD e.g 3cQl +b3 = base64Alphabet[ marker0 ]; +b4 = base64Alphabet[ marker1 ]; + +decodedData[encodedIndex] = (byte) (( b1 <<2 | b2>>4 ) & 0xff); +decodedData[encodedIndex + 1] = +(byte) b2 & 0xf)<<4 ) |( (b3>>2) & 0xf) ) & 0xff); +decodedData[encodedIndex + 2] = (byte) (( b3<<6 | b4 ) & 0xff); +} +else if (marker0 == PAD) +{ +//Two PAD e.g. 3c[Pad][Pad] +decodedData[encodedIndex] = (byte) (( b1 <<2 | b2>>4 ) & 0xff); +} +else if (marker1 == PAD) +{ +//One PAD e.g. 3cQ[Pad] +b3 = base64Alphabet[ marker0 ]; + +decodedData[encodedIndex] = (byte) (( b1 <<2 | b2>>4 ) & 0xff); +decodedData[encodedIndex + 1] = +(byte) b2 & 0xf)<<4 ) |( (b3>>2) & 0xf) ) & 0xff); +} +encodedIndex += 3; +} +} + } Index: java/org/apache/catalina/authenticator/BasicAuthenticator.java === --- java/org/apache/catalina/authenticator/BasicAuthenticator.java (revision 901368) +++ java/org/apache/catalina/authenticator/BasicAuthenticator.java (working copy) @@ -161,18 +161,18 @@ // FIXME: Add trimming // authorizationBC.trim(); -CharChunk authorizationCC = authorization.getCharChunk(); -Base64.decode(authorizationBC, authorizationCC); +ByteChunk authorizationBCC = authorization.getByteChunk(); +Base64.decode(authorizationBC, authorizationBCC); // Get username and password -int colon = authorizationCC.indexOf(':'); +int colon = authorizationBCC.indexOf(':',0); if (colon < 0) { -username = authorizationCC.toString(); +username = authorizationBCC.toString(); } else { -char[] buf = authorizationCC.getBuffer(); +byte[] buf = authorizationBCC.getBuffer(); username = new String(buf, 0, colon); password = new String(buf, colon + 1, -authorizationCC.getEnd() - colon - 1); +