On Thu, 2010-01-21 at 11:30 +0100, André Warnier wrote: This was quite replay :). Thanks for you time and knowledge.
> Mirko, > I am from Belgium, Europe too. I live in Spain and work mostly for > German and other international customers (among which are some from > Poland too). This to say that I am well-aware of multi-lingual character > set issues, and confront them every day. > So, just so as to give you some "context" for your issues : > > Despite the fact that Unicode and UTF-8 are now being increasingly used > on the web, the fact is that HTTP, and HTML, and most of the other > WWW-relevant RFCs, are still US-ASCII and ISO-8859-1 (latin-1) based. > > For example, HTTP header values are /supposed/ to contain only > single-byte character codes that are part of the (printable subset of) > US-ASCII character set. > For example also, by default, all "content" exchanged between browsers > and web servers is iso-8859-1. > And it is so because the relevant RFCs say that it should be. > (So the developers of Apache and mod_jk and Tomcat have little choice in > the matter; they have to follow the RFCs). I agree RFC are there to be used. > > This does not mean that you cannot handle other character sets on the > web. But it means that whenever you do, you have to be attentive to the > fact that it is /not/ the standard, and that you may have to do > character set translations and/or encoding. > It may even mean that, in order to exchange non-US-ASCII or > non-ISO-8859-1 data, you may have to use "tricks". > It also means that, in some cases, by using such "tricks", your > applications may become "non-standard", and will not necessarily work > with all servers and all clients. > > So for example, to get back to your question above : mod_jk is not > responsible for translating anything, and will not translate anything. > That is because mod_jk follows the relevant WWW RFCs, which specify that > such and such data is ASCII or ISO-8859-1. > > If the original HTTP request, as it is given by Apache to mod_jk, > contains HTTP headers, mod_jk will forward these headers "as is" to the > back-end Tomcat. But, because the HTTP RFC specifies that HTTP headers > should contain only US-ASCII character data, mod_jk would be allowed, if > it finds non-US-ASCII data in a HTTP header, to strip this data or > ignore the header or something like that. I don't know if mod_jk > actually does this, but if it did, it would be justified, because > according to the HTTP RFC this would be an invalid header. That what i'm afraid of. This code: new String(request.getHeader(headerName).getBytes("ISO-8859-1")) works for now but it really shouldn't work. That way i'm searching for more legitimate way. > > So, to be practical : > - the current HTTP 1.1 RFC specifies that HTTP headers can only contain > US-ASCII printable character data > - some UTF-8 codes contain bytes that are not part of the US-ASCII > character set (e.g. : bytes with values above 0x7F) > - so, if you want to forward such a header from Apache to Tomcat, in > principle the "right" way is to "encode" the value of this header on the > Apache side, in such a way that it contains only US-ASCII data (for > example, using Base64 encoding), then pass it to mod_jk. > - at the other end, your application would have to decode this header > (using Base64 decoding) back into UTF-8, and then it would have to read > this header value as UTF-8/Unicode. > > There is no guarantee that any standard module or class under Apache or > mod_jk or Tomcat would properly handle a header that contains > non-US-ASCII data. That because, in principle, they never have to. > > I know it is a mess. It is possible that there are shortcuts. It is > possible that mod_jk would transmit a HTTP header, even if it contains > non-US-ASCII data. But it is not sure, because "the bible" for mod_jk, > as for Apache and as for Tomcat, are the RFCs. But where to put this Base64 encoding (i do not use apache often :( i'm java programmer using tomcat). >From Idp (AAI identity provider) i get user data and SP (AAI service provide, this is module in apache) put this data in apache environment variables with utf-8 values. Then as i understand mod_jk take this variables and pack them in http header. I would like to have environment variables on apache with utf-8 values so applications (e.g php web pages) that are on this apache would still work. So my guess is that Base64 encoding should happen before mod_jk takes values from environment variables and puts them in http header.Is this possible (i mean except to make change in mod_jk code)? Or is this topic for some other mail list :). > We non-English speakers worldwide desperately need a new version of the > HTTP protocol where the default would be Unicode/UTF-8, for everything. > But I do not see much happening right now in that direction. O i do agree on that :) > > > Maybe a tip for your authentication issues : > If, in the AJP <Connector> on the Tomcat side, you set the attribute > tomcatAuthentication="false" > then Tomcat will accept the user-id authenticated by Apache, as the > user-id for Tomcat (mod_jk transmits it). > So if your user authentication mechanism works fine at the Apache level, > and generates a user-id that is "acceptable" by Tomcat, this may be a > solution for your issue. > I have no idea if this user-id, for Tomcat, can or cannot contain > non-US-ASCII characters. AAI returns more then just user-id. Idea behind AAI is that application save as little as possible data about user. All data is provided by AAI. In this data is for example first-name, last-name, address, .... It would be perfect that we would have this SP running on tomcat and we wouldn't need apache but at the time there is no such SP. mirko --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org