Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Rainer, On 1/30/2010 7:56 AM, Rainer Jung wrote: So I expect you can forward any binary garbage you like, as long as you make sure the code putting it into the environment variables doesn't already do any encoding or decoding. This was pretty much just as I expected. Now: it seems that Tomcat is by default assuming it needs to transform the binary AJP data stream for request attributes into ISO-8859-1 decoded Java strings. I'm not 100% sure here, but this is the likely the most important part of the game. It looks like AprProtocol.java, in prepareRequest, handles request attributes in the SC_A_REQ_ATTRIBUTE case. No encoding/decoding is done there. Instead, it is done by the MessageBytes class, indirectly by the ByteChunk class. The documentation for ByteChunk says: * In a server it is very important to be able to operate on * the original byte[] without converting everything to chars. * Some protocols are ASCII only, and some allow different * non-UNICODE encodings. The encoding is not known beforehand, * and can even change during the execution of the protocol. * ( for example a multipart message may have parts with different * encoding ) * * For HTTP it is not very clear how the encoding of RequestURI * and mime values can be determined, but it is a great advantage * to be able to parse the request without converting to string. Later: /** Default encoding used to convert to strings. It should be UTF8, as most standards seem to converge, but the servlet API requires 8859_1, and this object is used mostly for servlets. */ public static final String DEFAULT_CHARACTER_ENCODING=ISO-8859-1; If ByteChunk.setEncoding has not been called, this default encoding is used to decode bytes. Unfortunately, setEncoding is not static, so you have to have a reference to the ByteChunk object in order to fix it. Then again, knowing that ISO-8859-1 is being used may make it easier to write a transcoder... new String(myString.getBytes(ISO-8859-1), UTF-8) That's ugly and I feel like it's asking for problems, but it might be your only reasonable recourse. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktm694ACgkQ9CaO5/Lv0PDKKwCeIq2PqcF3DNyrqgw7JKh84kYf nFwAoJwBlivosSo4e95nhQTLZoxYs2Be =ePve -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
OK. He was my mistake i thought that mod_jk automatically takes environment variables and puts them in header. But, yes, as you said this is done by AAI. So right encoding should be done by AAI side. Thank you for clearing that up. Let us know what AAI says about this. OK. Just for info. I try to put in JkEnvVar directive, value with utf8 character encoding and the result was the same. On the tomcat side i got (through request.getAttribute(attributeName)) value in ISO-8859-1 character encoding. How did you construct your UTF-8-encoded environment variable? Can you give us an example for how to reproduce this? I try this with two different approaches. 1. AAI (this is done by apache Shibboleth module mod_shibx.so) puts AAI atributes in header and in environment variables (version 1.3 just in header, version 2 in environment variables but with directive ShibUseHeaders On also in header) So in mod_jk conf file i define with JkEnvVar directive which environment variable should be pass over to tomcat. I choose one of AAI atributes that has utf8 character in it. 2. Secondly i try to define JkEnvVar directive for non existent environment variable and i added also default value with some no ISO-8859-1 characters. My conf file is in utf8 encoding so default value should also be in utf8 encoding. I believe you could reproduce sencod approach. mirko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Mirko, On 1/29/2010 4:02 AM, Mirko Solic wrote: Secondly i try to define JkEnvVar directive for non existent environment variable and i added also default value with some no ISO-8859-1 characters. My conf file is in utf8 encoding so default value should also be in utf8 encoding. I'd be interested in how Apache httpd reads the httpd.conf file. If it reads the file in utf-8 encoding, then this could be a problem with mod_jk. If it reads it using ISO-8859-1 or US-ASCII or something like that, then the data is already broken before mod_jk gets ahold of it. You might want to re-post your question by saying that UTF-8 data is incorrectly transmitted to request /attributes/ and see if any of the mod_jk devs can take a look at that. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktjUQQACgkQ9CaO5/Lv0PAt2gCaA79KUx1X5st02tQQj3cPI+JR pi8AnArePSsdFwqEk1WOqi2KeLyioaEX =oZzD -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
According to André Warnier: But, because the HTTP RFC specifies that HTTP headers should contain only US-ASCII character data, mod_jk would be allowed, if it finds non-US-ASCII data in a HTTP header, to strip this data or ignore the header or something like that. I don't know if mod_jk actually does this, but if it did, it would be justified, because according to the HTTP RFC this would be an invalid header. Than i have no values to decode to. I can tell you there's no reason for mod_jk to do this, and I don't believe it does, for the testing I have performed does not demonstrate that behavior. Yes. It is also working for me. I have no problem whit that at the moment. My fear is just that at some point in future won't work any more. I agree with you here: Using HTTP headers for text data sucks!. But AAI is not supported on tomcat yet. However it is supported on apache and the only way for me if i want to use AAI and tomcat is to use mod_jk connector. But mod_jk is transporting environment variables from apache to tomcat in HTTP header. That sounds like an AAI bug, not an httpd/mod_jk/Tomcat bug: mod_jk sends environment variables as request /attributes/, not request headers. (See the JkEnvVar directive in http://tomcat.apache.org/connectors-doc/reference/apache.html). If AAI is creating new request headers, it's AAI's fault for incorrectly formatting them. If you can get this data from a request /attribute/ instead, then maybe that's a better option (though there are no references to character encoding in the documentation for JkEnvVar). OK. He was my mistake i thought that mod_jk automatically takes environment variables and puts them in header. But, yes, as you said this is done by AAI. So right encoding should be done by AAI side. Thank you for clearing that up. Just for info. I try to put in JkEnvVar directive, value with utf8 character encoding and the result was the same. On the tomcat side i got (through request.getAttribute(attributeName)) value in ISO-8859-1 character encoding. lp mirko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Mirko, On 1/27/2010 3:02 AM, Mirko Solic wrote: OK. He was my mistake i thought that mod_jk automatically takes environment variables and puts them in header. But, yes, as you said this is done by AAI. So right encoding should be done by AAI side. Thank you for clearing that up. Let us know what AAI says about this. Just for info. I try to put in JkEnvVar directive, value with utf8 character encoding and the result was the same. On the tomcat side i got (through request.getAttribute(attributeName)) value in ISO-8859-1 character encoding. How did you construct your UTF-8-encoded environment variable? Can you give us an example for how to reproduce this? - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktgXbIACgkQ9CaO5/Lv0PDseACfWoVNk7t7Smbbs8hipKDiua00 3CgAoKpFKRjt9cfGFcddOFsCbLmRQt6W =U/+9 -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
On Thu, 2010-01-21 at 15:21 +0100, André Warnier wrote: Mirko Solic wrote: On Thu, 2010-01-21 at 11:30 +0100, André Warnier wrote: Mirko, just for info : there is a related other thread taking place at the same time, entitled Basic Authentication Failed with multibyte username. I have read it. Basically, I am interested in those topics because I encounter them myself often in our own web applications. I don't know all the answers, but I know that it is confusing. As far as I can interpret : According to the HTTP 1.1 RFC 2616, HTTP header fields MAY contain *TEXT portions representing character sets other than US-ASCII. But then, such header field values MUST be encoded according to the rules of RFC 2047. RFC 2047 in turn, in 2. Syntax of encoded-words , indicates that this should be done using the form : encoded-word = =? charset ? encoding ? encoded-text ?= for example : Header-name: =?iso-8859-1?B?some iso-8859-1 text, base-64 encoded?= or Header-name: =?utf-8?B?some unicode/utf-8 text, base-64 encoded?= (I am not quite sure here of the utf-8 part as the correct name for the charset.) Now, I am not sure that if you pass a HTTP header, encoded as above, from Apache to Tomcat, the Tomcat getHeader() call will properly decode it, using the indicated charset. If not, you will have to do the decoding yourself, if you want to pass non-ascii (or non-iso-8859-1) characters in those headers. Admittedly, it is a pain; but there are still quite a few grey areas like that in the WWW-related RFCs in what concerns character sets. If you have to do this kind of encoding/decoding, I suggest to have a look in MIME (email) libraries. Such kind of encoding/decoding is regularly used in email headers. Save the original text (.eml) format of an email, with a non-ascii subject line, for an example. How i understand i don't have control when environment variables on apache side are putted in http header and sent to tomcat side. This is done by mode_jk automatically. I would hate to put encoded values already in environment variables on apache side so mod_jk would transfer them corectly on tomcat side but then other web pages that uses this variables wouldn't work no more. Right way would be (for my understanding) that mod_jk would encode environment varibales according to the rules of RFC 2047. lp mirko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
On Thu, 2010-01-21 at 10:34 -0500, Christopher Schultz wrote: On 1/21/2010 6:43 AM, Mirko Solic wrote: That what i'm afraid of. This code: new String(request.getHeader(headerName).getBytes(ISO-8859-1)) works for now but it really shouldn't work. That way i'm searching for more legitimate way. What would be better is to do something like this: java.net.URLEncoder.encode(request.getHeader(headerName), UTF-8) Of course, this will only work if your client knows that's how the encoding will be done. Yes but what if mod_jk chooses not to send non ISO-8859-1 header values over to tomcat side. According to André Warnier: But, because the HTTP RFC specifies that HTTP headers should contain only US-ASCII character data, mod_jk would be allowed, if it finds non-US-ASCII data in a HTTP header, to strip this data or ignore the header or something like that. I don't know if mod_jk actually does this, but if it did, it would be justified, because according to the HTTP RFC this would be an invalid header. Than i have no values to decode to. AAI needs to support whatever encoding you intend to use. You can't simply transcode things in an arbitrary way and expect AAI to work properly. What does their documentation say about what format these values should take? The problem is when i want to get data from AAI. AAI is sending data in utf-8 but this is broken when data is send from apache side to tomcat side. A better strategy would be for AAI to provide a numeric token (easily passable in HTTP headers without any encoding issues) and then provide an HTTP-based and/or XML-based API that uses proper document encoding to send textual data across the wire. Using HTTP headers for text data sucks! I agree with you here: Using HTTP headers for text data sucks!. But AAI is not supported on tomcat yet. However it is supported on apache and the only way for me if i want to use AAI and tomcat is to use mod_jk connector. But mod_jk is transporting environment variables from apache to tomcat in HTTP header. And yes AAI sends data to apache in xml document not over http headers. On apache side when data is received is is put in environment variables. mirko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Mirko, On 1/25/2010 4:06 AM, Mirko Solic wrote: How i understand i don't have control when environment variables on apache side are putted in http header and sent to tomcat side. This is done by mode_jk automatically. I would hate to put encoded values already in environment variables on apache side so mod_jk would transfer them corectly on tomcat side but then other web pages that uses this variables wouldn't work no more. Right way would be (for my understanding) that mod_jk would encode environment varibales according to the rules of RFC 2047. Again, mod_jks job is to deliver the variables (as HTTP headers, right?) without any manipulation whatsoever. It is not mod_jk's job to re-encode things that aren't acceptable to your webapp. If you want to use RFC2047-encoded values, then go ahead and use them. You'll just have to write some code into your webapp to decode them, as Tomcat does not directly support RFC2047 (though it also doesn't interfere with it). - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktd758ACgkQ9CaO5/Lv0PDvygCgnFj6uigM/a5WHnu9Eq84+vcU j+4An3SK7tv8KwsqZgIoKFJPXDuwhN9C =QW6J -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Mirko, On 1/25/2010 4:24 AM, Mirko Solic wrote: On Thu, 2010-01-21 at 10:34 -0500, Christopher Schultz wrote: What would be better is to do something like this: java.net.URLEncoder.encode(request.getHeader(headerName), UTF-8) Of course, this will only work if your client knows that's how the encoding will be done. Yes but what if mod_jk chooses not to send non ISO-8859-1 header values over to tomcat side. This is simply not mod_jk's job: mod_jk pretty much delivers the exact bytes sent by the client. Trust me: it's better that way. According to André Warnier: But, because the HTTP RFC specifies that HTTP headers should contain only US-ASCII character data, mod_jk would be allowed, if it finds non-US-ASCII data in a HTTP header, to strip this data or ignore the header or something like that. I don't know if mod_jk actually does this, but if it did, it would be justified, because according to the HTTP RFC this would be an invalid header. Than i have no values to decode to. I can tell you there's no reason for mod_jk to do this, and I don't believe it does, for the testing I have performed does not demonstrate that behavior. AAI needs to support whatever encoding you intend to use. You can't simply transcode things in an arbitrary way and expect AAI to work properly. What does their documentation say about what format these values should take? The problem is when i want to get data from AAI. AAI is sending data in utf-8 but this is broken when data is send from apache side to tomcat side. So, the bytes are being sent as UTF-8 instead of US-ASCII. I think you're back to where we started: re-encoding strings. It's possible that you may run into a situation where the re-encoding is simply going to fail because of how badly the string has been damaged by an incorrect decoding. Maybe that's not an issue with ISO-8859-1 (at least it's a 1-byte encoding and all bytes are ostensibly legal). A better strategy would be for AAI to provide a numeric token (easily passable in HTTP headers without any encoding issues) and then provide an HTTP-based and/or XML-based API that uses proper document encoding to send textual data across the wire. Using HTTP headers for text data sucks! I agree with you here: Using HTTP headers for text data sucks!. But AAI is not supported on tomcat yet. However it is supported on apache and the only way for me if i want to use AAI and tomcat is to use mod_jk connector. But mod_jk is transporting environment variables from apache to tomcat in HTTP header. That sounds like an AAI bug, not an httpd/mod_jk/Tomcat bug: mod_jk sends environment variables as request /attributes/, not request headers. (See the JkEnvVar directive in http://tomcat.apache.org/connectors-doc/reference/apache.html). If AAI is creating new request headers, it's AAI's fault for incorrectly formatting them. If you can get this data from a request /attribute/ instead, then maybe that's a better option (though there are no references to character encoding in the documentation for JkEnvVar). - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktd8hEACgkQ9CaO5/Lv0PBCdACfXGvpCFULt8Cs49xeQjdv+Rwz 2oAAmgNUr3WdHwRJ9T9x5XS+Jx3PkU7c =tG4b -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
Christopher thanks for quick replay. for connecting tomcat with apache i'm using mod_jk connector. But i'm having problem with header values. On apache side headers values are in UTF-8 cp but on tomcat side i have to make conversion from latin-1 cp. Hmm. HTTP defines header values as ASCII (well, it inherits that from other RFCs, but, whatever). If you need to encode non-ASCII data in header values, you'll need to do it in such a way that your client understands them. Often, URL-encoding (aka %-encoding) is used in these situations. I'm using this code: for(Enumeration en = request.getHeaderNames(); en.hasMoreElements();){ header = new Header(); headerName = (String) en.nextElement(); header.setHeaderName(headerName); header.setHeaderValue(new String(request.getHeader(headerName).getBytes(ISO-8859-1))); For most values, this will work. On the other hand, the response already knows how to convert a String into ASCII, so you probably don't have to do this. headers.add(header); header = new Header(); header.setHeaderName(headerName); header.setHeaderValue(request.getHeader(headerName)); headers.add(header); } The Header class is not part of the Servlet API. What does all of this do? This is just snap shot of my code. I use Header class to save values, it is just data holder. What information are you passing through the HTTP headers that needs to be in a particular encoding? These issues are typically handled using the response body coupled with a Content-Type header which specifies a character encoding. I'm from Slovenija, Europe. We are using character that are not defined in ASCII so we are using UTF-8 cp. I will try to explain what is this application about. This project (web page) is protected with AAI (http://www.switch.ch/aai/about/). This Authentication and Authorization infrastructure is roughly divided on SP (service provider) and Idp (identity provider). SP is module in apache. So when user try to get web page that is protected with AAI through apache, SP module checks if user is alredy logged in. If not SP redirects user to Idp where user can put his/her username and password. If everything is ok Idp sends users data in xml to SP. SP puts this data into apache environment variables so applications (web pages) can access it. Here i use mod_jk to get this environment variables in tomcat in HTTP header. If i print user data on apache side i get values in UTF-8 encoding but if i try this on tomcat i don't get right values i have to make conversion. Is it mod_jk responsible for converting UTF-8 environment variable to ACSII header values or is this conversion made elsewhere? mirko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
Mirko Solic wrote: Christopher thanks for quick replay. ... I'm from Slovenija, Europe. We are using character that are not defined in ASCII so we are using UTF-8 cp. I will try to explain what is this application about. This project (web page) is protected with AAI (http://www.switch.ch/aai/about/). This Authentication and Authorization infrastructure is roughly divided on SP (service provider) and Idp (identity provider). SP is module in apache. So when user try to get web page that is protected with AAI through apache, SP module checks if user is alredy logged in. If not SP redirects user to Idp where user can put his/her username and password. If everything is ok Idp sends users data in xml to SP. SP puts this data into apache environment variables so applications (web pages) can access it. Here i use mod_jk to get this environment variables in tomcat in HTTP header. If i print user data on apache side i get values in UTF-8 encoding but if i try this on tomcat i don't get right values i have to make conversion. Is it mod_jk responsible for converting UTF-8 environment variable to ACSII header values or is this conversion made elsewhere? Mirko, I am from Belgium, Europe too. I live in Spain and work mostly for German and other international customers (among which are some from Poland too). This to say that I am well-aware of multi-lingual character set issues, and confront them every day. So, just so as to give you some context for your issues : Despite the fact that Unicode and UTF-8 are now being increasingly used on the web, the fact is that HTTP, and HTML, and most of the other WWW-relevant RFCs, are still US-ASCII and ISO-8859-1 (latin-1) based. For example, HTTP header values are /supposed/ to contain only single-byte character codes that are part of the (printable subset of) US-ASCII character set. For example also, by default, all content exchanged between browsers and web servers is iso-8859-1. And it is so because the relevant RFCs say that it should be. (So the developers of Apache and mod_jk and Tomcat have little choice in the matter; they have to follow the RFCs). This does not mean that you cannot handle other character sets on the web. But it means that whenever you do, you have to be attentive to the fact that it is /not/ the standard, and that you may have to do character set translations and/or encoding. It may even mean that, in order to exchange non-US-ASCII or non-ISO-8859-1 data, you may have to use tricks. It also means that, in some cases, by using such tricks, your applications may become non-standard, and will not necessarily work with all servers and all clients. So for example, to get back to your question above : mod_jk is not responsible for translating anything, and will not translate anything. That is because mod_jk follows the relevant WWW RFCs, which specify that such and such data is ASCII or ISO-8859-1. If the original HTTP request, as it is given by Apache to mod_jk, contains HTTP headers, mod_jk will forward these headers as is to the back-end Tomcat. But, because the HTTP RFC specifies that HTTP headers should contain only US-ASCII character data, mod_jk would be allowed, if it finds non-US-ASCII data in a HTTP header, to strip this data or ignore the header or something like that. I don't know if mod_jk actually does this, but if it did, it would be justified, because according to the HTTP RFC this would be an invalid header. So, to be practical : - the current HTTP 1.1 RFC specifies that HTTP headers can only contain US-ASCII printable character data - some UTF-8 codes contain bytes that are not part of the US-ASCII character set (e.g. : bytes with values above 0x7F) - so, if you want to forward such a header from Apache to Tomcat, in principle the right way is to encode the value of this header on the Apache side, in such a way that it contains only US-ASCII data (for example, using Base64 encoding), then pass it to mod_jk. - at the other end, your application would have to decode this header (using Base64 decoding) back into UTF-8, and then it would have to read this header value as UTF-8/Unicode. There is no guarantee that any standard module or class under Apache or mod_jk or Tomcat would properly handle a header that contains non-US-ASCII data. That because, in principle, they never have to. I know it is a mess. It is possible that there are shortcuts. It is possible that mod_jk would transmit a HTTP header, even if it contains non-US-ASCII data. But it is not sure, because the bible for mod_jk, as for Apache and as for Tomcat, are the RFCs. We non-English speakers worldwide desperately need a new version of the HTTP protocol where the default would be Unicode/UTF-8, for everything. But I do not see much happening right now in that direction. Maybe a tip for your authentication issues : If, in the AJP Connector on the Tomcat side, you set the attribute
Re: mod_jk codepage in header values
On Thu, 2010-01-21 at 11:30 +0100, André Warnier wrote: This was quite replay :). Thanks for you time and knowledge. Mirko, I am from Belgium, Europe too. I live in Spain and work mostly for German and other international customers (among which are some from Poland too). This to say that I am well-aware of multi-lingual character set issues, and confront them every day. So, just so as to give you some context for your issues : Despite the fact that Unicode and UTF-8 are now being increasingly used on the web, the fact is that HTTP, and HTML, and most of the other WWW-relevant RFCs, are still US-ASCII and ISO-8859-1 (latin-1) based. For example, HTTP header values are /supposed/ to contain only single-byte character codes that are part of the (printable subset of) US-ASCII character set. For example also, by default, all content exchanged between browsers and web servers is iso-8859-1. And it is so because the relevant RFCs say that it should be. (So the developers of Apache and mod_jk and Tomcat have little choice in the matter; they have to follow the RFCs). I agree RFC are there to be used. This does not mean that you cannot handle other character sets on the web. But it means that whenever you do, you have to be attentive to the fact that it is /not/ the standard, and that you may have to do character set translations and/or encoding. It may even mean that, in order to exchange non-US-ASCII or non-ISO-8859-1 data, you may have to use tricks. It also means that, in some cases, by using such tricks, your applications may become non-standard, and will not necessarily work with all servers and all clients. So for example, to get back to your question above : mod_jk is not responsible for translating anything, and will not translate anything. That is because mod_jk follows the relevant WWW RFCs, which specify that such and such data is ASCII or ISO-8859-1. If the original HTTP request, as it is given by Apache to mod_jk, contains HTTP headers, mod_jk will forward these headers as is to the back-end Tomcat. But, because the HTTP RFC specifies that HTTP headers should contain only US-ASCII character data, mod_jk would be allowed, if it finds non-US-ASCII data in a HTTP header, to strip this data or ignore the header or something like that. I don't know if mod_jk actually does this, but if it did, it would be justified, because according to the HTTP RFC this would be an invalid header. That what i'm afraid of. This code: new String(request.getHeader(headerName).getBytes(ISO-8859-1)) works for now but it really shouldn't work. That way i'm searching for more legitimate way. So, to be practical : - the current HTTP 1.1 RFC specifies that HTTP headers can only contain US-ASCII printable character data - some UTF-8 codes contain bytes that are not part of the US-ASCII character set (e.g. : bytes with values above 0x7F) - so, if you want to forward such a header from Apache to Tomcat, in principle the right way is to encode the value of this header on the Apache side, in such a way that it contains only US-ASCII data (for example, using Base64 encoding), then pass it to mod_jk. - at the other end, your application would have to decode this header (using Base64 decoding) back into UTF-8, and then it would have to read this header value as UTF-8/Unicode. There is no guarantee that any standard module or class under Apache or mod_jk or Tomcat would properly handle a header that contains non-US-ASCII data. That because, in principle, they never have to. I know it is a mess. It is possible that there are shortcuts. It is possible that mod_jk would transmit a HTTP header, even if it contains non-US-ASCII data. But it is not sure, because the bible for mod_jk, as for Apache and as for Tomcat, are the RFCs. But where to put this Base64 encoding (i do not use apache often :( i'm java programmer using tomcat). From Idp (AAI identity provider) i get user data and SP (AAI service provide, this is module in apache) put this data in apache environment variables with utf-8 values. Then as i understand mod_jk take this variables and pack them in http header. I would like to have environment variables on apache with utf-8 values so applications (e.g php web pages) that are on this apache would still work. So my guess is that Base64 encoding should happen before mod_jk takes values from environment variables and puts them in http header.Is this possible (i mean except to make change in mod_jk code)? Or is this topic for some other mail list :). We non-English speakers worldwide desperately need a new version of the HTTP protocol where the default would be Unicode/UTF-8, for everything. But I do not see much happening right now in that direction. O i do agree on that :) Maybe a tip for your authentication issues : If, in the AJP Connector on the Tomcat side, you set the attribute
Re: mod_jk codepage in header values
Mirko Solic wrote: On Thu, 2010-01-21 at 11:30 +0100, André Warnier wrote: Mirko, just for info : there is a related other thread taking place at the same time, entitled Basic Authentication Failed with multibyte username. Basically, I am interested in those topics because I encounter them myself often in our own web applications. I don't know all the answers, but I know that it is confusing. As far as I can interpret : According to the HTTP 1.1 RFC 2616, HTTP header fields MAY contain *TEXT portions representing character sets other than US-ASCII. But then, such header field values MUST be encoded according to the rules of RFC 2047. RFC 2047 in turn, in 2. Syntax of encoded-words , indicates that this should be done using the form : encoded-word = =? charset ? encoding ? encoded-text ?= for example : Header-name: =?iso-8859-1?B?some iso-8859-1 text, base-64 encoded?= or Header-name: =?utf-8?B?some unicode/utf-8 text, base-64 encoded?= (I am not quite sure here of the utf-8 part as the correct name for the charset.) Now, I am not sure that if you pass a HTTP header, encoded as above, from Apache to Tomcat, the Tomcat getHeader() call will properly decode it, using the indicated charset. If not, you will have to do the decoding yourself, if you want to pass non-ascii (or non-iso-8859-1) characters in those headers. Admittedly, it is a pain; but there are still quite a few grey areas like that in the WWW-related RFCs in what concerns character sets. If you have to do this kind of encoding/decoding, I suggest to have a look in MIME (email) libraries. Such kind of encoding/decoding is regularly used in email headers. Save the original text (.eml) format of an email, with a non-ascii subject line, for an example. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Mirko, On 1/21/2010 6:43 AM, Mirko Solic wrote: That what i'm afraid of. This code: new String(request.getHeader(headerName).getBytes(ISO-8859-1)) works for now but it really shouldn't work. That way i'm searching for more legitimate way. What would be better is to do something like this: java.net.URLEncoder.encode(request.getHeader(headerName), UTF-8) Of course, this will only work if your client knows that's how the encoding will be done. From Idp (AAI identity provider) i get user data and SP (AAI service provide, this is module in apache) put this data in apache environment variables with utf-8 values. Then as i understand mod_jk take this variables and pack them in http header. I would like to have environment variables on apache with utf-8 values so applications (e.g php web pages) that are on this apache would still work. AAI needs to support whatever encoding you intend to use. You can't simply transcode things in an arbitrary way and expect AAI to work properly. What does their documentation say about what format these values should take? AAI returns more then just user-id. Idea behind AAI is that application save as little as possible data about user. All data is provided by AAI. In this data is for example first-name, last-name, address, It would be perfect that we would have this SP running on tomcat and we wouldn't need apache but at the time there is no such SP. A better strategy would be for AAI to provide a numeric token (easily passable in HTTP headers without any encoding issues) and then provide an HTTP-based and/or XML-based API that uses proper document encoding to send textual data across the wire. Using HTTP headers for text data sucks! - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktYc/4ACgkQ9CaO5/Lv0PDtNwCeJzVwiPNpMCOXYTCHYa87pqXs f9IAoJiyaTh0lbiMnxwG7Bp9/jWnHeMV =fHrt -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 André, On 1/21/2010 9:21 AM, André Warnier wrote: But then, such header field values MUST be encoded according to the rules of RFC 2047. Unfortunately, Tomcat does not follow RFC2047, at least not according to http://stackoverflow.com/questions/324470/http-headers-encoding-decoding-in-java and not according to my simple test: $ wget -O - --header Test-Value: =?iso-8859-1?q?this=20is=20some=20text?= http://myhost/SessionSnooper.jsp | grep -C 1 some=20text td =?iso-8859-1?q?this=20is=20some=20text?=br / /td The value is preserved as-is. (The SessionSnooper.jsp file referenced above can be found here: http://www.christopherschultz.net/projects/java/). Fortunately, the value /is/ passed-through without modification. That means that we can read it ourselves! Let's figure out how to decode the string =?iso-8859-1?q?this=20is=20some=20text?=: 1. Check the the string matches the pattern =\?[^?]*\?(B|Q)\?[^?]*\?=. 2. Extract the charset and encoding 3. If encoding is 'Q', convert value characters to bytes: =HL - 0xHL others direct 4. If encoding is 'B', base64 decode value into bytes 5. Convert bytes to characters using charset: new String(bytes, charset) As I started to write code to do this, it occurred to me that it must already exist. Googling for java rfc2047 decode shows that the javax.mail.internet.MimeUtility class (packaged with the JavaMail API) already has a method called decodeText that will do this for us. I wrote a simple wrapper around that method, and you can see that it works: $ java -classpath javamail-1.4.2.jar:. RFC2047Codec '=?iso-8859-1?q?this=20is=20some=20text?=' this is some text $ java -classpath javamail-1.4.2.jar:. RFC2047Codec '=?UTF-8?q?this=20is=20some=20text?=' this is some text $ java -classpath javamail-1.4.2.jar:. RFC2047Codec '=?utf-8?q?this=20is=20some=20text?=' this is some text $ java -classpath javamail-1.4.2.jar:. RFC2047Codec '=?utf-8?q?this=20is=20a=20pi:=20=cf=80?=' this is a pi: # Er the pi wouldn't copy correctly from my terminal, but I assure you that the pi character was dumped to my terminal. So, if you have to decode RFC2047-compliant values, MimeUtility can help you do that. It can also help you encode them, too. It sounds like you have everything you need at this point, as long as AAI recognizes RFC2047-formatted HTTP header values. Good luck, - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktYq7AACgkQ9CaO5/Lv0PAW5wCbBZM3AKhY23dp4OqYm927gM40 Ty0AoJOwpJlLZ/f3IiCNfzSaimyMnRHB =Vf7P -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
mod_jk codepage in header values
Hello, for connecting tomcat with apache i'm using mod_jk connector. But i'm having problem with header values. On apache side headers values are in UTF-8 cp but on tomcat side i have to make conversion from latin-1 cp. I'm using this code: for(Enumeration en = request.getHeaderNames(); en.hasMoreElements();){ header = new Header(); headerName = (String) en.nextElement(); header.setHeaderName(headerName); header.setHeaderValue(new String(request.getHeader(headerName).getBytes(ISO-8859-1))); headers.add(header); header = new Header(); header.setHeaderName(headerName); header.setHeaderValue(request.getHeader(headerName)); headers.add(header); } Is it possible to configure mod_jk somehow so that this conversion would be no longer needed? I went through configuration documentation but i didn't find nothing that could solve my problem. Any help will be much appreciated. SW versions: Tomcat 6.0.18 Apache 2.2.3 mod_jk 1.2.28 OS:Linux Centos 5.3 lp mirko - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: mod_jk codepage in header values
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Mirko, On 1/20/2010 9:42 AM, Mirko Solic wrote: for connecting tomcat with apache i'm using mod_jk connector. But i'm having problem with header values. On apache side headers values are in UTF-8 cp but on tomcat side i have to make conversion from latin-1 cp. Hmm. HTTP defines header values as ASCII (well, it inherits that from other RFCs, but, whatever). If you need to encode non-ASCII data in header values, you'll need to do it in such a way that your client understands them. Often, URL-encoding (aka %-encoding) is used in these situations. I'm using this code: for(Enumeration en = request.getHeaderNames(); en.hasMoreElements();){ header = new Header(); headerName = (String) en.nextElement(); header.setHeaderName(headerName); header.setHeaderValue(new String(request.getHeader(headerName).getBytes(ISO-8859-1))); For most values, this will work. On the other hand, the response already knows how to convert a String into ASCII, so you probably don't have to do this. headers.add(header); header = new Header(); header.setHeaderName(headerName); header.setHeaderValue(request.getHeader(headerName)); headers.add(header); } The Header class is not part of the Servlet API. What does all of this do? Is it possible to configure mod_jk somehow so that this conversion would be no longer needed? I don't believe so. mod_jk simply moves bytes back and forth across the wire. There is little to no interference with the HTTP protocol. What information are you passing through the HTTP headers that needs to be in a particular encoding? These issues are typically handled using the response body coupled with a Content-Type header which specifies a character encoding. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAktXXcUACgkQ9CaO5/Lv0PD5ywCgnzBVvS4a1u1wIpb065Z+ALpS IDUAnjomSUwNPMcpX2lTe08ytExsJ46G =V2Cz -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org