Re: mod_jk codepage in header values

2010-02-01 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rainer,

On 1/30/2010 7:56 AM, Rainer Jung wrote:
 So I expect you can forward any binary garbage you like, as long as you
 make sure the code putting it into the environment variables doesn't
 already do any encoding or decoding.

This was pretty much just as I expected.

 Now: it seems that Tomcat is by default assuming it needs to transform
 the binary AJP data stream for request attributes into ISO-8859-1
 decoded Java strings. I'm not 100% sure here, but this is the likely the
 most important part of the game.

It looks like AprProtocol.java, in prepareRequest, handles request
attributes in the SC_A_REQ_ATTRIBUTE case. No encoding/decoding is done
there. Instead, it is done by the MessageBytes class, indirectly by the
ByteChunk class.

The documentation for ByteChunk says:

 * In a server it is very important to be able to operate on
 * the original byte[] without converting everything to chars.
 * Some protocols are ASCII only, and some allow different
 * non-UNICODE encodings. The encoding is not known beforehand,
 * and can even change during the execution of the protocol.
 * ( for example a multipart message may have parts with different
 *  encoding )
 *
 * For HTTP it is not very clear how the encoding of RequestURI
 * and mime values can be determined, but it is a great advantage
 * to be able to parse the request without converting to string.

Later:

/** Default encoding used to convert to strings. It should be UTF8,
as most standards seem to converge, but the servlet API requires
8859_1, and this object is used mostly for servlets.
*/
public static final String DEFAULT_CHARACTER_ENCODING=ISO-8859-1;

If ByteChunk.setEncoding has not been called, this default encoding is
used to decode bytes. Unfortunately, setEncoding is not static, so you
have to have a reference to the ByteChunk object in order to fix it.

Then again, knowing that ISO-8859-1 is being used may make it easier to
write a transcoder...

new String(myString.getBytes(ISO-8859-1), UTF-8)

That's ugly and I feel like it's asking for problems, but it might be
your only reasonable recourse.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktm694ACgkQ9CaO5/Lv0PDKKwCeIq2PqcF3DNyrqgw7JKh84kYf
nFwAoJwBlivosSo4e95nhQTLZoxYs2Be
=ePve
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-29 Thread Mirko Solic

  OK. He was my mistake i thought that mod_jk automatically takes
  environment  variables and puts them in header. But, yes, as you said
  this is done by AAI. So right encoding should be done by AAI side. Thank
  you for clearing that up.
 
 Let us know what AAI says about this.

OK.

 
  Just for info. I try to put in JkEnvVar directive, value with utf8
  character encoding and the result was the same. On the tomcat side i got
  (through  request.getAttribute(attributeName)) value in ISO-8859-1
  character encoding.
 
 How did you construct your UTF-8-encoded environment variable? Can you
 give us an example for how to reproduce this?

I try this with two different approaches.

1.
AAI (this is done by apache Shibboleth module mod_shibx.so) puts AAI
atributes  in header and in environment variables (version 1.3 just in
header, version 2 in environment variables but with directive
ShibUseHeaders On also in header)
So in mod_jk conf file i define with JkEnvVar directive which
environment variable should be pass over to tomcat. I choose one of AAI
atributes that has utf8 character in it.

2.
Secondly i try to define JkEnvVar directive for non existent environment
variable and i added also default value with some no ISO-8859-1
characters. My conf file is in utf8 encoding so default value should
also be in utf8 encoding.

I believe you could reproduce sencod approach.

mirko




-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-29 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mirko,

On 1/29/2010 4:02 AM, Mirko Solic wrote:
 Secondly i try to define JkEnvVar directive for non existent environment
 variable and i added also default value with some no ISO-8859-1
 characters. My conf file is in utf8 encoding so default value should
 also be in utf8 encoding.

I'd be interested in how Apache httpd reads the httpd.conf file. If it
reads the file in utf-8 encoding, then this could be a problem with
mod_jk. If it reads it using ISO-8859-1 or US-ASCII or something like
that, then the data is already broken before mod_jk gets ahold of it.

You might want to re-post your question by saying that UTF-8 data is
incorrectly transmitted to request /attributes/ and see if any of the
mod_jk devs can take a look at that.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktjUQQACgkQ9CaO5/Lv0PAt2gCaA79KUx1X5st02tQQj3cPI+JR
pi8AnArePSsdFwqEk1WOqi2KeLyioaEX
=oZzD
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-27 Thread Mirko Solic
  According to André Warnier:
 
  But, because the HTTP RFC specifies that HTTP headers 
  should contain only US-ASCII character data, mod_jk would be allowed,
  if 
  it finds non-US-ASCII data in a HTTP header, to strip this data or 
  ignore the header or something like that.  I don't know if mod_jk 
  actually does this, but if it did, it would be justified, because 
  according to the HTTP RFC this would be an invalid header.
  
  Than i have no values to decode to.
 
 I can tell you there's no reason for mod_jk to do this, and I don't
 believe it does, for the testing I have performed does not demonstrate
 that behavior.

Yes. It is also working for me. I have no problem whit that at the
moment. My fear is just that at some point in future won't work any
more.

  
  I agree with you here: Using HTTP headers for text data sucks!. But AAI
  is not supported on tomcat yet. However it is supported on apache and
  the only way for me if i want to use AAI and tomcat is to use mod_jk
  connector. But mod_jk is transporting environment variables from apache
  to tomcat in HTTP header.
 
 That sounds like an AAI bug, not an httpd/mod_jk/Tomcat bug: mod_jk
 sends environment variables as request /attributes/, not request
 headers. (See the JkEnvVar directive in
 http://tomcat.apache.org/connectors-doc/reference/apache.html). If AAI
 is creating new request headers, it's AAI's fault for incorrectly
 formatting them. If you can get this data from a request /attribute/
 instead, then maybe that's a better option (though there are no
 references to character encoding in the documentation for JkEnvVar).

OK. He was my mistake i thought that mod_jk automatically takes
environment  variables and puts them in header. But, yes, as you said
this is done by AAI. So right encoding should be done by AAI side. Thank
you for clearing that up.

Just for info. I try to put in JkEnvVar directive, value with utf8
character encoding and the result was the same. On the tomcat side i got
(through  request.getAttribute(attributeName)) value in ISO-8859-1
character encoding.

lp mirko 






-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-27 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mirko,

On 1/27/2010 3:02 AM, Mirko Solic wrote:
 OK. He was my mistake i thought that mod_jk automatically takes
 environment  variables and puts them in header. But, yes, as you said
 this is done by AAI. So right encoding should be done by AAI side. Thank
 you for clearing that up.

Let us know what AAI says about this.

 Just for info. I try to put in JkEnvVar directive, value with utf8
 character encoding and the result was the same. On the tomcat side i got
 (through  request.getAttribute(attributeName)) value in ISO-8859-1
 character encoding.

How did you construct your UTF-8-encoded environment variable? Can you
give us an example for how to reproduce this?

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktgXbIACgkQ9CaO5/Lv0PDseACfWoVNk7t7Smbbs8hipKDiua00
3CgAoKpFKRjt9cfGFcddOFsCbLmRQt6W
=U/+9
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-25 Thread Mirko Solic
On Thu, 2010-01-21 at 15:21 +0100, André Warnier wrote:
 Mirko Solic wrote:
  On Thu, 2010-01-21 at 11:30 +0100, André Warnier wrote:
  
 Mirko,
 just for info : there is a related other thread taking place at the same 
 time, entitled Basic Authentication Failed with multibyte username.

I have read it. 

 
 Basically, I am interested in those topics because I encounter them 
 myself often in our own web applications.
 I don't know all the answers, but I know that it is confusing.
 
 As far as I can interpret :
 
 According to the HTTP 1.1 RFC 2616, HTTP header fields MAY contain *TEXT 
 portions representing character sets other than US-ASCII.
 But then, such header field values MUST be encoded according to the 
 rules of RFC 2047.
 RFC 2047 in turn, in 2. Syntax of encoded-words , indicates that this 
 should be done using the form :
 encoded-word = =? charset ? encoding ? encoded-text ?=
 for example :
 
 Header-name: =?iso-8859-1?B?some iso-8859-1 text, base-64 encoded?=
 or
 Header-name: =?utf-8?B?some unicode/utf-8 text, base-64 encoded?=
 (I am not quite sure here of the utf-8 part as the correct name for 
 the charset.)
 
 Now, I am not sure that if you pass a HTTP header, encoded as above, 
 from Apache to Tomcat, the Tomcat getHeader() call will properly decode 
 it, using the indicated charset.
 
 If not, you will have to do the decoding yourself, if you want to pass 
 non-ascii (or non-iso-8859-1) characters in those headers.
 Admittedly, it is a pain; but there are still quite a few grey areas 
 like that in the WWW-related RFCs in what concerns character sets.
 If you have to do this kind of encoding/decoding, I suggest to have a 
 look in MIME (email) libraries.  Such kind of encoding/decoding is 
 regularly used in email headers.  Save the original text (.eml) format 
 of an email, with a non-ascii subject line, for an example.

How i understand i don't have control when environment variables on
apache side are putted in http header and sent to tomcat side. This is
done by mode_jk automatically. 
I would hate to put encoded values already in environment variables on
apache side so mod_jk would transfer them corectly on tomcat side but
then other web pages that uses this variables wouldn't work no more.

Right way would be (for my understanding) that mod_jk would encode
environment varibales according to the rules of RFC 2047.

lp mirko


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-25 Thread Mirko Solic
On Thu, 2010-01-21 at 10:34 -0500, Christopher Schultz wrote:
 On 1/21/2010 6:43 AM, Mirko Solic wrote:
  That what i'm afraid of. This code: new
   String(request.getHeader(headerName).getBytes(ISO-8859-1)) works for
  now but it really shouldn't work.
  That way i'm searching for more legitimate way.
 
 What would be better is to do something like this:
 
 java.net.URLEncoder.encode(request.getHeader(headerName), UTF-8)
 
 Of course, this will only work if your client knows that's how the
 encoding will be done.

Yes but what if mod_jk chooses not to send non ISO-8859-1 header values
over to tomcat side. According to André Warnier:
 But, because the HTTP RFC specifies that HTTP headers 
 should contain only US-ASCII character data, mod_jk would be allowed,
 if 
 it finds non-US-ASCII data in a HTTP header, to strip this data or 
 ignore the header or something like that.  I don't know if mod_jk 
 actually does this, but if it did, it would be justified, because 
 according to the HTTP RFC this would be an invalid header.

Than i have no values to decode to.


 AAI needs to support whatever encoding you intend to use. You can't
 simply transcode things in an arbitrary way and expect AAI to work
 properly. What does their documentation say about what format these
 values should take?

The problem is when i want to get data from AAI. AAI is sending data in
utf-8 but this is broken when data is send from apache side to tomcat
side.

 A better strategy would be for AAI to provide a numeric token (easily
 passable in HTTP headers without any encoding issues) and then provide
 an HTTP-based and/or XML-based API that uses proper document encoding to
 send textual data across the wire.
 
 Using HTTP headers for text data sucks!

I agree with you here: Using HTTP headers for text data sucks!. But AAI
is not supported on tomcat yet. However it is supported on apache and
the only way for me if i want to use AAI and tomcat is to use mod_jk
connector. But mod_jk is transporting environment variables from apache
to tomcat in HTTP header.
And yes AAI sends data to apache in xml document not over http headers.
On apache side when data is received is is put in environment
variables. 

mirko


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-25 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mirko,

On 1/25/2010 4:06 AM, Mirko Solic wrote:
 How i understand i don't have control when environment variables on
 apache side are putted in http header and sent to tomcat side. This is
 done by mode_jk automatically. 
 I would hate to put encoded values already in environment variables on
 apache side so mod_jk would transfer them corectly on tomcat side but
 then other web pages that uses this variables wouldn't work no more.
 
 Right way would be (for my understanding) that mod_jk would encode
 environment varibales according to the rules of RFC 2047.

Again, mod_jks job is to deliver the variables (as HTTP headers, right?)
without any manipulation whatsoever. It is not mod_jk's job to re-encode
things that aren't acceptable to your webapp.

If you want to use RFC2047-encoded values, then go ahead and use them.
You'll just have to write some code into your webapp to decode them, as
Tomcat does not directly support RFC2047 (though it also doesn't
interfere with it).

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktd758ACgkQ9CaO5/Lv0PDvygCgnFj6uigM/a5WHnu9Eq84+vcU
j+4An3SK7tv8KwsqZgIoKFJPXDuwhN9C
=QW6J
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-25 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mirko,

On 1/25/2010 4:24 AM, Mirko Solic wrote:
 On Thu, 2010-01-21 at 10:34 -0500, Christopher Schultz wrote:
 What would be better is to do something like this:

 java.net.URLEncoder.encode(request.getHeader(headerName), UTF-8)

 Of course, this will only work if your client knows that's how the
 encoding will be done.
 
 Yes but what if mod_jk chooses not to send non ISO-8859-1 header values
 over to tomcat side.

This is simply not mod_jk's job: mod_jk pretty much delivers the exact
bytes sent by the client. Trust me: it's better that way.

 According to André Warnier:

 But, because the HTTP RFC specifies that HTTP headers 
 should contain only US-ASCII character data, mod_jk would be allowed,
 if 
 it finds non-US-ASCII data in a HTTP header, to strip this data or 
 ignore the header or something like that.  I don't know if mod_jk 
 actually does this, but if it did, it would be justified, because 
 according to the HTTP RFC this would be an invalid header.
 
 Than i have no values to decode to.

I can tell you there's no reason for mod_jk to do this, and I don't
believe it does, for the testing I have performed does not demonstrate
that behavior.

 AAI needs to support whatever encoding you intend to use. You can't
 simply transcode things in an arbitrary way and expect AAI to work
 properly. What does their documentation say about what format these
 values should take?
 
 The problem is when i want to get data from AAI. AAI is sending data in
 utf-8 but this is broken when data is send from apache side to tomcat
 side.

So, the bytes are being sent as UTF-8 instead of US-ASCII. I think
you're back to where we started: re-encoding strings. It's possible that
you may run into a situation where the re-encoding is simply going to
fail because of how badly the string has been damaged by an incorrect
decoding. Maybe that's not an issue with ISO-8859-1 (at least it's a
1-byte encoding and all bytes are ostensibly legal).

 A better strategy would be for AAI to provide a numeric token (easily
 passable in HTTP headers without any encoding issues) and then provide
 an HTTP-based and/or XML-based API that uses proper document encoding to
 send textual data across the wire.

 Using HTTP headers for text data sucks!
 
 I agree with you here: Using HTTP headers for text data sucks!. But AAI
 is not supported on tomcat yet. However it is supported on apache and
 the only way for me if i want to use AAI and tomcat is to use mod_jk
 connector. But mod_jk is transporting environment variables from apache
 to tomcat in HTTP header.

That sounds like an AAI bug, not an httpd/mod_jk/Tomcat bug: mod_jk
sends environment variables as request /attributes/, not request
headers. (See the JkEnvVar directive in
http://tomcat.apache.org/connectors-doc/reference/apache.html). If AAI
is creating new request headers, it's AAI's fault for incorrectly
formatting them. If you can get this data from a request /attribute/
instead, then maybe that's a better option (though there are no
references to character encoding in the documentation for JkEnvVar).

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktd8hEACgkQ9CaO5/Lv0PBCdACfXGvpCFULt8Cs49xeQjdv+Rwz
2oAAmgNUr3WdHwRJ9T9x5XS+Jx3PkU7c
=tG4b
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-21 Thread Mirko Solic
Christopher thanks for quick replay.

  for connecting tomcat with apache i'm using mod_jk connector. But i'm
  having problem with header values. On apache side headers values are in
  UTF-8 cp but on tomcat side i have to make conversion from latin-1 cp.
 
 Hmm.
 
 HTTP defines header values as ASCII (well, it inherits that from other
 RFCs, but, whatever). If you need to encode non-ASCII data in header
 values, you'll need to do it in such a way that your client understands
 them. Often, URL-encoding (aka %-encoding) is used in these situations.
 
  I'm using this code:
  
  for(Enumeration en = request.getHeaderNames(); en.hasMoreElements();){
  header = new Header();
  headerName = (String) en.nextElement();
  header.setHeaderName(headerName);
  header.setHeaderValue(new
  String(request.getHeader(headerName).getBytes(ISO-8859-1)));
 
 For most values, this will work. On the other hand, the response already
 knows how to convert a String into ASCII, so you probably don't have to
 do this.
 
  headers.add(header);
  
  header = new Header();
  header.setHeaderName(headerName);
  header.setHeaderValue(request.getHeader(headerName));
  headers.add(header);
  }
 
 The Header class is not part of the Servlet API. What does all of this do?
This is just snap shot of my code. I use Header class to save values, it
is just data holder.


 
 What information are you passing through the HTTP headers that needs to
 be in a particular encoding? These issues are typically handled using
 the response body coupled with a Content-Type header which specifies a
 character encoding.

I'm from Slovenija, Europe. We are using character that are not defined
in ASCII so we are using UTF-8 cp. 

I will try to explain what is this application about.

This project (web page) is protected with AAI
(http://www.switch.ch/aai/about/). This  Authentication and
Authorization infrastructure is roughly divided on SP (service provider)
and Idp (identity provider). SP is module in apache. So when user try to
get web page that is protected with AAI through apache, SP module checks
if user is alredy logged in. If not SP redirects user to Idp where user
can put his/her username and password. If everything is ok Idp sends
users data in xml to SP. SP puts this data into apache 
environment variables so applications (web pages) can access it.
Here i use mod_jk to get this environment variables in tomcat in HTTP
header. If i print user data on apache side i get values in UTF-8
encoding but if i try this on tomcat i don't get right values i have to
make conversion.

Is it mod_jk responsible for converting UTF-8 environment variable to
ACSII header values or is this conversion made elsewhere? 

mirko



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-21 Thread André Warnier

Mirko Solic wrote:

Christopher thanks for quick replay.


...



I'm from Slovenija, Europe. We are using character that are not defined
in ASCII so we are using UTF-8 cp. 


I will try to explain what is this application about.

This project (web page) is protected with AAI
(http://www.switch.ch/aai/about/). This  Authentication and
Authorization infrastructure is roughly divided on SP (service provider)
and Idp (identity provider). SP is module in apache. So when user try to
get web page that is protected with AAI through apache, SP module checks
if user is alredy logged in. If not SP redirects user to Idp where user
can put his/her username and password. If everything is ok Idp sends
users data in xml to SP. SP puts this data into apache 
environment variables so applications (web pages) can access it.

Here i use mod_jk to get this environment variables in tomcat in HTTP
header. If i print user data on apache side i get values in UTF-8
encoding but if i try this on tomcat i don't get right values i have to
make conversion.

Is it mod_jk responsible for converting UTF-8 environment variable to
ACSII header values or is this conversion made elsewhere? 


Mirko,
I am from Belgium, Europe too. I live in Spain and work mostly for 
German and other international customers (among which are some from 
Poland too). This to say that I am well-aware of multi-lingual character 
set issues, and confront them every day.

So, just so as to give you some context for your issues :

Despite the fact that Unicode and UTF-8 are now being increasingly used 
on the web, the fact is that HTTP, and HTML, and most of the other 
WWW-relevant RFCs, are still US-ASCII and ISO-8859-1 (latin-1) based.


For example, HTTP header values are /supposed/ to contain only 
single-byte character codes that are part of the (printable subset of) 
US-ASCII character set.
For example also, by default, all content exchanged between browsers 
and web servers is iso-8859-1.

And it is so because the relevant RFCs say that it should be.
(So the developers of Apache and mod_jk and Tomcat have little choice in 
the matter; they have to follow the RFCs).


This does not mean that you cannot handle other character sets on the 
web.  But it means that whenever you do, you have to be attentive to the 
fact that it is /not/ the standard, and that you may have to do 
character set translations and/or encoding.
It may even mean that, in order to exchange non-US-ASCII or 
non-ISO-8859-1 data, you may have to use tricks.
It also means that, in some cases, by using such tricks, your 
applications may become non-standard, and will not necessarily work 
with all servers and all clients.


So for example, to get back to your question above : mod_jk is not 
responsible for translating anything, and will not translate anything. 
That is because mod_jk follows the relevant WWW RFCs, which specify that 
such and such data is ASCII or ISO-8859-1.


If the original HTTP request, as it is given by Apache to mod_jk, 
contains HTTP headers, mod_jk will forward these headers as is to the 
back-end Tomcat.  But, because the HTTP RFC specifies that HTTP headers 
should contain only US-ASCII character data, mod_jk would be allowed, if 
it finds non-US-ASCII data in a HTTP header, to strip this data or 
ignore the header or something like that.  I don't know if mod_jk 
actually does this, but if it did, it would be justified, because 
according to the HTTP RFC this would be an invalid header.


So, to be practical :
- the current HTTP 1.1 RFC specifies that HTTP headers can only contain 
US-ASCII printable character data
- some UTF-8 codes contain bytes that are not part of the US-ASCII 
character set (e.g. : bytes with values above 0x7F)
- so, if you want to forward such a header from Apache to Tomcat, in 
principle the right way is to encode the value of this header on the 
Apache side, in such a way that it contains only US-ASCII data (for 
example, using Base64 encoding), then pass it to mod_jk.
- at the other end, your application would have to decode this header 
(using Base64 decoding) back into UTF-8, and then it would have to read 
this header value as UTF-8/Unicode.


There is no guarantee that any standard module or class under Apache or 
mod_jk or Tomcat would properly handle a header that contains 
non-US-ASCII data.  That because, in principle, they never have to.


I know it is a mess. It is possible that there are shortcuts.  It is 
possible that mod_jk would transmit a HTTP header, even if it contains 
non-US-ASCII data. But it is not sure, because the bible for mod_jk, 
as for Apache and as for Tomcat, are the RFCs.


We non-English speakers worldwide desperately need a new version of the 
HTTP protocol where the default would be Unicode/UTF-8, for everything.

But I do not see much happening right now in that direction.


Maybe a tip for your authentication issues :
If, in the AJP Connector on the Tomcat side, you set the attribute

Re: mod_jk codepage in header values

2010-01-21 Thread Mirko Solic
On Thu, 2010-01-21 at 11:30 +0100, André Warnier wrote:

This was quite replay :). Thanks for you time and 
knowledge.

 Mirko,
 I am from Belgium, Europe too. I live in Spain and work mostly for 
 German and other international customers (among which are some from 
 Poland too). This to say that I am well-aware of multi-lingual character 
 set issues, and confront them every day.
 So, just so as to give you some context for your issues :
 
 Despite the fact that Unicode and UTF-8 are now being increasingly used 
 on the web, the fact is that HTTP, and HTML, and most of the other 
 WWW-relevant RFCs, are still US-ASCII and ISO-8859-1 (latin-1) based.
 
 For example, HTTP header values are /supposed/ to contain only 
 single-byte character codes that are part of the (printable subset of) 
 US-ASCII character set.
 For example also, by default, all content exchanged between browsers 
 and web servers is iso-8859-1.
 And it is so because the relevant RFCs say that it should be.
 (So the developers of Apache and mod_jk and Tomcat have little choice in 
 the matter; they have to follow the RFCs).

I agree RFC are there to be used.

 
 This does not mean that you cannot handle other character sets on the 
 web.  But it means that whenever you do, you have to be attentive to the 
 fact that it is /not/ the standard, and that you may have to do 
 character set translations and/or encoding.
 It may even mean that, in order to exchange non-US-ASCII or 
 non-ISO-8859-1 data, you may have to use tricks.
 It also means that, in some cases, by using such tricks, your 
 applications may become non-standard, and will not necessarily work 
 with all servers and all clients.
 
 So for example, to get back to your question above : mod_jk is not 
 responsible for translating anything, and will not translate anything. 
 That is because mod_jk follows the relevant WWW RFCs, which specify that 
 such and such data is ASCII or ISO-8859-1.
 
 If the original HTTP request, as it is given by Apache to mod_jk, 
 contains HTTP headers, mod_jk will forward these headers as is to the 
 back-end Tomcat.  But, because the HTTP RFC specifies that HTTP headers 
 should contain only US-ASCII character data, mod_jk would be allowed, if 
 it finds non-US-ASCII data in a HTTP header, to strip this data or 
 ignore the header or something like that.  I don't know if mod_jk 
 actually does this, but if it did, it would be justified, because 
 according to the HTTP RFC this would be an invalid header.

That what i'm afraid of. This code: new
 String(request.getHeader(headerName).getBytes(ISO-8859-1)) works for
now but it really shouldn't work.
That way i'm searching for more legitimate way.
 
 So, to be practical :
 - the current HTTP 1.1 RFC specifies that HTTP headers can only contain 
 US-ASCII printable character data
 - some UTF-8 codes contain bytes that are not part of the US-ASCII 
 character set (e.g. : bytes with values above 0x7F)
 - so, if you want to forward such a header from Apache to Tomcat, in 
 principle the right way is to encode the value of this header on the 
 Apache side, in such a way that it contains only US-ASCII data (for 
 example, using Base64 encoding), then pass it to mod_jk.
 - at the other end, your application would have to decode this header 
 (using Base64 decoding) back into UTF-8, and then it would have to read 
 this header value as UTF-8/Unicode.
 
 There is no guarantee that any standard module or class under Apache or 
 mod_jk or Tomcat would properly handle a header that contains 
 non-US-ASCII data.  That because, in principle, they never have to.
 
 I know it is a mess. It is possible that there are shortcuts.  It is 
 possible that mod_jk would transmit a HTTP header, even if it contains 
 non-US-ASCII data. But it is not sure, because the bible for mod_jk, 
 as for Apache and as for Tomcat, are the RFCs.

But where to put this Base64 encoding (i do not use apache often :( i'm
java programmer using tomcat). 
From Idp (AAI identity provider) i get user data and SP (AAI service
provide, this is module in apache) put this data in apache environment
variables with utf-8 values. Then as i understand mod_jk take this
variables and pack them in http header. I would like to have environment
variables on apache with utf-8 values so applications (e.g php web
pages) that are on this apache would still work.
So my guess is that Base64 encoding should happen before mod_jk takes
values from environment variables and puts them in http header.Is this
possible (i mean except to make change in mod_jk code)? Or is this topic
for some other mail list :).


 We non-English speakers worldwide desperately need a new version of the 
 HTTP protocol where the default would be Unicode/UTF-8, for everything.
 But I do not see much happening right now in that direction.

O i do agree on that :)

 
 
 Maybe a tip for your authentication issues :
 If, in the AJP Connector on the Tomcat side, you set the attribute
 

Re: mod_jk codepage in header values

2010-01-21 Thread André Warnier

Mirko Solic wrote:

On Thu, 2010-01-21 at 11:30 +0100, André Warnier wrote:


Mirko,
just for info : there is a related other thread taking place at the same 
time, entitled Basic Authentication Failed with multibyte username.


Basically, I am interested in those topics because I encounter them 
myself often in our own web applications.

I don't know all the answers, but I know that it is confusing.

As far as I can interpret :

According to the HTTP 1.1 RFC 2616, HTTP header fields MAY contain *TEXT 
portions representing character sets other than US-ASCII.
But then, such header field values MUST be encoded according to the 
rules of RFC 2047.
RFC 2047 in turn, in 2. Syntax of encoded-words , indicates that this 
should be done using the form :

encoded-word = =? charset ? encoding ? encoded-text ?=
for example :

Header-name: =?iso-8859-1?B?some iso-8859-1 text, base-64 encoded?=
or
Header-name: =?utf-8?B?some unicode/utf-8 text, base-64 encoded?=
(I am not quite sure here of the utf-8 part as the correct name for 
the charset.)


Now, I am not sure that if you pass a HTTP header, encoded as above, 
from Apache to Tomcat, the Tomcat getHeader() call will properly decode 
it, using the indicated charset.


If not, you will have to do the decoding yourself, if you want to pass 
non-ascii (or non-iso-8859-1) characters in those headers.
Admittedly, it is a pain; but there are still quite a few grey areas 
like that in the WWW-related RFCs in what concerns character sets.
If you have to do this kind of encoding/decoding, I suggest to have a 
look in MIME (email) libraries.  Such kind of encoding/decoding is 
regularly used in email headers.  Save the original text (.eml) format 
of an email, with a non-ascii subject line, for an example.



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-21 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mirko,

On 1/21/2010 6:43 AM, Mirko Solic wrote:
 That what i'm afraid of. This code: new
  String(request.getHeader(headerName).getBytes(ISO-8859-1)) works for
 now but it really shouldn't work.
 That way i'm searching for more legitimate way.

What would be better is to do something like this:

java.net.URLEncoder.encode(request.getHeader(headerName), UTF-8)

Of course, this will only work if your client knows that's how the
encoding will be done.

 From Idp (AAI identity provider) i get user data and SP (AAI service
 provide, this is module in apache) put this data in apache environment
 variables with utf-8 values. Then as i understand mod_jk take this
 variables and pack them in http header. I would like to have environment
 variables on apache with utf-8 values so applications (e.g php web
 pages) that are on this apache would still work.

AAI needs to support whatever encoding you intend to use. You can't
simply transcode things in an arbitrary way and expect AAI to work
properly. What does their documentation say about what format these
values should take?

 AAI returns more then just user-id. Idea behind AAI is that application
 save as little as possible data about user. All data is provided by AAI.
 In this data is for example first-name, last-name, address,  It
 would be perfect that we would have this SP running on tomcat and we
 wouldn't need apache but at the time there is no such SP.

A better strategy would be for AAI to provide a numeric token (easily
passable in HTTP headers without any encoding issues) and then provide
an HTTP-based and/or XML-based API that uses proper document encoding to
send textual data across the wire.

Using HTTP headers for text data sucks!

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktYc/4ACgkQ9CaO5/Lv0PDtNwCeJzVwiPNpMCOXYTCHYa87pqXs
f9IAoJiyaTh0lbiMnxwG7Bp9/jWnHeMV
=fHrt
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-21 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

André,

On 1/21/2010 9:21 AM, André Warnier wrote:
 But then, such header field values MUST be encoded according to the
 rules of RFC 2047.

Unfortunately, Tomcat does not follow RFC2047, at least not according to
http://stackoverflow.com/questions/324470/http-headers-encoding-decoding-in-java
and not according to my simple test:

$ wget -O - --header Test-Value:
=?iso-8859-1?q?this=20is=20some=20text?=
http://myhost/SessionSnooper.jsp | grep -C 1 some=20text

   td

=?iso-8859-1?q?this=20is=20some=20text?=br /

/td

The value is preserved as-is. (The SessionSnooper.jsp file referenced
above can be found here: http://www.christopherschultz.net/projects/java/).

Fortunately, the value /is/ passed-through without modification. That
means that we can read it ourselves!

Let's figure out how to decode the string
=?iso-8859-1?q?this=20is=20some=20text?=:

1. Check the the string matches the pattern =\?[^?]*\?(B|Q)\?[^?]*\?=.
2. Extract the charset and encoding
3. If encoding is 'Q', convert value characters to bytes:
  =HL - 0xHL
  others direct
4. If encoding is 'B', base64 decode value into bytes
5. Convert bytes to characters using charset:
 new String(bytes, charset)

As I started to write code to do this, it occurred to me that it must
already exist. Googling for java rfc2047 decode shows that the
javax.mail.internet.MimeUtility class (packaged with the JavaMail API)
already has a method called decodeText that will do this for us.

I wrote a simple wrapper around that method, and you can see that it works:

$ java -classpath javamail-1.4.2.jar:. RFC2047Codec
'=?iso-8859-1?q?this=20is=20some=20text?='
this is some text
$ java -classpath javamail-1.4.2.jar:. RFC2047Codec
'=?UTF-8?q?this=20is=20some=20text?='
this is some text
$ java -classpath javamail-1.4.2.jar:. RFC2047Codec
'=?utf-8?q?this=20is=20some=20text?='
this is some text
$ java -classpath javamail-1.4.2.jar:. RFC2047Codec
'=?utf-8?q?this=20is=20a=20pi:=20=cf=80?='
this is a pi: #

Er the pi wouldn't copy correctly from my terminal, but I assure you
that the pi character was dumped to my terminal.

So, if you have to decode RFC2047-compliant values, MimeUtility can help
you do that. It can also help you encode them, too.

It sounds like you have everything you need at this point, as long as
AAI recognizes RFC2047-formatted HTTP header values.

Good luck,
- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktYq7AACgkQ9CaO5/Lv0PAW5wCbBZM3AKhY23dp4OqYm927gM40
Ty0AoJOwpJlLZ/f3IiCNfzSaimyMnRHB
=Vf7P
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



mod_jk codepage in header values

2010-01-20 Thread Mirko Solic
Hello,

for connecting tomcat with apache i'm using mod_jk connector. But i'm
having problem with header values. On apache side headers values are in
UTF-8 cp but on tomcat side i have to make conversion from latin-1 cp.
I'm using this code:

for(Enumeration en = request.getHeaderNames(); en.hasMoreElements();){
header = new Header();
headerName = (String) en.nextElement();
header.setHeaderName(headerName);
header.setHeaderValue(new
String(request.getHeader(headerName).getBytes(ISO-8859-1)));
headers.add(header);

header = new Header();
header.setHeaderName(headerName);
header.setHeaderValue(request.getHeader(headerName));
headers.add(header);
}


Is it possible to configure mod_jk somehow so that this conversion would
be no longer needed? I went through configuration documentation but i
didn't find nothing that could solve my problem.

Any help will be much appreciated. 


SW versions:
Tomcat 6.0.18
Apache 2.2.3
mod_jk 1.2.28

OS:Linux Centos 5.3

lp mirko 




-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: mod_jk codepage in header values

2010-01-20 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mirko,

On 1/20/2010 9:42 AM, Mirko Solic wrote:
 for connecting tomcat with apache i'm using mod_jk connector. But i'm
 having problem with header values. On apache side headers values are in
 UTF-8 cp but on tomcat side i have to make conversion from latin-1 cp.

Hmm.

HTTP defines header values as ASCII (well, it inherits that from other
RFCs, but, whatever). If you need to encode non-ASCII data in header
values, you'll need to do it in such a way that your client understands
them. Often, URL-encoding (aka %-encoding) is used in these situations.

 I'm using this code:
 
 for(Enumeration en = request.getHeaderNames(); en.hasMoreElements();){
 header = new Header();
 headerName = (String) en.nextElement();
 header.setHeaderName(headerName);
 header.setHeaderValue(new
 String(request.getHeader(headerName).getBytes(ISO-8859-1)));

For most values, this will work. On the other hand, the response already
knows how to convert a String into ASCII, so you probably don't have to
do this.

 headers.add(header);
 
 header = new Header();
 header.setHeaderName(headerName);
 header.setHeaderValue(request.getHeader(headerName));
 headers.add(header);
 }

The Header class is not part of the Servlet API. What does all of this do?

 Is it possible to configure mod_jk somehow so that this conversion would
 be no longer needed?

I don't believe so. mod_jk simply moves bytes back and forth across the
wire. There is little to no interference with the HTTP protocol.

What information are you passing through the HTTP headers that needs to
be in a particular encoding? These issues are typically handled using
the response body coupled with a Content-Type header which specifies a
character encoding.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktXXcUACgkQ9CaO5/Lv0PD5ywCgnzBVvS4a1u1wIpb065Z+ALpS
IDUAnjomSUwNPMcpX2lTe08ytExsJ46G
=V2Cz
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org