Re: [PATCH] '8859_1' is not a valid charset alias
Costin: I'm not yet familiar with the Tomcat or Jasper code (and I've only been on this list for a couple weeks) but in general I concur with Vince's analysis. I can corroborate his benchmark testing since I've seen it contribute to performance problems under very heavy load with a large number of threads (200). I'm baffled why the Java implementors allowed any synchronization in such a fundamental class as String. Furthermore, it has been my experience that it is necessary to internally map between the IANA character set names and the Java encoding names (I have no idea why the Java implementors chose to use non-standard encoding names). I have also never seen the Content-Encoding in use and rarely have I seen the charset specified as part of the Content-Type (I've heard that some user-agents and some servers choke on it). Browsers (user agents) almost universally post content in the encoding of the original document. For applications I recommend strictly using a single encoding per session and limiting query parameters to US-ASCII. Of course, Tomcat needs to do something reasonable regardless. Cheers, Forrest Vincent Schonau wrote: > > On Sat, May 19, 2001 at 03:19:09PM -0700, [EMAIL PROTECTED] wrote: > > Vicent, Forrest, > > > > Thanks for the patch & review. > > > > Could you summarize and/or expand a bit :-) ? > > The changes I made affect two uses of the concept of Character encodings: > > 1 what's being sent to the browser (ie in JspParseEventListener) as > HTTP headers (as literals) > 2 what's being used to set the CharacterEncoding of input and output > streams > > The reason I made the patch is that an (older) version of Lynx that I use to > test apps barfed on the "text/html; charset=8859_1" header. I noticed this > was non-standard, and that it's all over the tree; hence the patch. > It's just a standards thing. ("iso-8859-1" is the 'preferred mime name' for > this charset; see the IANA charset list that I pointed too). That's category > 1. > > Forrest then pointed out that the code I touched affect the selection of > encodings in Java, and that there is a performance gain to be had. > > I did a little investigation into Forrests remarks, and it turns out that > _consistently_ using something other than what Java looks at as name of the > encoding of a string can have an enormous impact (on my benchmark); using > the canonical name ("ISO8859_1") instead of some alias ("ISO-8859-1" or > "8859_1") can cause a performance win of up to 20x (!) If one looks up the > canonical name of the charset before accessing a String with a non-default > encoding, the total cost is only 1.5x the cost of accessing it with the > encodings canonical name. > > I've looked at the 3.x tree, and from superficial tests, it looks like this > specific code is hardly ever reached by tomcat, so optimising it may not, in > fact, do any good for anyone using iso-8895-1 for most content & > user-agents. Most of the work is already done, so I'll do it anyway. > > That's category 2. (patch coming up). > > > Also, does anyone played with the various browsers ? Is any browser > > sending the charset encoding ? What format ? > > I've been playing with this, but I don't have any definite results. As part > of the work for issue 2 above, I'll be testing this. > > There isn't actually any reference to charsets used in the request in > Servler 2.2; but there is in 2.3 (SRV.4.9 Request data encoding). (they say > there that there aren't many browsers sending Content-Encoding with the > request, currently). > > > I know that some browsers are encoding the URL with the same charset that > > is used in the page, while some are using UTF ( there was discussion about > > that somewhere). > > If you have a reference to this, I'll be happy to look into it. > > > Is it true that browsers that are using UTF ( like IE on NT ? ) do send > > the body as UTF ? Do they set the Charset-Encoding header ? > > > > I would really apreciate some info ( I don't use Windows, and I heard > > there are differences between IE/Win9x and IE/NT ) > > I have no data on this yet, but I will soon. > > Hope this helps, > > Vince. -- Forrest Girouard @ Openwave Systems Inc. phone: +1-650-817-1556 mailto:[EMAIL PROTECTED] http://www.openwave.com
Re: [PATCH] '8859_1' is not a valid charset alias
On Fri, May 18, 2001 at 01:42:17PM +0200, Vincent Schonau wrote: > [this has also been entered as bug #1808] > > Both Tomcat and Apache have the string '8859_1' hard-coded and as a public ^typo: I meant Jasper > static final String in several places. Just to clarify; this all relates to the jakarta-tomcat tree (3.3). Vince.
Re: [PATCH] '8859_1' is not a valid charset alias
On Sat, May 19, 2001 at 03:19:09PM -0700, [EMAIL PROTECTED] wrote: > Vicent, Forrest, > > Thanks for the patch & review. > > Could you summarize and/or expand a bit :-) ? The changes I made affect two uses of the concept of Character encodings: 1 what's being sent to the browser (ie in JspParseEventListener) as HTTP headers (as literals) 2 what's being used to set the CharacterEncoding of input and output streams The reason I made the patch is that an (older) version of Lynx that I use to test apps barfed on the "text/html; charset=8859_1" header. I noticed this was non-standard, and that it's all over the tree; hence the patch. It's just a standards thing. ("iso-8859-1" is the 'preferred mime name' for this charset; see the IANA charset list that I pointed too). That's category 1. Forrest then pointed out that the code I touched affect the selection of encodings in Java, and that there is a performance gain to be had. I did a little investigation into Forrests remarks, and it turns out that _consistently_ using something other than what Java looks at as name of the encoding of a string can have an enormous impact (on my benchmark); using the canonical name ("ISO8859_1") instead of some alias ("ISO-8859-1" or "8859_1") can cause a performance win of up to 20x (!) If one looks up the canonical name of the charset before accessing a String with a non-default encoding, the total cost is only 1.5x the cost of accessing it with the encodings canonical name. I've looked at the 3.x tree, and from superficial tests, it looks like this specific code is hardly ever reached by tomcat, so optimising it may not, in fact, do any good for anyone using iso-8895-1 for most content & user-agents. Most of the work is already done, so I'll do it anyway. That's category 2. (patch coming up). > Also, does anyone played with the various browsers ? Is any browser > sending the charset encoding ? What format ? I've been playing with this, but I don't have any definite results. As part of the work for issue 2 above, I'll be testing this. There isn't actually any reference to charsets used in the request in Servler 2.2; but there is in 2.3 (SRV.4.9 Request data encoding). (they say there that there aren't many browsers sending Content-Encoding with the request, currently). > I know that some browsers are encoding the URL with the same charset that > is used in the page, while some are using UTF ( there was discussion about > that somewhere). If you have a reference to this, I'll be happy to look into it. > Is it true that browsers that are using UTF ( like IE on NT ? ) do send > the body as UTF ? Do they set the Charset-Encoding header ? > > I would really apreciate some info ( I don't use Windows, and I heard > there are differences between IE/Win9x and IE/NT ) I have no data on this yet, but I will soon. Hope this helps, Vince.
Re: [PATCH] '8859_1' is not a valid charset alias
Vicent, Forrest, Thanks for the patch & review. Could you summarize and/or expand a bit :-) ? Also, does anyone played with the various browsers ? Is any browser sending the charset encoding ? What format ? I know that some browsers are encoding the URL with the same charset that is used in the page, while some are using UTF ( there was discussion about that somewhere). Is it true that browsers that are using UTF ( like IE on NT ? ) do send the body as UTF ? Do they set the Charset-Encoding header ? I would really apreciate some info ( I don't use Windows, and I heard there are differences between IE/Win9x and IE/NT ) Costin On Sat, 19 May 2001, Vincent Schonau wrote: > On Fri, May 18, 2001 at 12:40:04PM -0700, Forrest R. Girouard wrote: > > > > It is my understanding that '8859_1' is an alias for a Java encoding > > which maps to the 'ISO-8859-1' character set. The Java encoding and > > the character set name are not always the same. > > > > Furthermore, while it's not readily apparent using 'ISO8859_1' for > > the Java encoding is far preferable to using '8859_1' (or anything > > else) under Java 2. > > > > Look at the private getBTCConverter() method in the String.java source > > and note the use of the following: > > > > !encoding.equals(btc.getCharacterEncoding()) > > > > The ByteToCharConverter instance for ISO-8859-1 always returns 'ISO8859_1' > > for the getCharacterEncoding() method and this means that while other > > names may work the ThreadLocal caching will be subverted. Since the > > ByteToCharConverter.getConverter() method involves synchronization it > > is not a good thing to subvert the ThreadLocal cache. > > Thanks for pointing this out. AFAICS, the use of 'iso-8859-1' instead of > '8859_1' (my patch) does not make this situation any better or worse in the > tomcat code. > > The tomcat 3.x code doesn't look like it takes this into account at all. I > wonder if looking up the Java Encoding name associated with the encoding > name supplied by user-agents etc. is an optimisation worth making. I'll look > into that. > > > > Vince. >
Re: [PATCH] '8859_1' is not a valid charset alias
On Fri, May 18, 2001 at 12:40:04PM -0700, Forrest R. Girouard wrote: > > It is my understanding that '8859_1' is an alias for a Java encoding > which maps to the 'ISO-8859-1' character set. The Java encoding and > the character set name are not always the same. > > Furthermore, while it's not readily apparent using 'ISO8859_1' for > the Java encoding is far preferable to using '8859_1' (or anything > else) under Java 2. > > Look at the private getBTCConverter() method in the String.java source > and note the use of the following: > > !encoding.equals(btc.getCharacterEncoding()) > > The ByteToCharConverter instance for ISO-8859-1 always returns 'ISO8859_1' > for the getCharacterEncoding() method and this means that while other > names may work the ThreadLocal caching will be subverted. Since the > ByteToCharConverter.getConverter() method involves synchronization it > is not a good thing to subvert the ThreadLocal cache. Thanks for pointing this out. AFAICS, the use of 'iso-8859-1' instead of '8859_1' (my patch) does not make this situation any better or worse in the tomcat code. The tomcat 3.x code doesn't look like it takes this into account at all. I wonder if looking up the Java Encoding name associated with the encoding name supplied by user-agents etc. is an optimisation worth making. I'll look into that. Vince.
Re: [PATCH] '8859_1' is not a valid charset alias
It is my understanding that '8859_1' is an alias for a Java encoding which maps to the 'ISO-8859-1' character set. The Java encoding and the character set name are not always the same. Furthermore, while it's not readily apparent using 'ISO8859_1' for the Java encoding is far preferable to using '8859_1' (or anything else) under Java 2. Look at the private getBTCConverter() method in the String.java source and note the use of the following: !encoding.equals(btc.getCharacterEncoding()) The ByteToCharConverter instance for ISO-8859-1 always returns 'ISO8859_1' for the getCharacterEncoding() method and this means that while other names may work the ThreadLocal caching will be subverted. Since the ByteToCharConverter.getConverter() method involves synchronization it is not a good thing to subvert the ThreadLocal cache. Cheers, Forrest Vincent Schonau wrote: > > [this has also been entered as bug #1808] > > Both Tomcat and Apache have the string '8859_1' hard-coded and as a public > static final String in several places. > > Although Java accepts '8859_1' as an alias for the ISO-8859-1 character set, > this isn't a valid name anywhere else; the valid aliases are listed at > > http://www.iana.org/assignments/character-sets> > > Some user-agents (I first noticed this on an older version of Lynx) are > confused by this. > > This patch will: > > - remove all references in code (not comments) to '8859_1' > - In classes where this string was used, add a > public static final String DEFAULT_CHAR_ENCODING > if none was present (this is the most frequently used name > when such a field is present) > - In the src/org/apache/jasper tree: > - add a > public static final String DEFAULT_CHAR_ENCODING > to Constants.java > - replace all occurrences of '8859_1' in code > with Constants.DEFAULT_CHAR_ENCODING > as this seems to me be the proper way to do this in Jasper. > > Regards, > > Vince. > > > > >iso2.patchName: iso2.patch > Type: Plain Text (text/plain) -- Forrest Girouard @ Openwave Systems Inc. phone: +1-650-817-1556 mailto:[EMAIL PROTECTED] http://www.openwave.com
[PATCH] '8859_1' is not a valid charset alias
[this has also been entered as bug #1808] Both Tomcat and Apache have the string '8859_1' hard-coded and as a public static final String in several places. Although Java accepts '8859_1' as an alias for the ISO-8859-1 character set, this isn't a valid name anywhere else; the valid aliases are listed at http://www.iana.org/assignments/character-sets> Some user-agents (I first noticed this on an older version of Lynx) are confused by this. This patch will: - remove all references in code (not comments) to '8859_1' - In classes where this string was used, add a public static final String DEFAULT_CHAR_ENCODING if none was present (this is the most frequently used name when such a field is present) - In the src/org/apache/jasper tree: - add a public static final String DEFAULT_CHAR_ENCODING to Constants.java - replace all occurrences of '8859_1' in code with Constants.DEFAULT_CHAR_ENCODING as this seems to me be the proper way to do this in Jasper. Regards, Vince. Index: src/share/org/apache/tomcat/util/buf/MessageBytes.java === RCS file: /home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/util/buf/MessageBytes.java,v retrieving revision 1.1 diff -u -r1.1 MessageBytes.java --- src/share/org/apache/tomcat/util/buf/MessageBytes.java 2001/02/20 03:12:13 1.1 +++ src/share/org/apache/tomcat/util/buf/MessageBytes.java 2001/05/18 11:05:42 @@ -74,7 +74,7 @@ * @author Costin Manolache */ public final class MessageBytes implements Cloneable, Serializable { -public static final String DEFAULT_CHAR_ENCODING="8859_1"; +public static final String DEFAULT_CHAR_ENCODING="iso-8859-1"; // primary type ( whatever is set as original value ) private int type = T_NULL; Index: src/share/org/apache/tomcat/util/http/Parameters.java === RCS file: /home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/util/http/Parameters.java,v retrieving revision 1.11 diff -u -r1.11 Parameters.java --- src/share/org/apache/tomcat/util/http/Parameters.java 2001/02/20 03:14:11 1.11 +++ src/share/org/apache/tomcat/util/http/Parameters.java 2001/05/18 11:06:04 @@ -83,7 +83,8 @@ MimeHeaders headers; public static final int INITIAL_SIZE=4; - +public static final String DEFAULT_CHAR_ENCODING = "iso-8859-1"; + // Garbage-less parameter merging. // In a sub-request with parameters, the new parameters // will be stored in child. When a getParameter happens, @@ -265,7 +266,7 @@ try { String postedBody = new String(data, 0, data.length, - "8859_1"); + DEFAULT_CHAR_ENCODING); // XXX encoding !!! processFormData( postedBody ); Index: src/share/org/apache/tomcat/modules/server/Ajp13.java === RCS file: /home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/modules/server/Ajp13.java,v retrieving revision 1.17 diff -u -r1.17 Ajp13.java --- src/share/org/apache/tomcat/modules/server/Ajp13.java 2001/02/28 19:41:23 1.17 +++ src/share/org/apache/tomcat/modules/server/Ajp13.java 2001/05/18 11:06:26 @@ -913,7 +913,7 @@ return (getByte() == (byte) 1); } - public static final String DEFAULT_CHAR_ENCODING = "8859_1"; + public static final String DEFAULT_CHAR_ENCODING = "iso-8859-1"; public void getMessageBytes( MessageBytes mb ) { int length = getInt(); Index: src/share/org/apache/tomcat/core/OutputBuffer.java === RCS file: /home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/core/OutputBuffer.java,v retrieving revision 1.13 diff -u -r1.13 OutputBuffer.java --- src/share/org/apache/tomcat/core/OutputBuffer.java 2001/02/27 02:45:02 1.13 +++ src/share/org/apache/tomcat/core/OutputBuffer.java 2001/05/18 11:06:48 @@ -80,6 +80,7 @@ int defaultBufferSize = DEFAULT_BUFFER_SIZE; int defaultCharBufferSize = DEFAULT_BUFFER_SIZE / 2 ; +public static final String DEFAULT_CHAR_ENCODING = "iso-8859-1"; // The buffer can be used for byte[] and char[] writing // ( this is needed to support ServletOutputStream and for // efficient implementations of templating systems ) @@ -426,7 +427,7 @@ if( resp!=null ) enc = resp.getCharacterEncoding(); gotEnc=true; - if(enc==null) enc="8859_1"; + if(enc==null) enc=DEFAULT_CHAR_ENCODING; conv=(WriteConvertor)encoders.get(enc); if(conv==null) { IntermediateOutputStream ios=new IntermediateOutputStream(this); @@ -434,11 +435,11 @@ conv=new WriteConvertor(ios,enc); encoders.put(