Re: [PATCH] '8859_1' is not a valid charset alias

2001-05-20 Thread Forrest R. Girouard

Costin:

I'm not yet familiar with the Tomcat or Jasper code (and I've only
been on this list for a couple weeks) but in general I concur with 
Vince's analysis.  I can corroborate his benchmark testing since 
I've seen it contribute to performance problems under very heavy
load with a large number of threads (200).  I'm baffled why the 
Java implementors allowed any synchronization in such a fundamental 
class as String.

Furthermore, it has been my experience that it is necessary to 
internally map between the IANA character set names and the Java 
encoding names (I have no idea why the Java implementors chose to 
use non-standard encoding names).  

I have also never seen the Content-Encoding in use and rarely have I 
seen the charset specified as part of the Content-Type (I've heard 
that some user-agents and some servers choke on it).   Browsers 
(user agents) almost universally post content in the encoding of the 
original document.

For applications I recommend strictly using a single encoding per 
session and limiting query parameters to US-ASCII.  Of course, 
Tomcat needs to do something reasonable regardless.

Cheers,
Forrest

Vincent Schonau wrote:
> 
> On Sat, May 19, 2001 at 03:19:09PM -0700, [EMAIL PROTECTED] wrote:
> > Vicent, Forrest,
> >
> > Thanks for the patch & review.
> >
> > Could you summarize and/or expand a bit :-) ?
> 
> The changes I made affect two uses of the concept of Character encodings:
> 
>   1 what's being sent to the browser (ie in JspParseEventListener) as
> HTTP headers (as literals)
>   2 what's being used to set the CharacterEncoding of input and output
> streams
> 
> The reason I made the patch is that an (older) version of Lynx that I use to
> test apps barfed on the "text/html; charset=8859_1" header. I noticed this
> was non-standard, and that it's all over the tree; hence the patch.
> It's just a standards thing. ("iso-8859-1" is the 'preferred mime name' for
> this charset; see the IANA charset list that I pointed too). That's category
> 1.
> 
> Forrest then pointed out that the code I touched affect the selection of
> encodings in Java, and that there is a performance gain to be had.
> 
> I did a little investigation into Forrests remarks, and it turns out that
> _consistently_ using something other than what Java looks at as name of the
> encoding of a string can have an enormous impact (on my benchmark); using
> the canonical name ("ISO8859_1") instead of some alias ("ISO-8859-1" or
> "8859_1") can cause a performance win of up to 20x (!) If one looks up the
> canonical name of the charset before accessing a String with a non-default
> encoding, the total cost is only 1.5x the cost of accessing it with the
> encodings canonical name.
> 
> I've looked at the 3.x tree, and from superficial tests, it looks like this
> specific code is hardly ever reached by tomcat, so optimising it may not, in
> fact, do any good for anyone using iso-8895-1 for most content &
> user-agents. Most of the work is already done, so I'll do it anyway.
> 
> That's category 2. (patch coming up).
> 
> > Also, does anyone played with the various browsers ? Is any browser
> > sending the charset encoding ? What format ?
> 
> I've been playing with this, but I don't have any definite results. As part
> of the work for issue 2 above, I'll be testing this.
> 
> There isn't actually any reference to charsets used in the request in
> Servler 2.2; but there is in 2.3 (SRV.4.9 Request data encoding). (they say
> there that there aren't many browsers sending Content-Encoding with the
> request, currently).
> 
> > I know that some browsers are encoding the URL with the same charset that
> > is used in the page, while some are using UTF ( there was discussion about
> > that somewhere).
> 
> If you have a reference to this, I'll be happy to look into it.
> 
> > Is it true that browsers that are using UTF ( like IE on NT ? ) do send
> > the body as UTF ? Do they set the Charset-Encoding header ?
> >
> > I would really apreciate some info ( I don't use Windows, and I heard
> > there are differences between IE/Win9x and IE/NT )
> 
> I have no data on this yet, but I will soon.
> 
> Hope this helps,
> 
> Vince.

-- 
Forrest Girouard @ Openwave Systems Inc.
phone: +1-650-817-1556
mailto:[EMAIL PROTECTED]
http://www.openwave.com





Re: [PATCH] '8859_1' is not a valid charset alias

2001-05-20 Thread Vincent Schonau

On Fri, May 18, 2001 at 01:42:17PM +0200, Vincent Schonau wrote:
> [this has also been entered as bug #1808]
> 
> Both Tomcat and Apache have the string '8859_1' hard-coded and as a public
  ^typo: I meant Jasper
> static final String in several places.

Just to clarify; this all relates to the jakarta-tomcat tree (3.3).


Vince.



Re: [PATCH] '8859_1' is not a valid charset alias

2001-05-19 Thread Vincent Schonau

On Sat, May 19, 2001 at 03:19:09PM -0700, [EMAIL PROTECTED] wrote:
> Vicent, Forrest,
> 
> Thanks for the patch & review. 
> 
> Could you summarize and/or expand a bit :-) ? 

The changes I made affect two uses of the concept of Character encodings:

  1 what's being sent to the browser (ie in JspParseEventListener) as
HTTP headers (as literals)
  2 what's being used to set the CharacterEncoding of input and output 
streams

The reason I made the patch is that an (older) version of Lynx that I use to
test apps barfed on the "text/html; charset=8859_1" header. I noticed this
was non-standard, and that it's all over the tree; hence the patch.
It's just a standards thing. ("iso-8859-1" is the 'preferred mime name' for
this charset; see the IANA charset list that I pointed too). That's category
1.

Forrest then pointed out that the code I touched affect the selection of
encodings in Java, and that there is a performance gain to be had.

I did a little investigation into Forrests remarks, and it turns out that
_consistently_ using something other than what Java looks at as name of the
encoding of a string can have an enormous impact (on my benchmark); using
the canonical name ("ISO8859_1") instead of some alias ("ISO-8859-1" or
"8859_1") can cause a performance win of up to 20x (!) If one looks up the
canonical name of the charset before accessing a String with a non-default
encoding, the total cost is only 1.5x the cost of accessing it with the
encodings canonical name.

I've looked at the 3.x tree, and from superficial tests, it looks like this
specific code is hardly ever reached by tomcat, so optimising it may not, in
fact, do any good for anyone using iso-8895-1 for most content &
user-agents. Most of the work is already done, so I'll do it anyway.

That's category 2. (patch coming up).

> Also, does anyone played with the various browsers ? Is any browser
> sending the charset encoding ? What format ? 

I've been playing with this, but I don't have any definite results. As part
of the work for issue 2 above, I'll be testing this.

There isn't actually any reference to charsets used in the request in
Servler 2.2; but there is in 2.3 (SRV.4.9 Request data encoding). (they say
there that there aren't many browsers sending Content-Encoding with the
request, currently).

> I know that some browsers are encoding the URL with the same charset that
> is used in the page, while some are using UTF ( there was discussion about
> that somewhere). 

If you have a reference to this, I'll be happy to look into it.

> Is it true that browsers that are using UTF ( like IE on NT ? ) do send
> the body as UTF ? Do they set the Charset-Encoding header ?
> 
> I would really apreciate some info ( I don't use Windows, and I heard
> there are differences between IE/Win9x and IE/NT )

I have no data on this yet, but I will soon.


Hope this helps,


Vince.



Re: [PATCH] '8859_1' is not a valid charset alias

2001-05-19 Thread cmanolache

Vicent, Forrest,

Thanks for the patch & review. 

Could you summarize and/or expand a bit :-) ? 

Also, does anyone played with the various browsers ? Is any browser
sending the charset encoding ? What format ? 

I know that some browsers are encoding the URL with the same charset that
is used in the page, while some are using UTF ( there was discussion about
that somewhere). 

Is it true that browsers that are using UTF ( like IE on NT ? ) do send
the body as UTF ? Do they set the Charset-Encoding header ?

I would really apreciate some info ( I don't use Windows, and I heard
there are differences between IE/Win9x and IE/NT )

Costin


On Sat, 19 May 2001, Vincent Schonau wrote:

> On Fri, May 18, 2001 at 12:40:04PM -0700, Forrest R. Girouard wrote:
> > 
> > It is my understanding that '8859_1' is an alias for a Java encoding 
> > which maps to the 'ISO-8859-1' character set.  The Java encoding and
> > the character set name are not always the same.
> > 
> > Furthermore, while it's not readily apparent using 'ISO8859_1' for
> > the Java encoding is far preferable to using '8859_1' (or anything 
> > else) under Java 2.  
> > 
> > Look at the private getBTCConverter() method in the String.java source
> > and note the use of the following:
> > 
> > !encoding.equals(btc.getCharacterEncoding())
> > 
> > The ByteToCharConverter instance for ISO-8859-1 always returns 'ISO8859_1'
> > for the getCharacterEncoding() method and this means that while other
> > names may work the ThreadLocal caching will be subverted.  Since the
> > ByteToCharConverter.getConverter() method involves synchronization it
> > is not a good thing to subvert the ThreadLocal cache.
> 
> Thanks for pointing this out. AFAICS, the use of 'iso-8859-1' instead of
> '8859_1' (my patch) does not make this situation any better or worse in the
> tomcat code. 
> 
> The tomcat 3.x code doesn't look like it takes this into account at all. I
> wonder if looking up the Java Encoding name associated with the encoding
> name supplied by user-agents etc. is an optimisation worth making. I'll look
> into that.
> 
> 
> 
> Vince.
> 




Re: [PATCH] '8859_1' is not a valid charset alias

2001-05-19 Thread Vincent Schonau

On Fri, May 18, 2001 at 12:40:04PM -0700, Forrest R. Girouard wrote:
> 
> It is my understanding that '8859_1' is an alias for a Java encoding 
> which maps to the 'ISO-8859-1' character set.  The Java encoding and
> the character set name are not always the same.
> 
> Furthermore, while it's not readily apparent using 'ISO8859_1' for
> the Java encoding is far preferable to using '8859_1' (or anything 
> else) under Java 2.  
> 
> Look at the private getBTCConverter() method in the String.java source
> and note the use of the following:
> 
>   !encoding.equals(btc.getCharacterEncoding())
> 
> The ByteToCharConverter instance for ISO-8859-1 always returns 'ISO8859_1'
> for the getCharacterEncoding() method and this means that while other
> names may work the ThreadLocal caching will be subverted.  Since the
> ByteToCharConverter.getConverter() method involves synchronization it
> is not a good thing to subvert the ThreadLocal cache.

Thanks for pointing this out. AFAICS, the use of 'iso-8859-1' instead of
'8859_1' (my patch) does not make this situation any better or worse in the
tomcat code. 

The tomcat 3.x code doesn't look like it takes this into account at all. I
wonder if looking up the Java Encoding name associated with the encoding
name supplied by user-agents etc. is an optimisation worth making. I'll look
into that.



Vince.




Re: [PATCH] '8859_1' is not a valid charset alias

2001-05-18 Thread Forrest R. Girouard


It is my understanding that '8859_1' is an alias for a Java encoding 
which maps to the 'ISO-8859-1' character set.  The Java encoding and
the character set name are not always the same.

Furthermore, while it's not readily apparent using 'ISO8859_1' for
the Java encoding is far preferable to using '8859_1' (or anything 
else) under Java 2.  

Look at the private getBTCConverter() method in the String.java source
and note the use of the following:

!encoding.equals(btc.getCharacterEncoding())

The ByteToCharConverter instance for ISO-8859-1 always returns 'ISO8859_1'
for the getCharacterEncoding() method and this means that while other
names may work the ThreadLocal caching will be subverted.  Since the
ByteToCharConverter.getConverter() method involves synchronization it
is not a good thing to subvert the ThreadLocal cache.

Cheers,
Forrest

Vincent Schonau wrote:
> 
> [this has also been entered as bug #1808]
> 
> Both Tomcat and Apache have the string '8859_1' hard-coded and as a public
> static final String in several places.
> 
> Although Java accepts '8859_1' as an alias for the ISO-8859-1 character set,
> this isn't a valid name anywhere else; the valid aliases are listed at
> 
> http://www.iana.org/assignments/character-sets>
> 
> Some user-agents (I first noticed this on an older version of Lynx) are
> confused by this.
> 
> This patch will:
> 
>   - remove all references in code (not comments) to '8859_1'
>   - In classes where this string was used, add a
> public static final String DEFAULT_CHAR_ENCODING
> if none was present (this is the most frequently used name
> when such a field is present)
>   - In the src/org/apache/jasper tree:
> - add a
>   public static final String DEFAULT_CHAR_ENCODING
>   to Constants.java
> - replace all occurrences of '8859_1' in code
>   with Constants.DEFAULT_CHAR_ENCODING
>   as this seems to me be the proper way to do this in Jasper.
> 
> Regards,
> 
> Vince.
> 
>   
>
> 
>iso2.patchName: iso2.patch
>  Type: Plain Text (text/plain)

-- 
Forrest Girouard @ Openwave Systems Inc.
phone: +1-650-817-1556
mailto:[EMAIL PROTECTED]
http://www.openwave.com





[PATCH] '8859_1' is not a valid charset alias

2001-05-18 Thread Vincent Schonau

[this has also been entered as bug #1808]

Both Tomcat and Apache have the string '8859_1' hard-coded and as a public
static final String in several places.

Although Java accepts '8859_1' as an alias for the ISO-8859-1 character set,
this isn't a valid name anywhere else; the valid aliases are listed at

http://www.iana.org/assignments/character-sets>

Some user-agents (I first noticed this on an older version of Lynx) are
confused by this.

This patch will:

  - remove all references in code (not comments) to '8859_1'
  - In classes where this string was used, add a 
public static final String DEFAULT_CHAR_ENCODING 
if none was present (this is the most frequently used name
when such a field is present)
  - In the src/org/apache/jasper tree:
- add a 
  public static final String DEFAULT_CHAR_ENCODING
  to Constants.java
- replace all occurrences of '8859_1' in code
  with Constants.DEFAULT_CHAR_ENCODING
  as this seems to me be the proper way to do this in Jasper.


Regards,


Vince.




Index: src/share/org/apache/tomcat/util/buf/MessageBytes.java
===
RCS file: 
/home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/util/buf/MessageBytes.java,v
retrieving revision 1.1
diff -u -r1.1 MessageBytes.java
--- src/share/org/apache/tomcat/util/buf/MessageBytes.java  2001/02/20 03:12:13
 1.1
+++ src/share/org/apache/tomcat/util/buf/MessageBytes.java  2001/05/18 11:05:42
@@ -74,7 +74,7 @@
  * @author Costin Manolache
  */
 public final class MessageBytes implements Cloneable, Serializable {
-public static final String DEFAULT_CHAR_ENCODING="8859_1";
+public static final String DEFAULT_CHAR_ENCODING="iso-8859-1";
 
 // primary type ( whatever is set as original value )
 private int type = T_NULL;
Index: src/share/org/apache/tomcat/util/http/Parameters.java
===
RCS file: 
/home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/util/http/Parameters.java,v
retrieving revision 1.11
diff -u -r1.11 Parameters.java
--- src/share/org/apache/tomcat/util/http/Parameters.java   2001/02/20 03:14:11
 1.11
+++ src/share/org/apache/tomcat/util/http/Parameters.java   2001/05/18 11:06:04
@@ -83,7 +83,8 @@
 MimeHeaders  headers;
 
 public static final int INITIAL_SIZE=4;
-
+public static final String DEFAULT_CHAR_ENCODING = "iso-8859-1";
+
 // Garbage-less parameter merging.
 // In a sub-request with parameters, the new parameters
 // will be stored in child. When a getParameter happens,
@@ -265,7 +266,7 @@

try {
String postedBody = new String(data, 0, data.length,
-  "8859_1");
+  DEFAULT_CHAR_ENCODING);
// XXX encoding !!!
 
processFormData( postedBody );
Index: src/share/org/apache/tomcat/modules/server/Ajp13.java
===
RCS file: 
/home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/modules/server/Ajp13.java,v
retrieving revision 1.17
diff -u -r1.17 Ajp13.java
--- src/share/org/apache/tomcat/modules/server/Ajp13.java   2001/02/28 19:41:23
 1.17
+++ src/share/org/apache/tomcat/modules/server/Ajp13.java   2001/05/18 11:06:26
@@ -913,7 +913,7 @@
return (getByte() == (byte) 1);
}
 
-   public static final String DEFAULT_CHAR_ENCODING = "8859_1";
+   public static final String DEFAULT_CHAR_ENCODING = "iso-8859-1";
 
public void getMessageBytes( MessageBytes mb ) {
int length = getInt();
Index: src/share/org/apache/tomcat/core/OutputBuffer.java
===
RCS file: 
/home/cvspublic/jakarta-tomcat/src/share/org/apache/tomcat/core/OutputBuffer.java,v
retrieving revision 1.13
diff -u -r1.13 OutputBuffer.java
--- src/share/org/apache/tomcat/core/OutputBuffer.java  2001/02/27 02:45:02 1.13
+++ src/share/org/apache/tomcat/core/OutputBuffer.java  2001/05/18 11:06:48
@@ -80,6 +80,7 @@
 int defaultBufferSize = DEFAULT_BUFFER_SIZE;
 int defaultCharBufferSize = DEFAULT_BUFFER_SIZE / 2 ;
 
+public static final String DEFAULT_CHAR_ENCODING = "iso-8859-1";
 // The buffer can be used for byte[] and char[] writing
 // ( this is needed to support ServletOutputStream and for
 // efficient implementations of templating systems )
@@ -426,7 +427,7 @@
if( resp!=null ) 
enc = resp.getCharacterEncoding();
gotEnc=true;
-   if(enc==null) enc="8859_1";
+   if(enc==null) enc=DEFAULT_CHAR_ENCODING;
conv=(WriteConvertor)encoders.get(enc);
if(conv==null) {
IntermediateOutputStream ios=new IntermediateOutputStream(this);
@@ -434,11 +435,11 @@
conv=new WriteConvertor(ios,enc);
encoders.put(