Hi,

I volunteered to resolve the charset issues, the problem is very complex
and difficult ( and I had very little time for that ). (I'm also an 8859-2
user).

In other words - I need help ( patches, ideas, code to reproduce etc
) ... This is too big and scary issue - and I don't feel confident enough
on my knowledge to start anything major. 

I already added code in 3.3 that would allow a module to set the charset, 
I fixed the generation of UTF8, and I'm looking at the changes that set
the platform encoding - I believe it's a very bad idea ( since you can't
know that the browser is using the same encoding as the server - unless
you are inside intranets ), but I have no other solution right now.

Anyway - I do believe we can' ship 3.3 final without this problem
resolved, and the current design ( with MessageBytes and lazy conversion
from byte->String ) should be able to support a solution ( if we find one)

Maybe using UTF8 as the default for input and output ? ( I saw few RFCs
mentioning that as the best solution - given that most current browsers
do support UTF8). Of course, this can't be enabled the default ( spec
issues ), but it's better than the local server encoding...


Costin





On Mon, 19 Mar 2001, Szegedi, Attila wrote:

> I have also done this once in my private copy of Tomcat, but have abandoned
> it.
> The problem is standards compliance, and standards (both the HTML standard
> and the Servlet spec) are somewhat internationalization-ignorant on this
> point.
> 
> Tomcat follows the HTML standard, which explicitly declares that MIME type
> "application/x-www-form-urlencoded" is suitable ONLY for transferring ASCII
> (but will of course work for ISO 8859-1 as well). See
> http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
> It says:
> 
> <citation>
> "The content type "application/x-www-form-urlencoded" is inefficient for
> sending large quantities of binary data or text containing non-ASCII
> characters. The content type "multipart/form-data" should be used for
> submitting forms that contain files, non-ASCII data, and binary data."
> </citation>
> 
> So, if you want to comply with the HTML standard, you should force sending
> all of your forms containing non-ASCII characters as "multipart/form-data"
> using the "enctype" attribute of the form. Unfortunately, Tomcat will not
> present "multipart/form-data" to your servlet as request parameters.
> 
> The HTML standard is further flawed in that it
> 1. defaults the encoding type of the form to
> "application/x-www-form-urlencoded"
> 2. requires browsers to send form data in the same encoding they received
> the HTML page in, (except if "accept-charset" attribute is set, which is
> usually not).
> So, a complying browser will by default use
> "application/x-www-form-urlencoded" and send data through it in the same
> encoding they received the HTML page in. The trouble is, that they wont send
> the *ENCODING* back to the server in the Content-Type header (at least all
> IE (up to 5.5) and NN (up to 4.75) won't). It will always be
> "application/x-www-form-urlencoded" and not
> "application/x-www-form-urlencoded; charset=whatever"), so Tomcat's
> parsePostData can't determine the charset, it will always sense ISO 8859-1,
> as this is the default.
> 
> I have some back experience working with Microsoft's ASP technology. They
> solved the problem partially by introducing the "session encoding" -- all
> HTML response used this encoding, and all request parameters were parsed
> according to that encoding.
> 
> This could be a solution, however it should go into servlet spec. (Are we
> heard, servlet spec people?)
> 
> My own app uses ISO 8859-2 (as it's in Hungarian), and for now I just
> transcode 8859-1 into 8859-2. I'm lucky I use Model2 paradigm, so I have a
> single servlet handling all requests and a single central place to transcode
> request parameters.
> 
> Cheers,
>   Attila.
> 
> > -----Original Message-----
> > From: Aleksandras Novikovas [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, March 16, 2001 10:32 AM
> > To: '[EMAIL PROTECTED]'
> > Subject: problem with national language in html form input
> >
> >
> > Hello All,
> >
> > I'm posting for the first time, so please inform me if I do
> > something wrong ...
> >
> > First of all - problem description :
> > I have application in multilanguage (where user can
> > dynamically change charset).
> > Problem rises when user enters information in selected language.
> > After parsePostData in HttpUtils I get lots of "????" instead of text.
> > I can not rely on default system encoding, because
> > application has ability to add the languages dynamically
> > without recompilation.
> > So I never know what next encoding system will need.
> >
> > I have written some code to work around this problem and
> > think it would be nice to have it standard package.
> > Actually I've changed parsePostData - added  encoding parameter.
> > Now programmer could choose in what encoding InputStream is supplied.
> > I have tested it with windows-1257 (Baltic) and windows-1251
> > (Cyrylic) - for me it worked.
> > If someone find any errors - please let me know.
> > Here is code of that method :
> >
> > //////////////////////////////////////////////////////////////
> > //////////////////
> > // Parses data from an HTML form that the client sends to
> > // the server using the HTTP POST method and the
> > // <i>application/x-www-form-urlencoded</i> MIME type.
> > //
> > // <p>The data sent by the POST method contains key-value
> > // pairs. A key can appear more than once in the POST data
> > // with different values. However, the key appears only once in
> > // the hashtable, with its value being
> > // an array of strings containing the multiple values sent
> > // by the POST method.
> > //
> > // <p>The keys and values in the hashtable are stored in their
> > // decoded form, so
> > // any + characters are converted to spaces, and characters
> > // sent in hexadecimal notation (like <i>%xx</i>) are
> > // converted to specified encoding.
> > //
> > // @param len       an integer specifying the length,
> > //                          in characters, of the
> > //                          <code>ServletInputStream</code>
> > //                          object that is also passed to this
> > //                          method
> > // @param in        the <code>ServletInputStream</code>
> > //                          object that contains the data sent
> > //                          from the client
> > // @param enc       a String specifying the character encoding
> > //                          of the <code>ServletInputStream</code>
> > //                          object
> > //
> > // @return          a <code>HashTable</code> object built
> > //                          from the parsed key-value pairs
> > //
> > // @exception IllegalArgumentException      if the data
> > //                          sent by the POST method is invalid
> > //////////////////////////////////////////////////////////////
> > //////////////////
> >
> > public Hashtable parsePostData (int len, ServletInputStream
> > in, String enc)
> > {
> >     // XXX
> >     // should a length of 0 be an IllegalArgumentException
> >
> >     if (len <=0)
> >         return new Hashtable (); // cheap hack to return an
> > empty hash
> >
> >     if (in == null) {
> >         throw new IllegalArgumentException ();
> >     }
> >
> >     // Make sure we read the entire POSTed body.
> >     byte [] postedBytes = new byte [len];
> >     try {
> >             int offset = 0;
> >             do {
> >                     int inputLen = in.read (postedBytes,
> > offset, len - offset);
> >                     if (inputLen <= 0) {
> >                             throw new
> > IllegalArgumentException (lStrings.getString("err.io.short_read"));
> >                     }
> >                     offset += inputLen;
> >             } while ((len - offset) > 0);
> >     }
> >     catch (IOException e) {
> >             throw new IllegalArgumentException (e.getMessage ());
> >     }
> >
> >     // Here some changes ...
> >     // Direct parsing of postedBytes, converting to
> >     // desired unicode symbol and forming final string
> >
> >     StringBuffer sb = new StringBuffer ();
> >     Integer unicodeInteger;
> >     for (int i = 0; i < postedBytes.length - 1; i++) {
> >             String testString = new String (postedBytes, i, 1);
> >             switch (testString.charAt (0)) {
> >                     case '+' :
> >                             sb.append (' ');
> >                             break;
> >                     case '%' :
> >                             try {
> >                                     // Here is actual
> > conversion to unicode
> >                                     unicodeInteger =
> > Integer.valueOf (new String (postedBytes, i + 1, 2), 16);
> >                                     sb.append (new String
> > (new byte [] {unicodeInteger.byteValue ()}, enc));
> >                                     i += 2;
> >                             }
> >                             catch (NumberFormatException e) {
> >                                     throw new
> > IllegalArgumentException ();
> >                             }
> >                             catch (UnsupportedEncodingException e) {
> >                                     throw new
> > IllegalArgumentException ();
> >                             }
> >                             catch
> > (ArrayIndexOutOfBoundsException e) {
> >                                     // This can happen only
> > at the end of stream
> >                                     // So just add the rest
> > and stop loop
> >                                     String rest = new
> > String (postedBytes, i, postedBytes.length - i);
> >                                     sb.append (rest);
> >                                     i += rest.length ();
> >                             }
> >                             break;
> >                     default:
> >                             // Here do not use encodintg
> >                             // It is expected, that request
> > is sent in
> >                             sb.append (new String
> > (postedBytes, i, 1));
> >                             break;
> >             }
> >     }
> >     return (parseQueryString (sb.toString ()));
> > }
> >
> >
> > Best regards,
> > Aleksandras Novikovas [EMAIL PROTECTED]
> > IT manager
> > Baltic Logistic System Vilnius Ltd.
> > Kirtumu 51, Vilnius, Lithuania
> > Phone: +370-2-390874; FAX: +370-2-390899; Mobile: +370-99-21678
> >
> >
> >
> >
> 

Reply via email to