Hi,

I have a problem reading non-ISO8859_1 characters form HTTP requests.
I think it is a bug in all major browsers (NC 3,4,4.5 MSIE 3,4,5)
Does anybody have a solution ?

The problem is this:
Java uses internally 16-bit UNICODE charset, which can hold any
characters from all languages. But the standard 8-bit character
encoding for "text/html" documents is ISO8859_1 (Western Europe).
It is very easy to ouput a HTML page from a servlet with some
other character encoding with a construction like this:

 response.setContentType("text/html; charset=iso-8859-2");
 PrintWriter out = response.getWriter();
 out.println("\u00e1\u010d\u010f");

But when a servlet reads a HTTP request, it must perform
a reverse conversion from 8-bit to 16-bit characters. Servlet
tries to determine the used encoding by "charset=" field
of the MIME type of the HTTP request. But browsers don't
set this field, so every time the characters are converted as if
they were ISO8859_1 characters. You can try it with this servlet:

//-----------Servlet demonstrating encoding of HTTP requests------
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;

public class Cti extends HttpServlet
{
  String original_text = "\u00e1\u010d\u010f\u00e9";

  public void doGet (HttpServletRequest request, HttpServletResponse
response)
       throws ServletException,IOException
       {
        response.setContentType("text/html; charset=iso-8859-2");
        PrintWriter out = response.getWriter();
        out.println( "<FORM METHOD=POST >" );
        out.println( "<INPUT TYPE=TEXT NAME=textname VALUE=\"" +
original_text + "\">" );
        out.println( "<INPUT TYPE=SUBMIT VALUE=Send >" );
        out.println( "</FORM>" );
        }
  public void doPost (HttpServletRequest request, HttpServletResponse
response)
        throws ServletException,IOException
        {
        String MimeType = request.getContentType() ;
        String CharEnc  = request.getCharacterEncoding() ;
        response.setContentType("text/html; charset=iso-8859-2");
        PrintWriter out = response.getWriter();
        String read_text = request.getParameter("textname");
        out.println("<PRE>");
        out.println("MIME type of request:<B> "+MimeType+"</B>");
        out.println("   CharacterEncoding:<B> "+CharEnc+"</B>");
        out.println("       original_text:<B> "+original_text+"</B>");
        out.println("           read_text:<B> "+read_text+"</B>");
        if(original_text.equals(read_text))
          out.println("<BR>They are same !");
         else
          out.println("<BR>They are different !");
        }
}
//---------------------------------------------------------------------------------------------

Every request has "application/x-www-form-urlencoded" MIME type without
"charset=" field set. I don't know if it is a bug in browsers or
in the ISO-standard definition of MIME types.
Browsers usually use character encoding of the page with the FORM tag,
but they don't mark it in the request MIME type header.

Possible solution may be to define a new method
javax.servlet.ServletRequest.setCharacterEncoding(String)
but it will be an API change.
Anybody knows a workaround ?

Martin Kuba
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   INET, a.s.                          Mgr. Martin Kuba
Kralovopolska 139                  e-mail: [EMAIL PROTECTED]
  601 12 Brno                      WWW: http://www.inet.cz/~makub/
 Czech Republic                    tel: +420-5-41242414/33
--------------------------------------------------------------------
PGP fingerprint = D8 57 47 E5 36 D2 C1 A1  C3 48 B2 59 00 58 42 27
 http://wwwkeys.cz.pgp.net:11371/pks/lookup?op=index&search=makub
--------------------------------------------------------------------

___________________________________________________________________________
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff SERVLET-INTEREST".

Archives: http://archives.java.sun.com/archives/servlet-interest.html
Resources: http://java.sun.com/products/servlet/external-resources.html
LISTSERV Help: http://www.lsoft.com/manuals/user/user.html

Reply via email to