Database) character encoding.

Nikola Milutinovic Wed, 01 Sep 2004 03:29:16 -0700

Ben Bookey wrote:

Dear list,

We have a web-based jsp-servlet application performing updates, deletes and
inserts into an oracle database running with Tomcat 5. We want to support
both
american, and european customer client locales, so we want to use either
ISO-8859-15 or utf-8. But we are having problems saving the Euro symbol when
using ISO-8859-15 encoding.

Since you have to support multiple character sets, it would be cleaner if you chose UTF-8 for your DB, in the first place. I do realise that data conversion can be a tremendous task, so your mileage may vary.

I had previously assumed that because java works with unicode as default,
that all data entered in a HTML form would be saved therefore as UTF-8 into
the database. (i.e. as soon as a value is assigned to  the a java dataobject
e.g. string or int). I am beginning to think this not to be case, and that
all data is saved in the database based on the original encoding as posted
by the browser. Please can someone explain what is really going on?? Do i
need to have some code which, checks the browser encoding in the HTTP
header, and then convert/parse accordingly to a chosen standard. This will
then avoid the situation that our database could end up containing records
in different character encoding systems, which I suspect is what is now
happening.

First of all, Tomcat, being a Java based application, uses Unicode. JSP Page can specify it's *output* encoding and it should match whatever browser expects. Tomcat *should* (I haven't checked) output HTTP headers to match the declared encoding. Additionally, you as a web page designer may specify a <meta ...> tag, to set your own encoding, but it will be ignored if the web server (Tomcat in this case) sets a HTTP header for character encoding. I've seen that on Apache, kept fixing our Windows-1250 pages to ISO-8859-1. The path of oyur data, in displaying case is:

DB(Oracle)--[JDBC]-->JVM-->Tomcat-->JSP--[Jasper]-->HTML==[HTTP]==>Mozilla

JDBC driver should transform data into Unicode correctly, if DB encoding is OK and data is of the right encoding. This simply means that you cannot put, for instance, Windows-1250 data into Latin-1 database and expect it to come out OK. JVM will try to convert Unicode into requested output encoding, if it fails, the character will read "?".

For input path, situation is similar, with one catch. Not only the JSP or HTML page holding hte form can (and should) have a character encoding, but the HTML Form itself can have an encoding specified. Logic would sugest that if it is not specified, it should be inherited from the page. Logic fails on some browsers, so it would be prudent to specify it on the form as well.

The last step is informing the processing Servlet/JSP of the character encoding of the incoming data. That should be done by the browser, in case of POST request, I'm not sure what happens for GET requests. The browser should set HTTP headers of it's request (Form data submission). Of course there is a slight difference between "should", "must" and "will". :-)

In addition, how does TC deal with framsets containing many html pages. Are
they all treated individually (in theory allowing many character encodings
to be used in each HTML frame), or as one unit.

TC deals with frames just as any other web servers does - it doesn't. HTML frames are a client side construction. Web servers don't care about them and do not notice them. Just as they don't care about multiple images in one single HTML page. A browser may request them, after it has gotten the page, or it may simply ignore them - the web server doesn't care. It will answer ANY request, be it HTML, JPEG or GIF, providing it is valid.

Nix.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: ++ Best practive ?? ++ (JSP-->Servlet-->Database) character encoding.

Reply via email to