Jiří Eichler wrote:
I didn't program MediaWiki, but on Wikipedia it seems to be working well. I just realize that we haven't solved that problem with charset, I have just changed charset sent by php ... you're right with "double encoding" to utf-8, Apache/php think that it is something else and encode it once more. But how can we tell php that it is in utf-8? I don't know. :-D Can it be called bug when there is no way to detect charset of uploaded filename?

Well...
One basic problem is that there are "holes" in the HTTP 1.x specification, at least when considering the multi-lingual, increasingly Unicode-centric world in which we are living. The next problem is that browsers do not always respect even the HTTP 1.x specification. The third problem is that not all browsers fail to respect it in the same way (but they are getting better at this). The next issue is that, the WWW being like it is, with clients that the server does not control, you can never be sure of anything. The next issue is that programming languages like PHP, do not necessarily offer very good tools to "mark" a string as being in any particular encoding. Another issue is that it is relatively easy to check if a received text is valid UTF-8; but it is very hard to check if a received text is valid iso-8859-1 or iso-8859-2 or cp-1250, or any of the 8-bit character sets; and it is even harder to find out which one of them it is.

And one overall issue, is that it is not always easy to change any of the above, without suddenly breaking many WWW applications.

Taking all the above into account however, there are some things which you can do in your applications, to minimise the consequences.

One first thing is to be correct, consistent, and precise in what you send to the browser.
("Be strict in what you send, and tolerant in what you receive")

So if you have chosen Unicode/UTF-8 for your basic charset and encoding (the best choice nowadays), make sure that : - each time your server sends some text page to the client, there is a proper "Content-type: xxx/yyyy; charset=utf-8" HTTP header with the response (see *1 below) - each time your server sends some HTML or XML page to the client, make sure that it has an explicit charset declaration inside - always verify that your pages *are* encoded in UTF-8. Not that someone has been editing your pages using an old editor, which knows only iso-latin-2 or cp-1250.
- when you send a <form> to the client, specify the
accept-charset="utf-8" attribute in the <form> tag
- when you send a <form> to the client, which will be later submitted back, include some
<input name="test-encoding" type="hidden" value="xxxxxxxxxx">
where "xxxxxxxxxx" is a valid UTF-8 string containing non US-ASCII characters. Then, in the script that receives the data from this form, test this parameter, to see if what you received is indeed UTF-8 or not.
The way to do that varies depending on the programming language.
(Maybe you can compare the length in bytes and/or the length in characters, or compare it with an internal identical string known to be UTF-8.) - be "defensive" in your cgi-bin scripts. Everything you receive from the client is suspect. - never forget that on the WWW, "the client is king". The user /can/ change the charset of his browser, no matter what the server tells it.
(Firefox 3.1 : View..Character encoding; IE 7 : same)



(*1) :
when I use your PHP upload page, the response page that I get from your server has these HTTP headers :
HTTP/1.1 200 OK
Date: Wed, 01 Jul 2009 19:44:31 GMT
Server: Apache/2.2.11 (Win32) PHP/5.2.8
X-Powered-By: PHP/5.2.8
Content-Length: 716
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=windows-1250


However, the html page itself contains :
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

That is /not/ consistent.

On the other hand, the index page received after you click on the /data link, has the following HTTP headers :

HTTP/1.1 200 OK
Date: Wed, 01 Jul 2009 19:54:01 GMT
Server: Apache/2.2.11 (Win32) PHP/5.2.8
Content-Length: 264
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html;charset=UTF-8



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
  "   from the digest: users-digest-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

Reply via email to