Re: [users@httpd] Wrong charset convert

Jiří Eichler Wed, 01 Jul 2009 07:00:00 -0700

Thank you André for perfect explanation. Web browser converts 'ž' to
%C5%BE, which are two bytes, this is 'sent' to Apache: GET
/slo%C5%BEka.png HTTP/1.1. Apache in "Not found" message translate it to
/sloÅ¾ka.png, what is probably right - ASCII. But it seems really
strange :) I don't think that OS is changing the name when uploading, it
has to save it in UTF-8 format and it's saved right with bytes C5BE
('ž'), even if in Windows it is of course wrong charset. I tried to open
such file from C program:


char name[]= {'s', 'l', 'o', 0xC5, 0xBE, 'k', 'a', '.', 'p', 'n', 'g', 0};
OFSTRUCT o;
HFILE f = OpenFile(name, &o, OF_READ);

And it has opened that file. Windows didn't change anything. Apache
receives GET with exactly same bytes as are in file system on hard
drive. I would suppose that Apache won't convert anything and only will
call OpenFile or something similar.
When I try to load file "/sloĹľka.png", then Apache find it. Apache
receives from browser: GET /slo%C4%B9%C4%BEka.png HTTP/1.1, and that
works. This filename works with Windows API OpenFile function too, but
it is because Windows try to convert it, it is not saved on hard drive
this way.

Apache must convert received request somehow....
You wrote: "The webserver should take this path exactly as received, and
look for a file on disk whose name matches exactly that path, byte by byte."
If it was so, then it MUST work with C5BE bytes, if it work with Windows
API and in Hexplorer view it is only C5BE on hard drive, not C4B9C4BE.

I hope that 'ž' character is well displayed and sorry for my english,
I'm sure that I made a lot of mistakes :-)

André Warnier wrote:

[email protected] wrote:
This is that problem: http://sgo.happyforever.com/test.php
(http://sgo.happyforever.com/test.php)
Try it please, thanks.

------------ Původní zpráva ------------
Od: <[email protected]>
Předmět: [us...@httpd] Wrong charset convert
Datum: 01.7.2009 00:03:06
---------------------------------------------
I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3.Windows are using Windows-1250 charset (Czech localization). I wantto install MediaWiki software which uses utf-8 charset.
When I upload a file with non-english characters in its name, thenits name is saved in utf-8 format. When I try to open such file inweb browser it sends 404 not found status.
Example:
Upload a file by using simple html upload form, which is encoded inutf-8:
<!-- this is only part of whole code --!>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
</head>
<body>

<form enctype="multipart/form-data" action="uploader.php" method=
"POST">
<input type="hidden" name="MAX_FILE_SIZE" value="100000" />
Choose a file to upload: <input name="uploadedfile" type="file" /><
br />
<input type="submit" value="Upload File" />

</form>
</body>
</html>

File named for example "složka.png" is saved to hard drive with name
"sloĹľka.png" in Windows-1250 encoding.
(This is not true, see below)

If that upload form was
encoded with charset=Windows-1250 then it'll be right named "složka.
png", but charset must be utf-8.

So suppose that we have server with uploaded file: http://something.
com/složka.png. On linux it is working fine. But on Windows serveryou must use address like that: http://something.com/sloĹľka.png and
that's not good for MediaWiki.
I don't know if it's understandably enough, I need set up Apache toignore windows-1250 charset and use original utf-8 for decoding URL.
httpd.conf is original (with php installation).

Thanks for help
Jiri Eichler
Jiri,
the issue you are explaining above is not an easy one.
It will really be solved only, whenever the powers-that-be on theInternet, finally decide to move to an HTTP version 2.0, whereeverything by default would be Unicode, UTF-8 encoded.Until then, there will be confusion and difficulties for whoever doesnot use English as his main language.
--- Part I -------

First, about your last paragraph :
Apache will not use UTF-8 to decode a URL, because that would be wrongaccording to the current RFCs that specifiy how the WWW is working.
The "law" in that respect is defined here :
http://www.ietf.org/rfc/rfc2396.txt
See section : 1.5. URI Transcribability

It is all a bit obscure, but basically what it boils down to is :
when a server receives a URL :
- it first decodes the URL, to convert the "percent-escaped"characters back into single characters. That means, for instance, thata "%20" is decoded into a space.
- then it does *no further decoding*, it takes the bytes *as they are*.
They are *not supposed* to be decoded any further, using iso-8859-1,cp-1250, UTF-8 or whatever.
(If Apache did that, then Apache would not respect the RFC).
Now, let's say that in this URL, is a path pointing to some resource,which in this case is a file on disk.Well then, the webserver should take this path exactly as received,and look for a file on disk whose name matches exactly that path, byteby byte.
But, between the webserver and the disk, there is an operating system.
The webserver does not read the disk directly. It does that throughthe OS I/O interface calls. So, it is possible that when the webserverlooks for a file called "xyz123.html", the OS interface translatesthat to "XYZ123.HTML" for example, and returns /that/ file.That is for example the case for Windows. For "xyz123.html", Windowswill return any file that is named "Xyz123.html", or "yYz123.html", or"XYZ123.html" etc.. because when looking for files, Windows iscase-insensitive. If the webserver does not double-check this (somedo), then it may thus return the wrong file.The same kind of thing can happen with "diacritic" characters, such asyour "složka.png".
--------- Part II -----------

Uploading files and writing them to disk.
This is a separate issue.
The script that handles the <form> which is used to upload the file,knows that the filename is Unicode, encoded as UTF-8.(It knows that, because you wrote the <form> and the script, and inyour <form>, you have told the browser to send information in UTF-8).
In the UTF-8 encoding, the filename "složka.png", consists of *10characters*, but of *11 bytes*. That is because the "ž" in the middle,is encoded using 2 bytes in UTF-8.If you look at this filename with an editor which understands UTF-8,you will see this as "složka.png".If you look at this same filename with an editor which does notunderstand UTF-8 (or is set to iso-8859-2), then you will see thissame string as something like "sloĹľka.png" (or something else likethat, I have not really checked).
But back to your upload script.

It has this uploaded file name, in Unicode UTF-8, as "složka.png".
Now it wants to create this file on disk.
For that, it tells the OS : create file "složka.png".
The OS takes this file name, and depending on several conditions (**),understands this name literally as either a series of *bytes* (11 ofthem), or as a series of *characters* (10 of them) in UTF-8 encoding.And the OS, according to its understanding, creates a directory entryon disk for this filename.In your case, it creates an entry in the disk directory, containingthe /bytes/ (or /characters/) "sloĹľka.png".
It does that, because your script does it wrong :
The script "knows" that this filename is encoded in UTF-8.
But the OS does not know that.
The script /should know/ how the OS is going to understand that, andshould, if needed, re-encode this filename in the proper encoding, sothat the OS understands it correctly, and creates a file named"složka.png".
It is not that a file named "sloĹľka.png" is wrong. It is, in itself,a perfectly valid filename.
But the problem is that, considering Part I above :
- your users are going to type a URL in the location bar of their browser
- for that, they are going to use the keyboard that they have, ontheir workstation, with their OS and their browser etc...(for example, I could never type it, because I don't have a key for"ž" on my keyboard; so I have to cut and paste from your email ;-))
- So they are going to type, for example :
http://yourhost.yourcompany.com/uploadedfiles/složka.png
- The browser is going to URL-encode that, probably replacing the "ž"by a 3-character "percent-sequence" like %B3 (or even 2 3-charactersequences, if the browser thinks it must encode the URL as UTF-8).
- the browser is then going to "send this URL" to Apache.
- Apache will receive this URL, decode the %-sequences into *bytes*,and ask the OS for this file.
------ Part III ----
Now, IF the two translations match (the one which happened when youuploaded the file, and the one which happens between the user and theserver disk), then the file will be found.
And otherwise, it will not be.

Your case is that the two translations do not match.

----- Part IV : how to resolve this --------

My suggestion :
do /not/ allow the users to decide under which name the file is reallystored on the disk.Create an "alias" for the filename, containing only US-ASCIIcharacters, and store the file under that name.And then, arrange that when the users ask for the file "složka.png"(this name appears for example on an index page that you create), inreality your webserver is looking for this alias name. (*)
This is the only way to make your application really portable, becausein the end, on the WWW, you never know who or where the user is, whathis workstation is, what his OS is, etc..So the user could upload a file under a name that gives you a lot oftrouble on your server (as you have discovered already, but notentirely).For example, one user could upload a file named "složka.png", andanother user could upload a file called "Složka.png". If your serveris Windows, and if you are not careful, the second file will overwritethe first.
There are many other such problematic cases.
And if MediaWiki does not do that, then MediaWiki is not a portableapplication, sorry. The problem is not the webserver, the problem isthe application.
(and, in part, HTTP 1.x)

(*) you show for example an index page like :
<a href="/files/20090630-180667-123456.png">složka.png</a>
(**) which can be, for example, the "locale" under which the Apacheprocess is running.
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP ServerProject.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [email protected]
" from the digest: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [email protected]
  "   from the digest: [email protected]
For additional commands, e-mail: [email protected]

Re: [users@httpd] Wrong charset convert

Reply via email to