Thank you André for perfect explanation. Web browser converts 'ž' to
%C5%BE, which are two bytes, this is 'sent' to Apache: GET
/slo%C5%BEka.png HTTP/1.1. Apache in "Not found" message translate it to
/složka.png, what is probably right - ASCII. But it seems really
strange :) I don't think that OS is changing the name when uploading, it
has to save it in UTF-8 format and it's saved right with bytes C5BE
('ž'), even if in Windows it is of course wrong charset. I tried to open
such file from C program:

char name[]= {'s', 'l', 'o', 0xC5, 0xBE, 'k', 'a', '.', 'p', 'n', 'g', 0};
OFSTRUCT o;
HFILE f = OpenFile(name, &o, OF_READ);

And it has opened that file. Windows didn't change anything. Apache
receives GET with exactly same bytes as are in file system on hard
drive. I would suppose that Apache won't convert anything and only will
call OpenFile or something similar.
When I try to load file "/sloĹľka.png", then Apache find it. Apache
receives from browser: GET /slo%C4%B9%C4%BEka.png HTTP/1.1, and that
works. This filename works with Windows API OpenFile function too, but
it is because Windows try to convert it, it is not saved on hard drive
this way.

Apache must convert received request somehow....
You wrote: "The webserver should take this path exactly as received, and
look for a file on disk whose name matches exactly that path, byte by byte."
If it was so, then it MUST work with C5BE bytes, if it work with Windows
API and in Hexplorer view it is only C5BE on hard drive, not C4B9C4BE.

I hope that 'ž' character is well displayed and sorry for my english,
I'm sure that I made a lot of mistakes :-)

André Warnier wrote:
ejir...@seznam.cz wrote:
This is that problem: http://sgo.happyforever.com/test.php
(http://sgo.happyforever.com/test.php)
Try it please, thanks.

------------ Původní zpráva ------------
Od: <ejir...@seznam.cz>
Předmět: [us...@httpd] Wrong charset convert
Datum: 01.7.2009 00:03:06
---------------------------------------------
I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3. Windows are using Windows-1250 charset (Czech localization). I want to install MediaWiki software which uses utf-8 charset.

When I upload a file with non-english characters in its name, then its name is saved in utf-8 format. When I try to open such file in web browser it sends 404 not found status.

Example:

Upload a file by using simple html upload form, which is encoded in utf-8:

<!-- this is only part of whole code --!>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
</head>
<body>

<form enctype="multipart/form-data" action="uploader.php" method=
"POST">
<input type="hidden" name="MAX_FILE_SIZE" value="100000" />
Choose a file to upload: <input name="uploadedfile" type="file" /><
br />
<input type="submit" value="Upload File" />

</form>
</body>
</html>

File named for example "složka.png" is saved to hard drive with name
"sloĹľka.png" in Windows-1250 encoding.
(This is not true, see below)

If that upload form was
encoded with charset=Windows-1250 then it'll be right named "složka.
png", but charset must be utf-8.

So suppose that we have server with uploaded file: http://something.
com/složka.png. On linux it is working fine. But on Windows server you must use address like that: http://something.com/sloĹľka.png and
that's not good for MediaWiki.

I don't know if it's understandably enough, I need set up Apache to ignore windows-1250 charset and use original utf-8 for decoding URL.
httpd.conf is original (with php installation).

Thanks for help
Jiri Eichler

Jiri,
the issue you are explaining above is not an easy one.
It will really be solved only, whenever the powers-that-be on the Internet, finally decide to move to an HTTP version 2.0, where everything by default would be Unicode, UTF-8 encoded. Until then, there will be confusion and difficulties for whoever does not use English as his main language.

--- Part I -------

First, about your last paragraph :
Apache will not use UTF-8 to decode a URL, because that would be wrong according to the current RFCs that specifiy how the WWW is working.
The "law" in that respect is defined here :
http://www.ietf.org/rfc/rfc2396.txt
See section : 1.5. URI Transcribability

It is all a bit obscure, but basically what it boils down to is :
when a server receives a URL :
- it first decodes the URL, to convert the "percent-escaped" characters back into single characters. That means, for instance, that a "%20" is decoded into a space.
- then it does *no further decoding*, it takes the bytes *as they are*.
They are *not supposed* to be decoded any further, using iso-8859-1, cp-1250, UTF-8 or whatever.
(If Apache did that, then Apache would not respect the RFC).

Now, let's say that in this URL, is a path pointing to some resource, which in this case is a file on disk. Well then, the webserver should take this path exactly as received, and look for a file on disk whose name matches exactly that path, byte by byte.

But, between the webserver and the disk, there is an operating system.
The webserver does not read the disk directly. It does that through the OS I/O interface calls. So, it is possible that when the webserver looks for a file called "xyz123.html", the OS interface translates that to "XYZ123.HTML" for example, and returns /that/ file. That is for example the case for Windows. For "xyz123.html", Windows will return any file that is named "Xyz123.html", or "yYz123.html", or "XYZ123.html" etc.. because when looking for files, Windows is case-insensitive. If the webserver does not double-check this (some do), then it may thus return the wrong file. The same kind of thing can happen with "diacritic" characters, such as your "složka.png".

--------- Part II -----------

Uploading files and writing them to disk.
This is a separate issue.

The script that handles the <form> which is used to upload the file, knows that the filename is Unicode, encoded as UTF-8. (It knows that, because you wrote the <form> and the script, and in your <form>, you have told the browser to send information in UTF-8).

In the UTF-8 encoding, the filename "složka.png", consists of *10 characters*, but of *11 bytes*. That is because the "ž" in the middle, is encoded using 2 bytes in UTF-8. If you look at this filename with an editor which understands UTF-8, you will see this as "složka.png". If you look at this same filename with an editor which does not understand UTF-8 (or is set to iso-8859-2), then you will see this same string as something like "sloĹľka.png" (or something else like that, I have not really checked).

But back to your upload script.

It has this uploaded file name, in Unicode UTF-8, as "složka.png".
Now it wants to create this file on disk.
For that, it tells the OS : create file "složka.png".
The OS takes this file name, and depending on several conditions (**), understands this name literally as either a series of *bytes* (11 of them), or as a series of *characters* (10 of them) in UTF-8 encoding. And the OS, according to its understanding, creates a directory entry on disk for this filename. In your case, it creates an entry in the disk directory, containing the /bytes/ (or /characters/) "sloĹľka.png".

It does that, because your script does it wrong :
The script "knows" that this filename is encoded in UTF-8.
But the OS does not know that.
The script /should know/ how the OS is going to understand that, and should, if needed, re-encode this filename in the proper encoding, so that the OS understands it correctly, and creates a file named "složka.png".

It is not that a file named "sloĹľka.png" is wrong. It is, in itself, a perfectly valid filename.
But the problem is that, considering Part I above :
- your users are going to type a URL in the location bar of their browser
- for that, they are going to use the keyboard that they have, on their workstation, with their OS and their browser etc... (for example, I could never type it, because I don't have a key for "ž" on my keyboard; so I have to cut and paste from your email ;-))
- So they are going to type, for example :
http://yourhost.yourcompany.com/uploadedfiles/složka.png

- The browser is going to URL-encode that, probably replacing the "ž" by a 3-character "percent-sequence" like %B3 (or even 2 3-character sequences, if the browser thinks it must encode the URL as UTF-8).
- the browser is then going to "send this URL" to Apache.
- Apache will receive this URL, decode the %-sequences into *bytes*, and ask the OS for this file.

------ Part III ----

Now, IF the two translations match (the one which happened when you uploaded the file, and the one which happens between the user and the server disk), then the file will be found.
And otherwise, it will not be.

Your case is that the two translations do not match.

----- Part IV : how to resolve this --------

My suggestion :
do /not/ allow the users to decide under which name the file is really stored on the disk. Create an "alias" for the filename, containing only US-ASCII characters, and store the file under that name. And then, arrange that when the users ask for the file "složka.png" (this name appears for example on an index page that you create), in reality your webserver is looking for this alias name. (*)

This is the only way to make your application really portable, because in the end, on the WWW, you never know who or where the user is, what his workstation is, what his OS is, etc.. So the user could upload a file under a name that gives you a lot of trouble on your server (as you have discovered already, but not entirely). For example, one user could upload a file named "složka.png", and another user could upload a file called "Složka.png". If your server is Windows, and if you are not careful, the second file will overwrite the first.
There are many other such problematic cases.

And if MediaWiki does not do that, then MediaWiki is not a portable application, sorry. The problem is not the webserver, the problem is the application.

(and, in part, HTTP 1.x)

(*) you show for example an index page like :
<a href="/files/20090630-180667-123456.png">složka.png</a>

(**) which can be, for example, the "locale" under which the Apache process is running.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
" from the digest: users-digest-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
  "   from the digest: users-digest-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

Reply via email to