The following reply was made to PR general/4492; it has been noted by GNATS.
From: Dirk-Willem van Gulik <[EMAIL PROTECTED]>
To: Ralf Weinand <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: general/4492: UTF-8 encoding URL's at IE5 won't work with special
directory-names and apache
Date: Sun, 30 May 1999 14:00:27 +0200 (CEST)
On 29 May 1999, Ralf Weinand wrote:
> After the Installation of IE5 (german version) i got Problems with my
> websites
> Some links, that i created via javascript won't work.
> But javascript isn't the problem.
> non english URL's are standardly encoded in the UTF-8 mode, so i can't
> reach the sites with special words. when i disable the utf-8 in the
> IE5-properties (deep inside), all will work.
> i searched a while about the UTF-8 Meaning, but i do nor Know, whether
> UTF-8 is a standard real planned for the Internet.
> this .txt file will not be reached with IE5 and the standard-installation
Although perhaps too technical; this is not really a server problem; but
one having to do with the way IE5 implements some of their
internationalization and localization. And some of that is plain wrong,
wrong and wrong. Sorry. But there is a way round it; see the end of this
longish msg.
As for apache; apache can deal with UTF8 files just fine; they are send
out exactly as they are; but you should make sure that the Charset is
set right of course. See www.w3.org/International for more information.
As for UTF8 inside a URI; there are some rules all URI's are to adhere to,
and what characters they may contain. Unfortunately your ringel-ss or sz
is not one of them, nor are say chinese characters. This page explains
it in detail:
http://www.w3.org/International/O-URL-and-ident.html
In short the rules are
0. the URI is an octed stream with no real meaning,
i.e. just a sequence of numbers.
1. the URI is an octed stream with no real meaning,
i.e. just a sequence of numbers.
2. the URI is an octed stream with no real meaning,
i.e. just a sequence of numbers.
3. the URI is an octed stream with no real meaning,
i.e. just a sequence of numbers.
4. the URI is an octed stream with no real meaning,
i.e. just a sequence of numbers.
5. the URI is an octed stream with no real meaning,
i.e. just a sequence of numbers.
6. any special character (i.e. not a-z, 0-9 and a few
more) is to be encoded as a '%xy' where x and y are
hex numbers 0..9a..f.
7. the URI is an octed stream with no real meaning,
i.e. just a sequence of numbers.
To confuse matters; that sequence of numbers just _HAPPENS_ (but this
is entirly coincedental and of no substance) to look like a human
readable string when you look up the numbers in an ASCII table. But
you should completely forget this :-)
What now follows is an incredible simplification of the real story. But
it might help. The 'solution' for your problem is at the end. I hope.
What generally happens is that a user enters a URL in the bar of the
browser. The browser, together with the OS then translates this into
a valid octed-string, as per RFC2396 according to localization rules.
I.e. the user can actually type in strange char's, such as the sz,
the ae, ij and many others needed in dutch, danish, chinese, german
and so on... but the browser; helped by the OS (which has details on
what the user meant when it typed in the string) is to translate those
to a simple octed string.
This string then goes to the server. The apache server decodes part
of this string; but basically passes it on the the OS which then tries
to work out what file you have. If the OS understands UTF8 coded file
names you are usually all right. But obviously there is a big i18n
problem here.
But... in an HTML, regardless of the charset it is written in, wether
it is in chinese, german or greek; the URI's, i.e. the bits between
the href="...." quotes are _NOT_ in the charset of that page; but
are to be treated as an octed stream; and send on the wire exactly
like that. So even though one would type in the browser window's
location bar
http://www.teddy-online.de/Teddys/Gro_/Teddy-schwarz.txt
(where the '_' is the Beta shaped german 'sz' char), you would code it in
the HTML as
<a href="/images/Teddys/Gro%df/Teddy-schwarz.txt">
i.e. use a 'hex' escape instead of the ringel-ess/sz. The same applies
for javascript _AND_ for java; despite the fact that all code, comments
and displayable strings in java are in UTF8, you are to threat the URIs
strictly as octed strings if you encode them directly.
Hope this helps,
Dw.