RE: unicode URIs

Paul Deuter Tue, 15 May 2001 14:26:40 -0700
Even though I am not a Mozilla user, I have been reading the discussion of
Unicode URIs with great interest.  
I think that Naoki is exactly correct: the %HH format is already extensively
used and is context sensitive.  The character encoding is an agreement
between the sender and receiver.  The encoding is not always (indeed rarely)
UTF-8.

Rather I believe there is a need for a new encoding format explicitly for
Unicode.  I like the %uHHHH format because it is already in use by many user
agents and already correctly decoded by some servers.  But whatever format
is chosen, I would just like to see something that says explicitly "I am a
Unicode codepoint".  I don't believe that the %HH format can be used as this
explicit Unicode format, because the %HH is already used by lots of software
to specify other character sets (see Naoki's examples below).

-Paul Deuter
Plumtree Software

-----Original Message-----
From: Naoki Hotta [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, May 15, 2001 2:22 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: unicode URIs


> Second phase:
> - ASCII % encoding would be removed from the url implementation(s), and
pushed out to the protocols who need it. Callers expecting the encoding
would also need to be repaired to handle the new UTF8 format.
> 

There are cases which %-escape cannot be unescaped by client side.
I have examples here (search results from different engines).

I searched "baseball" in Japanese, which takes two characters.
In typical Japanese charsets, they are reprezented as 4 bytes (2 bytes
per character).

1)
http://search.yahoo.co.jp/bin/search?p=%CC%EE%B5%E5

2)
http://search.netscape.com/ja/search.tmpl?charset=x-sjis&cp=nsiwidsrc&;
cat=World/Japanese&search=%96%EC%8B%85

3)
http://www.google.com/search?q=%96%EC%8B%85&btnG=Google+%8C%9F%8D%F5&hl=ja&l
r=

* The first example, the charset is "EUC-JP" but you don't really know 
the charset by just looking at the URI.
* The second one is "x-sjis" (alias of "Shift_JIS") which is in the 
query part but that is supposed to be parsed by the server.
* The third case, it's "Shift_JIS" (the same charset as the second case)
but again the client has no way to know. Also there is an additional
escaped string "%8C%9F%8D%F5" which I have no idea what that is (it
could be a binary data instead of a text).

So client cannot always unescape URI when the URI is already escaped by 
the server or placed in a document escaped (e.g. in "HREF=").
So I think we need exception cases to allow %-escaped representation in 
necko.

Naoki
RE: unicode URIs

Reply via email to