unicode URIs

Judson Valeski Thu, 10 May 2001 10:47:18 -0700
I've been hearing rumblings from various folks about URI encodings. 
Within the last week people have suggested making *all* URIs totally 
flat ASCII (char*, %escaping, *no* char encoding), to making them 
unicode. I've put a 1 hour meeting together tomorrow (Friday, May 11th) 
at 1pm Pacific time. To discuss the issues and the model we want to support.

If you're at Netscape, goto the Quincy conf. room. Otherwise, dial in 
using the following (and yes, I've officially changed my name to "Mr. 
Judson Valeski" :-)):

USA Toll Free Number: 888-282-0360
PASSCODE: 47954
LEADER:    Mr. Judson Valeski

Currently we're using the old/traditional way to represent URIs which is to % escape a 
set of characters defined in the URL spec. That doesn't cover unicode or UTF8 
encoding. The reason this issue is being raised is because we have existing bugs that 
are forcing it to the foreground.

I see four layers here.
1. UI layer. It's possible for me to type unicode into a URL bar, and it's
possible that I'm viewing unicode content in the browser window that has
unicode links in there that, when I hover over them, I want to have them
display as unicode (not encoded or escaped). 
2. Loading layer. This is the uriloader/top-level-necko/docshell layer that
takes strings from the UI level, and hands them off to protocol handlers.
3. Protocol handling layer. Some protocols want to play w/ Unicode (UTF8 most
likely) and some don't (HTTP for example).
4. DNS layer. IDNS is a proposed standard that allows for UTF8 (right frank?)
hostnames.

5(?). the IP transport layer. I'm probably erroneously ignoring this level.

If I'm reading everyone's needs correctly here, we need to hash out what each
layer needs to do to support, at least, UTF8 (a unicode encoding) URL's. From
10k feet, it seems that we can tinker w/ interfaces, and just say it's up to a
protocol impl to determine whether or not they can handle the non ASCII data.

I'd prefer not to spend a lot of time in the meeting talking about ficticious worlds 
where flat char*'s don't exist and life is represented in unicode. Our master here is 
reality (*not* RFCs and specs), and we don't want to spend cycles over-planning and 
disrupting the current code-base to handle some edge case.

Jud

reference:
- nsIURI definition: 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsIURI.idl
- current necko utility uri creation function, 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#81 , notice 
the UTF8 encoding call.
- bug on that uri creation function for doing the UTF8 encoding 
http://bugzilla.mozilla.org/show_bug.cgi?id=66515
- new uri scheme proposal http://www.ietf.org/rfc/rfc2718.txt
- uri's ftp://ftp.isi.edu/in-notes/rfc2396.txt
- nice'n'nasty non-ascii chars in the spec 
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
- LDAP UTF8 needs ftp://ftp.isi.edu/in-notes/rfc2253.txt ([EMAIL PROTECTED] has a bug 
against him to support this).
- LDAP url format ftp://ftp.isi.edu/in-notes/rfc2255.txt
- IMAP urls ftp://ftp.isi.edu/in-notes/rfc2192.txt
unicode URIs

Reply via email to