Re: unicode URIs

Andreas Otte Thu, 17 May 2001 12:12:24 -0700
Hi!

Judson Valeski wrote:
> 
> bringing this back onto the newsgroup and cc'ing rick.
> 
> Andreas Otte wrote:
> 
> >Hi!
> >
> >Judson Valeski wrote:
> >
> >>We decided on the following proposal. dougt, rpotts, chak, dmose, gagan, valeski, 
>and nhotta attended the meeting.
> >>
> >>URI's would accept, and store, only UTF8 encoded strings. Protocols not able to 
>handle UTF8 (HTTP for example), would access the charset attribute (proposed) off of 
>nsIURI to convert back to the original string. The charset would be set by the URI 
>creator as they have the best charset context. Is nsIURI the right
> >>place for the charset attribute?
> >>
> >
> >I think it is. Also get away with the char representation of the uri
> >components. Use strings instead.
> >
> >
> >>The current ASCII % encoding would be removed from the internal URI 
>representation. Again, this encoding would be pushed out to the protocol level.
> >>
> >
> >So we will have a two levels of %-enconding? I don't think the
> >%-encoding can be removed completly. The first level applies to all URIs
> >and masks reserved chars as the current stuff does. On a second level
> >non ascii chars can be encoded as the protocol needs it.
> >
> As I understood it, there would be no % encoding at all unless a
> protocol (say HTTP) needed it, in which case, it would pull the UTF8
> data out of the uri object, and % encoding it on it's own.
> 
> Jud

Sometimes you have to %-encode the url when giving the uri to the
urlparser to avoid parser confusion when using reserved chars. You then
have to decide which way you store the parsed components: as you got
them (which could mean partialy escaped) or unescaped. Currently we
store them as we get them (as 8bit chars) and return them unescaped as
url components and escaped as whole or any combined portion of url
components. I think this part still has to stay, it is necessary even
when using UTF-8 for urlparser reasons. What can go is the conversion of
ascii chars > 127 for some protocols like ldap but not for http for
example.

Andreas
Re: unicode URIs

Reply via email to