Based on the responses, I guess my original question/problem was not
very well written.

UTF-7 won't work because it cannot be distinguished from ASCII without 
something that identifies it as UTF-7.

The %XX idea does not work because this it already in use by lots of
software
to encode many different character sets.  So again we need something that
identifies
it as UTF-8.

What is needed is an escape code that implicitly indicates the Unicode 
character set.

I see this as somewhat analogus to the invention of the U+XXXX notation 
in Unicode consortium writings?  They needed a completely unambiguous way 
to tell their readers that the 16 bit value was not "any" 16 bit value 
but rather a specific Unicode codepoint.  They invented a new kind of escape
sequence that said two things: what follows is hex *and* Unicode.

I see the BOM as filling the same need for text files.  It was not enough
to invent Unicode but also a way to identify the encoding.

Paul Deuter
Internationalization Manager
Plumtree Software
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> 
 


-----Original Message-----
From: Markus Scherer [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 26, 2001 11:29 AM
To: unicode
Subject: Re: Unicode in a URL


Paul Deuter wrote:
> I am wondering if there isn't a need for the Unicode Spec to also
> dictate a way of encoding Unicode in an ASCII stream.  Perhaps

How many more ways to we need?

To be 8-bit-friendly, we have UTF-8.
To get everything into ASCII characters, we have UTF-7.
W3C specifies to use %-encoded UTF-8 for URLs.

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]

> itself. The best way to handle it (from a reliability point of view) is to
> use UTF-8 for everything and to reinterpret the URL using code. The idea

This sounds good, too. Have your pages in UTF-8 and all servers will
interpret URLs as UTF-8.
Especially if browsers encode URLs differently, this is your best choice.


Of course, if this all does not work, the obvious choice for Unicode-broken
systems is to use only ASCII characters to begin with...

markus

Reply via email to