RE: Unicode in a URL

Martin Duerst Thu, 26 Apr 2001 20:03:54 -0700
At 15:02 01/04/26 -0700, Paul Deuter wrote:
>Based on the responses, I guess my original question/problem was not
>very well written.

>The %XX idea does not work because this it already in use by lots of
>software
>to encode many different character sets.  So again we need something that
>identifies
>it as UTF-8.

It's used with lot's of different encodings. Adding one more (UTF-8)
won't make it much worse, in the first place.

Second, it turns out that UTF-8 is extremely easy to detect/check,
the easiest of all encodings. For details, see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

Apart from that, the HTTP protocol says exactly what you can send,
and so you can't just invent something new (such as %u....),
even though it might work 'sometimes'.


>I see this as somewhat analogus to the invention of the U+XXXX notation
>in Unicode consortium writings?  They needed a completely unambiguous way
>to tell their readers that the 16 bit value was not "any" 16 bit value
>but rather a specific Unicode codepoint.  They invented a new kind of escape
>sequence that said two things: what follows is hex *and* Unicode.
>
>I see the BOM as filling the same need for text files.  It was not enough
>to invent Unicode but also a way to identify the encoding.

The BOM for UTF-8 is doing a lot of damage. All the tools that
would work very nicely without the BOM stop to work.


Regards,    Martin.
RE: Unicode in a URL

Reply via email to