Re: Clean and Unicode compliance

Martin Duerst Sun, 16 Dec 2001 17:35:57 -0800

As the person who implemented UTF-8 checking for http://validator.w3.org,
I beg to disagree. In order to validate correctly, the validator has
to make sure it correctly interprets the incomming byte sequence as
a sequence of characters. For this, it has to know the character
encoding. As an example, there are many files in iso-2022-jp or
shift_jis that are prefectly valid as such, but will get rejected
by some tools because they contain bytes that correspond to '<' in
ASCII as part of a doublebyte character.


So the UTF-8 check is just to make sure we validate something
reasonable, and to avoid GIGO (garbage in, garbage out).
Of course, this cannot be avoided completely; the validator
has no way to check whether something that is sent in as
iso-8859-1 would actually be iso-8859-2. (humans can check
by looking at the source).

Regards,  Martin.

At 12:26 01/12/14 -0800, James Kass wrote:
>There is so much text on the web using many different
>encoding methods.  Big-5, Shift-JIS, and similar encodings
>are fairly well standardised and supported.  Now, in addition
>to UTF-8, a web page might be in UTF-16 or perhaps even
>UTF-32, eventually.  Plus, there's a plethora of non-standard
>encodings in common use today.  An HTML validator should
>validate the mark-up, assuring an author that (s)he hasn't
>done anything incredibly dumb like having two </title>
>tags appearing consecutively.  Really, this is all that we should
>expect from an HTML validator.  Extra features such as
>checking for invalid UTF-8 sequences would probably be most
>welcome, but there are other tools for doing this which an
>author should already be using.
>
>Best regards,
>
>James Kass.
>

Re: Clean and Unicode compliance

Reply via email to