As the person who implemented UTF-8 checking for http://validator.w3.org, I beg to disagree. In order to validate correctly, the validator has to make sure it correctly interprets the incomming byte sequence as a sequence of characters. For this, it has to know the character encoding. As an example, there are many files in iso-2022-jp or shift_jis that are prefectly valid as such, but will get rejected by some tools because they contain bytes that correspond to '<' in ASCII as part of a doublebyte character.
So the UTF-8 check is just to make sure we validate something reasonable, and to avoid GIGO (garbage in, garbage out). Of course, this cannot be avoided completely; the validator has no way to check whether something that is sent in as iso-8859-1 would actually be iso-8859-2. (humans can check by looking at the source). Regards, Martin. At 12:26 01/12/14 -0800, James Kass wrote: >There is so much text on the web using many different >encoding methods. Big-5, Shift-JIS, and similar encodings >are fairly well standardised and supported. Now, in addition >to UTF-8, a web page might be in UTF-16 or perhaps even >UTF-32, eventually. Plus, there's a plethora of non-standard >encodings in common use today. An HTML validator should >validate the mark-up, assuring an author that (s)he hasn't >done anything incredibly dumb like having two </title> >tags appearing consecutively. Really, this is all that we should >expect from an HTML validator. Extra features such as >checking for invalid UTF-8 sequences would probably be most >welcome, but there are other tools for doing this which an >author should already be using. > >Best regards, > >James Kass. >