HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread James Kass
The HTML validation service from W3C at: http://validator.w3.org has been commended on this list and appears to be sophisticated and fast. Tests run on non-BMP text show no problem for Plane One using UTF-8 encoding but error messages are generated when these characters are referenced as

Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread Elliotte Rusty Harold
At 3:07 AM -0800 12/16/01, James Kass wrote: Tests run on non-BMP text show no problem for Plane One using UTF-8 encoding but error messages are generated when these characters are referenced as NCRs. I suspect there's a lot of random mistakes like this waiting to be discovered. I recently

Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread James Kass
Elliotte Rusty Harold wrote, I suspect a lot of our tools haven't been thoroughly tested with PLane-1 and are likely to have these sorts of bugs in them. Since Plane One is still fairly new, this is understandable. I'm also having trouble getting Plane Zero pages to validate. Spent

Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread Martin Duerst
Hello James (and everybody else), Can you please send comments and bug reports on the validator to [EMAIL PROTECTED]? Sending bug reports to the right address seriously increases the chance that they get fixed. Regards, Martin. At 14:46 01/12/16 -0800, James Kass wrote: Elliotte Rusty Harold

Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst
At 07:16 01/12/14 -0800, James Kass wrote: Having an HTML validator, like Tidy.exe, which generates errors or warnings every time it encounters a UTF-8 sequence is unnerving. It's especially irritating when the validator automatically converts each string making a single UTF-8 character into two

Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst
As the person who implemented UTF-8 checking for http://validator.w3.org, I beg to disagree. In order to validate correctly, the validator has to make sure it correctly interprets the incomming byte sequence as a sequence of characters. For this, it has to know the character encoding. As an

Re: Clean and Unicode compliance

2001-12-16 Thread James Kass
Martin Duerst wrote, This is really bad. Have you made sure you have the right options? Tidy has a lot of options. It sure does. One of which is -utf8. Using this option (tidy -utf8 -f output.txt -m input.htm) works like a charm, directing the errors and warnings for an HTML file called

Re: Clean and Unicode compliance

2001-12-16 Thread James Kass
Martin Duerst wrote, As the person who implemented UTF-8 checking for http://validator.w3.org, I beg to disagree. In order to validate correctly, the validator has to make sure it correctly interprets the incomming byte sequence as a sequence of characters. For this, it has to know the

Clean and Unicode compliance

2001-12-14 Thread W4z5m4
19940405 Hello, Does the Clean development team plan to make Concurrent Clean partially or fully Unicode compliant in their future releases, as this is crucial for those of us who use non-European writing systems, and more generally for those who develop truly global applications. Thanks in

Re: Clean and Unicode compliance

2001-12-14 Thread James Kass
Welé Negga wrote, Does the Clean development team plan to make Concurrent Clean partially or fully Unicode compliant in their future releases, as this is crucial for those of us who use non-European writing systems, and more generally for those who develop truly global applications. It is

Re: Clean and Unicode compliance

2001-12-14 Thread Asmus Freytag
W3C's HTML validation service seems to have no such problems. We've been using it to validate all the files on the unicode site regularly. A validator *should* look between the and in order to catch invalid entity references, esp. invalu NCRs. For UTF-8, it would ideally also check that no

Re: Clean and Unicode compliance

2001-12-14 Thread James Kass
Asmus Freytag wrote, A validator *should* look between the and in order to catch invalid entity references, esp. invalu NCRs. For UTF-8, it would ideally also check that no ill-formed, and therefore illegal, sequences are part of the UTF-8. You've made a good point about invalid NCRs

Re: Clean and Unicode compliance

2001-12-14 Thread Asmus Freytag
James, NCRs *are* markup. And validating that the encoding matches the declaration (e.g. UTF-8 is not ill-formed) has nothing whatsoever to do with content, but all with verifying that the file conforms to the HTML specification. All this is completely different from spelling and grammar

Re: Clean and Unicode compliance

2001-12-14 Thread James Kass
Asmus Freytag wrote, NCRs *are* markup. Whether they are called mark-up or macros, they are certainly part of HTML and I was not disagreeing with you that they should be checked by the validator. And validating that the encoding matches the declaration (e.g. UTF-8 is not ill-formed)