Re: Clean and Unicode compliance

2001-12-14 Thread James Kass


Welรฉ Negga wrote,

> Does the Clean development team plan to make Concurrent
> Clean partially or fully Unicode compliant in their future
> releases, as this is crucial for those of us who use non-European
> writing systems, and more generally for those who develop
> truly global applications.

It is crucial for everyone.

Having an HTML validator, like Tidy.exe, which generates errors
or warnings every time it encounters a UTF-8 sequence is
unnerving.  It's especially irritating when the validator
automatically converts each string making a single UTF-8
character into two or three HTML named entities.

If UTF-8 is needed in the HTML, and the HTML needs to be valid,
the user must make a back-up copy of the original HTML, run
the validator on the back-up, and then manually make corrections
to the source.  This is quite cumbersome and should really be
unneccessary.

HTML validators should only validate the HTML, that is the text
between the HTML brackets "<" and ">", and not affect the actual
text of the file.

Best regards,

James Kass.







Re: Clean and Unicode compliance

2001-12-14 Thread Asmus Freytag

W3C's HTML validation service seems to have no such problems.
We've been using it to validate all the files on the unicode
site regularly.

A validator *should* look between the > and < in order to
catch invalid entity references, esp. invalu NCRs.

For UTF-8, it would ideally also check that no ill-formed,
and therefore illegal, sequences are part of the UTF-8.

A./

At 07:16 AM 12/14/01 -0800, James Kass wrote:

>Welรฉ Negga wrote,
>
> > Does the Clean development team plan to make Concurrent
> > Clean partially or fully Unicode compliant in their future
> > releases, as this is crucial for those of us who use non-European
> > writing systems, and more generally for those who develop
> > truly global applications.
>
>It is crucial for everyone.
>
>Having an HTML validator, like Tidy.exe, which generates errors
>or warnings every time it encounters a UTF-8 sequence is
>unnerving.  It's especially irritating when the validator
>automatically converts each string making a single UTF-8
>character into two or three HTML named entities.
>
>If UTF-8 is needed in the HTML, and the HTML needs to be valid,
>the user must make a back-up copy of the original HTML, run
>the validator on the back-up, and then manually make corrections
>to the source.  This is quite cumbersome and should really be
>unneccessary.
>
>HTML validators should only validate the HTML, that is the text
>between the HTML brackets "<" and ">", and not affect the actual
>text of the file.
>
>Best regards,
>
>James Kass.
>
>





Re: Clean and Unicode compliance

2001-12-14 Thread James Kass


Asmus Freytag wrote,

> A validator *should* look between the > and < in order to
> catch invalid entity references, esp. invalu NCRs.
> 
> For UTF-8, it would ideally also check that no ill-formed,
> and therefore illegal, sequences are part of the UTF-8.

You've made a good point about invalid NCRs or named entities.

But, I think it's up to the author to proofread the actual text
in an appropriate application.

Is the HTML validator going to also be expected to check for
grammar, spelling, and use of punctuation?

There is so much text on the web using many different
encoding methods.  Big-5, Shift-JIS, and similar encodings
are fairly well standardised and supported.  Now, in addition
to UTF-8, a web page might be in UTF-16 or perhaps even 
UTF-32, eventually.  Plus, there's a plethora of non-standard 
encodings in common use today.  An HTML validator should
validate the mark-up, assuring an author that (s)he hasn't
done anything incredibly dumb like having two 
tags appearing consecutively.  Really, this is all that we should
expect from an HTML validator.  Extra features such as 
checking for invalid UTF-8 sequences would probably be most 
welcome, but there are other tools for doing this which an 
author should already be using.

Best regards,

James Kass.






Re: Clean and Unicode compliance

2001-12-14 Thread Asmus Freytag

James,

NCRs *are* markup. And validating that the encoding matches
the declaration (e.g. UTF-8 is not ill-formed) has nothing
whatsoever to do with content, but all with verifying that
the file conforms to the HTML specification.

All this is completely different from spelling and grammar
checking.

The thread started when someone complained that a validator
was unable to understand UTF-8 encoded files. Once you go
from HTML to XHTML or XML, those 'validators' are
themselves invalid, as XML requires all parsers to support
UTF-8.

Again, the HTML validation service from W3C is able to deal
with UTF-8 and even will warn about the 'UTF-8BOM' issue.
You should reasonably be able to expect that other tools
that call themselves validators match the functionality
of that service - or get out of that business.

A./






Re: Clean and Unicode compliance

2001-12-14 Thread James Kass


Asmus Freytag wrote,

>
> NCRs *are* markup. 

Whether they are called "mark-up" or "macros", they are 
certainly part of HTML and I was not disagreeing with you 
that they should be checked by the validator.

> And validating that the encoding matches
> the declaration (e.g. UTF-8 is not ill-formed) has nothing
> whatsoever to do with content, but all with verifying that
> the file conforms to the HTML specification.
>
> All this is completely different from spelling and grammar
> checking.
>

My point here is that the human being who created the text
needs to look at the text personally to assure that it is free
of errors and displays as expected.  During that essential
proofreading, any encoding errors should be obvious.

HTML consists of plain text and mark-up tags.  The tags enable 
fancy typography like bold or italics to be included in a file
generated from a plain text editor (like the DOS editor or 
SCUnipad) and then displayed in an HTML browser.  

UTF-8 is now considered to be plain text, and plain text isn't 
mark-up.

However, if an HTML validator finds malformed UTF-8 material,
it's good to be advised and I never said it wasn't.

Best regards,

James Kass.






Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst

At 07:16 01/12/14 -0800, James Kass wrote:
>Having an HTML validator, like Tidy.exe, which generates errors
>or warnings every time it encounters a UTF-8 sequence is
>unnerving.  It's especially irritating when the validator
>automatically converts each string making a single UTF-8
>character into two or three HTML named entities.

This is really bad. Have you made sure you have the right
options? Tidy has a lot of options.


Regards,   Martin.




Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst

As the person who implemented UTF-8 checking for http://validator.w3.org,
I beg to disagree. In order to validate correctly, the validator has
to make sure it correctly interprets the incomming byte sequence as
a sequence of characters. For this, it has to know the character
encoding. As an example, there are many files in iso-2022-jp or
shift_jis that are prefectly valid as such, but will get rejected
by some tools because they contain bytes that correspond to '<' in
ASCII as part of a doublebyte character.

So the UTF-8 check is just to make sure we validate something
reasonable, and to avoid GIGO (garbage in, garbage out).
Of course, this cannot be avoided completely; the validator
has no way to check whether something that is sent in as
iso-8859-1 would actually be iso-8859-2. (humans can check
by looking at the source).

Regards,  Martin.

At 12:26 01/12/14 -0800, James Kass wrote:
>There is so much text on the web using many different
>encoding methods.  Big-5, Shift-JIS, and similar encodings
>are fairly well standardised and supported.  Now, in addition
>to UTF-8, a web page might be in UTF-16 or perhaps even
>UTF-32, eventually.  Plus, there's a plethora of non-standard
>encodings in common use today.  An HTML validator should
>validate the mark-up, assuring an author that (s)he hasn't
>done anything incredibly dumb like having two 
>tags appearing consecutively.  Really, this is all that we should
>expect from an HTML validator.  Extra features such as
>checking for invalid UTF-8 sequences would probably be most
>welcome, but there are other tools for doing this which an
>author should already be using.
>
>Best regards,
>
>James Kass.
>





Re: Clean and Unicode compliance

2001-12-16 Thread James Kass

Martin Duerst wrote,

> 
> This is really bad. Have you made sure you have the right
> options? Tidy has a lot of options.
> 

It sure does.  One of which is "-utf8".  Using this option
(tidy -utf8 -f output.txt -m input.htm)
works like a charm, directing the errors and warnings for
an HTML file called input.htm to a text file called output.txt .

So, when installing a newer version of an application like Tidy,
it's good practice to read the newer version of the help file, eh?

Thank you.

Best regards,

James Kass.





Re: Clean and Unicode compliance

2001-12-16 Thread James Kass


Martin Duerst wrote,

> As the person who implemented UTF-8 checking for http://validator.w3.org,
> I beg to disagree. In order to validate correctly, the validator has
> to make sure it correctly interprets the incomming byte sequence as
> a sequence of characters. For this, it has to know the character
> encoding. As an example, there are many files in iso-2022-jp or
> shift_jis that are prefectly valid as such, but will get rejected
> by some tools because they contain bytes that correspond to '<' in
> ASCII as part of a doublebyte character.
> 

Excellent example.  Use of less-than bracket bytes in certain 
encoding methods hadn't occurred to me.

HTML validators need to be aware of the encoding used in the
file.  Based on your comments and other comments in this thread, 
I concede the point.  A validator should validate that the plain
text portion of an HTML file is properly encoded/well formed.

Best regards,

James Kass.






HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread James Kass


The HTML validation service from W3C at:

http://validator.w3.org

has been commended on this list and appears to be sophisticated
and fast.

Tests run on non-BMP text show no problem for Plane One using
UTF-8 encoding but error messages are generated when these
characters are referenced as NCRs.

Plane One support in M.S.I.E. 5.x is just the opposite, NCRs work
and UTF-8 doesn't.

-
Example from a page using NCRs:

Line 117, column 9:
  โ€ฎ๐Œ“๐Œ€๐Œ”๐Œ๐Œ€โ€ฌ -
   ^
Error: "66323" is not a character number in the document character set
-

Best regards,

James Kass.







Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread Elliotte Rusty Harold

At 3:07 AM -0800 12/16/01, James Kass wrote:

>Tests run on non-BMP text show no problem for Plane One using
>UTF-8 encoding but error messages are generated when these
>characters are referenced as NCRs.
>

I suspect there's a lot of random mistakes like this waiting to be 
discovered. I recently added a Plane-1 musical symbol to a book I'm 
working on, and watched Xerces's XMLSerializer class trip over it. It 
emitted the character as two character references, one for each half 
of the surrogate pair, rather than one, thus producing malformed 
HTML. It worked when I switched to UTF-8 encoding though.

I suspect a lot of our tools haven't been thoroughly tested with 
PLane-1 and are likely to have these sorts of bugs in them.
-- 

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|  The XML Bible, 2nd Edition (Hungry Minds, 2001)   |
|  http://www.ibiblio.org/xml/books/bible2/  |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+--+-+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/  |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ |
+--+-+




Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread James Kass


Elliotte Rusty Harold wrote,

>
> I suspect a lot of our tools haven't been thoroughly tested with
> PLane-1 and are likely to have these sorts of bugs in them.

Since Plane One is still fairly new, this is understandable.

I'm also having trouble getting Plane Zero pages to validate.

Spent several hours revising some of my pages as a result of 
some kindly off-list suggestions.  (Most of the pages on my site
were rewritten to pass Tidy.exe long ago, and apparently were
already correct.)  After getting the revised pages to pass the 
Tidy validator (which is also from w3), it was a big surprise 
that the first four pages checked with the W3 validator failed 
to pass.

Amazingly, some pages didn't pass because " wasn't recognized
as a valid named entity.

After tidy warns that 

Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread Martin Duerst

Hello James (and everybody else),

Can you please send comments and bug reports on the validator to
[EMAIL PROTECTED]? Sending bug reports to the right address
seriously increases the chance that they get fixed.

Regards,  Martin.

At 14:46 01/12/16 -0800, James Kass wrote:

>Elliotte Rusty Harold wrote,
>
> >
> > I suspect a lot of our tools haven't been thoroughly tested with
> > PLane-1 and are likely to have these sorts of bugs in them.
>
>Since Plane One is still fairly new, this is understandable.
>
>I'm also having trouble getting Plane Zero pages to validate.
>
>Spent several hours revising some of my pages as a result of
>some kindly off-list suggestions.  (Most of the pages on my site
>were rewritten to pass Tidy.exe long ago, and apparently were
>already correct.)  After getting the revised pages to pass the
>Tidy validator (which is also from w3), it was a big surprise
>that the first four pages checked with the W3 validator failed
>to pass.
>
>Amazingly, some pages didn't pass because " wasn't recognized
>as a valid named entity.
>
>After tidy warns that