Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-04-09 Thread olivier Thereaux


On Mar 28, 2007, at 21:24 , Henri Sivonen wrote:
It's been a truly informative and enlightening reading, especially  
the parts where you develop on the (im)possibility of using only  
schemas to describe conformance to the html5 specs. This is a  
question that has been bothering me for a long time, especially as  
there is only one (as of today) production-ready conformance  
checking tool not based on some kind (or combination) of schema- 
based parsers,


I take it that you mean the Feed Validator?


Indeed.

Also, in the future, I'd like to make it super-easy for CMS  
developers to integrate the conformance checker back end to their  
products. To enable this, the barrier for getting a runnable copy  
should be low.


Libraries/APIs in a few languages?

I'm very pessimistic about translations. Even the online markup  
checkers whose authors have borne the burden of making the messages  
translatable aren't getting numerous translation contributions.


It depends. Projets with large user bases do get a lot of volunteers  
for translation.



Indeed. Did you have a chance to look at EARL?


I did. I also had a look at the SOAP and Unicorn outputs of the W3C  
Validator. I like EARL the least of the three, because its  
assumptions about the nature of the checker software do not work  
well with implementations that have a grammar-based schema inside.  
Grammar-based implementations cannot cite an exact conformance  
criterion when a derivation in the grammar fails as demonstrated by  
the EARL output of the W3C Validator. The SOAP and Unicorn formats,  
even if crufty to my taste, match better the SAX ErrorHandler  
interface.


Interesting, thanks for your thoughts. Which version of EARL did you  
look at? If you made your mind based on the earl outputs of the  
markup validator, note that it's due for an update. the EARL spec has  
gone through a lot of development and changes, and the new version  
clearly takes conformance checkers as a use case:

http://www.w3.org/TR/2007/WD-EARL10-Schema-20070323/
The group developing EARL is really eager to get feedback, so if you  
find that it has shortcomings in some areas, I think you could easily  
get that changed.


--
olivier




Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-28 Thread Henri Sivonen

On Mar 12, 2007, at 05:27, olivier Thereaux wrote:


On Mar 11, 2007, at 02:15 , Henri Sivonen wrote:

The draft of my master's thesis is available for commenting at:
http://hsivonen.iki.fi/thesis/


Henri, congratulations on your work on the HTML conformance checker  
and on the Thesis.


Thanks.

It's been a truly informative and enlightening reading, especially  
the parts where you develop on the (im)possibility of using only  
schemas to describe conformance to the html5 specs. This is a  
question that has been bothering me for a long time, especially as  
there is only one (as of today) production-ready conformance  
checking tool not based on some kind (or combination) of schema- 
based parsers,


I take it that you mean the Feed Validator?

[2.3.2] I share the view of the Web that holds WebKit, Presto,  
Gecko and Trident (the engines of Safari, Opera, Mozilla/Firefox  
and IE, respectively) to be the most important browser engines.


Did you have a chance to look at engines in authoring tools?


I didn't investigate them beyond mentioning three authoring tools  
that have a RELAX NG-driven auto-completion feature.



What type of parser do NVU, Amaya, golive etc work on?


For authoring tools, the key thing is that their serializers work  
with browser parsers. The details of how authoring tools recovers  
from bad markup is not as crucial as recovery in browsers because  
with authoring tools the author has a chance review the recovery result.


How about parsing engines for search engine robots? These are  
probably as important, if not more as some of the browser engines  
in defining the generic engine for the web today.


Search engines are secretive about what they do, but I would assume  
that they'd want be compatible with browsers in order to fight SEO  
cloaking.


[4.1] The W3C Validator sticks strictly to the SGML validity  
formalism. It is often argued that it would be inappropriate for a  
program to be called a “validator” unless it checks exactly for  
validity in the SGML sense of the word – nothing more, nothing less.


That's very true, there's a strong reluctance from part of the  
validator user community tool to do anything else than formal  
validation, mostly (?) out of fear that it would eventually make  
the term of validation meaningless. The only thing the validator  
does beyond DTD validation are the preparse checks on encoding,  
presence of doctype, media type etc.


ISO and the W3C have already expanded the notion of validation to  
cover schema languages other than DTDs. In colloquial usage  
validation is already understood to mean checking in general. The  
notion of a schema could be detached from a schema language to be  
be an abstract partitioning of the set of possible XML documents into  
two disjoint sets: valid and invalid. Calling the process of deciding  
which set a given document instance belongs into validation would  
give a formal definition that matched the colloquial usage.


I do sympathize with Hixie's reluctance to call HTML5 conformance  
checking HTML5 validation, though. Calling it conformance  
checking makes sure that others don't have a claim on defining what  
it means. Fighting the colloquial usage will probably be futile,  
though, outside spec lawyerism.



[6.1.3] Erroneous Source Is Not Shown
The error messages do not show the erroneous markup. For this  
reason it is unnecessarily hard for the user to see where the  
problem is.


Was this by lack of time?


Yes. Showing the source code based on the SAX-reported line and  
column numbers is useful but it isn't novel enough or central enough  
to proving the feasibility of the chosen implementation approach for  
it to delay the publication of the thesis.


Observing the thesis projects of my friends who started before me has  
taught me that it is a mistake to promise a complete software product  
as a precondition for the completion of the thesis. Software always  
has one more bug to fix or one more feature to add. On the other  
hand, as far as the academic requirements go, one could even write a  
thesis explaining why a project failed.



Did you have a look at existing implementations?


On this particular point, not yet.

Oh I see [ 8.10 Showing the Erroneous Source Markup] as future  
work. If you're looking for a decent, though by no means perfect,  
implementation, look for sub truncate_line  in

http://dev.w3.org/cvsweb/~checkout~/validator/httpd/cgi-bin/check


Thanks. I'll keep this in mind.

[8.1] Even though the software developed in this project is Free  
Software / Open Source, it has not been developed in a way that  
would make it easily approachable to potential contributors.  
Perhaps the most pressing need for change in order to move the  
software forward after the completion of this thesis is moving the  
software to a public version control system and making building  
and deploying the software easy.


Making it available on a more open-sourcey 

Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-16 Thread Ian Hickson
On Wed, 14 Mar 2007, olivier Thereaux wrote:
 
 If lightweigh browsers [on mobile devices] with less tolerance of tag 
 soup carry more weight

I don't know why you think that browsers on mobile phones have less 
tolerance of tag soup. All the testing I have seen shows that they support 
tag soup as much as the desktop browsers. In fact the only browsers that I 
am aware of that actually has stricter (XML) parsing on mobile phones is 
Opera, running the same core engine as the desktop Opera browser.

(See, e.g., http://simon.html5.org/articles/mobile-results but note the 
paragaph at the bottom of http://simon.html5.org/test/mobile/ which points 
out that the only pass line for a non-Opera browser is in fact a false 
positive, that browser in fact having even more tolerant parsing and even 
less support for the relevant standards.)


 All considered, of course I understand your point that desktop browsers 
 *today* have a considerable influence in defining the state of the art 
 of the web. But any standardization work, or study of the web, made 
 under the assumption that other classes of product only have a minor 
 importance because for the most part they follow this current balance of 
 power and mimick the desktop browsers, is IMHO missing a good chunk of 
 the big picture.

Given that I work for a company that authors content by hand, provides a 
template-based Web authoring tool, runs a search engine, contributes to a 
browser's development, and is working on mobile device software, I assure 
you that I agree that all of these things should be considered (and, in 
the WHATWG context, are). My original point, which I still believe is 
true, is that the details of the _parsing model_ of search engines is not 
important. That is what is relevant in the context of Henri's thesis.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-13 Thread olivier Thereaux

Hi Ian,
Thank you for the insightful comments and information about parsers.

On Mar 12, 2007, at 16:21 , Ian Hickson wrote:
Why do you think search engine behaviour is more important than  
browser

engine behaviour? For what it's worth, search engine engineers I have
spoken to have told me that what browsers do is far more important  
than

what a particular version of a search engine does in terms of what the
specification should say, because their results are better when their
algorithms match the browsers' behaviours.


My opinion, which may be wrong, is that the current balance of power  
is not the only factor in importance for the design/study of markup  
languages.


* browsers mostly determine (largely) whether and how the documents  
are presented to the user. Most of it is actually a question of CSS  
support, not relevant to this discussion. The rest is a matter of  
parsing model and element/attribute support, where the browsers do  
have an enormous influence, but even that may shift as more people in  
the world move to lightweight browsers on mobile devices than desktop  
browsers. If lightweigh browsers with less tolerance of tag soup  
carry more weight, the state of the art will be whatever parsing  
model is standardized, less so what browser foo or bar does on the  
desktop. Ditto for apps/widgets in non-browser environments. There's  
more to browsing than the desktop computer.


* Authoring tools and CMSs determine (largely) how documents are  
structured, and what features go in documents. If a feature of a  
markup language gets no adoption from them, then regardless of  
browser support, it will remain in the confidential little world of  
web geeks (no disrespect meant, I'm putting myself in this category)  
who edit their pages by hand.


* Search engines and their indexing mechanisms determine (largely)  
how documents get found. I've seen estimates that content-rich sites  
get half of their traffic through search engines. As you aptly point  
out, search engines are mostly mimicking browsers' behavior in  
parsing HTML documents, but that's not all.
An example: 10 years ago any serious Web page had to have meta  
description and keywords information, because that was the key to  
being listed in search engines. When search engines started ignoring  
those because of spam, usage fell. If today a feature of HTML, or  
RDFa, or microformats, caught the fancy of the major search engines  
and gave their user a serious visiting boost, the adoption rate would  
soar, regardless of browsers support.


* Servers, proxies, cache have their say too, though probably not  
much when it comes to markup languages.



All considered, of course I understand your point that desktop  
browsers *today* have a considerable influence in defining the state  
of the art of the web. But any standardization work, or study of the  
web, made under the assumption that other classes of product only  
have a minor importance because for the most part they follow this  
current balance of power and mimick the desktop browsers, is IMHO  
missing a good chunk of the big picture.



I hope this helps clarify why I was wondering if Henri had considered  
other classes of products than browser desktops in his study.


Regards,
--
olivier


Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-12 Thread Ian Hickson
On Mon, 12 Mar 2007, olivier Thereaux wrote:
 
 Did you have a chance to look at engines in authoring tools? What type of
 parser do NVU

Gecko, same as Firefox.


 Amaya,

Amaya's editor uses the same rendering engine as Amaya's browser, which I 
presume was ignored due to its negligible market share.


 golive etc work on?

Golive uses Opera's rendering engine.


 How about parsing engines for search engine robots? These are probably 
 as important, if not more as some of the browser engines in defining the 
 generic engine for the web today.

Search engine companies are notoriously secretive about what their 
indexing pipelines support, since any insight into how they work can be 
abused by people attempting to game their ranking algorithms. The WHATWG 
specification (in particular the parsing part, but other parts as well) 
has, however, been influenced by what information search engine 
implementors have confidentially contacted me with, and what suggestions 
they have anonymously or subtly sent to the list over the years. (This is 
why a careful study of the specification's acknowledgements will reveal 
employees from several search engine implementors.) In any case, reverse 
engineering search engine indexing pipelines is extremely difficult and 
tedious, orders of magnitude more so than even browsers.

Why do you think search engine behaviour is more important than browser 
engine behaviour? For what it's worth, search engine engineers I have 
spoken to have told me that what browsers do is far more important than 
what a particular version of a search engine does in terms of what the 
specification should say, because their results are better when their 
algorithms match the browsers' behaviours.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-11 Thread David Håsäther

mozer wrote:


[[
common.inner.strict-inline =
  ( text )
]]
appear twice in the html file


If you're referring to 5.6.2.1 Common Content Models, it's  
...strict-inline and ...struct-inline (unless this has been fixed  
since you read it).


--
David Håsäther


Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-11 Thread mozer

Ach Mensch !!
Two in a row !!

Hope the last two would help...

Thanks David²

On 3/11/07, David Håsäther [EMAIL PROTECTED] wrote:

mozer wrote:

 [[
 common.inner.strict-inline =
   ( text )
 ]]
 appear twice in the html file

If you're referring to 5.6.2.1 Common Content Models, it's
...strict-inline and ...struct-inline (unless this has been fixed
since you read it).

--
David Håsäther



Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-11 Thread olivier Thereaux


On Mar 11, 2007, at 02:15 , Henri Sivonen wrote:

The draft of my master's thesis is available for commenting at:
http://hsivonen.iki.fi/thesis/


Henri, congratulations on your work on the HTML conformance checker  
and on the Thesis. It's been a truly informative and enlightening  
reading, especially the parts where you develop on the (im) 
possibility of using only schemas to describe conformance to the  
html5 specs. This is a question that has been bothering me for a long  
time, especially as there is only one (as of today) production-ready  
conformance checking tool not based on some kind (or combination) of  
schema-based parsers, and although, as it is often pointed out, no  
browser uses a DTD-based parser in their engine today, I still think  
producing a schema representation of (most of) the conformance  
criteria help adoption and implementation.



Some comments based on first read through the thesis, below.
I'm cross-posting them to the www-validator list at w3c, as I think  
your thesis will be of interest to a number of subscribers of that  
list too.

For www-validator, Henri's announcement and rfc -
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-March/ 
009941.html




[2.3.2] I share the view of the Web that holds WebKit, Presto,  
Gecko and Trident (the engines of Safari, Opera, Mozilla/Firefox  
and IE, respectively) to be the most important browser engines.


Did you have a chance to look at engines in authoring tools? What  
type of parser do NVU, Amaya, golive etc work on?
How about parsing engines for search engine robots? These are  
probably as important, if not more as some of the browser engines in  
defining the generic engine for the web today.



[4.1] The W3C Validator sticks strictly to the SGML validity  
formalism. It is often argued that it would be inappropriate for a  
program to be called a “validator” unless it checks exactly for  
validity in the SGML sense of the word – nothing more, nothing less.


That's very true, there's a strong reluctance from part of the  
validator user community tool to do anything else than formal  
validation, mostly (?) out of fear that it would eventually make the  
term of validation meaningless. The only thing the validator does  
beyond DTD validation are the preparse checks on encoding, presence  
of doctype, media type etc.


I think it will change over time, in fact it's already changing, as  
the innards of the validator have moved to a SAX-based parsing. It's  
going to be an opportunity to add data type checking and move closer  
to conformance checker than validator. Work at W3C on Unicorn [1] and  
little modules such as the Appendix C checker [2] for XHTML1.0 also  
go in that direction.


[1] http://www.w3.org/QA/Tools/Unicorn/
[2] http://dev.w3.org/cvsweb/perl/modules/W3C/XHTML/HTMLCompatChecker/



[6.1.3] Erroneous Source Is Not Shown
The error messages do not show the erroneous markup. For this  
reason it is unnecessarily hard for the user to see where the  
problem is.


Was this by lack of time? Did you have a look at existing  
implementations? Oh I see [ 8.10 Showing the Erroneous Source Markup]  
as future work. If you're looking for a decent, though by no means  
perfect, implementation, look for sub truncate_line  in

http://dev.w3.org/cvsweb/~checkout~/validator/httpd/cgi-bin/check
(this is to be modularized out of the check script and into a cpan  
module sooner or later, see [3])


[3] http://esw.w3.org/topic/SoftwareProjects


[6.2] Instead of modifying the libraries themselves, an alternative  
approach to localization would be reverse templating. The English  
messages would be matched against known patterns that would allow  
the variable parts to be extracted. The variable parts could then  
be plugged into a translated message corresponding to the matched  
pattern.


This is something I have been looking at, and had come to the same  
conclusion. I'm hoping to be able to reuse, in one way or another,  
the existing localization of some of the libraries being used (e.g.  
OpenSP, with all its issues, has a very impressive localization record).



[8.1] Even though the software developed in this project is Free  
Software / Open Source, it has not been developed in a way that  
would make it easily approachable to potential contributors.  
Perhaps the most pressing need for change in order to move the  
software forward after the completion of this thesis is moving the  
software to a public version control system and making building and  
deploying the software easy.


Making it available on a more open-sourcey system, with a multi-user  
revision system will probably not create an explosion of code  
contributors (you've had very helpful contributions from e.g Elika,  
and most OS projects, even successful ones, never have more than a  
handful of coders), but you may be able to create a healthy community  
of users, reviewers, bug spotters, translators, document editors,  
beyond 

Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-10 Thread mozer

Henri,

Here are few remarks

In RELAX NG Datatyping
Why is there no mention to DTLL of the DSDL ?

For sake of completeness
In Schematron
You should mention that XML Schema 1.1 which is still a WD try to add
assertion too

Liam Quin (with only one n)

[[
common.inner.strict-inline =
 ( text )
]]
appear twice in the html file

Regards,

Xmlizer

On 3/10/07, Henri Sivonen [EMAIL PROTECTED] wrote:

The reason why I haven't updated the software at http://
hsivonen.iki.fi/validator/html5/ lately is that I have been writing
about it.

The draft of my master's thesis is available for commenting at:
http://hsivonen.iki.fi/thesis/

I'd appreciate comments on the draft. However, I won't be able to
respond to comments next week. I'll continue polishing the thesis the
week after next.



I have a couple of specific questions:

 * Besides perhaps IRC logs, is there any less ephemeral reference
about the WHATWG members being able to impeach and replace the editor?

 * What would be a good reference for the design policy of Web Forms
2.0 regarding implementability on top of IE6? Is http://
lists.whatwg.org/pipermail/whatwg-whatwg.org/2005-April/003818.html
the best reference?

(These are stuff I think I know but I don't know where these things
are said. :-)

--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/





Re: [whatwg] Thesis draft about HTML5 conformance checking

2007-03-10 Thread L. David Baron
On Saturday 2007-03-10 23:41 +0100, mozer wrote:
 Liam Quin (with only one n)

No, Liam Quin [1] and Liam Quinn [2] are two different Canadian
members of the Web standards community, and should not be confused
with each other.  The latter was responsible for the WDG HTML
Validator, so Henri's spelling is correct.

-David

[1] http://www.w3.org/People/Quin/
[2] http://htmlhelp.com/~liam/

-- 
L. David BaronURL: http://dbaron.org/ 
   Technical Lead, Layout  CSS, Mozilla Corporation


pgpVDa5ExYuRi.pgp
Description: PGP signature