On Mar 12, 2007, at 05:27, olivier Thereaux wrote:

On Mar 11, 2007, at 02:15 , Henri Sivonen wrote:
The draft of my master's thesis is available for commenting at:
http://hsivonen.iki.fi/thesis/

Henri, congratulations on your work on the HTML conformance checker and on the Thesis.

Thanks.

It's been a truly informative and enlightening reading, especially the parts where you develop on the (im)possibility of using only schemas to describe conformance to the html5 specs. This is a question that has been bothering me for a long time, especially as there is only one (as of today) production-ready conformance checking tool not based on some kind (or combination) of schema- based parsers,

I take it that you mean the Feed Validator?

[2.3.2] I share the view of the Web that holds WebKit, Presto, Gecko and Trident (the engines of Safari, Opera, Mozilla/Firefox and IE, respectively) to be the most important browser engines.

Did you have a chance to look at engines in authoring tools?

I didn't investigate them beyond mentioning three authoring tools that have a RELAX NG-driven auto-completion feature.

What type of parser do NVU, Amaya, golive etc work on?

For authoring tools, the key thing is that their serializers work with browser parsers. The details of how authoring tools recovers from bad markup is not as crucial as recovery in browsers because with authoring tools the author has a chance review the recovery result.

How about parsing engines for search engine robots? These are probably as important, if not more as some of the browser engines in defining the "generic" engine for the web today.

Search engines are secretive about what they do, but I would assume that they'd want be compatible with browsers in order to fight SEO cloaking.

[4.1] The W3C Validator sticks strictly to the SGML validity formalism. It is often argued that it would be inappropriate for a program to be called a “validator” unless it checks exactly for validity in the SGML sense of the word – nothing more, nothing less.

That's very true, there's a strong reluctance from part of the validator user community tool to do anything else than formal validation, mostly (?) out of fear that it would eventually make the term of "validation" meaningless. The only thing the validator does beyond DTD validation are the preparse checks on encoding, presence of doctype, media type etc.

ISO and the W3C have already expanded the notion of validation to cover schema languages other than DTDs. In colloquial usage "validation" is already understood to mean checking in general. The notion of a "schema" could be detached from a schema language to be be an abstract partitioning of the set of possible XML documents into two disjoint sets: valid and invalid. Calling the process of deciding which set a given document instance belongs into "validation" would give a formal definition that matched the colloquial usage.

I do sympathize with Hixie's reluctance to call "HTML5 conformance checking" "HTML5 validation", though. Calling it "conformance checking" makes sure that others don't have a claim on defining what it means. Fighting the colloquial usage will probably be futile, though, outside spec lawyerism.

[6.1.3] Erroneous Source Is Not Shown
The error messages do not show the erroneous markup. For this reason it is unnecessarily hard for the user to see where the problem is.

Was this by lack of time?

Yes. Showing the source code based on the SAX-reported line and column numbers is useful but it isn't novel enough or central enough to proving the feasibility of the chosen implementation approach for it to delay the publication of the thesis.

Observing the thesis projects of my friends who started before me has taught me that it is a mistake to promise a complete software product as a precondition for the completion of the thesis. Software always has one more bug to fix or one more feature to add. On the other hand, as far as the academic requirements go, one could even write a thesis explaining why a project failed.

Did you have a look at existing implementations?

On this particular point, not yet.

Oh I see [ 8.10 Showing the Erroneous Source Markup] as future work. If you're looking for a decent, though by no means perfect, implementation, look for sub truncate_line in
http://dev.w3.org/cvsweb/~checkout~/validator/httpd/cgi-bin/check

Thanks. I'll keep this in mind.

[8.1] Even though the software developed in this project is Free Software / Open Source, it has not been developed in a way that would make it easily approachable to potential contributors. Perhaps the most pressing need for change in order to move the software forward after the completion of this thesis is moving the software to a public version control system and making building and deploying the software easy.

Making it available on a more open-sourcey system, with a multi- user revision system will probably not create an explosion of code contributors (you've had very helpful contributions from e.g Elika, and most OS projects, even successful ones, never have more than a handful of coders), but you may be able to create a healthy community of users, reviewers, bug spotters, translators, document editors, beyond the whatwg community.

I am not expecting an explosion of contributors. However, I have a reason to believe that my current arrangement has caused at least one potential contributor to walk away. I'd rather avoid turning people away.

Also, in the future, I'd like to make it super-easy for CMS developers to integrate the conformance checker back end to their products. To enable this, the barrier for getting a runnable copy should be low.

I'm very pessimistic about translations. Even the online markup checkers whose authors have borne the burden of making the messages translatable aren't getting numerous translation contributions.

If you're interested in using w3c logistics, and benefit from the existing communities around w3c, I'm happy to invite you.

Thank you. I'll keep your offer in mind when it is time to figure out where to put the source.

[8.8] To support the use of the conformance checker back end from other applications (non-Java applications in particular), a Web service would be useful.

Indeed. Did you have a chance to look at EARL?

I did. I also had a look at the SOAP and Unicorn outputs of the W3C Validator. I like EARL the least of the three, because its assumptions about the nature of the checker software do not work well with implementations that have a grammar-based schema inside. Grammar- based implementations cannot cite an exact conformance criterion when a derivation in the grammar fails as demonstrated by the EARL output of the W3C Validator. The SOAP and Unicorn formats, even if crufty to my taste, match better the SAX ErrorHandler interface.

I think I saw Relaxed having its own SAX ErrorHandler-friendly format, but now I can't find it.

I wrote some basic notes at http://lists.w3.org/Archives/Public/www- validator/2007Mar/0005

Thanks. My notes are at http://lists.w3.org/Archives/Public/www- validator/2006Dec/0060.html and http://wiki.whatwg.org/wiki/ Conformance_Checker_Web_Service_Interface_Ideas

Thank you for your comments.

--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/


Reply via email to