Re: [whatwg] Thesis draft about HTML5 conformance checking

Henri Sivonen Wed, 28 Mar 2007 04:24:17 -0800

On Mar 12, 2007, at 05:27, olivier Thereaux wrote:

On Mar 11, 2007, at 02:15 , Henri Sivonen wrote:
The draft of my master's thesis is available for commenting at:
http://hsivonen.iki.fi/thesis/
Henri, congratulations on your work on the HTML conformance checkerand on the Thesis.


Thanks.

It's been a truly informative and enlightening reading, especiallythe parts where you develop on the (im)possibility of using onlyschemas to describe conformance to the html5 specs. This is aquestion that has been bothering me for a long time, especially asthere is only one (as of today) production-ready conformancechecking tool not based on some kind (or combination) of schema-based parsers,


I take it that you mean the Feed Validator?

[2.3.2] I share the view of the Web that holds WebKit, Presto,Gecko and Trident (the engines of Safari, Opera, Mozilla/Firefoxand IE, respectively) to be the most important browser engines.
Did you have a chance to look at engines in authoring tools?

I didn't investigate them beyond mentioning three authoring toolsthat have a RELAX NG-driven auto-completion feature.

What type of parser do NVU, Amaya, golive etc work on?

For authoring tools, the key thing is that their serializers workwith browser parsers. The details of how authoring tools recoversfrom bad markup is not as crucial as recovery in browsers becausewith authoring tools the author has a chance review the recovery result.

How about parsing engines for search engine robots? These areprobably as important, if not more as some of the browser enginesin defining the "generic" engine for the web today.

Search engines are secretive about what they do, but I would assumethat they'd want be compatible with browsers in order to fight SEOcloaking.

[4.1] The W3C Validator sticks strictly to the SGML validityformalism. It is often argued that it would be inappropriate for aprogram to be called a “validator” unless it checks exactly forvalidity in the SGML sense of the word – nothing more, nothing less.
That's very true, there's a strong reluctance from part of thevalidator user community tool to do anything else than formalvalidation, mostly (?) out of fear that it would eventually makethe term of "validation" meaningless. The only thing the validatordoes beyond DTD validation are the preparse checks on encoding,presence of doctype, media type etc.

ISO and the W3C have already expanded the notion of validation tocover schema languages other than DTDs. In colloquial usage"validation" is already understood to mean checking in general. Thenotion of a "schema" could be detached from a schema language to bebe an abstract partitioning of the set of possible XML documents intotwo disjoint sets: valid and invalid. Calling the process of decidingwhich set a given document instance belongs into "validation" wouldgive a formal definition that matched the colloquial usage.

I do sympathize with Hixie's reluctance to call "HTML5 conformancechecking" "HTML5 validation", though. Calling it "conformancechecking" makes sure that others don't have a claim on defining whatit means. Fighting the colloquial usage will probably be futile,though, outside spec lawyerism.

[6.1.3] Erroneous Source Is Not Shown
The error messages do not show the erroneous markup. For thisreason it is unnecessarily hard for the user to see where theproblem is.
Was this by lack of time?

Yes. Showing the source code based on the SAX-reported line andcolumn numbers is useful but it isn't novel enough or central enoughto proving the feasibility of the chosen implementation approach forit to delay the publication of the thesis.

Observing the thesis projects of my friends who started before me hastaught me that it is a mistake to promise a complete software productas a precondition for the completion of the thesis. Software alwayshas one more bug to fix or one more feature to add. On the otherhand, as far as the academic requirements go, one could even write athesis explaining why a project failed.

Did you have a look at existing implementations?


On this particular point, not yet.

Oh I see [ 8.10 Showing the Erroneous Source Markup] as futurework. If you're looking for a decent, though by no means perfect,implementation, look for sub truncate_line in
http://dev.w3.org/cvsweb/~checkout~/validator/httpd/cgi-bin/check


Thanks. I'll keep this in mind.

[8.1] Even though the software developed in this project is FreeSoftware / Open Source, it has not been developed in a way thatwould make it easily approachable to potential contributors.Perhaps the most pressing need for change in order to move thesoftware forward after the completion of this thesis is moving thesoftware to a public version control system and making buildingand deploying the software easy.
Making it available on a more open-sourcey system, with a multi-user revision system will probably not create an explosion of codecontributors (you've had very helpful contributions from e.g Elika,and most OS projects, even successful ones, never have more than ahandful of coders), but you may be able to create a healthycommunity of users, reviewers, bug spotters, translators, documenteditors, beyond the whatwg community.

I am not expecting an explosion of contributors. However, I have areason to believe that my current arrangement has caused at least onepotential contributor to walk away. I'd rather avoid turning peopleaway.

Also, in the future, I'd like to make it super-easy for CMSdevelopers to integrate the conformance checker back end to theirproducts. To enable this, the barrier for getting a runnable copyshould be low.

I'm very pessimistic about translations. Even the online markupcheckers whose authors have borne the burden of making the messagestranslatable aren't getting numerous translation contributions.

If you're interested in using w3c logistics, and benefit from theexisting communities around w3c, I'm happy to invite you.

Thank you. I'll keep your offer in mind when it is time to figure outwhere to put the source.

[8.8] To support the use of the conformance checker back end fromother applications (non-Java applications in particular), a Webservice would be useful.
Indeed. Did you have a chance to look at EARL?

I did. I also had a look at the SOAP and Unicorn outputs of the W3CValidator. I like EARL the least of the three, because itsassumptions about the nature of the checker software do not work wellwith implementations that have a grammar-based schema inside. Grammar-based implementations cannot cite an exact conformance criterion whena derivation in the grammar fails as demonstrated by the EARL outputof the W3C Validator. The SOAP and Unicorn formats, even if crufty tomy taste, match better the SAX ErrorHandler interface.

I think I saw Relaxed having its own SAX ErrorHandler-friendlyformat, but now I can't find it.

I wrote some basic notes at http://lists.w3.org/Archives/Public/www-validator/2007Mar/0005

Thanks. My notes are at http://lists.w3.org/Archives/Public/www-validator/2006Dec/0060.html and http://wiki.whatwg.org/wiki/Conformance_Checker_Web_Service_Interface_Ideas


Thank you for your comments.

--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Thesis draft about HTML5 conformance checking

Reply via email to