Package: www.debian.org
User: www.debian....@packages.debian.org
Usertag: scripts
Severity: important

Hi all
I'm starting to work in the bug #980921 (Pages in HTML5) and, as it is mentioned
there, we need to adapt our "validate" script so it correctly processes the
pages declared as HTML5 (currently, only the homepage in the different 
languages).

The current status is following:

Related scripts:

https://salsa.debian.org/webmaster-team/cron/-/blob/master/lessoften executed
once a day, calling (via run-parts) the following script:
https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/999Xvalidate
which gets the list of languages and folders to process and then calls:

https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/validate

Which is the actual script doing the HTML validation, using the onsgmls command 
(part of opensp package). 

This command validates a SGML file based on a DTD. The issue (as far as I know) 
is that there is no "official" SGML DTD template to use when parsing HTML5 
files.

I have tried adapting the "validate" script to be able to recognize the DOCTYPE 
header used for html5 files, and then tried to pass a DTD (I tried downloading 
the ones here http://sgmljs.net/docs/w3c-html5-dtd.html and here 
http://sgmljs.net/docs/w3c-html52-dtd.html and also here 
https://jkorpela.fi/html5-dtd.html ) but couldn't make it work, and also was 
not convinced it is the better approach.

I've tried to look at what w3c validator uses and they use Nu.checker:

https://validator.w3.org/nu/about.html
https://github.com/validator/validator/releases/latest

But I'm not sure if this is packaged in Debian in any of its flavours.

I have searched https://packages.debian.org/search?keywords=html5 but none of 
the results looks like a commandline tool that we could call instead of onsgmls

So I don't know what to do at this point.

In my local machine, I have downloaded the vnu.jar file from the latest Nu 
checker release " and tried to validate files and it works. But I don't know if 
asking DSA to install openjdk in www-master and include a copy of vnu.jar in 
our cron scripts is good and/or elegant.

Opinions, advice and patches are very welcome.

Meanwhile, I guess we can modify 99Xvalidate to add file exclusions, and 
exclude, for now, /index.*.html and later the few other files we have with 
html5 tags for now. I don't know how to exclude the index.*.html files on top 
folder only and not in subfolders but I guess playing with find -wholename and 
prune will do the treak (if you know, please go ahead).

Kind regards,
-- 
Laura Arjona
https://wiki.debian.org/LauraArjona

Reply via email to