Hello, Recent thread on debian-science@ [1] motivated me to look deeper into enforcing quality standards of debian/upstream/metadata files (a.k.a. DEP 12) we ship with Debian packages. I learnt that lintian already runs YAML syntax check on debian/upstream/metadata files, but further validation is not performed (to my knowledge). Thus I have developed a formal validation tool [2] to check the contents inside these YAML files, mostly syntax of URLs and some fields that are defined to be in correspondence to BibTeX as per [3].
Yesterday I have downloaded debian/upstream/metadata files from all >1300 projects under https://salsa.debian.org/debian-med/ and run against my validator. Resulting validation messages could be grouped into the following categories: 1. Highly possible typos: reference year '200' (bagpipe), '20015' (rambo-k), URLs with spaces (bio-tradis) and so on. This category is the one I was actually aiming at. 2. URLs with trailing newlines (adapterremoval, aevol, amos, just to name a few). This is most likely due to YAML property to append newline to the end of multiline strings, which can be quite easily averted [4]. On the other hand, trailing newlines in URLs could be ignored at all, as clearly they are not intentional. 3. Numeric months in references (augustus, cluster3, haploview, just to name a few). According to [3], "[Reference] keys that correspond to standard BibTeX entries must provide the same content", and 1988 BibTeX manual from CTAN [5] says "[month:] You should use the standard three-letter abbreviation". Of course "should" is not "must" (in terms of RFC 2119), but machine-reading would be easier with a consistent definition. 4. E-mail addresses in Bug-Submit (htslib, last-align, nanook, just to name a few). Per [3], values of Bug-Submit are URLs. Maybe [3] could be amended to cover e-mails too? 5. Unclear scalar/list status of some fields. Only Screenshots is defined as "One or more URLs", while in reality lists appear for Webservice (clustalw, primer3), Bug-Submit (mira, albeit seems broken). Maybe these too could be defined as "One or more URLs"? 6. Empty templates (agat, intake, libpll-2, just to name a few). I would suggest removing the templates, as they do not carry anything meaningful. 7. DOIs written as URLs (fast, libnewuoa). This is debatable, and [5] does not talk about DOIs at all. As said earlier, I would be interested in implementing formal validation of debian/upstream/metadata in lintian to catch typos and so on. However, there are a few ambiguities in the specification, which would be really interesting to discuss and resolve. Please do not take any part of my text as a critique for anyone. Package names are here only for the purpose of illustration. [1] https://lists.debian.org/debian-science/2021/01/msg00050.html [2] https://github.com/merkys/Debian-DEP12, no stable release yet [3] https://wiki.debian.org/UpstreamMetadata [4] https://yaml-multiline.info/ [5] https://mirror.datacenter.by/pub/mirrors/CTAN/biblio/bibtex/base/btxdoc.pdf Best wishes, Andrius