Hi everyone! Early this summer I did a bunch of work trying to figure out which Guix sources are preserved by the SWH archive. I’m finally ready to share some preliminary results!
https://ngyro.com/pog-reports/2021-10-20/ This report is already quite outdated, though. It only covers commits up to the end of May, and sometime in June is when the sources were checked against the SWH archive. I’m sharing it now to avoid any further delays. What’s cool is that the report is automated. Next on my list is to update the database and generate a new report. Then, we can compare the results and see if we are improving. (My read on the results so far is that improving “sources.json” will yield big improvements, but we might not be able to get to that before the next report.) The report itself only provides a very high level overview. If you want to check on specifics, you will have to download the database. There’s a link at the bottom of the report as well as a link to a detailed schema definition. Anyone interested in making some sense of the 5,043 known missing sources is encouraged to look there. However, I can say from my own investigation that a lot of them are kinda boring. For instance, 3,435 are from crates.io, CRAN, Hackage, Bioconductor, and CPAN: select count(*) from fods join fod_references using (fod_id) where not is_in_swh and (reference like '%crates.io%' or reference like '%/cran/%' or reference like '%hackage%' or reference like '%/bioconductor.%' or reference like '%/cpan/%'); => 3435 It’s surprising to me that SWH is not already getting these from “sources.json”. I picked an arbitrary one, “rust-quote-0.6”, and it’s simply not in “sources.json”. On the other hand, I bet SWH would like a crates.io (and CRAN, etc.) loader, too. One other more interesting approach might be to check Git sources: select count(*) from fods join fod_references using (fod_id) where not is_in_swh and reference like '(git-reference%'; => 336 There are fewer, but they might be more interesting. Just be sure to check that they haven’t made it into the SWH archive since June. For instance, I just checked “asciidoc@9.1.0” and learned that the database has “NOT is_in_swh”, but it is now in the SWH archive. So, caveat emptor, I guess. Maybe it would be wise to wait for a more recent report before diving in. One other way to help would be to suggest improvements to the report. I don’t want to fiddle with it too much, but if there is some simple graph or table or list that should be there, I’m happy to give it a go. -- Tim