Hey, Ludovic Courtès <l...@gnu.org> writes:
> Timothy Sample <samp...@ngyro.com> skribis: > >> Early this summer I did a bunch of work trying to figure out which Guix >> sources are preserved by the SWH archive. I’m finally ready to share >> some preliminary results! >> >> https://ngyro.com/pog-reports/2021-10-20/ >> >> This report is already quite outdated, though. It only covers commits >> up to the end of May, and sometime in June is when the sources were >> checked against the SWH archive. I’m sharing it now to avoid any >> further delays. > > This is truly awesome! (Did you manage to grab all that info with the > default rate limit?!) Yes, but I have another trick. The “known” endpoint [1]. If you already know the SWHIDs you want to check, you can check 1,000 per call. With the anonymous rate limit, I can check 120,000 every hour, which is plenty. [1] https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#get--api-1-content-known-(sha1)[,(sha1),%20...,(sha1)]- > I can’t wait for the updated report now that Simon and yourself have > identified that SWHID computation bug! I’m computing SWHIDs while writing this. Not long now! > Some of our <git-reference> refer to tags, not commits. How do you > determine whether they’re saved? The short answer is “elbow grease”. Basically, I’m taking a “work harder, not smarter” approach. :p I go out and obtain the source, verify it with Guix’s hash, and then compute the SWHID. This is another thing we could move to the CI infrastructure, but I think there might be some hiccoughs. For git-references, I believe we can’t just compute the ID after the download derivation – we would have to change the download derivation itself. Maybe add an ‘swhid’ output? It’s a little more complicated than just throwing up some scripts, anyway. > ‘guix lint -c archival’ uses ‘lookup-origin-revision’, which is a good > approximation, but it’s not 100% reliable because tags can be modified > and that procedure only tells you that a same-named tag was found, not > that it’s the commit you were expecting. (And really, we should stop > referring to tags.) Like zimoun said elsewhere in this thread, having an explicit mapping from Guix hash to SHWID will improve reliability quite a bit. It’s hard to get to 100%, though! With the reports, we will eventually be able to check everything. However, there’s still a small possibility of bugs and false positives. Ultimately, I’m hoping the reports will help detect small problems (some specific source is missing) and guide our efforts on big problems (xz support in Disarchive or support for more version control systems, etc.). -- Tim