Re: Preservation of Guix (PoG) report 2023-03-13

2023-03-22 Thread Ludovic Courtès
Hello!

Timothy Sample  skribis:

[...]

>>> • Subversion support (for TeX-based documentation stuff, I guess)
>>
>> For the interested reader, details for helping in the implementation:
>>
>> https://issues.guix.gnu.org/issue/43442#9
>> https://issues.guix.gnu.org/issue/43442#11
>
> Fantastic.  That looks very promising!
>
>> However, it would ease all the dance if SWH would consider to store and
>> expose NAR hashes on their side.  As discussed here:
>>
>> https://gitlab.softwareheritage.org/swh/meta/-/issues/4538
>
> This would be nice, yes.

Good news is progress is happening on these fronts…

Ludo’.



Re: Preservation of Guix (PoG) report 2023-03-13

2023-03-18 Thread Timothy Sample
Hi Ludo,

Ludovic Courtès  writes:

> Do you think this could be turned into a Guix System service, with an
> eye towards making it run on project infrastructure?

I need to revisit what you did with Disarchive and Cuirass.  The process
for the PoG report is very similar.  I can’t jump into it right away,
but I agree that it would be better and I hope to work on it eventually.

> Subversion support should probably be high-priority because TeX Live is
> deep down in the dependency graph so if we ever lose it, we’re doomed.

I will start processing Subversion references on my end.  That way, we
will at least know how many of them are in the SWH archive.


-- Tim



Re: Preservation of Guix (PoG) report 2023-03-13

2023-03-18 Thread Timothy Sample
Hey,

Simon Tournier  writes:

> Well, I do not remember if you consider also the ’origin’
> (fixed-outputs) as ’inputs’ or ’patches’.  Do you?

I’m quite confident I’m getting everything.  I’ll describe my approach,
because I’m happy with it.  :)

The Guix package graph exists twice, essentially.  There’s the
high-level representation made up of packages, origins, gexps, etc.
Then, there is the low-level representation which is just derivations.
The high-level representation has nice metadata and makes sense to
humans, while the low-level representation is easy to traverse.

AFAICT, there’s no generic way to traverse the high-level
representation.  Every lowerable object has complete control over how it
references other lowerable objects, and is not obliged provide any means
of listing those references.  That is, there’s no ‘lowerable-inputs’
procedure or anything like that.  (We have ‘bag-node-edges’ in ‘(guix
scripts graph)’, but it doesn’t cover everything.)

What I do for the report is traverse (as best I can) the high-level
representation and construct a map from derivations to origin objects.
Then, I traverse the low-level representation to find all the
fixed-output derivations.  Finally, I use the map to look up origin
objects for each fixed-output derivation.  If I miss an origin object,
the fixed-output derivation still gets recorded.  It will show up in the
report as “unknown” until I investigate why it’s missing and correct it.

There’s currently 56 (out of 54K) fixed-output derivations that are
missing metadata in my database.  A fair few of them have to do with
Telegram, Thunderbird, and UBlock Origin.  All it means is that those
packages have sneaky ways of referencing origins that my code can’t
handle.  It’s harmless and easy to fix as time permits.

>> Over the whole set, 77.1% are known to be safely tucked away in the
>> Software Heritage archive.  But it’s actually much better than that.  If
>> we only look at the most recent sampled commit (from Sunday the 5th),
>> that number becomes 87.4%, which is starting to look pretty good!
>
> Just to be point the new nixguix loader [1] is still in SWH staging and
> not yet deployed, IIRC.  It will not change much the coverage on our
> side but it should be fix some corner-cases.
>
> 1: 

Good to know!

>>  This is kinda like an automated version of Simon’s recent
>> investigation.
>
> Neat!  Note that I also wanted to check the SWH capacity for cooking,
> not only checking the end points.  For instance, it allowed to discover
> mismatch due to uncovered CR/LF normalization; now fixed with:
> 58f20fa8181bdcd4269671e1d3cef1268947af3a.

Maybe we need a “chaos monkey mode” for Guix.  It could randomly select
packages to build, randomly pick source code fallback methods, and also
test reproducibility (like “--check”).  You could have a blocklist for
browsers, etc., but otherwise it could pick the odd package to test
thoroughly.  Those of us with the time and inclination could crank up
that knob and get interesting feedback about reproducibility at the cost
of doing a few package builds here and there.

>> Here’s a rough road map for that based on a glance at the script’s
>> output:
>>
>> • Subversion support (for TeX-based documentation stuff, I guess)
>
> For the interested reader, details for helping in the implementation:
>
> https://issues.guix.gnu.org/issue/43442#9
> https://issues.guix.gnu.org/issue/43442#11

Fantastic.  That looks very promising!

> However, it would ease all the dance if SWH would consider to store and
> expose NAR hashes on their side.  As discussed here:
>
> https://gitlab.softwareheritage.org/swh/meta/-/issues/4538

This would be nice, yes.

>>  However, 42% of them are old Bioconductor packages.  They
>> seem to be lost.  It looks like Bioconductor now stores multiple package
>> versions per Bioconductor version [2], but before version 3.15 that was
>> not the case.  As an example, take “ggcyto” from Bioconductor 3.10 [3].
>> We packaged version 1.14.0, and then at some point Bioconductor 3.10
>> switched to version 1.14.1.  We packaged that, too, but now 1.14.0 is
>> gone.
>
> Well, I have not investigated much because it is between December 2019
> and March 2020 thus “guix time-machine” is not smooth for this old time.
>
> First question, does we have the source tarball in Berlin or Bordeaux or
> somewhere else?  If yes, there is a hope. :-) Else, it is probably gone
> forever.

Like I wrote, I picked up a handful from Bordeaux, but not much.

> The hope is: https://git.bioconductor.org/packages/ggcyto
>
> If we have the tarball with the correct checksum from commit
> f5f440312d848e12463f0c6f7510a86b623a9e27
>
> +(version "1.14.0")
> +(source
> + (origin
> +   (method url-fetch)
> +   (uri (bioconductor-uri "ggcyto" version))
> +   (sha256
> +(base32
> + "165qszvy5z176h1l3dnjb5dcm279b6bjl5n

Re: Preservation of Guix (PoG) report 2023-03-13

2023-03-16 Thread Ludovic Courtès
Hi Timothy,

Timothy Sample  skribis:

> Allow me to present to you a long-overdue update to the Preservation of
> Guix (PoG) report: .  🎉

Yay, thank you!!

Do you think this could be turned into a Guix System service, with an
eye towards making it run on project infrastructure?

> Over the whole set, 77.1% are known to be safely tucked away in the
> Software Heritage archive.  But it’s actually much better than that.  If
> we only look at the most recent sampled commit (from Sunday the 5th),
> that number becomes 87.4%, which is starting to look pretty good!

Truly impressive!

> I have a few more notes on the report, but I want to put this near the
> top of the message so that people will see it.  :)  I wrote a script
> (see attached) that uses the PoG database to find missing sources on a
> packge-by-package basis.  That is, you can run
>
> guix repl specification-to-swhids.scm pog.db bash

Neat!

> • Subversion support (for TeX-based documentation stuff, I guess)
> • bzip2 support for Disarchive (there are 45 bzip2 tarballs)
> • ZIP support for Disarchive (for the 8 ZIP files)
> • lzip support for Disarchive (or a workaround for ed)
> • Fix some issues (gettext is .tar.gz, but something went wrong)
> • Do something with the static bootstrap binaries

Subversion support should probably be high-priority because TeX Live is
deep down in the dependency graph so if we ever lose it, we’re doomed.

Thanks for the update!

Ludo’.



Re: Preservation of Guix (PoG) report 2023-03-13

2023-03-14 Thread Simon Tournier
Hi,

On Mon, 13 Mar 2023 at 19:37, Timothy Sample  wrote:

> Note that you can link to the most recent version of the report using
> .

Awesome! \o/

Well, I do not remember if you consider also the ’origin’
(fixed-outputs) as ’inputs’ or ’patches’.  Do you?

Basically, ’package-direct-sources’ from (guix packages).

For instance, see the package ’ntp’,

--8<---cut here---start->8---
(source
 (origin
   (method url-fetch)
   (uri (list (string-append
   "https://www.eecis.udel.edu/~ntp/ntp_spool/ntp4/ntp-";
   (version-major+minor version)
[...]
   (sha256
(base32 "06cwhimm71safmwvp6nhxp6hvxsg62whnbgbgiflsqb8mgg40n7n"))
   ;; Add an upstream patch to fix build with GCC 10.  Taken from
   ;; .
   (patches (list (origin
(method url-fetch)
(uri "https://bugs.ntp.org/attachment.cgi?id=1760\
&action=diff&context=patch&collapsed=&headers=1&format=raw")
(file-name "ntp-gcc-compat.patch")
(sha256
 (base32
  
"13d28sg45rflc7kqiv30asrhna8n69wlpwx16l65rravgpvp90h2")))
--8<---cut here---end--->8---

or see the package ’tensorflow’,

--8<---cut here---start->8---
(native-inputs
 `(("pkg-config" ,pkg-config)
[...]
   ("boringssl-src"
,(let ((commit "ee7aa02")
   (revision "1"))
   (origin
 (method git-fetch)
 (uri (git-reference
   (url "https://boringssl.googlesource.com/boringssl";)
   (commit commit)))
 (file-name (string-append "boringssl-0-" revision
   (string-take commit 7)
   "-checkout"))
 (sha256
  (base32
   "1jf693q0nw0adsic6cgmbdx6g7wr4rj4vxa8j1hpn792fqhd8wgw")
--8<---cut here---end--->8---


> Over the whole set, 77.1% are known to be safely tucked away in the
> Software Heritage archive.  But it’s actually much better than that.  If
> we only look at the most recent sampled commit (from Sunday the 5th),
> that number becomes 87.4%, which is starting to look pretty good!

Just to be point the new nixguix loader [1] is still in SWH staging and
not yet deployed, IIRC.  It will not change much the coverage on our
side but it should be fix some corner-cases.

1: 


>  This is kinda like an automated version of Simon’s recent
> investigation.

Neat!  Note that I also wanted to check the SWH capacity for cooking,
not only checking the end points.  For instance, it allowed to discover
mismatch due to uncovered CR/LF normalization; now fixed with:
58f20fa8181bdcd4269671e1d3cef1268947af3a.


> Here’s a rough road map for that based on a glance at the script’s
> output:
>
> • Subversion support (for TeX-based documentation stuff, I guess)

For the interested reader, details for helping in the implementation:

https://issues.guix.gnu.org/issue/43442#9
https://issues.guix.gnu.org/issue/43442#11

However, it would ease all the dance if SWH would consider to store and
expose NAR hashes on their side.  As discussed here:

https://gitlab.softwareheritage.org/swh/meta/-/issues/4538


>  However, 42% of them are old Bioconductor packages.  They
> seem to be lost.  It looks like Bioconductor now stores multiple package
> versions per Bioconductor version [2], but before version 3.15 that was
> not the case.  As an example, take “ggcyto” from Bioconductor 3.10 [3].
> We packaged version 1.14.0, and then at some point Bioconductor 3.10
> switched to version 1.14.1.  We packaged that, too, but now 1.14.0 is
> gone.

Well, I have not investigated much because it is between December 2019
and March 2020 thus “guix time-machine” is not smooth for this old time.

First question, does we have the source tarball in Berlin or Bordeaux or
somewhere else?  If yes, there is a hope. :-) Else, it is probably gone
forever.

The hope is: https://git.bioconductor.org/packages/ggcyto

If we have the tarball with the correct checksum from commit
f5f440312d848e12463f0c6f7510a86b623a9e27

--8<---cut here---start->8---
+(version "1.14.0")
+(source
+ (origin
+   (method url-fetch)
+   (uri (bioconductor-uri "ggcyto" version))
+   (sha256
+(base32
+ "165qszvy5z176h1l3dnjb5dcm279b6bjl5n5gzz8wfn4xpn8anc8"
--8<---cut here---end--->8---

then we can disassemble it and then using the Git repository, we can try
to assemble the content from SWH and the meta from Disarchive DB.

For sure, it is again another 

Preservation of Guix (PoG) report 2023-03-13

2023-03-13 Thread Timothy Sample
Hi Guix,

It’s been a while!  :)

Allow me to present to you a long-overdue update to the Preservation of
Guix (PoG) report: .  🎉

Note that you can link to the most recent version of the report using
.

What is this?  Well, I added a description to the report itself, but
here’s a brief teaser.  The PoG report shows what we know about the
archival status of the approximately 54K sources (and counting) Guix has
linked to since around the time of the 1.0 release.

For this edition, I took a bit of time to fix the contrast and colours
to be a bit more accessible.  They’re about half as garish as they used
to be, too.

Over the whole set, 77.1% are known to be safely tucked away in the
Software Heritage archive.  But it’s actually much better than that.  If
we only look at the most recent sampled commit (from Sunday the 5th),
that number becomes 87.4%, which is starting to look pretty good!

I have a few more notes on the report, but I want to put this near the
top of the message so that people will see it.  :)  I wrote a script
(see attached) that uses the PoG database to find missing sources on a
packge-by-package basis.  That is, you can run

guix repl specification-to-swhids.scm pog.db bash

and it will print a table of all of the transitive sources needed to
build Bash, along with their preservation status.  Here’s a (heavily
edited and snipped to fit an email message) sample of its output:

[... many “stored” inputs]
sha256 0r5p. swh:1:dir:02f7. stored  /gnu/store/.-gmp-6.0.0a.tar.xz
sha256 0c3k. swh:1:dir:6027. stored  /gnu/store/.-mescc-tar.xz
sha256 1r1z. swh:1:dir:6087. stored  /gnu/store/.-bash-2.05b.tar.gz
sha256 14l0. unknown unknown /gnu/store/.-gcc-4.9.4.tar.bz2
sha256 0m2y. unknown unknown /gnu/store/.-ed-1.17.tar.lz
[... more “unknown” inputs]

(I had to pipe the output to “sort -k 4” to have it sorted by status.)

The first two columns are the Guix hash.  The next two columns are the
SWHID (if known) and whether SWH has it (if known).  That last column is
the store filename (which is nice because it usually tells you what it
is we are looking at).  In this sample, you can see that GMP, MesCC
Tools, and Bash are all safe.  However, we don’t know about GCC 4 and
ed.  This is kinda like an automated version of Simon’s recent
investigation [1].  The “unknown” two are due to Disarchive’s lack of
support for those compression formats.  I just wrote this script today
(mind the rough edges), and I’ve learned a lot from trying it on a few
packages.  It’s a little like a terrifying robotic TODO list, since it
shows a lot of problems, but it’s also exiting because solving all the
problems for the Guix package, say, would be a massive leap forward.
Here’s a rough road map for that based on a glance at the script’s
output:

• Subversion support (for TeX-based documentation stuff, I guess)
• bzip2 support for Disarchive (there are 45 bzip2 tarballs)
• ZIP support for Disarchive (for the 8 ZIP files)
• lzip support for Disarchive (or a workaround for ed)
• Fix some issues (gettext is .tar.gz, but something went wrong)
• Do something with the static bootstrap binaries

[1] https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html

If you want to try it out for yourself, you’ll need to download the
database .  Heads up:
it’s just over 200M, and my server can be pretty slow.

One other stray thought: the script should work with the time machine,
so you can check on packages from the past.  I didn’t test it, but I bet
it’s fine.

Okay.  Here are the rest of my notes about the report itself.

One thing that jumps out at me is 189 Git sources that SWH does not
have.  Usually they have basically all of the non-recursive Git sources.
It’s something to look into.

I also took a quick peek at the 1.9K “unknown” tar-gz sources.  About
39% percent of them are old Rust crates.  It’s a known problem with
Disarchive.  However, 42% of them are old Bioconductor packages.  They
seem to be lost.  It looks like Bioconductor now stores multiple package
versions per Bioconductor version [2], but before version 3.15 that was
not the case.  As an example, take “ggcyto” from Bioconductor 3.10 [3].
We packaged version 1.14.0, and then at some point Bioconductor 3.10
switched to version 1.14.1.  We packaged that, too, but now 1.14.0 is
gone.  I know it’s been discussed before, but I can’t remember what the
conclusion was.  Are these just gone forever?  I’m doing another pass
through all of them and recovering a few from the bordeaux substitute
server, but only a handful.

[2] https://bioconductor.org/packages/3.15/bioc/src/contrib/Archive/DiffBind/
[3] https://bioconductor.org/packages/3.10/bioc/html/ggcyto.html

That’s all for now.  Enjoy the update and the script!


-- Tim

;;; specification-to-swhids.scm
;;; Copyright © 2023 Timothy Sample 
;;;