Re: [swh-devel] Call for public review - SWH Nix/GNU Guix stack
Hello, > Is that because the changes you describe were done after the staging > data was loaded or is it a bug? Our staging instance inherits its append-only property from our main archive. In the staging case (for "prototypes", soon-to-be-deployed new feature or so), that makes it hard to see through the "old bug" noise. It's old origins that were ingested initially with a first version of the lister (which got iteratively fixed). @anlambert made a pass this week in docker (from scratch) to check (thx ;) > Excellent! I believe this addresses a problem we recently reported > regarding tarballs published with our own content-addressed URLs, which > look like: > > > https://bordeaux.guix.gnu.org/file/BiocNeighbors_1.20.0.tar.gz/sha256/0a5wg099fgwjbzd6r3mr4l02rcmjqlkdcz1w97qzwx1mir41fmas As a result, he actually enhanced the listing so the urls mentioned earlier ^ is treated correctly out of the data in the url. (@me That needs a bump in deployment [for next week]) Early on, I was referring to another heuristic using a HEAD query to parse header informations [if any]. As that specific url does not provide any, so it passed through. Note: cc-ed jul...@malka.sh instead of commun...@nixos.org (as asked in the thread) Cheers, -- tony / Antoine R. Dumont (@ardumont) - gpg fingerprint BF00 203D 741A C9D5 46A8 BE07 52E2 E984 0D10 C3B8 Timothy Sample writes: > Hello, > > This is very exciting work, thanks everyone! > > "Antoine R. Dumont (@ardumont)" writes: > >> FWIW, in the "new" lister [1] implementation, there are a bunch of extra >> computations done [1] to try and resolve those situations. It's trying >> to fetch more information from upstream server (e.g. crates urls which >> ends in /download, ...) now. It's probably not exhaustive though. >> >> [1] >> https://gitlab.softwareheritage.org/swh/devel/swh-lister/-/blob/master/swh/lister/nixguix/lister.py?ref_type=heads > > I was just looking over some of the new results and noticed that crates > are being treated as ‘content’ rather than ‘tarball-directory’. E.g.: > > https://webapp.staging.swh.network/browse/content/sha1_git:e05b33b2d3b40254ceaaa5fe4c501d1b15c75ea6/?origin_url=https://crates.io/api/v1/crates/diff/0.1.12/download > > Is that because the changes you describe were done after the staging > data was loaded or is it a bug? > > > -- Tim signature.asc Description: PGP signature
Re: [swh-devel] Call for public review - SWH Nix/GNU Guix stack
Hello, This is very exciting work, thanks everyone! "Antoine R. Dumont (@ardumont)" writes: > FWIW, in the "new" lister [1] implementation, there are a bunch of extra > computations done [1] to try and resolve those situations. It's trying > to fetch more information from upstream server (e.g. crates urls which > ends in /download, ...) now. It's probably not exhaustive though. > > [1] > https://gitlab.softwareheritage.org/swh/devel/swh-lister/-/blob/master/swh/lister/nixguix/lister.py?ref_type=heads I was just looking over some of the new results and noticed that crates are being treated as ‘content’ rather than ‘tarball-directory’. E.g.: https://webapp.staging.swh.network/browse/content/sha1_git:e05b33b2d3b40254ceaaa5fe4c501d1b15c75ea6/?origin_url=https://crates.io/api/v1/crates/diff/0.1.12/download Is that because the changes you describe were done after the staging data was loaded or is it a bug? -- Tim
Re: [swh-devel] Call for public review - SWH Nix/GNU Guix stack
Hey Antoine, "Antoine R. Dumont (@ardumont)" skribis: >>> My understanding is that so far these URLs were ignored by the >>> lister/loader because they didn’t end in *.tar.*.⁰ > > FWIW, in the "new" lister [1] implementation, there are a bunch of extra > computations done [1] to try and resolve those situations. It's trying > to fetch more information from upstream server (e.g. crates urls which > ends in /download, ...) now. It's probably not exhaustive though. > > [1] > https://gitlab.softwareheritage.org/swh/devel/swh-lister/-/blob/master/swh/lister/nixguix/lister.py?ref_type=heads > >>> I’m sure Simon Tournier (Cc’d) already discussed with others at SWH >>> how crucial it is for us to be able to query content by nar hash. > >> So yeah, we are looking forward to some ExtID interface. :-) > > Yes, and there is an ongoing merge request about the new interface [2] > > [2] > https://gitlab.softwareheritage.org/swh/devel/swh-web/-/merge_requests/1220 These are both excellent news, thank you! Ludo’.
RE: [swh-devel] Call for public review - SWH Nix/GNU Guix stack
>> My understanding is that so far these URLs were ignored by the >> lister/loader because they didn’t end in *.tar.*.⁰ FWIW, in the "new" lister [1] implementation, there are a bunch of extra computations done [1] to try and resolve those situations. It's trying to fetch more information from upstream server (e.g. crates urls which ends in /download, ...) now. It's probably not exhaustive though. [1] https://gitlab.softwareheritage.org/swh/devel/swh-lister/-/blob/master/swh/lister/nixguix/lister.py?ref_type=heads >> I’m sure Simon Tournier (Cc’d) already discussed with others at SWH >> how crucial it is for us to be able to query content by nar hash. > So yeah, we are looking forward to some ExtID interface. :-) Yes, and there is an ongoing merge request about the new interface [2] [2] https://gitlab.softwareheritage.org/swh/devel/swh-web/-/merge_requests/1220 Cheers, tony / Antoine R. Dumont (@ardumont) - gpg fingerprint BF00 203D 741A C9D5 46A8 BE07 52E2 E984 0D10 C3B8 Simon TOURNIER writes: > Hi, > >> The initial NixGuix loader (currently in production) lists and loads >> origins from a manifest, ignoring the specific origins mentioned above. The >> new stack will be able to ingest those origins. It will also optionally >> associate, if present, a NAR hash (specific intrinsic identifier to Nix and >> Guix) to what’s called an ExtID (SWH side). > > Cool! Thank you. > >> Regarding the SWH API reading side of the ExtID though is a work to be done. > > In short, currently Guix relies on SWH API for resolving from > “something” to SWHID, where “something” can be: > > + Git label tag + url > + Git commit hash > + plain url > > Well, the situation is in good shape IMHO – I do not have recent > numbers, say all is fine for 75% of all Guix packages and for 90% of > Guix packages coming from some Git repositories – but still, we have > examples where “Git label tag + url” fails. For one instance, see [1] > pointed by [2]. > > The information – history of history – is there in SWH but it would > require on Guix side to parse the snapshot information and extract as > best as possible; trying several SWH snapshots until a match. Something > like that. Chance of success until completion? Weak. :-) > > Moreover, what about the missing 25%? They are Guix packages coming > from Mercurial repositories or from Subversion repositories or some > others. > > Back on October 2020, we had discussion [3] for sending a save request > for packages using SVN checkouts but at the time we did not have a clear > path for retrieving. Then on March 2023, maybe an path for retrieving > with this discussion [4]… but still many hacks are required [5]. > > Again, the information is there in SWH but it would require on Guix side > to parse the snapshot information and extract as best as possible; > trying several SWH snapshots until a match. Something like that. > Chance of success until completion? Weak. :-) > > If only one source is missing, all the castle potentially falls down. > Somehow, > a dictionary from ExtID as nar hash to SWHID would help to have the > castle more robust. :-) > > The SWH archive coverage of Guix packages would not be 75% because we, on > Guix side, are not able to know or retrieve these missing 25%. Such > dictionary > could reinforce the bridge between reproducible computational environment > and archiving, IMHO. > > So yeah, we are looking forward to some ExtID interface. :-) > > Cheers, > simon > > > 1: https://issues.guix.gnu.org/66015#0-lineno53 > 2: > https://gitlab.softwareheritage.org/swh/devel/swh-loader-git/-/issues/4751#note_148587 > 3: https://issues.guix.gnu.org/43442#9 > 4: https://sympa.inria.fr/sympa/arc/swh-devel/2023-03/msg9.html > 5: https://issues.guix.gnu.org/43442#13 signature.asc Description: PGP signature
RE: [swh-devel] Call for public review - SWH Nix/GNU Guix stack
Hi, > The initial NixGuix loader (currently in production) lists and loads > origins from a manifest, ignoring the specific origins mentioned above. The > new stack will be able to ingest those origins. It will also optionally > associate, if present, a NAR hash (specific intrinsic identifier to Nix and > Guix) to what’s called an ExtID (SWH side). Cool! Thank you. > Regarding the SWH API reading side of the ExtID though is a work to be done. In short, currently Guix relies on SWH API for resolving from “something” to SWHID, where “something” can be: + Git label tag + url + Git commit hash + plain url Well, the situation is in good shape IMHO – I do not have recent numbers, say all is fine for 75% of all Guix packages and for 90% of Guix packages coming from some Git repositories – but still, we have examples where “Git label tag + url” fails. For one instance, see [1] pointed by [2]. The information – history of history – is there in SWH but it would require on Guix side to parse the snapshot information and extract as best as possible; trying several SWH snapshots until a match. Something like that. Chance of success until completion? Weak. :-) Moreover, what about the missing 25%? They are Guix packages coming from Mercurial repositories or from Subversion repositories or some others. Back on October 2020, we had discussion [3] for sending a save request for packages using SVN checkouts but at the time we did not have a clear path for retrieving. Then on March 2023, maybe an path for retrieving with this discussion [4]… but still many hacks are required [5]. Again, the information is there in SWH but it would require on Guix side to parse the snapshot information and extract as best as possible; trying several SWH snapshots until a match. Something like that. Chance of success until completion? Weak. :-) If only one source is missing, all the castle potentially falls down. Somehow, a dictionary from ExtID as nar hash to SWHID would help to have the castle more robust. :-) The SWH archive coverage of Guix packages would not be 75% because we, on Guix side, are not able to know or retrieve these missing 25%. Such dictionary could reinforce the bridge between reproducible computational environment and archiving, IMHO. So yeah, we are looking forward to some ExtID interface. :-) Cheers, simon 1: https://issues.guix.gnu.org/66015#0-lineno53 2: https://gitlab.softwareheritage.org/swh/devel/swh-loader-git/-/issues/4751#note_148587 3: https://issues.guix.gnu.org/43442#9 4: https://sympa.inria.fr/sympa/arc/swh-devel/2023-03/msg9.html 5: https://issues.guix.gnu.org/43442#13
Re: [swh-devel] Call for public review - SWH Nix/GNU Guix stack
Hi Benoit and all! (Cc: guix-devel rather than gnu-system-discuss.) Benoit Chauvet skribis: > Regarding the Nix/GNU Guix stack, Software Heritage will soon be ready to > support the > ingestion of specific versioned files, tarballs, git, hg, svn source code > listed in their respective manifests [1] (as origins). The new lister (and > extra loaders, namely > {Content|Directory|GitCheckout|SvnExport|HgCheckout}Loader) have been > deployed in our staging infrastructure [2]. Excellent! I believe this addresses a problem we recently reported regarding tarballs published with our own content-addressed URLs, which look like: https://bordeaux.guix.gnu.org/file/BiocNeighbors_1.20.0.tar.gz/sha256/0a5wg099fgwjbzd6r3mr4l02rcmjqlkdcz1w97qzwx1mir41fmas My understanding is that so far these URLs were ignored by the lister/loader because they didn’t end in *.tar.*.⁰ > The initial NixGuix loader (currently in production) lists and loads > origins from a manifest, ignoring the specific origins mentioned above. The > new stack will be able to ingest those origins. It will also optionally > associate, if present, a NAR hash (specific intrinsic identifier to Nix and > Guix) to what’s called an ExtID (SWH side). > Regarding the SWH API reading side of the ExtID though is a work to be done. > > On staging, we have currently ingested origins that were listed from the > GNU Guix manifest [3]. > > We have already improved the implementations after discussing multiple > limitations encountered along the way with the Guix community [4]. I’m sure Simon Tournier (Cc’d) already discussed with others at SWH how crucial it is for us to be able to query content by nar hash. Essentially, it would fill the gap that currently prevents us from retrieving Subversion checkouts from SWH¹ and more generally complicates retrieval of anything not referenced by a Git hash. So obviously, we’re looking forward to that ExtID interface for SWH. Thanks for sharing this status update, these are all exciting news and perspectives! Ludo’. ⁰ https://issues.guix.gnu.org/39885#15-lineno60 ¹ https://issues.guix.gnu.org/43442#13-lineno37