Re: Identical files across subsequent package revisions

2021-01-06 Thread Ludovic Courtès
Hi, Ludovic Courtès skribis: > It could go along these lines: > > 1. GET /digest/xyz-emacs; the digest contains a list of file/hash > pairs essentially; > > 2. traverse digest and hardlink the files already in /gnu/store/.links > to the target directory; > > 3. pipeline-GET the r

Re: Identical files across subsequent package revisions

2021-01-06 Thread Ludovic Courtès
Hi! pukkamustard skribis: > Your research inspired me to do conduct some experiments towards > de-duplication. > > For two similar packages (emacs-27.1 and emacs-26.3) I was able to > de-duplicate ~12% using EROFS and ERIS. Still far from the ~85% > similarity, but an attempt I'd like to share.

Re: Identical files across subsequent package revisions

2020-12-30 Thread pukkamustard
Hi Ludo, Thoughts? :-) Super cool! :) Your research inspired me to do conduct some experiments towards de-duplication. For two similar packages (emacs-27.1 and emacs-26.3) I was able to de-duplicate ~12% using EROFS and ERIS. Still far from the ~85% similarity, but an attempt I'd like t

Re: Identical files across subsequent package revisions

2020-12-27 Thread Ludovic Courtès
Hi! Miguel Ángel Arruga Vivas skribis: > Another idea that might fit well into that kind of protocol---with > harder impact on the design, and probably with a high cost on the > runtime---would be the "upgrade" of the deduplication process towards a > content-based file system as git does[2]. T

Re: Identical files across subsequent package revisions

2020-12-27 Thread Ludovic Courtès
Hi! zimoun skribis: > What I wanted to illustrate is “revision” does not mean “new version“ > and I do not know what is the typical usage by people; aside Guix > dev. :-) > > How many times per week or month people are doing “guix pull && guix > upgrade” almost blindly without reviewing what is

Re: Identical files across subsequent package revisions

2020-12-23 Thread Mark H Weaver
I wrote: > FYI, here's the reason that IceCat is unusual among the projects sampled > in Ludovic's analysis: the large collection of JavaScript source files > and other auxiliary files, which are mostly unchanged from one release > to the next, are bundled into a single zip file in lib/icecat/omni.

Re: Identical files across subsequent package revisions

2020-12-23 Thread Jonathan Brielmaier
On 23.12.20 23:06, Mark H Weaver wrote: Hi Taylan, Taylan Kammer writes: My second thought: it's surprising that IceCat supposedly changes so much between releases. I suppose the reason is that this analysis is on a per-file basis, and IceCat is mostly a massive binary. FYI, here's the reas

Re: Identical files across subsequent package revisions

2020-12-23 Thread Mark H Weaver
Hi Taylan, Taylan Kammer writes: > My second thought: it's surprising that IceCat supposedly changes so > much between releases. I suppose the reason is that this analysis is on > a per-file basis, and IceCat is mostly a massive binary. FYI, here's the reason that IceCat is unusual among the

Re: Identical files across subsequent package revisions

2020-12-23 Thread Miguel Ángel Arruga Vivas
Hi Julien and Simon, Julien Lepiller writes: > Le 23 décembre 2020 09:07:23 GMT-05:00, zimoun a > écrit : >>Hi, >> >>On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas >> wrote: >>> Another idea that might fit well into that kind of protocol---with >>> harder impact on the design, and pro

Re: Identical files across subsequent package revisions

2020-12-23 Thread Julien Lepiller
Le 23 décembre 2020 09:07:23 GMT-05:00, zimoun a écrit : >Hi, > >On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas > wrote: > >> Probably you're already aware of it, but I want to mention that >> Tridgell's thesis[1] contains a very neat approach to this problem. > >This thesis is a must

Re: Identical files across subsequent package revisions

2020-12-23 Thread zimoun
Hi, On Wed, 23 Dec 2020 at 14:10, Miguel Ángel Arruga Vivas wrote: > Probably you're already aware of it, but I want to mention that > Tridgell's thesis[1] contains a very neat approach to this problem. This thesis is a must to read! :-) > A naive prototype would be copying of the latest ava

Re: Identical files across subsequent package revisions

2020-12-23 Thread Miguel Ángel Arruga Vivas
Hi Ludo, Just one interjection: wow! :-) Ludovic Courtès writes: > Hello Guix! > > Every time a package changes, we end up downloading complete substitutes > for itself and for all its dependents, even though we have the intuition > that a large fraction of the files in those store items are un

Re: Identical files across subsequent package revisions

2020-12-23 Thread Christopher Baines
Ludovic Courtès writes: > The reason I’m looking at this is to understand how much would be gained > in terms of bandwidth usage if we were able to avoid downloading > individual files already in the store. It would seem to be rather > encouraging. Very cool! This might help guide the implemen

Re: Identical files across subsequent package revisions

2020-12-23 Thread Pierre Neidhardt
Taylan Kammer writes: > On 22.12.2020 23:01, Ludovic Courtès wrote: >> >> Thoughts? :-) >> > > My first thought: Neat, would love to see this implemented! :D I second that! :) -- Pierre Neidhardt https://ambrevar.xyz/ signature.asc Description: PGP signature

Re: Identical files across subsequent package revisions

2020-12-23 Thread zimoun
Hi again, Sorry, I have not finished my previous email. Wrong keystroke in the wrong buffer because misuse of the Ludo’s code. :-) On Wed, 23 Dec 2020 at 11:19, zimoun wrote: >> I wanted to evaluate that by looking at store items corresponding to >> subsequent revisions of a package (be it di

Re: Identical files across subsequent package revisions

2020-12-23 Thread zimoun
Hi Ludo, On Tue, 22 Dec 2020 at 23:01, Ludovic Courtès wrote: > I wanted to evaluate that by looking at store items corresponding to > subsequent revisions of a package (be it different versions or rebuilds > induced by dependencies), and this is what the program below does. > > Here are prelimi

Re: Identical files across subsequent package revisions

2020-12-23 Thread Taylan Kammer
On 22.12.2020 23:01, Ludovic Courtès wrote: Thoughts? :-) My first thought: Neat, would love to see this implemented! :D My second thought: it's surprising that IceCat supposedly changes so much between releases. I suppose the reason is that this analysis is on a per-file basis, and IceC

Identical files across subsequent package revisions

2020-12-22 Thread Ludovic Courtès
Hello Guix! Every time a package changes, we end up downloading complete substitutes for itself and for all its dependents, even though we have the intuition that a large fraction of the files in those store items are unchanged. I wanted to evaluate that by looking at store items corresponding to