Re: Are 'guix gc' stats exaggerated?

2024-06-17 Thread Ludovic Courtès
Andreas Enge  skribis:

> In my experience on ext4 (also not backed by looking at the code), "guix gc"
> always deletes substantially less than what I ask for. I always thought it
> just counted hard linked files even when the link count does not go to 0
> and the file is not actually deleted.

Yes, that’s also my experience.  I did look at the code several times, I
even thought 7033c7692ccbbbad8f7b9952015de071a5588e87 in 2020 would fix
that estimate, but it didn’t.  I guess I’m bad at maths and logic, we
should give another look at that part of the code!

(Note that creation of sparse files will be another source of
discrepancy, though there will be few of them.)

Ludo’.



Re: Are 'guix gc' stats exaggerated?

2024-06-09 Thread Andreas Enge
Am Sun, Jun 09, 2024 at 12:19:55PM +0300 schrieb Efraim Flashner:
> In my not having looked at the code, I'll point out that running `guix
> gc -C 10G` will clear 10G of items from the store, but will return
> between 2-10G of real space for future use on the hard drive.  Thinking
> across my various machines, on my desktop and laptop using btrfs this is
> the case, but on my other machines using ext4 I think the space cleared
> and what I'm expecting to have free to use do actually match up, but I
> don't remember paying that much attention to the numbers previously on
> those machines.

In my experience on ext4 (also not backed by looking at the code), "guix gc"
always deletes substantially less than what I ask for. I always thought it
just counted hard linked files even when the link count does not go to 0
and the file is not actually deleted.

For instance, I have tried it just now:
$ df -h .
/dev/mapper/cryptroot  468G427G   18G   97% /

$ guix gc -F 20G
guix gc: 2.931,84 MiB werden freigegeben
...
deleted or invalidated more than 3074252800 bytes; stopping

$ df -h .
/dev/mapper/cryptroot  468G427G   18G   96% /

Andreas




Re: Are 'guix gc' stats exaggerated?

2024-06-09 Thread Efraim Flashner
On Thu, Jun 06, 2024 at 12:32:52PM -0700, Felix Lechner wrote:
> Hi Ludo'
> 
> On Thu, Jun 06 2024, Ludovic Courtès wrote:
> 
> > Where does that 3:1 figure come from?
> 
> Efraim's experience, I believe.

I've found that to be my experience, and posted two compsize outputs to
show where I got my numbers from.

> > Where do you see that in the code?  After checking
> > ‘removeUnusedLinks’, I think it counts space savings right.
> 
> Sorry, I didn't look at the code.  I was merely prompted to speculate by
> the mentioning of hard links and inferred wrongly, it seems, that the
> discrepancy was related---although in fairness I also doubted that a
> fixed 3:1 ratio could be credibly explained by deduplication alone.
> 
> Also, I don't mean to appear critical.  Thanks to everyone for your hard
> work on Guix!

In my not having looked at the code, I'll point out that running `guix
gc -C 10G` will clear 10G of items from the store, but will return
between 2-10G of real space for future use on the hard drive.  Thinking
across my various machines, on my desktop and laptop using btrfs this is
the case, but on my other machines using ext4 I think the space cleared
and what I'm expecting to have free to use do actually match up, but I
don't remember paying that much attention to the numbers previously on
those machines.

-- 
Efraim Flashner  רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted


signature.asc
Description: PGP signature


Re: Are 'guix gc' stats exaggerated?

2024-06-06 Thread Development of GNU Guix and the GNU System distribution.
Hi Ludo'

On Thu, Jun 06 2024, Ludovic Courtès wrote:

> Where does that 3:1 figure come from?

Efraim's experience, I believe.

> Where do you see that in the code?  After checking
> ‘removeUnusedLinks’, I think it counts space savings right.

Sorry, I didn't look at the code.  I was merely prompted to speculate by
the mentioning of hard links and inferred wrongly, it seems, that the
discrepancy was related---although in fairness I also doubted that a
fixed 3:1 ratio could be credibly explained by deduplication alone.

Also, I don't mean to appear critical.  Thanks to everyone for your hard
work on Guix!

Kind regards
Felix



Re: Are 'guix gc' stats exaggerated?

2024-06-06 Thread Ludovic Courtès
Hi Felix,

Felix Lechner via "Development of GNU Guix and the GNU System
distribution."  skribis:

> It probably makes more sense to focus on the Guix daemon here.  I hope
> you don't mind a few clarifying questions.
>
> Why, please, does the benefit of de-duplication approach a fixed ratio
> of 3:1?  Does the benefit not depend on the number of copies in the
> store, which can vary by any number?  (It sounds like the answer may
> have something to do with store size.)

Where does that 3:1 figure come from?

> Further, why is the removal of hardlinks counted as saving space even
> when their inode reference count, which is widely available [1] is
> greater than one?

Where do you see that in the code?  After checking ‘removeUnusedLinks’,
I think it counts space savings right.  (OTOH, something somewhere is
counted wrong, as anyone who’s used ‘guix gc -F…’ has seen; not sure
where the bug is!)

Thanks,
Ludo’.



Daemon deduplication and btrfs compression [was Re: Are 'guix gc' stats exaggerated?]

2024-06-02 Thread Efraim Flashner
On Fri, May 31, 2024 at 03:03:47PM -0700, Felix Lechner wrote:
> Hi Efraim,
> 
> On Tue, May 28 2024, Efraim Flashner wrote:
> 
> > As your store grows larger the inherent deduplication from the
> > guix-daemon approaches a 3:1 file deduplication ratio.
> 
> Thank you for your explanations and your data about btrfs!  Btrfs
> compression is a well-understood feature, although even its developers
> acknowledge that the benefit is hard to quantify.
> 
> It probably makes more sense to focus on the Guix daemon here.  I hope
> you don't mind a few clarifying questions.
> 
> Why, please, does the benefit of de-duplication approach a fixed ratio
> of 3:1?  Does the benefit not depend on the number of copies in the
> store, which can vary by any number?  (It sounds like the answer may
> have something to do with store size.)

It would seem that this is just my experience and I'm not sure of an
actual reason why this is the case. I believe that with the hardlinks
only files which are identical would share a link, as opposed to a block
based deduplication, where there could be more granular deduplication,
so it's quite likely that multiple copies of the same package at the
same version would share the majority of their files with the other
copies of the package.

> Further, why is the removal of hardlinks counted as saving space even
> when their inode reference count, which is widely available [1] is
> greater than one?

I suspect that this part of the code is in the C++ daemon, which no one
really wants to hack on.  AFAIK Nix turned off deduplication by default
years ago to speed up store operations, so I wouldn't be surprised if
they also haven't worked on that part of the code.

> Finally, barring a better solution should our output numbers be divided
> by three to being them closer to the expected result for users?
> 
> [1] https://en.wikipedia.org/wiki/Hard_link#Reference_counting

(ins)efraim@3900XT ~$ sudo compsize -x /gnu
Processed 39994797 files, 12867013 regular extents (28475611 refs), 20558307 
inline.
Type   Perc Disk Usage   Uncompressed Referenced
TOTAL   56%  437G 776G 2.1T
none   100%  275G 275G 723G
zstd32%  161G 500G 1.4T

It looks like right now my store is physically using 437GB of space.
Looking only at the total the Uncompressed -> Referenced ratio being
about 2.77:1 and Disk Usage -> Uncompressed being about 1.78:1, I'm
netting a total of 4.92:1.

Numbers on Berlin are a bit different:

(ins)efraim@berlin ~$ time guix shell compsize -- sudo compsize -x /gnu
Processed 41030472 files, 14521470 regular extents (37470325 refs), 17429255 
inline.
Type   Perc Disk Usage   Uncompressed Referenced
TOTAL   59%  578G 970G 3.2T
none   100%  402G 402G 1.1T
zstd31%  176G 567G 2.1T

real45m9.762s
user1m53.984s
sys 24m37.338s

Uncompressed -> Referenced: 3.4:1
Disk Usage -> Uncompressed: 1.68:1
Total:  5.67:1

Looking at it another way, the bits that are compressible with zstd
together move from 3.79:1 to 12.22:1, with no change (2.8:1) for the
uncompressible bits.

-- 
Efraim Flashner  רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted


signature.asc
Description: PGP signature


Re: Are 'guix gc' stats exaggerated?

2024-05-31 Thread Development of GNU Guix and the GNU System distribution.
Hi Efraim,

On Tue, May 28 2024, Efraim Flashner wrote:

> As your store grows larger the inherent deduplication from the
> guix-daemon approaches a 3:1 file deduplication ratio.

Thank you for your explanations and your data about btrfs!  Btrfs
compression is a well-understood feature, although even its developers
acknowledge that the benefit is hard to quantify.

It probably makes more sense to focus on the Guix daemon here.  I hope
you don't mind a few clarifying questions.

Why, please, does the benefit of de-duplication approach a fixed ratio
of 3:1?  Does the benefit not depend on the number of copies in the
store, which can vary by any number?  (It sounds like the answer may
have something to do with store size.)

Further, why is the removal of hardlinks counted as saving space even
when their inode reference count, which is widely available [1] is
greater than one?

Finally, barring a better solution should our output numbers be divided
by three to being them closer to the expected result for users?

Thanks!

Kind regards,
Felix

[1] https://en.wikipedia.org/wiki/Hard_link#Reference_counting



Re: Are 'guix gc' stats exaggerated?

2024-05-31 Thread Simon Tournier
Hi,

On Sun, 26 May 2024 at 13:13, Felix Lechner via "Development of GNU Guix and 
the GNU System distribution."  wrote:

> By my math, about 65.8 GiB were recovered.
>
> When 'guix gc' was done, it announced:
>
> [184389 MiB] deleting '/gnu/store/...'
> deleting `/gnu/store/trash'
> deleting unused links...
> note: currently hard linking saves 59224.03 MiB
> guix gc: freed 110,649.49 MiBs

Well, 180 GiB does not count deduplication, I guess.  And as Efraim
said, the ratio on average is 3:1 so 65 GiB vs 180 GiB seems consistent,
right?

However, the question is then: what are these 110 GiB?

Cheers,
simon



Re: Are 'guix gc' stats exaggerated?

2024-05-28 Thread Efraim Flashner
On Sun, May 26, 2024 at 01:13:45PM -0700, Felix Lechner via Development of GNU 
Guix and the GNU System distribution. wrote:
> Hi,
> 
> Today I ran 'guix gc' on equipment with an ext4 root partition.  It had
> these space characteristics beforehand:
> 
> Filesystem  Size   Used  Avail Use% Mounted on
> /dev/dm-3   309047680  157252980 138126064  54% /
> 
> or for human eyes:
> 
> /dev/dm-3   295G150G  132G  54% /
> 
> After the run, the drive showed:
> 
> /dev/dm-3   309047680   88267956 207111088  30% /
> 
> or for human eyes:
> 
> /dev/dm-3   295G 85G  198G  30% /
> 
> By my math, about 65.8 GiB were recovered.
> 
> When 'guix gc' was done, it announced:
> 
> [184389 MiB] deleting '/gnu/store/...'
> deleting `/gnu/store/trash'
> deleting unused links...
> note: currently hard linking saves 59224.03 MiB
> guix gc: freed 110,649.49 MiBs
> 
> Seeing the 184389 MiB number, or 180 GiB, already made me suspicious.
> It exceeded my drive usage by 30 GiB.  Even the more conservative 110649
> MiB "freed," however, are off by a mile. That would have been 108 GiB,
> or 42 GiB more than the space actually recovered.
> 
> Am I looking at those numbers the wrong way?  Thanks!

As your store grows larger the inherent deduplication from the
guix-daemon approaches a 3:1 file deduplication ratio.  If two files are
the same then they are hardlinked to the same actual block on the drive
and you save some space.

I have found that if you switch to btrfs and add zstd (level 3)
compression then you get about another 2:1 on top of that, for around
5.5:1.

-- 
Efraim Flashner  רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted


signature.asc
Description: PGP signature


Re: Are 'guix gc' stats exaggerated?

2024-05-27 Thread Development of GNU Guix and the GNU System distribution.
Hi raingloom,

On Mon, May 27 2024, raingl...@riseup.net wrote:

> Are you using compression? (BTRFS, ZFS, etc)

No, I thought about that, too, but that volume, like all my root
volumes, is straight ext4 on LVM2, on bare metal.

Kind regards
Felix



Re: Are 'guix gc' stats exaggerated?

2024-05-27 Thread raingloom
On 2024-05-26 22:13, Felix Lechner via "Development of GNU Guix and the
GNU System distribution." wrote:
> Hi,
> 
> Today I ran 'guix gc' on equipment with an ext4 root partition.  It had
> these space characteristics beforehand:
> 
> Filesystem  Size   Used  Avail Use% Mounted on
> /dev/dm-3   309047680  157252980 138126064  54% /
> 
> or for human eyes:
> 
> /dev/dm-3   295G150G  132G  54% /
> 
> After the run, the drive showed:
> 
> /dev/dm-3   309047680   88267956 207111088  30% /
> 
> or for human eyes:
> 
> /dev/dm-3   295G 85G  198G  30% /
> 
> By my math, about 65.8 GiB were recovered.
> 
> When 'guix gc' was done, it announced:
> 
> [184389 MiB] deleting '/gnu/store/...'
> deleting `/gnu/store/trash'
> deleting unused links...
> note: currently hard linking saves 59224.03 MiB
> guix gc: freed 110,649.49 MiBs
> 
> Seeing the 184389 MiB number, or 180 GiB, already made me suspicious.
> It exceeded my drive usage by 30 GiB.  Even the more conservative 110649
> MiB "freed," however, are off by a mile. That would have been 108 GiB,
> or 42 GiB more than the space actually recovered.
> 
> Am I looking at those numbers the wrong way?  Thanks!
> 
> Kind regards
> Felix

Are you using compression? (BTRFS, ZFS, etc)



Are 'guix gc' stats exaggerated?

2024-05-26 Thread Development of GNU Guix and the GNU System distribution.
Hi,

Today I ran 'guix gc' on equipment with an ext4 root partition.  It had
these space characteristics beforehand:

Filesystem  Size   Used  Avail Use% Mounted on
/dev/dm-3   309047680  157252980 138126064  54% /

or for human eyes:

/dev/dm-3   295G150G  132G  54% /

After the run, the drive showed:

/dev/dm-3   309047680   88267956 207111088  30% /

or for human eyes:

/dev/dm-3   295G 85G  198G  30% /

By my math, about 65.8 GiB were recovered.

When 'guix gc' was done, it announced:

[184389 MiB] deleting '/gnu/store/...'
deleting `/gnu/store/trash'
deleting unused links...
note: currently hard linking saves 59224.03 MiB
guix gc: freed 110,649.49 MiBs

Seeing the 184389 MiB number, or 180 GiB, already made me suspicious.
It exceeded my drive usage by 30 GiB.  Even the more conservative 110649
MiB "freed," however, are off by a mile. That would have been 108 GiB,
or 42 GiB more than the space actually recovered.

Am I looking at those numbers the wrong way?  Thanks!

Kind regards
Felix