In the old pkg_add model (prior to 7.1), we create tmp files anyway, so the
code is fine.

7.1 and later, we try to avoid creating temp files, so a better name matching
correspondence would be cool

What this patch does:

the old code creates a simple hash where we associate ONE packing-list element
with a given sha256 value.

the new code creates a list of entries for an sha256 hash and iterates 
over them.

Of course, it's unlikely two entries with the same sha256 will have different
content (though we still check the size, just in case, and -Dchecksum is
really paranoid about it and also really slow).

BUT, even if we find a good match, we will keep going... until we find
the exact same name ! (or settle on the last one we saw if we don't)

[... which means no temp file creation]


It might be quite accidental, but python has lots of __init__.py files with
0 size, and ALSO a significant amount of file with non empty content, but which
are identical (for reasons...? I don't want to know why the python guys think
it's a good idea to ship LOTS of identical files, and I don't mean symlinks
or hardlinks but actual copies)

As a measure point, on the python 3.9 package, without this diff, we
misidentify ~500 names out of 2700, and so churn 500 temp files if we
replace python with itself.

I haven't measured data points for other packages, but I would guess most 
python packages are probably similarly afflicted (yeah, not affected, 
I think it's a disease :D )


Please test.

Something else might have been fucked up.


Index: OpenBSD/PkgAdd.pm
===================================================================
RCS file: /cvs/src/usr.sbin/pkg_add/OpenBSD/PkgAdd.pm,v
retrieving revision 1.130
diff -u -p -r1.130 PkgAdd.pm
--- OpenBSD/PkgAdd.pm   27 Apr 2022 15:27:45 -0000      1.130
+++ OpenBSD/PkgAdd.pm   5 May 2022 17:55:01 -0000
@@ -84,7 +84,7 @@ sub hash_files
        my ($self, $sha, $state) = @_;
        return if $self->{link} or $self->{symlink} or $self->{nochecksum};
        if (defined $self->{d}) {
-               $sha->{$self->{d}->key} = $self;
+               push @{$sha->{$self->{d}->key}}, $self;
        }
 }
 
@@ -94,13 +94,17 @@ sub tie_files
        return if $self->{link} or $self->{symlink} or $self->{nochecksum};
        # XXX python doesn't like this, overreliance on timestamps
        return if $self->{name} =~ m/\.py$/ && !defined $self->{ts};
-       if (defined $sha->{$self->{d}->key}) {
-               my $tied = $sha->{$self->{d}->key};
-               # don't tie if there's a problem with the file
-               my $realname = $tied->realname($state);
-               return unless -f $realname;
-               # and do a sanity check that this file wasn't altered
-               return unless (stat _)[7] == $self->{size};
+       if (exists $sha->{$self->{d}->key}) {
+               my ($tied, $realname);
+               for my $c (@{$sha->{$self->{d}->key}}) {
+                       # don't tie if there's a problem with the file
+                       my $realname = $c->realname($state);
+                       next unless -f $realname;
+                       # and do a sanity check that this file wasn't altered
+                       next unless (stat _)[7] == $self->{size};
+                       $tied = $c;
+                       last if $tied->name eq $self->name;
+               }
                if ($state->defines('checksum')) {
                        my $d = $self->compute_digest($realname, $self->{d});
                        # XXX we don't have to display anything here

Reply via email to