I'm starting a separate thread, because I don't want to confuse this with
the old db removal stuff.

Vasiliy commented on the need to get rid of the contents file.
As usual, it sounds eminently reasonable that getting rid of the
contents file is going to radically improve performance, right?

OK, so I did some measurements. They're at the end. First some commentary
on Vasiliy's comments.

> We have huge opportunity to bust performance if we get rid of contents file -
> [...]
> This is next bottleneck
> need to be fixed to improve installation performance.

It's inefficient, for sure. I'm hoping we'll get to the point where it
is actually
the next bottleneck, but I think we're still a long way from there.

> I think we should keep this records locally for each package

Well, actually, we do already. eg.

/var/sadm/pkg/SYMhisl/save/pspool/SYMhisl/pkgmap

although I'm not sure what uses it (I don't see it getting
updated by patches). I think this is where editable files
are stored so you can populate a zone cleanly.

> contents file which is in nevada over 1M

12Meg seems about typical. My home test machine is about
28Meg, but I have a lot of junk installed.

So a quick test. How long does a random pkgadd and pkgrm take.
This is on my W2100z with a 28M contents file, so any effect that
a contents file has is going to be more noticeable on this system
than most.

pkgadd (cold): 3.97s
pkgadd(warm): 2.65s
pkgrm(cold): 17.28s
pkgrm(warm): 4.87s

The cold results are first time round.

OK, so I can install to an alternate root, so the contents
file will be empty. This ought to be way quicker.

pkgadd: 1.20s
pkgrm: 0.53s

All these are essentially warm, so that's the comparison.

Now, two things are clear.

The first is that, as expected, there is some improvement.
The contents file is about half the time. (And the difference
isn't far off the 1.4s time it's expected to take to do 2 28M
writes to disk at the 40M/s it'll manage.) If we translate this
to a more normal system then the typical effect of the contents
file is about 20%. (And for a system install it's on average half
that, so a 10% effect, which is basically what I worked out
before.)

The second is that pkgrm is much more sensitive. I need
to work out why that is. One thing that pkgrm needs to do
that pkgadd doesn't is to fully parse the contents file (it
needs to parse every line to see if that pathname is
in the given package, whereas pkgadd knows the list of
filenames and can use a binary search to find them in
the contents file because it's sorted). But that's only about
0.8s (I know that from how long pkginfo -l on the package
takes).

The next test involved a simple patchadd. This is of a patch
with a single file.

patchadd: 3.51s
patchrm: 14.07s

Ouch to that second one. OK, so the contents file effect
is 40% here on the patchadd, and 10% on the patchrm
case. And on a more typical system will be something
like half that. (And again note that removal is expensive.)

Conclusion: there's still a lot of work to do to improve the
performance of the package and patch tools before we
start looking at the contents file.

A couple of other observations:
1. The contents file is quite compressible.I got about
a factor 8 with gzip. Clearly there's quite a lot of redundant
information in the file, so it should be feasible to come up
with a way to use that redundancy to make the file much
smaller. I'm not expecting a factor of 8, but a factor of 3
seems eminently reasonable.
2. Is the contents file in it's current format causing
functionality problems, as opposed to just the performance
issues? For example, I would like to see stronger
checksums, and the ability to describe ACLs and
extended attributes.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Reply via email to