Hi,
So, I've been having difficulties with dpkg's performance on cold disk
cache. dpkg's list files in /var/lib/dpkg/info are inefficient. Before
doing most operations, dpkg calls ensure_allinstfiles_available() which
reads in the contents of each file into a global hash table. As there
are thousands of these small files scattered throughout the disk, a
sequential read is a very expensive operation.
On my machine (Ubuntu Intrepid), dpkg --search takes nearly 30 seconds
on cold cache:
% dump-disk-cache
% time dpkg-query --search /bin/ls
coreutils: /bin/ls
dpkg-query --search /bin/ls 0.47s user 0.45s system 3% cpu 29.536 total
dpkg also reads the list files when installing packages, so we are
affected there.
The current list-files are good for the query "given a package, what did
it install". They also have fairly fast updates. However, they are
extremely poorly suited for the query "given a file, what package(s)
installed it" or if you need to read it all in at once. So, I propose
adding a cache for the data.
As a proof-of-concept, I have a series of patches[1] which implement a
simple cache: putting everything into a tar file. The first two refactor
some of the code; they can probably be merged now. The other two add a
cache. Note: the latter two are purely proof-of-concept and have
numerous technical problems, are incomplete, etc. Please do not consider
them for merging.
Some numbers:
[dpkg master (1b5a009da6fdd38b2b51bd551c09880f890566f7)]
% dump-disk-cache
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.51s user 0.53s
system 3% cpu 30.324 total
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.33s user 0.08s
system 93% cpu 0.435 total
[dpkg tarfile-proof-of-concept]
% dump-disk-cache
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.47s user 0.07s
system 37% cpu 1.461 total
% time dpkg-query --admindir=/var/lib/dpkg --search /bin/ls
coreutils: /bin/ls
dpkg-query --admindir=/var/lib/dpkg --search /bin/ls 0.42s user 0.08s
system 83% cpu 0.587 total
There is a performance regression on warm cache, but I suspect this is
largely because of I'm piping to the system tar binary. Of course, a
real implemention would be more reasonable.
The time to do a search (very strongly bottlenecked on reading the list
files) goes down from over 30 seconds to under 1.5. I think this is a
clear improvement. Only ensure_allinstfiles_available() is touched and
it gives the same result, so this should not effect the rest of the program.
I'm not sure what implementation is the most acceptable. Ideally, a
database of sorts that avoids reading in all the data would be best, but
it seems that will be difficult to integrate with the existing code
without touch many code paths. A tar file may have problems with
--delete rewriting the entire file.
Thoughts?
David Benjamin
[1] http://github.com/davidben/dpkg/tree/tarfile-proof-of-concept
--
To UNSUBSCRIBE, email to debian-dpkg-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org