On 03/05/2013 08:11 PM, Michael Schroeder wrote:
On Mon, Mar 04, 2013 at 12:22:31PM +0100, Michael Schroeder wrote:
For 2000 packages we have about... ugh, that's actually hard
to tell as the avg and the median differ that much. Let's
use the average: 2000 * 130 = 260000 files.
I would hash them using just a 32-bit number for each hash entry,
the number would be a combination of the hash value and a header
number. Thus:
260000 -> 2^19 = 524288 hash entries (49% fill size)
hash(basename) = xxxx...xxxx yyyy....yyyy
--13 bits-- -- 19 bits--
Thus we store 13 bits of the hash value together with the header
number (we may need a header number -> header db idx array to keep
the header number small, but that just needs 2000 * 4 bytes).
I would not store the real string on disk, in most cases we
need to retrieve the real header anyway, our db function
can simply fetch the header and make sure that it's not a false
positive. Chances for that are pretty slim anyway, as we have the
extra 13 bits in the entry.
Hmm, turns out that it works only well if we have a really good
hash function. The simple hashing I do in libsolv can't be used
as too many files map to the same hash value. I tried the
popular "MurMurHash", it seems to do a good job. Disadvantages
of the scheme are the excess header fetches and there's also no
fast way to get all the keys from the hash (needed for rpm's new
IndexIterator).
If we don't want to access excess headers, we need to put the
strings into the hash. Here's the data for my machine:
filelist entries: 238032
-> hash entries: 524288
-> hash memory used: 4 * 524288 = 2 Mbyte
unique basenames: 207686
basename stringspace needed: 3355117 bytes = 3.2 MByte
This assumes that we can somehow address the string from the
32bit hash value. I have 2099 package installed, this needs
12 bits. That leaves us with 20 bits for the string offset, which
is not enough (2^20 = 1048576). We might put the strings at
a somewhat aligned offsets, thus we would need max 3 excess bytes
for aligning, making the needed space 3355117 + 3 * 207686 =
3.8 MBytes, which is smalled than 1048576 * 4.
It's probably easier to just use 2 * 32bit for the hashing, i.e.
the needed size is 2 * 2 MByte hash + 3.2 MByte strings.
I wouldn't worry too much about hash algorithms and storage optimization
at this point: that's something that can be tweaked and tuned over time
as long as the cache structure is internally versioned so we know when
we need to rebuild it.
Right now I'm more interested in what the overall design of this all
might look like. Like said, I'd like to see the cache be a "read-only
media" so there are zero locking needed for queries that only need data
from the cache. It'll undoubtedly penalize writers (ie transactions) as
the entire cache probably needs to be regenerated even if just one
package is installed/removed, but then we're not in the "millions of
transactions per second" database business at all, in rpm's case
painless (say, without having a library steal your signals and blow up
in all directions if you miss a single iterator free, etc) and fast
reads are what really counts I think.
One possibility for (supposedly :) hassle-free Packages replacement
might be just storing the headers as individual files (named by their
"instance" or sha1 hash or something) in a directory of their own.
That'd eliminate the need for complex book-keeping for free slots and
stuff which is pretty much required for a database-style single file,
plus adding and removing things should be very low-cost. Dunno about
looping through them all, compared to eg Packages, but then for the
added cost of separate open+close etc for each there are no
middle-layers adding cost of their own. I'm sure there are downsides
too, such as making the whole thing more exposed and easier for outside
abuse than manipulating a BDB database file but dunno if it matters in
reality - root is required to mess with it, and root can already rm -f
/var/lib/rpm/Packages or replace it with whatever and so on.
- Panu -
_______________________________________________
Rpm-maint mailing list
[email protected]
http://lists.rpm.org/mailman/listinfo/rpm-maint