On Mon, Mar 04, 2013 at 12:22:31PM +0100, Michael Schroeder wrote:
> For 2000 packages we have about... ugh, that's actually hard
> to tell as the avg and the median differ that much. Let's
> use the average: 2000 * 130 = 260000 files.
> 
> I would hash them using just a 32-bit number for each hash entry,
> the number would be a combination of the hash value and a header
> number. Thus:
> 
> 260000  ->  2^19 = 524288 hash entries (49% fill size)
> hash(basename) = xxxx...xxxx yyyy....yyyy
>                  --13 bits-- -- 19 bits--
> 
> Thus we store 13 bits of the hash value together with the header
> number (we may need a header number -> header db idx array to keep
> the header number small, but that just needs 2000 * 4 bytes).
> 
> I would not store the real string on disk, in most cases we
> need to retrieve the real header anyway, our db function
> can simply fetch the header and make sure that it's not a false
> positive. Chances for that are pretty slim anyway, as we have the
> extra 13 bits in the entry.

Hmm, turns out that it works only well if we have a really good
hash function. The simple hashing I do in libsolv can't be used
as too many files map to the same hash value. I tried the
popular "MurMurHash", it seems to do a good job. Disadvantages
of the scheme are the excess header fetches and there's also no
fast way to get all the keys from the hash (needed for rpm's new
IndexIterator).


If we don't want to access excess headers, we need to put the
strings into the hash. Here's the data for my machine:

filelist entries: 238032
-> hash entries: 524288
-> hash memory used: 4 * 524288 = 2 Mbyte

unique basenames: 207686
basename stringspace needed: 3355117 bytes = 3.2 MByte

This assumes that we can somehow address the string from the
32bit hash value. I have 2099 package installed, this needs
12 bits. That leaves us with 20 bits for the string offset, which
is not enough (2^20 = 1048576). We might put the strings at
a somewhat aligned offsets, thus we would need max 3 excess bytes
for aligning, making the needed space 3355117 + 3 * 207686 =
3.8 MBytes, which is smalled than 1048576 * 4.

It's probably easier to just use 2 * 32bit for the hashing, i.e.
the needed size is 2 * 2 MByte hash + 3.2 MByte strings.

Cheers,
  Michael.

-- 
Michael Schroeder                                   m...@suse.de
SUSE LINUX Products GmbH,  GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
_______________________________________________
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint

Reply via email to