On Dec 20, 2010, at 10:52 AM, Jeff Johnson wrote: > > For KISS illustration purposes, I will use n=4096, chosen as the usual ~2x > larger > than observed conservative engineering policy. I do point out that almost all > packages > would fit comfortably in n=128 or n=256 so the reality of n=4096 is actually > 16-32x larger > than necessary. Since RPMTAG_DIRNAMES Bloom filters will likely have to be > done per-package, > there's a builtin reduction of 16-32x in the no. of bits needed implied by > n=4096 (but > with additional overhead carrying per-package parameters around). > > Next will be to adjust lib/tbf.c to calculate the no. of bits needed with > per-package > RPMTAG_DIRNAMES Bloom filters. >
After a small amount of dinky coding, here's the preliminary result [...@rhel6 lib]$ ./tbf Dirnames: 722712 bytes (20886 items) with XZDIO: 62416 bytes Bloom filter: {4096,78521}[13] false positives: 0.0001 Uncompressed: 11985336 = 1221 * 9816 with XZDIO: 311808 bytes The Dirnames: uses '\n' separators with a blank line after each package. That is KISS for illsutration only, avoiding issues like how compressible XML markup is or isn't, with or without --text passed to XZ. The Bloom filter (with ultra-conservative parameters) is an array of fixed size bit maps, one for each package, concatenated, all with exactly the same {n,e,m,k} choice to avoid overhead. Clearly the {n=4096, e=0.0001} parameters are sub-optimal. Why does one need an array of 78521 bits (of which 13 bits will be set per-item, one per hash) when the majority of packages have <100 elements in RPMTAG_DIRNAMES. Let's try e=0.01 first. From the WikiPedia entry, that saves 2 * 4.8 bits per-item on average: Dirnames: 722712 bytes (20886 items) with XZDIO: 62416 bytes Bloom filter: {4096,39260}[6] false positives: 0.01 Uncompressed: 5992668 = 1221 * 4908 with XZDIO: 149416 bytes Now let's reduce the estimate from n=4096 to n=2561 (the actual kernel-devel worst case): Dirnames: 722712 bytes (20886 items) with XZDIO: 62416 bytes Bloom filter: {2561,24547}[6] false positives: 0.01 Uncompressed: 3747249 = 1221 * 3069 with XZDIO: 127408 bytes Hmmm, not much savings (in the compressed bit maps), likely because the bits are randomly distributed in the sparse bit maps. But remember there's a 16-32x overkill built into the original Bloom filter parameters, so its pretty clear a Bloom filter is gonna start "winning" soon, based solely on minimum size criteria, compared to a string representation. Next will be to measure specific per-package parameterized Bloom filters with the original e=0.0001 false positive rate. hth 73 de Jeff ______________________________________________________________________ RPM Package Manager http://rpm5.org Developer Communication List rpm-devel@rpm5.org