On Dec 20, 2010, at 10:52 AM, Jeff Johnson wrote:
> 
> For KISS illustration purposes, I will use n=4096, chosen as the usual ~2x 
> larger
> than observed conservative engineering policy. I do point out that almost all 
> packages
> would fit comfortably in n=128 or n=256 so the reality of n=4096 is actually 
> 16-32x larger
> than necessary. Since RPMTAG_DIRNAMES Bloom filters will likely have to be 
> done per-package,
> there's a builtin reduction of 16-32x in the no. of bits needed implied by 
> n=4096 (but
> with additional overhead carrying per-package parameters around).
> 
> Next will be to adjust lib/tbf.c to calculate the no. of bits needed with 
> per-package
> RPMTAG_DIRNAMES Bloom filters.
> 

After a small amount of dinky coding, here's the preliminary result

[...@rhel6 lib]$ ./tbf
Dirnames: 722712 bytes (20886 items)
  with XZDIO: 62416 bytes
Bloom filter: {4096,78521}[13] false positives: 0.0001
Uncompressed: 11985336 = 1221 * 9816
  with XZDIO: 311808 bytes


The Dirnames: uses '\n' separators with a blank line after each package.
That is KISS for illsutration only, avoiding issues like how compressible
XML markup is or isn't, with or without --text passed to XZ.

The Bloom filter (with ultra-conservative parameters) is an array
of fixed size bit maps, one for each package, concatenated, all
with exactly the same {n,e,m,k} choice to avoid overhead.

Clearly the {n=4096, e=0.0001} parameters are sub-optimal. Why does
one need an array of 78521 bits (of which 13 bits will be set per-item, one per 
hash)
when the majority of packages have <100 elements in RPMTAG_DIRNAMES.

Let's try e=0.01 first. From the WikiPedia entry, that saves 2 * 4.8 bits 
per-item
on average:

Dirnames: 722712 bytes (20886 items)
  with XZDIO: 62416 bytes
Bloom filter: {4096,39260}[6] false positives:  0.01
Uncompressed: 5992668 = 1221 * 4908
  with XZDIO: 149416 bytes

Now let's reduce the estimate from n=4096 to n=2561 (the actual kernel-devel 
worst case):

Dirnames: 722712 bytes (20886 items)
  with XZDIO: 62416 bytes
Bloom filter: {2561,24547}[6] false positives:  0.01
Uncompressed: 3747249 = 1221 * 3069
  with XZDIO: 127408 bytes

Hmmm, not much savings (in the compressed bit maps), likely because the
bits are randomly distributed in the sparse bit maps.

But remember there's a 16-32x overkill built into the original Bloom filter 
parameters,
so its pretty clear a Bloom filter is gonna start "winning" soon, based solely 
on minimum
size criteria, compared to a string representation.

Next will be to measure specific per-package parameterized Bloom filters with 
the original e=0.0001
false positive rate.

hth

73 de Jeff
______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Reply via email to