On Jun 21, 2012, at 8:27 PM, Alexey Tourbin <alexey.tour...@gmail.com> wrote:
> On Thu, Jun 21, 2012 at 7:28 PM, Jeffrey Johnson <n3...@me.com> wrote: >>> More precisely, a set-version can be (in principle) converted to a >>> Bloom filter which uses only one hash function. The idea is that such >>> a filter will set bits in a highly sparse set of bits, one by one. >>> Instead, a set-version remembers the bits simply by their indices. >>> Setting the bit becomes taking the number, and there is a >>> straightforward correspondence. It also turns out that a set of >>> numbers can be easier to deal with, as opposed to a sparse set of >>> bits. >> >> You've made an unsupported claim here re "numbers can be easier". >> Presumably you are talking about means of encoding, where clever >> mathematics derived from numerical transforms can be used to >> remove redundancy. With a pile of bits in a Bloom filter all one >> has to work with is a pile of bits. > > Numbers are easier to deal with because you can use 'unsigned v[]' > array to represent them, which can be seen throughout the code, > accompanied by 'int c' (which is argc/argv[] style). Bits are somewhat > more complicated: you need to use something like <sys/select.h> > macros, and you cannot use pointer arithmetic to implement traversal, > etc. > Well different strokes here. The #include <sys/select.h> is complicated only because of effete Linux kernel debutantes with their precious kernel stack advocacy issues re select(2). These minor defines (swiped directly from <sys/select.h> mind you) are hugely portable: #define __PBM_NBITS /*@-sizeoftype@*/(8 * sizeof(__pbm_bits))/*@=sizeoftype@*/ #define __PBM_IX(d) ((d) / __PBM_NBITS) #define __PBM_MASK(d) ((__pbm_bits) 1 << (((unsigned)(d)) % __PBM_NBITS)) #define __PBM_BITS(set) ((__pbm_bits *)(set)->bits) #define PBM_FREE(s) _free(s); #define PBM_SET(d, s) (__PBM_BITS (s)[__PBM_IX (d)] |= __PBM_MASK (d)) #define PBM_CLR(d, s) (__PBM_BITS (s)[__PBM_IX (d)] &= ~__PBM_MASK (d)) #define PBM_ISSET(d, s) ((__PBM_BITS (s)[__PBM_IX (d)] & __PBM_MASK (d)) != 0) #define PBM_ALLOC(d) xcalloc(__PBM_IX (d) + 1, __PBM_NBITS/8) See <rpmio/rpmbf.h>. > Also, to me, a set of numbers just "makes sense", and is good to go. > If you wonder how much complicated things can get, there's been some > recent papers out there which use matrix solving for approximate set > representation. They are basically unreadable (given the undergraduate > skills), and you can be completely lost because there is apparently no > connection between the Galois fields and what you actually try to do. > After you try to read these papers, the "set of numbers" becomes a > wonderful salvation which comes directly from God and the Holy Spirit > and brings peace to your soul. This wonderful salvation, which comes > from God, turns out to be a good thing. You should use it, when you > have some. :-) > Sure numbers "make sense". But God invented 0 and 1 and who needs steenkin carries to do arithmetic in Galois fields?!? >> The counter argument is this: testing bits (in a Bloom filter) is >> rather KISS. Other compression/redundancy removal schemes >> end up trading storage for implementation complexity cost. >> In a running installer like RPM, there will always be a need >> for memory dealing with payload contents. Since large amounts >> of memory are eventually going to be needed/used, savings >> by using Golomb-Reid codes are mostly irrelevant for the >> 100K -> 1M set sizes used by RPM no matter whether bits or >> numbers are used to represent. > > I do not agree that testing bits is more KISS than a set of numbers. > It you think that testing bits is the only and natural thing to do, I > disagree. The reason is: Requires versions are much more sparse than > Provides versions. If the Provides version is optimal, in terms of bit > usage (50% set), the Requires version must be necessarily suboptimal > (too few bits set). It can be pictured like this: > > P: 01010010010101 > R: 01000000010000 > Yes … but this is "premature optimization" … > The conclusion is, if you want to stick to bitmasks and reduce subset > comparison to bitmask tests, there will be a great deal of > inefficiency because of sparse R-sets. On the other hand, sets of > numbers can be pictured like this: > > P: 1,3,7,13,15 > R: 1,13 > > Note that R-set does not take extra space, and there is a simple > merge-like algorithm which advances either R or P (or both - which is > how rpmsetcmp routine is implemented). > … if you are serious abtout minimal size, compressing headers in *.rpm packages will save more bandwidth than any single other optimization one might choose to implement. >> But don't take my argument as being opposed to set:versions whatsoever. >> Just (for RPM and even for *.rpm) that compression size isn't as important >> to me as it is to you as the "primary goal" of set: versions. > > The size of *.rpm is very important when you want to update the > information about a repo, like in "apt-get update" - you have to > download it all. The question is then how big the repo is, and how > many set-versions you can afford to download. :-) The "primary goal" > was not to make things much worse. In other words, the price must not > be prohibitive even for the repo of 10k or 15k packages strong. > (Please also realize that the Internet was not a given until very > recent times - some shoddy ISPs in Kamchatka still want to charge > something like $0.1 per 1Mb.) > How big the repo is determines how important the distro is and nothing else. Just look at how impo'tent Debian is … >> AFAICT set:versions also has a finite probability of failure due to >> (relatively rare) >> hash collisions converting strings to numbers. >> >> Is there any hope that set:versions (or the Golomb-Rice code) can supply a >> parameterization >> to "tune" the probability of false positives? >> >> Or am I mis-understanding what is implemented? > > You can tune the probability by selecting the appropriate "bpp" > paramter, and passing it to "mkset" (which is how scripts work). The > "bpp" indicates how much bits must be used per hash value (the range > is 10..32). For example, if you want to encode Provides versions which > has 1024 symbols, and the error rate has to be fixed at about 0.1%, > you need to use bpp=20 - that is, after each symbol is hashed, its 20 > lower bits will be taken into account and encoded. The probability > then is simply the number of symbols over the capacity of the > universe, which is 2^{10}/2^{20}=0.1%. On the other hand, when you > encode Requires versions, you should use the same bpp as the > corresponding Provides version have. > Noted. Nice. Thanks. > As I've mentioned there is a program "mkset" that does just that, and > you can play with it a bit to understand how things work. > > $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10 > set:dbUnz4 > $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20 > set:nl2ZEALdS > > In the first run, the probability is modest ("print -l" is zsh syntax > to print an arg per line). In the second run, the probability is much > higher (and it takes more space). There is also a function in scripts > to sustain probabilities at about o 0.1%. > Hmmm … I've been defaulting Bloom filters ultra conservatively at 10**-4 and mobing to 10**-6 at the first hint of a possible problem. Perhaps time to become more aggressive in parameter tuning: todo++. > SuggestBPP() > { > # For the number of symbols in the range 2^{m-1}..2^m-1, > # use m+10 bits per symbol. Upper bound on the error rate > # is 2^{-10} (about 0.1%). > perl -le 'print int log(shift) / log(2) + 11' "$1" > } Thanks for the info. 73 de Jeff > ______________________________________________________________________ > RPM Package Manager http://rpm5.org > Developer Communication List rpm-devel@rpm5.org ______________________________________________________________________ RPM Package Manager http://rpm5.org Developer Communication List rpm-devel@rpm5.org