On Jun 21, 2012, at 8:27 PM, Alexey Tourbin <alexey.tour...@gmail.com> wrote:

> On Thu, Jun 21, 2012 at 7:28 PM, Jeffrey Johnson <n3...@me.com> wrote:
>>> More precisely, a set-version can be (in principle) converted to a
>>> Bloom filter which uses only one hash function. The idea is that such
>>> a filter will set bits in a highly sparse set of bits, one by one.
>>> Instead, a set-version remembers the bits simply by their indices.
>>> Setting the bit becomes taking the number, and there is a
>>> straightforward correspondence. It also turns out that a set of
>>> numbers can be easier to deal with, as opposed to a sparse set of
>>> bits.
>> 
>> You've made an unsupported claim here re "numbers can be easier".
>> Presumably you are talking about means of encoding, where clever
>> mathematics derived from numerical transforms can be used to
>> remove redundancy. With a pile of bits in a Bloom filter all one
>> has to work with is a pile of bits.
> 
> Numbers are easier to deal with because you can use 'unsigned v[]'
> array to represent them, which can be seen throughout the code,
> accompanied by 'int c' (which is argc/argv[] style). Bits are somewhat
> more complicated: you need to use something like <sys/select.h>
> macros, and you cannot use pointer arithmetic to implement traversal,
> etc.
> 

Well different strokes here. The
        #include <sys/select.h>
is complicated only because of effete Linux kernel debutantes
with their precious kernel stack advocacy issues re select(2).
These minor defines (swiped directly from <sys/select.h> mind you)

are hugely portable:
#define __PBM_NBITS /*@-sizeoftype@*/(8 * sizeof(__pbm_bits))/*@=sizeoftype@*/
#define __PBM_IX(d)     ((d) / __PBM_NBITS)
#define __PBM_MASK(d)   ((__pbm_bits) 1 << (((unsigned)(d)) % __PBM_NBITS))
#define __PBM_BITS(set) ((__pbm_bits *)(set)->bits)

#define PBM_FREE(s)     _free(s);
#define PBM_SET(d, s)   (__PBM_BITS (s)[__PBM_IX (d)] |= __PBM_MASK (d))
#define PBM_CLR(d, s)   (__PBM_BITS (s)[__PBM_IX (d)] &= ~__PBM_MASK (d))
#define PBM_ISSET(d, s) ((__PBM_BITS (s)[__PBM_IX (d)] & __PBM_MASK (d)) != 0)

#define PBM_ALLOC(d)    xcalloc(__PBM_IX (d) + 1, __PBM_NBITS/8)

See <rpmio/rpmbf.h>.

> Also, to me, a set of numbers just "makes sense", and is good to go.
> If you wonder how much complicated things can get, there's been some
> recent papers out there which use matrix solving for approximate set
> representation. They are basically unreadable (given the undergraduate
> skills), and you can be completely lost because there is apparently no
> connection between the Galois fields and what you actually try to do.
> After you try to read these papers, the "set of numbers" becomes a
> wonderful salvation which comes directly from God and the Holy Spirit
> and brings peace to your soul. This wonderful salvation, which comes
> from God, turns out to be a good thing. You should use it, when you
> have some. :-)
> 

Sure numbers "make sense".

But God invented 0 and 1 and who needs steenkin carries to do
arithmetic in Galois fields?!?

>> The counter argument is this: testing bits (in a Bloom filter) is
>> rather KISS. Other compression/redundancy removal schemes
>> end up trading storage for implementation complexity cost.
>> In a running installer like RPM, there will always be a need
>> for memory dealing with payload contents. Since large amounts
>> of memory are eventually going to be needed/used, savings
>> by using Golomb-Reid codes are mostly irrelevant for the
>> 100K -> 1M set sizes used by RPM no matter whether bits or
>> numbers are used to represent.
> 
> I do not agree that testing bits is more KISS than a set of numbers.
> It you think that testing bits is the only and natural thing to do, I
> disagree. The reason is: Requires versions are much more sparse than
> Provides versions. If the Provides version is optimal, in terms of bit
> usage (50% set), the Requires version must be necessarily suboptimal
> (too few bits set). It can be pictured like this:
> 
> P: 01010010010101
> R: 01000000010000
> 

Yes … but this is "premature optimization" …

> The conclusion is, if you want to stick to bitmasks and reduce subset
> comparison to bitmask tests, there will be a great deal of
> inefficiency because of sparse R-sets. On the other hand, sets of
> numbers can be pictured like this:
> 
> P: 1,3,7,13,15
> R: 1,13
> 
> Note that R-set does not take extra space, and there is a simple
> merge-like algorithm which advances either R or P (or both - which is
> how rpmsetcmp routine is implemented).
> 

… if you are serious abtout minimal size, compressing headers
in *.rpm packages will save more bandwidth than any single other
optimization one might choose to implement.

>> But don't take my argument as being opposed to set:versions whatsoever.
>> Just (for RPM and even for *.rpm) that compression size isn't as important
>> to me as it is to you as the "primary goal" of set: versions.
> 
> The size of *.rpm is very important when you want to update the
> information about a repo, like in "apt-get update" - you have to
> download it all. The question is then how big the repo is, and how
> many set-versions you can afford to download. :-) The "primary goal"
> was not to make things much worse. In other words, the price must not
> be prohibitive even for the repo of 10k or 15k packages strong.
> (Please also realize that the Internet was not a given until very
> recent times - some shoddy ISPs in Kamchatka still want to charge
> something like $0.1 per 1Mb.)
> 

How big the repo is determines how important the
distro is and nothing else. Just look at how impo'tent Debian is …

>> AFAICT set:versions also has a finite probability of failure due to 
>> (relatively rare)
>> hash collisions converting strings to numbers.
>> 
>> Is there any hope that set:versions (or the Golomb-Rice code) can supply a 
>> parameterization
>> to "tune" the probability of false positives?
>> 
>> Or am I mis-understanding what is implemented?
> 
> You can tune the probability by selecting the appropriate "bpp"
> paramter, and passing it to "mkset" (which is how scripts work). The
> "bpp" indicates how much bits must be used per hash value (the range
> is 10..32). For example, if you want to encode Provides versions which
> has 1024 symbols, and the error rate has to be fixed at about 0.1%,
> you need to use bpp=20 - that is, after each symbol is hashed, its 20
> lower bits will be taken into account and encoded. The probability
> then is simply the number of symbols over the capacity of the
> universe, which is 2^{10}/2^{20}=0.1%. On the other hand, when you
> encode Requires versions, you should use the same bpp as the
> corresponding Provides version have.
> 

Noted. Nice. Thanks.

> As I've mentioned there is a program "mkset" that does just that, and
> you can play with it a bit to understand how things work.
> 
> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10
> set:dbUnz4
> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20
> set:nl2ZEALdS
> 
> In the first run, the probability is modest ("print -l" is zsh syntax
> to print an arg per line). In the second run, the probability is much
> higher (and it takes more space). There is also a function in scripts
> to sustain probabilities at about o 0.1%.
> 

Hmmm … I've been defaulting Bloom filters ultra conservatively
at 10**-4 and mobing to 10**-6 at the first hint of a possible problem.

Perhaps time to become more aggressive in parameter tuning: todo++.

> SuggestBPP()
> {
>        # For the number of symbols in the range 2^{m-1}..2^m-1,
>        # use m+10 bits per symbol.  Upper bound on the error rate
>        # is 2^{-10} (about 0.1%).
>        perl -le 'print int log(shift) / log(2) + 11' "$1"
> }

Thanks for the info.

73 de Jeff

> ______________________________________________________________________
> RPM Package Manager                                    http://rpm5.org
> Developer Communication List                        rpm-devel@rpm5.org

______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Reply via email to