On Jun 22, 2012, at 5:26 PM, Alexey Tourbin <alexey.tour...@gmail.com> wrote:

> On Fri, Jun 22, 2012 at 5:30 AM, Jeffrey Johnson <n3...@me.com> wrote:
>> Sure numbers "make sense".
>> 
>> But God invented 0 and 1 and who needs steenkin carries to do
>> arithmetic in Galois fields?!?
> 
> Jeffery, I understand that you are ironic and sarcastic, but I can't
> see the reason why, as per our discussion. If you ask honest
> questions, like "what this means", I try to do my best to answer the
> question, possibly involving considerations like "numbers comes from
> God", which are questionable but not irrelevant. They may help to
> understand, or may not.
> 

Apologies for the sarcasm. My comments were inline with your observations
regarding undergraduate difficulties with Galois fields and matrices imho.

But math jokes are bone dry …


>>> P: 01010010010101
>>> R: 01000000010000
>> 
>> Yes … but this is "premature optimization" …
> 
> There is no premature optimization here, and it is fair to ask what we
> may possibly want to try to optimize. My answer is: first, it must
> take less space (given the probability) because we have to download
> the repo metadata every time we run the "update"; second, set-versions
> must compare quickly (we must be able to compare them all within a
> second). What's your suggestion? Do you want them to take more space
> and compare slowly, so as to avoid premature optimizations? :-) Or do
> you dislike them exactly because they are not Bloom filters? Go on
> then, make shorter strings, given the same probability, which compare
> faster! But this is largely impossible, exactly because of these
> laughable topics like "what is a number" or "what is a probability"
> which I'm trying to present.
> 

I would state that compression (of any sort) to minimize
bandwidth is entirely the wrong problem to solve.

Even in Kazakistan, one can minimize bandwidth by
        Upgrading less often!


My contrarian POV should not be taken as opposed to compression
or elegance or anything else. Just that minimal size in the
representation of bit fields (or bit positions as numbers)
overly limits the applicability of set:versions.

There are lots of usage cases for efficient sub-set computations
in package management, not just as a de facto API/ABI check
using ELF symbols. Most of the other usage cases for efficient
sub-set computations are not subject to an ultimately optimal
string encoding.

E.g. ripping off the base62 encoding and distributing binary
assertion fields saves at least as much bandwidth as your
guesstimate that a Golomb-Rice code is ~20% more efficient
than a Bloom filter.

So don't take my flip comments as being opposed to what you have
implemented.

>> How big the repo is determines how important the
>> distro is and nothing else. Just look at how impo'tent Debian is …
> 
> Your considerations are probably true but are largely irrelevant to
> the discussion.
> Besides that, I can't understand them. Are you sarcastic? I am
> sarcastic too, that's not at all a problem. Shall we talk? :-)
> 

Just in case: yes acidic sarcasm was fully intended.

>>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10
>>> set:dbUnz4
>>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20
>>> set:nl2ZEALdS
>>> 
>>> In the first run, the probability is modest ("print -l" is zsh syntax
>>> to print an arg per line). In the second run, the probability is much
>>> higher (and it takes more space). There is also a function in scripts
>>> to sustain probabilities at about o 0.1%.
>> 
>> Hmmm … I've been defaulting Bloom filters ultra conservatively
>> at 10**-4 and mobing to 10**-6 at the first hint of a possible problem.
> 
> The number does not matter here, the only consideration was that
> 2^{-10} was kind of cute number, and it works well in practice - that
> is, we catch all or most of the bugs where a library symbol has been
> deleted. But it could also be 2^{-16} or even 2^{-20}. It is a
> trade-off, and I see no ground for criticizing it for just that. Or
> again it boils down to the question how many megabytes you expect to
> download when you run the "update".

You already made a valid point asking what probability means wrto
false positives. In fact your 2**10/2**20 is a very different
estimate of false positive probability than the approx. currently
in use in rpmbf.c (which is based on a statistical model for
Bloom filter parameters).

Below is what is in use by RPM which chooses {m, k} given
an estimated population n and a desired probability of false
positives e (I forget where I found the actual computation,
can likely find it again if/when necessary).

What I would like to be able to do with set:versions
in RPM is to be able to use either of these 2 "container"
representations interchangeably.

I'd also like to be able to pre-compute either form
in package tags for whatever purpose without the
dreadful tyranny of being "legacy compatible" with
older versions of RPM. Adding a new tag that isn't
used by older RPM is exactly as compatible as following
existing string representations for NEVR dependencies
in existing tags, arguably more compatible because
older versions of RPM are incapable of interpreting
unknown tags.

And again: the best solution for the download bandwidth
problem is
        Update the metadata less often.
if you think a bit.

73 de Jeff


====================
void rpmbfParams(size_t n, double e, size_t * mp, size_t * kp)
{
    static size_t _nmin = 10;
    static size_t _n = 10000;
    static size_t _nmax = 1 * 1000 * 1000 * 1000;
    static double _emin = 1.0e-10;
    static double _e = 1.0e-4;
    static double _emax = 1.0e-2;
    size_t m;
    size_t k;

    /* XXX sanity checks on (n,e) args. */
    if (!(n >= _nmin && _n <= _nmax))
        n = _n;
    if (!(e >= _emin && _e <= _emax))
        e = _e;

    m = (size_t)((n * log(e)) / (log(1.0 / pow(2.0, log(2.0)))) + 0.5);
    k = (size_t) ((m * log(2.0)) / n);
    if (mp) *mp = m;
    if (kp) *kp = k;
if (_rpmbf_debug)
fprintf(stderr, "<-- %s(%u, %g) m %u k %u\n", __FUNCTION__, (unsigned)n, e, 
(unsigned)m, (unsigned)k);
}

______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Reply via email to