On Jun 22, 2012, at 5:26 PM, Alexey Tourbin <alexey.tour...@gmail.com> wrote:
> On Fri, Jun 22, 2012 at 5:30 AM, Jeffrey Johnson <n3...@me.com> wrote: >> Sure numbers "make sense". >> >> But God invented 0 and 1 and who needs steenkin carries to do >> arithmetic in Galois fields?!? > > Jeffery, I understand that you are ironic and sarcastic, but I can't > see the reason why, as per our discussion. If you ask honest > questions, like "what this means", I try to do my best to answer the > question, possibly involving considerations like "numbers comes from > God", which are questionable but not irrelevant. They may help to > understand, or may not. > Apologies for the sarcasm. My comments were inline with your observations regarding undergraduate difficulties with Galois fields and matrices imho. But math jokes are bone dry … >>> P: 01010010010101 >>> R: 01000000010000 >> >> Yes … but this is "premature optimization" … > > There is no premature optimization here, and it is fair to ask what we > may possibly want to try to optimize. My answer is: first, it must > take less space (given the probability) because we have to download > the repo metadata every time we run the "update"; second, set-versions > must compare quickly (we must be able to compare them all within a > second). What's your suggestion? Do you want them to take more space > and compare slowly, so as to avoid premature optimizations? :-) Or do > you dislike them exactly because they are not Bloom filters? Go on > then, make shorter strings, given the same probability, which compare > faster! But this is largely impossible, exactly because of these > laughable topics like "what is a number" or "what is a probability" > which I'm trying to present. > I would state that compression (of any sort) to minimize bandwidth is entirely the wrong problem to solve. Even in Kazakistan, one can minimize bandwidth by Upgrading less often! My contrarian POV should not be taken as opposed to compression or elegance or anything else. Just that minimal size in the representation of bit fields (or bit positions as numbers) overly limits the applicability of set:versions. There are lots of usage cases for efficient sub-set computations in package management, not just as a de facto API/ABI check using ELF symbols. Most of the other usage cases for efficient sub-set computations are not subject to an ultimately optimal string encoding. E.g. ripping off the base62 encoding and distributing binary assertion fields saves at least as much bandwidth as your guesstimate that a Golomb-Rice code is ~20% more efficient than a Bloom filter. So don't take my flip comments as being opposed to what you have implemented. >> How big the repo is determines how important the >> distro is and nothing else. Just look at how impo'tent Debian is … > > Your considerations are probably true but are largely irrelevant to > the discussion. > Besides that, I can't understand them. Are you sarcastic? I am > sarcastic too, that's not at all a problem. Shall we talk? :-) > Just in case: yes acidic sarcasm was fully intended. >>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10 >>> set:dbUnz4 >>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20 >>> set:nl2ZEALdS >>> >>> In the first run, the probability is modest ("print -l" is zsh syntax >>> to print an arg per line). In the second run, the probability is much >>> higher (and it takes more space). There is also a function in scripts >>> to sustain probabilities at about o 0.1%. >> >> Hmmm … I've been defaulting Bloom filters ultra conservatively >> at 10**-4 and mobing to 10**-6 at the first hint of a possible problem. > > The number does not matter here, the only consideration was that > 2^{-10} was kind of cute number, and it works well in practice - that > is, we catch all or most of the bugs where a library symbol has been > deleted. But it could also be 2^{-16} or even 2^{-20}. It is a > trade-off, and I see no ground for criticizing it for just that. Or > again it boils down to the question how many megabytes you expect to > download when you run the "update". You already made a valid point asking what probability means wrto false positives. In fact your 2**10/2**20 is a very different estimate of false positive probability than the approx. currently in use in rpmbf.c (which is based on a statistical model for Bloom filter parameters). Below is what is in use by RPM which chooses {m, k} given an estimated population n and a desired probability of false positives e (I forget where I found the actual computation, can likely find it again if/when necessary). What I would like to be able to do with set:versions in RPM is to be able to use either of these 2 "container" representations interchangeably. I'd also like to be able to pre-compute either form in package tags for whatever purpose without the dreadful tyranny of being "legacy compatible" with older versions of RPM. Adding a new tag that isn't used by older RPM is exactly as compatible as following existing string representations for NEVR dependencies in existing tags, arguably more compatible because older versions of RPM are incapable of interpreting unknown tags. And again: the best solution for the download bandwidth problem is Update the metadata less often. if you think a bit. 73 de Jeff ==================== void rpmbfParams(size_t n, double e, size_t * mp, size_t * kp) { static size_t _nmin = 10; static size_t _n = 10000; static size_t _nmax = 1 * 1000 * 1000 * 1000; static double _emin = 1.0e-10; static double _e = 1.0e-4; static double _emax = 1.0e-2; size_t m; size_t k; /* XXX sanity checks on (n,e) args. */ if (!(n >= _nmin && _n <= _nmax)) n = _n; if (!(e >= _emin && _e <= _emax)) e = _e; m = (size_t)((n * log(e)) / (log(1.0 / pow(2.0, log(2.0)))) + 0.5); k = (size_t) ((m * log(2.0)) / n); if (mp) *mp = m; if (kp) *kp = k; if (_rpmbf_debug) fprintf(stderr, "<-- %s(%u, %g) m %u k %u\n", __FUNCTION__, (unsigned)n, e, (unsigned)m, (unsigned)k); } ______________________________________________________________________ RPM Package Manager http://rpm5.org Developer Communication List rpm-devel@rpm5.org