Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Sat, Jun 23, 2012 at 7:10 AM, Jeffrey Johnson n3...@me.com wrote: Why is it any wrong to minimize bandwidth, or, in other words, why it is bad to spend less money? Your answer is like, because the meaning of life is not to spend less money, which is a wrong perspective. Okay, but what's a better perspective? Spending more money for less good is not at all a better perspective. Nothing I said implies that bandwidth reduction is wrong. If you want to minimize bandwidth downloading packages compress the metadata and deal with the legacy compatible issues however you want/need. My rule of thumb is that metadata is ~15% of the size of a typical *.rpm package: assuming 50% compression one might save 7.5% of the package download cost (0.50 * 0.15 = 7.5%). An interesting practical matter behind set-versions is that we can simply represent the exact set of symbols, which can be encoded and pictured like this: Provides: libfoo.so.0 = set:sym1,sym2,sym3,sym4 Requires: libfoo.so.0 = set:sym1,sym3 (where symN stands for direct ELF symbol name - e.g. strcmp). You simply name the symbols which you require or provide! In fact, this is exactly how early alpha set-versions were implemented - before I had some time to ponder over approximate set/subset encoding problem (it is then how sym1,sym2... sets where converted into numbers, per symbol, and compressed). The point here is that the price of the exact set representation may, or may not, be prohibitive. If the price is not prohibitive, it's a no-brainer: you don't have to involve into approximate subset business at all. Sometimes, you simply should not use bloom filters, despite the fact that they might seem appealing. However, if the price is prohibitive, which it was, the reason for going into approximate subset business is also a no-brainer: you should cut down heavily and optimize for size first. If you simply introduced probabilities without making things less prohibitive, did you do anything useful at all? You only spoiled things a bit! There is also a philosophical consideration which somehow accompanies this practical consideration. There is a short story, I believe by Borges, where a clever scientist devises a 1-1 map of reality. A 1-1 map of reality turns out to be a very true picture of reality. What's wrong with this approach? Well, a 1-1 map of the world turns out to be an exact copy of the world, which is of no use in terms of being a map. Somehow, the reality must be construed and reduced to a simpler (and somewhat coarser) description to become a useful model. This is also why we don't plug ELF sections into RPM headers: we believe that much simpler (or at least much shorter) dependencies must be used to represent ELF binaries in terms of their interconnectedness, and must also omit other less important details. Back to the story of set-versions, with the original implementation (which introduced full Golomb coding), it was estimated that the size of architecture-dependent pkglist.classic.bz2 metadata is going to go up from about 3M to about 12M, four-fold! This still was almost prohibitive to soar up like this. It was considered non-prohibitive only by the virtue of information-theoretical considerations: since we are going to encode that many symbols, we must not fool ourselves into thinking that we could somehow pay a smaller price - that is, without violating fundamental laws. By the way, what's the information-theoretical minimum? Say, we want to encode 1024 20-bit hash values (which yields the false positive rate at about 0.1%). Well, the first mathematical intuition is that we need to cut 20-bit range into 1024 smaller stripes, which gives 10 bits per stripe, on average. It is a little bit more complicated than that, though, exactly because of this on average business: we must also take some bits to encode stripe boundaries. But this is only a mathematical intuition. The exact formula, in R, is: lchoose(2**20,2**10)/log(2)/2**10 [1] 11.43581 (so, on the other hand, and somewhat unexpectedly, it is that old good n choose k business.) Current implementation lines up at about 11.6 bits per symbol. This is why sometimes I say that set-versions currently take about 2 alnum character per Provides symbol - this is because each alnum character can stuff about log2(62)=5.9+ bits. But compare this to early alpha set-versions which represented exact sets in form of set:sym1,sym2,sym3,sym4, that is, in terms of enumerating symbols. With exact sets, can you think of going to anywhere near 2 characters per symbol, especially that you also need a separator? This leaves you 1 character per symbol! :-) There's a saying that Perl makes easy things easy and hard things possible. Set-versions were designed to make hard things possible. __ RPM Package Managerhttp://rpm5.org Developer Communication List
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 28, 2012, at 12:18 PM, Alexey Tourbin wrote: There is also a philosophical consideration which somehow accompanies this practical consideration. There is a short story, I believe by Borges, where a clever scientist devises a 1-1 map of reality. A 1-1 map of reality turns out to be a very true picture of reality. What's wrong with this approach? Well, a 1-1 map of the world turns out to be an exact copy of the world, which is of no use in terms of being a map. Somehow, the reality must be construed and reduced to a simpler (and somewhat coarser) description to become a useful model. This is also why we don't plug ELF sections into RPM headers: we believe that much simpler (or at least much shorter) dependencies must be used to represent ELF binaries in terms of their interconnectedness, and must also omit other less important details. (aside) Nice: I'll dig out the Borges story … meanwhile (iirc) there is a story (and surely some sort of literary response to Borges) called How Much Shall We Bet in Cosmicomics that I pass along for your amusement. Seriously yes: carrying redundant metadata _EXACTLY_ as is in the interest of accurate data -- while well intended -- isn't needed. And both Bloom filters and set:versions are useful/compact data stores for existence tests without a need to also carry the _EXACT_ values (if/when the false positive rate can be tolerated, which is often true for the moderate sized sets needed in package management). The rest I agree with. 73 de Jeff__ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, Jun 28, 2012 at 8:18 PM, Alexey Tourbin alexey.tour...@gmail.com wrote: There is also a philosophical consideration which somehow accompanies this practical consideration. There is a short story, I believe by Borges, where a clever scientist devises a 1-1 map of reality. A 1-1 map of reality turns out to be a very true picture of reality. What's wrong with this approach? Well, a 1-1 map of the world turns out to be an exact copy of the world, which is of no use in terms of being a map. Somehow, the reality must be construed and reduced to a simpler (and somewhat coarser) description to become a useful model. This is also why we don't plug ELF sections into RPM headers: we believe that much simpler (or at least much shorter) dependencies must be used to represent ELF binaries in terms of their interconnectedness, and must also omit other less important details. This philosophical argument applies to set-versions in a (not-so-)obvious manner, which I will now clarify. It goes like this: although the ultimate goal is to check that R-set of symbols is a subset of P-set of symbols, you do not necessarily have to store the full names of the symbols in order to perform a somewhat stripped-down check itself. When it only matters if R is subset of P, the names themselves become largely irrelevant, provided that you can devise a very clever substitution/encoding scheme. You can make much simpler (or at least shorter) dependencies by getting rid of the names in a manner which does not destroy the check. The downside is, of course, that when a dependency R subset P is broken, it is not easy to find out which P symbols were deleted or renamed (or which R symbols are missing). But this is largely a developer's, or should I say a hacker's, problem. On the other hand, from the user point of view, and also from rpm(1) perspective, this approach simply promotes synchronous or rather transactional upgrades. It says like, guys, I will not apply half-baked updates before you fix it all - so that apps and libraries match. Which totally makes sense! __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 28, 2012, at 1:27 PM, Alexey Tourbin wrote: The downside is, of course, that when a dependency R subset P is broken, it is not easy to find out which P symbols were deleted or renamed (or which R symbols are missing). But this is largely a developer's, or should I say a hacker's, problem. Calculating set union/disjunction (for Bloom filters, surely almost as easy with set:versions) for P/R sets union P | R intersectionP R is natural/rapid/easy even if the symbols are unknown. FWIW: the arbitrary tags implemented @rpm5.org have a similar one-way lossy mapping where the string representation of the tag name cannot be unmapped from the 30 bit (truncated SHA1) tag number. The lack of mapping invertibility matters little to batch oriented installer usage cases in RPM libraries even if enquiring minds wish concise/complete/informative error messages for diagnosis of problems. Error messages (and symbols) are overrated and less useful with moderately sized token dictionaries imho. I hope to get around to generating set:versions this weekend. I was almost there 10 days ago, but got side-tracked by 1 week's worth of cable modem issues which has now led to postponed backup/upgrade chores … almost finished. 73 de Jeff__ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org