Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 7:10 AM, Jeffrey Johnson n3...@me.com wrote:
 Why is it any wrong to minimize bandwidth, or, in other words, why it
 is bad to spend less money? Your answer is like, because the meaning
 of life is not to spend less money, which is a wrong perspective.
 Okay, but what's a better perspective? Spending more money for less
 good is not at all a better perspective.

 Nothing I said implies that bandwidth reduction is wrong.

 If you want to minimize bandwidth downloading packages compress
 the metadata and deal with the legacy compatible issues however
 you want/need.

 My rule of thumb is that metadata is ~15% of the size of a
 typical *.rpm package: assuming 50% compression one might save
 7.5% of the package download cost (0.50 * 0.15 = 7.5%).

An interesting practical matter behind set-versions is that we can
simply represent the exact set of symbols, which can be encoded and
pictured like this:

Provides: libfoo.so.0 = set:sym1,sym2,sym3,sym4
Requires: libfoo.so.0 = set:sym1,sym3
(where symN stands for direct ELF symbol name - e.g. strcmp). You
simply name the symbols which you require or provide! In fact, this is
exactly how early alpha set-versions were implemented - before I had
some time to ponder over approximate set/subset encoding problem (it
is then how sym1,sym2... sets where converted into numbers, per
symbol, and compressed). The point here is that the price of the exact
set representation may, or may not, be prohibitive. If the price is
not prohibitive, it's a no-brainer: you don't have to involve into
approximate subset business at all. Sometimes, you simply should not
use bloom filters, despite the fact that they might seem appealing.
However, if the price is prohibitive, which it was, the reason for
going into approximate subset business is also a no-brainer: you
should cut down heavily and optimize for size first. If you simply
introduced probabilities without making things less prohibitive, did
you do anything useful at all? You only spoiled things a bit!

There is also a philosophical consideration which somehow accompanies
this practical consideration. There is a short story, I believe by
Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
map of reality turns out to be a very true picture of reality. What's
wrong with this approach? Well, a 1-1 map of the world turns out to be
an exact copy of the world, which is of no use in terms of being a
map. Somehow, the reality must be construed and reduced to a
simpler (and somewhat coarser) description to become a useful model.
This is also why we don't plug ELF sections into RPM headers: we
believe that much simpler (or at least much shorter) dependencies must
be used to represent ELF binaries in terms of their
interconnectedness, and must also omit other less important details.

Back to the story of set-versions, with the original implementation
(which introduced full Golomb coding), it was estimated that the size
of architecture-dependent pkglist.classic.bz2 metadata is going to go
up from about 3M to about 12M, four-fold! This still was almost
prohibitive to soar up like this. It was considered non-prohibitive
only by the virtue of information-theoretical considerations: since we
are going to encode that many symbols, we must not fool ourselves into
thinking that we could somehow pay a smaller price - that is, without
violating fundamental laws.

By the way, what's the information-theoretical minimum? Say, we want
to encode 1024 20-bit hash values (which yields the false positive
rate at about 0.1%). Well, the first mathematical intuition is that we
need to cut 20-bit range into 1024 smaller stripes, which gives 10
bits per stripe, on average. It is a little bit more complicated than
that, though, exactly because of this on average business: we must
also take some bits to encode stripe boundaries. But this is only a
mathematical intuition. The exact formula, in R, is:

 lchoose(2**20,2**10)/log(2)/2**10
[1] 11.43581

(so, on the other hand, and somewhat unexpectedly, it is that old good
n choose k business.)  Current implementation lines up at about 11.6
bits per symbol. This is why sometimes I say that set-versions
currently take about 2 alnum character per Provides symbol - this is
because each alnum character can stuff about log2(62)=5.9+ bits.

But compare this to early alpha set-versions which represented exact
sets in form of set:sym1,sym2,sym3,sym4, that is, in terms of
enumerating symbols. With exact sets, can you think of going to
anywhere near 2 characters per symbol, especially that you also need a
separator? This leaves you 1 character per symbol! :-)

There's a saying that Perl makes easy things easy and hard things
possible. Set-versions were designed to make hard things possible.
__
RPM Package Managerhttp://rpm5.org
Developer Communication List

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Jeffrey Johnson

On Jun 28, 2012, at 12:18 PM, Alexey Tourbin wrote:

 
 There is also a philosophical consideration which somehow accompanies
 this practical consideration. There is a short story, I believe by
 Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
 map of reality turns out to be a very true picture of reality. What's
 wrong with this approach? Well, a 1-1 map of the world turns out to be
 an exact copy of the world, which is of no use in terms of being a
 map. Somehow, the reality must be construed and reduced to a
 simpler (and somewhat coarser) description to become a useful model.
 This is also why we don't plug ELF sections into RPM headers: we
 believe that much simpler (or at least much shorter) dependencies must
 be used to represent ELF binaries in terms of their
 interconnectedness, and must also omit other less important details.
 

(aside)
Nice: I'll dig out the Borges story … meanwhile (iirc) there is a story (and 
surely
some sort of literary response to Borges) called
How Much Shall We Bet
in Cosmicomics that I pass along for your amusement.

Seriously yes: carrying redundant metadata _EXACTLY_ as is
in the interest of accurate data -- while well intended -- isn't
needed.

And both Bloom filters and set:versions are useful/compact data
stores for existence tests without a need to also carry the _EXACT_
values (if/when the false positive rate can be tolerated, which
is often true for the moderate sized sets needed in package management).

The rest I agree with.

73 de Jeff__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Alexey Tourbin
On Thu, Jun 28, 2012 at 8:18 PM, Alexey Tourbin
alexey.tour...@gmail.com wrote:
 There is also a philosophical consideration which somehow accompanies
 this practical consideration. There is a short story, I believe by
 Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
 map of reality turns out to be a very true picture of reality. What's
 wrong with this approach? Well, a 1-1 map of the world turns out to be
 an exact copy of the world, which is of no use in terms of being a
 map. Somehow, the reality must be construed and reduced to a
 simpler (and somewhat coarser) description to become a useful model.
 This is also why we don't plug ELF sections into RPM headers: we
 believe that much simpler (or at least much shorter) dependencies must
 be used to represent ELF binaries in terms of their
 interconnectedness, and must also omit other less important details.

This philosophical argument applies to set-versions in a
(not-so-)obvious manner, which I will now clarify. It goes like this:
although the ultimate goal is to check that R-set of symbols is a
subset of P-set of symbols, you do not necessarily have to store the
full names of the symbols in order to perform a somewhat stripped-down
check itself. When it only matters if R is subset of P, the names
themselves become largely irrelevant, provided that you can devise a
very clever substitution/encoding scheme. You can make much simpler
(or at least shorter) dependencies by getting rid of the names in a
manner which does not destroy the check.

The downside is, of course, that when a dependency R subset P is
broken, it is not easy to find out which P symbols were deleted or
renamed (or which R symbols are missing).  But this is largely a
developer's, or should I say a hacker's, problem.

On the other hand, from the user point of view, and also from rpm(1)
perspective, this approach simply promotes synchronous or rather
transactional upgrades. It says like, guys, I will not apply
half-baked updates before you fix it all - so that apps and libraries
match. Which totally makes sense!
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Jeffrey Johnson

On Jun 28, 2012, at 1:27 PM, Alexey Tourbin wrote:

 
 The downside is, of course, that when a dependency R subset P is
 broken, it is not easy to find out which P symbols were deleted or
 renamed (or which R symbols are missing).  But this is largely a
 developer's, or should I say a hacker's, problem.
 

Calculating set union/disjunction (for Bloom filters, surely almost as
easy with set:versions) for P/R sets
union   P | R
intersectionP  R
is natural/rapid/easy even if the symbols are unknown.

FWIW: the arbitrary tags implemented @rpm5.org have a similar
one-way lossy mapping where the string representation of the
tag name cannot be unmapped from the 30 bit (truncated SHA1) tag number.

The lack of mapping invertibility matters little to batch oriented installer 
usage cases
in RPM libraries even if enquiring minds wish concise/complete/informative 
error
messages for diagnosis of problems.

Error messages (and symbols) are overrated and less useful with moderately
sized token dictionaries imho.

I hope to get around to generating set:versions this weekend. I was almost there
10 days ago, but got side-tracked by 1 week's worth of cable modem issues which
has now led to postponed backup/upgrade chores … almost finished.

73 de Jeff__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org