Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-07-02 Thread Alexey Tourbin
On Mon, Jul 2, 2012 at 9:17 AM, Jeffrey Johnson  wrote:
>> All RPMv4.4+ packages, that is, but not RPMv4.0. I find this "file
>> coloring" business very annoying, by the way, and it took me some time
>> to realize that "fc" actually stands for "file coloring". :-)
>
> RPMv4.0 was a l-o-n-g time ago.

It was a long time ago, but it was not that bad. In ALT Linux we still
use rpm-4.0 code base (with many important backports etc.). Since I've
recently broken with ALT Linux, I cannot make more claims, you see...

> I find multilib quite annoying: I was asked for an implementation,
> and did so. 'Twas a job mon: already 7y since leaving @redhat, and file
> colors for multi lib were several years before that.

I don't like how "multilib" works either. You can no longer identify
packages by their names, and then there are special rules to resolve
file conflicts, which basically say that the license is that you can
swamp files as much as you want, provided that you got the first hand
in that strange "x86-64" relationship. There is than that recent "x32"
stuff where you can run in 64-bit mode using only 32-bit pointers. How
can you address THAT? To me, the world is declining, like the Roman
Empire. ;-)

>> I see no reason why rpm(1) should ever consider any preferences. To
>> me, rpm(1) is exactly black-and-white thing, a watchdog which checks
>> logical assertions. Things are either consistent enough to proceed
>> (and e.g. to upgrade the packages), or not - and then you get e.g.
>> non-zero exit code and you are forced to bail out. Higher-level logic
>> of finding an upgrade plan anyways belongs to something like apt(1) or
>> yum(1), although it is executed in some basic librpm terms which we
>> must make efficient enough. I see no reason to discuss closest metrics
>> or largest overlaps - this is as interesting as irrelevant to our
>> basic tasks.
>
> Everyone has an opinion: yours is particularly brittle but
> entirely logical. Now go persuade everyone else that your
> opinion is The One True Opinion and RPM will surely change too.
>
>>> There's a similar application with dual/triple/... licensed software and
>>> computing
>>> per-file, not per-package, license affinity precisely where set:versions (or
>>> Bloom
>>> filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses
>>> unlike
>>> file(1) magic keywords will require name space administration. SUrely LSB
>>> and LFF
>>> are seeking something useful to do for RPM packaging these days, and might
>>> be convinced to make some set of license tokens "standard" so that license
>>> affinity can be precisely computed in distributed software.
>>
>> You then discuss more applications which are largely irrelevant to our
>> basic tasks. (I realize that I'm revisiting and older discussion,
>> which might not be completely fair because our understanding might
>> have evolved since.)
>
> Um … what are "… our basic tasks."?

Our basic task, is that when we feed packages to rpm(1), it must
quickly decide whether the upgrade is feasible or not. Of course this
involves hard and sometimes speculative considerations whether things
are going to work. But if things are definitely not going to work, rpm
should upgrade never! Not without a special flag which reads
'--upgrade-as-i-wish'.

What is not part of "our basic tasks" is to find an upgrade plan. How
they do that - it's another business, and completely another story. We
only check if the upgrade is consistent or not.

> I cannot determine what is "fair" without knowing what is being compared …
>
>> Anyway, set-versions are not the "next big thing" with plenty of
>> applications. It's rather a very boring stuff which nevertheless
>> answers the question "how we can possibly enhance ABI compatibility
>> control beyond sonames". The answer is that we must involve into
>> set/subset testing - that's the model, that it is very expensive, and
>> that the only reasonable and possibly the best way to go is to replace
>> symbols with numbers, and to treat sets of numbers as special kind of
>> versions. Now why is that? But that's a much better perspective for
>> discussion.
>
> We differ in usage cases. I see a "container", you see a "version".

I realize that the "version" is somehow a contrived concept behind
really a containter. Like: - Whats' that? - It's a version. - Oh my
gosh!  But then again, if you try to organize your thinking beyond
sets, bytes, and characters, the concept of "version" pops up pretty
naturally. If we are going to satisfy some requirement, that's good
for "version".

> There are many usage cases for subset operations in "package management"
> no matter what we think or how the operations are implemented and represented.

Not all subset operations are equally important, or complex. As I
said, sometimes you should make easy things easy, and simply use
 which is Rb-tree or something like this. Sometimes things
go off, though. Like this: - How many symbols do you want to encode? -
10M. - That'

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-07-01 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 10:29 PM, Jeffrey Johnson  wrote:
> In the interest of getting off negative nerdy obscure discussions, let's
> try a positive alternative application for Golob-Rice subset operations.
>
> All RPMv4 packages attach (a lightly filtered) file(1) magic string to
> every file.

All RPMv4.4+ packages, that is, but not RPMv4.0. I find this "file
coloring" business very annoying, by the way, and it took me some time
to realize that "fc" actually stands for "file coloring". :-)

> The file(1) data is mostly usable as a "keyword" namespace exactly as is.
> Yes there are flaws: however magic strings are from file(1) is about
> as good as any other de facto keyword tagging of file content.
>
> keywords are strings just like elf symbols are, and set:versions (or Bloom
> filters)
> are a compact representation from which its rather easy to do subset
> computations.
>
> One extension that would be needed is a "closest" metric in order to
> "prefer"
> the largest subset overlap: with set:versions any contained subset will
> satisfy the
> logical assertions, and there's no easy way to prefer the larger sub-set.

I see no reason why rpm(1) should ever consider any preferences. To
me, rpm(1) is exactly black-and-white thing, a watchdog which checks
logical assertions. Things are either consistent enough to proceed
(and e.g. to upgrade the packages), or not - and then you get e.g.
non-zero exit code and you are forced to bail out. Higher-level logic
of finding an upgrade plan anyways belongs to something like apt(1) or
yum(1), although it is executed in some basic librpm terms which we
must make efficient enough. I see no reason to discuss closest metrics
or largest overlaps - this is as interesting as irrelevant to our
basic tasks.

> There's a similar application with dual/triple/... licensed software and
> computing
> per-file, not per-package, license affinity precisely where set:versions (or
> Bloom
> filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses
> unlike
> file(1) magic keywords will require name space administration. SUrely LSB
> and LFF
> are seeking something useful to do for RPM packaging these days, and might
> be convinced to make some set of license tokens "standard" so that license
> affinity can be precisely computed in distributed software.

You then discuss more applications which are largely irrelevant to our
basic tasks. (I realize that I'm revisiting and older discussion,
which might not be completely fair because our understanding might
have evolved since.)

Anyway, set-versions are not the "next big thing" with plenty of
applications. It's rather a very boring stuff which nevertheless
answers the question "how we can possibly enhance ABI compatibility
control beyond sonames". The answer is that we must involve into
set/subset testing - that's the model, that it is very expensive, and
that the only reasonable and possibly the best way to go is to replace
symbols with numbers, and to treat sets of numbers as special kind of
versions. Now why is that? But that's a much better perspective for
discussion.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-07-01 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 10:10 PM, Jeffrey Johnson  wrote:
>> So I suggest that set-strings must cover all usages of set-subset
>> computations, where static data structures are appropriate (so that
>> you compute it, write into RPM header, and do not expect to modify
>> it). it is still "efficient", despite the fact that we try hard to
>> achieve a better string representation. The fact that it can be
>> packaged nicely does not make it any worse in other respects, except
>> that obviously we can stuff somewhat less bits into alnum characters.
>> But I also wanted it to look "nice" (this is why base62 was devised
>> over much simpler base64 encoding). But so, it looks nice, and it is
>> efficient. Why are you unhappy? Do you necessarily require that you
>> have to do bitmask tests yourself?
>
> I'm not "unhappy". Rather I'm pointing out that -- if your primary
> goal is bandwidth reduction in *.rpm package metadata -- that there are
> equivalent savings that aren't from "pretty" encodings.

Okay, the primary goal was not so much to minimize bandwidth, but just
to make things "nice" overall. But again, realize that set-versions
are O(N) in size, where N is the number of symbols. For example, here
is the Provides of perl-base package in ALT Linux:

$ rpm -q --provides perl-base
perl = 1:5.14.2
perl-version = 0.88
perl-PerlIO = 1:5.14.2
perl-Storable = 1:5.14.2
perl-Digest-MD5
perl-Time-HiRes
perl-MIME-Base64
libperl-5.14.so()(64bit) =
set:odiA4ZlKzevK5y39eQsWZckcCoxsZqg4k4MIETgeNn0RxqjzEscOAdkenBZ7a373IGkMzlRZcFtq8vCvMeJMgzVyfiFR61GhMIUZ81a3aDJtilWwCksMTw3vQxUgc4x1QMFnA2h8lmdZpJ4jHU4wlFn4W2wHpFabuLlqZiIyviaTYTPl1PjrcIm4iZdvp5fOXR9PKnYQyfo6w1AOS7V5qO5ND9AGGby7LRkpBYZuOEZhvKfx9ZEGiSTLG9vx8ka0dtcUm5WXZtFUvjG92kZwnUjHVWc9NZapw9ocaec37K7rO7N4rl3mGGxh21z3bE4FdZ6RI6k52BrCvogEqZqeVZLcBZ9tcAbHq6ObepE2vXR9TQODoNkoeyaOQGbrDchblKzS8wBPTGtnv5Ze8K9FAsQN2dsQxAJ2hD2NOIL1ROhhFydp27xpqCYssiowuiYfJtCdofpKW9qohptrbV1pmkjVb4kI6lhKHZb5ae7p4u65NodjOSHmOw81L7fjwZjoK2uwH5rACj7NxXW9kk10uOsM6EJdRY4nrZqYbkKN2G1gQvUtZzqxPMjWgkh3P9LmcluzYzgCfZoexmZxQP4nRLYuemEWzHXkX1Ys5wqbfMZDxBbi7PtkohqPCqcB4f6zurrib4y7aUHs6aqCiVRHGDdR4thBgZFjpXDxIDN4X2QjAOjRQZhNzilkpBXmCgy75JDndDdw1DCMEZkmfgAnLU4rUEDIZ2pUvC5q3S2lMtq2Zj17f2xDL80cyg11BFQ8thh9iR2SIXCB69Jrda3E7ArKzyXYci6KlHo1jfvWP6Q2lre8VQCGQZoH8qDo7vaq2MGlNFgtgJGQgTGVx0CKZAAhlijag1ZKUGmgqFOdprCVtuZc8IoZBsuiY1DKZyS6e1HPHGYrZcOcd4KNcminIr9spwghADXzujggTmVsCutsEh6LLbjD2boZan6ycEnBMoFsbyIoz5ggnxFKIjuJyZvM6voYIs2qurMGOCoeB2csGGtTV6b82ItcZ2kj7mbamrZsQscXQ8EUT7lflT6IamZAZLZx7GTKUaZjwPyUltZ6gHWKIvw5wbNQMZGvOU6F43xdP7M2lJ0R9g3cVd4oiQig2F8K0AUdnwXz7LVPbjJMk8iBytXGntBKoPI6G3ahWHgSvmhGVVdNJeNXZl0AkcxhvhFqRujtasuOxCEz8WJCLZaO642OhFjtOVZliv4b4Y34Uh2vftwCo1vXxEJxtVVek42wG9La5uiLoz9o77UfQz4pu0yvZmTR8rs1X05ykr61bK73IOzGuMFjfKxQX2uUUL7zkl661NxWK8VR04bP7p1GVLn7SWPfXyfm2MbTZG8ytPzT2vDJYjmKTi6hRJ9RFPpZegntTk7uuL7ahsSej3aiBbEWVKYXWhkntIZ3YfDPOzRxosdgSKMNRY1llmBvZk5JlN3ETltMlAI7v7bq24bCvG9iGyCPyoUoKVLYpL8cycfepxYcgm4ioyz8zt3ZkAMQi4iZfs2zbd34VnocYTB6cyEZxR67x0i7wQRP7VrwDilZfniZgmdH5pAsWxXX0auKdd1nSN93XGZECEQP4nKKWCLwjYVjjAB1iRBohqEbRBstuQLCBMEL8OLapxqEYozeGlFMoSbo6eV6lCFuOKbYwvN6mjofKRtjavsU4hYm2ArDYgZ8alF8tZCw1BFvmhI0zL4w2un5m3S3G2tcZpWdVODLSi3IB4pQyR755EB7zUaPCF8LafHzDKHBSKEZohdNEooVN3lY5DnGpz25rW38dU8TXRG2pckDnGb5LVBVYpxK7VsybdZGsy5JP8Alk1Mf5O4zFmGcpZLs172YECGj3f8JzUcpknwF2PWxqRZgnZapssvh3CIOl8fA1bfZiUGMFg1eu62GXYlWiqWaA9g627YjeQkGdz9AhZCVECGOqNEZFX2Jy6O23lHvgJ2ylDSwi2RJGZdl1ckGMVwZ5z3uKJ207KVuQWyeCv7Ne4wxyWj9JgP0Z2PEmyXsan5QBFEZq323vZmlRyhhmPoPDle2zyxxT7AZ3vL42k3MLOzNPtGpZcTgIJfhy2H38fUU0YIFNpi9ygZGg5NBZdu1XL0NWesCbk2PapUsltirZ7qo6Jwbg2UkwsgEaVsclZ2EPrD5brplmPl9DHlrgLcAZCaNIG9HIFSfZErD7U0tJob9JvoHKZogXZgtaH0UkZjrpma2lCxPpj9qdZdh4gli6C9mqCKElC2NjZKh9T0Uw1ddLDYNmvPB971Xyzx9qSNQ99acW1uP36D7qgjyPsf9qO1ZDqA9fZuJ5oEkNIpIoatW0f157Sp3bTZ7o58XhKir4N8PnJIfj3P4NQjtrjW85vAEHU9B2dZACzNfNACZn7TO2AyRBmnPn6XWYKoAEW9UiH8Z12Z5OBMa2ovftCZJ3Zpqoc0GG9kpGjXEZCN0ypGqk4lnYTCL4X8i1B2AabsfEG100r2HQWbjoIrM1LtGBAqSuPAPwIEVnaSqAvQFfCi9xCHS3EKRSZa9EsDZktcKHply2cMb5ldXl06eDg3i1xHM9uPTvm7laFyWorkUMcCZiV5d0tlXzFuiDjA3pOsOjUFmgAzPNJK17LfQSHA4avpaLbZ4lGSPRdQWOOKBxGDgOYMnK2Z11iom0XUBCvybJ2POnO81yDKHCEtVHJE4ZKTMiruLDjAe876HiwstanPxJ0m6ZeYmWgraCbEF7iaQZoZkgkwHS8ONi1OXSBF7crddfeYDwWLZk9w4lRr5lj5b3bZx2DFkDwk1IwczPHksFPHGgM8hSnBhfBXpcRLalykEkN9p0LGi6u5Zzs7q7dLLOPRrRLNtaz83XPT1ytZAbLKy7dKfRzTBQ314Zbe7tf8HZwplZuD7ZJZi2Juyxq8JdDPBRCJYgD8ZoDjoNbfrnxv7oJ60T8OyKLvkuLjFnWnHZd1M3OwsOM4uDRX3hk8OsAzsQSehIcdLYzRjYAloGknyZGmDCQLMlfpzzz18aIP9BMhLr364JPk6sSFK4qDU8EMcHGgYUWRUaJE9HnLTs9TzLxTveQnBch7PRnniDwgh2hUF0sDUbQmR9wRcDHtZD5p4QZBHaJYKWpdRtiL1nSsNEzvsMZ9PI05mLLNmZIfluzk3mopgeSQWtQGoiHzmRlNFnrhy5AxC9hHegoVllHAHc8VZrOZyNnBZD9b3Vaqrybs1uveBSfZG3mohKqO2WbTUjg3FXD1kkQPIWA0Z2Z3d6Q77wriiou1ZldJgnkru5WtLsY5YAvwTMdFzoa82XjjDRQwCXnzyRRuXqLF6TGDcrEZIAlAtgs8qgZr10JIaHcp7ZiG4oZpCP8nBCCWGxkBOrue2foa1QoJ30LXN7tMFc5U3wuIoXZmSIyBApsgPkgk5
perl(AutoLoader.pm) = 5.710
perl(B.pm) = 1.290
perl(Benchmark.pm) = 1.120
perl(Carp.pm) = 1.200
perl(Carp/Heavy.pm)
perl(Class/Struct.pm) = 0.630
perl(Config

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Alexey Tourbin
On Thu, Jun 28, 2012 at 8:18 PM, Alexey Tourbin
 wrote:
> There is also a philosophical consideration which somehow accompanies
> this practical consideration. There is a short story, I believe by
> Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
> map of reality turns out to be a very true picture of reality. What's
> wrong with this approach? Well, a 1-1 map of the world turns out to be
> an exact copy of the world, which is of no use in terms of being "a
> map". Somehow, the reality must be "construed" and "reduced" to a
> simpler (and somewhat coarser) description to become a useful model.
> This is also why we don't plug ELF sections into RPM headers: we
> believe that much simpler (or at least much shorter) dependencies must
> be used to represent ELF binaries in terms of their
> interconnectedness, and must also omit other less important details.

This philosophical argument applies to set-versions in a
(not-so-)obvious manner, which I will now clarify. It goes like this:
although the ultimate goal is to check that R-set of symbols is a
subset of P-set of symbols, you do not necessarily have to store the
full names of the symbols in order to perform a somewhat stripped-down
check itself. When it only matters if R is subset of P, the names
themselves become largely irrelevant, provided that you can devise a
very clever substitution/encoding scheme. You can make "much simpler
(or at least shorter) dependencies" by getting rid of the names in a
manner which does not destroy the check.

The downside is, of course, that when a dependency R subset P is
broken, it is not easy to find out which P symbols were deleted or
renamed (or which R symbols are missing).  But this is largely a
developer's, or should I say a hacker's, problem.

On the other hand, from the user point of view, and also from rpm(1)
perspective, this approach simply promotes synchronous or rather
"transactional" upgrades. It says like, guys, I will not apply
half-baked updates before you fix it all - so that apps and libraries
match. Which totally makes sense!
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 7:10 AM, Jeffrey Johnson  wrote:
>> Why is it any wrong to minimize bandwidth, or, in other words, why it
>> is bad to spend less money? Your answer is like, because the meaning
>> of life is not to spend less money, which is a wrong perspective.
>> Okay, but what's a better perspective? Spending more money for less
>> good is not at all a better perspective.
>
> Nothing I said implies that bandwidth reduction is "wrong".
>
> If you want to minimize bandwidth downloading packages compress
> the metadata and deal with the "legacy compatible" issues however
> you want/need.
>
> My rule of thumb is that metadata is ~15% of the size of a
> typical *.rpm package: assuming 50% compression one might save
> 7.5% of the package download cost (0.50 * 0.15 = 7.5%).

An interesting practical matter behind set-versions is that we can
simply represent the exact set of symbols, which can be encoded and
pictured like this:

Provides: libfoo.so.0 = set:sym1,sym2,sym3,sym4
Requires: libfoo.so.0 = set:sym1,sym3
(where symN stands for direct ELF symbol name - e.g. strcmp). You
simply name the symbols which you require or provide! In fact, this is
exactly how early alpha set-versions were implemented - before I had
some time to ponder over approximate set/subset encoding problem (it
is then how sym1,sym2... sets where converted into numbers, per
symbol, and compressed). The point here is that the price of the exact
set representation may, or may not, be prohibitive. If the price is
not prohibitive, it's a no-brainer: you don't have to involve into
approximate subset business at all. Sometimes, you simply should not
use bloom filters, despite the fact that they might seem appealing.
However, if the price is prohibitive, which it was, the reason for
going into approximate subset business is also a no-brainer: you
should cut down heavily and optimize for size first. If you simply
introduced probabilities without making things less prohibitive, did
you do anything useful at all? You only spoiled things a bit!

There is also a philosophical consideration which somehow accompanies
this practical consideration. There is a short story, I believe by
Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
map of reality turns out to be a very true picture of reality. What's
wrong with this approach? Well, a 1-1 map of the world turns out to be
an exact copy of the world, which is of no use in terms of being "a
map". Somehow, the reality must be "construed" and "reduced" to a
simpler (and somewhat coarser) description to become a useful model.
This is also why we don't plug ELF sections into RPM headers: we
believe that much simpler (or at least much shorter) dependencies must
be used to represent ELF binaries in terms of their
interconnectedness, and must also omit other less important details.

Back to the story of set-versions, with the "original" implementation
(which introduced full Golomb coding), it was estimated that the size
of architecture-dependent pkglist.classic.bz2 metadata is going to go
up from about 3M to about 12M, four-fold! This still was almost
prohibitive to soar up like this. It was considered non-prohibitive
only by the virtue of information-theoretical considerations: since we
are going to encode that many symbols, we must not fool ourselves into
thinking that we could somehow pay a smaller price - that is, without
violating fundamental laws.

By the way, what's the information-theoretical minimum? Say, we want
to encode 1024 20-bit hash values (which yields the false positive
rate at about 0.1%). Well, the first mathematical intuition is that we
need to cut 20-bit range into 1024 smaller stripes, which gives 10
bits per stripe, on average. It is a little bit more complicated than
that, though, exactly because of this "on average" business: we must
also take some bits to encode stripe boundaries. But this is only a
mathematical intuition. The exact formula, in R, is:

> lchoose(2**20,2**10)/log(2)/2**10
[1] 11.43581

(so, on the other hand, and somewhat unexpectedly, it is that old good
"n choose k" business.)  Current implementation lines up at about 11.6
bits per symbol. This is why sometimes I say that set-versions
currently take about 2 alnum character per Provides symbol - this is
because each alnum character can stuff about log2(62)=5.9+ bits.

But compare this to "early alpha" set-versions which represented exact
sets in form of "set:sym1,sym2,sym3,sym4", that is, in terms of
enumerating symbols. With exact sets, can you think of going to
anywhere near 2 characters per symbol, especially that you also need a
separator? This leaves you 1 character per symbol! :-)

There's a saying that Perl makes easy things easy and hard things
possible. Set-versions were designed to make hard things possible.
__
RPM Package Managerhttp://rpm5.org
Developer Communication List 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson  wrote:
> There are lots of usage cases for efficient sub-set computations
> in package management, not just as a de facto API/ABI check
> using ELF symbols. Most of the other usage cases for efficient
> sub-set computations are not subject to an ultimately optimal
> string encoding.

The possibility of optimal string encoding should not be
underestimated. If you can't "write it down with a pencil", which by
the way refers to Alan Turing's style of reasoning, it becomes very
problematic anyway. Your intuition is probably that raw bytes are
cheap because you can bitwise-AND them in terms of direct CPU
instructions, and any "encoding" is expensive because you have to
crunch bits a lot. This is not very true in practice, though. But you
should arm yourself with valgrind(1) and spend a few days with it
before you understand how you waste you CPU powers (and bandwidth, for
that matter). Set-string encoding has reasonable, and affordable,
cost. It can be all put together and presented in a rather less
pessimistic manner. (For example, there is a cache_decode_set routine
that boosts things by a factor of about o 5. This is simply due to the
fact that you do not always have to decode the same Provides version
all over again. This is part of that "fancy-schmancy" stuff which you
must ignore on the first reading.)

So I suggest that set-strings must cover all usages of set-subset
computations, where static data structures are appropriate (so that
you compute it, write into RPM header, and do not expect to modify
it). it is still "efficient", despite the fact that we try hard to
achieve a better string representation. The fact that it can be
packaged nicely does not make it any worse in other respects, except
that obviously we can stuff somewhat less bits into alnum characters.
But I also wanted it to look "nice" (this is why base62 was devised
over much simpler base64 encoding). But so, it looks nice, and it is
efficient. Why are you unhappy? Do you necessarily require that you
have to do bitmask tests yourself?
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson  wrote:
> I would state that compression (of any sort) to minimize
> bandwidth is entirely the wrong problem to solve.

So what kind of a problem are we trying to solve? Why are you making
all these small puns? Are you ready to go out and pronounce that "I am
the Jefferey Johnson is entirely confident that whichever problem we
face we must address, but to minimize bandwidth is entirely wrong and
shall be addressed never!" :-)

Why is it any wrong to minimize bandwidth, or, in other words, why it
is bad to spend less money? Your answer is like, because the meaning
of life is not to spend less money, which is a wrong perspective.
Okay, but what's a better perspective? Spending more money for less
good is not at all a better perspective.

Okay, but what do we actually try to do? Um, we try to bring some
binary compatibility by using the limited amount of information with
some contrived data structures which test set-subset relation. If
space were not a problem, we could simply plug ELF sections into RPM
headers, couldn't we? Why not? And why is there a distinction between
the header and the payload? :-)

> My contrarian POV should not be taken as opposed to compression
> or elegance or anything else. Just that minimal size in the
> representation of bit fields (or bit positions as numbers)
> overly limits the applicability of set:versions.
>
> There are lots of usage cases for efficient sub-set computations
> in package management, not just as a de facto API/ABI check
> using ELF symbols. Most of the other usage cases for efficient
> sub-set computations are not subject to an ultimately optimal
> string encoding.

My POV is that I ask "how the best I can do what I need to do, is it
doable, what is the price, can it be reduced, etc". Of course, these
"best to do" and "need to do" terms are not exactly mathematical, and
I already hear some laughs in the audience. Nevertheless, the only
thing which you can oppose to this is simply a better implementation.
Anything else does not count. Why? That's because! Go tell people they
should spend more money. ;-)

> E.g. ripping off the base62 encoding and distributing binary
> assertion fields saves at least as much bandwidth as your
> guesstimate that a Golomb-Rice code is ~20% more efficient
> than a Bloom filter.

I see no point in criticizing base62 encoding being suboptimal as
compared to raw bytes, because it does just that: squeezes bits into
alnum characters. It is pretty clear and pointless that raw bytes
stuff more bits. The goal was just that - to "dump" bits into
"readable" and "usable" form.

> Just in case: yes acidic sarcasm was fully intended.

There is no point in your sarcasm, except that probably that you are
very clever (which I totally agree).

> You already made a valid point asking what probability means wrto
> false positives. In fact your 2**10/2**20 is a very different
> estimate of false positive probability than the approx. currently
> in use in rpmbf.c (which is based on a statistical model for
> Bloom filter parameters).

The 2**10/2**20 is the best estimate and you probably cannot further
improve it. The whole business of set-versions, which I have been
pondering recently, boils down to the questions, can you improve the
ratio just a little bit? Or can you pay just a little bit less for a
little bit more? Is there a better data structure? The answer is
basically turns out to be "NO". The "set of numbers" is good to go,
and the Galois fields have some discrepancies which you do not want to
know, and the benefit is very marginal.

> Below is what is in use by RPM which chooses {m, k} given
> an estimated population n and a desired probability of false
> positives e (I forget where I found the actual computation,
> can likely find it again if/when necessary).
>
> What I would like to be able to do with set:versions
> in RPM is to be able to use either of these 2 "container"
> representations interchangeably.
>
> I'd also like to be able to pre-compute either form
> in package tags for whatever purpose without the
> dreadful tyranny of being "legacy compatible" with
> older versions of RPM. Adding a new tag that isn't
> used by older RPM is exactly as compatible as following
> existing string representations for NEVR dependencies
> in existing tags, arguably more compatible because
> older versions of RPM are incapable of interpreting
> unknown tags.
>
> And again: the best solution for the download bandwidth
> problem is
>        Update the metadata less often.
> if you think a bit.

The best solution for "don't be stupid" is just don't be stupid. Of
course, downloading metadata is not the only problem, and not exactly
the one which we exactly try to solve. It all combines into a complex
system. The question is then, to me, can we do any better? How much
does this whole business cost? if it's still expensive, which it is,
how can we cut down on the costs? These kind of questions I am willing
to 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Alexey Tourbin
On Fri, Jun 22, 2012 at 5:30 AM, Jeffrey Johnson  wrote:
> Sure numbers "make sense".
>
> But God invented 0 and 1 and who needs steenkin carries to do
> arithmetic in Galois fields?!?

Jeffery, I understand that you are ironic and sarcastic, but I can't
see the reason why, as per our discussion. If you ask honest
questions, like "what this means", I try to do my best to answer the
question, possibly involving considerations like "numbers comes from
God", which are questionable but not irrelevant. They may help to
understand, or may not.

>> P: 01010010010101
>> R: 010001
>
> Yes … but this is "premature optimization" …

There is no premature optimization here, and it is fair to ask what we
may possibly want to try to optimize. My answer is: first, it must
take less space (given the probability) because we have to download
the repo metadata every time we run the "update"; second, set-versions
must compare quickly (we must be able to compare them all within a
second). What's your suggestion? Do you want them to take more space
and compare slowly, so as to avoid premature optimizations? :-) Or do
you dislike them exactly because they are not Bloom filters? Go on
then, make shorter strings, given the same probability, which compare
faster! But this is largely impossible, exactly because of these
laughable topics like "what is a number" or "what is a probability"
which I'm trying to present.

> How big the repo is determines how important the
> distro is and nothing else. Just look at how impo'tent Debian is …

Your considerations are probably true but are largely irrelevant to
the discussion.
Besides that, I can't understand them. Are you sarcastic? I am
sarcastic too, that's not at all a problem. Shall we talk? :-)

>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10
>> set:dbUnz4
>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20
>> set:nl2ZEALdS
>>
>> In the first run, the probability is modest ("print -l" is zsh syntax
>> to print an arg per line). In the second run, the probability is much
>> higher (and it takes more space). There is also a function in scripts
>> to sustain probabilities at about o 0.1%.
>
> Hmmm … I've been defaulting Bloom filters ultra conservatively
> at 10**-4 and mobing to 10**-6 at the first hint of a possible problem.

The number does not matter here, the only consideration was that
2^{-10} was kind of cute number, and it works well in practice - that
is, we catch all or most of the bugs where a library symbol has been
deleted. But it could also be 2^{-16} or even 2^{-20}. It is a
trade-off, and I see no ground for criticizing it for just that. Or
again it boils down to the question how many megabytes you expect to
download when you run the "update".
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Alexey Tourbin
On Thu, Jun 21, 2012 at 7:51 PM, Jeffrey Johnson  wrote:
> On Jun 18, 2012, at 2:32 PM, Jeffrey Johnson wrote:
>>
>> The "contained in" or subset semantic that applies to the operations "<" and 
>> ">="
>> is rather easy to do as well. E.g. if (assuming on;y existence, not 
>> versioned inequality ranges)
>>       P == Bloom filter of Provides: tokens
>>       R == Bloom filter of a (possible) subset of tokens
>> then the subset computation is nothing more than
>>       (P & R) == R

That's right. Set-versions do the same thing, only reframed in terms
of set of numbers (as opposed to explicit bit masks). The rpmsetcmp
routine basically does just that: (P & R) == R check, written in a
slightly different manner, with a bunch of fancy stuff. :-)

> I should point out the implicit (and tricky) assumption regarding
> using Bloom filters to easily compute set and union intersection:
>        Both P and R MUST have exactly the same parameterization.
> Choosing an a priori parametrization is hard because it MUST
> deal with the worst possible case of the largest estimated population
> which increases the sparseness of all other Bloom filters.

There are a few kinds of "parametrization" which we must ponder.
First, you should use the same hash function for R and P sets, so that
it makes the same number per symbol (or sets the same bits). With
different hash functions, there is no chance to compare sets
meaningfully. Second, you should also ponder what a "number" actually
is (or how high a bit you may want to be able to set). The thing is,
the numbers are not unlimited (or otherwise the whole world can be
represented with just a big number). It helps to think of a number as
a tiny bit of information within a limited range. The range is another
parameter.

> This also applies to "tuning" to reduce the probability of false positives.
>
> There are no obvious ways to rescale a Bloom filter meaningfully either.

There IS a (not-so-)obvious way to rescale Bloom filters, and numbers,
for that matter. Big surprise, big surprise! It boils down to the
question what a number is. If your number is just a few bits in the
range modulo power of 2, you can "rescale" the number down to a
smaller range by simply stripping its higher bits. The equivalent
operation can be devised to "downsample" a bloom filter into a lesser
precision, provided that some (not-so-)obvious conditions are met.
Basically you need to split the filter into two halves and bitwise-OR
them.

> Do set: versions have the same difficulty (even if the parameters are hidden
> as implementation/design constraints somehow)?

There is some difficulty that we always need to use the same hash
function, which is hardwired, and cannot be easily changed (it is
Jenkins on-at-a-time hash with a fancy initial constant). There is no
difficulty to compare set-versions per se, though. Because a
set-versions is just a set of numbers within a limited range modulo
power of two. If ranges are different, you first need to "downsample"
the higher-ranged set by stripping its higher bits and sorting the
numbers again, but then you proceed normally. Note that the range, or
"bpp", has to be encoded explicitly, and is part of a set-version.
Again, it helps to think in terms what a number is. A number is a
thing within the range, modulo power of two. If you don't know the
range, you don't know what the number is.

> Or does the Golomb-Rice encoding "scale" more naturally than Bloom filter 
> parameters?
> If the set:versions implementation "scales" better than Bloom filters (which 
> seems to be the case),
> then there are lots of usage cases (like packages in a repository, or files 
> in a package) where
> an efficient means to compute set membership is quite useful.
>
> (aside)
> Apologies for "scales" imprecision. I'm merely trying to ask is
>        To what extent does set:versions depend on a priori assumptions?
>
> BTW, what is the current false positive failure probability for set:versions?

What is a probability? (That's the second stunning questions after
what is a number.) If numbers were unlimited, they could have
represented symbols exactly. The only reason the numbers "clash" is
because they "sit" within a limited range. What you have to do to
control probabilities is to select the appropriate range, which is a
trade-off between how many clashes are possible and how much bits per
number you are allowed to take. The basic idea is that, to encode a
Provides version, you first need to know how many symbols you are
going to encode, say 1024. You then ask yourself, how many bits per
symbol should I take. If you take e.g. only 10 bits, that's plain
stupid, because 10-bit numbers can address only 1024 values (so you
end up with most of the bits set, or most of the numbers taken, that
is to say). If you take 20 bits per symbol, though, things get more
interesting. 20-bit numbers can address 2^{20} different symbols.
Since the probability is due to range limitations, you get
2^{10}/2^{20

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-21 Thread Alexey Tourbin
On Thu, Jun 21, 2012 at 7:28 PM, Jeffrey Johnson  wrote:
>> More precisely, a set-version can be (in principle) converted to a
>> Bloom filter which uses only one hash function. The idea is that such
>> a filter will set bits in a highly sparse set of bits, one by one.
>> Instead, a set-version remembers the bits simply by their indices.
>> Setting the bit becomes taking the number, and there is a
>> straightforward correspondence. It also turns out that a set of
>> numbers can be easier to deal with, as opposed to a sparse set of
>> bits.
>
> You've made an unsupported claim here re "numbers can be easier".
> Presumably you are talking about means of encoding, where clever
> mathematics derived from numerical transforms can be used to
> remove redundancy. With a pile of bits in a Bloom filter all one
> has to work with is a pile of bits.

Numbers are easier to deal with because you can use 'unsigned v[]'
array to represent them, which can be seen throughout the code,
accompanied by 'int c' (which is argc/argv[] style). Bits are somewhat
more complicated: you need to use something like 
macros, and you cannot use pointer arithmetic to implement traversal,
etc.

Also, to me, a set of numbers just "makes sense", and is good to go.
If you wonder how much complicated things can get, there's been some
recent papers out there which use matrix solving for approximate set
representation. They are basically unreadable (given the undergraduate
skills), and you can be completely lost because there is apparently no
connection between the Galois fields and what you actually try to do.
After you try to read these papers, the "set of numbers" becomes a
wonderful salvation which comes directly from God and the Holy Spirit
and brings peace to your soul. This wonderful salvation, which comes
from God, turns out to be a good thing. You should use it, when you
have some. :-)

> The counter argument is this: testing bits (in a Bloom filter) is
> rather KISS. Other compression/redundancy removal schemes
> end up trading storage for implementation complexity cost.
> In a running installer like RPM, there will always be a need
> for memory dealing with payload contents. Since large amounts
> of memory are eventually going to be needed/used, savings
> by using Golomb-Reid codes are mostly irrelevant for the
> 100K -> 1M set sizes used by RPM no matter whether bits or
> numbers are used to represent.

I do not agree that testing bits is more KISS than a set of numbers.
It you think that testing bits is the only and natural thing to do, I
disagree. The reason is: Requires versions are much more sparse than
Provides versions. If the Provides version is optimal, in terms of bit
usage (50% set), the Requires version must be necessarily suboptimal
(too few bits set). It can be pictured like this:

P: 01010010010101
R: 010001

The conclusion is, if you want to stick to bitmasks and reduce subset
comparison to bitmask tests, there will be a great deal of
inefficiency because of sparse R-sets. On the other hand, sets of
numbers can be pictured like this:

P: 1,3,7,13,15
R: 1,13

Note that R-set does not take extra space, and there is a simple
merge-like algorithm which advances either R or P (or both - which is
how rpmsetcmp routine is implemented).

> But don't take my argument as being opposed to set:versions whatsoever.
> Just (for RPM and even for *.rpm) that compression size isn't as important
> to me as it is to you as the "primary goal" of set: versions.

The size of *.rpm is very important when you want to update the
information about a repo, like in "apt-get update" - you have to
download it all. The question is then how big the repo is, and how
many set-versions you can afford to download. :-) The "primary goal"
was not to make things much worse. In other words, the price must not
be prohibitive even for the repo of 10k or 15k packages strong.
(Please also realize that the Internet was not a given until very
recent times - some shoddy ISPs in Kamchatka still want to charge
something like $0.1 per 1Mb.)

> AFAICT set:versions also has a finite probability of failure due to 
> (relatively rare)
> hash collisions converting strings to numbers.
>
> Is there any hope that set:versions (or the Golomb-Rice code) can supply a 
> parameterization
> to "tune" the probability of false positives?
>
> Or am I mis-understanding what is implemented?

You can tune the probability by selecting the appropriate "bpp"
paramter, and passing it to "mkset" (which is how scripts work). The
"bpp" indicates how much bits must be used per hash value (the range
is 10..32). For example, if you want to encode Provides versions which
has 1024 symbols, and the error rate has to be fixed at about 0.1%,
you need to use bpp=20 - that is, after each symbol is hashed, its 20
lower bits will be taken into account and encoded. The probability
then is simply the number of symbols over the capacity of the
universe, which is 2^{10}/2^{20}=0.1%. On the other hand,

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-20 Thread Alexey Tourbin
On Thu, Jun 21, 2012 at 12:15 AM, Alexey Tourbin
 wrote:
> On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson  wrote:
>> Good: the above confirmation of the characteristics allows a set:versions
>> implementation to proceed.
>
> Hello, there's been some speculation about Bloom filters below, which
> I cannot address right now, offhand. Nevertheless, I can say that, in
> some highly mathematical sense, set-versions are exactly equivalent to
> Bloom filters. They do just the same thing, if you will. The only
> difference is that set-versions are more compact: they take somewhat
> less space, which was, if you remember, the number one goal of the
> original implementation.

More precisely, a set-version can be (in principle) converted to a
Bloom filter which uses only one hash function. The idea is that such
a filter will set bits in a highly sparse set of bits, one by one.
Instead, a set-version remembers the bits simply by their indices.
Setting the bit becomes taking the number, and there is a
straightforward correspondence. It also turns out that a set of
numbers can be easier to deal with, as opposed to a sparse set of
bits.

If you want to know more why things have to work like this, and e.g.
where constants pop up, there is a good starting point at "Cache-,
Hash-, and Space-efficient Bloom filters" paper by Felix Putze, Peter
Sanders, and Johannes Singler. Actually this paper helped me a lot to
put things together and to produce the original implementation.
Reading this paper requires some working mathematical knowledge,
though. This requirement must not be underestimated, but also should
not be overestimated. The paper is very readable: it tells you what
you may want to do and what you have to do.
http://algo2.iti.kit.edu/singler/publications/cacheefficientbloomfilters-wea2007.pdf
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-20 Thread Alexey Tourbin
On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson  wrote:
> Good: the above confirmation of the characteristics allows a set:versions
> implementation to proceed.

Hello, there's been some speculation about Bloom filters below, which
I cannot address right now, offhand. Nevertheless, I can say that, in
some highly mathematical sense, set-versions are exactly equivalent to
Bloom filters. They do just the same thing, if you will. The only
difference is that set-versions are more compact: they take somewhat
less space, which was, if you remember, the number one goal of the
original implementation.

Very informal, if you want to encode 1K symbols out of 1M symbols
using a bloom filter, you need at least
1024 * 1.44 * 10 bits of information (which is a factor of 1.44 per symbol)
while with a set-version you need at least
1024 * (10 + 1.44) bits of information (which is an additive constant
of 1.44 per symbol).
It is basically fair to say that set-versions are 20% shorter than the
equivalent bloom filters, which is not unimportant.
By the way, the information-theoretical minimum is
1024 * 10 bits of information -
you cannot go beyond that without violating fundamental laws.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-15 Thread Alexey Tourbin
On Fri, Jun 1, 2012 at 6:07 AM, Jeffrey Johnson  wrote:
> I asked 2 very specific questions … the rest is quite important also,
> but I need to understand precisely what properties set:versions have in order
> to implement correctly (and I don't fully understand your reply).
>
> Specifically:
>
>        1) Is the set:versions VERSION independent of the order of the
>        calls to rpmsetAdd()? (you know the routine as set_add())

Completely independent - you can add symbols in any order. The symbols
are then hashed and sorted by their numeric values. The underlying
idea is that a set-version is just a (sorted) set of numbers. You can
add whatever symbol to it, possibly twice, the symbol will be hashed
to a number, in a unique manner, and finally you can get the string
representation of the set of numbers. This involves much fuss under
the hood, but basically, you should think of the set of symbols, which
is just the set of numbers, after each symbol has been hashed
individually.

>        2) Can the set:versions encoding be compared for more than equality?
>        What set/arithmetic property is the basis for the comparison? What
>        circumstances/constraints are there related to
>                        … You cannot always compare
>                set-versions in terms of "greater or equal" (but when you can, 
> it's
>                important).

Set-versions compare as sets. There are Euler diagrams to visaulise
set comparsion, which is an undergraduate matetrial. The idea is that,
real numbers are linear order: you can always tell either V1=V2. Sets are quite another matter: you cannot always apply for
"tertium non datur" (either lt or ge). Which is to say that sets can
be quite different and do not compare easily. The order can be imposed
on the sets, though, by requiring the "greater" sets to have at least
the same elements they compare against (perhaps I'm starting to retell
the undergraduate material, which is not going to last). To sum up,
there IS a mathematical basis behind "Requires: foo >= set:asdf"
dependencies.

> I can of course answer my own questions with try-and-see test cases.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-31 Thread Alexey Tourbin
On Thu, May 31, 2012 at 9:03 AM, Jeffrey Johnson  wrote:
>>> The mixed code case is interesting: what happens
>>> if a set:version encoding contains the literal string "0:V-R"
>>
>> I can't understand you question. A version is either a set-version, or
>> not a set-version.  If a version is a set-version, it has to be
>> prefixed by "set:". Regular RPM versions cannot be prefixed by "set:",
>> because "set" cannot be decoded as a valid serial number, which has to
>> be an integer. There might be some implementation sloppiness out
>> there, but in principle, I believe the encoding scheme is sane, and
>> makes sense.
>>
>> Now the question is, what if a set-version cannot be decoded? But that
>> can be perplexed by a question, what if a regular rpm version cannot
>> be decoded? Or can you decode any junk as a valid RPM version?
>
> I'm trying to understand rpmsetcmp() as a "black box" independent
> of all the gory implementation details of ELF symbols, base62 encoding,
> and RPM dependencies.
>
> I believe that set:versions are much like Bloom filters:
>        1) strings can be added to a "set" in any order
>        2) the comparison operation implied by
>                Requires: foo >= set:….
>        is identical to "contained in" or "is subset of"
> Is that the case?

To me, a set-version is just a VERSION. I can't stress that enough.
When you need a library, it is a legitimate question to ask which
version you need. If you answer that you need at least version 1.0, a
conventional version, you must be kidding. Because there is no
connection between what you actually need and a god-damn number. So a
plausible answer must be "Well, we need at least a version which
provides  symbols which we need to use". Let's take
this approach to the extreme, and say that the VERSION which we need
is simply the one which provides at least  symbols. To
me, this is much better an approach to library versioning, and
possibly the only viable approach. How else can you express your
expectations about a library? Suppose someone is talking to you at a
conference, and says "Mr. Johnson, we are very proud, blah-blah-blah,
because blah-blah-blah". What you want to tell them is basically "Go
out, folks, it works". Now, with set-versions, things really work. :-)

Back to implementation,
1) set-strings should be considered opaque, static, and unmodifiable.
Once they are formed, there is no useful way to alter them. They
express an idea of a VERSION in a manner which cannot be easily
comprehended, but that's not a problem (or otherwise there is
perplexing questions of what regular rpm versions must mean, and
whether any piece of junk can be decoded as a regular rpm version).
2) They must compare as sets, in terms of elements unique to the first
set, common elements, and elements unique to the second set. The
implementation does not quite much yet, because it already tries to
mimic regular versions. Regular versions are linear order.
Set-versions are partial order.  You cannot always compare
set-versions in terms of "greater or equal" (but when you can, it's
important).
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Alexey Tourbin
On Thu, May 31, 2012 at 6:51 AM, Jeffrey Johnson  wrote:
> We are in violent agreement here over a minor issue
> of implementation/representation.

By the way, actual problems that will arise are rarely what you expect
them to be. In 2010, I was naive and I thought that "char bitv[]" was
a pretty good representation of bit sequence (which can be still seen
in set.c). It then took many days to devise a sophisticated decoding
routine which avoids bitv[] altogether and makes things smooth. So, in
a violent agreement, don't take things for granted. :-)

> The mixed code case is interesting: what happens
> if a set:version encoding contains the literal string "0:V-R"

I can't understand you question. A version is either a set-version, or
not a set-version.  If a version is a set-version, it has to be
prefixed by "set:". Regular RPM versions cannot be prefixed by "set:",
because "set" cannot be decoded as a valid serial number, which has to
be an integer. There might be some implementation sloppiness out
there, but in principle, I believe the encoding scheme is sane, and
makes sense.

Now the question is, what if a set-version cannot be decoded? But that
can be perplexed by a question, what if a regular rpm version cannot
be decoded? Or can you decode any junk as a valid RPM version?

> and a match is attempted against a traditional dependency like
>        Provides: foo = 0:V-R
> If the literal string in the Provides: is encoded on the
> fly, will setcmp(…) match or not?
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Alexey Tourbin
On Thu, May 31, 2012 at 6:21 AM, Jeffrey Johnson  wrote:
>> On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson  wrote:
>>> I should point out that writing the attached
>>> message (and sending from the wrong e-mail address) has instantly
>>> led to a different -- and perhaps more natural -- syntax like
>>>
>>> Requires: set(libfoo.so.1) >= whatever
>>
>> Hello,
>> Set-versions are just that - versions. One must arguably think of them
>> in terms of VERSIONS. If you need a library, it is a legitimate
>> question to ask which version you need. If you think you need the
>> version at least 1.0, there's a good question: why the heck you think
>> you need the version at least 1.0 (and whether 2.0 would still fit).
>> With set-versions, things get straightforward: you need at least a
>> version which provides  API symbols - that's much
>> better a description of a library than a god-damn arbitrary number.
>
> Yes VERSIONS.
> The issue for RPM is how to represent/attach a different VERSION comparison.

If those are only VERSIONS, they must apply to the same NAME. Look,
there are two-fold way of dependency resolution in rpm. Set-versions
were designed to fit into the scheme. First, you look up the name in
rpmdb/Providename, and fetch the headers. Second, you decide whether
the versions match.

The real question is what to do if we are forced to match a
set-version against a non-set/another kind of version. Well, the
answer is that you should return as if the NAMEs were different. And
unless you're in a deeply theoretical mood, I believe the approach is
perfectly valid.  If two kinds of versions cannot be matched, pretend
they apply to different names.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Alexey Tourbin
On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson  wrote:
> I should point out that writing the attached
> message (and sending from the wrong e-mail address) has instantly
> led to a different -- and perhaps more natural -- syntax like
>
> Requires: set(libfoo.so.1) >= whatever

Hello,
Set-versions are just that - versions. One must arguably think of them
in terms of VERSIONS. If you need a library, it is a legitimate
question to ask which version you need. If you think you need the
version at least 1.0, there's a good question: why the heck you think
you need the version at least 1.0 (and whether 2.0 would still fit).
With set-versions, things get straightforward: you need at least a
version which provides  API symbols - that's much
better a description of a library than a god-damn arbitrary number.

There are some philosophical implications of introducing set-versions,
in particular, whether it can be extended to describe prototypes and
calling conventions (e.g. the number of arguments which must be passed
to a function). This is why I might seem reluctant to participate in
discussions. I'm thinking! (And perhaps I'm arrogant.)

> for set:versions, and for the generalization (for writing strict regression
> tests,
> its mostly useless in packaging because there is no mapping that specifies
> how the mixed DEB <-> RPM version comparison might be done "naturally")
>
> Requires: deb(foo) >= E:V-R
> Requires: rpm(foo) >= E:V-R
>
> The precedent for foo(bar) name spacing in RPM dependencies with
> the above syntax is already widely deployed although entirely
> de facto.
>
> Sure would be nice _NOT_ to have to consider "Have it your own way!"
> competing syntaxes like
> Requires: libfoo.so.1 >= set:whatever
> and
> Requires: set(libfoo.so.1) >= whatever
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: refined implementation of set-versions

2012-04-21 Thread Alexey Tourbin
On Fri, Apr 20, 2012 at 5:16 PM, Jeffrey Johnson  wrote:
> The methods in the existing encoding/decoding are in rpmio/set.c @rpm5.org: 
> the algorithm
> is unchanged from Alt.
>
> A change to the existing scheme over the next few months doesn't bother
> me at all. But "legacy compatibility" has instantly appeared as an issue
> for an @rpm5.org implementation that has been "working" for less than
> 24 hours, and where the need is to attempt to install Alt packages
> into a chroot, sounds like @rpm5.org is going to be forced to both
> "old" and "new" encodings merely to continue trying to do "continuous 
> integration"
> with Alt packages.
>
> But "legacy compatibility" is an insoluble problem which need not be 
> discussed.
> If there's a better encoding scheme available soon, then switching is
> better done earlier than later.

The amount of compatibility with the existing Alt format is
negotiable. If you think that rpm5 must be able to install Alt
packages into chroot, then there is little choice but to 1) design the
new format with a clear distinction, so that older set-strings don't
get confused with the newer ones; and 2) to provide an additional
decoding routine. However, no additional support is required in e.g.
the comparison routine, since the decoding routine simply restores the
array of hash values. So this doable.

There another option, however. For the reason which shall remain
nameless, I find it tempting to produce the new and incompatible
format without any clear signs of distinction. :-)

> ATM, rpm-5.4.9 does only doing decoding (and comparison) of set:versions.
> The need was to be able to install Alt sisyphus packages (with set:versions 
> dependencies)
> into a chroot. Generating set:versions (Alt uses a helper script, "multilib" 
> packaging needs
> to use the gelf* API) will be harder, particularly if interoperability is 
> desired.

Using gelf* API probably won't do, since it is best to use
ldd(1)-based tool which will basically invoke ld.so(8) to resolve
symbols and dump associations between the symbols and the libraries.
It can't be easily done with *gelf ABI, unless you actually try to
reimplement a substantial portion of the dynamic linker, which is a
bad idea anyway. So using the script is the only realistic options. If
the script cannot be used at all (i.e. due to issues with multilib
attributes), this probably indicates a problem within rpmbuild itself.

> But set:versions looks quite useful, and far more effective at reducing the 
> number
> of dependencies than attempting a "pin-hole" optimizations with boolean
> expressions, discarding inequalities which are implied by other dependencies,
> as Per Oyvind has been attempting in Mandriva.

Where can I find more information about this work?
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: refined implementation of set-versions

2012-04-20 Thread Alexey Tourbin
On Fri, Apr 20, 2012 at 4:12 PM, R P Herrold  wrote:
> On Fri, 20 Apr 2012, Alexey Tourbin wrote:
>
>> I have just learnt that rpm5 project has borrowed set-string
>> implementation recently from ALT Linux. At the very same time, I was
>> working on on a new and improved encoding scheme which can make
>
> ... This is exciting news -- This seems like it would be a useful library.
>  What package in ALT are you doing this work in, or alternatiely, is there a
> version control repository that I could check out to read your ongoing work?

The goal is not only to improve the implementation, but also to refine
basic concepts and designs.  Actually, set.c is already usable as a
library on its own. Why, it provides an API for creating set-versions,
and it also implements rpmsetcmp() comparison routine. That's just
enough to get the job done, and everything else then is the
implementation details which it hides in a particularly perfect
manner. On the other hand, the details are concealed perhaps a bit too
much - up to the point where it is not clear what set-versions are and
how exactly they are supposed to work. It is desirable then to expose
a lower-level API which clarifies the concept of set-versions while
still hiding less important implementation details.

So what's a set-version?

Intuitively, a set-version represents a set of symbols. This actually
suggests a new "de facto" approach to library versioning: the required
version of a library, in addition to its soname, is simply the set of
library symbols (i.e. functions and global variables) which we need to
use. Likewise, the version provided by a library is simply the set of
all symbols exported by the library. The key point is that Requires
versions and Provides versions can be produced in a relatively
independent manner, and meaningfully compared at later stages; that
is, it is possible to check if R \subset P.  The check is
probabilistic, which indicates a possibility of error.  The error rate
is reasonably small, though, and can be further controlled by a
parameter. What's more important is that only "false positive" kind of
error is possible - that is, in the worst case, the check simply does
not work, but at least we lose nothing. Another kind of error, a
"false alarm", is not possible. In this respect, set-versions are
similar to Bloom filters.

Set-versions have other nice properties which you might suspect, and
the one which is not so nice: the length of n-element set-version is
O(n). This is a fundamental limit which cannot be overcome. However, a
practical and feasible implementation is still possible. This outlines
two implementation priorities: 1) set-versions should be as short in
size as possible - actually their size should be close to the
information-theoretical minimum; 2) however, this must not tamper with
the possibility of fast decoding and comparison. With current
implementation, when the error rate is fixed at about 0.1%, Provides
version take about 2 alphanumeric characters per symbol, and Requires
versions, since they are much more sparse, can take up to 3
alphanumeric characters per symbol. For a repo of about 10,000
packages, Requires and Provides set-versions can take only about 10M
total (but this assumes that some dependency optimizations are
performed and also that superfluous plugin-like Provides are
excluded). The check of all set-versioned dependencies, such as
performed by "apt-cache unmet", can be finished within a second. To
sum up, this is a compromise; but it is a favorable compromise.

So what a set-version really is?  What if we say that a set-version is
just a set of numbers, such as hash values obtained after hashing each
symbol individually? Simple as it is, this approach can be used to
express everything else. More precisely, we need a scheme to encode n
m-bit numbers. For some reason which will become apparent later, we
need to supply m, which is actual bits per hash value, aka bpp,
explicitly.

So I think we can define a lower-level "set-string" encoding API as follows:

/** \ingroup setstring
 * Estimate the size of a string buffer for encoding.
 * @param v the values, sorted and unique
 * @param n number of values
 * @param bpp   actual bits per value, 8..32
 * @return  buffer size for encoding, < 0 on error
 */
int setstringEncodeSize(const unsigned *v, int n, int bpp);

/** \ingroup setstring
 * Encode a set of numeric values into alnum string.
 * @param v the values, sorted and unique
 * @param n number of values
 * @param bpp   actual bits per value, 8..32
 * @param s alnum output, null-terminated on success
 * @return  alnum string length, < 0 on error
 */
int setstringEncode(const unsigned *v, int n, int bpp, char *s);

The decoding API mirrors the encoding routines:

/** \ingroup setstring
 * 

refined implementation of set-versions

2012-04-20 Thread Alexey Tourbin
Hello,

I have just learnt that rpm5 project has borrowed set-string
implementation recently from ALT Linux. At the very same time, I was
working on on a new and improved encoding scheme which can make
set-versions about 1% shorter in size, and which also permits more
efficient decoding. There are also other improvements, such as integer
overflow checking and revised API. Assuming that set-string are not
widely used yet (except for ALT Linux), and compatibility is not an
issue, I'm willing to provide a refined implementation.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


ANN: RPM-Payload-0.10 on CPAN

2009-04-04 Thread Alexey Tourbin
It's better late than never that I released RPM::Paylod perl module
on CPAN.  The code is almost trivial, and I would doubt if it's worth
releasing at all, except that it does a good job.  (I am going to
release more code that depends on RPM::Payload.)

http://search.cpan.org/dist/RPM-Payload/
http://git.altlinux.org/people/at/packages/perl-RPM-Payload.git

Also note that there is Archive::Cpio module on CPAN, written by Pixel,
which may or may not suit one's needs better.
http://search.cpan.org/dist/Archive-Cpio/


RPM::Payload(3)   User Contributed Perl Documentation  RPM::Payload(3)

NAME
   RPM::Payload - simple in-memory access to RPM cpio archive

SYNOPSIS
   use RPM::Payload;
   my $cpio = RPM::Payload->new("rpm-3.0.4-0.48.i386.rpm");
   while (my $entry = $cpio->next) {
   print $entry->filename, "\n";
   }

DESCRIPTION
   "RPM::Payload" provides in-memory access to RPM cpio archive.  Cpio
   headers and file data can be read in a simple loop.  "RPM::Payload"
   uses "rpm2cpio" program which comes with RPM.

EXAMPLE
   Piece of Bourne shell code:

   rpmfile()
   {
   tmpdir=`mktemp -dt rpmfile.`
   rpm2cpio "$1" |(cd "$tmpdir"
   cpio -idmu --quiet --no-absolute-filenames
   chmod -Rf u+rwX .
   find -type f -print0 |xargs -r0 file)
   rm -rf "$tmpdir"
   }

   Sample output:

   $ rpmfile rss2mail2-2.25-alt1.noarch.rpm
   ./usr/share/man/man1/rss2mail2.1.gz: gzip compressed data, from 
Unix, max compression
   ./usr/bin/rss2mail2: perl script text executable
   ./etc/rss2mail2rc:   ASCII text
   $

   Perl implementation:

   use RPM::Payload;
   use Fcntl qw(S_ISREG);
   use File::LibMagic qw(MagicBuffer);
   sub rpmfile {
   my $f = shift;
   my $cpio = RPM::Payload->new($f);
   while (my $entry = $cpio->next) {
   next unless S_ISREG($entry->mode);
   next unless $entry->size > 0;
   $entry->read(my $buf, 8192) > 0 or die "read error";
   print $entry->filename, "\t", MagicBuffer($buf), "\n";
   }
   }

CAVEATS
   "rpm2cpio" program (which comes with RPM) must be installed.

   It will die on error, so you may need an enclosing eval block.  How‐
   ever, they say "when you must fail, fail noisily and as soon as possi‐
   ble".

   Entries obtained with "$cpio->next" are coupled with current position
   in $cpio stream.  Thus, "$entry->read" and "$entry->readlink" methods
   may only be invoked before the next "$cpio->next" call.

   Hradlinks must be handled manually.  Alternatively, you may want to
   skip entries with "$entry->size == 0" altogether.

AUTHOR
   Written by Alexey Tourbin .

COPYING
   Copyright (c) 2006, 2009  Alexey Tourbin, ALT Linux Team.

   This is free software; you can redistribute it and/or modify it under
   the terms of the GNU General Public License as published by the Free
   Software Foundation; either version 2 of the License, or (at your
   option) any later version.

SEE ALSO
   rpm2cpio(8).

   Edward C. Bailey.  Maximum RPM.
   <http://www.rpm.org/max-rpm/index.html> (RPM File Format).

   Eric S. Raymond.  The Art of Unix Programming.
   <http://www.faqs.org/docs/artu/index.html> (Rule of Repair).

perl v5.8.9   2009-04-03   RPM::Payload(3)


pgp3m8UQMqDGB.pgp
Description: PGP signature


Re: Remapping lib/rpmal.c to use a backing store.

2008-11-10 Thread Alexey Tourbin
On Mon, Nov 10, 2008 at 12:38:32PM -0500, Jeff Johnson wrote:
> Here's details of the hystory of rpmal.c and what I think needs to
> be done instead.
> 
> RPM started life as the engine for the Red Hat installer. The
> malloc() in libc.so.5 was buggy, so the entire index was designed
> as a huge array w/o ptrs, to simplify debugging and avoid memory
> fragmentation.

Frankly I don't quite understand what the "al" thing is, except that
"al" is used to reorder pacakges in rpmtsOder().  To me, al is just
a list of headers with TR_ADDED/TR_REMOVED flags attached to them.


pgpx0i0U6I2Y3.pgp
Description: PGP signature


Re: RPM: rpm/lib/ rpmal.c

2008-11-10 Thread Alexey Tourbin
On Mon, Nov 10, 2008 at 12:01:36PM -0500, Jeff Johnson wrote:
> BTW, where & how are you seeing a flaw? What are the symptoms
> or usage case? This code has survived on 64bit platforms so

I stumbled upon various bugs in rpmtsOrder: certain ordering
relations ultimately were *NOT* added (T3) to tsi structures.

--- lib/depends.c-  2008-11-09 13:46:19 +
+++ lib/depends.c   2008-11-10 17:16:00 +
@@ -2142,12 +2143,14 @@ static inline int addRelation(rpmts ts,
pkgKey = (alKey)(((long)pkgKey) + ts->numAddedPackages);
 
 for (qi = rpmtsiInit(ts), i = 0; (q = rpmtsiNext(qi, 0)) != NULL; i++) {
+   // pkgKey had garbage in its high bits
if (pkgKey == rpmteAddedKey(q))
break;
 }
 qi = rpmtsiFree(qi);
-if (q == NULL || i >= ts->orderCount)
-   return 0;
+if (q == NULL || i >= ts->orderCount) { 
+   fprintf(stderr, "RET2 q=%p %s <- %s\n", q, rpmdsN(requires), 
rpmteNEVRA(p));
+   return 0; }
 
 /* Avoid certain dependency relations. */
 if (ignoreDep(ts, p, q))
End of diff

So, with this debugging output, I've seen a lot of early "RET"s.
The code "survived" only because it silently returned for "can't
happen" conditions.  I'd rather use assert() to catch "can't happen"
conditions (assuming that rpm should not be compiled with -DNDEBUG).

> I'd like to understand why not reported. Presumably you
> are using alNum2Key() in some new context where the garbage
> bits matter.


pgpYdRkSFnskg.pgp
Description: PGP signature


Re: RPM: rpm/lib/ rpmal.c

2008-11-10 Thread Alexey Tourbin
On Sun, Nov 09, 2008 at 05:01:29PM -0500, Jeff Johnson wrote:
> Hehe, been there, done that. I feel your pain ...

This fixes various issues on 64 bit platforms.
I think the fix should be backported to relevant 5_x branches.

> Do we agree that lib/rpmal.c code needs to DIE! DIE! DIE!?
> 
> Seriously, its insane to have an __IN MEMORY__ 2 level
> dir/file store baseed on __BSEARCH__ for portability in the year 2008.

I think that the code is complicated, and the underlying data model
is hard to understand.  The compilcation is partly due to opaque
interfaces.  Possible solution is to export less public interfaces,
while using internally plain data structures.

> > --- rpm/lib/rpmal.c 2 Aug 2008 00:38:04 -   2.71
> > +++ rpm/lib/rpmal.c 9 Nov 2008 21:38:03 -   2.72
> > @@ -154,6 +154,7 @@
> >  {
> >  /[EMAIL PROTECTED] -temptrans -retalias @*/
> >  union { alKey key; alNum num; } u;
> > +u.num = 0;
> >  u.key = pkgKey;
> >  return u.num;
> >  /[EMAIL PROTECTED] =temptrans =retalias @*/
> > @@ -165,6 +166,7 @@
> >  {
> >  /[EMAIL PROTECTED] -temptrans -retalias @*/
> >  union { alKey key; alNum num; } u;
> > +u.key = 0;
> >  u.num = pkgNum;
> >  return u.key;

(Without this change, u.key had grabage in its high bits.)


pgpHxNJ4bTY9y.pgp
Description: PGP signature


Re: %post-script prerequisites

2008-09-24 Thread Alexey Tourbin
On Wed, Sep 24, 2008 at 07:08:54PM +, Alexey Tourbin wrote:
> In package foo, program /usr/bin/foo is both packaged *and* called
> in its %post script.  The program /usr/bin/foo runs /usr/bin/bar,
> for which we have the dependency "Requires: /usr/bin/bar".

Here is similar example that does not require --noorder option
to demonstrate the problem.  The difference is that packages A
and B have circular dependencies, so, unless we have "Requires(post)",
rpm choose to install A first, and its %post script fails.

Name: A
Version: 1.0
Release: 1
Summary: A
License: GPL
Group: Development/Other
Requires: /usr/bin/B
#Requires(post): /usr/bin/B
BuildArch: noarch
AutoReqProv: no
%package -n B
Summary: B
Group: Development/Other
Requires: A
AutoReqProv: no
%description
%description -n B
%install
mkdir -p %buildroot/usr/bin
cat >%buildroot/usr/bin/A <%buildroot/usr/bin/B <

pgplkUOJH591p.pgp
Description: PGP signature


Re: %post-script prerequisites

2008-09-24 Thread Alexey Tourbin
On Wed, Sep 24, 2008 at 01:56:35PM -0400, Jeff Johnson wrote:
> >Anyway, perhaps I should do some rewording in my initial description
> >of the problem.  In ALT Linux mailing list (in Russian), there seems
> >to be some misunderstanding (or maybe a lack of thereof), too.
> 
> Lots of misunderstandings wrto Requires(post): and PreReq:.
> 
> But I still question whether "(post)" is fixing anything at all. Try  
> and see,
> push a reproducer to me if you'ld like comments.

Sample specfile:

Name: foo
Version: 1.0
Release: 1
Summary: foo
License: GPL
Group: Development/Other
Requires: /usr/bin/bar
BuildArch: noarch
AutoReqProv: no
%package -n bar
Summary: bar
Group: Development/Other
AutoReqProv: no
%description
%description -n bar
%install
mkdir -p %buildroot/usr/bin
cat >%buildroot/usr/bin/foo <%buildroot/usr/bin/bar <

pgpCk2X1Z6H5m.pgp
Description: PGP signature


Re: %post-script prerequisites

2008-09-24 Thread Alexey Tourbin
> >Think about this again: package foo has program /usr/bin/update-foo,
> >which is invoked in %post-script of the package.  The program is  
> >linked
> >with e.g. libglib-2.0.so.0(GLIB_2.18), and we have the dependency
> >"Requires: libglib-2.0.so.0(GLIB_2.18)".  However, this is merely
> >"Requires".  To run /usr/bin/update-foo in the %post script reliably,
> >that must be "Requires(post)".  Or otherwise there's a possibility  
> >that,
> >despite topological reordering, glib2 gets installed or upgraded after
> >foo, and so the %post scriptlet fails miserably.
> >
> >So, there's a general problem: if you both package a program and run
> >it in the %post-script (in the very same package), then bare Requires
> >are not enough: some of them (namely, which are used by the program)
> >should also become Requires(post).
> 
> Mebbe.
> 
> IMHO, there are several flaws in the above.
> 
> The fundamental flaw (imho) is trying to add secondary
> dependencies to a package node in the dependency graph.
> 
> If the chain A -> B -> C is necessary for running /usr/bin/update-foo
> in a script, then A should have
>   Requires: B
> and B should have
>   Requires: C
> and the dependency graph should be assembled dynamically,
> not added statically to A.

+# The solution is: 1) to detect all packaged programs (and files), recursively,
+# which are used in %post-script; and 2) to find prerequisites for such files
+# and programs (which are not provided by the package itself), and add them to
+# Requires(post) dependencies.  Also, we want to ensure that 3) the list of
+# Requires(post) additional dependencies is only a subset of original Requires.

Note that there are three steps, and the last step is explicitly
about not adding secondary dependencies.  So, for the package foo,
it goes like this:
1) We see that the packaged program /usr/bin/update-foo is invoked
in %post script (in the very same package!).
2) We run find-requires for /usr/bin/update-foo, and get some
dependences, which are candidates for Requires(post); they are
libglib-2.0.so(GLIB_2.18) and others (but no secondary dependencies
anyway).
3) Requires(post) candidates are intersected with the "Requires"
of the package.  Sine /usr/bin/update-foo is packaged, there must
be "Requires: libglib-2.0.so(GLIB_2.18)", too.  This step is basically
required only because of some glitches in dependency generators (i.e.
some generators work best when they process the whole list of specific
files, to make some folding).

> Also your claim
> 
> >To run /usr/bin/update-foo in the %post script reliably,
> >that must be "Requires(post)".
> 
> is false. The "(post) marker limits the context where a
> dependency applies, and does not improve reliability at all.

To run /usr/bin/foo in the %post-script reliably, there must be
"Requires(post): libglib-2.0.so(GLIB_2.18)" dependency.  "Reliably"
means that, unless we have this "Requires(post)" dependency on recent
glib2 version, there's a possibility that glib2 gets installed/upgraded
*after* foo, which is too late for its %post-sciprt).  What's false in
this claim?

Anyway, perhaps I should do some rewording in my initial description
of the problem.  In ALT Linux mailing list (in Russian), there seems
to be some misunderstanding (or maybe a lack of thereof), too.


pgplBvsqgRMfc.pgp
Description: PGP signature


Re: %post-script prerequisites

2008-09-24 Thread Alexey Tourbin
On Wed, Sep 24, 2008 at 12:26:10PM -0400, Jeff Johnson wrote:
> You do know that bash --rpm-requires will extract
> dependencies for all scriptlets, not just %post and %preun,
> automagically for several years now?

Think about this again: package foo has program /usr/bin/update-foo,
which is invoked in %post-script of the package.  The program is linked
with e.g. libglib-2.0.so.0(GLIB_2.18), and we have the dependency
"Requires: libglib-2.0.so.0(GLIB_2.18)".  However, this is merely
"Requires".  To run /usr/bin/update-foo in the %post script reliably,
that must be "Requires(post)".  Or otherwise there's a possibility that,
despite topological reordering, glib2 gets installed or upgraded after
foo, and so the %post scriptlet fails miserably.

So, there's a general problem: if you both package a program and run
it in the %post-script (in the very same package), then bare Requires
are not enough: some of them (namely, which are used by the program)
should also become Requires(post).


pgp3zYMdfBxIm.pgp
Description: PGP signature


Re: Two limitations of triggers in rpm

2008-09-22 Thread Alexey Tourbin
On Mon, Sep 22, 2008 at 07:38:03AM -0400, Jeff Johnson wrote:
> >Now, some paths are are "virtual", which is e.g. executable paths  
> >under
> >update-alternatives(1) control.  Those paths are not packaged (and  
> >hence
> >cannot be accessed via Basenames index), but rather created in % 
> >post script.
> >We have find-provides hook which automatically provides virtual paths
> >(e.g. "Provides: /usr/bin/xvt" for xterm, rxvt-unicode etc.)
> 
> So use a probe dependency like
> Requires: executable(/path/to/alternative)
> The probe will be evaluated at run-time, and even permits rpm to
> interoperate with dpkg alternatives.

I don't need runtime probe at all, I need to resolve/install the
dependency.

# apt-get install /usr/bin/xvt
Reading Package Lists... Done
Building Dependency Tree... Done
Package /usr/bin/xvt is a virtual package provided by:
  xterm 237-alt1 [Installed]
  termit 1.3.5-alt1
  rxvt-unicode 9.02-alt1 [Installed]
  kdebase-wm 3.5.10-alt4
  kde4base-konsole 4.1.1-alt1
  gnome-terminal 2.22.3-alt1
  aterm 1.0.1-alt3 [Installed]
You should explicitly select one to install.
E: Package /usr/bin/xvt is a virtual package with multiple good providers.
# 


pgpwPm3gRgF5r.pgp
Description: PGP signature


Re: Two limitations of triggers in rpm

2008-09-21 Thread Alexey Tourbin
On Sat, Sep 20, 2008 at 01:37:59PM -0400, Jeff Johnson wrote:
>   There's the additional wrinkle of handling
>   Provides: /path/to/file
>   which muddles the implementation further, because 2 indices need to  
> be searched.
> 
>   Personally, I think its way past time to prohibit file paths in the  
> Providename index. A second source
>   for paths adds a huge amount of complexity to all application  
> accesses, not just rpm, of an rpmdb for
>   very little benefit other than that packagers and vendors get to  
> pretend that file paths in the Providename
>   index is some sort of cool and useful feature.

In ALT Linux, "file-level" dependencies are essential.  I.e. when
a file path is known in advance, we use dependency on that path (e.g.
"Requires: /bin/sh").  And we have some complicated logic to translate
file-level dependencies with intermediate symlinks in path components,
e.g. /etc/init.d/functions -> /etc/rc.d/init.d/functions.

Now, some paths are are "virtual", which is e.g. executable paths under
update-alternatives(1) control.  Those paths are not packaged (and hence
cannot be accessed via Basenames index), but rather created in %post script.
We have find-provides hook which automatically provides virtual paths
(e.g. "Provides: /usr/bin/xvt" for xterm, rxvt-unicode etc.)

Other packages might want to require /usr/bin/xvt, which they actually
do require.  So, due to "virtual paths", file-like Provides cannot
be prohibited.

>   From an engineering POV, paths in the Providename index just doubles 
> the amount of work needed
>   to ensure whether a path is present (or not).

Most of the time, Basenames index lookup will do (since most paths are
non-virtual).  The amount of work gets doubled only when Providename
fallback is invoked.


pgplCTewZ12J1.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 04:01:13PM -0400, Jeff Johnson wrote:
> >1)
> >%triggerin --posttrans -- /usr/share/icons/hicolor/*/*/
> >gtk-update-icon-cache /usr/share/icons/hicolor
> >
> >This trigger can be triggered/folded/called either by dirname or by
> >glob pattern itself.  Since there is no way to pass the matching
> >dirname, which is limitation by itself, the only sane possibility
> >is that DIRNAMES triggers are triggered/folded/called by glob  
> >patterns.
> 
> You're worried about a package that has __LITERALLY__ a path
> that includes glob characters?!?

No-no-no.  I mean something else.  Driname triggers are called *per
what*?  They should be called *per matching dirname*.  On the other
hand, using glob patterns, there is no way to pass the dirname to the
trigger, so we fold matching dirnames *by dirname patterns*.

E.g. for
%triggerin --posttrans -- /usr/share/icons/hicolor/*/*/
...

Possibility #1) when dirname matches, call the trigger;
so the trigger gets called multiple times for e.g.
/usr/share/icons/hicolor/a/b/
/usr/share/icons/hicolor/a/c/
/usr/share/icons/hicolor/a/d/
etc.

Possibility #2) getting sober: no way to know which dirnames really
matched; call once for all matching dirnames.

My point was that, with glob-dirname triggers (which is what you
propose), I still cannot do what I need (actually what *they* need).
Triggers simply cannot have any specail "arguments" "that matched",
at least not now.

[... I need some time to study other points ...]


pgpKt1roYx18U.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 12:40:55PM -0400, Jeff Johnson wrote:
> >On Fri, Sep 19, 2008 at 04:21:50PM +, Alexey Tourbin wrote:
> >>Technically there's no piping, only a file duplicated on stdin.  And
> >>"filetriggers" are run only once, at the end of transaction (they're
> >>actually "posttrans filetriggers"), which saves consecutive ldconfig,
> >>gtk-update-icon-cache, or whatever calls.
> >
> >Uh, but can that work? A Prereq to another package basically says
> >that the package must be fully configured before installation,
> >so all triggers must be run. Post-transaction is a bit late...
> >
> 
> There's need for a IMMEDIATE as well as a ONETIME (as in delayed)  
> trigger attribute.
> 
> The ONETIME mechanism can be handled by appending to existing
> %posttrans, the IMMEDIATE attribute is essentially the existing trigger
> mechanism(s).

Okay, with DIRNAMES patterns and "posttrans" trigger flag,
you can implement something like "posttrans filetriggers" on
behalf of specfile/rpmdb.

There are still issues.

1) 
%triggerin --posttrans -- /usr/share/icons/hicolor/*/*/
gtk-update-icon-cache /usr/share/icons/hicolor

This trigger can be triggered/folded/called either by dirname or by
glob pattern itself.  Since there is no way to pass the matching
dirname, which is limitation by itself, the only sane possibility
is that DIRNAMES triggers are triggered/folded/called by glob patterns.

2)
%triggerin --posttrans -- /usr/share/icons/hicolor/*/*/
gtk-update-icon-cache /usr/share/icons/hicolor
%triggerun --posttrans -- /usr/share/icons/hicolor/*/*/
gtk-update-icon-cache /usr/share/icons/hicolor

How do you pass "$2" argument to these triggers?  What is "$2"?  If you
pass different "$2" for in/un, you can no longer fold basically the same
in/un triggers (and they run twice).  Or you do not pass "$2" at all.
Anway, doing just something about "$2" is weired.

And this is still not enough.

3) There's a dozen of icon themes, and their gtk2 icon cache is specific
to gtk2.  The above triggers imply that I process "hicolor" theme
specially.  However, I do not.  I want gtk2 to update caches for all its
themes as needed.

Here is gtk-icon-cache.filtrigger for gtk2 pacakge as (presumably)
implemented for ALT Linux:

#!/bin/sh
egrep -o '^/usr/share/icons/[^/]+/' |sort -u |
# doing /usr/share/icons/*/ directories
while read -r dir; do
if [ -f "$dir"/index.theme ]; then
# something changed for this theme
gtk-update-icon-cache "$dir"
elif [ -f "$dir"/icon-theme.cache ]; then
# theme was removed, nuke stale cache
rm -f "$dir"/icon-theme.cache
rmdir --ignore-fail-on-non-empty "$dir"
fi
done

Now you cannot implement this with glob-dirname triggers, because
you need to know the name of icon theme dir.

gtk2.spec:
%triggerin -- /usr/share/icons/*/*/*/
# cannot deduce /usr/share/icons/hicolor/ prefix


pgpG1KIHvjhqR.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 12:08:14PM -0400, Jeff Johnson wrote:
> >>So? Use a glob pattern against RPMTAG_DIRNAMES
> >>elements to detect condition pkg-contains-directory.
> >
> >Do you mean something like -- ?
> >%triggerin -- /usr/share/icons/hicolor/*/*/
> >gtk-update-icon-cache /usr/share/icons/hicolor
> 
> Yes.
> 
> >Possible implementation is: retrieve all Triggername index keys
> >with leading "/", and treat them as patterns.  Then do O(N^2) nested
> >loop: for each DIRNAME in a package, for each Triggername pattern,
> >check for fnmatch(pattern, dirname).
> 
> No. I haven't said anything at all about loops or implementation.

But it has to be implemented somehow...

> And how is piping every file to an external script any better or faster?

Technically there's no piping, only a file duplicated on stdin.  And
"filetriggers" are run only once, at the end of transaction (they're
actually "posttrans filetriggers"), which saves consecutive ldconfig,
gtk-update-icon-cache, or whatever calls.

> For starters, there are many fewer directories, already uniqified, than
> there are file paths in packages ...


pgpUhaOFHb2X1.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 11:51:32AM -0400, Jeff Johnson wrote:
> >$ find /usr/share/icons/hicolor -mindepth 2 -type d |sort -u |head
> >/usr/share/icons/hicolor/128x128/actions
> >/usr/share/icons/hicolor/128x128/animations
> >/usr/share/icons/hicolor/128x128/apps
> >/usr/share/icons/hicolor/128x128/categories
> >/usr/share/icons/hicolor/128x128/devices
> >/usr/share/icons/hicolor/128x128/emblems
> >/usr/share/icons/hicolor/128x128/emotes
> >/usr/share/icons/hicolor/128x128/filesystems
> >/usr/share/icons/hicolor/128x128/intl
> >/usr/share/icons/hicolor/128x128/mimetypes
> >$ find /usr/share/icons/hicolor -mindepth 2 -type d  |wc -l
> >156
> >$
> 
> So? Use a glob pattern against RPMTAG_DIRNAMES
> elements to detect condition pkg-contains-directory.

Do you mean something like -- ?
%triggerin -- /usr/share/icons/hicolor/*/*/
gtk-update-icon-cache /usr/share/icons/hicolor

Possible implementation is: retrieve all Triggername index keys
with leading "/", and treat them as patterns.  Then do O(N^2) nested
loop: for each DIRNAME in a package, for each Triggername pattern,
check for fnmatch(pattern, dirname).


pgp6jWa2oHb3F.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 11:36:52AM -0400, Jeff Johnson wrote:
> >On Fri, Sep 19, 2008 at 11:26:23AM -0400, Jeff Johnson wrote:
> >>Likely the 1st thing to get into place is the ability to trigger from
> >>adding
> >>a file to a directory, i.e. trigger if RPMTAG_DIRNAMES matches a  
> >>trigger
> >>pattern, add trailing / to pattern to differentiate a dirname  
> >>trigger.
> >
> >Triggers based on DIRNAMES are not enough.  Here is an example:
> >when e.g. /usr/share/icons/hicolor/32x32/apps/kpdf.png is installed,
> >upgraded, or removed, "gtk-update-icon-cache /usr/share/icons/hicolor"
> >must be triggered.
> 
> And that is exactly the same condition as RPMTAG_DIRNAMES contains
> /usr/share/icons/hicolor/32x32/apps/ for a package that is installed/ 
> erased.

$ find /usr/share/icons/hicolor -mindepth 2 -type d |sort -u |head
/usr/share/icons/hicolor/128x128/actions
/usr/share/icons/hicolor/128x128/animations
/usr/share/icons/hicolor/128x128/apps
/usr/share/icons/hicolor/128x128/categories
/usr/share/icons/hicolor/128x128/devices
/usr/share/icons/hicolor/128x128/emblems
/usr/share/icons/hicolor/128x128/emotes
/usr/share/icons/hicolor/128x128/filesystems
/usr/share/icons/hicolor/128x128/intl
/usr/share/icons/hicolor/128x128/mimetypes
$ find /usr/share/icons/hicolor -mindepth 2 -type d  |wc -l
156
$


pgpEORQtQioQn.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 11:26:23AM -0400, Jeff Johnson wrote:
> Likely the 1st thing to get into place is the ability to trigger from  
> adding
> a file to a directory, i.e. trigger if RPMTAG_DIRNAMES matches a trigger
> pattern, add trailing / to pattern to differentiate a dirname trigger.

Triggers based on DIRNAMES are not enough.  Here is an example:
when e.g. /usr/share/icons/hicolor/32x32/apps/kpdf.png is installed,
upgraded, or removed, "gtk-update-icon-cache /usr/share/icons/hicolor"
must be triggered.


pgpZTNEqWo5hR.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 04:59:35PM +0200, Michael Schroeder wrote:
> On Fri, Sep 19, 2008 at 02:52:21PM +0000, Alexey Tourbin wrote:
> > On Fri, Sep 19, 2008 at 11:07:04AM +0200, Michael Schroeder wrote:
> > > while implementing virtual triggers
> > [...]
> > May I perhaps take a look at where you are?
> > I'm implementing some sort of triggers, too.
> 
> Oh, I'm not implementing new triggers, I'm just changing the current
> implementation so that they also trigger on package provides (and
> maybe the file list) and not just package NEVR.
> 
> What are you working on?

File triggers for ALT Linux rpm, based on Mandriva patch, but with a few
design decisions different.  Main differences are:

1) The list of files is not prefixed with "+" or "-".
When some package is upgraded, the same files are both "added"
and "removed" (in terms of rpm), so the distinction is not relibable
(especially if the triggers are postponed/resumed).  We can still
use simple file test like [ -f file ] to see if the files were actually
added/upgraded or removed.  This also means we can make the list
of files unique with "sort -u".

2) No separate files with regular expressions (and no internal grep in
librpm) -- file triggers are black boxes.  All triggers are run with
full file list attached to stdin.

Here is for example how ldconfig trigger can be implemented.

#!/bin/sh -e
while read -r f; do
case "$f" in
/lib/lib*/* |\
/lib64/lib*/* |\
/usr/lib/lib*/* |\
/usr/lib64/lib*/* )
# false positives
continue ;;
/lib/lib*.so |\
/lib64/lib*.so |\
/usr/lib/lib*.so |\
/usr/lib64/lib*.so )
# maybe soname
if set "$f".* && [ -f "$1" ]; then
continue
fi
;;
/lib/lib*.so.* |\
/lib64/lib*.so.* |\
/usr/lib/lib*.so.* |\
/usr/lib64/lib*.so.* )
# soname
;;
/etc/ld.so.conf.d/*.conf)
;;
*) continue ;;
esac
exec /sbin/ldconfig
done

In "maybe soname" case, I check for something like *-devel packages
installed or removed (they have symbolic links for which ldconfig
should not be invoked).


pgpCLtjnTv4Wn.pgp
Description: PGP signature


Re: Conflicts on files not symmetric

2008-09-19 Thread Alexey Tourbin
On Fri, Sep 19, 2008 at 11:07:04AM +0200, Michael Schroeder wrote:
> while implementing virtual triggers
[...]
May I perhaps take a look at where you are?
I'm implementing some sort of triggers, too.


pgpcQRuXXG2Et.pgp
Description: PGP signature


Fwd: Re: v4.999.5alpha LZMA_STREAM_INIT_VAR

2008-09-14 Thread Alexey Tourbin
- Forwarded message from Lasse Collin <[EMAIL PROTECTED]> -

Date: Sat, 13 Sep 2008 19:08:21 +0300
From: Lasse Collin <[EMAIL PROTECTED]>
To: Alexey Tourbin <[EMAIL PROTECTED]>
Subject: Re: v4.999.5alpha LZMA_STREAM_INIT_VAR

Alexey Tourbin wrote:
> On Sat, Sep 13, 2008 at 01:17:42PM +0300, Lasse Collin wrote:
> > Alexey Tourbin wrote:
> > > rpm5.org/rpmio/lzdio.c:
> > > 81  lzfile = calloc(1, sizeof(*lzfile));
> > > 82  if (!lzfile) {
> > > 83  (void) fclose(fp);
> > > 84  return NULL;
> > > 85  }
> > > 86  lzfile->fp = fp;
> > > 87  lzfile->encoding = encoding;
> > > 88  lzfile->eof = 0;
> > > 89  lzfile->strm = LZMA_STREAM_INIT_VAR;
> >
> > You can use for example memset(&lzfile->strm, 0,
> > sizeof(lzfile->strm)). See the comment of LZMA_STREAM_INIT in
> > src/liblzma/api/lzma/base.h for details.
>
> There's a calloc() call which is already there, which means there's
> no need to explicitly invoke memset.  But LZMA_STREAM_INIT_VAR has
> been there for a while.

Oh, I didn't read the code carefully, sorry.

If the previous liblzma version you built against was 4.999.3alpha, you 
will have bunch of other problems too. For example, the protoypes of 
lzma_alone_encode() and lzma_alone_decoder() have changed.

> May I perhaps forward your message to rpm5.org developemnt list?
> (So that we can fix the code in some way.)

Sure. I can go and idle on [EMAIL PROTECTED] for a while too.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

- End forwarded message -


pgpTY7wZWFpGZ.pgp
Description: PGP signature


Fwd: Re: v4.999.5alpha LZMA_STREAM_INIT_VAR

2008-09-14 Thread Alexey Tourbin
- Forwarded message from Lasse Collin <[EMAIL PROTECTED]> -

Date: Sat, 13 Sep 2008 13:17:42 +0300
From: Lasse Collin <[EMAIL PROTECTED]>
To: Alexey Tourbin <[EMAIL PROTECTED]>
Subject: Re: v4.999.5alpha LZMA_STREAM_INIT_VAR

Alexey Tourbin wrote:
> Upgrading to v4.999.5alpha breaks existing software builds, since
> LZMA_STREAM_INIT_VAR apparently has been removed from API (without
> deprecation note or something).

There are lots of other API changes in addition to LZMA_STREAM_INIT_VAR 
in 4.999.5alpha, and I've already made a few more in the git 
repository. So you need to be careful when upgrading until the first 
stable release, because it is possible that some changes don't get 
detected by the compiler, e.g. if a new member is added to a structure.

I know how much people hate API and ABI breakages. Once the first stable 
release is out, I won't break the API or ABI easily. But before that, I 
won't promise anything, because it would complicate development far too 
much. Stable release should be out before end of this year.

> rpm5.org/rpmio/lzdio.c:
> 81  lzfile = calloc(1, sizeof(*lzfile));
> 82  if (!lzfile) {
> 83  (void) fclose(fp);
> 84  return NULL;
> 85  }
> 86  lzfile->fp = fp;
> 87  lzfile->encoding = encoding;
> 88  lzfile->eof = 0;
> 89  lzfile->strm = LZMA_STREAM_INIT_VAR;

You can use for example memset(&lzfile->strm, 0, sizeof(lzfile->strm)). 
See the comment of LZMA_STREAM_INIT in src/liblzma/api/lzma/base.h for 
details.

If you think the initialization should be done in some other way, for 
example by having a separate function or macro to do the 
initialization, let me know. I'm going to remove all exported variables 
from the API, so LZMA_STREAM_INIT_VAR won't be added back as is.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

- End forwarded message -


pgpVQfYGIbdmm.pgp
Description: PGP signature


Re: rpm infinite recursions using manifests

2008-09-07 Thread Alexey Tourbin
On Sun, Sep 07, 2008 at 02:32:57PM -0400, Jeff Johnson wrote:
> On Sep 7, 2008, at 2:28 PM, Alexey Tourbin wrote:
> >On Sun, Sep 07, 2008 at 02:22:26PM -0400, Jeff Johnson wrote:
> >>>Forbid manifest files from within manifests.
> >>Forbid manifests entirely is a similarly Draconian solution.
> >
> >I expect manifests to have  semanitcs, not #include semantics.
> 
> Sure manifests have list semantics.

List semantics means that we have basic type rpm, and manifests are
of type list.  Include semantics has a bit more vague notion of
basic type: [  |  ]+, the latter
is recurisve and should be reduced to terminal code_snippets.  cpp(1)
provides a device to prevent infinite recursion, which is

#ifndef FOO_H
#define FOO_H
[  |  ]+
#endif

and otherwise has no special way to handle recursion, except for
nesting limit.

$ cat foo.h
#include "foo.h"
$ cpp -I. -E foo.h 2>&1 >/dev/null |tail
 from foo.h:1,
 from foo.h:1,
 from foo.h:1,
 from foo.h:1,
 from foo.h:1,
 from foo.h:1,
 from foo.h:1,
 from foo.h:1,
 from foo.h:1:
foo.h:1:17: error: #include nested too deeply
$ 

So, right, the best thing you can do is set up nesting limit.  However,
to me, the fact that manifests can include manifests is not the least
surprising thing.


pgp5cHW149MX2.pgp
Description: PGP signature


Re: rpm infinite recursions using manifests

2008-09-07 Thread Alexey Tourbin
On Sun, Sep 07, 2008 at 02:22:26PM -0400, Jeff Johnson wrote:
> >Forbid manifest files from within manifests.
> Forbid manifests entirely is a similarly Draconian solution.

I expect manifests to have  semanitcs, not #include semantics.

> I want an acceptable general solution, forbidding solves no  
> engineering problem.


pgpAMkvOpN7Mb.pgp
Description: PGP signature


Re: rpm infinite recursions using manifests

2008-09-07 Thread Alexey Tourbin
On Sun, Sep 07, 2008 at 12:10:18PM -0400, Jeff Johnson wrote:
> There's a class of infinite recursion problems with manifests used
> on the rpm CLI that I don't know to fix.
> 
> A manifest is a file containing a list of paths to packages (or other  
> manifests)

Forbid manifest files from within manifests.


pgp8hE1AXhf2r.pgp
Description: PGP signature


lua bindings for rpmdb and header

2008-09-04 Thread Alexey Tourbin
On Thu, Sep 04, 2008 at 05:08:21PM -0400, Jeff Johnson wrote:
> >Also, it is necessary to provide lua bindings for 1) retrieving
> >headers by instance, perhepas something like "h = getAddedHeader(num)"
> >and "h = getRemovedHeader(num)"; and 2) retrieving header entries,
> >e.g. "h.filenames".

BTW, back in 2005 I wrote a file luarpm.c, which is lua bindings to rpm,
akin to old Perl-RPM.  I don't remember much details, but perhaps it was
for lua-5.0, and perhaps it even worked.
#include 
#include 

#include 
#include 

#include 
#include 
#include 

static const char RPM_Database[] = "RPM_Database";
static const char RPM_Header[] = "RPM_Header";

static int luaRPM_dbopen(lua_State *L)
{
char *prefix = NULL;
int mode = O_RDONLY;
int perms = 0;
rpmdb db;

if (rpmdbOpen(prefix, &db, mode, perms) != 0) {
lua_pushnil(L);
lua_pushstring(L, "rpmdbOpen failed");
return 2;
}

void *ptr = lua_newuserdata(L, sizeof db);
memmove(ptr, &db, sizeof db);
luaL_getmetatable(L, RPM_Database);
lua_setmetatable(L, -2);
return 1;
}

static int luaRPM_fopen(lua_State *L)
{
const char *fname = luaL_checkstring(L, 1);
FD_t fd = Fopen(fname, "r");

if (fd == 0) {
lua_pushnil(L);
lua_pushstring(L, strerror(errno));
lua_pushstring(L, "open");
return 3;
}

Header hdr;
rpmRC rc = rpmReadPackageHeader(fd, &hdr, 0, 0, 0);
Fclose(fd);

if (rc != RPMRC_OK) {
lua_pushnil(L);
lua_pushstring(L, "rpmReadPackageHeader failed");
lua_pushstring(L, "init");
return 3;
}

void *ptr = lua_newuserdata(L, sizeof hdr);
memmove(ptr, &hdr, sizeof hdr);
luaL_getmetatable(L, RPM_Header);
lua_setmetatable(L, -2);
return 1;
}

static int db_find(lua_State *L, int tag)
{
rpmdb db = *(rpmdb *)luaL_checkudata(L, 1, RPM_Database);
luaL_argcheck(L, db != NULL, 1, "RPM_Database expected");

const char *name = luaL_checkstring(L, 2);

rpmdbMatchIterator mi = rpmdbInitIterator(db, tag, name, 0);
Header hdr;
int n = 0;

lua_newtable(L);

while ((hdr = rpmdbNextIterator(mi)) != NULL) {
headerLink(hdr);
lua_pushnumber(L, ++n);
void *ptr = lua_newuserdata(L, sizeof hdr);
memmove(ptr, &hdr, sizeof hdr);
luaL_getmetatable(L, RPM_Header);
lua_setmetatable(L, -2);
lua_settable(L, -3);
}

rpmdbFreeIterator(mi);
return 1;
}

static int luaRPM_db_pkg(lua_State *L)
{
db_find(L, RPMTAG_NAME);
lua_pushnumber(L, 1);
lua_gettable(L, -2);
lua_remove(L, -2);
return 1;
}

static int luaRPM_db_whatrequires(lua_State *L)
{
return db_find(L, RPMTAG_REQUIRENAME);
}

static int luaRPM_db_whatprovides(lua_State *L)
{
return db_find(L, RPMTAG_PROVIDENAME);
}

static int tag2num(const char *tag)
{
int i;
for (i = 0; i < rpmTagTableSize; i++)
if (strcasecmp(tag, rpmTagTable[i].name + 7) == 0)
return rpmTagTable[i].val;
return -1;
}

static int luaRPM_hdr_tag(lua_State *L)
{
Header hdr = *(Header *)luaL_checkudata(L, 1, RPM_Header);
const char *key = luaL_checkstring(L, 2);
int tag = tag2num(key);

if (tag < 0) {
lua_pushnil(L);
lua_pushfstring(L, "unknown tag: %s", key);
return 2;
}

int type, size;
char *data;

if (headerGetEntry(hdr, tag, &type, (void**)&data, &size) == 0) {
lua_pushnil(L);
lua_pushfstring(L, "no tag in header: %s", key);
return 2;
}

switch (type) {
case RPM_NULL_TYPE:
lua_pushnil(L);
return 1;
case RPM_BIN_TYPE:
lua_pushlstring(L, data, size);
return 1;
}

lua_newtable(L);

int i;

switch (type) {
case RPM_CHAR_TYPE: {
char *ptr;
for (ptr = (char *)data, i = 0; i < size; i++, ptr++) {
lua_pushnumber(L, i + 1);
lua_pushlstring(L, ptr, 1);
lua_settable(L, -3);
}
break;
}
case RPM_INT8_TYPE: {
int_8 *ptr;
for (ptr = (int_8 *)data, i = 0; i < size; i++, ptr++) {
lua_pushnumber(L, i + 1);
lua_pushnumber(L, *ptr & 0xff);
lua_settable(L, -3);
}
break;
}
case RPM_INT16_TYPE: {
int_16 *ptr;

Re: modular %posttrans-like scripts

2008-09-04 Thread Alexey Tourbin
On Wed, Aug 27, 2008 at 09:13:18AM -0400, Jeff Johnson wrote:
> There's the Mandriva solution, called "file triggers", to the cache  
> update
> problem in lib/filetriggers.c. I dislike several things with the the  
> specific
> Mandriva implementation, but the idea is closest to being generally  
> useful IMHO.

lib/transaction.c:
  1880  if ((rpmtsFlags(ts) & _noTransTriggers) != _noTransTriggers)
  1881  rpmRunFileTriggers(rpmtsRootDir(ts));

Perhaps I need more general *mechanism* which can implement file
triggers as a site/vendor *policy*, and which is not limited itself
to file triggers.

-   rpmRunFileTriggers(rpmtsRootDir(ts));
+   rpmRunSitePosttrans(ts);

In rpmRunSitePosttrans, what possibly can be done is provide
the ability for lua script to access installed and removed headers.

That is, rpmRunSitePosttrans can call lua script
/usr/lib/rpm/posttrans.lua with basically two arguments:
the list of removed package instance numbers, and the list
of added instance numbers (in the transaction ts).

Also, it is necessary to provide lua bindings for 1) retrieving
headers by instance, perhepas something like "h = getAddedHeader(num)"
and "h = getRemovedHeader(num)"; and 2) retrieving header entries,
e.g. "h.filenames".

Then the whole notion of "posttrans file triggers" or whatever posttrans
triggers can be implemented that posttrans.lua script.

Now, it is easy to retrive headers of added packages, but it is a tricky
question if I can access headers of removed packages, i.e. the headers
that's been removed from rpmdb while the transaction ts is not finished
yet.  There seems to be RPMDBI_REMOVED temporary database, but I am not
sure how it is supposed to work.


pgpcHwwmWnNR5.pgp
Description: PGP signature


tagNum and fpNum

2008-08-29 Thread Alexey Tourbin
rpmdb/rpmdb.h:
63  struct _dbiIndexItem {
64  rpmuint32_t hdrNum; /*!< header instance in db */
65  rpmuint32_t tagNum; /*!< tag index in header */
66  rpmuint32_t fpNum;  /*!< finger print index */
67  };

Please explain what is tagNum and how it is used.
(I just need to understand the coder better.)

Here is some reverse engeneering (against older rpmdb,
actually created with rpm-4.0.4+).  
$ ./rpm -qa --qf '%{NAME}\t%{DBINSTANCE}\n' |grep -w perl-base
perl-base   1068
$

Package perl-base is instance #1068.

$ rpm -q --qf '[%{BASENAMES}\t%{FILENAMES}\n]' perl-base |cat -n |awk 
'{$1--}($2=="perl5")'   
0 perl5 /etc/perl5
2 perl5 /usr/bin/perl5
9 perl5 /usr/lib/perl5
$

Package perl-base have 3 entries for "perl5" basename.

$ perl -MDB_File -le 'tie %db, "DB_File", "/var/lib/rpm/Basenames", 0 and print 
join ",", unpack "I*", $db{perl5}'
707,165,775,0,800,14,1068,0,1068,2,1068,9
$

We can see (1068,0), (1068,2), and (1068,9) pairs.
Okay, perhaps I can understand what tagNum is.

What is fpNum then?


pgpjV5y7dgku6.pgp
Description: PGP signature


Re: damaged headers are due to FILESTATES RPM_CHAR_TYPE

2008-08-29 Thread Alexey Tourbin
On Wed, Aug 27, 2008 at 12:55:06PM -0400, Jeff Johnson wrote:
> >$ ./rpm -q -vvv --whatprovides /a
> >D: opening  db index   /var/lib/rpm/Packages rdonly mode=0x0
> >D: locked   db index   /var/lib/rpm/Packages
> >D: opening  db index   /var/lib/rpm/Basenames rdonly mode=0x0
> >error: rpmdb: damaged header #625 retrieved -- skipping.
> >error: rpmdb: damaged header #625 retrieved -- skipping.
> >D: opening  db index   /var/lib/rpm/Providename rdonly mode=0x0
> >file /a: No such file or directory
> >D: closed   db index   /var/lib/rpm/Providename
> >D: closed   db index   /var/lib/rpm/Basenames
> >D: closed   db index   /var/lib/rpm/Packages
> >$ ./rpm -q -vvv --whatprovides /bin/cat
> >D: opening  db index   /var/lib/rpm/Packages rdonly mode=0x0
> >D: locked   db index   /var/lib/rpm/Packages
> >D: opening  db index   /var/lib/rpm/Basenames rdonly mode=0x0
> >error: rpmdb: damaged header #90 retrieved -- skipping.
> >error: rpmdb: damaged header #90 retrieved -- skipping.
> >error: rpmdb: damaged header #531 retrieved -- skipping.
> >error: rpmdb: damaged header #585 retrieved -- skipping.
> >error: rpmdb: damaged header #1101 retrieved -- skipping.
> >error: rpmdb: damaged header #1173 retrieved -- skipping.
> >D: opening  db index   /var/lib/rpm/Providename rdonly mode=0x0
> >file /bin/cat is not owned by any package
> >D: closed   db index   /var/lib/rpm/Providename
> >D: closed   db index   /var/lib/rpm/Basenames
> >D: closed   db index   /var/lib/rpm/Packages
> >$
> >
> >(I don't know why --whatprovides is special; simple -q queries
> >whithout join-key lookup work fine.)
> >
> 
> Hmmm, there is nothing special abt --whatprovides that should wander
> into code that depends on RPM_MIN_TYPE (unless I'm missing something).
> 
> RPMTAG_FILESTATES is the only tag that was ever type'd as RPM_CHAR_TYPE
> iirc. And here are the uses, none of the code is directly on a query  
> code path (this is HEAD):

There's a code that handles this case.

rpmdb/header_internal.c:
36  int headerVerifyInfo(rpmuint32_t il, rpmuint32_t dl, const void * pev, 
void * iv, int negate)
37  {
38  /[EMAIL PROTECTED]@*/
39  entryInfo pe = (entryInfo) pev;
40  /[EMAIL PROTECTED]@*/
41  entryInfo info = iv;
42  rpmuint32_t i;
43  
(The following line was added by me.)
44  fprintf(stderr, "headerVerifyInfo\n");
45  
46  for (i = 0; i < il; i++) {
47  info->tag = (rpmuint32_t) ntohl(pe[i].tag);
48  info->type = (rpmuint32_t) ntohl(pe[i].type);
49  /* XXX Convert RPMTAG_FILESTATE to RPM_UINT8_TYPE. */
50  if (info->tag == 1029 && info->type == 1) {
51  info->type = RPM_UINT8_TYPE;
52  pe[i].type = (rpmuint32_t) htonl(info->type);
53  }

Now, for some reason, certain rpmquery calls do trigger headerVerifyInfo
call, and others do not.

$ ./rpm -q --qf '%{NAME}\n' coreutils  
headerVerifyInfo
headerVerifyInfo
headerVerifyInfo
headerVerifyInfo
headerVerifyInfo
headerVerifyInfo
headerVerifyInfo
headerVerifyInfo
headerVerifyInfo
coreutils
$ ./rpm -q --qf '%{NAME}\n' --whatprovides /bin/cat
error: rpmdb: damaged header #90 retrieved -- skipping.
error: rpmdb: damaged header #90 retrieved -- skipping.
error: rpmdb: damaged header #531 retrieved -- skipping.
error: rpmdb: damaged header #585 retrieved -- skipping.
error: rpmdb: damaged header #1101 retrieved -- skipping.
error: rpmdb: damaged header #1173 retrieved -- skipping.
file /bin/cat is not owned by any package
$

9 headerVerifyInfo() calls with '-q' and no single headerVerifyInfo()
call with '-q --whatprovides'.


pgp4DN49HvxVJ.pgp
Description: PGP signature


Re: damaged headers are due to FILESTATES RPM_CHAR_TYPE

2008-08-27 Thread Alexey Tourbin
On Wed, Aug 27, 2008 at 03:36:26PM -0400, Jeff Johnson wrote:
> >Is there any benefit for them to join rpm5.org?  There must be a good
> >reason.  There's nothing more than that.  (They might know that  
> >there's
> >a good reason, or they might know that there is not.  This is rather
> >non-political argument.)
> 
> Asking me what their reason(s) are is pointless. All I know is I asked,
> and saw several SuSE developers at both FOSDEM in February and OLS
> last month.
> 
> There's lots that is dead-on with the OpenSUSE build system. OTOH,
> some of the compatibility issues integrating "weak dependencies" and
> the other RFE's at
> http://wiki.rpm.org/Problems_of_Building
> will be tricky, not all of the mentioned problems are properly  
> formulated yet imho.

To me, weak dependencies are bullshit (that's straight, yeah).
Either there is a dependency, or there is not.  BTW, SuSE paper
on their smart SAT solver says the SAT solver can't handle weak
dependencies anyway.

(This does not mean I object weak dependencies.  Let them be.
I just see them as a sort of some special comments.)

(Also, to me, rpm is not a "deus ex machina", by which I mean that
rpm is not supposed to resolve complicated dependency stuff.  There
must be external solvers associated with external rpm repositories,
such as apt-get or yum.  In other words, I see librpm as a simple
yes-or-no thing.)

> >Well, something what, basically I like rpm5.org because things are
> >straight and there's no any kind of political endorsement or whatever.
> >I can hack as long as I like and as long as I don't break too much.
> >This also means that I can contribute back and forth (even if I don't
> >like some of rpm5 new pieces.)  Perhaps this is a good reason for
> >developers to join.
> 
> ;-) Basically why I joined @rpm5.org too.

Let's gather more developers then.  I would rather say hackers,
unencumbered with political stuff or something.  People who want
to hack should join.  Perhaps the worst thing that can happen is
that their code is ifdeffed.

Honestly, there's another approach: don't touch that "rpm" thing
unless it breaks; that's probably okay, too.

> Seriously, rpm "make check" is starting to congeal sufficiently
> that I'm way less worried about undetected breakage like
> RPM_CHAR_TYPE. Yes better "legacy" checks need to be added.


pgpVcCt4z5kqz.pgp
Description: PGP signature


Re: damaged headers are due to FILESTATES RPM_CHAR_TYPE

2008-08-27 Thread Alexey Tourbin
On Wed, Aug 27, 2008 at 01:44:15PM -0400, Jeff Johnson wrote:
> >Anyway, perhaps you should try to persuade SuSE guys that rpm5 is
> >beneficial for them.  That could be a big win.
> 
> All the SuSE guys were invited @rpm5.org last October. None bothered to
> even acknowledge the invitation. They are certainly still welcome  
> @rpm5.org.

Is there any benefit for them to join rpm5.org?  There must be a good
reason.  There's nothing more than that.  (They might know that there's
a good reason, or they might know that there is not.  This is rather
non-political argument.)

Well, something what, basically I like rpm5.org because things are
straight and there's no any kind of political endorsement or whatever.
I can hack as long as I like and as long as I don't break too much.
This also means that I can contribute back and forth (even if I don't
like some of rpm5 new pieces.)  Perhaps this is a good reason for
developers to join.


pgpE8jIA2UtX9.pgp
Description: PGP signature


Re: damaged headers are due to FILESTATES RPM_CHAR_TYPE

2008-08-27 Thread Alexey Tourbin
On Wed, Aug 27, 2008 at 12:55:06PM -0400, Jeff Johnson wrote:
> >This apparently means that rpm5 is not that widely used.
> >Perhaps you should call for yet more major distributions.
> 
> I'd agree that rpm5 is not widely used on "legacy" distributions.

Face it, people use what they give them, and they give them something
about rpm-4.4.2+.

> I'm more interested in best possible engineering than with
> widest possible usage atm for rpm-5.x. Better engineering
> will eventually be adopted is my guess, and I most definitely
> do not want the burden of support for rpm everywhere, I'm
> just one guy who wants his life back.

(aside) You seem to use colloquial English which is not always easy
to grasp.  What "one guy who wants his life back" is supposed to mean
with respect to rpm5.org?

Anyway, perhaps you should try to persuade SuSE guys that rpm5 is
beneficial for them.  That could be a big win.


pgpzDhLC9bma6.pgp
Description: PGP signature


Re: damaged headers are due to FILESTATES RPM_CHAR_TYPE

2008-08-27 Thread Alexey Tourbin
On Wed, Aug 27, 2008 at 09:16:50AM -0400, Jeff Johnson wrote:
> >Damaged headers are due to FILESTATES from older rpmdb.
> >
> >rpmdb/header.c (regionSwab):
> >   522  for (; il > 0; il--, pe++) {
> >   523  struct indexEntry_s ie;
> >   524  rpmTagType type;
> >   525
> >   526  ie.info.tag = (rpmuint32_t) ntohl(pe->tag);
> >   527  ie.info.type = (rpmuint32_t) ntohl(pe->type);
> >   528  ie.info.count = (rpmuint32_t) ntohl(pe->count);
> >   529  ie.info.offset = (rpmint32_t) ntohl(pe->offset);
> >   530  assert(ie.info.offset >= 0);/* XXX insurance */
> >   531
> >Bails out right here:
> >   532  if (hdrchkType(ie.info.type))
> >   533  return 0;
> >   534  if (hdrchkData(ie.info.count))
> >   535  return 0;
> >   536  if (hdrchkData(ie.info.offset))
> >   537  return 0;
> >   538  if (hdrchkAlign(ie.info.type, ie.info.offset))
> >   539  return 0;
> >
> >Older FILESTATES have type RPM_CHAR_TYPE (= 1), and new value
> >for RPM_MIN_TYPE is 2, which is RPM_UINT8_TYPE.
> 
> Nice catch! Changing RPM_MIN_TYPE back to 1 is the obvious fix.
> 
> However, I do wonder why this has not been reported before. AFAICT
> the issue should have been very very loud and obvious.

This apparently means that rpm5 is not that widely used.  
Perhaps you should call for yet more major distributions.

> What was the full calling context where the problem was seen?

It goes like this:

$ ./rpm -q -vvv --whatprovides /a
D: opening  db index   /var/lib/rpm/Packages rdonly mode=0x0
D: locked   db index   /var/lib/rpm/Packages
D: opening  db index   /var/lib/rpm/Basenames rdonly mode=0x0
error: rpmdb: damaged header #625 retrieved -- skipping.
error: rpmdb: damaged header #625 retrieved -- skipping.
D: opening  db index   /var/lib/rpm/Providename rdonly mode=0x0
file /a: No such file or directory
D: closed   db index   /var/lib/rpm/Providename
D: closed   db index   /var/lib/rpm/Basenames
D: closed   db index   /var/lib/rpm/Packages
$ ./rpm -q -vvv --whatprovides /bin/cat
D: opening  db index   /var/lib/rpm/Packages rdonly mode=0x0
D: locked   db index   /var/lib/rpm/Packages
D: opening  db index   /var/lib/rpm/Basenames rdonly mode=0x0
error: rpmdb: damaged header #90 retrieved -- skipping.
error: rpmdb: damaged header #90 retrieved -- skipping.
error: rpmdb: damaged header #531 retrieved -- skipping.
error: rpmdb: damaged header #585 retrieved -- skipping.
error: rpmdb: damaged header #1101 retrieved -- skipping.
error: rpmdb: damaged header #1173 retrieved -- skipping.
D: opening  db index   /var/lib/rpm/Providename rdonly mode=0x0
file /bin/cat is not owned by any package
D: closed   db index   /var/lib/rpm/Providename
D: closed   db index   /var/lib/rpm/Basenames
D: closed   db index   /var/lib/rpm/Packages
$ 

(I don't know why --whatprovides is special; simple -q queries
whithout join-key lookup work fine.)

Actually it was an infinite loop with ever increasing mi->mi_setx
and all signals blocked, so I thought that first I had to fix that
infinite loop.  And then it was some printf-style debugging; the
surprising thing is that printf debuggins sometimes goes faster than
gdb breakpoints etc.


pgpb7m3621sAp.pgp
Description: PGP signature


damaged headers are due to FILESTATES RPM_CHAR_TYPE

2008-08-27 Thread Alexey Tourbin
Damaged headers are due to FILESTATES from older rpmdb.

rpmdb/header.c (regionSwab):
   522  for (; il > 0; il--, pe++) {
   523  struct indexEntry_s ie;
   524  rpmTagType type;
   525
   526  ie.info.tag = (rpmuint32_t) ntohl(pe->tag);
   527  ie.info.type = (rpmuint32_t) ntohl(pe->type);
   528  ie.info.count = (rpmuint32_t) ntohl(pe->count);
   529  ie.info.offset = (rpmint32_t) ntohl(pe->offset);
   530  assert(ie.info.offset >= 0);/* XXX insurance */
   531
Bails out right here:
   532  if (hdrchkType(ie.info.type))
   533  return 0;
   534  if (hdrchkData(ie.info.count))
   535  return 0;
   536  if (hdrchkData(ie.info.offset))
   537  return 0;
   538  if (hdrchkAlign(ie.info.type, ie.info.offset))
   539  return 0;

Older FILESTATES have type RPM_CHAR_TYPE (= 1), and new value
for RPM_MIN_TYPE is 2, which is RPM_UINT8_TYPE.


pgpD2dXm501hK.pgp
Description: PGP signature


modular %posttrans-like scripts

2008-08-26 Thread Alexey Tourbin
I need modular and configurable %posttrans-like scripts.
Examples:
1) automatic run of update-menus, if package has /usr/lib/menu/*
2) update-desktop-database -- /usr/share/applications/*.desktop
3) gtk-update-icon-cache -- /usr/share/icons/hiclolr/*

I mean that, for common tasks, I do not want packagers to write
their %post-like scripts at all.  Simple cache updates should
be handled by rpm itself.

Now, is there somehting like this in rpm5.org, to start looking at?


pgpHTj3Z6zgTo.pgp
Description: PGP signature


mi->mi_offset == mi->mi_prevoffset

2008-08-24 Thread Alexey Tourbin
rpmdb/rpmdb.c (rpmdbNextIterator):
  2418  /* If next header is identical, return it now. */
  2419  /[EMAIL PROTECTED] -refcounttrans -retalias -retexpose -usereleased @*/
  2420  if (mi->mi_prevoffset && mi->mi_offset == mi->mi_prevoffset)
  2421  return mi->mi_h;
  2422  /[EMAIL PROTECTED] =refcounttrans =retalias =retexpose =usereleased @*/

It looks like the condition never holds, because: 1) when doing mi_set,
each set element should be different; 2) when doing DB_NEXT, each header
instance should be different as well.


pgpKcSiCmcnEs.pgp
Description: PGP signature


rpmdbNextIterator infinite loop

2008-08-24 Thread Alexey Tourbin
rpmdb/rpmdb.c (rpmdbNextIterator):
  2381  top:
  2382  uh = NULL;
  2383  uhlen = 0;
  2384  
  2385  do {
  2386  union _dbswap mi_offset;
  2387  
  2388  if (mi->mi_set) {
  2389  if (!(mi->mi_setx < mi->mi_set->count))
  2390  return NULL;
  2391  mi->mi_offset = dbiIndexRecordOffset(mi->mi_set, 
mi->mi_setx);
  2392  mi->mi_filenum = dbiIndexRecordFileNumber(mi->mi_set, 
mi->mi_setx);
  2393  mi_offset.ui = mi->mi_offset;
  2394  if (dbiByteSwapped(dbi) == 1)
  2395  _DBSWAP(mi_offset);
  2396  keyp = &mi_offset;
  2397  keylen = sizeof(mi_offset.ui);
  2398  } else {
* 2399  key->data = (void *)mi->mi_keyp;
  2400  key->size = (UINT32_T) mi->mi_keylen;
  2401  data->data = uh;
  2402  data->size = (UINT32_T) uhlen;
  2403  #if !defined(_USE_COPY_LOAD)
  2404  data->flags |= DB_DBT_MALLOC;
  2405  #endif
  2406  rc = dbiGet(dbi, mi->mi_dbc, key, data,
  2407  (key->data == NULL ? DB_NEXT : DB_SET));
  2408  data->flags = 0;
  2409  keyp = key->data;
  2410  keylen = key->size;
  2411  uh = data->data;
  2412  uhlen = data->size;
  2413  
  2414  /*
  2415   * If we got the next key, save the header instance number.
  2416   *
  2417   * For db3 Packages, instance 0 (i.e. mi->mi_setx == 0) is 
the
  2418   * largest header instance in the database, and should be
  2419   * skipped.
  2420   */
  2421  if (keyp && mi->mi_setx && rc == 0) {
  2422  memcpy(&mi_offset, keyp, sizeof(mi_offset.ui));
  2423  if (dbiByteSwapped(dbi) == 1)
  2424  _DBSWAP(mi_offset);
  2425  mi->mi_offset = (unsigned) mi_offset.ui;
  2426  }
  2427  
  2428  /* Terminate on error or end of keys */
  2429  /[EMAIL PROTECTED]@*/
  2430  if (rc || (mi->mi_setx && mi->mi_offset == 0))
  2431  return NULL;
  2432  /[EMAIL PROTECTED]@*/
  2433  #ifdef  REFERENCE
  2434  if (mi->mi_offset & 0x) {
  2435  fprintf(stderr, "*** damaged key 0x%x reset to 0\n", mi->mi_offset);
  2436  mi->mi_offset = 0;
  2437  }
  2438  #endif
  2439  }
  2440  mi->mi_setx++;
  2441  } while (mi->mi_offset == 0);
...
  2532  if (mi->mi_h == NULL || !headerIsEntry(mi->mi_h, RPMTAG_NAME)) {
  2533  rpmlog(RPMLOG_ERR,
  2534  _("rpmdb: damaged header #%u retrieved -- skipping.\n"),
  2535  mi->mi_offset);
* 2536  goto top;
  2537  }

This code is subject to infinite loop.
Consider how it is called:

mi = rpmdbInitIterator(db, RPMDBI_PACKAGES, hdrNum, sizeof(hdrNum));
h = rpmdbNextIterator(mi);

and consider that the header indentified with hdrNum is damaged.
What happens then?  In line 2536, you "goto top".  And on top,
mi->mi_set is epmty (because it is not index-to-join-key lookup),
and key->data is hdrNum.  This means that dbiGet gets called with
DB_SET, you get the same header which is damaged etc.

(I'll try to fix this but I'm not sure what's the best way to fix
this yet.)


pgp8u9pe5AkeU.pgp
Description: PGP signature


Re: RPM: rpm/rpmdb/ db3.c dbconfig.c rpmdb.c rpmdb.h

2008-08-18 Thread Alexey Tourbin
On Mon, Aug 18, 2008 at 12:20:36PM -0400, Jeff Johnson wrote:
> >  @@ -2878,6 +2878,7 @@
> >
> > if (db->db_tags != NULL)
> > for (dbix = 0; dbix < db->db_ndbi; dbix++) {
> >  +  dbiIndex dbi;

(Actually one of my previous changes moved these variables from the
outer scope to the inner loop scope.)

Note that dbi corresponds to dbix here, and this is the right scope
for dbi.  There's a temptation to reuse this dbi in another loop,
but that would be a different "dbi" for another purpose.

So, sometimes I ask, like, "what this variable is for?"  One approach
is to use more distinctive names, e.g. s/dbi_for_dbix_index_update/ (and
another would be dbi_to_fetch_max_header_instance_number).  Possibly a
better approach is still to use short names but within the inner scope.
The very scope can tell then what the variable is supposed to mean.

> > DBC * dbcursor = NULL;
> > DBT k = DBT_INIT;
> > DBT v = DBT_INIT;

Now note that k and v are also connected to "dbi" index (k has tag value
for dbi index and v is a set of header instances).  As soon as "dbi" goes
away, k and v have garbage.  They'd better go away together.

> >  @@ -2887,7 +2888,6 @@
> > rpmTag rpmtag = dbiTag->tag;
> > const char * dbiBN = (dbiTag->str != NULL
> > ? dbiTag->str : tagName(rpmtag));
> >  -  dbiIndex dbi;
> > rpmuint8_t * bin = NULL;
> > int i;
> >
> >  @@ -2939,7 +2939,6 @@
> > if (!xx)
> > continue;
> > /[EMAIL PROTECTED]@*/ break;
> >  -
> > }
> > 
> >   dbi = dbiOpen(db, he->tag, 0);

Note that dbiOpen happens to be at the very same scope/indentation as
dbi was first declared.  So is dbiClose, right before the bracket that
closes the scope of the loop.  The benefit at least that you don't have
to set dbi to NULL so that it does not dangle throughout the rest of the
code.

> Be very careful re-scoping dbiIndex here as well. Or at least watch  
> for subtle flaws,
> valgrind should catch any flaws iirc.

I hope I am careful (well, I did not run valgrind, I just asked myself
questions, like, "what this variable is for").


pgpqn72alT7R0.pgp
Description: PGP signature


Re: rpmdb.c blockSignals()

2008-08-18 Thread Alexey Tourbin
On Mon, Aug 18, 2008 at 09:13:14AM -0400, Jeff Johnson wrote:
> >Consider that you open two rpmdb databases simultaneously.
> >Signal handling is screwed, and after you close them both,
> >rpmsq handler is still installed, and "oact" is lost.
> 
> Screwed? signal handling reverts to the original state is/was the  
> intent.

When you call rpmdbOpen twice (without calling rpmdbClose before the
second time), rpmsqEnable will be called twice, and the original state
gets lost.

Okay, I know how to fix this.

> >Actually, if we consider rpmdb a *library*, rpmsq is bad idea.
> >A library must not intervene signal handling, it may only only block
> >signals for a short period of time to perform its critical sections.
> >
> >However, from the "library" point of view, there is no change
> >to shut down rpmdb gracefully, and rpmdbRock is then useless.
> 
> Yes having a signal handler in a library leads to many difficulties.
> 
> I can go through the design choices again again again, but having
> a means to handle state associated with using a Berkeley DB is most
> definitely MUSTHAVE.

Ideailly, only blocking signals for critical sections (which is
dbiPut+dbiSync) should be enough.  If Berkeley DB was not shut
down gracefully, it might require only minor repair (e.g. removing
stale locks), which can be performed automatically.


pgps5ixNobdon.pgp
Description: PGP signature


Re: rpmdb.c blockSignals()

2008-08-18 Thread Alexey Tourbin
On Sat, Aug 16, 2008 at 12:12:13PM -0400, Jeff Johnson wrote:
> >I think this is wrong -- with this change, blockSignals() now does
> >NOT block signals.  Note that e.g. db->put is neither atomic nor
> >reenterable (possibly cannot even close db if db->put is in progress).
> 
> Correct, signals are not blocked, they are caught by the rpmsq
> handler, with the patch applied. Assuming the code is still correct.
> 
> No exit is undertaken until the caught signal mask is tested. db->put
> is still atomic, and no re-entry is undertaken.
> 
> >However, I do not quite understand what rpmsq does.
> 
> SQ == Signal Queue.
> rpmsq registers, catches and delivers signals.

Oh, I see now.  I don't like rpmsq then.

Consider that you open two rpmdb databases simultaneously.
Signal handling is screwed, and after you close them both,
rpmsq handler is still installed, and "oact" is lost.

Actually, if we consider rpmdb a *library*, rpmsq is bad idea.
A library must not intervene signal handling, it may only only block
signals for a short period of time to perform its critical sections.

However, from the "library" point of view, there is no change
to shut down rpmdb gracefully, and rpmdbRock is then useless.


pgp6AC3jnEFW6.pgp
Description: PGP signature


Re: rpmdb.c blockSignals()

2008-08-16 Thread Alexey Tourbin
On Sat, Aug 16, 2008 at 05:57:30AM -0400, Jeff Johnson wrote:
> 
> On Aug 16, 2008, at 4:12 AM, Jeff Johnson wrote:
> 
> >
> >Please do not change anything with blockSignals.
> >
> 
> Hmmm, I've neglected to supply sufficient details, and so
> it sounds like I'm forbidding better engineering. That was not my  
> intent.
> In fact, I'd much rather have better engineering than risk aversion in  
> RPM.
> 
> So let's say this patch is attempted:
> 
> @@ -822,7 +822,7 @@ static int blockSignals(/[EMAIL PROTECTED]@*/ rpm
>  (void) sigdelset(&newMask, SIGHUP);
>  (void) sigdelset(&newMask, SIGTERM);
>  (void) sigdelset(&newMask, SIGPIPE);
> -return sigprocmask(SIG_BLOCK, &newMask, NULL);
> +return sigprocmask(SIG_SETMASK, &newMask, NULL);
>  }
> 
>  /**
> 
> The net result is that rpm would run without blocking any
> of those explicitly mentioned signals, and ^C et al would be handled
> sooner. There are certain long running loops like rpm -qa whose

I think this is wrong -- with this change, blockSignals() now does
NOT block signals.  Note that e.g. db->put is neither atomic nor
reenterable (possibly cannot even close db if db->put is in progress).

However, I do not quite understand what rpmsq does.

> responsiveness might be improved. Note that how often the
> received signal mask is polled, not how long ^C is blocked,
> ultimately determines "responsiveness".


pgpLhDtHFdqch.pgp
Description: PGP signature


damaged rpmdb headers

2008-08-16 Thread Alexey Tourbin
On Sat, Aug 16, 2008 at 04:12:49AM -0400, Jeff Johnson wrote:
> (aside) btw, I'm seeing issues with damaged (truncated is my guess,  
> not looked)
> headers retrieved from rpmdb on HEAD, rpm-5_1_4 is fine.

How do I reproduce?  At least simple *read-only* rpmquery works fine.

$ ./rpm -qa --qf '%{NAME}\n' >/dev/null
$ echo $?
0
$ 


pgplm0bNNbIcF.pgp
Description: PGP signature


mire

2008-08-16 Thread Alexey Tourbin
I wonder if mires are of any use.
Suppose I want to query provides by glob pattern 'perl(*)'.
The desired output is 4 columns

perl(...)   perl(...)-version   %{NAME} %{VERSION}

E.g. something like

$ rpm -qa --qf '%{PROVIDENAME}\t%{PROVIDEVERSION}\t%{NAME}\t%{VERSION}\n' |grep 
'^perl(' |head
perl(Unicode/UCD.pm)0.250   perl-unicore5.8.8
perl(Text/Reform.pm)1.012   perl-Text-Reform1.12.2
perl(Locale/Maketext/Simple.pm) 0.160   perl-Locale-Maketext-Simple 0.16
perl(XML/Parser.pm) 2.360   perl-XML-Parser 2.36
perl(FreezeThaw.pm) 0.430   perl-FreezeThaw 0.43
perl(List/MoreUtils.pm) 0.220   perl-List-MoreUtils 0.22
perl(CGI.pm)3.390   perl-CGI3.39
perl(Thread.pm) 2.010   perl-threads5.8.8
perl(DateTime/Format/Mail.pm)   0.300   perl-DateTime-Format-Mail   0.30
perl(Exception/Class.pm)1.240   perl-Exception-Class1.24
$ 

done with mire glob, not with external grep.


pgp8rrRHUy8mz.pgp
Description: PGP signature


rpmdb.c blockSignals()

2008-08-16 Thread Alexey Tourbin
rpmdb/rpmdb.c:
   808  /**
   809   * Block all signals, returning previous signal mask.
   810   * @param dbrpm database
   811   * @retval *oldMask previous sigset
   812   * @return  0 on success
   813   */
   814  static int blockSignals(/[EMAIL PROTECTED]@*/ rpmdb db, /[EMAIL 
PROTECTED]@*/ sigset_t * oldMask)
   815  /[EMAIL PROTECTED] fileSystem @*/
   816  /[EMAIL PROTECTED] *oldMask, fileSystem @*/
   817  {
   818  sigset_t newMask;
   819  
   820  (void) sigfillset(&newMask);/* block all signals */
   821  (void) sigprocmask(SIG_BLOCK, &newMask, oldMask);
   822  (void) sigdelset(&newMask, SIGINT);
   823  (void) sigdelset(&newMask, SIGQUIT);
   824  (void) sigdelset(&newMask, SIGHUP);
   825  (void) sigdelset(&newMask, SIGTERM);
   826  (void) sigdelset(&newMask, SIGPIPE);
   827  return sigprocmask(SIG_BLOCK, &newMask, NULL);
   828  }

Why sigdelset() calls and second sigprocmask() call are necessary here?
As far as I can see, they do nothing.


pgpIpxXDCnLcw.pgp
Description: PGP signature


Re: RPM: rpm/rpmdb/ rpmdb.c

2008-08-07 Thread Alexey Tourbin
On Fri, Aug 08, 2008 at 01:37:42AM -0400, Jeff Johnson wrote:
> Careful with this change. It's quite easy to end up truncating
> unloaded header blobs accidentally by 8b (which is what your
> deletion does afaict, not checked).
> 
> headerGetMagic() is just a complicated way of setting nb = 8.

> >  --- rpm/rpmdb/rpmdb.c  7 Aug 2008 19:17:20 -   1.260
> >  +++ rpm/rpmdb/rpmdb.c  8 Aug 2008 02:07:11 -   1.261
> >  @@ -3221,13 +3221,6 @@
> > dbi = dbiOpen(db, RPMDBI_PACKAGES, 0);
> > if (dbi != NULL) {
> >   
> >  -  nb = 0;
> >  -  (void) headerGetMagic(h, NULL, &nb);
> >  -  /* XXX db0: hack to pass sizeof header to fadAlloc */
> >  -  datap = h;
> >  -  datalen = headerSizeof(h);
> >  -  datalen -= nb;  /* XXX HEADER_MAGIC_NO */
> >  -
> > xx = dbiCopen(dbi, dbi->dbi_txnid, &dbcursor, DB_WRITECURSOR);
> >   
> > /* Retrieve join key for next header instance. */

The code was like this:

  3214  {
  3215  unsigned int firstkey = 0;
  3216  void * keyp = &firstkey;
  3217  size_t keylen = sizeof(firstkey);
  3218  void * datap = NULL;
  3219  size_t datalen = 0;
  3220  
  3221dbi = dbiOpen(db, RPMDBI_PACKAGES, 0);
  3222if (dbi != NULL) {
  3223  
  3224  nb = 0;
  3225  (void) headerGetMagic(h, NULL, &nb);
  3226  /* XXX db0: hack to pass sizeof header to fadAlloc */
  3227  datap = h;
  3228  datalen = headerSizeof(h);
  3229  datalen -= nb;  /* XXX HEADER_MAGIC_NO */

So here was the hack that affects datap and datalen.

  3230  
  3231  xx = dbiCopen(dbi, dbi->dbi_txnid, &dbcursor, DB_WRITECURSOR);
  3232  
  3233  /* Retrieve join key for next header instance. */
  3234  
  3235  /[EMAIL PROTECTED]@*/
  3236  key->data = keyp;
  3237  key->size = (UINT32_T) keylen;
  3238  /[EMAIL PROTECTED]@*/ data->data = datap;
  3239  data->size = (UINT32_T) datalen;
  3240  ret = dbiGet(dbi, dbcursor, key, data, DB_SET);

Here goes dbiGet call with datap.  I ASSUME that, with DB_SET flag,
"key" is used only as input paramter, and "data" is used only as output
parameter (so that the key is left unchanged, and initial "data" is not
checked at all).

There is simply no point in setting "data" fields before
dbiGet(..., key, data, DB_SET) call, isn't it?
(Except for some old hack.)

  3241  keyp = key->data;
  3242  keylen = key->size;
  3243  datap = data->data;
  3244  datalen = data->size;
  3245  /[EMAIL PROTECTED]@*/


pgpS0K9r1LHtG.pgp
Description: PGP signature


Re: Provideversion empty string index

2008-08-07 Thread Alexey Tourbin
On Thu, Aug 07, 2008 at 06:27:37PM -0400, Jeff Johnson wrote:
>  
> On Thursday, August 07, 2008, at 06:21PM, "Alexey Tourbin" <[EMAIL 
> PROTECTED]> wrote:
> >If you try something like
> >perl -MDB_File -MData::Dumper -le 'tie 
> >%db,"DB_File","/var/lib/rpm/Provideversion",0,0,$DB_BTREE or die; print 
> >Dumper(\%db)' |less
> >you can see that the biggest Provideversion entry has "\0" key.
> >Actually It takes about one half of the whole index.
> >
> >Was it really intended?
> 
> Yes intended.
>
> The underlying issue with a data store (like Headers) is
>What to do with missing/optional values?
> 
> I chose to make NULL == "" == missing.

However, there is a special case for empty file digests.

rpmdb.c (rpmdbAdd):
  3402  case RPMTAG_FILEDIGESTS:
  3403  /* Filter out empty MD5 strings. */
  3404  if (!(he->p.argv[i] && *he->p.argv[i] != '\0'))
  3405  /[EMAIL PROTECTED]@*/ continue;


pgpPOiEOTteHX.pgp
Description: PGP signature


Re: rpmdbAdd

2008-08-07 Thread Alexey Tourbin
On Thu, Aug 07, 2008 at 09:08:01PM -0400, Jeff Johnson wrote:
> Header indices are monotonically increasing integer instances starting with 1.
> And header instance #0 is where the monotonically increasing integer is 
> stored.

Thanks, that's the key point that I missed.

> I can lay out what I think needs to happen with rpmdbAdd() early next week so
> hold off on any radical rpmdb surgery please.

Actually I'm trying to synchronize our (of ALT Linux) rpmdb code (based
on 4.0.4 with a number of backports) with rpm5 rpmdb code.  So I need
something like "current stable" rpmdb code base.  As I'm trying to
understand the code when copying and pasting (which is a code review),
I can also exercise minor changes to make the code a bit simpler and
prettier.  No radical surgery intend.


pgpvBozcGfBws.pgp
Description: PGP signature


rpmdbAdd

2008-08-07 Thread Alexey Tourbin
Can't you help me to understand the code please?
What is the "firstkey"?  What's going on?

rpmdb.c (rpmdbAdd):
  3214  {
  3215  unsigned int firstkey = 0;
  3216  void * keyp = &firstkey;
  3217  size_t keylen = sizeof(firstkey);
  3218  void * datap = NULL;
  3219  size_t datalen = 0;
  3220  
  3221dbi = dbiOpen(db, RPMDBI_PACKAGES, 0);
  3222if (dbi != NULL) {
  3223  
  3224  nb = 0;
  3225  (void) headerGetMagic(h, NULL, &nb);
  3226  /* XXX db0: hack to pass sizeof header to fadAlloc */
  3227  datap = h;
  3228  datalen = headerSizeof(h);
  3229  datalen -= nb;  /* XXX HEADER_MAGIC_NO */
  3230  
  3231  xx = dbiCopen(dbi, dbi->dbi_txnid, &dbcursor, DB_WRITECURSOR);
  3232  
  3233  /* Retrieve join key for next header instance. */
  3234  
  3235  /[EMAIL PROTECTED]@*/
  3236  key->data = keyp;
  3237  key->size = (UINT32_T) keylen;
  3238  /[EMAIL PROTECTED]@*/ data->data = datap;
  3239  data->size = (UINT32_T) datalen;
  3240  ret = dbiGet(dbi, dbcursor, key, data, DB_SET);
  3241  keyp = key->data;
  3242  keylen = key->size;
  3243  datap = data->data;
  3244  datalen = data->size;
  3245  /[EMAIL PROTECTED]@*/
  3246  
  3247  hdrNum = 0;
  3248  if (ret == 0 && datap) {
  3249  memcpy(&mi_offset, datap, sizeof(mi_offset.ui));
  3250  if (dbiByteSwapped(dbi) == 1)
  3251  _DBSWAP(mi_offset);
  3252  hdrNum = (unsigned) mi_offset.ui;
  3253  }
  3254  ++hdrNum;
  3255  mi_offset.ui = hdrNum;
  3256  if (dbiByteSwapped(dbi) == 1)
  3257  _DBSWAP(mi_offset);
  3258  if (ret == 0 && datap) {
  3259  memcpy(datap, &mi_offset, sizeof(mi_offset.ui));
  3260  } else {
  3261  datap = &mi_offset;
  3262  datalen = sizeof(mi_offset.ui);
  3263  }
  3264  
  3265  key->data = keyp;
  3266  key->size = (UINT32_T) keylen;
  3267  /[EMAIL PROTECTED]@*/
  3268  data->data = datap;
  3269  /[EMAIL PROTECTED]@*/
  3270  data->size = (UINT32_T) datalen;
  3271  
  3272  /[EMAIL PROTECTED]@*/
  3273  ret = dbiPut(dbi, dbcursor, key, data, DB_KEYLAST);
  3274  /[EMAIL PROTECTED]@*/
  3275  xx = dbiSync(dbi, 0);
  3276  
  3277  xx = dbiCclose(dbi, dbcursor, DB_WRITECURSOR);
  3278  dbcursor = NULL;
  3279}


pgpMATFaFto4s.pgp
Description: PGP signature


Provideversion empty string index

2008-08-07 Thread Alexey Tourbin
If you try something like
perl -MDB_File -MData::Dumper -le 'tie 
%db,"DB_File","/var/lib/rpm/Provideversion",0,0,$DB_BTREE or die; print 
Dumper(\%db)' |less
you can see that the biggest Provideversion entry has "\0" key.
Actually It takes about one half of the whole index.

Was it really intended?


pgpTEUhT6H8pl.pgp
Description: PGP signature


Re: - jbj: replace with private typedefs.

2008-08-07 Thread Alexey Tourbin
On Thu, Aug 07, 2008 at 08:40:26PM +0200, Ralf S. Engelschall wrote:
> On Thu, Aug 07, 2008, Alexey Tourbin wrote:
> 
> > [...]
> > > - jbj: replace  with private typedefs.
> >
> > Why private typedefs are any better?
> 
> For instance because the private ones are available everywhere while
> the  ones require at least a C99 environment which in turn
> unnecessarily increases the entry barrier for a bootstrapping tool like
> RPM when it comes to non-Linux platforms.

Arguably a better approach is to define uint32_t etc. on systems that
miss them instead of introducing rpmuint32_t.

My $EDITOR can highlight uint32_t but knows nothing about rpmuint32_t.
As a developer, I'm disappointed about reading the code made harder.


pgpYZdyrXi15e.pgp
Description: PGP signature


- jbj: replace with private typedefs.

2008-08-07 Thread Alexey Tourbin
On Thu, Jul 31, 2008 at 04:40:11AM +0200, Jeff Johnson wrote:
>   Server: rpm5.org Name:   Jeff Johnson
>   Root:   /v/rpm/cvs   Email:  [EMAIL PROTECTED]
>   Module: rpm  Date:   31-Jul-2008 04:40:11
>   Branch: HEAD Handle: 2008073102400505
> 
>   Modified files:
> rpm configure.ac system.h
> rpm/build   buildio.h files.c misc.c names.c pack.c
> parseChangelog.c parsePreamble.c parsePrep.c
> parseReqs.c parseScript.c reqprov.c rpmbuild.h
> rpmspec.h spec.c
> rpm/lib depends.c formats.c fs.c fs.h fsm.c poptI.c psm.c
> query.c rpmal.c rpmal.h rpmchecksig.c rpmcli.h
> rpmds.c rpmds.h rpmfc.c rpmfc.h rpmfi.c rpmfi.h
> rpminstall.c rpmps.c rpmps.h rpmrc.c rpmrollback.c
> rpmte.c rpmte.h rpmts.c rpmts.h rpmversion.c
> rpmversion.h.in tevr.c transaction.c verify.c
> rpm/rpmdb   db3.c db_emu.h dbconfig.c fprint.c fprint.h
> hdrNVR.c hdrfmt.c header.c header_internal.c
> header_internal.h pkgio.c rpmdb.c rpmdb.h rpmevr.c
> rpmns.c rpmtag.h rpmtd.c rpmtd.h rpmwf.c
> signature.c sqlite.c tjfn.c
> rpm/rpmio   digest.c gzdio.c iosm.c iosm.h lookup3.c lzdio.c
> poptIO.h rpmbc.c rpmdav.c rpmdav.h rpmdigest.c
> rpmgc.c rpmhash.c rpmhash.h rpmio.c
> rpmio_internal.h rpmiotypes.h rpmkeyring.c
> rpmkeyring.h rpmlua.c rpmnss.c rpmpgp.c rpmpgp.h
> rpmssl.c rpmsw.c rpmwget.c rpmz.c thkp.c
> 
>   Log:
> - jbj: replace  with private typedefs.

Why private typedefs are any better?


pgpmRnQpO482n.pgp
Description: PGP signature


Re: RPM: rpm/ CHANGES rpm/rpmio/ gzdio.c

2008-08-06 Thread Alexey Tourbin
On Wed, Aug 06, 2008 at 05:09:30AM -0400, Jeff Johnson wrote:
> So that's what I screwed. Tired old blind eyes here, sigh.

void pointers are evil.  gzFile is typdeffed as "void *",
which means there's no chances left for compiler to complain
about pointer type mismatch.  They'd better typdef gzFile
as "struct gzFile_s *" or something.

> Thanks for the fix! Next tasks are to unwire internal zlib, and hint
> the need for compressor flushing slightly differently (from iosm.c) ...

BTW, it is not possible to make lzma rsyncable, at least for now.
In lzma_alone format, there's no block partitioning at all.
It's just a huge single block, as far as I see.

There are other reasons why lzma compressed data cannot be effectively
rsyncable.  Certain conditions must be met, including small dictionary size
(gzip has 32K dictionary size, and lzma works best with ~2M dictionary
size).  Basically it's either ultimate compression with lzma or
rsyncability with good old gzip.


pgpryxr9COPBb.pgp
Description: PGP signature


Re: Fwd: Re: rsyncable gzdio

2008-07-09 Thread Alexey Tourbin
On Wed, Jul 09, 2008 at 05:35:46AM -0700, Jeff Johnson wrote:
>2) From a fundamental coding design POV, adding --rsyncable to rpmio  
>code is just
>plain wrong. That's as true for the patched internal zlib as it is  
>for your gzdio
>patches. The issue of when gzflush() is called is fundamentally a  
>compression
>and rsync transport, not a *.rpm payload packaging, issue.
>
>So isn't a pure API/ABI drop-in replacement for gzio(3), with a  
>"gzwrite" rather
>than a "rsyncable_gzwrite" symbol a better approach? There are many  
>applications,
>not just rpm, that _MUST_ participate in an efficient *.rpm  
>distribution framework,
>so patching gzdio in rpmio is just a tiny piece of a much bigger  
>puzzle, where "gzwrite"
>rather than "rsyncable_gzwrite" is likelier to be successful. JMHO,  
>YMMV, as always.

Hmm.  Let me argue here just a bit.  You're saying that indirect
gzflush() calls are somewhat alien to plain gzdio abstrcation.  However,
from gzFile higher-level (logical) perspective, calling gzflush() is
also simply a no-op.  The worst thing that can ever happen is calling
gzflush() too often, which can degrade compression.

Now, rsyncable_gzwrite is simply about calling real gzwrite+gzflush
in a loop.  Provided that pointer arithmetic in rsyncable_gzwrite(),
is valid, gzflush is still a no-op, and gzdio abstraction is preserved,
and nothing bad can ever happen.  The rest of the code is sync_hint(),
which is about finding the right sync points, and also, basically, not
calling gzflush() too often.

So, the code may look weired, but the idea is simple.  I introdice
rsyncable_gzwrite, which is equivalent to gzwrite from the higher-level
(logical) point of view.  There is no possiblity for rsyncable_gzwrite
to produce logically invalid output.  The rest of the code is just
a hint for rsyncable_gzwrite() loop.

BTW, we can have something like %_rsyncable_gzwrite macro, to switch
between plain gzwrite and rsyncable_gzwrite.


pgpzqa7EJS54H.pgp
Description: PGP signature


Re: rsyncable gzdio

2008-07-09 Thread Alexey Tourbin
On Wed, Jul 09, 2008 at 05:35:46AM -0700, Jeff Johnson wrote:
>0) No one understands why --rsyncable is important, or why gzip != zlib,
>or why the "fuzzy" name patch in rsync would be a tremendous bandwidth
>saving for *.rpm packages. I've been tracking the issue for like 6+  
>years,
>and what is fundamentally needed is a very clear demonstration,  
>including
>publicized benchmarks and likely a drop-in "production" ready transport
>implementation, for any --rsyncable code to be worth the effort. JMHO
>based on 6+ years of explaining ...

I have done some comprehensive testing of rsyncable gzdio with respect
to rpm packaging.  I posted them (in Russian) to our ALT Linux Team
development list.  The upshot is that 1) it is known to work well,
which is at least no segfaults or corrupted data; 2) it does not
degrade compression rate, due to cpio hints (avg 0.09% compared
to avg 1% for patched zlib); 3) rsyncability effect can be worthwhile,
which is about 1/3 bandwidth saving on real data transfer.

Some details.  I've tested rsyncability of our two directories:
/ALT/archive/Sisyphus/2008/03/01/files/x86_64/RPMS
/ALT/archive/Sisyphus/2008/04/01/files/x86_64/RPMS
(they are just what you may think.)

1) From these two directories, I select package tuples which have the
same %{NAME} (but file names %name-%version-%release.x86_64.rpm differ).
This means I test whether rsyncability is worthwhile for the packages that's
been updated within one month.  This includes %version upgrades as well is
minor %release updates (something like representative data).
2) For each package in a tuple, I repackage its cpio archive
with rsyncable gzdio.
3) Small packages are excluded: repackaged cpio must be at least
32K each.
4) rsync is run (with a small trick) to diagnose if there is any
speedup.

The resulting table is

rpm-1  size-1  rpm-2  size-2  rsync-sent   rsync-recv   speedup
-  --  -  --  --   --   ---

The table (the attachment) and some more details are available here:
http://lists.altlinux.org/pipermail/devel/2008-May/074937.html

$ wc -l 2' rsyncability.txt |wc -l
211
$

-- 211 packages have high rsyncability rate (one has to download
less than 1/2 of new package size).

$ sum() { perl -MList::Util=sum -ln0 -e 'print sum split'; }

$ cut -f4 rsyncability.txt |sum 

2433627 

$   


-- New packages are 2.32G total.

$ cut -f5 rsyncability.txt |sum 

14017   

$

-- rsync downloaded 1.57G.

(End of details.)

This means that rsyncable gzdio *can* be worthwhile -- one can expect
to save about 1/3 of bandwidth.  However, this also has some
requirements: 1) you must have older rpms (or you are going to save
nothing anyway); 2) you must synchronize two directories, and you must
use 'rsync --fuzzy', to catch up file renames; 3) both old and new files
must be compressed with rsyncable gzdio.

I hope this gives some idea of what rsyncable gzdio can do.


pgpFcrWcKwEaf.pgp
Description: PGP signature


rsyncable gzdio

2008-07-08 Thread Alexey Tourbin
On Mon, Jul 07, 2008 at 11:44:07PM +0200, Jeff Johnson wrote:
> - make gzdio.c standalone.

BTW, I have rsyncable gzdio implemntation (this does not require
patched zlib, one only has to call gzflush() at certain sync points).

It is known to work well.  Please review the patches and tell me
whether you want it or not.
http://git.altlinux.org/people/at/packages/rpm.git?a=commitdiff;h=c761902b
http://git.altlinux.org/people/at/packages/rpm.git?a=commitdiff;h=f7b5ee1e
http://git.altlinux.org/people/at/packages/rpm.git?a=commitdiff;h=8d5e355e
http://git.altlinux.org/people/at/packages/rpm.git?a=commitdiff;h=52b2499a


pgpnEHrWeb0ej.pgp
Description: PGP signature


Re: RPM: rpm/ CHANGES rpmpopt.in

2008-07-08 Thread Alexey Tourbin
On Mon, Jul 07, 2008 at 09:08:05PM -0400, Jeff Johnson wrote:
> >From another side, it is not obvious how recursive --needswhat should
> >traverse virtual packages where more than one alternative is  
> >available.
> >
> 
> Except for multilib (which I personally don't use), what categories  
> for multiple provides exist?

E.g. executable(foo), or /usr/bin/foo (which can be under
update-alternatives(8) control), or MTA.

> All that is needed is criteria for preferring a Provides:. Even for  
> multilib, there is now %_prefer_color which
> can be added to display the "preferred" answer if necessary. Should I  
> implement?
> 
> Note also that the examples I've given for --needswhat/--whatneeds  
> are slyly/implicitly dependent
> on whatever packages are already installed, which is likely to be  
> whatever was "preferred".
> 
> A general answer for "preferred" is more complex however ...


pgpKKeb6f8JkJ.pgp
Description: PGP signature


Re: RPM: rpm/build/ files.c

2008-07-07 Thread Alexey Tourbin
On Mon, Jul 07, 2008 at 01:31:55AM -0400, Jeff Johnson wrote:
> Here's the flaw (from the edos-test*.src.rpm build):
> 
> ...
> re --nodeps edos-test-*.src.rpm
> (cd edos-test && /X/src/wdj/rpm --macros /X/src/wdj/macros:/X/src/wdj/ 
> tests/macros -q --specfile edos-test.spec && /X/src/wdj/rpm --macros / 
> X/src/wdj/macros:/X/src/wdj/tests/macros -q --specsrpm edos-test.spec  
> && /X/src/wdj/rpm --macros /X/src/wdj/macros:/X/src/wdj/tests/macros - 
> q --specfile --specedit edos-test.spec && /X/src/wdj/rpm --macros /X/ 
> src/wdj/macros:/X/src/wdj/tests/macros -bb --nodeps edos-test.spec)  
> > /dev/null
> /usr/lib/rpm/check-files: line 11: cd: /X/src/wdj/tests/tmp//edos- 
> test-root: No such file or directory
>  
> ^
> /X/src/wdj/rpm --macros /X/src/wdj/macros:/X/src/wdj/tests/macros -i  
> --nosignature --nodeps probes-test-*.src.rpm

Hmm, in tests/edos-test.spec, there is no %install section,
and %buildroot is not created at all.  Also, all %files sections
are empty there.

And so, there is simply no buildroot.
Here is how check-files changed:

-RPM_BUILD_ROOT=`echo $1 | sed 's://*:/:g'`
-
-if [ ! -d "$RPM_BUILD_ROOT" ] ; then
-   cat > /dev/null
+RPM_BUILD_ROOT=$1
+if ! cd "${RPM_BUILD_ROOT:?}"; then
+   cat >/dev/null
exit 1

Old behaviour was: test for buildroot or exit 1.
New behaviour: cd to buildroot, or fail noisily and exit 1.

Let me think!


pgpxg5K3ZXiyI.pgp
Description: PGP signature


rpmfiFMaxLen

2008-07-05 Thread Alexey Tourbin
On Sun, Jul 06, 2008 at 06:24:43AM +0200, Alexey Tourbin wrote:
>   Server: rpm5.org Name:   Alexey Tourbin
>   Root:   /v/rpm/cvs   Email:  [EMAIL PROTECTED]
>   Module: rpm  Date:   06-Jul-2008 06:24:43
>   Branch: HEAD Handle: 2008070604244200
> 
>   Modified files:
> rpm CHANGES
> rpm/build   files.c
> rpm/lib librpm.vers rpmfi.c rpmfi.h
> 
>   Log:
> rpmfi: changed fi->fnlen meaning, added rpmfiFMaxLen() function
> 
> - fi->fnlen: now indicates max file name lengith, without '\0' (like 
> strlen)
> - rpmfiNew: find the exact max file name lengith, not dnlmax + bnlmax
> - rpmfiFN: allocate (fi->fnlen + 1) bytes
> - rpmfiFMaxLen: new function

rpmfiFNMaxLen is possibly a better name - s/F/FN/.
However, there is also rpmfiFNlink.


pgpnm8799ZH41.pgp
Description: PGP signature


Re: RPM: rpm/ CHANGES rpm/build/ files.c

2008-07-04 Thread Alexey Tourbin
On Fri, Jul 04, 2008 at 09:48:19PM -0400, Jeff Johnson wrote:
> >  +  if (!N1) headerNEVRA(h1, &N1, NULL, NULL, NULL, NULL);
> >  +  if (!N2) headerNEVRA(h2, &N2, NULL, NULL, NULL, NULL);
> >  +  rpmlog(RPMLOG_WARNING,
> >  + _("file %s is packaged into both %s and %s\n"),
> >  + fn1, N1, N2);
> >
> 
> This paradigm of displaying N-V-R.A (or whatever is deemed informative)
> is an obviously (duh!) widely repeated paradigm throughout rpm.

As for me, I think that displaying N-V-R.A is (sometimes) nothing more
than displaying redundant information.  Here is why.  Suppose that we
store rpmbuild build logs in some SCM system (well, we do), to study
e.g. new compiler warnings or something.  And then, simply changing the
release is going to introduce quite a few changes in the build log, and
hence this will yield new diff hunks.  Now guess what.  When studying
build logs, the last thing we need is those funky new diff hunks.


pgpX9eHkxZfbU.pgp
Description: PGP signature


Re: RPM: rpm/ CHANGES rpm/build/ files.c

2008-06-16 Thread Alexey Tourbin
On Mon, Jun 16, 2008 at 10:21:56AM -0400, Jeff Johnson wrote:
> Hmm, at some point, I start to question whether permitting duplicates  
> like
>/foo/1
>/foo/1
> in %files, or worrying about %ghost (and %attr and %verify and % 
> exclude and ...,
> there's *at least* one other place that the test for %ghost/%exclude  
> is needed)
> scoping over pathologies as in your example above, is worth the effort.

It is worth to have a correct algorithm for RPMTAG_SIZE counter.
The algorithm is correct iff cpio file data bytes match RPMTAG_SIZE
value (this is why we move the code to genCpioListAndHeader()).

rpm2cpio foo.rpm |catenate_cpio_file_data |wc -c
rpmquery --qf '%{SIZE}\n' -p foo.rpm

My point is that as far as we can build valid cpio archive, we can
also mimic some cpio logic a bit and get valid RPMTAG_SIZE value.
It has nothing to do with specfile pathologies (or we just can't
build cpio archive otherwise).

> With %ghost, the issue runs to a fundamental spec file design flaw,  
> there
> is plain and simply no way _BY DEFINITION_ to know the file type  
> associated
> with the path that has a %ghost attribute.
> 
> The RPMTAG_FILEMODES associated with %ghost files has ugo rwx, but  
> not the file type.

I can't quite understand what you mean.  It looks like you're trying
to say that, in genCpioListAndHeader(), if (flp->flags & RPMFILE_HOST)
holds, then we cannot reliably check for S_ISREG(flp->fl_mode).  I can't
see why yet.

> Why shouldn't all of the above be treated as syntax errors instead
> of quietly assuming that indeed, there is some real world need to
> have sloppy goosey-loosey spec file syntax permitting duplicates (and
> file marker attribute scoping across hard links, or primitive
> filtering directives like %exclude) anyways.

rpm seems to permit file dups by design, and, while issuing a warning,
it also has some code to fold dups and merge their flags correctly.

There is a good reason -- glob(3) is not quite flexible at times.
Sometimes you do:

%files
/foo/prefix.*
/foo/*.suffix

Overlaps are okay here.

> But as always, since there's no grammar for spec files, anything goes.
> Even with a grammar, the issues of scoping through implicit hard link  
> aliasing,
> are semantic, not syntactical (but duplicates are syntax).
> 
> Guess what happens in your example when the install runs with
> --excludepath=/foo/1
> 
> What link count should be checked with --verify, particularly if
> another package also contains
> %ghost /foo/2
> and /foo/1 explicitly had --excludepath when installed?
> 
> And to return to the original RPMTAG_SIZE issue, what value should
> --qf '%{size}
> report for a given package with excluded hard linked paths that span  
> multiple
> packages as above?

I think that %{SIZE} should report cpio file data bytes (i.e. cpio
archive size excluding 110 bytes per cpio entry, filenames, and
alignment bytes).


pgp1ZDyWXGcsu.pgp
Description: PGP signature


Re: RPM: rpm/ CHANGES rpm/build/ files.c

2008-06-16 Thread Alexey Tourbin
On Mon, Jun 16, 2008 at 04:38:03AM -0400, Jeff Johnson wrote:
> Alexey:
>This is your patch reworked slightly. See what you think.

> >  +  if (S_ISREG(flp->fl_mode)) {
> >  +  int bingo = 1;
> >  +  /* Hard links need be tallied only once. */
> >  +  if (flp->fl_nlink > 1) {
> >  +  FileListRec jlp = flp + 1;
> >  +  int j = i + 1;
> >  +  for (; (unsigned)j < fi->fc; j++, jlp++) {

Loop post-increment "j++, jlp++" is not enough here.
Remember that "jlp" can go ahead of "j" when folding dups.
You've got to rewind dups here the same way it is done in the outer loop.

while (...)
jlp++;

Here is a counterexample:

%install
mkdir -p %buildroot/foo
head -c 1024 /dev/zero >%buildroot/foo/1
head -c 1024 /dev/zero >%buildroot/foo/2
ln %buildroot/foo/1 %buildroot/foo/3
%files
/foo/1
/foo/2
/foo/2
/foo/3

We expect RPMTAG_SIZE == 2048 (1024 per /foo/1+3 hardlink set
plus 1024 per /foo/2).  However, we get 3072.  It is easy to
see why: when doing /foo/1, /foo/2 gets checked twice, and /foo/3
is not checked at all (and /foo/1 gets bingo but it shouldn't).

> >  +  if (!S_ISREG(jlp->fl_mode))
> >  +  continue;
> >  +  if (flp->fl_nlink != jlp->fl_nlink)
> >  +  continue;
> >  +  if (flp->fl_ino != jlp->fl_ino)
> >  +  continue;
> >  +  if (flp->fl_dev != jlp->fl_dev)
> >  +  continue;

Now assume that rewind went well, and we are at the last dup
entry (which has valid file flags that's been merged).  You still
have to check:

if (jlp->fl_flags & (RPMFILE_EXCLUDE | RPMFILE_GHOST))
continue;

Think of this example:

%install
mkdir -p %buildroot/foo
head -c 1024 /dev/zero >%buildroot/foo/1
ln %buildroot/foo/1 %buildroot/foo/2
%files
/foo/1
%ghost /foo/2

> >  +  bingo = 0;  /* don't tally hardlink yet. */
> >  +  break;
> >  +  }
> >  +  }
> >  +  if (bingo)
> >  +  fl->totalFileSize += flp->fl_size;
> >  +  }


pgp6bhd4q61NY.pgp
Description: PGP signature


Re: noarch sub-packages

2008-06-15 Thread Alexey Tourbin
On Mon, Jun 16, 2008 at 01:57:13AM -0400, Jeff Johnson wrote:
> >It looks like you cannot specify "BuildArch: %_target_cpu" in toplevel
> >package.
> 
> Ah, I have a reproducer for the flaw now, fixing todo++.

I believe there's an easier way to introduce noarch subpackages
(i.e. without explicit spec->toplevel and pkg->noarch bookkeeping).
You can simply check for (pkg == spec->packages).
http://git.altlinux.org/people/at/packages/rpm.git?a=commitdiff;h=3ad2b101


pgpX7BSVxgedP.pgp
Description: PGP signature


Re: noarch sub-packages

2008-06-15 Thread Alexey Tourbin
On Mon, Jun 16, 2008 at 01:20:31AM -0400, Jeff Johnson wrote:
> Certainly "they really work" (unless I missed pushing some patch to  
> rpm-5.1.3):

It looks like you cannot specify "BuildArch: %_target_cpu" in toplevel
package.


pgphiIGn556c8.pgp
Description: PGP signature


[PATCH] fix RPMTAG_SIZE counter

2008-06-15 Thread Alexey Tourbin
RPMTAG_SIZE counter is broken a bit.

1) Duplicate file entries are not getting merged.

%install
mkdir -p %buildroot/foo
head -c 1024 /dev/zero >%buildroot/foo/1
%files
/foo/1
/foo/1

RPMTAG_SIZE == 2048.

2) File flags are not getting mreged.

%install
mkdir -p %buildroot/foo
head -c 1024 /dev/zero >%buildroot/foo/1
%files
/foo/1
%exclude /foo/1

RPMTAG_SIZE == 1024 (empty package).

3) Similar problems with %ghost files.

To fix the problem, RPMTAG_SIZE counter should be invoked after
file dups are merged.  This means the code should be moved from
addFile() to genCpioListAndHeader().

--- rpm-5.1.3/build/files.c-2008-04-06 12:47:47 +0400
+++ rpm-5.1.3/build/files.c 2008-06-16 07:16:45 +0400
@@ -1601,15 +1601,6 @@ if (!(_rpmbuildFlags & 4))
 /[EMAIL PROTECTED] =noeffectuncon @*/
 sxfn = _free(sxfn);
 
-ui32 = fl->totalFileSize;
-he->tag = RPMTAG_SIZE;
-he->t = RPM_UINT32_TYPE;
-he->p.ui32p = &ui32;
-he->c = 1;
-he->append = 1;
-xx = headerPut(h, he, 0);
-he->append = 0;
-
 if (_rpmbuildFlags & 4) {
 (void) rpmlibNeedsFeature(h, "PayloadFilesHavePrefix", "4.0-1");
 (void) rpmlibNeedsFeature(h, "CompressedFileNames", "3.0.4-1");
@@ -1713,7 +1704,44 @@ if (_rpmbuildFlags & 4) {
if (isSrc)
fi->fmapflags[i] |= IOSM_FOLLOW_SYMLINKS;
 
+   if (S_ISREG(flp->fl_mode) && flp->fl_nlink == 1)
+   fl->totalFileSize += flp->fl_size;
+   else if (S_ISREG(flp->fl_mode) && flp->fl_nlink > 1) {
+   /* Hard links need be counted only once. */
+   FileListRec jlp;
+   int j;
+   int found = 0;
+   for (j = i+1, jlp = flp+1; (unsigned)j < fi->fc; j++, jlp++) {
+   while (((jlp - fl->fileList) < (fl->fileListRecsUsed - 1)) &&
+   !strcmp(jlp->fileURL, jlp[1].fileURL))
+   jlp++;
+   if (!S_ISREG(jlp->fl_mode))
+   continue;
+   if (flp->fl_nlink != jlp->fl_nlink)
+   continue;
+   if (flp->fl_ino != jlp->fl_ino)
+   continue;
+   if (flp->fl_dev != jlp->fl_dev)
+   continue;
+   if (jlp->flags & (RPMFILE_EXCLUDE | RPMFILE_GHOST))
+   continue;
+   found = 1;
+   break;
+   }
+   if (!found) /* last entry in hardlink set */
+   fl->totalFileSize += flp->fl_size;
+   }
 }
+
+ui32 = fl->totalFileSize;
+he->tag = RPMTAG_SIZE;
+he->t = RPM_UINT32_TYPE;
+he->p.ui32p = &ui32;
+he->c = 1;
+he->append = 1;
+xx = headerPut(h, he, 0);
+he->append = 0;
+
 /[EMAIL PROTECTED]@*/
 if (fip)
*fip = fi;
@@ -1947,26 +1975,6 @@ static int addFile(FileList fl, const ch
flp->specdFlags = fl->currentSpecdFlags;
flp->verifyFlags = fl->currentVerifyFlags;
 
-   /* Hard links need be counted only once. */
-   if (S_ISREG(flp->fl_mode) && flp->fl_nlink > 1) {
-   FileListRec ilp;
-   for (i = 0;  i < fl->fileListRecsUsed; i++) {
-   ilp = fl->fileList + i;
-   if (!S_ISREG(ilp->fl_mode))
-   continue;
-   if (flp->fl_nlink != ilp->fl_nlink)
-   continue;
-   if (flp->fl_ino != ilp->fl_ino)
-   continue;
-   if (flp->fl_dev != ilp->fl_dev)
-   continue;
-   break;
-   }
-   } else
-   i = fl->fileListRecsUsed;
-
-   if (!(flp->flags & RPMFILE_EXCLUDE) && S_ISREG(flp->fl_mode) && i >= 
fl->fileListRecsUsed) 
-   fl->totalFileSize += flp->fl_size;
 }
 
 fl->fileListRecsUsed++;
@@ -2758,8 +2766,6 @@ int processSourceFiles(Spec spec)
 #endif
flp->langs = xstrdup("");

-   fl.totalFileSize += flp->fl_size;
-   
if (! (flp->uname && flp->gname)) {
rpmlog(RPMLOG_ERR, _("Bad owner/group: %s\n"), diskURL);
rc = fl.processingFailed = 1;


pgp1XmZV5vOzZ.pgp
Description: PGP signature


noarch sub-packages

2008-06-15 Thread Alexey Tourbin
Hello,
Do they really work?

rpm-5.1.3 $ ./rpmbuild -bb ~/test.spec
error: line 8: Only "noarch" sub-packages are supported: BuildArch: x86_64
error: Package has no %description: test-1.0-alt1.x86_64
rpm-5.1.3 $ cat ~/test.spec
Name: test
Version: 1.0
Release: alt1
Summary: test
License: GPL
Source: foo
Group: Development/Other
BuildArch: %_target_cpu
%description
rpm-5.1.3 $
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org