Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Jeffrey Johnson

On Jun 23, 2012, at 2:29 PM, Jeffrey Johnson wrote:

> 
> On Jun 23, 2012, at 2:10 PM, Jeffrey Johnson wrote:
> 
>> 
>> On Jun 23, 2012, at 1:49 PM, Alexey Tourbin  wrote:
>> 
> 
> <…>
> 
>> Perhaps making base61 encoding MANDATORY in rpm would displease
>> everyone equally: choosing a prime is as "pretty" as all other
>> encoding criteria. 
>> 
> 
> In the interest of getting off negative nerdy obscure discussions, let's
> try a positive alternative application for Golob-Rice subset operations.
> 
> All RPMv4 packages attach (a lightly filtered) file(1) magic string to
> every file.
> 
> The file(1) data is mostly usable as a "keyword" namespace exactly as is.
> Yes there are flaws: however magic strings are from file(1) is about
> as good as any other de facto keyword tagging of file content.
> 
> keywords are strings just like elf symbols are, and set:versions (or Bloom 
> filters)
> are a compact representation from which its rather easy to do subset 
> computations.
> 
> One extension that would be needed is a "closest" metric in order to "prefer"
> the largest subset overlap: with set:versions any contained subset will 
> satisfy the
> logical assertions, and there's no easy way to prefer the larger sub-set.
> 
> There's a similar application with dual/triple/... licensed software and 
> computing
> per-file, not per-package, license affinity precisely where set:versions (or 
> Bloom
> filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses 
> unlike
> file(1) magic keywords will require name space administration. SUrely LSB and 
> LFF
> are seeking something useful to do for RPM packaging these days, and might
> be convinced to make some set of license tokens "standard" so that license
> affinity can be precisely computed in distributed software.
> 

The above suggestions were an attempt to generalize assertions to "feature 
vectors" (which
is the standard technique by which SAT solvers can start to handle 
"preference").

Meanwhile here's another application of existing set:versions "containers"
Query-by-ELF-symbol
that's perfectly obvious as well.

1) Generate tables for Provideversion/Requireversion by mentioning in the usual 
macro
that configures Berkeley DB. The table will be automagically generated lazily on
1st access (or do --rebuilddb if you insist on being a Luddite).

2) Generate a set:versions set with the symbols of interest. POPT already has a 
data type for
#define POPT_ARG_BITSET 16U+14U /*!< arg ==> bit set */
which results in a Bloom filter (set:versions can be added to popt just like 
rpmbf was by
swiping code from RPM: just as easy is to add some vectors that RPM fills in on 
startup
so that option processing can generate a set:versions set).

3) Retrieve the set:versions values from the PV/RV table; the retrieval will be 
blazingly
fast because there is no need to do secondary loading of headers, and there is
already means to retrieve by a *RE pattern like "^set:.*$" that essentially do 
a partial retrieve on
"set:"

4) Filter the retrieved set:versions keys down to a set of headers to load; do 
the usual --queryformat
display of NEVRA (or whatever other --queryformat is useful).

None of the above is very hard to do (with @rpm5.org code), and is sure to be 
lots faster
than the equivalent scripted operations because of Berkeley DB and btree 
indexing.

73 de Jeff



Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Jeffrey Johnson

On Jun 23, 2012, at 2:10 PM, Jeffrey Johnson wrote:

> 
> On Jun 23, 2012, at 1:49 PM, Alexey Tourbin  wrote:
> 

<…>

> Perhaps making base61 encoding MANDATORY in rpm would displease
> everyone equally: choosing a prime is as "pretty" as all other
> encoding criteria. 
> 

In the interest of getting off negative nerdy obscure discussions, let's
try a positive alternative application for Golob-Rice subset operations.

All RPMv4 packages attach (a lightly filtered) file(1) magic string to
every file.

The file(1) data is mostly usable as a "keyword" namespace exactly as is.
Yes there are flaws: however magic strings are from file(1) is about
as good as any other de facto keyword tagging of file content.

keywords are strings just like elf symbols are, and set:versions (or Bloom 
filters)
are a compact representation from which its rather easy to do subset 
computations.

One extension that would be needed is a "closest" metric in order to "prefer"
the largest subset overlap: with set:versions any contained subset will satisfy 
the
logical assertions, and there's no easy way to prefer the larger sub-set.

There's a similar application with dual/triple/... licensed software and 
computing
per-file, not per-package, license affinity precisely where set:versions (or 
Bloom
filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses 
unlike
file(1) magic keywords will require name space administration. SUrely LSB and 
LFF
are seeking something useful to do for RPM packaging these days, and might
be convinced to make some set of license tokens "standard" so that license
affinity can be precisely computed in distributed software.

73 de Jeff



Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Jeffrey Johnson

On Jun 23, 2012, at 1:49 PM, Alexey Tourbin  wrote:

> On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson  wrote:
>> There are lots of usage cases for efficient sub-set computations
>> in package management, not just as a de facto API/ABI check
>> using ELF symbols. Most of the other usage cases for efficient
>> sub-set computations are not subject to an ultimately optimal
>> string encoding.
> 
> The possibility of optimal string encoding should not be
> underestimated. If you can't "write it down with a pencil", which by
> the way refers to Alan Turing's style of reasoning, it becomes very
> problematic anyway. Your intuition is probably that raw bytes are
> cheap because you can bitwise-AND them in terms of direct CPU
> instructions, and any "encoding" is expensive because you have to
> crunch bits a lot. This is not very true in practice, though. But you
> should arm yourself with valgrind(1) and spend a few days with it
> before you understand how you waste you CPU powers (and bandwidth, for
> that matter). Set-string encoding has reasonable, and affordable,
> cost. It can be all put together and presented in a rather less
> pessimistic manner. (For example, there is a cache_decode_set routine
> that boosts things by a factor of about o 5. This is simply due to the
> fact that you do not always have to decode the same Provides version
> all over again. This is part of that "fancy-schmancy" stuff which you
> must ignore on the first reading.)
> 
> So I suggest that set-strings must cover all usages of set-subset
> computations, where static data structures are appropriate (so that
> you compute it, write into RPM header, and do not expect to modify
> it). it is still "efficient", despite the fact that we try hard to
> achieve a better string representation. The fact that it can be
> packaged nicely does not make it any worse in other respects, except
> that obviously we can stuff somewhat less bits into alnum characters.
> But I also wanted it to look "nice" (this is why base62 was devised
> over much simpler base64 encoding). But so, it looks nice, and it is
> efficient. Why are you unhappy? Do you necessarily require that you
> have to do bitmask tests yourself?

I'm not "unhappy". Rather I'm pointing out that -- if your primary
goal is bandwidth reduction in *.rpm package metadata -- that there are
equivalent savings that aren't from "pretty" encodings.

The other reason is this: missing values in the {E,V,R} triple
(and attaching some reasonable DWIM semantic) lead to huge
amounts of pointless discussion. Adding all of
twiddle-in-version
set:versions
dpkg-like
is all underway atm. Running digest/encoding strings through
"legacy compatible" (i.e. what most rpm tool chains will
continue to be using for years) rpmvercmp(3) is an accident
waiting to be reported again again again.

Its just as easy to attach a tag with binary data as any other
approach to "pretty" encodings, all of which involve some
aesthetic choice of permitted/excluded ASCII characters, and so
the encodings proliferate to meet all possible opinions.

Perhaps making base61 encoding MANDATORY in rpm would displease
everyone equally: choosing a prime is as "pretty" as all other
encoding criteria. 

73 de Jeff

__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson  wrote:
> There are lots of usage cases for efficient sub-set computations
> in package management, not just as a de facto API/ABI check
> using ELF symbols. Most of the other usage cases for efficient
> sub-set computations are not subject to an ultimately optimal
> string encoding.

The possibility of optimal string encoding should not be
underestimated. If you can't "write it down with a pencil", which by
the way refers to Alan Turing's style of reasoning, it becomes very
problematic anyway. Your intuition is probably that raw bytes are
cheap because you can bitwise-AND them in terms of direct CPU
instructions, and any "encoding" is expensive because you have to
crunch bits a lot. This is not very true in practice, though. But you
should arm yourself with valgrind(1) and spend a few days with it
before you understand how you waste you CPU powers (and bandwidth, for
that matter). Set-string encoding has reasonable, and affordable,
cost. It can be all put together and presented in a rather less
pessimistic manner. (For example, there is a cache_decode_set routine
that boosts things by a factor of about o 5. This is simply due to the
fact that you do not always have to decode the same Provides version
all over again. This is part of that "fancy-schmancy" stuff which you
must ignore on the first reading.)

So I suggest that set-strings must cover all usages of set-subset
computations, where static data structures are appropriate (so that
you compute it, write into RPM header, and do not expect to modify
it). it is still "efficient", despite the fact that we try hard to
achieve a better string representation. The fact that it can be
packaged nicely does not make it any worse in other respects, except
that obviously we can stuff somewhat less bits into alnum characters.
But I also wanted it to look "nice" (this is why base62 was devised
over much simpler base64 encoding). But so, it looks nice, and it is
efficient. Why are you unhappy? Do you necessarily require that you
have to do bitmask tests yourself?
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org