Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 23, 2012, at 2:29 PM, Jeffrey Johnson wrote: > > On Jun 23, 2012, at 2:10 PM, Jeffrey Johnson wrote: > >> >> On Jun 23, 2012, at 1:49 PM, Alexey Tourbin wrote: >> > > <…> > >> Perhaps making base61 encoding MANDATORY in rpm would displease >> everyone equally: choosing a prime is as "pretty" as all other >> encoding criteria. >> > > In the interest of getting off negative nerdy obscure discussions, let's > try a positive alternative application for Golob-Rice subset operations. > > All RPMv4 packages attach (a lightly filtered) file(1) magic string to > every file. > > The file(1) data is mostly usable as a "keyword" namespace exactly as is. > Yes there are flaws: however magic strings are from file(1) is about > as good as any other de facto keyword tagging of file content. > > keywords are strings just like elf symbols are, and set:versions (or Bloom > filters) > are a compact representation from which its rather easy to do subset > computations. > > One extension that would be needed is a "closest" metric in order to "prefer" > the largest subset overlap: with set:versions any contained subset will > satisfy the > logical assertions, and there's no easy way to prefer the larger sub-set. > > There's a similar application with dual/triple/... licensed software and > computing > per-file, not per-package, license affinity precisely where set:versions (or > Bloom > filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses > unlike > file(1) magic keywords will require name space administration. SUrely LSB and > LFF > are seeking something useful to do for RPM packaging these days, and might > be convinced to make some set of license tokens "standard" so that license > affinity can be precisely computed in distributed software. > The above suggestions were an attempt to generalize assertions to "feature vectors" (which is the standard technique by which SAT solvers can start to handle "preference"). Meanwhile here's another application of existing set:versions "containers" Query-by-ELF-symbol that's perfectly obvious as well. 1) Generate tables for Provideversion/Requireversion by mentioning in the usual macro that configures Berkeley DB. The table will be automagically generated lazily on 1st access (or do --rebuilddb if you insist on being a Luddite). 2) Generate a set:versions set with the symbols of interest. POPT already has a data type for #define POPT_ARG_BITSET 16U+14U /*!< arg ==> bit set */ which results in a Bloom filter (set:versions can be added to popt just like rpmbf was by swiping code from RPM: just as easy is to add some vectors that RPM fills in on startup so that option processing can generate a set:versions set). 3) Retrieve the set:versions values from the PV/RV table; the retrieval will be blazingly fast because there is no need to do secondary loading of headers, and there is already means to retrieve by a *RE pattern like "^set:.*$" that essentially do a partial retrieve on "set:" 4) Filter the retrieved set:versions keys down to a set of headers to load; do the usual --queryformat display of NEVRA (or whatever other --queryformat is useful). None of the above is very hard to do (with @rpm5.org code), and is sure to be lots faster than the equivalent scripted operations because of Berkeley DB and btree indexing. 73 de Jeff
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 23, 2012, at 2:10 PM, Jeffrey Johnson wrote: > > On Jun 23, 2012, at 1:49 PM, Alexey Tourbin wrote: > <…> > Perhaps making base61 encoding MANDATORY in rpm would displease > everyone equally: choosing a prime is as "pretty" as all other > encoding criteria. > In the interest of getting off negative nerdy obscure discussions, let's try a positive alternative application for Golob-Rice subset operations. All RPMv4 packages attach (a lightly filtered) file(1) magic string to every file. The file(1) data is mostly usable as a "keyword" namespace exactly as is. Yes there are flaws: however magic strings are from file(1) is about as good as any other de facto keyword tagging of file content. keywords are strings just like elf symbols are, and set:versions (or Bloom filters) are a compact representation from which its rather easy to do subset computations. One extension that would be needed is a "closest" metric in order to "prefer" the largest subset overlap: with set:versions any contained subset will satisfy the logical assertions, and there's no easy way to prefer the larger sub-set. There's a similar application with dual/triple/... licensed software and computing per-file, not per-package, license affinity precisely where set:versions (or Bloom filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses unlike file(1) magic keywords will require name space administration. SUrely LSB and LFF are seeking something useful to do for RPM packaging these days, and might be convinced to make some set of license tokens "standard" so that license affinity can be precisely computed in distributed software. 73 de Jeff
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 23, 2012, at 1:49 PM, Alexey Tourbin wrote: > On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson wrote: >> There are lots of usage cases for efficient sub-set computations >> in package management, not just as a de facto API/ABI check >> using ELF symbols. Most of the other usage cases for efficient >> sub-set computations are not subject to an ultimately optimal >> string encoding. > > The possibility of optimal string encoding should not be > underestimated. If you can't "write it down with a pencil", which by > the way refers to Alan Turing's style of reasoning, it becomes very > problematic anyway. Your intuition is probably that raw bytes are > cheap because you can bitwise-AND them in terms of direct CPU > instructions, and any "encoding" is expensive because you have to > crunch bits a lot. This is not very true in practice, though. But you > should arm yourself with valgrind(1) and spend a few days with it > before you understand how you waste you CPU powers (and bandwidth, for > that matter). Set-string encoding has reasonable, and affordable, > cost. It can be all put together and presented in a rather less > pessimistic manner. (For example, there is a cache_decode_set routine > that boosts things by a factor of about o 5. This is simply due to the > fact that you do not always have to decode the same Provides version > all over again. This is part of that "fancy-schmancy" stuff which you > must ignore on the first reading.) > > So I suggest that set-strings must cover all usages of set-subset > computations, where static data structures are appropriate (so that > you compute it, write into RPM header, and do not expect to modify > it). it is still "efficient", despite the fact that we try hard to > achieve a better string representation. The fact that it can be > packaged nicely does not make it any worse in other respects, except > that obviously we can stuff somewhat less bits into alnum characters. > But I also wanted it to look "nice" (this is why base62 was devised > over much simpler base64 encoding). But so, it looks nice, and it is > efficient. Why are you unhappy? Do you necessarily require that you > have to do bitmask tests yourself? I'm not "unhappy". Rather I'm pointing out that -- if your primary goal is bandwidth reduction in *.rpm package metadata -- that there are equivalent savings that aren't from "pretty" encodings. The other reason is this: missing values in the {E,V,R} triple (and attaching some reasonable DWIM semantic) lead to huge amounts of pointless discussion. Adding all of twiddle-in-version set:versions dpkg-like is all underway atm. Running digest/encoding strings through "legacy compatible" (i.e. what most rpm tool chains will continue to be using for years) rpmvercmp(3) is an accident waiting to be reported again again again. Its just as easy to attach a tag with binary data as any other approach to "pretty" encodings, all of which involve some aesthetic choice of permitted/excluded ASCII characters, and so the encodings proliferate to meet all possible opinions. Perhaps making base61 encoding MANDATORY in rpm would displease everyone equally: choosing a prime is as "pretty" as all other encoding criteria. 73 de Jeff __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson wrote: > There are lots of usage cases for efficient sub-set computations > in package management, not just as a de facto API/ABI check > using ELF symbols. Most of the other usage cases for efficient > sub-set computations are not subject to an ultimately optimal > string encoding. The possibility of optimal string encoding should not be underestimated. If you can't "write it down with a pencil", which by the way refers to Alan Turing's style of reasoning, it becomes very problematic anyway. Your intuition is probably that raw bytes are cheap because you can bitwise-AND them in terms of direct CPU instructions, and any "encoding" is expensive because you have to crunch bits a lot. This is not very true in practice, though. But you should arm yourself with valgrind(1) and spend a few days with it before you understand how you waste you CPU powers (and bandwidth, for that matter). Set-string encoding has reasonable, and affordable, cost. It can be all put together and presented in a rather less pessimistic manner. (For example, there is a cache_decode_set routine that boosts things by a factor of about o 5. This is simply due to the fact that you do not always have to decode the same Provides version all over again. This is part of that "fancy-schmancy" stuff which you must ignore on the first reading.) So I suggest that set-strings must cover all usages of set-subset computations, where static data structures are appropriate (so that you compute it, write into RPM header, and do not expect to modify it). it is still "efficient", despite the fact that we try hard to achieve a better string representation. The fact that it can be packaged nicely does not make it any worse in other respects, except that obviously we can stuff somewhat less bits into alnum characters. But I also wanted it to look "nice" (this is why base62 was devised over much simpler base64 encoding). But so, it looks nice, and it is efficient. Why are you unhappy? Do you necessarily require that you have to do bitmask tests yourself? __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org