Re: refined implementation of set-versions

Jeffrey Johnson Sat, 21 Apr 2012 07:26:42 -0700

On Apr 21, 2012, at 1:52 AM, Alexey Tourbin wrote:

> On Fri, Apr 20, 2012 at 4:12 PM, R P Herrold <herr...@owlriver.com> wrote:
>> On Fri, 20 Apr 2012, Alexey Tourbin wrote:
>> 
>>> I have just learnt that rpm5 project has borrowed set-string
>>> implementation recently from ALT Linux. At the very same time, I was
>>> working on on a new and improved encoding scheme which can make
>> 
>> ... This is exciting news -- This seems like it would be a useful library.
>>  What package in ALT are you doing this work in, or alternatiely, is there a
>> version control repository that I could check out to read your ongoing work?
> 
> The goal is not only to improve the implementation, but also to refine
> basic concepts and designs.  Actually, set.c is already usable as a
> library on its own. Why, it provides an API for creating set-versions,
> and it also implements rpmsetcmp() comparison routine. That's just
> enough to get the job done, and everything else then is the
> implementation details which it hides in a particularly perfect
> manner. On the other hand, the details are concealed perhaps a bit too
> much - up to the point where it is not clear what set-versions are and
> how exactly they are supposed to work. It is desirable then to expose
> a lower-level API which clarifies the concept of set-versions while
> still hiding less important implementation details.
>


Yes.

Here is (my) attempt to restate the "bit too much" issue in "plain speak":

        As soon as you attempt a transform (in this case Golomb-Rice
        encoding) on the original WYSIWYG data for any reason (for
        set:versions, one is replacing a collection of strings with a single
        string), then one loses the ability to easily debug by inspection.

(aside)
There are two existing implementations that I will point out that
attempt to solve the (if you will permit) "The WYSIWYG Problem" by
hybrid concatenation.

1) NixOS uses a hybrid identifier of (in RPM jargon)
        ID-NEVRA
The ID is a truncated SHA1 encoded in base32 (i.e. the "transform")
and the NEVRA is a human readable mnemonic identifier to make
it easier to find what one is looking for.

2) WindRiver 3.0 (aka LDAT) wished to capture dynamic build
configuration into an identifier that could be carried in *.rpm packages.
The implementation used flat files with (essentially)
        ID NEVRA
pairs, one per line. The ID was a SHA1 of some dynamic build configuration
and "NEVRA" is/was some convenient identifier (not limited to package naming).
Multiple lines in the flat file serve as a "container" abstraction, and there's
an ability to chain by including "containers" within "containers". Finding
the exact item that changed is nothing more than an efficient Merkle hash tree
search opening/verifying container items and re-verifying the ID. When the
ID doesn't match, the "NEVRA" (which isn't limited to package naming) is
useful (in the human WYSIWYG sense) for drilling down to the exact item
that changed inspire of complex transforms.

I'd like to see set:versions have the ability to "chain". Its little more than
choosing some convention on the actual symbols that need to be encoded
afaik.

> So what's a set-version?
> 
> Intuitively, a set-version represents a set of symbols. This actually
> suggests a new "de facto" approach to library versioning: the required
> version of a library, in addition to its soname, is simply the set of
> library symbols (i.e. functions and global variables) which we need to
> use. Likewise, the version provided by a library is simply the set of
> all symbols exported by the library. The key point is that Requires
> versions and Provides versions can be produced in a relatively
> independent manner, and meaningfully compared at later stages; that
> is, it is possible to check if R \subset P.  The check is
> probabilistic, which indicates a possibility of error.  The error rate
> is reasonably small, though, and can be further controlled by a
> parameter. What's more important is that only "false positive" kind of
> error is possible - that is, in the worst case, the check simply does
> not work, but at least we lose nothing. Another kind of error, a
> "false alarm", is not possible. In this respect, set-versions are
> similar to Bloom filters.
> 

So why not use Bloom filters (with some encoding like base62)?

I suspect that the answer has to do with doing R \subset P efficiently
(which is a very powerful feature even if probabilistic).

Is the probability of false result tunable or dictated by the algorithm?

(aside)
There's another means to compute set differences efficiently in the
two papers here:
        http://code.google.com/p/sks-keyserver/
The technique there is to generate polynomials with zeroes at
the "symbols" (in the SKS case GPG 64bit fingerprints).

The (P,Q} polynomials for the (possibly different) sets are exchanged.

The quotient = P/Q is computed: the zeroes are the set difference
(and the subset overlap is equally well known but usually a larger
polynomial)

None of the above is computationally infeasible even if the math is gory.

> Set-versions have other nice properties which you might suspect, and
> the one which is not so nice: the length of n-element set-version is
> O(n). This is a fundamental limit which cannot be overcome. However, a
> practical and feasible implementation is still possible. This outlines
> two implementation priorities: 1) set-versions should be as short in
> size as possible - actually their size should be close to the
> information-theoretical minimum; 2) however, this must not tamper with
> the possibility of fast decoding and comparison. With current
> implementation, when the error rate is fixed at about 0.1%, Provides
> version take about 2 alphanumeric characters per symbol, and Requires
> versions, since they are much more sparse, can take up to 3
> alphanumeric characters per symbol. For a repo of about 10,000
> packages, Requires and Provides set-versions can take only about 10M
> total (but this assumes that some dependency optimizations are
> performed and also that superfluous plugin-like Provides are
> excluded). The check of all set-versioned dependencies, such as
> performed by "apt-cache unmet", can be finished within a second. To
> sum up, this is a compromise; but it is a favorable compromise.
> 

Nice!

> So what a set-version really is?  What if we say that a set-version is
> just a set of numbers, such as hash values obtained after hashing each
> symbol individually? Simple as it is, this approach can be used to
> express everything else. More precisely, we need a scheme to encode n
> m-bit numbers. For some reason which will become apparent later, we
> need to supply m, which is actual bits per hash value, aka bpp,
> explicitly.
> 
> So I think we can define a lower-level "set-string" encoding API as follows:
> 
> /** \ingroup setstring
> * Estimate the size of a string buffer for encoding.
> * @param v             the values, sorted and unique
> * @param n             number of values
> * @param bpp           actual bits per value, 8..32
> * @return              buffer size for encoding, < 0 on error
> */
> int setstringEncodeSize(const unsigned *v, int n, int bpp);
> 
> /** \ingroup setstring
> * Encode a set of numeric values into alnum string.
> * @param v             the values, sorted and unique
> * @param n             number of values
> * @param bpp           actual bits per value, 8..32
> * @param s             alnum output, null-terminated on success
> * @return              alnum string length, < 0 on error
> */
> int setstringEncode(const unsigned *v, int n, int bpp, char *s);
> 
> The decoding API mirrors the encoding routines:
> 
> /** \ingroup setstring
> * Estimate the number of values obtained after decoding a set-string.
> * @param s             alnum string to decode, null-terminated
> * @param len           alnum string length
> * @return              number of values (upper size), < 0 on error
> */
> int setstringDecodeSize(const char *s, int len);
> 
> /** \ingroup setstring
> * Decode alnum string into a set of numeric values.
> * @param s             alnum string to decode, null-terminated
> * @param len           alnum string length
> * @v                   decoded values (sorted and unique)
> * @pbpp                original bits per value
> * @return              number of values, < 0 on error
> */
> int setstringDecode(const char *s, int len, unsigned *v, int *pbpp);
> 
> The first things to notice is that these routines only do bit
> crunching, and they don't not try to do something else which can be
> even remotely inefficient. For example, they do not allocate storage
> on their own; instead, they facilitate allocation by the caller. This
> is because many Requires versions are small, and can be decoded on the
> stack, thus saving a malloc() call. If malloc is cheap, this might
> turn out to be an unnecessary complication. The decoding routine won't
> even call strlen(). This is because average Provides version length is
> about 2K, and in certain scenarios the length can be reused.
> 
> The are also two important things to notice in this overall design: 1)
> the error rate can be controlled by choosing an appropriate bpp
> parameter; 2) it is even possible to compare two set-versions with
> different bpp parameter, provided that the same hash function, modulo
> power of two, was used: the set v[] with higher bpp value can be
> "downsampled" by stripping higher bits and sorting the values again.
> The R \subset P check is itself straightforward.

Good: (and answered my question above re is error rate tunable).

Could you add something like this routine to the set:versions API so
that the error rate can be adjusted to different usage cases? The routine
here chooses Bloom filter parameters {m,k} given
        n       (an expected population)
        e       (probability of failure)
The routine is simple so I'll just cut-n-paste (see rpmio/rpmbf.c, also in POPT 
code)

I forget where I found the heuristic: but its quite useful to be able to adjust 
the
error rate to need when necessary.

rpm uses Bloom filters a lot these days, and set:versions surely have many other
usage cases than as an efficient representation for a R \subset P operation.

One other question:

Golomb-Rice is a prefix code (afaik). What I can't quite tell is this
        Can set:version strings be compared only for equality?
What is confusing me is that sisyphus packages have dependencies written like
        Requires: libfoo.so.1 >= set:yaddayadda
I suspect that you have a different (than naive arithmetic) semantic for 
"greater than"
based on \subset where what is actually being tested is this
        Does the Provides: have the Requires: set:version as a subset?
mapped into "greater than". That semantic change is perfectly natural but
is quite different than what most people (like me ;-) will read into that 
dependency.

73 de Jeff
______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Re: refined implementation of set-versions

Reply via email to