Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Ciaran McCreesh
On Sun, 15 Feb 2009 16:48:44 -0800
Zac Medico  wrote:
> > It only comes into its own if you expect there to be a long time
> > between an EAPI being used in the tree and an EAPI being supported
> > by a package manager. And even then, it's probably easier to just
> > do a minor stable release straight away with rules for "don't know
> > how to use this EAPI, but do know how to read metadata cache
> > entries for it" whilst keeping new EAPI support for the next major
> > release.
> 
> But how will it know if it supports those cache entries? Wouldn't
> the easiest way to determine that be to have a DIGESTS version
> identifier? Otherwise, the only way for it to know would be to parse
> it and either throw a parse error if necessary or proceed all the
> way to the digest verification step (if it doesn't hit a parse error
> first).

You just need to give your package manager a way of dealing with EAPIs
where it can verify that DIGESTS is correct, but not make use of the
ebuild in question beyond that. Rather than having supported and
unsupported EAPIs, have supported, partially-understood and unsupported
EAPIs.

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Zac Medico wrote:
> Ciaran McCreesh wrote:
>> On Sun, 15 Feb 2009 15:56:18 -0800
>> It only comes into its own if you expect there to be a long time
>> between an EAPI being used in the tree and an EAPI being supported by a
>> package manager. And even then, it's probably easier to just do a minor
>> stable release straight away with rules for "don't know how to use this
>> EAPI, but do know how to read metadata cache entries for it" whilst
>> keeping new EAPI support for the next major release.
> 
> But how will it know if it supports those cache entries? Wouldn't
> the easiest way to determine that be to have a DIGESTS version
> identifier? Otherwise, the only way for it to know would be to parse
> it and either throw a parse error if necessary or proceed all the
> way to the digest verification step (if it doesn't hit a parse error
> first).

Well, I guess you were saying that it should just use the EAPI.
Given that we don't have much control over how often users upgrade,
I'd still prefer to have a DIGESTS version identifier.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmYuUsACgkQ/ejvha5XGaOIcQCfctQ/heCKDzGmls3NNLulodsD
g2AAnAwOd/JD+sHvDBPQSmx2LOHOiqjw
=onL8
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sun, 15 Feb 2009 15:56:18 -0800
> Zac Medico  wrote:
>> If the package manager is not able to validate a cache entry that
>> has been generated for an unsupported EAPI, then it will be forced
>> to regenerate the metadata in order to check whether or not the EAPI
>> has changed (example given 2 emails ago). Don't you agree that it
>> would be useful to be able to avoid metadata generation in cases
>> like this, if possible?
> 
> Well... The solution you give only *sometimes* avoids it, so it's only
> worth it if we expect that most EAPI changes won't mess around with
> inheriting at all. And given that we probably want per-cat/pkg
> eclasses...

Well, I think it's more like "the vast majority of the time" than
just "sometimes", and it's a lot better than "never".

> It only comes into its own if you expect there to be a long time
> between an EAPI being used in the tree and an EAPI being supported by a
> package manager. And even then, it's probably easier to just do a minor
> stable release straight away with rules for "don't know how to use this
> EAPI, but do know how to read metadata cache entries for it" whilst
> keeping new EAPI support for the next major release.

But how will it know if it supports those cache entries? Wouldn't
the easiest way to determine that be to have a DIGESTS version
identifier? Otherwise, the only way for it to know would be to parse
it and either throw a parse error if necessary or proceed all the
way to the digest verification step (if it doesn't hit a parse error
first).

> Honestly, I don't think it'll be useful often enough that it's worth
> the added ick.

Doesn't a simple version identifier seem less icky than checking for
both a parse error and digest verification failure?
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmYt+oACgkQ/ejvha5XGaOC6gCgzgIcH6D7X/o/vOuWvsS0mp42
dGsAn17xnY8bX9IG28Uj3MX42qdrxGrL
=+Hkp
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Ciaran McCreesh
On Sun, 15 Feb 2009 15:56:18 -0800
Zac Medico  wrote:
> If the package manager is not able to validate a cache entry that
> has been generated for an unsupported EAPI, then it will be forced
> to regenerate the metadata in order to check whether or not the EAPI
> has changed (example given 2 emails ago). Don't you agree that it
> would be useful to be able to avoid metadata generation in cases
> like this, if possible?

Well... The solution you give only *sometimes* avoids it, so it's only
worth it if we expect that most EAPI changes won't mess around with
inheriting at all. And given that we probably want per-cat/pkg
eclasses...

It only comes into its own if you expect there to be a long time
between an EAPI being used in the tree and an EAPI being supported by a
package manager. And even then, it's probably easier to just do a minor
stable release straight away with rules for "don't know how to use this
EAPI, but do know how to read metadata cache entries for it" whilst
keeping new EAPI support for the next major release.

Honestly, I don't think it'll be useful often enough that it's worth
the added ick.

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sun, 15 Feb 2009 15:26:44 -0800
> Zac Medico  wrote:
 Regardless of what the EAPI value happens to be, the package
 manager should be able to trust that the version identifier is a
 reliable indicator of the mechanism which should be used to
 validate the integrity of the cache entry.
>>> Validate it against what? If EAPI is unsupported, the package
>>> manager can't make use of INHERITED to see what DIGESTS means.
>> In the example given, the DIGESTS version identifier would serve to
>> indicate that the INHERITED field behaves as required by the
>> validation mechanism (regardless of EAPI). If INHERITED can no
>> longer be used like that in a new EAPI, the DIGESTS format/version
>> will have to be bumped.
> 
> So in effect we're introducing a second level of versioned
> compatibility testing? Strikes me as excessive, especially since it
> only works for EAPIs where the scope of changes is small enough to
> keep the meaning of INHERITED and DIGESTS the same...

If the package manager is not able to validate a cache entry that
has been generated for an unsupported EAPI, then it will be forced
to regenerate the metadata in order to check whether or not the EAPI
has changed (example given 2 emails ago). Don't you agree that it
would be useful to be able to avoid metadata generation in cases
like this, if possible?
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmYq6EACgkQ/ejvha5XGaNumwCeIqaACk67tlvtQNBppUsuOknN
8agAoN8ZuPYQ5KiFMJj/5syG2/mNqgaE
=zffn
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Ciaran McCreesh
On Sun, 15 Feb 2009 15:26:44 -0800
Zac Medico  wrote:
> >> Regardless of what the EAPI value happens to be, the package
> >> manager should be able to trust that the version identifier is a
> >> reliable indicator of the mechanism which should be used to
> >> validate the integrity of the cache entry.
> > 
> > Validate it against what? If EAPI is unsupported, the package
> > manager can't make use of INHERITED to see what DIGESTS means.
> 
> In the example given, the DIGESTS version identifier would serve to
> indicate that the INHERITED field behaves as required by the
> validation mechanism (regardless of EAPI). If INHERITED can no
> longer be used like that in a new EAPI, the DIGESTS format/version
> will have to be bumped.

So in effect we're introducing a second level of versioned
compatibility testing? Strikes me as excessive, especially since it
only works for EAPIs where the scope of changes is small enough to
keep the meaning of INHERITED and DIGESTS the same...

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sun, 15 Feb 2009 14:51:10 -0800
> Zac Medico  wrote:
>> Regardless of what the EAPI value happens to be, the package manager
>> should be able to trust that the version identifier is a reliable
>> indicator of the mechanism which should be used to validate the
>> integrity of the cache entry.
> 
> Validate it against what? If EAPI is unsupported, the package
> manager can't make use of INHERITED to see what DIGESTS means.

In the example given, the DIGESTS version identifier would serve to
indicate that the INHERITED field behaves as required by the
validation mechanism (regardless of EAPI). If INHERITED can no
longer be used like that in a new EAPI, the DIGESTS format/version
will have to be bumped.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmYpLMACgkQ/ejvha5XGaN/XwCeNcczP2k4J4LKMDxbmVnWV8v/
cz8AniLUx7fSpEo717IB3nezFZIdcwkr
=79XI
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Ciaran McCreesh
On Sun, 15 Feb 2009 14:51:10 -0800
Zac Medico  wrote:
> Regardless of what the EAPI value happens to be, the package manager
> should be able to trust that the version identifier is a reliable
> indicator of the mechanism which should be used to validate the
> integrity of the cache entry.

Validate it against what? If EAPI is unsupported, the package
manager can't make use of INHERITED to see what DIGESTS means.

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-15 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Zac Medico wrote:
> Tiziano Müller wrote:
>> I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
>> passwords) to be able to change the digest algorithm as needed
>> (especially in regards to the current SHA successor competition).
>> This allows a future package manager which might use SHA-3 for hashing
>> (once it's released) to still check old digests. Furthermore it would
>> allow for easier transition and only needs a definition of allowed
>> hashes instead of a specific one.
> 
> I like that idea. That way it's not necessary to bump the EAPI in
> order to change the hash function. So, a typical DIGESTS value might
> look like this:
> 
> SHA1 02021be38b a28b191904 3992945426 6ec21b29a3

While thinking about the implementation details, I realized that it
would be very useful to give the DIGESTS data a version identifier
that is independent of the EAPI. This will allow a package manager
to validate a cache entry that has been generated for an unsupported
EAPI, and allows it to trust that there's no point in regenerating
the cache entry (to see if the EAPI has changed since the last time
that it was generated). For example, suppose that we introduce EAPI
3 and a package manager that does not support EAPI 3 encounters a
cache entry for an EAPI 3 ebuild. If the package manager recognizes
the DIGESTS data version and it's able to validate the cache entry,
then it can avoid the cost of regenerating metadata for that ebuild.
If the user modifies the ebuild locally to change the EAPI to a
supported EAPI (from 3 to 2, for example), the DIGESTS data will
allow the package manager to recognize that the cache entry has been
invalidated and needs to be regenerated (and it will discover that
the EAPI has changed to a supported value).

So, if a "0" version identifier at the beginning of the DIGESTS
data, a typical entry could look like this:

0 SHA1 02021be38b a28b191904 3992945426 6ec21b29a3

Regardless of what the EAPI value happens to be, the package manager
should be able to trust that the version identifier is a reliable
indicator of the mechanism which should be used to validate the
integrity of the cache entry.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmYnFwACgkQ/ejvha5XGaNTzQCdFyZpEBZhftEISVrBBT+DsOHv
JXEAn2KtO/g0KjQtQu8fuB8KGF9Krr/d
=TxtX
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-14 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian Harring wrote:
> On Wed, Feb 11, 2009 at 02:01:24AM -0800, Zac Medico wrote:
>> Brian Harring wrote:
>>> On Tue, Feb 10, 2009 at 12:55:51PM -0800, Zac Medico wrote:
 Brian Harring wrote:
> Frankly, forget compatibility- the current format could stand to die.  
> The repository format is an ever growing mess- leave it as is and 
> work on cutting over to something sane.
 Changing the repository layout is a pretty radical thing to do.
 You're welcome to start a new subject for that if you'd like but I'd
 prefer to keep the scope of this thread focussed on the cache format
 for the existing repository layout.
>> I don't intend to repeal the cache mtime requirement, at least
>> (especially) not on gentoo's rsync tree. However, I wouldn't say
>> that it's something that necessarily needs to be a requirement for
>> other repositories or overlays, moving forward (assuming that an
>> alternative validation framework is in place).
> 
> So... you want a subset of repositories to have cache algo x, while 
> the rest have the old algo.  And since the repo w/ algo x isn't 
> marked in some fashion, all managers will have to use new algo x for 
> compatibility reasons.  Right...

Clients using either validation mechanism can consume the same
cache. If the client recognizes DIGESTS data and it's available in a
given cache entry, naturally the client should prefer the DIGESTS
validation mechanism because it's more reliable.

>>> I reiterate, this belongs in a seperate repository format, along w/ 
>>> the rest of the unversioned repository changes you've been pushing in 
>>> (profile package.mask breaking all non portage PMs is a perfect 
>>> example).
>> The package.mask thing is a separate discussion. Let's do that in a
>> separate thread.
> 
> Package.mask is relevant purely as a demonstration of why unversioned 
> changes to the repository formats *needs* to stop.  Generally speaking 
> it's pretty shitty behaviour to embrace/extend a format when others 
> rely on it for interop.

I agree that it's a poor practice to change the format in ways that
are not inter-operable. However, as said above, introduction of the
DIGESTS data is inter-operable.

> The annoying thing about this thread is that *no where* am I saying 
> you shouldn't be free to experiment.  All I'm stating is that the end 
> result isn't a compatible repo- it *is* a new format (version even) 
> thus mark it in some way so that the rest of us can start properly 
> handling it rather then having to cut last minute releases since we're 
> PMS compliant but portage treats PMS as a subset of it's format rules.

As said, the end result of introducing the DIGESTS data _is_ a
compatible repo.

> Pretty simple request, and not something that shouuld require argument 
> as far as I'm concerned.

>>> The daft thing about this is that w/ effectively atomic sync (if the 
>>> sync fails then mark the repo as screwed up till a sync completes), 
>>> the current cache format can *still* do validation- no clue if 
>>> paludis has it, but at least pkgcore and portage can handle this via 
>>> awareness of the eclass stacking.
>> I want to have a more fault-tolerant solution than that.
> 
> I understand your reasoning, and frankly I used to view the rsync 
> issue in the same way- it's a naive view however since it implicitly 
> is assuming that the resultant repo is *usable*, iow that the actual 
> ebuild/eclass/profile data is valid, just that the updating bailed 
> during metadata transfer.  There is zero gurantee as to where the 
> rsync bailed- meaning you can be missing patches, have trashed 
> manifests, etc.
> 
> Well aware it's not friendly to require people to force a completed 
> sync before being able to use the repo, but it really is the only 
> *safe* option- as such the fault tolerant counterarg is a non 
> arguement.

Problems aren't only triggered by sync issues. For example, suppose
that the user has locally modified an eclass in a way that results
in a metadata change. The DIGESTS data will provide enough
information to detect cases such as this. Without this data, the
user may be left scratching their head, wondering why their eclass
change hasn't been accounted for.

>>> Note that proper PM implementations *still* have to set the cache 
>>> entries mtime for backwards compatibility w/ older PMs that don't 
>>> support this new unversioned change thus muddying the implementation 
>>> even further.
>> As said above, I wasn't intending that, at least (especially) not
>> for gentoo's rsync tree. I guess you got that idea from the mention
>> of bug 139134, but you don't need to worry about it.
> 
> Implicitly it's required; if pkgcore is to generate cache entries for 
> repo x, it has to do exactly as I said so that any any pre 
> cache-modified-managers are still able to use the cache.  That's 
> assuming the $PM cares about compatibility...

As said, clients using either va

Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-14 Thread Brian Harring
On Wed, Feb 11, 2009 at 02:01:24AM -0800, Zac Medico wrote:
> Brian Harring wrote:
> > On Tue, Feb 10, 2009 at 12:55:51PM -0800, Zac Medico wrote:
> >> Brian Harring wrote:
> >>> Frankly, forget compatibility- the current format could stand to die.  
> >>> The repository format is an ever growing mess- leave it as is and 
> >>> work on cutting over to something sane.
> >> Changing the repository layout is a pretty radical thing to do.
> >> You're welcome to start a new subject for that if you'd like but I'd
> >> prefer to keep the scope of this thread focussed on the cache format
> >> for the existing repository layout.
> 
> I don't intend to repeal the cache mtime requirement, at least
> (especially) not on gentoo's rsync tree. However, I wouldn't say
> that it's something that necessarily needs to be a requirement for
> other repositories or overlays, moving forward (assuming that an
> alternative validation framework is in place).

So... you want a subset of repositories to have cache algo x, while 
the rest have the old algo.  And since the repo w/ algo x isn't 
marked in some fashion, all managers will have to use new algo x for 
compatibility reasons.  Right...


> > I reiterate, this belongs in a seperate repository format, along w/ 
> > the rest of the unversioned repository changes you've been pushing in 
> > (profile package.mask breaking all non portage PMs is a perfect 
> > example).
> 
> The package.mask thing is a separate discussion. Let's do that in a
> separate thread.

Package.mask is relevant purely as a demonstration of why unversioned 
changes to the repository formats *needs* to stop.  Generally speaking 
it's pretty shitty behaviour to embrace/extend a format when others 
rely on it for interop.

The annoying thing about this thread is that *no where* am I saying 
you shouldn't be free to experiment.  All I'm stating is that the end 
result isn't a compatible repo- it *is* a new format (version even) 
thus mark it in some way so that the rest of us can start properly 
handling it rather then having to cut last minute releases since we're 
PMS compliant but portage treats PMS as a subset of it's format rules.

Pretty simple request, and not something that shouuld require argument 
as far as I'm concerned.


> > The daft thing about this is that w/ effectively atomic sync (if the 
> > sync fails then mark the repo as screwed up till a sync completes), 
> > the current cache format can *still* do validation- no clue if 
> > paludis has it, but at least pkgcore and portage can handle this via 
> > awareness of the eclass stacking.
> 
> I want to have a more fault-tolerant solution than that.

I understand your reasoning, and frankly I used to view the rsync 
issue in the same way- it's a naive view however since it implicitly 
is assuming that the resultant repo is *usable*, iow that the actual 
ebuild/eclass/profile data is valid, just that the updating bailed 
during metadata transfer.  There is zero gurantee as to where the 
rsync bailed- meaning you can be missing patches, have trashed 
manifests, etc.

Well aware it's not friendly to require people to force a completed 
sync before being able to use the repo, but it really is the only 
*safe* option- as such the fault tolerant counterarg is a non 
arguement.


> > Note that proper PM implementations *still* have to set the cache 
> > entries mtime for backwards compatibility w/ older PMs that don't 
> > support this new unversioned change thus muddying the implementation 
> > even further.
> 
> As said above, I wasn't intending that, at least (especially) not
> for gentoo's rsync tree. I guess you got that idea from the mention
> of bug 139134, but you don't need to worry about it.

Implicitly it's required; if pkgcore is to generate cache entries for 
repo x, it has to do exactly as I said so that any any pre 
cache-modified-managers are still able to use the cache.  That's 
assuming the $PM cares about compatibility...

~harring


pgpirUW2WrBOd.pgp
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-11 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian Harring wrote:
> On Tue, Feb 10, 2009 at 12:55:51PM -0800, Zac Medico wrote:
>> Brian Harring wrote:
>>> Frankly, forget compatibility- the current format could stand to die.  
>>> The repository format is an ever growing mess- leave it as is and 
>>> work on cutting over to something sane.
>> Changing the repository layout is a pretty radical thing to do.
>> You're welcome to start a new subject for that if you'd like but I'd
>> prefer to keep the scope of this thread focussed on the cache format
>> for the existing repository layout.

I don't intend to repeal the cache mtime requirement, at least
(especially) not on gentoo's rsync tree. However, I wouldn't say
that it's something that necessarily needs to be a requirement for
other repositories or overlays, moving forward (assuming that an
alternative validation framework is in place).

> Vacuous arguement via focusing on the 'layout' part rather then the 
> repository whole I implied; you're stating that one should not 
> discuss changing the repository standard/spec while arguing that 
> repealing the requirement that cache mtime entries match ebuild 
> mtime (part of the repository spec) should be the point of discussion.
> 
> The daft thing about this is that w/ effectively atomic sync (if the 
> sync fails then mark the repo as screwed up till a sync completes), 
> the current cache format can *still* do validation- no clue if 
> paludis has it, but at least pkgcore and portage can handle this via 
> awareness of the eclass stacking.

I want to have a more fault-tolerant solution than that.

> So for git vcses bundling metadata (a bad idea anyways to be storing 
> generated content in the mainline vcs), your proposal allows them to 
> use a cache.  For every other distribution mechanism that works fine, 
> they wind up paying the cost for that corner case.  The 80 pays for 
> the 20 isn't the normal form of the 80/20 rule ;)

What I'm concerned about is costs in terms of support and usability.
When something goes wrong and there's not enough data to detect it,
it triggers problems that confuse and annoy users. I want to have
the DIGESTS data available so that these sorts of problems are easy
to detect and handle appropriately. I think you're being too stingy
about disk space.

> Note that proper PM implementations *still* have to set the cache 
> entries mtime for backwards compatibility w/ older PMs that don't 
> support this new unversioned change thus muddying the implementation 
> even further.

As said above, I wasn't intending that, at least (especially) not
for gentoo's rsync tree. I guess you got that idea from the mention
of bug 139134, but you don't need to worry about it.

> I reiterate, this belongs in a seperate repository format, along w/ 
> the rest of the unversioned repository changes you've been pushing in 
> (profile package.mask breaking all non portage PMs is a perfect 
> example).

The package.mask thing is a separate discussion. Let's do that in a
separate thread.

>>> Overlay maintainers who want the latest/greatest obviously can convert 
>>> over also; one would hope their would be enough cleanup to make it 
>>> worth their time.
>>>
>>> As for the nasty gentoo-x86 compatibility, basically, do the 
>>> following:
>>>
>>> 1) maintain the existing cvs repo as is
>>> 2) iron out what cleanup/restructuring is desired.  glep55 being 
>>> jammed in here is a potential for example.  Nail down the new repo 
>>> format basically (with an eye for translating the cvs repo to it on 
>>> the fly).
>>> 3) use an eclass index holding the checksums, w/ the cache entries 
>>> referencing the index numbers rather (sorting the index by 
>>> consumption, meaning the more ebuilds using it the lower the index): 
>>> this brings the cache addition down to around 285KB (acceptable imo) 
>>> while giving full flexibility in the checksums available for eclasses.  
>>> This is assuming the current flat_list format is still in use in the 
>>> new repo...
>> As previously discussed [2], having shared integrity data (as you
>> suggest) has implications in terms of reduced simplicity and robustness.
> 
> The complexity arguement is a white elephant.   Rsync is the sole 
> transport that has atomicity issues; the rest don't (when you check 
> out from vcs, you get an exact rev effectively).  Rsync generation 
> ought to be preparing the new snapshot then swapping it in, and if I 
> recall correctly that's exactly what osprey does now (or whatever node 
> y'all are using for generating gentoo-x86 these days).
> 
> The point there is that there are specific steps taken preparing the 
> repo- those steps already ensure the snapshot/rev is complete prior to 
> being available so there isn't real potential of catching it mid 
> update.  Via that existing machinery, a shared index is *no issue*- 
> the one spot it rears it's head is during a failed/partial sync (the 
> repo should not be using in such a state 

Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-11 Thread Brian Harring
On Tue, Feb 10, 2009 at 12:55:51PM -0800, Zac Medico wrote:
> Brian Harring wrote:
> > On Mon, Feb 09, 2009 at 11:55:41AM -0800, Zac Medico wrote:
> >> All that I can say right now is that I recall questions about it in
> >> the past from overlay maintainers (I don't have a list) and the
> >> funtoo project is the only one which I can name offhand.
> >>
> >> However, the ability to distribute cache via a vcs is only an
> >> ancillary feature which is made possible by the DIGESTS data. The
> >> DIGESTS data is useful regardless of the protocol that is used to
> >> distribute the cache, since it allows the cache to be properly
> >> validated for integrity. So, the real primary reason for introducing
> >> the DIGESTS data is to provide a proper solution for cases like bug
> >> #139134 [1] in which invalid metadata cache goes undetected.
> > 
> > I'm sorry, but this proposal smells something awful.  Because of the 
> > mtime requirement on cache entries you're proposing jamming another 
> > 1.4MB into the cache for validation purposes (which should be 4x that 
> > since a full checksum really should be in there) while trying to 
> > maintain compatibility.
> 
> As I've said before [1], 10 hex digits gives 1.1e12 possible
> combinations and that's probably sufficient for the given application.

And as I said before, I don't agree with you on it (repeating it over 
and over isn't going to convince the other side either).

The 1.4MB is more the concern then arguments over avalanche I might 
add.


> > Frankly, forget compatibility- the current format could stand to die.  
> > The repository format is an ever growing mess- leave it as is and 
> > work on cutting over to something sane.
> 
> Changing the repository layout is a pretty radical thing to do.
> You're welcome to start a new subject for that if you'd like but I'd
> prefer to keep the scope of this thread focussed on the cache format
> for the existing repository layout.

Vacuous arguement via focusing on the 'layout' part rather then the 
repository whole I implied; you're stating that one should not 
discuss changing the repository standard/spec while arguing that 
repealing the requirement that cache mtime entries match ebuild 
mtime (part of the repository spec) should be the point of discussion.

The daft thing about this is that w/ effectively atomic sync (if the 
sync fails then mark the repo as screwed up till a sync completes), 
the current cache format can *still* do validation- no clue if 
paludis has it, but at least pkgcore and portage can handle this via 
awareness of the eclass stacking.

So for git vcses bundling metadata (a bad idea anyways to be storing 
generated content in the mainline vcs), your proposal allows them to 
use a cache.  For every other distribution mechanism that works fine, 
they wind up paying the cost for that corner case.  The 80 pays for 
the 20 isn't the normal form of the 80/20 rule ;)

Note that proper PM implementations *still* have to set the cache 
entries mtime for backwards compatibility w/ older PMs that don't 
support this new unversioned change thus muddying the implementation 
even further.

I reiterate, this belongs in a seperate repository format, along w/ 
the rest of the unversioned repository changes you've been pushing in 
(profile package.mask breaking all non portage PMs is a perfect 
example).


> > Overlay maintainers who want the latest/greatest obviously can convert 
> > over also; one would hope their would be enough cleanup to make it 
> > worth their time.
> > 
> > As for the nasty gentoo-x86 compatibility, basically, do the 
> > following:
> > 
> > 1) maintain the existing cvs repo as is
> > 2) iron out what cleanup/restructuring is desired.  glep55 being 
> > jammed in here is a potential for example.  Nail down the new repo 
> > format basically (with an eye for translating the cvs repo to it on 
> > the fly).
> > 3) use an eclass index holding the checksums, w/ the cache entries 
> > referencing the index numbers rather (sorting the index by 
> > consumption, meaning the more ebuilds using it the lower the index): 
> > this brings the cache addition down to around 285KB (acceptable imo) 
> > while giving full flexibility in the checksums available for eclasses.  
> > This is assuming the current flat_list format is still in use in the 
> > new repo...
> 
> As previously discussed [2], having shared integrity data (as you
> suggest) has implications in terms of reduced simplicity and robustness.

The complexity arguement is a white elephant.   Rsync is the sole 
transport that has atomicity issues; the rest don't (when you check 
out from vcs, you get an exact rev effectively).  Rsync generation 
ought to be preparing the new snapshot then swapping it in, and if I 
recall correctly that's exactly what osprey does now (or whatever node 
y'all are using for generating gentoo-x86 these days).

The point there is that there are specific steps taken preparing the 
repo- those steps already e

Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-10 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian Harring wrote:
> On Mon, Feb 09, 2009 at 11:55:41AM -0800, Zac Medico wrote:
>> All that I can say right now is that I recall questions about it in
>> the past from overlay maintainers (I don't have a list) and the
>> funtoo project is the only one which I can name offhand.
>>
>> However, the ability to distribute cache via a vcs is only an
>> ancillary feature which is made possible by the DIGESTS data. The
>> DIGESTS data is useful regardless of the protocol that is used to
>> distribute the cache, since it allows the cache to be properly
>> validated for integrity. So, the real primary reason for introducing
>> the DIGESTS data is to provide a proper solution for cases like bug
>> #139134 [1] in which invalid metadata cache goes undetected.
> 
> I'm sorry, but this proposal smells something awful.  Because of the 
> mtime requirement on cache entries you're proposing jamming another 
> 1.4MB into the cache for validation purposes (which should be 4x that 
> since a full checksum really should be in there) while trying to 
> maintain compatibility.

As I've said before [1], 10 hex digits gives 1.1e12 possible
combinations and that's probably sufficient for the given application.

> Frankly, forget compatibility- the current format could stand to die.  
> The repository format is an ever growing mess- leave it as is and 
> work on cutting over to something sane.

Changing the repository layout is a pretty radical thing to do.
You're welcome to start a new subject for that if you'd like but I'd
prefer to keep the scope of this thread focussed on the cache format
for the existing repository layout.

> Overlay maintainers who want the latest/greatest obviously can convert 
> over also; one would hope their would be enough cleanup to make it 
> worth their time.
> 
> As for the nasty gentoo-x86 compatibility, basically, do the 
> following:
> 
> 1) maintain the existing cvs repo as is
> 2) iron out what cleanup/restructuring is desired.  glep55 being 
> jammed in here is a potential for example.  Nail down the new repo 
> format basically (with an eye for translating the cvs repo to it on 
> the fly).
> 3) use an eclass index holding the checksums, w/ the cache entries 
> referencing the index numbers rather (sorting the index by 
> consumption, meaning the more ebuilds using it the lower the index): 
> this brings the cache addition down to around 285KB (acceptable imo) 
> while giving full flexibility in the checksums available for eclasses.  
> This is assuming the current flat_list format is still in use in the 
> new repo...

As previously discussed [2], having shared integrity data (as you
suggest) has implications in terms of reduced simplicity and robustness.

My intention is for the cache format to be both simple and robust.
It may require some extra space in order to achieve these goals, but
I think it's well worth it. When accessing a given cache entry, it's
very important that the package manager be able to reliably validate
it's integrity (given that the package manager has no control over
the implementation details of the cache generation infrastructure),
and I believe that the proposed DIGESTS data will solve this problem
in a simple and robust manner.

[1]
http://archives.gentoo.org/gentoo-dev/msg_d92eddd796dcc7b9272cc8b8a5a9ca18.xml
[2]
http://archives.gentoo.org/gentoo-dev/msg_94a65c9f395706a112ec903b611aad0e.xml
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmR6dYACgkQ/ejvha5XGaNlkwCeLA+roi+zg392R4HsWIuXIGrK
nw4AoNztwEEioDDqPkVTv3pFKRrYUXKv
=TRW8
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-10 Thread Nirbheek Chauhan
On Tue, Feb 10, 2009 at 5:50 PM, Brian Harring  wrote:

> So... flame away.

When I first read Zac's original email, I was sure that a change was
required, and I'm sure now as well (I personally find the cache stuff
pretty clunky). However, I think that a time has come for more
*radical* change to the repository format. We've learnt a lot of
lessons regarding flat-file repository management, and our experience
could result in a new much better system.

I'm not sure what kind of tree structure exherbo uses, but I'm sure
there are ideas we could take from there as well. This alongwith our
plans for moving to GIT, and the tagging need, and various others
means this is a good time for a "revolution" :)

As they say, "Keep one to throw away, you're going to anyway".


-- 
~Nirbheek Chauhan



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-10 Thread Brian Harring
On Mon, Feb 09, 2009 at 11:55:41AM -0800, Zac Medico wrote:
> All that I can say right now is that I recall questions about it in
> the past from overlay maintainers (I don't have a list) and the
> funtoo project is the only one which I can name offhand.
> 
> However, the ability to distribute cache via a vcs is only an
> ancillary feature which is made possible by the DIGESTS data. The
> DIGESTS data is useful regardless of the protocol that is used to
> distribute the cache, since it allows the cache to be properly
> validated for integrity. So, the real primary reason for introducing
> the DIGESTS data is to provide a proper solution for cases like bug
> #139134 [1] in which invalid metadata cache goes undetected.

I'm sorry, but this proposal smells something awful.  Because of the 
mtime requirement on cache entries you're proposing jamming another 
1.4MB into the cache for validation purposes (which should be 4x that 
since a full checksum really should be in there) while trying to 
maintain compatibility.

Frankly, forget compatibility- the current format could stand to die.  
The repository format is an ever growing mess- leave it as is and 
work on cutting over to something sane.

Overlay maintainers who want the latest/greatest obviously can convert 
over also; one would hope their would be enough cleanup to make it 
worth their time.

As for the nasty gentoo-x86 compatibility, basically, do the 
following:

1) maintain the existing cvs repo as is
2) iron out what cleanup/restructuring is desired.  glep55 being 
jammed in here is a potential for example.  Nail down the new repo 
format basically (with an eye for translating the cvs repo to it on 
the fly).
3) use an eclass index holding the checksums, w/ the cache entries 
referencing the index numbers rather (sorting the index by 
consumption, meaning the more ebuilds using it the lower the index): 
this brings the cache addition down to around 285KB (acceptable imo) 
while giving full flexibility in the checksums available for eclasses.  
This is assuming the current flat_list format is still in use in the 
new repo...
4) drop mtime on cache entries, bump it forward whenever it's updated 
(bug 139134 goes away) jamming in an ebuild checksum of some sort.
5) rsync nodes are required to have 10GB of storage available- so 
storage shouldn't be an issue, but ensuring all nodes have been 
updated to sync both the old and *new* format is required.
6) suffer through cvs for a year (or whatever time frame), converting 
folks over to the new url.
7) kill the old format after whatever period deemed best (potentially 
leaving a README telling folks how to update if they're seriously 
behind).
8) convert the cvs repo to the new format, tear down the 
transformation bits.

Yes, the plan above is coarse- there aren't any glaring holes as far 
as I can see however.  It does place restrictions on the repo format 
choosen, but careful choices in the new format (heavy format 
versioning) should make it possible to make this sort of issue less 
of a pain down the line.


At the very least, doing a different repo format for repos/overlays 
stored in a vcs that doesn't track mtime would solve their issues- it 
also has the nice benefit of not making the repo more bloated for the 
99% of folk who didn't even hit the issues spawning this.

If gentoo-x86 is left as is, bug 139134 can be head off w/out jamming 
a new metadata key in; to be clear, I'm likely going to "Special Hell" 
for suggesting this but if mtime/size on the new cache entry is the 
same size as old, append a space to the value in the description 
field.

All sane managers ought to be doing basic clean up of that value 
anyways in their data layer (let alone at the UI level), but it's 
enough to make rsync behave.

So... flame away.

~brian


pgpWzHwIYn9If.pgp
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Petteri Räty wrote:
> Zac Medico wrote:
>> Ciaran McCreesh wrote:
>>> On Sun, 08 Feb 2009 15:27:54 -0800
>>> Zac Medico  wrote:
> Which is offset and more by the massive inconvenience of having to
> keep track of and store junk under version control.
 I think you're making it out to be worse than it really is. Like I
 said, I think we have a justifiable exception to the rule.
>>> If you start encouraging this approach, are you prepared to make
>>> Portage warn extremely noisily if a repository-provided (as opposed to
>>> user generated) cache entry is found to be stale?
>> Sure. Otherwise, it's confusing for the user when dependency
>> calculations take longer than usual for no apparent reason.
>>
> 
> It would probably be useful to provide a central rsync infra for
> overlays where overlay maintainers could subscribe their overlays to and
> the machine would pull in their VCS and generate the metadata for them.

That's fine if somebody wants to implement it. The introduction of
DIGESTS data in the metadata cache does not preclude it. Like I just
said in another reply [1], the ability to distribute cache via a vcs
is only an ancillary feature which is made possible by the DIGESTS data.

[1]
http://archives.gentoo.org/gentoo-dev/msg_760e199e74796fed7e56236f248efe9e.xml
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmQj+UACgkQ/ejvha5XGaOajACePIoV6STCE/bh7SB8X/ch4phk
bpAAnjsYR9UgBVP26wIldvCX2OFNe4yy
=kYc/
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sun, 08 Feb 2009 14:43:01 -0800
> Zac Medico  wrote:
>> Well, if you want to use timestamps, the alternative is to
>> distributors to use a protocol which preserves timestamps. This
>> creates an unnecessary burden. Allowing distribution of metadata
>> cache via version control systems is more flexible.
> 
> Ok, if we're going to encourage this, let's do it properly:
> 
> * Have a branch called 'master'. Commit to it. Don't stick any metadata
>   in it.
> 
> * Have a branch called 'master-with-metadata'. Don't commit to it
>   manually.
> 
> * Have a script that merges master to master-with-metadata, and as part
>   of the merge commit, generates all necessary metadata for the range
>   it's merging.

Yes, that's how I imagine it should be done.

> * Store either the partial hash or the owning repository and timestamp
>   of each eclass used by an ebuild in its metadata.

I think the partial hash is plenty of information since the package
manager still needs some other way to resolve the eclass paths in
order to generate the cache in the first place.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmQi+oACgkQ/ejvha5XGaNrkACg2l+CndFMKHPEx3vtw0FhohRz
i5MAnA/usLTUHsSD5y0QZx8tY91sfdau
=ya5w
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tiziano Müller wrote:
> Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Tiziano Müller wrote:
>>> Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
 For the digest format, I suggest that we use the leftmost 10
 hexadecimal digits of the SHA-1 digest. The rationale for limiting
 it to 10 digits (out of 40) is to save space. Due to the avalanche
 effect [2], 10 digits should be sufficient to ensure that problems
 resulting from hash collisions are extremely unlikely.
>>> I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
>>> passwords) to be able to change the digest algorithm as needed
>>> (especially in regards to the current SHA successor competition).
>>> This allows a future package manager which might use SHA-3 for hashing
>>> (once it's released) to still check old digests. Furthermore it would
>>> allow for easier transition and only needs a definition of allowed
>>> hashes instead of a specific one.
>> I like that idea. That way it's not necessary to bump the EAPI in
>> order to change the hash function. So, a typical DIGESTS value might
>> look like this:
>>
>> SHA1 02021be38b a28b191904 3992945426 6ec21b29a3
>>
 The primary reason to use a digest for cache validation instead of a
 timestamp is that it allows the cache validation mechanism to work
 even if the tree is distributed with a protocol that does not
 preserve timestamps, such as git or subversion. This would make it
>>> Well, usually you don't keep intermediate or generated files in a VCS,
>>> so why the metadata?
>> People who distribute overlays commonly ask if it's possible to
>> distribute metadata cache with the overlay. Using a format that
>> doesn't rely on timestamps will allow them to distribute metadata
>> cache using their existing infrastructure, which is typically git or
>> subversion. In addition to overlays, it would also be useful for
>> forks of the entire gentoo tree, such as the funtoo tree [1].
>>
>> [1] http://github.com/funtoo/portage/tree/master
> 
> Ok, after having the technical details discussed, I'd like to know which
> overlays or trees could really make use of it.
> Because small overlays surely won't generate the metadata because it is
> cumbersome to generate the metadata and isn't really a speed issue.
> Most larger overlays/repositories will probably be able to setup rsync
> or implement a procedure using cron+tarball.
> So, who exactly is asking about being able to distribute the metadata
> cache via a VCS?

All that I can say right now is that I recall questions about it in
the past from overlay maintainers (I don't have a list) and the
funtoo project is the only one which I can name offhand.

However, the ability to distribute cache via a vcs is only an
ancillary feature which is made possible by the DIGESTS data. The
DIGESTS data is useful regardless of the protocol that is used to
distribute the cache, since it allows the cache to be properly
validated for integrity. So, the real primary reason for introducing
the DIGESTS data is to provide a proper solution for cases like bug
#139134 [1] in which invalid metadata cache goes undetected.

[1] http://bugs.gentoo.org/show_bug.cgi?id=139134
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmQijwACgkQ/ejvha5XGaM2gQCguhueRSzVSr6GlFpTW6uutJ9p
mAQAoJ5LOuU9kl8wXEF3qzF5XFa2LdmH
=DTgz
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Ciaran McCreesh
On Sun, 08 Feb 2009 14:43:01 -0800
Zac Medico  wrote:
> Well, if you want to use timestamps, the alternative is to
> distributors to use a protocol which preserves timestamps. This
> creates an unnecessary burden. Allowing distribution of metadata
> cache via version control systems is more flexible.

Ok, if we're going to encourage this, let's do it properly:

* Have a branch called 'master'. Commit to it. Don't stick any metadata
  in it.

* Have a branch called 'master-with-metadata'. Don't commit to it
  manually.

* Have a script that merges master to master-with-metadata, and as part
  of the merge commit, generates all necessary metadata for the range
  it's merging.

* Store either the partial hash or the owning repository and timestamp
  of each eclass used by an ebuild in its metadata.

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Tiziano Müller
Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Tiziano Müller wrote:
> > Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
> >> For the digest format, I suggest that we use the leftmost 10
> >> hexadecimal digits of the SHA-1 digest. The rationale for limiting
> >> it to 10 digits (out of 40) is to save space. Due to the avalanche
> >> effect [2], 10 digits should be sufficient to ensure that problems
> >> resulting from hash collisions are extremely unlikely.
> > I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
> > passwords) to be able to change the digest algorithm as needed
> > (especially in regards to the current SHA successor competition).
> > This allows a future package manager which might use SHA-3 for hashing
> > (once it's released) to still check old digests. Furthermore it would
> > allow for easier transition and only needs a definition of allowed
> > hashes instead of a specific one.
> 
> I like that idea. That way it's not necessary to bump the EAPI in
> order to change the hash function. So, a typical DIGESTS value might
> look like this:
> 
> SHA1 02021be38b a28b191904 3992945426 6ec21b29a3
> 
> >> The primary reason to use a digest for cache validation instead of a
> >> timestamp is that it allows the cache validation mechanism to work
> >> even if the tree is distributed with a protocol that does not
> >> preserve timestamps, such as git or subversion. This would make it
> > Well, usually you don't keep intermediate or generated files in a VCS,
> > so why the metadata?
> 
> People who distribute overlays commonly ask if it's possible to
> distribute metadata cache with the overlay. Using a format that
> doesn't rely on timestamps will allow them to distribute metadata
> cache using their existing infrastructure, which is typically git or
> subversion. In addition to overlays, it would also be useful for
> forks of the entire gentoo tree, such as the funtoo tree [1].
> 
> [1] http://github.com/funtoo/portage/tree/master

Ok, after having the technical details discussed, I'd like to know which
overlays or trees could really make use of it.
Because small overlays surely won't generate the metadata because it is
cumbersome to generate the metadata and isn't really a speed issue.
Most larger overlays/repositories will probably be able to setup rsync
or implement a procedure using cron+tarball.
So, who exactly is asking about being able to distribute the metadata
cache via a VCS?


-- 
---
Tiziano Müller
Gentoo Linux Developer, Council Member
Areas of responsibility:
  Samba, PostgreSQL, CPP, Python, sysadmin
E-Mail : dev-z...@gentoo.org
GnuPG FP   : F327 283A E769 2E36 18D5  4DE2 1B05 6A63 AE9C 1E30


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Rémi Cardona

Petteri Räty a écrit :

Ciaran McCreesh wrote:

On Mon, 09 Feb 2009 14:30:58 +0200
Petteri Räty  wrote:

It would probably be useful to provide a central rsync infra for
overlays where overlay maintainers could subscribe their overlays to
and the machine would pull in their VCS and generate the metadata for
them.

How much do you trust overlay maintainers?



It shouldn't be that hard to sandbox the overlays for cache generation.
Trust should be much more of an issue to people actually installing
stuff from overlays. Adding new overlays to the server would probably
have to be manual.


I can't possibly be the *only* one to think that the ideas in this 
thread are getting out of hand as far as complexity is concerned.


Seriously, let's try to do simpler things that most developers can 
understand.


Cheers,

Rémi



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Ciaran McCreesh
On Mon, 09 Feb 2009 16:15:55 +0200
Petteri Räty  wrote:
> > How much do you trust overlay maintainers?
> 
> It shouldn't be that hard to sandbox the overlays for cache
> generation.

Uh. Really? I'd be interested to see how you plan to pull that one off.

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Petteri Räty
Ciaran McCreesh wrote:
> On Mon, 09 Feb 2009 14:30:58 +0200
> Petteri Räty  wrote:
>> It would probably be useful to provide a central rsync infra for
>> overlays where overlay maintainers could subscribe their overlays to
>> and the machine would pull in their VCS and generate the metadata for
>> them.
> 
> How much do you trust overlay maintainers?
> 

It shouldn't be that hard to sandbox the overlays for cache generation.
Trust should be much more of an issue to people actually installing
stuff from overlays. Adding new overlays to the server would probably
have to be manual.

Regards,
Petteri



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Ciaran McCreesh
On Mon, 09 Feb 2009 14:30:58 +0200
Petteri Räty  wrote:
> It would probably be useful to provide a central rsync infra for
> overlays where overlay maintainers could subscribe their overlays to
> and the machine would pull in their VCS and generate the metadata for
> them.

How much do you trust overlay maintainers?

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-09 Thread Petteri Räty
Zac Medico wrote:
> Ciaran McCreesh wrote:
>> On Sun, 08 Feb 2009 15:27:54 -0800
>> Zac Medico  wrote:
 Which is offset and more by the massive inconvenience of having to
 keep track of and store junk under version control.
>>> I think you're making it out to be worse than it really is. Like I
>>> said, I think we have a justifiable exception to the rule.
>> If you start encouraging this approach, are you prepared to make
>> Portage warn extremely noisily if a repository-provided (as opposed to
>> user generated) cache entry is found to be stale?
> 
> Sure. Otherwise, it's confusing for the user when dependency
> calculations take longer than usual for no apparent reason.
>

It would probably be useful to provide a central rsync infra for
overlays where overlay maintainers could subscribe their overlays to and
the machine would pull in their VCS and generate the metadata for them.

Regards,
Petteri



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sun, 08 Feb 2009 15:27:54 -0800
> Zac Medico  wrote:
>>> Which is offset and more by the massive inconvenience of having to
>>> keep track of and store junk under version control.
>> I think you're making it out to be worse than it really is. Like I
>> said, I think we have a justifiable exception to the rule.
> 
> If you start encouraging this approach, are you prepared to make
> Portage warn extremely noisily if a repository-provided (as opposed to
> user generated) cache entry is found to be stale?

Sure. Otherwise, it's confusing for the user when dependency
calculations take longer than usual for no apparent reason.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmPbV8ACgkQ/ejvha5XGaN24ACg9LFy8dag9/riCwODjknQV/Ic
0koAn00PP5WJBo5UwMR6iATwfFOipTi6
=sOk8
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Ciaran McCreesh
On Sun, 08 Feb 2009 15:27:54 -0800
Zac Medico  wrote:
> > Which is offset and more by the massive inconvenience of having to
> > keep track of and store junk under version control.
> 
> I think you're making it out to be worse than it really is. Like I
> said, I think we have a justifiable exception to the rule.

If you start encouraging this approach, are you prepared to make
Portage warn extremely noisily if a repository-provided (as opposed to
user generated) cache entry is found to be stale?

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sun, 08 Feb 2009 15:03:48 -0800
> Zac Medico  wrote:
>>> No, it's just encouraging bad development practices.
>> It seems like you're making a rather arbitrary judgment.
> 
> Not storing generated content under revision control is hardly an
> arbitrary judgement. It's a well accepted software development bad
> practice.

In general, it's a good rule of thumb. However, in this case I think
we have a justifiable exception to the rule.

>>> If you're concerned that setting up an rsync mirror is difficult,
>>> why not make a tool that generates a tarball, including metadata,
>>> for a repo, and have people run that on a cron and distribute it
>>> via http? That's just as easy to host, and anyone running an
>>> overlay big enough to make this impractical already has the
>>> resources to deal with rsync instead...
>> I'm not saying that it necessarily "difficult" or "beyond the
>> resources", but it does create an unnecessary burden. I think that
>> it adds a significant level of convenience to be able to use a
>> version control system as a single distribution channel.
> 
> Which is offset and more by the massive inconvenience of having to
> keep track of and store junk under version control.

I think you're making it out to be worse than it really is. Like I
said, I think we have a justifiable exception to the rule.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmPankACgkQ/ejvha5XGaNpnQCfVHgDPmfzUbfH6mIgmpUxcWda
xkYAoJ1s+DEd873rpRpDQkck6ZP7pclr
=K88a
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Ciaran McCreesh
On Sun, 08 Feb 2009 15:03:48 -0800
Zac Medico  wrote:
> > No, it's just encouraging bad development practices.
> 
> It seems like you're making a rather arbitrary judgment.

Not storing generated content under revision control is hardly an
arbitrary judgement. It's a well accepted software development bad
practice.

> > If you're concerned that setting up an rsync mirror is difficult,
> > why not make a tool that generates a tarball, including metadata,
> > for a repo, and have people run that on a cron and distribute it
> > via http? That's just as easy to host, and anyone running an
> > overlay big enough to make this impractical already has the
> > resources to deal with rsync instead...
> 
> I'm not saying that it necessarily "difficult" or "beyond the
> resources", but it does create an unnecessary burden. I think that
> it adds a significant level of convenience to be able to use a
> version control system as a single distribution channel.

Which is offset and more by the massive inconvenience of having to
keep track of and store junk under version control.

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sun, 08 Feb 2009 14:43:01 -0800
> Zac Medico  wrote:
>>> Sticking metadata cache files under version control really is a
>>> perfect example of doing it wrong...
>> Well, if you want to use timestamps, the alternative is to
>> distributors to use a protocol which preserves timestamps. This
>> creates an unnecessary burden. Allowing distribution of metadata
>> cache via version control systems is more flexible.
> 
> No, it's just encouraging bad development practices.

It seems like you're making a rather arbitrary judgment.

> If you're concerned that setting up an rsync mirror is difficult, why
> not make a tool that generates a tarball, including metadata, for a
> repo, and have people run that on a cron and distribute it via http?
> That's just as easy to host, and anyone running an overlay big enough
> to make this impractical already has the resources to deal with rsync
> instead...

I'm not saying that it necessarily "difficult" or "beyond the
resources", but it does create an unnecessary burden. I think that
it adds a significant level of convenience to be able to use a
version control system as a single distribution channel.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmPZNMACgkQ/ejvha5XGaNqcgCg3GAWiklumvFhBtbWDYBPGz2+
u6IAoJ5eCaytti4FSmOHEtIrLSm10W4O
=n0eG
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Ciaran McCreesh
On Sun, 08 Feb 2009 14:43:01 -0800
Zac Medico  wrote:
> > Sticking metadata cache files under version control really is a
> > perfect example of doing it wrong...
> 
> Well, if you want to use timestamps, the alternative is to
> distributors to use a protocol which preserves timestamps. This
> creates an unnecessary burden. Allowing distribution of metadata
> cache via version control systems is more flexible.

No, it's just encouraging bad development practices.

If you're concerned that setting up an rsync mirror is difficult, why
not make a tool that generates a tarball, including metadata, for a
repo, and have people run that on a cron and distribute it via http?
That's just as easy to host, and anyone running an overlay big enough
to make this impractical already has the resources to deal with rsync
instead...

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ciaran McCreesh wrote:
> On Sat, 07 Feb 2009 15:23:18 -0800
> Zac Medico  wrote:
>>> Well, usually you don't keep intermediate or generated files in a
>>> VCS, so why the metadata?
>> People who distribute overlays commonly ask if it's possible to
>> distribute metadata cache with the overlay. Using a format that
>> doesn't rely on timestamps will allow them to distribute metadata
>> cache using their existing infrastructure, which is typically git or
>> subversion. In addition to overlays, it would also be useful for
>> forks of the entire gentoo tree, such as the funtoo tree [1].
> 
> Are these people really all going to remember to run some command at
> the top level of the repository before every commit, and to git add the
> relevant files for everything (thus making really messy commits)?

The cache can be incrementally updated by a tool such as repoman, or
it can be updated in periodic batches by a cron job. The periodic
batch approach may be more convenient for eclass changes affect
large numbers of ebuilds.

> Sticking metadata cache files under version control really is a perfect
> example of doing it wrong...

Well, if you want to use timestamps, the alternative is to
distributors to use a protocol which preserves timestamps. This
creates an unnecessary burden. Allowing distribution of metadata
cache via version control systems is more flexible.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmPX/QACgkQ/ejvha5XGaMPewCeOZYsNt2bv+CbOV58aV7isq4f
wCAAnA/10jcuad5NrP3BxyFZAYWH07ot
=iRw3
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Ciaran McCreesh
On Sat, 07 Feb 2009 15:23:18 -0800
Zac Medico  wrote:
> > Well, usually you don't keep intermediate or generated files in a
> > VCS, so why the metadata?
> 
> People who distribute overlays commonly ask if it's possible to
> distribute metadata cache with the overlay. Using a format that
> doesn't rely on timestamps will allow them to distribute metadata
> cache using their existing infrastructure, which is typically git or
> subversion. In addition to overlays, it would also be useful for
> forks of the entire gentoo tree, such as the funtoo tree [1].

Are these people really all going to remember to run some command at
the top level of the repository before every commit, and to git add the
relevant files for everything (thus making really messy commits)?

Sticking metadata cache files under version control really is a perfect
example of doing it wrong...

-- 
Ciaran McCreesh


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tiziano Müller wrote:
> Am Sonntag, den 08.02.2009, 12:36 -0800 schrieb Zac Medico:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Tiziano Müller wrote:
>>> But if your target is to reduce the size of the metadata cache, why
>>> store the hashes of the eclasses in the ebuild's metadata and not in a
>>> seperate dir? They have to be the same for every ebuild, don't they?
>>> In case you have an average number of eclasses which is bigger than 4,
>>> you can even store the full hash with less space used than with
>>> truncated hashes for all eclasses.
>> The problem with having eclass integrity data shared in a separate
>> file is that it creates a requirement for all cache entries which
>> reference the same eclasses to be consistent with one another. This
>> means that a single cache entry can no longer be updated atomically.
>> For example, before updating the shared eclass integrity data, you'd
>> want to make sure that you first discard all of the cache entries
>> which reference it. Although it can be done this way, I think it's
>> much more convenient to have all of the integrity data encapsulated
>> within each individual cache entry.
> Ok, let me see if I get this: Since parts of the content of a
> metadata-entry (like the DEPEND/RDEPEND vars) depend on the contents of
> the eclass used by the time a cache entry got generated, you want to
> store the eclass' hash in the ebuild entry to make sure the entry gets
> invalidated once the eclass changes. Is that correct?

Right. By having each cache entry encapsulate it's own integrity
data, the program updating the cache is never required to update
more than one file at a time. Having shared integrity data would
imply that the program would have the burden of maintaining
consistency across all cache entries.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmPWTEACgkQ/ejvha5XGaOLHQCg0wGuRIkPCmQUQ2k14RjQlpv0
C54AoNqBaA6d3xyO6FuNz1GO7ZJ7y7E6
=D/ei
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Tiziano Müller
Am Sonntag, den 08.02.2009, 12:36 -0800 schrieb Zac Medico:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Tiziano Müller wrote:
> > Am Sonntag, den 08.02.2009, 00:59 -0800 schrieb Zac Medico:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA1
> >>
> >> Tiziano Müller wrote:
> >>> Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
>  -BEGIN PGP SIGNED MESSAGE-
>  Hash: SHA1
> 
>  Tiziano Müller wrote:
> > Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
>  I like that idea. That way it's not necessary to bump the EAPI in
>  order to change the hash function. So, a typical DIGESTS value might
>  look like this:
> > You still have to bump the EAPI in case you want to use a new hash not
> > already available now (like SHA-3). The advantage of noting the used
> > hash is that new PMs can handle old metadata cache.
> 
> That's true.
> 
>  SHA1 02021be38b a28b191904 3992945426 6ec21b29a3
> >>> Sleeping over it again I don't think that truncating a hash is a good
> >>> idea (truncating it from 40 to 10 digits makes the possibility of
> >>> collisions much much higher).
> >> The probability of collision is much higher, but it's still
> >> relatively small. Given the "avalanche effect" that is typical of
> >> cryptographic hash functions, it's extremely unlikely that collision
> >> will occur in such a way that it will cause a problem for cache
> >> validation.
> > The "avalanche effect" as I understood it is required for a hash
> > function to avoid simple calculations of collisions (what the diffusion
> > is for crypto algorithms). So, small changes should affect as many
> > numbers in the hash as possible. But you don't have only small changes
> > here in case somebody patches an eclass, so, the only thing which counts
> > is the probability of a collision.
> 
> Well, the avalanche effect helps in the sense that the leftmost 10
> digits would serve approximately as well as any other 10 digits out
> of all of them. But you're right about the probability of a
> collision being what really matters. With 10 hex digits, we've got a
> space of 16^10 = 1.1e12 possible combinations. Given a space that
> large, the probability of a collision pretty small.
> 
> >>> But if you want to go this way, I'd say you should use something like
> >>> SHA1t (t for truncated) to make sure we can use full hashes once we feel
> >>> it's appropriate.
> >> We could, but I think SHA1 would also be fine since one can infer
> >> from the length of the string that it's been truncated.
> > No, guessing is a bad thing here because it could be truncated because
> > of faulty metadata. But the main motivation is that if you write SHA1
> > everyone reading it expects it to be a full SHA1 hash, which it isn't.
> 
> Well, if the metadata is faulty then the digests are unlikely to
> match and the cache will be discarded anyway as invalid. However, I
> think your point is still somewhat valid, so SHA1t is fine with me
> if that makes more people happy. Does anyone else have a preference
> here?
> 
> > But if your target is to reduce the size of the metadata cache, why
> > store the hashes of the eclasses in the ebuild's metadata and not in a
> > seperate dir? They have to be the same for every ebuild, don't they?
> > In case you have an average number of eclasses which is bigger than 4,
> > you can even store the full hash with less space used than with
> > truncated hashes for all eclasses.
> 
> The problem with having eclass integrity data shared in a separate
> file is that it creates a requirement for all cache entries which
> reference the same eclasses to be consistent with one another. This
> means that a single cache entry can no longer be updated atomically.
> For example, before updating the shared eclass integrity data, you'd
> want to make sure that you first discard all of the cache entries
> which reference it. Although it can be done this way, I think it's
> much more convenient to have all of the integrity data encapsulated
> within each individual cache entry.
Ok, let me see if I get this: Since parts of the content of a
metadata-entry (like the DEPEND/RDEPEND vars) depend on the contents of
the eclass used by the time a cache entry got generated, you want to
store the eclass' hash in the ebuild entry to make sure the entry gets
invalidated once the eclass changes. Is that correct?


-- 
---
Tiziano Müller
Gentoo Linux Developer, Council Member
Areas of responsibility:
  Samba, PostgreSQL, CPP, Python, sysadmin
E-Mail : dev-z...@gentoo.org
GnuPG FP   : F327 283A E769 2E36 18D5  4DE2 1B05 6A63 AE9C 1E30


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tiziano Müller wrote:
> Am Sonntag, den 08.02.2009, 00:59 -0800 schrieb Zac Medico:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Tiziano Müller wrote:
>>> Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Tiziano Müller wrote:
> Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
 I like that idea. That way it's not necessary to bump the EAPI in
 order to change the hash function. So, a typical DIGESTS value might
 look like this:
> You still have to bump the EAPI in case you want to use a new hash not
> already available now (like SHA-3). The advantage of noting the used
> hash is that new PMs can handle old metadata cache.

That's true.

 SHA1 02021be38b a28b191904 3992945426 6ec21b29a3
>>> Sleeping over it again I don't think that truncating a hash is a good
>>> idea (truncating it from 40 to 10 digits makes the possibility of
>>> collisions much much higher).
>> The probability of collision is much higher, but it's still
>> relatively small. Given the "avalanche effect" that is typical of
>> cryptographic hash functions, it's extremely unlikely that collision
>> will occur in such a way that it will cause a problem for cache
>> validation.
> The "avalanche effect" as I understood it is required for a hash
> function to avoid simple calculations of collisions (what the diffusion
> is for crypto algorithms). So, small changes should affect as many
> numbers in the hash as possible. But you don't have only small changes
> here in case somebody patches an eclass, so, the only thing which counts
> is the probability of a collision.

Well, the avalanche effect helps in the sense that the leftmost 10
digits would serve approximately as well as any other 10 digits out
of all of them. But you're right about the probability of a
collision being what really matters. With 10 hex digits, we've got a
space of 16^10 = 1.1e12 possible combinations. Given a space that
large, the probability of a collision pretty small.

>>> But if you want to go this way, I'd say you should use something like
>>> SHA1t (t for truncated) to make sure we can use full hashes once we feel
>>> it's appropriate.
>> We could, but I think SHA1 would also be fine since one can infer
>> from the length of the string that it's been truncated.
> No, guessing is a bad thing here because it could be truncated because
> of faulty metadata. But the main motivation is that if you write SHA1
> everyone reading it expects it to be a full SHA1 hash, which it isn't.

Well, if the metadata is faulty then the digests are unlikely to
match and the cache will be discarded anyway as invalid. However, I
think your point is still somewhat valid, so SHA1t is fine with me
if that makes more people happy. Does anyone else have a preference
here?

> But if your target is to reduce the size of the metadata cache, why
> store the hashes of the eclasses in the ebuild's metadata and not in a
> seperate dir? They have to be the same for every ebuild, don't they?
> In case you have an average number of eclasses which is bigger than 4,
> you can even store the full hash with less space used than with
> truncated hashes for all eclasses.

The problem with having eclass integrity data shared in a separate
file is that it creates a requirement for all cache entries which
reference the same eclasses to be consistent with one another. This
means that a single cache entry can no longer be updated atomically.
For example, before updating the shared eclass integrity data, you'd
want to make sure that you first discard all of the cache entries
which reference it. Although it can be done this way, I think it's
much more convenient to have all of the integrity data encapsulated
within each individual cache entry.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmPQjkACgkQ/ejvha5XGaNFUACfQvVYgNiZNK8PVReTZKN47wQU
9wkAniltb1ivZYGgmhn/eli2fpprkOlI
=2mbq
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Tiziano Müller
Am Sonntag, den 08.02.2009, 00:59 -0800 schrieb Zac Medico:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Tiziano Müller wrote:
> > Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA1
> >>
> >> Tiziano Müller wrote:
> >>> Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
>  For the digest format, I suggest that we use the leftmost 10
>  hexadecimal digits of the SHA-1 digest. The rationale for limiting
>  it to 10 digits (out of 40) is to save space. Due to the avalanche
>  effect [2], 10 digits should be sufficient to ensure that problems
>  resulting from hash collisions are extremely unlikely.
> >>> I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
> >>> passwords) to be able to change the digest algorithm as needed
> >>> (especially in regards to the current SHA successor competition).
> >>> This allows a future package manager which might use SHA-3 for hashing
> >>> (once it's released) to still check old digests. Furthermore it would
> >>> allow for easier transition and only needs a definition of allowed
> >>> hashes instead of a specific one.
> >> I like that idea. That way it's not necessary to bump the EAPI in
> >> order to change the hash function. So, a typical DIGESTS value might
> >> look like this:
You still have to bump the EAPI in case you want to use a new hash not
already available now (like SHA-3). The advantage of noting the used
hash is that new PMs can handle old metadata cache.

> >>
> >> SHA1 02021be38b a28b191904 3992945426 6ec21b29a3
> > 
> > Sleeping over it again I don't think that truncating a hash is a good
> > idea (truncating it from 40 to 10 digits makes the possibility of
> > collisions much much higher).
> 
> The probability of collision is much higher, but it's still
> relatively small. Given the "avalanche effect" that is typical of
> cryptographic hash functions, it's extremely unlikely that collision
> will occur in such a way that it will cause a problem for cache
> validation.
The "avalanche effect" as I understood it is required for a hash
function to avoid simple calculations of collisions (what the diffusion
is for crypto algorithms). So, small changes should affect as many
numbers in the hash as possible. But you don't have only small changes
here in case somebody patches an eclass, so, the only thing which counts
is the probability of a collision.

> 
> > But if you want to go this way, I'd say you should use something like
> > SHA1t (t for truncated) to make sure we can use full hashes once we feel
> > it's appropriate.
> 
> We could, but I think SHA1 would also be fine since one can infer
> from the length of the string that it's been truncated.
No, guessing is a bad thing here because it could be truncated because
of faulty metadata. But the main motivation is that if you write SHA1
everyone reading it expects it to be a full SHA1 hash, which it isn't.

But if your target is to reduce the size of the metadata cache, why
store the hashes of the eclasses in the ebuild's metadata and not in a
seperate dir? They have to be the same for every ebuild, don't they?
In case you have an average number of eclasses which is bigger than 4,
you can even store the full hash with less space used than with
truncated hashes for all eclasses.

-- 
---
Tiziano Müller
Gentoo Linux Developer, Council Member
Areas of responsibility:
  Samba, PostgreSQL, CPP, Python, sysadmin
E-Mail : dev-z...@gentoo.org
GnuPG FP   : F327 283A E769 2E36 18D5  4DE2 1B05 6A63 AE9C 1E30


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tiziano Müller wrote:
> Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Tiziano Müller wrote:
>>> Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
 For the digest format, I suggest that we use the leftmost 10
 hexadecimal digits of the SHA-1 digest. The rationale for limiting
 it to 10 digits (out of 40) is to save space. Due to the avalanche
 effect [2], 10 digits should be sufficient to ensure that problems
 resulting from hash collisions are extremely unlikely.
>>> I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
>>> passwords) to be able to change the digest algorithm as needed
>>> (especially in regards to the current SHA successor competition).
>>> This allows a future package manager which might use SHA-3 for hashing
>>> (once it's released) to still check old digests. Furthermore it would
>>> allow for easier transition and only needs a definition of allowed
>>> hashes instead of a specific one.
>> I like that idea. That way it's not necessary to bump the EAPI in
>> order to change the hash function. So, a typical DIGESTS value might
>> look like this:
>>
>> SHA1 02021be38b a28b191904 3992945426 6ec21b29a3
> 
> Sleeping over it again I don't think that truncating a hash is a good
> idea (truncating it from 40 to 10 digits makes the possibility of
> collisions much much higher).

The probability of collision is much higher, but it's still
relatively small. Given the "avalanche effect" that is typical of
cryptographic hash functions, it's extremely unlikely that collision
will occur in such a way that it will cause a problem for cache
validation.

> But if you want to go this way, I'd say you should use something like
> SHA1t (t for truncated) to make sure we can use full hashes once we feel
> it's appropriate.

We could, but I think SHA1 would also be fine since one can infer
from the length of the string that it's been truncated.
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmOnvwACgkQ/ejvha5XGaPtSACeOS21UYlvkMQy5q86B+9aKHpH
DnUAoK1P83uKFEd2uzfc2t+QhArMHeEZ
=jPpV
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-08 Thread Tiziano Müller
Am Samstag, den 07.02.2009, 15:23 -0800 schrieb Zac Medico:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Tiziano Müller wrote:
> > Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
> >> For the digest format, I suggest that we use the leftmost 10
> >> hexadecimal digits of the SHA-1 digest. The rationale for limiting
> >> it to 10 digits (out of 40) is to save space. Due to the avalanche
> >> effect [2], 10 digits should be sufficient to ensure that problems
> >> resulting from hash collisions are extremely unlikely.
> > I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
> > passwords) to be able to change the digest algorithm as needed
> > (especially in regards to the current SHA successor competition).
> > This allows a future package manager which might use SHA-3 for hashing
> > (once it's released) to still check old digests. Furthermore it would
> > allow for easier transition and only needs a definition of allowed
> > hashes instead of a specific one.
> 
> I like that idea. That way it's not necessary to bump the EAPI in
> order to change the hash function. So, a typical DIGESTS value might
> look like this:
> 
> SHA1 02021be38b a28b191904 3992945426 6ec21b29a3

Sleeping over it again I don't think that truncating a hash is a good
idea (truncating it from 40 to 10 digits makes the possibility of
collisions much much higher).
But if you want to go this way, I'd say you should use something like
SHA1t (t for truncated) to make sure we can use full hashes once we feel
it's appropriate.

-- 
---
Tiziano Müller
Gentoo Linux Developer, Council Member
Areas of responsibility:
  Samba, PostgreSQL, CPP, Python, sysadmin
E-Mail : dev-z...@gentoo.org
GnuPG FP   : F327 283A E769 2E36 18D5  4DE2 1B05 6A63 AE9C 1E30


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-07 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tiziano Müller wrote:
> Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
>> For the digest format, I suggest that we use the leftmost 10
>> hexadecimal digits of the SHA-1 digest. The rationale for limiting
>> it to 10 digits (out of 40) is to save space. Due to the avalanche
>> effect [2], 10 digits should be sufficient to ensure that problems
>> resulting from hash collisions are extremely unlikely.
> I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
> passwords) to be able to change the digest algorithm as needed
> (especially in regards to the current SHA successor competition).
> This allows a future package manager which might use SHA-3 for hashing
> (once it's released) to still check old digests. Furthermore it would
> allow for easier transition and only needs a definition of allowed
> hashes instead of a specific one.

I like that idea. That way it's not necessary to bump the EAPI in
order to change the hash function. So, a typical DIGESTS value might
look like this:

SHA1 02021be38b a28b191904 3992945426 6ec21b29a3

>> The primary reason to use a digest for cache validation instead of a
>> timestamp is that it allows the cache validation mechanism to work
>> even if the tree is distributed with a protocol that does not
>> preserve timestamps, such as git or subversion. This would make it
> Well, usually you don't keep intermediate or generated files in a VCS,
> so why the metadata?

People who distribute overlays commonly ask if it's possible to
distribute metadata cache with the overlay. Using a format that
doesn't rely on timestamps will allow them to distribute metadata
cache using their existing infrastructure, which is typically git or
subversion. In addition to overlays, it would also be useful for
forks of the entire gentoo tree, such as the funtoo tree [1].

[1] http://github.com/funtoo/portage/tree/master
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmOF+UACgkQ/ejvha5XGaPSyQCg7kVF3S1z4G+7pXOrLBB1Pu77
Y5cAnj60bGSww8SLfcqhHmk1voKwm20+
=PmlJ
-END PGP SIGNATURE-



Re: [gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-07 Thread Tiziano Müller
Am Montag, den 02.02.2009, 12:34 -0800 schrieb Zac Medico:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi,
> 
> I'd like to add a new metadata cache value called DIGESTS which will
> contain a space separated list of digests which can be
> used to validate the metadata cache. Like INHERITED and
> DEFINED_PHASES [1], it will be automatically generated. The first
> digest in the list will correspond to the ebuild. If there are any
> inherited eclasses, the digests of those eclasses will follow in a
> space separated list, in the same order that they occur in the
> INHERITED variable. The value of the DIGESTS variable will be on
> line 18 of the metadata cache (just after DEFINED_PHASES).
> 
> For the digest format, I suggest that we use the leftmost 10
> hexadecimal digits of the SHA-1 digest. The rationale for limiting
> it to 10 digits (out of 40) is to save space. Due to the avalanche
> effect [2], 10 digits should be sufficient to ensure that problems
> resulting from hash collisions are extremely unlikely.
I'd recommend to prefix the digest with a "{TYPE}" (like for hashed
passwords) to be able to change the digest algorithm as needed
(especially in regards to the current SHA successor competition).
This allows a future package manager which might use SHA-3 for hashing
(once it's released) to still check old digests. Furthermore it would
allow for easier transition and only needs a definition of allowed
hashes instead of a specific one.

> 
> The primary reason to use a digest for cache validation instead of a
> timestamp is that it allows the cache validation mechanism to work
> even if the tree is distributed with a protocol that does not
> preserve timestamps, such as git or subversion. This would make it
Well, usually you don't keep intermediate or generated files in a VCS,
so why the metadata?

> possible to distribute metadata cache directly from git and
> subversion repositories (among others). Since a digest is inherently
> more expensive to obtain than a timestamp, package managers may use
> the Manifest entries as a digest cache, in order to avoid the need
> to compute digests of ebuilds during dependency calculations.
> 
> Does the suggested approach seem reasonable? Would anybody like to
> suggest any changes?

Cheers,
Tiziano

-- 
---
Tiziano Müller
Gentoo Linux Developer, Council Member
Areas of responsibility:
  Samba, PostgreSQL, CPP, Python, sysadmin
E-Mail : dev-z...@gentoo.org
GnuPG FP   : F327 283A E769 2E36 18D5  4DE2 1B05 6A63 AE9C 1E30


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


[gentoo-dev] [RFC] DIGESTS metadata variable for cache validation

2009-02-02 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I'd like to add a new metadata cache value called DIGESTS which will
contain a space separated list of digests which can be
used to validate the metadata cache. Like INHERITED and
DEFINED_PHASES [1], it will be automatically generated. The first
digest in the list will correspond to the ebuild. If there are any
inherited eclasses, the digests of those eclasses will follow in a
space separated list, in the same order that they occur in the
INHERITED variable. The value of the DIGESTS variable will be on
line 18 of the metadata cache (just after DEFINED_PHASES).

For the digest format, I suggest that we use the leftmost 10
hexadecimal digits of the SHA-1 digest. The rationale for limiting
it to 10 digits (out of 40) is to save space. Due to the avalanche
effect [2], 10 digits should be sufficient to ensure that problems
resulting from hash collisions are extremely unlikely.

The primary reason to use a digest for cache validation instead of a
timestamp is that it allows the cache validation mechanism to work
even if the tree is distributed with a protocol that does not
preserve timestamps, such as git or subversion. This would make it
possible to distribute metadata cache directly from git and
subversion repositories (among others). Since a digest is inherently
more expensive to obtain than a timestamp, package managers may use
the Manifest entries as a digest cache, in order to avoid the need
to compute digests of ebuilds during dependency calculations.

Does the suggested approach seem reasonable? Would anybody like to
suggest any changes?

[1]
http://archives.gentoo.org/gentoo-dev/msg_8c34d8efbc0d31ab28c517403dc83f62.xml
[2] http://en.wikipedia.org/wiki/Avalanche_effect
- --
Thanks,
Zac


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmHWOQACgkQ/ejvha5XGaOJeQCgouZGO+pbOgJYkzssRVhzMDwt
Cq4AoN6NG7SmJ6XjEked1WnZ+CJPXVWj
=JSDL
-END PGP SIGNATURE-