Re: [gradle-dev] Strategy for minimising network traffic during dependency resolution.

Adam Murdoch Sat, 31 Mar 2012 17:02:27 -0700

On 01/04/2012, at 2:07 AM, Daz DeBoer wrote:

> 
> On 31 March 2012 02:15, Luke Daley <[email protected]> wrote:
> On 30/03/2012, at 9:24 PM, Adam Murdoch <[email protected]> wrote:
>> We want to keep the transports and the caching as separate as possible, so 
>> we can reuse the caching across transports. This may not necessarily mean 
>> that every caching strategy will work with every transport, but it would be 
>> nice to have at least one strategy that can work across any transport. And 
>> it looks like the 'best' option we've come up with for http also happens to 
>> be a generic option, too (except for the etag check, but I'm sure we can 
>> deal with that), so perhaps we only need one strategy. At some point we 
>> might end up with some other transport-specific strategies, but ideally we 
>> can base these on optional abstract capabilities of the transport (e.g. 'can 
>> you provide content-length+last-modified-time efficiently?', 'can you do a 
>> get-content-if-sha1-does-not-match?' and so on) rather than on concrete 
>> transports.
> 
> This is more or less what we have now.
> 
> https://github.com/gradle/gradle/blob/master/subprojects/core-impl/src/main/groovy/org/gradle/api/internal/externalresource/transfer/DefaultCacheAwareExternalResourceAccessor.java
> 
> The ExternalResourceAccessor contract is kinda flexible so I think we could 
> make it work for most transports and still use this general caching algorithm.
> 
> At the moment this is built in ExternslResourceRepository, but we could allow 
> injection of a custom one easily enough.
> 
> There are two things I still four things I want to do before wrapping this up:
> 
> * treat 403 and 405 HEAD requests as "metadata unknown"
> * if the server is googlecode, treat 404 HEAD requests as "metadata unknown"
> * when reusing a locally found resource, store the real metadata in the index.
> * where it's safe to, extract the sha1 from the etag (e.g. Artifactory).
> 
> All of this things are small.
> 
> Cool. One more thing we should do soon (but post 1.0) is allow caching of 
> 'file' repositories. The recent change to FileSystemResolver means that we've 
> removed the previous workaround for people with slow filesystem access (eg 
> network shares).
> 
> I think the solution we discussed was to always cache local repository 
> artifacts (more consistent), but to always treat them as 'changing'.


Thinking about this a little more, it feels a bit awkward that the exact same 
artefact would be considered 'can never change' when it's up on a remote 
repository, and 'can change at any time' when it's on a local repository.

I think what we're trying to say here that any artefact at all can change, and 
that it's 'expensive' to check for changes to remote artefacts and 'cheap' to 
check for changes to local artefacts. And that different artefacts have 
different likelihoods that they can change, where changing artefacts are 
'likely' to have changed, and non-changing artefacts are 'unlikely' to have 
changed. Whether or not we check is a function of 
(how-long-since-we-last-checked, likelihood-of-change, 
cost-of-checking-for-change).

There's also a semantic aspect here - some artefacts, such as releases, should 
never change, regardless of where they live. Which means we shouldn't ever 
check for changes to them (or check for changes and fail if they do happen 
change).

I think we want to introduce a new 'cacheModulesFor' expiry period that applies 
to all modules, changing or otherwise. Default this to a 'long' period 
(possibly infinite) for remote repositories and a 'short' period (possibly 0) 
for local repositories. Default 'cacheChangingModulesFor' to 0 for local 
repositories. Such a setting would be useful, for example, for those who want 
to validate all dependencies during a CI build.

Over time, we will add a richer model for the lifecycle of a module, so we can 
distinguish between things like:
* Something I'm working on, which needs to be checked (and maybe even built) on 
each resolve.
* An active development stream, which needs to be checked frequently, possibly 
on every resolve.
* A nightly or periodic build, which only needed to be checked once a day or so.
* A release, which never needs to be checked. It is an error if such an 
artefact ever changes.

That is, these checks become location independent, and more a function of the 
state of the thing we're using. Of course, we might still consider the 
location, by checking 'cheap' locations more frequently than 'expensive' 
locations.

We can also make use of this information to improve various other things. The 
multi repository handling might, for example, prefer a locally built copy of a 
module that I am working on, over once built by CI.


> That is, on every resolve we would check if the cached artifact was 
> up-to-date, by comparing modification-date+size, sha1, etc.

We should do the check once per build, rather than once per resolve. It's not a 
good thing to change an artefact halfway through a build, I think.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Re: [gradle-dev] Strategy for minimising network traffic during dependency resolution.

Reply via email to