[gradle-dev] Strategy for minimising network traffic during dependency resolution.

Luke Daley Thu, 29 Mar 2012 10:03:02 -0700

Hi all,

As previously discussed, we are now leveraging last modified and content length 
values to avoid downloading changing artifacts (resources really) that have not 
changed. Currently, our strategy is the following…


Given an artifact id (group, name, version) and a repository:

1. See if we have resolved this artifact from this repository previously. if so 
and if the cache entry has not expired, use the cached resource. Otherwise:
3. Search the local file system in a bunch of places (e.g. maven local, old 
gradle caches, the current filestore) for anything that was resolved with the 
same artifact id effectively
2. Convert the request into a url to hit
4. Search the cache index for a record of the metadata for this url

So we now may have 0..n “locally available resource candidates” that we think 
may be the same as what's behind the URL, and possibly a “cached external 
resource” (a record of the metadata last time we hit the resource and it's 
location in the filestore).

The fetch process looks like this:

* If there are any locally available resource candidates, fetch the remote sha1 
for the resource if it's available.
* If any of the locally available resource candidates have the same checksum, 
use that instead of downloading the resource (at the cost of not obtaining 
metadata such as last modified, etag etc).
* If not, or if there was no remote checksum available;
* If we have a cached version of the resource, compare the cached metadata with 
the real metadata via a HEAD request (implies that there was no remote checksum 
in practice).
** If the metadata is unchanged (compare last modified date and content length 
for equality), use the cached version (including metadata).
** If the metadata is changed, issue a GET to download the resource (then cache 
the resource of course)

I think this is the practical thing to do, but probably not theoretically 
correct.

The issue is that by using the checksum check to determine if something has 
changed or not, we lose any cached metadata about the resource. If we find 
something on the filesystem with the same checksum, all we can really assume is 
that that file has the same binary content. We cannot assume that it came from 
the same URL which should invalidate any cached metadata we had for that URL. 
However, since the only metadata items that we care about are content length, 
last modified and etag, if the checksum hasn't changed we could probably assume 
that these values haven't changed either.

Furthermore, it probably doesn't matter because if there are remote checksums 
for a resource available then we aren't really going to use the metadata for 
anything.

Further furthermore, our current strategy is optimised for the case where 
checksums are available which is considered best practice. If we flipped it 
around and compared metadata first…

Pros:
* If the item is unchanged, we only have one HEAD request as opposed to the GET 
on the checksum (faster)
* We maintain cached metadata “integrity”

Cons:
* If the item has changed, we have one HEAD for the metadata (to determine it 
was changed) then another GET for the sha1 (to look for locally available 
resources)

Keep in mind, the con there is the rare case. This means that the external 
resource has changed since the last time we saw it, but something else (i.e. 
maven, older gradle version) has downloaded it in the meantime.

Under this (metadata first) strategy, the requests for a 
seen-before-but-changed resource would look like this:

* HEAD to resource (get metadata) - determine changed
* GET to checksum - most likely outcome is that we don't find a local version 
of this
* GET to resource

Under the current (checksum first) strategy it looks like this:

* GET to checksum - no local version found with checksum
* GET to resource 

Under this (metadata first) strategy, the requests for a 
seen-before-but-UNchanged resource would look like this:

* HEAD to resource (get metadata) - determine unchanged

Under the current (checksum first) strategy it looks like this:

* GET to checksum - local version found with checksum (can't guarantee it came 
from the same URL)


Still following? :)

For me this comes down to:

* Is there a noticeable benefit of one HEAD request over one GET (for a sha1 
text file), if not then we don't change. If so,
* Do we optimise for the case where the resource is unchanged?


There's another interesting option. Some servers send an “X-checksum-SHA1” 
header (e.g. Artifactory). In this case, we could use this when performing the 
initial HEAD and get the best of both worlds. Other servers advertise that 
their etags are SHA1s (e.g. Nexus). We could use this metadata, and keep the 
extra sha1 request as a fallback.

-- 
Luke Daley
Principal Engineer, Gradleware 
http://gradleware.com


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email

[gradle-dev] Strategy for minimising network traffic during dependency resolution.

Reply via email to