Hi all, As previously discussed, we are now leveraging last modified and content length values to avoid downloading changing artifacts (resources really) that have not changed. Currently, our strategy is the following…
Given an artifact id (group, name, version) and a repository: 1. See if we have resolved this artifact from this repository previously. if so and if the cache entry has not expired, use the cached resource. Otherwise: 3. Search the local file system in a bunch of places (e.g. maven local, old gradle caches, the current filestore) for anything that was resolved with the same artifact id effectively 2. Convert the request into a url to hit 4. Search the cache index for a record of the metadata for this url So we now may have 0..n “locally available resource candidates” that we think may be the same as what's behind the URL, and possibly a “cached external resource” (a record of the metadata last time we hit the resource and it's location in the filestore). The fetch process looks like this: * If there are any locally available resource candidates, fetch the remote sha1 for the resource if it's available. * If any of the locally available resource candidates have the same checksum, use that instead of downloading the resource (at the cost of not obtaining metadata such as last modified, etag etc). * If not, or if there was no remote checksum available; * If we have a cached version of the resource, compare the cached metadata with the real metadata via a HEAD request (implies that there was no remote checksum in practice). ** If the metadata is unchanged (compare last modified date and content length for equality), use the cached version (including metadata). ** If the metadata is changed, issue a GET to download the resource (then cache the resource of course) I think this is the practical thing to do, but probably not theoretically correct. The issue is that by using the checksum check to determine if something has changed or not, we lose any cached metadata about the resource. If we find something on the filesystem with the same checksum, all we can really assume is that that file has the same binary content. We cannot assume that it came from the same URL which should invalidate any cached metadata we had for that URL. However, since the only metadata items that we care about are content length, last modified and etag, if the checksum hasn't changed we could probably assume that these values haven't changed either. Furthermore, it probably doesn't matter because if there are remote checksums for a resource available then we aren't really going to use the metadata for anything. Further furthermore, our current strategy is optimised for the case where checksums are available which is considered best practice. If we flipped it around and compared metadata first… Pros: * If the item is unchanged, we only have one HEAD request as opposed to the GET on the checksum (faster) * We maintain cached metadata “integrity” Cons: * If the item has changed, we have one HEAD for the metadata (to determine it was changed) then another GET for the sha1 (to look for locally available resources) Keep in mind, the con there is the rare case. This means that the external resource has changed since the last time we saw it, but something else (i.e. maven, older gradle version) has downloaded it in the meantime. Under this (metadata first) strategy, the requests for a seen-before-but-changed resource would look like this: * HEAD to resource (get metadata) - determine changed * GET to checksum - most likely outcome is that we don't find a local version of this * GET to resource Under the current (checksum first) strategy it looks like this: * GET to checksum - no local version found with checksum * GET to resource Under this (metadata first) strategy, the requests for a seen-before-but-UNchanged resource would look like this: * HEAD to resource (get metadata) - determine unchanged Under the current (checksum first) strategy it looks like this: * GET to checksum - local version found with checksum (can't guarantee it came from the same URL) Still following? :) For me this comes down to: * Is there a noticeable benefit of one HEAD request over one GET (for a sha1 text file), if not then we don't change. If so, * Do we optimise for the case where the resource is unchanged? There's another interesting option. Some servers send an “X-checksum-SHA1” header (e.g. Artifactory). In this case, we could use this when performing the initial HEAD and get the best of both worlds. Other servers advertise that their etags are SHA1s (e.g. Nexus). We could use this metadata, and keep the extra sha1 request as a fallback. -- Luke Daley Principal Engineer, Gradleware http://gradleware.com --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email
