After a little pondering, I'd favour an approach that is simple to describe
and doesn't result in unexpected behaviour; I think an extra HEAD request
here or there is ok.

How about we perform a HEAD request if we have any cache candidates, be
they local files or previous accesses to this URL.
So the logic would be:

   - Do we have any cache candidates?
      - If not, just HTTP GET the resource and we're done.
   - HTTP HEAD to get the resource meta-data (and possibly the SHA1)
      - If we got a 404, the resource is missing, we're done.
      - If we match a cached URL resource, just use it and we're done.
   - If we have a local file candidate, HTTP GET SHA1.
      - If published SHA1 was found and matches then we can cache the URL
      resource and we're done.
   - HTTP GET the actual resource

Pros:
- We can get the SHA1 from the headers if available, and avoid the GET-SHA1
call.
- If a local file matches, we can cache the URL resolution as if we did an
HTTP GET, since we have the full HTTP headers + the content. We never have
a cached resource without an origin.
- After initially using a file from say .m2/repo to satisfy a request, from
then on it will be just like we actually downloaded it from the URL. So
there are no residual effects of using a local file in place of a
downloaded one. Use of local files is a pure optimisation.
- If the artifact is missing altogether, we get a single 404 for the HEAD,
rather than 404 for the SHA1 + 404 for the GET
- It's simpler to understand, I think.

Cons:
- If we have a candidate local file but SHA1 isn't published, we'll do an
extra HEAD request. ie HEAD URL + GET SHA1 + GET URL, rather than just GET
SHA1 + GET URL

Thoughts?
Daz

On 29 March 2012 11:00, Luke Daley <[email protected]> wrote:

> Hi all,
>
> As previously discussed, we are now leveraging last modified and content
> length values to avoid downloading changing artifacts (resources really)
> that have not changed. Currently, our strategy is the following…
>
> Given an artifact id (group, name, version) and a repository:
>
> 1. See if we have resolved this artifact from this repository previously.
> if so and if the cache entry has not expired, use the cached resource.
> Otherwise:
> 3. Search the local file system in a bunch of places (e.g. maven local,
> old gradle caches, the current filestore) for anything that was resolved
> with the same artifact id effectively
> 2. Convert the request into a url to hit
> 4. Search the cache index for a record of the metadata for this url
>
> So we now may have 0..n “locally available resource candidates” that we
> think may be the same as what's behind the URL, and possibly a “cached
> external resource” (a record of the metadata last time we hit the resource
> and it's location in the filestore).
>
> The fetch process looks like this:
>
> * If there are any locally available resource candidates, fetch the remote
> sha1 for the resource if it's available.
> * If any of the locally available resource candidates have the same
> checksum, use that instead of downloading the resource (at the cost of not
> obtaining metadata such as last modified, etag etc).
> * If not, or if there was no remote checksum available;
> * If we have a cached version of the resource, compare the cached metadata
> with the real metadata via a HEAD request (implies that there was no remote
> checksum in practice).
> ** If the metadata is unchanged (compare last modified date and content
> length for equality), use the cached version (including metadata).
> ** If the metadata is changed, issue a GET to download the resource (then
> cache the resource of course)
>
> I think this is the practical thing to do, but probably not theoretically
> correct.
>
> The issue is that by using the checksum check to determine if something
> has changed or not, we lose any cached metadata about the resource. If we
> find something on the filesystem with the same checksum, all we can really
> assume is that that file has the same binary content. We cannot assume that
> it came from the same URL which should invalidate any cached metadata we
> had for that URL. However, since the only metadata items that we care about
> are content length, last modified and etag, if the checksum hasn't changed
> we could probably assume that these values haven't changed either.
>
> Furthermore, it probably doesn't matter because if there are remote
> checksums for a resource available then we aren't really going to use the
> metadata for anything.
>
> Further furthermore, our current strategy is optimised for the case where
> checksums are available which is considered best practice. If we flipped it
> around and compared metadata first…
>
> Pros:
> * If the item is unchanged, we only have one HEAD request as opposed to
> the GET on the checksum (faster)
> * We maintain cached metadata “integrity”
>
> Cons:
> * If the item has changed, we have one HEAD for the metadata (to determine
> it was changed) then another GET for the sha1 (to look for locally
> available resources)
>
> Keep in mind, the con there is the rare case. This means that the external
> resource has changed since the last time we saw it, but something else
> (i.e. maven, older gradle version) has downloaded it in the meantime.
>
> Under this (metadata first) strategy, the requests for a
> seen-before-but-changed resource would look like this:
>
> * HEAD to resource (get metadata) - determine changed
> * GET to checksum - most likely outcome is that we don't find a local
> version of this
> * GET to resource
>
> Under the current (checksum first) strategy it looks like this:
>
> * GET to checksum - no local version found with checksum
> * GET to resource
>
> Under this (metadata first) strategy, the requests for a
> seen-before-but-UNchanged resource would look like this:
>
> * HEAD to resource (get metadata) - determine unchanged
>
> Under the current (checksum first) strategy it looks like this:
>
> * GET to checksum - local version found with checksum (can't guarantee it
> came from the same URL)
>
>
> Still following? :)
>
> For me this comes down to:
>
> * Is there a noticeable benefit of one HEAD request over one GET (for a
> sha1 text file), if not then we don't change. If so,
> * Do we optimise for the case where the resource is unchanged?
>
>
> There's another interesting option. Some servers send an “X-checksum-SHA1”
> header (e.g. Artifactory). In this case, we could use this when performing
> the initial HEAD and get the best of both worlds. Other servers advertise
> that their etags are SHA1s (e.g. Nexus). We could use this metadata, and
> keep the extra sha1 request as a fallback.
>
> --
> Luke Daley
> Principal Engineer, Gradleware
> http://gradleware.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>    http://xircles.codehaus.org/manage_email
>
>
>


-- 
Darrell (Daz) DeBoer
Principal Engineer, Gradleware
http://www.gradleware.com

Reply via email to