On 30/03/2012, at 9:24 PM, Adam Murdoch <[email protected]> wrote:
> > On 30/03/2012, at 8:44 PM, Luke Daley wrote: > >> >> On 30/03/2012, at 5:53 AM, Adam Murdoch wrote: >> >>> >>> On 30/03/2012, at 8:35 AM, Daz DeBoer wrote: >>> >>>> After a little pondering, I'd favour an approach that is simple to >>>> describe and doesn't result in unexpected behaviour; I think an extra HEAD >>>> request here or there is ok. >>>> >>>> How about we perform a HEAD request if we have any cache candidates, be >>>> they local files or previous accesses to this URL. >>>> So the logic would be: >>>> Do we have any cache candidates? >>>> If not, just HTTP GET the resource and we're done. >>>> HTTP HEAD to get the resource meta-data (and possibly the SHA1) >>>> If we got a 404, the resource is missing, we're done. >>>> If we match a cached URL resource, just use it and we're done. >>>> If we have a local file candidate, HTTP GET SHA1. >>>> If published SHA1 was found and matches then we can cache the URL resource >>>> and we're done. >>>> HTTP GET the actual resource >>>> Pros: >>>> - We can get the SHA1 from the headers if available, and avoid the >>>> GET-SHA1 call. >>>> - If a local file matches, we can cache the URL resolution as if we did an >>>> HTTP GET, since we have the full HTTP headers + the content. We never have >>>> a cached resource without an origin. >>>> - After initially using a file from say .m2/repo to satisfy a request, >>>> from then on it will be just like we actually downloaded it from the URL. >>>> So there are no residual effects of using a local file in place of a >>>> downloaded one. Use of local files is a pure optimisation. >>>> - If the artifact is missing altogether, we get a single 404 for the HEAD, >>>> rather than 404 for the SHA1 + 404 for the GET >>>> - It's simpler to understand, I think. >>> >>> - This approach works nicely as a decoration over all the transports we're >>> interested in (http, sftp, webdav, local file, network file). These all >>> offer a way to get at least (content-length + last-modified-time) without >>> fetching the entire content. So, we could have a number of Resource >>> implementations that sit directly on top of the transport and which don't >>> care about caching, and a single Resource implementation that sits on top >>> of this to apply this caching algorithm. This would allow us, for example, >>> to start efficiently caching file resources, regardless of whether they are >>> sitting on local or network file system. >> >> I've actually made this kind of thing NOT the responsibility of the >> ExternalResource (named so to differentiate from Ivy's Resource type) object. >> >> https://github.com/gradle/gradle/blob/master/subprojects/core-impl/src/main/groovy/org/gradle/api/internal/externalresource/transfer/ExternalResourceAccessor.java >> >> My thinking was that we are likely to have different strategies for cache >> optimisations here for different transports. That's starting to look like >> that's not going to be the case. > > I don't think it matters too much at this stage. > > We want to keep the transports and the caching as separate as possible, so we > can reuse the caching across transports. This may not necessarily mean that > every caching strategy will work with every transport, but it would be nice > to have at least one strategy that can work across any transport. And it > looks like the 'best' option we've come up with for http also happens to be a > generic option, too (except for the etag check, but I'm sure we can deal with > that), so perhaps we only need one strategy. At some point we might end up > with some other transport-specific strategies, but ideally we can base these > on optional abstract capabilities of the transport (e.g. 'can you provide > content-length+last-modified-time efficiently?', 'can you do a > get-content-if-sha1-does-not-match?' and so on) rather than on concrete > transports. This is more or less what we have now. https://github.com/gradle/gradle/blob/master/subprojects/core-impl/src/main/groovy/org/gradle/api/internal/externalresource/transfer/DefaultCacheAwareExternalResourceAccessor.java The ExternalResourceAccessor contract is kinda flexible so I think we could make it work for most transports and still use this general caching algorithm. At the moment this is built in ExternslResourceRepository, but we could allow injection of a custom one easily enough. There are two things I still four things I want to do before wrapping this up: * treat 403 and 405 HEAD requests as "metadata unknown" * if the server is googlecode, treat 404 HEAD requests as "metadata unknown" * when reusing a locally found resource, store the real metadata in the index. * where it's safe to, extract the sha1 from the etag (e.g. Artifactory). All of this things are small.
