On 22 March 2012 05:02, Luke Daley <[email protected]> wrote:
> > On 22/03/2012, at 7:15 AM, Daz DeBoer wrote: > > I guess one question is whether the lastModified date can actually be > considered some sort of meta-data about the artifact itself, or if it's > actually more like meta-data about the artifact *retrieval*. I kind of > think the lastModified date should be only used when checking if a > particular URL retrieval is up-to-date: and that this value should not leak > into the model further. > > > I've got no problem conceptualising it as meta data about the resolution. > Say if we provided some kind of report on the contents of the cache > relevant to the particular build you are working on. I can see providing > the url and last modified of the artifacts when they were resolved in a > particular repository being useful information. > > At the moment, if we get the same artifact from 2 different URLs in the > same repository, it looks like we will overwrite the first retrieval > (lastModified + url) when we cache the second. This doesn't feel great to > me. > > > For the “by-repository” cache, yes. But that in itself is interesting > information (i.e. given the same repository and module vectors we got > something at a different URL). It becomes more interesting if the > repository is actually serving back redirects too. We could potentially > then do some short circuiting here by using this info (need to think about > that more). > We could be missing a concept here; the separation of artifact meta-data (artifactId/file/sha1) from resolution meta-data (url/lastModified/timestamp). But looking deeper I don't think this was exacerbated by your recent changes at all, and I don't think this it's too big a deal. Also looking deeper I see a potential issue when the artifact is not downloaded but we are using a previously downloaded artifact based on SHA1. In that case we will be (incorrectly) using the last modified date of the the cached file as the URL last modified date. This is because CachedHttpResource uses the file modification date as 'lastModified'. The only ways I can think to get around this are to 1) issue an extra HTTP HEAD request to get the correct lastModified date, 2) Assume a zero lastModified date for these artifacts, or 3) Store the last modified date in 'external' artifact caches in a way that it is available cross-version. As it stands, I think there are scenarios where this will produce incorrect artifact resolution. This has made me think of another thing: should --refresh-dependencies bypass the url resolution cache? We don't bypass the SHA1 checking when using '--refresh-dependencies', but that process is not temporally sensitive: it can only be screwed up if the SHA1 is published incorrectly. The caching of URL lastModified timestamps could be problematic if a new artifact was published on a URL, and the user wanted to switch back to an older one. If we think of '--refresh-dependencies' a bit like force-reload in a browser, then we probably want to bypass the url resolution cache with this switch. Implementation-wise this could be tricky, however. > I guess my opinion is that the shared abstraction is nice, and coherent in > my opinion, but we are storing data that we are not using right now. > > So our options are: > > #1 - leave it the way it is and live with the cost of storing this extra > metadata (i.e. cache file size and extra serialisation cost) > #2 - optimise to only store what is strictly needed and introduce more > concepts and types > > My vote is for 1, but given the criticality of this section of code I > don't feel comfortable making that decision on my own. > My vote is for 1 as well. I'm not concerned about cache size or serialization cost. Daz
