On Fri, 2009-09-18 at 11:30 +0100, John-Mark Bell wrote: > On Fri, 2009-09-18 at 13:51 +0800, Bo Yang wrote: > > How about use the same internment context only for the documents in > > the same host? > > There are a number of cases that this doesn't handle well: > > 1) Page resources being fetched off a different host. > 2) Hosts (e.g. MySpace) that provide multiple sites. > > The first of these is much more problematic than the second, I think.
Ok, so I've now been to the pub enough times since I sent the above to potentially have a workable solution. This ties in heavily with caching so, before you read what follows, make sure you comprehend my new cache scheme. We introduce a map that contains internment contexts for hosts. There's a 1:1 mapping here. The map is populated on the fly by the cache. A host is defined in the standard way: either an IP address or a FQDN. There's no magical domain matching happening, so a.b.com is treated as distinct from b.com. Content retrieval from the cache is keyed upon the tuple: <URL, internment context> The internment context may be NULL (see later for why). The URL to compare will be modified to reflect any fetch-layer redirects. [Aside: Additionally, when retrieving a content from the cache, the parent content handle (if any) is provided. The URL for this content (and its parents) will be the post-redirect URL. This permits trivial detection of cycles in the include graph, thus solving another long-standing problem for free.] Once the cache has the post-redirect URL[1], it searches for an existing content that matches the tuple. If none does, a new content is created, using the specified internment context. If there is one, it is used, as per normal. A NULL internment context defines a special case. This case permits the cache to either return any pre-existing content that matches the post-redirect URL (and satisfies all the other constraints upon returning a pre-existing content from the cache), regardless of the internment context that it uses or create a new content with the internment context for the URL's host (or an entirely new internment context if there's none for the host -- in this case, the internment context is also inserted into the map for later use). CSS contents are permitted to be shared, providing that the internment context matches. Copy-on-write does not affect this behaviour (the copied content will use the same internment context, resulting in the minimum possible duplication). HTML contents from the same host may use the same internment context. I suspect an example would help: There's no fetch-layer redirection occurring here, for simplicity. The cache is initially empty, as is the internment context map. Request http://example.com/, with a NULL internment context. A new content is created, with a new internment context which is inserted into the map. This requires /foo.css and /bar.css. Request these, with the parent content's internment context. 2 new contents, using the parent's internment context. User clicks link to /page2.html. Request this, providing a NULL internment context (as we'd like to reuse a cached content, regardless of its internment context). New content is created, using the internment context for example.com, which is retrieved from the map. /page2.html requires /foo.css. Request this, specifying page 2's internment context (which is the same as that for /). Existing content for foo.css is reused, as it's shareable. This approach should, assuming I've not missed anything, improve the likelihood of sharing child contents between documents on the same host. It avoids the problems of one internment context per host by permitting child contents to use their parent's internment context. Reinstating the shareability of CSS contents (with the added internment context restriction) should reduce the necessity to create duplicate CSS contents. It's worth noting that the only time when an internment context can be NULL in a cache request is when there's no parent content. Perhaps we can merge these concepts back together to avoid the need for spurious parameters. Comments welcome, as per the original cache proposal. I'd really like some feedback before I start writing code as the effectiveness of the caching architecture has serious implications for the rest of the code. J.
