Re: String internment contexts and content reuse

John-Mark Bell Tue, 22 Sep 2009 18:34:54 -0700

On Fri, 2009-09-18 at 11:30 +0100, John-Mark Bell wrote:
> On Fri, 2009-09-18 at 13:51 +0800, Bo Yang wrote:
> > How about use the same internment context only for the documents in
> > the same host?
> 
> There are a number of cases that this doesn't handle well:
> 
> 1) Page resources being fetched off a different host.
> 2) Hosts (e.g. MySpace) that provide multiple sites.
> 
> The first of these is much more problematic than the second, I think.


Ok, so I've now been to the pub enough times since I sent the above to
potentially have a workable solution.

This ties in heavily with caching so, before you read what follows, make
sure you comprehend my new cache scheme.

We introduce a map that contains internment contexts for hosts. There's
a 1:1 mapping here. The map is populated on the fly by the cache.

A host is defined in the standard way: either an IP address or a FQDN.
There's no magical domain matching happening, so a.b.com is treated as
distinct from b.com.

Content retrieval from the cache is keyed upon the tuple:

   <URL, internment context>

The internment context may be NULL (see later for why).
The URL to compare will be modified to reflect any fetch-layer
redirects.

[Aside: Additionally, when retrieving a content from the cache, the
parent content handle (if any) is provided. The URL for this content
(and its parents) will be the post-redirect URL. This permits trivial
detection of cycles in the include graph, thus solving another
long-standing problem for free.]

Once the cache has the post-redirect URL[1], it searches for an existing
content that matches the tuple. If none does, a new content is created,
using the specified internment context. If there is one, it is used, as
per normal.

A NULL internment context defines a special case. This case permits the
cache to either return any pre-existing content that matches the
post-redirect URL (and satisfies all the other constraints upon
returning a pre-existing content from the cache), regardless of the
internment context that it uses or create a new content with the
internment context for the URL's host (or an entirely new internment
context if there's none for the host -- in this case, the internment
context is also inserted into the map for later use).

CSS contents are permitted to be shared, providing that the internment
context matches. Copy-on-write does not affect this behaviour (the
copied content will use the same internment context, resulting in the
minimum possible duplication).

HTML contents from the same host may use the same internment context.


I suspect an example would help:

There's no fetch-layer redirection occurring here, for simplicity.
The cache is initially empty, as is the internment context map.

Request http://example.com/, with a NULL internment context.
A new content is created, with a new internment context which is
inserted into the map.

This requires /foo.css and /bar.css. Request these, with the parent
content's internment context. 2 new contents, using the parent's
internment context.

User clicks link to /page2.html. Request this, providing a NULL
internment context (as we'd like to reuse a cached content, regardless
of its internment context). New content is created, using the internment
context for example.com, which is retrieved from the map.

/page2.html requires /foo.css. Request this, specifying page 2's
internment context (which is the same as that for /). Existing content
for foo.css is reused, as it's shareable.


This approach should, assuming I've not missed anything, improve the
likelihood of sharing child contents between documents on the same host.
It avoids the problems of one internment context per host by permitting
child contents to use their parent's internment context. Reinstating the
shareability of CSS contents (with the added internment context
restriction) should reduce the necessity to create duplicate CSS
contents.

It's worth noting that the only time when an internment context can be
NULL in a cache request is when there's no parent content. Perhaps we
can merge these concepts back together to avoid the need for spurious
parameters.

Comments welcome, as per the original cache proposal. I'd really like
some feedback before I start writing code as the effectiveness of the
caching architecture has serious implications for the rest of the code.


J.

Re: String internment contexts and content reuse

Reply via email to