I spent the better part of yesterday mucking around in the dregs of Django's
cache middleware and related modules, and in doing so I've come to the
conclusion that, due to an accumulation of hinderances and minor bugs, the
per-site and per-view caching mechanism are effectively broken for many
fairly typical usage patterns.
Let me demonstrate by fictional example, with what I would consider to be a
pretty typical configuration and use case for the per-site cache:
Let's pretend I'm developing a blog powered by Django. I'm using memcached,
and I would like to cache pages on that blog for anonymous users, who are
going to make up the vast majority of my site's visitors. Ideally, I will
serve the exact same cached version of a blog post to every single anonymous
visitor to my site, which will help keep server load under control,
particularly when I get slashdotted/reddited/what-have-you.
Like any blog, a typical page view features the content primarily (e.g a
blog post). It also has some "auth" stuff at the top right, which will say
"Log in / Register" for non logged in users but show a username and welcome
message for logged in users. Each blog post also has an empty comment form
at the bottom of it where users can leave comments on the post. Like 99% of
the websites out there, I will be using Google Analytics to track my
visitors etc.
Pretty straightforward, right?
Let me count the ways that Django's cache middleware will muck up my goals
in the above scenario.
First, I'm going to try use the per site cache. Here's what's going to go
wrong for me:
* It's going to be virtually impossible for me to avoid my cache varying by
cookie and thus by visitor. Because in my templates I am checking to see if
the current user is logged in, I'm touching the session, which is going to
now set the vary cookie header. That means if there is any difference in the
cookies users are requesting my pages with, I'm going to be sending each
user a separate cached page, keyed off of SESSION_COOKIE_NAME, which is
unique for every visitor.
* Even if I avoid touching the request user somehow, the CSRF middleware
presents the same issue. Because I have a comment form on every page, I have
a unique CSRF token for each visitor. Thankfully Django doesn't let me
completely shoot myself in the foot by caching the page with one user's
token and serving it to everybody else. At least it helpfully sets a CSRF
token cookie and varies on it to prevent this. However, that cookie is
different for every unique user. That triggers the the same problem as
above. I again cannot avoid caching a unique page for each unique visitor.
* Unfortunately, my troubles are not over, even if I resign myself to having
a cache that varies per visitor. You see, Google Analytics actually sets a
handful of other cookies with each page request. And guess what? The values
for those cookies are unique *for each request*. This mean...I'm actually
not caching at all. Cookies are unique for each and every page request
thanks to Google Analytics. My per-site cache configuration is totally and
completely inoperable, all because I'm using a tracking service that pretty
much *everybody* uses.
Since that didn't work, I wonder if it'll work if I do per-view caching? It
shouldn't work at all, should it, since it's not like any of the factors I
outlined above are different if I'm using the @cache_page decorator to do my
caching vs the per-site cache.
Well, the sad news is caching does "work" when I use cache_page, and that's
not a good thing:
* @cache_page caches the direct output of the view/render function. It skips
over the middleware that might have very good reason to introduce vary
headers and doesn't introduce any vary headers of it's own. So now, with
this applied, I *am* serving a cached version of this page even though I
absolutely should not be. Some poor user's token is now being sent to
everybody. My only chance of redemption is if I happen to have read the docs
and discovered that this incantation is required to prevent having
cache_page improperly cache the page:
@cache_page(60 * 15)
@csrf_protect
def my_view(request):
# ...
Of course, the above just puts me right back where I started at the per-site
level. There was never any chance of making cache_page work any different
from the per-site cache, but it certainly proved to be a temptation if I'm a
hurried developer, frustrated by why my per site cache wasn't working and
"thankful" for the fact that I could get the cache to start "working" with
the cache_page decorator.
Hopefully the above example really makes it clear to you guys how all of the
seemingly minor bugs and imperfections really do add up to a broken
situation for someone coming to this with a pretty standard set of
expectations and requirements.
Anyhow, the good news is that a good portion of what I have written about
already has open tickets which in some cases are close to being ready for
checkin:
* Google Analytics is a known issue with a proposed patch:
https://code.djangoproject.com/ticket/9249
* CSRF is known to not play nicely with caching, it's documented at least:
https://docs.djangoproject.com/en/dev/ref/contrib/csrf/#caching
* The actual underlying cache_page issue is ticketed:
https://code.djangoproject.com/ticket/15855
Still, I can't help but feel that, to an extent, these are band aids. There
is still an exceptionally narrow set of circumstances that would allow me to
serve a single cached page to all anonymous visitors to my site: namely, I
can't touch request.user and I can't use CSRF. Quite honestly, I'm not even
sure you should be using a framework like Django if most of your pages don't
have logic pertaining to a logged in vs. anonymous user, or have some kind
of form on them which requires CSRF protection. Even if all of the above
tickets got fixed, it seems like we're still in kind of a bad place.
I don't know that I have good solutions to any of this (though I am very
much willing to contribute work toward such a solution). I do have a few
ideas/questions to pose to conclude with here:
* Is it reasonable to set as a goal that Django should attempt to support
per site caching for the scenario I described above? I mean, am I wrong in
thinking that in an ideal world, it should be possible to serve the same
cached page to all anonymous users most of the time, even if there are forms
or anonymous vs. logged in user logic on it?
* Is an embedded token the only form in which CSRF protection can come from?
Why can't the token be set as a cookie and the value of that cookie serve as
the CSRF verification (without varying on it in the cache, obviously)? Or
perhaps there's a way to dynamically generate a CSRF token via ajax after
the page load? I'm certain someone much smarter and more knowledgable than I
will point out why these are dreadfully horrible, unworkable ideas, but the
embedded token is sort of a deal breaker for effective caching, and these
days many, many sites have forms on almost every page (e.g. a hidden login
form that's revealed when you press login, comment form, etc.).
* Why does the cookie have to vary if the request user object is touched on
the template even though it's not authenticated? If the sessionid isn't even
in the request cookie (i.e. for a first time visitor), then it doesn't
require a real "check" of the session. And correct me if I'm wrong, but
doesn't the session key get cycled when a user logs in anyway? In other
words, a session key that represents an anonymous user will *always*
represent an anonymous user. Perhaps there's a way to keep track of those so
the anonymous session ids so the same anonymous cached view can be served to
them all. What a waste to generate the entire page dynamically for each
individual anonymous user all because of one simple key lookup. Again, this
is probably a hopelessly naive idea with a sensible, obvious rebuttal, but
perhaps there is some merit in coming up with a creative solution?
I have to guess some of you have already spent some brain cycles thinking
about the above issues I've raised, in whole or in part, and I apologize if
I'm re-hashing an old debate or am so totally off-base that I've wasted your
time if you made it this far. My intent, again, is not to complain, but to
see if others agree that the current state of the per-site cache is not so
great, and if so, to elicit some ideas on how to best address it. It also
seems to me that there is more than just one problem standing in the way of
things, so "success" might require something of a coordinated effort.
Please do let me know if my concerns make sense, if my goal is a legitimate
one, if I'm wrong in part or in whole, etc. etc. As I said earlier, if
there's a path forward on any of the above I am happy to contribute to the
effort.
Thanks for listening.
--
You received this message because you are subscribed to the Google Groups
"Django developers" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/django-developers/-/G7iNJsARF4IJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/django-developers?hl=en.