Re: Clearing cache between tests (#11505)

2011-10-31 Thread Jim Dalton
On Oct 30, 2011, at 4:19 PM, Jeremy Dunck wrote:

Hi Jeremy,

> I locally have a test-runner class that calls .clear() on setUp.
> Works for me.  But we run CI, so test against prod.  I needed to point
> to different CACHES under test to avoid flushing prod cache 20 times
> per day.

Right. You've got a great real world use case where cleaning the cache is 
valuable, yet it's important for you to ensure it's not getting cleared 
frequently.


> Another approach that can work is to have the test runner generate a
> unique key prefix for each test run.  Then the keys will all be misses
> on later runs.  If invading the available key length isn't acceptable,
> then the test runner could also have a custom KEY_FUNCTION to hash the
> resulting key.  A downside is that this can create significant
> pressure on the cache.  I think that is less bad than clearing cache,
> though, since that is effectively infinite cache pressure.

I think this is indeed a workable alternative and I recall it from the previous 
discussion. I too had the same concern about the downside, namely that pressure 
it puts on the cache. If you're someone like you who's running integration 
tests 20 times a day on production, I do think it would add up. The end result 
might be cache data falling out of production, though certainly it would be 
preferable to clearing out the cache.

BTW also in the last conversation, Jacob proposed something similar to what you 
do now, i.e. a "test" cache(s) that the user could specify, and if one is not 
defined, fall back to locmem.

So I'm curious what your opinion is about the alternatives we've bandied about 
here? If you had your personal wish, which implementation would best suit you?

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Clearing cache between tests (#11505)

2011-10-29 Thread Jim Dalton
On Oct 29, 2011, at 11:14 AM, Aymeric Augustin wrote:

Thank Aymeric,

> 
>> As an alternative, I propose we:
>> 
>> 1. Add a new optional setting under CACHES, CLEAR_BETWEEN_TESTS. The default 
>> for this setting is False. To be clear, the value is set independently for 
>> each cache defined in CACHES.
>> 
>> 2. Between each test, cycle through each cache in CACHES. If 
>> CLEAR_BETWEEN_TESTS is True, then clear() that cache. Otherwise, move on.
> 
> Here's what cache.clear() does on each cache backend:
> - db: it runs a SQL query "DELETE FROM "
> - dummy: it's a no-op
> - filebased: it calls shutil.rmtree on the cache directory
> - locmem: it clears two Python dicts
> - memcached: it calls the "flush_all" API method
> 
> It's dirt cheap with locmem, who is probably the most popular cache backend 
> for testing. It's a little bit expensive with db, but if you're using the 
> database cache backend, you're clearly not a performance junkie.
> 
> The fact that Django doesn't reset cache state between tests is a bug, which 
> means we don't need to be 100% backwards compatible. I'd be comfortable with 
> calling clear() unconditionally on each cache, after each test. I don't feel 
> strongly against the CLEAR_BETWEEN_TESTS flag, but if you add it, I suggest 
> making it True by default.
> 
> To sum up: +1 on .clear(), -0 on CLEAR_BETWEEN_TESTS.
> 

I *believe* that when this conversation has been had in the past, the simple 
solution of cache.clear() has been sort of a nonstarter. Certainly it's by far 
the simplest solution, and it's tolerable for most of the caches. The main 
issue is that, using the memcached backend, clear() flushes out the entire 
cache….i.e. not just the cache for the tests, not just the cache you might have 
set with this application, but literally the whole of the cache.

So the challenge has been to find a way to clear the cache between tests 
without potentially destructive side effects.

If I personally were given the choice between cache.clear() and nothing at all, 
i'd take cache.clear() and be careful with it. But I'm not sure if the rest of 
the community is willing to accept such a solution. Hence the 
CLEAR_BETWEEN_TESTS flag as a compromise.

Cheers,

Jim

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Clearing cache between tests (#11505)

2011-10-29 Thread Jim Dalton
I noticed that Aymeric Augustin committed a patch last week [17042] that 
fixed some issues in the cache backends. Apparently the fact that the cache 
carries over test to test remains a problem.

I've progressed pretty far with an aggressive "fix" for this problem with 
the patch in #11505. The current patch modifies all the keys set during a 
test run (so they don't overwrite any existing values), it tracks any 
values set during each test and it deletes those and only those that were 
set during that test. In essence, the cache is "restored" to its original 
state after each test. This is achieved by monkeypatching each cache 
backend in use and adding some "bookkeeping" code so to speak to track the 
keys set. The patch includes thorough tests, and as of last week includes a 
fix for the cache middleware problem as well (where views cached by 
middleware get carried over test to test so that the view function does not 
run.). The full Django test suite passes with it.

That all said, for the most part reception has been lukewarm for this 
patch. I think that's pretty understandable, since there's a fair amount of 
fiddly stuff taking place that makes people uncomfortable.

Anyhow, this morning I thought of a vastly simpler alternative proposal, 
which would fix the cache between tests and ensure we did not step on 
anyone's toes by unknowingly clearing a cache that the developer did not 
want to be cleared. As an alternative, I propose we:

1. Add a new optional setting under CACHES, CLEAR_BETWEEN_TESTS. The 
default for this setting is False. To be clear, the value is set 
independently for each cache defined in CACHES.

2. Between each test, cycle through each cache in CACHES. If 
CLEAR_BETWEEN_TESTS is True, then clear() that cache. Otherwise, move on.

Since test problems caused by cache values set in the tests are fairly 
infrequent, the default use case is just to leave this setting alone. 
However, if your test(s) need it, the option is there. It's also then 
trivial to e.g. set it for True in your local environment but have it on 
False in production, if e.g. you wanted to run your tests there but didn't 
want to clear your entire production cache.

I'd love to move #11505 through the logjam here and get it taken care of. 
Are any core devs willing to buy in to this alternative proposal? If so, I 
can create a patch for it with relative ease. Part of me still likes my 
original patch and I think there are good reasons to take the "thorough" 
approach. But this works as well and I could really go either way.

Let me know and thanks for listening.

Here are some references for previous conversations that have taken place 
about this:

   - 
   https://groups.google.com/d/topic/django-developers/zlaPsP13dUY/discussion
   - https://code.djangoproject.com/ticket/11505
   

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/TTma3dJU1ccJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: The state of per-site/per-view middleware caching in Django

2011-10-21 Thread Jim Dalton
On Oct 21, 2011, at 8:04 AM, Kääriäinen Anssi wrote:

> I do not know nearly enough about caching to participate fully in this 
> discussion. But it strikes me that the attempt to have CSRF protected 
> anonymous page cached is not that smart. If you have an anonymous submittable 
> form, why bother with CSRF protection? I mean, what is it protecting against? 
> Making complex arrangements in the caching layer for this use case seems like 
> wasted effort. Or am I missing something obvious?

First issue is that CSRF can matter for anonymous users. From here 
http://www.squarefree.com/securitytips/web-developers.html#CSRF:

Attacks can also be based on the victim's IP address rather than cookies:

• Post an anonymous comment that is shown as coming from the victim's 
IP address.
...
• Perform a distributed password-guessing attack without a botnet. 
(This assumes they have a way to tell whether the login succeeded, perhaps by 
submitting a second form that isn't protected against CSRF.)

So two very common uses cases for anonymous forms are log in forms and 
anonymous comment forms, both of which are potentially vulnerable. I guess I 
feel like it's quite common to have forms on a page these days even for 
anonymous users.

Second is -- and I don't know about this -- but I don't know how well CSRF 
handles authentication conditionally. Like if I have a page and let's say that 
page has forms in it for logged in users but nothing for anonymous user, can I 
conditionally exempt the formless page from CSRF? I have no idea, but buy 
default I presume it's on and I presume the cache is varying on it.

So, yes, you could probably optimize a lot of this to sort of skip around the 
CSRF issue and it's not a deal breaker. But my main argument has been the 
ubiquity of CSRF + user authentication in Django projects to me means a 
solution to both of these is a requirement for page caching to become easy and 
applicable in most scenarios.

> 
> The following is from the stupid ideas department: Maybe there could be a 
> "reverse cache" template tag, such that you would mark the places where you 
> want changing content as non-cacheable. You would need two views for this, 
> one which would construct the "base content" and then another which would 
> construct the dynamic parts. Something like:
> 

Your idea sounds a lot like the "server side include" or "two phased template 
rendering" approach that I know some people are doing. Here's an excellent 
example of this approach being used in EveryBlock (from two years ago):

http://www.holovaty.com/writing/django-two-phased-rendering/

And looks like some core devs have been involved at some point in this 
implementation of that concept:

https://github.com/codysoyland/django-phased

That looks almost exactly like your idea: "django-phased contains a 
templatetag, phased, which defines blocks that are to be parsed during the 
second phase. A middleware class, PhasedRenderMiddleware, processes the 
response to render the parts that were skipped during the first rendering."

I guess that was sort of what I was hinting at in my previous discussion about 
how to handle CSRF. In the link he is taking it to the next level (where even 
logged in users get the page and that stuff is done after.

Anyhow it's obviously a sensible conceptual approach. It would be a stretch to 
fit that into the existing page caching approach of Django obviously.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: The state of per-site/per-view middleware caching in Django

2011-10-21 Thread Jim Dalton
On Oct 20, 2011, at 6:02 PM, Carl Meyer wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi Jim,
> 
> This is a really useful summary of the current state of things, thanks
> for putting it together.
> 
> Re the anonymous/authenticated issue, CSRF token, and Google Analytics
> cookies, it all boils down to the same root issue. And Niran is right,
> what we currently do re setting Vary: Cookie is what we have to do in
> order to be correct with respect to HTTP and upstream caches. For
> instance, we can't just remove Vary: Cookie from unauthenticated
> responses, because then upstream caches could serve that unauthenticated
> response to anyone, even if they are actually authenticated.
> 
> Currently the Django page caching middleware behaves pretty much just
> like an upstream cache in terms of the Vary header. Apart from the
> CACHE_MIDDLEWARE_ANONYMOUS_ONLY setting, it just looks at the response,
> it doesn't make use of any additional "inside information" about what
> your Django site did to generate that response in order to decide what
> to cache and how to cache it.
> 
> This approach is pretty attractive, because it's conceptually simple,
> consistent with upstream HTTP caching, and conservative (quite unlikely
> to serve the wrong cached content).
> 
> It might be possible to make it "smarter" in certain cases, and allow it
> to cache more aggressively than an upstream cache can. #9249 is one
> proposal to do this for cookies that aren't used on the server, either
> via explicit setting or (in a recently-added proposal) via tracking
> which cookie values are accessed. If we did that, plus special-cased the
> session cookie if the user is unauthenticated and the session isn't used
> outside of contrib.auth, I think that could possibly solve the
> unauthenticated-users and GA issues.
> 
> However, this (especially the latter) would come with the cost of making
> the cache middleware implementation more fragile and coupled to other
> parts of the framework. And it still doesn't help with CSRF, which is a
> much tougher nut to crack, because every response for pages using CSRF
> come with a Set-Cookie header and probably with a CSRF token embedded in
> the response content; and those both mean that response really can't be
> re-used for anyone else. (Getting rid of the token embedded in the HTML
> means forms couldn't ever POST without JS help, which is not an option
> as the documented default approach). You can mark some form-using views
> that are available to anonymous users as csrf-exempt, which exposes you
> potentially to CSRF-based spam, but isn't a security issue if you aren't
> treating authenticated submissions any differently from
> non-authenticated ones.
> 
> Generally, I come down on the side of skepticism that introducing these
> special cases into the cache middleware really buys enough to be worth
> the added complexity (though I could be convinced that #9249 is worth it).

Thanks Carl. This is definitely a good, clarifying response to what I was 
mulling around about.

A few thoughts of my own to add here:

* You and Nihan are certainly right about upstream caches. Regardless of what 
we do here, we'll have to vary by cookie in the response header. This makes 
sense for a site that offers authentication: Django needs to check on every 
page view to see if the user is authenticated, so we can't have the upstream 
cache holding on to a page for us.

* Agreed about how the "smartness" comes at the cost of brittleness if the 
implementations are too tightly coupled. That said, I can squint and sort of 
see an implementation that could thread the needle here. It would require 
something like:

- An API in the cache middleware instructing it to ignore certain cookies for 
the purposes of caching (i.e. something along the lines of #9249).

- Some kind of "pre-fetch" hook in the cache middleware. Whether it's a flag in 
the request object, a signal or something else, give other systems the ability 
to look at a request before it hits the FetchFromCacheMiddleware and either 
allow or prevent the response from being pulled from the cache. E.g if there 
was a flag request.invalidate_cache that defaults to False, the contrib.auth 
app could, in combination with the above, pull the session id from 
consideration in the cache key and do an authentication check on its own, 
invalidating the cache on its own if the user is authenticated. The core idea 
is what you already suggested, I'm more illustrating here that this can 
conceivably be implemented as an API, making it less brittle.

- Some kind of "post-fetch" hook in the cache middleware, combined with a 
retooling of the CSRF middleware. This is getting in the clouds here a bit, but 
a hook on the opposite end of the fetch operation could allow the CSRF app to 
add its token after the response was pulled from the cache. I say we're in the 
clouds here because for something like this to work the CSRF would have to do a 
little 

Re: The state of per-site/per-view middleware caching in Django

2011-10-20 Thread Jim Dalton
On Oct 20, 2011, at 10:26 AM, Niran Babalola wrote:

> This problem is inherent to page caching. Workarounds to avoid varying
> by cookie for anonymous users are conceptually incorrect. If a single
> URL can give different responses depending on who's viewing it, then
> it varies by cookie. Preventing CSRF is inherently session-variable as
> well. Loading the token via a separate AJAX call is possible, but
> there are simpler solutions.

You may in fact be correct, but I'm not convinced by what you're saying here 
(not that there is any onus on you to convince me of anything of course). 

I"m suggesting that all anonymous users *could* receive an identical page from 
the server, theoretically, since the same URL does *not* need to return a 
different response depending on which (anonymous) user is viewing it. CSRF is 
obviously a trickier problem, and it's not really worth solving the anonymous 
user problem if CSRF isn't solved as well. But if both problems were somehow 
solvable, then we're in a position where per-site cache would be viable for 
many common scenarios such as the one I described in my original post.

If these two problems are in fact unsolvable or not worth solving because 
simpler alternatives exist, that's fine and understandable. Perhaps 
per-site/per-view caching are indeed exceptionally limited tools that are 
beneficial in a very limited number of use cases, and perhaps the "solution" 
here is tidying up the outstanding bugs and perhaps clarifying the 
documentation as needed to make the limitations more explicit.


> 
> If you want to cache pages with small portions that vary by user, then
> you want edge site includes and something like Varnish to process
> them. If you want a much slower, pure-python solution that doesn't
> require a separate service running somewhere, then you want
> armstrong.esi[1].


Thanks. This post wasn't really about what *I* need btw; I can definitely sort 
out my caching strategies in other areas as I need to. The post only relates to 
"me" because I sat down yesterday and said, "Gee, I wonder if I could make use 
of Django's per-site caching feature for this project I'm working on." I turned 
it "on" to test it out and then spent the next 6 hours delving into the source 
code, IRC, ticket tracker, Google etc. to figure out why it wasn't working at 
all and why @cache_page was, and then after finally sorting it out and grokking 
all of the moving parts etc, realizing that there was extraordinarily limited 
value in a per-site/view caching strategy that caches per unique visitor, which 
is pretty much unavoidable for most common usage patterns.

So yeah, maybe it's me and I'm looking at things the wrong way, but needless to 
say it wasn't a particularly pleasant or worthwhile experience. Not looking for 
pity btw, but just wondering what I/we can or should do to make it better.

Jim

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



The state of per-site/per-view middleware caching in Django

2011-10-20 Thread Jim Dalton
I spent the better part of yesterday mucking around in the dregs of Django's 
cache middleware and related modules, and in doing so I've come to the 
conclusion that, due to an accumulation of hinderances and minor bugs, the 
per-site and per-view caching mechanism are effectively broken for many 
fairly typical usage patterns.

Let me demonstrate by fictional example, with what I would consider to be a 
pretty typical configuration and use case for the per-site cache:

Let's pretend I'm developing a blog powered by Django. I'm using memcached, 
and I would like to cache pages on that blog for anonymous users, who are 
going to make up the vast majority of my site's visitors. Ideally, I will 
serve the exact same cached version of a blog post to every single anonymous 
visitor to my site, which will help keep server load under control, 
particularly when I get slashdotted/reddited/what-have-you.

Like any blog, a typical page view features the content primarily (e.g a 
blog post). It also has some "auth" stuff at the top right, which will say 
"Log in / Register" for non logged in users but show a username and welcome 
message for logged in users. Each blog post also has an empty comment form 
at the bottom of it where users can leave comments on the post. Like 99% of 
the websites out there, I will be using Google Analytics to track my 
visitors etc.

Pretty straightforward, right?

Let me count the ways that Django's cache middleware will muck up my goals 
in the above scenario.

First, I'm going to try use the per site cache. Here's what's going to go 
wrong for me:

* It's going to be virtually impossible for me to avoid my cache varying by 
cookie and thus by visitor. Because in my templates I am checking to see if 
the current user is logged in, I'm touching the session, which is going to 
now set the vary cookie header. That means if there is any difference in the 
cookies users are requesting my pages with, I'm going to be sending each 
user a separate cached page, keyed off of SESSION_COOKIE_NAME, which is 
unique for every visitor.

* Even if I avoid touching the request user somehow, the CSRF middleware 
presents the same issue. Because I have a comment form on every page, I have 
a unique CSRF token for each visitor. Thankfully Django doesn't let me 
completely shoot myself in the foot by caching the page with one user's 
token and serving it to everybody else. At least it helpfully sets a CSRF 
token cookie and varies on it to prevent this. However, that cookie is 
different for every unique user. That triggers the the same problem as 
above. I again cannot avoid caching a unique page for each unique visitor.

* Unfortunately, my troubles are not over, even if I resign myself to having 
a cache that varies per visitor. You see, Google Analytics actually sets a 
handful of other cookies with each page request. And guess what? The values 
for those cookies are unique *for each request*. This mean...I'm actually 
not caching at all. Cookies are unique for each and every page request 
thanks to Google Analytics. My per-site cache configuration is totally and 
completely inoperable, all because I'm using a tracking service that pretty 
much *everybody* uses.

Since that didn't work, I wonder if it'll work if I do per-view caching? It 
shouldn't work at all, should it, since it's not like any of the factors I 
outlined above are different if I'm using the @cache_page decorator to do my 
caching vs the per-site cache.

Well, the sad news is caching does "work" when I use cache_page, and that's 
not a good thing:

* @cache_page caches the direct output of the view/render function. It skips 
over the middleware that might have very good reason to introduce vary 
headers and doesn't introduce any vary headers of it's own. So now, with 
this applied, I *am* serving a cached version of this page even though I 
absolutely should not be. Some poor user's token is now being sent to 
everybody. My only chance of redemption is if I happen to have read the docs 
and discovered that this incantation is required to prevent having 
cache_page improperly cache the page:

   @cache_page(60 * 15)
   @csrf_protect
   def my_view(request):
   # ...
   
Of course, the above just puts me right back where I started at the per-site 
level. There was never any chance of making cache_page work any different 
from the per-site cache, but it certainly proved to be a temptation if I'm a 
hurried developer, frustrated by why my per site cache wasn't working and 
"thankful" for the fact that I could get the cache to start "working" with 
the cache_page decorator.

Hopefully the above example really makes it clear to you guys how all of the 
seemingly minor bugs and imperfections really do add up to a broken 
situation for someone coming to this with a pretty standard set of 
expectations and requirements.

Anyhow, the good news is that a good portion of what I have written about 
already has open tickets which in some 

Re: Deferred constraint checking causing problems during test runs

2011-07-14 Thread Jim Dalton
On Jul 14, 2011, at 1:31 AM, Simon Riggs wrote:

> On Sun, Jul 10, 2011 at 3:27 PM, Jim Dalton <jim.dal...@gmail.com> wrote:
>> On Jul 10, 2011, at 3:13 AM, Simon Riggs wrote:
>> 
>>> Maintaining the property of deferrable constraints seems important
>>> here, so changing the deferrability of constraints, or overriding it
>>> using the SET CONSTRAINTS command at the top of the transaction might
>>> not be what we want.
>> 
>> Well, that's just it. We want tests to behave as much like production code 
>> as possible, so we actually *don't* want constraint checks to be deferred.
>> 
>> When you're working with DB normally in production, i.e. outside of a 
>> transaction, constraints are checked immediately. It's only when you get 
>> into a transaction that they are deferred. Each of our tests run inside of a 
>> transaction, which is an artificial construct. So to emulate production, I'm 
>> arguing that this property is not something we want (*unless* we are testing 
>> a transaction or transaction like behavior, in which case it does make sense 
>> to temporarily suspend constraint checks).
> 
> My role here is to help with Postgres details, so any comments I make
> are just attempts at providing useful information.

Thank you, btw, for offering your input in that capacity. I found it quite 
reassuring when I read your first reply that someone with your credentials was 
giving feedback on this issue. :)
> 
> It sounds like there is a slight confusion about the way PostgreSQL
> works. Everything is a transaction, so there is no difference between
> inside/outside a transaction.

> You can choose whether you wish immediate or deferred constraint
> checking. It's completely user selectable, though the default is
> immediate. If you're seeing deferred checks, they can be changed.
> 
> ISTM that the tests should work the same way as things do in
> production, and it looks possible to make that happen.

Okay, this is interesting -- and thank you for clarifying this point, which I 
had grossly oversimplified. I think you actually shed a huge amount of light on 
part of the problem here (at least for me).

So in normal, everyday use, Django opens a connection to Postgresql with an 
open transaction. Per the docs, it commits this transaction immediately 
whenever you do an INSERT or UPDATE etc. At that point Postgresql would run its 
constraint checks and rollback the transaction on error or else commit.

During a test, we open up a big transaction during setUp and then *disable* the 
transaction commit and rollback operations, i.e. a call to either does nothing. 
Since we can't nest transactions in Postgresql (right?) this has been the only 
sensible way to allow tests to proceed. The problem has been -- I am positing 
-- that we're getting away with a lot of stuff as a result, because constraint 
checks are *never* checked during testing. The small slew of bugs I found is a 
demonstration of that, to me.

My solution up to now, however, is also "unlifelike", which your comment has 
helped me to realize. It works kind of similar to the way Django works normally 
because in some ways Django sort of emulates immediate checks in its default 
behavior. But I think my present solution is a kludge in the other direction 
and I agree is not close enough to reality.

I'm thinking this might not be too hard to solve though. I'm thinking that 
rather than *only* enabling constraint checks immediately as I was suggesting 
before, we could instead focus on the transaction-related methods that we are 
presently just blanking out (by this I mean when tests are set up, 
transaction.commit() transaction.rollback() etc. are being monkey patched with 
a function that does nothing). Even though we can't nest a transaction 
explicitly, we can better emulate the behavior of normal Django operation. We 
can keep things in deferred mode as they are in production, but then do a 
temporary flip to IMMEDIATE at commit() which will fire the constraint checks. 
Presumably, we could even do a SAVEPOINT, as you suggested, at the start so 
that the test could be recovered and continue on if e.g. the IntegrityError 
needed to be caught during testing.

I imagine that this will still flush out most if not all of the little errors I 
had discovered (and prevent us from writing test cases going forward with 
similar problems) but at the same time will very closely emulate production. 
The challenge will be wrangling all this together with the other DBs (i.e. 
MySQL and SQLite) which don't share this execution model. Should be workable 
though.

Anyhow, thank you once again Simon for shedding light on this for me. 

Cheers
Jim

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send e

Re: Deferred constraint checking causing problems during test runs

2011-07-10 Thread Jim Dalton
On Jul 10, 2011, at 3:13 AM, Simon Riggs wrote:

> Maintaining the property of deferrable constraints seems important
> here, so changing the deferrability of constraints, or overriding it
> using the SET CONSTRAINTS command at the top of the transaction might
> not be what we want.

Well, that's just it. We want tests to behave as much like production code as 
possible, so we actually *don't* want constraint checks to be deferred.

When you're working with DB normally in production, i.e. outside of a 
transaction, constraints are checked immediately. It's only when you get into a 
transaction that they are deferred. Each of our tests run inside of a 
transaction, which is an artificial construct. So to emulate production, I'm 
arguing that this property is not something we want (*unless* we are testing a 
transaction or transaction like behavior, in which case it does make sense to 
temporarily suspend constraint checks).

> What I would recommend is that we issue an explicit SET CONSTRAINTS
> ALL IMMEDIATE command immediately before the ROLLBACK at *end* of
> test. This will fire any outstanding checks. That way all constraint
> checks will occur in the same place they would during a commit, yet we
> can maintain the situation that the test ends with a rollback.

This would conceivably work I think. I'm pretty sure Ramiro was exploring this 
approach actually. My feeling, however, is that this still allows you to get 
away with stuff you might not otherwise get away with in production. I also 
think it's more helpful seeing exactly where an integrity issue came up so you 
can address it. This is, for example, what allowed me to understand the handful 
of bugs that were hiding, i.e. because I could trace the Intergrity Error to 
the exact line of code that was triggering it.

Your questions raise one issue that had not occurred to me though. One possible 
"problem" with putting constraint checks at the beginning is that there is now 
way for a test to recover from them. If you try to put bad data into the DB 
with immediate constraint checks on, you will raise and error *and* if I'm not 
mistaken the transaction will be rolled back at that very instant. So if for 
some reason you knew you were putting bad data in and wanted to recover from it 
in your test and keep going, that would not be possible. I'm not sure that's 
actually a problem, but it's certainly something to consider. It's another 
dimension of behavior that is changed.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Resetting cache gracefully at the end of every test?

2011-07-04 Thread Jim Dalton
On Jul 4, 2011, at 8:01 AM, Jacob Kaplan-Moss wrote:

> On Mon, Jul 4, 2011 at 8:54 AM, Jim Dalton <jim.dal...@gmail.com> wrote:
>> I've created a ticket for this and uploaded a patch: 
>> https://code.djangoproject.com/ticket/16401
>> Please feel free to review and let me know if this is a good idea in the 
>> first place, as well as if the patch makes sense and is the right approach.
> 
> Hm.
> 
> While I think the feature's a good idea (I've been stung by tests
> running against a live cache) I can't say I'm particularly thrilled by
> the approach here. Monkeypatching is acceptable in situations like
> these, but there's a limit... and I think patching five methods is
> starting to reach it. Remember: as much as possible, tests should
> operate just like production. Any difference, no matter how small,
> introduces the worst kind of bugs: ones that only show up in
> production.

Yeah, I agree with you in principle there. I'm definitely an advocate of 
avoiding monkey patching like the plague in production code.

While I also agree that this code looks a bit hairy when you sort of squint 
your eyes at it, if you actually look closely at what's going on, we're not 
modifying things in any way that's tricky or complicated. Really, the only two 
things we are doing are adding a bit of code that that tracks keys and then 
swapping a few methods out. Pretty similar to what we do already with 
Template.render  -- just x5. The monkey patching stuff is not modifying any 
arguments or results, nor is it changing the observable behavior of the 
modified methods. The tests demonstrate this pretty thoroughly.

The only place I've managed to think of where this is different from production 
is in the length of the key. Because we are appending "_test_" to the start of 
KEY_PREFIX, you're going to get 6 less characters in your key than you would 
otherwise. It *is* a difference, so I don't want to downplay it, but at the 
same time it's a difference where you'll get an error in testing that you 
wouldn't get in production -- which is better than the opposite.

The only other place where someone might run into trouble is if they themselves 
were monkey patching the cache and their changes collided somehow. I understand 
that this is the kind of thing that happens in Ruby a lot, and it strikes me as 
unpleasant.

Anyhow, that's not to say that there isn't a better solution or that your 
proposal below isn't preferable (more on that in a sec), but I'd argue that my 
patch is at least a plausible solution, i.e. it works in lieu of a better 
alternative.



> I have an alternate proposal:
> 
> Instead of using the default cache defined in CACHES, use a "test
> cache." This can safely be flushed, meaning that no monkeypatching is
> needed. The added benefit here is that this works similarly to the
> database engine, so it'll be fairly easy to understand.
> 
> Now, unlike DB test we can't just prefix "test_" to the database name
> (because we haven't got one), nor can we prefix keys (because not all
> cache backends can do something like "delete test:*"). So I'd suggest
> we look for a "test" backend defined in CACHES and fall back to a
> local memory cache if not defined.
> 
> I think this simplifies things immensely, and also gives a fair bit of
> control -- if I want to test against a real cache backend, I just
> define CACHES['test'] and roll.


So, in many ways I like this idea. I agree that the implementation is going to 
be more straightforward. Here's just a quick list of concerns or possible 
drawbacks I was able to think of:

* Not sure how to handle multiple cache backends. I imagine if you're someone 
who has implemented multiple cache backends, then you are using them in 
specific ways for specific reasons. If your tests end up sharing a test 
backend, that could result in unpredictable behavior. One possible resolution 
would be to amend your proposal a bit: For each cache backend defined, we swap 
it out with the locmem backend unless you have yourself defined a 
test_CACHE_NAME version of that cache. So e.g. fi I have three backends named 
"default", "file" and "db", I then need to define test_default, test_file and 
test_db if I want to specify how the test framework is going to set up the test 
cache. Otherwise, it'll fall back to replacing it with a locmem cache.
* At least when it comes to memcached, this doesn't help you much with the 
cache.clear() issue, since cache.clear() will blank your entire memcached 
cache. So if we're calling cache.clear() at the end of each test and it's 
blanking the whole cache, I start to wonder why we bother at all? In other 
words, why not just call cache.clear() at the end of every test and call it a 
day, and make a note in the documentation that running the test will clear a

Re: Resetting cache gracefully at the end of every test?

2011-07-04 Thread Jim Dalton
I've created a ticket for this and uploaded a patch: 
https://code.djangoproject.com/ticket/16401

Please feel free to review and let me know if this is a good idea in the first 
place, as well as if the patch makes sense and is the right approach.

Thanks,
Jim

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Resetting cache gracefully at the end of every test?

2011-07-01 Thread Jim Dalton
Awesome feedback, thanks.

On Jul 1, 2011, at 10:01 AM, Jeremy Dunck wrote:

> Well, I think you forgot all the other low-level cache options, like
> .incr, .add, etc.?

Yep, forgot those so they would conceivably need to be tracked.

> 
> If I were to do this, I wouldn't have cache shared between tests or
> environments, so I'd be tempted to use flush.

I'm not totally sure if I'm following you on this point?

> 
> Rather than "test_" and bookkeeping on key names, the key prefix was
> the hostname + timestamp from the beginning of the run, so that test
> keys would never collide?

What I like about this idea is it avoids the bookkeeping and monkey-patching 
nonsense. What I don't like about it is that it feels like you could 
potentially be adding a lot of junk to your cache. A test run might end up 
hitting it with many values, each unique. A requirement I'm sort of working 
with in my mind here is that the cache returns to the "pristine" state it was 
in prior to the test run.

> 
> Or, what if we just had hooks called from setUp and tearDown, either
> NotImplemented methods, or signals?
> 
> Book-keeping with monkey-patched methods sounds fiddly to me; consider
> the case where celery pokes keys in, but our bookkeeping doesn't know
> about it?

I know what you mean about it being fiddly, but I'm kind of envisioning 
something not all that different from what goes on with Template.render() 
settings.EMAIL_BACKEND, or the database backends during test setup and 
teardown. The bookkeeping required would in truth be nothing more than a set() 
where you add keys. The monkey patching would entail nothing more than a line 
to add the passed keys to that set and then a call to the original function. 
I'm just not sure it's much more fiddly than the fiddling we already do for 
those examples is what I'm saying.

I"m not quite sure what you mean about celery poking keys in. Maybe you can 
elaborate a bit? To be clear I'm not sure we could keep the actual underling 
cache entirely pristine if you were using it in different ways from different 
systems; the goal would really be to ensure calls using the cache backend API 
are rolled back with every test.

Anyhow thanks once again for the feedback and the critical thoughts.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Thoughts on solution to forward references in MySQL (#3615)

2011-06-28 Thread Jim Dalton
On Jun 28, 2011, at 6:29 AM, Cal Leeming [Simplicity Media Ltd] wrote:

> Sorry for the noobish question but, could someone explain the definition of 
> "forward references"?? Is this a MySQL or a django term?? Google wasn't very 
> forthcoming :X

Jacob actually requested that I add a note in the database docs on the topic 
this patch addresses. So as a way of answering your question I'll paste that 
note verbatim here:

In previous versions of Django, fixtures with forward references (i.e.
relations to rows that have not yet been inserted into the database) would fail
to load when using the InnoDB storage engine. This was due to the fact that 
InnoDB
deviates from the SQL standard by checking foreign key constraints immediately
instead of deferring the check until the transaction is committed. This
problem has been resolved in Django 1.4. Fixture data is now loaded with 
foreign key
checks turned off; foreign key checks are then re-enabled when the data has
finished loading, at which point the entire table is checked for invalid foreign
key references and an `IntegrityError` is raised if any are found.

Is that note illuminating to you? If not, then presumably it needs work :)

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Thoughts on solution to forward references in MySQL (#3615)

2011-06-28 Thread Jim Dalton
On Jun 28, 2011, at 6:25 AM, Karen Tracey wrote:

> It actually doesn't *need* to return False; pass is the same as not returning 
> anything or returning None. The boolean check just treats it the same way as 
> False. "Should it?" is another question. On the one hand it's a bit more 
> clear, this value is called and always returns False, unless a backend has 
> overridden it. On the other hand, pass is in keeping with other methods in 
> that class that are meant to be overridden in backends, so I went with pass 
> to emphasize that aspect of the code.
> 
> Hmm, well, I did not know that falling off the end of a method with pass 
> guaranteed a return value of False or None (and can't find that noted in the 
> doc here:http://docs.python.org/tutorial/controlflow.html#pass-statements) so 
> in my mind explicitly returning False would be clearer/better...

Ah, okay. To be clear, the return value actually has nothing to do specifically 
with the use of pass or not. The behavior of Python is to return None from any 
function that does not explicitly return a value. This is from the exact same 
page in the Python docs you linked to:

"In fact, even functions without a return statement do return a value, albeit a 
rather boring one. This value is called None (it’s a built-in name). Writing 
the value None is normally suppressed by the interpreter if it would be the 
only value written."

Obviously it's a fairly minor point but we want things to be as clear as 
possible. Anyone else wants to weigh in on "pass" vs. an explicit return value?

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Thoughts on solution to forward references in MySQL (#3615)

2011-06-28 Thread Jim Dalton
On Jun 28, 2011, at 5:27 AM, Karen Tracey wrote:

> Also, though I don't have MySQL 4 handy to test on, I'd be astonished if 
> there were any issue there compared to MySQL 5. The set foreign_key_check 
> command is certainly supported in MySLQ 4 and the select being issued to do 
> the check is nothing special at all. 

Agreed. I did find some circumstantial evidence to support the idea that ti 
would work on MySQL 4 here: 
http://dev.mysql.com/doc/refman/5.1/en/innodb-foreign-key-constraints.html#c3009
 This comment from 2003 (when MySQL 4 was still in beta) described the 
technique. If it worked in 2003 I have to imagine we're in good shape.

Slightly higher concern is the new get_key_relations() method. Since the 
information_schema table is not available before MySQL 5 there is a workaround 
in there for version 4. The contents of this method were originally part of 
get_indexes() and I just moved them to their own method (that's why you don't 
see it in the commit). So assuming that function worked fine previously in 
Django, I don't anticipate this will have a problem either.

> 
> I have not had time to try out the patch, but did look at it. Doesn't the 
> base implementation of disable_foreign_key_checks need to return False 
> instead of just passing? The return value is used in loaddata processing to 
> decide whether it's necessary to re-enable/check.

It actually doesn't *need* to return False; pass is the same as not returning 
anything or returning None. The boolean check just treats it the same way as 
False. "Should it?" is another question. On the one hand it's a bit more clear, 
this value is called and always returns False, unless a backend has overridden 
it. On the other hand, pass is in keeping with other methods in that class that 
are meant to be overridden in backends, so I went with pass to emphasize that 
aspect of the code.

> 
> Thanks for working on this -- wish I'd thought of that idea two years ago!

You're welcome. It was actually fun to solve a cold case :)

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Thoughts on solution to forward references in MySQL (#3615)

2011-06-27 Thread Jim Dalton
On Jun 27, 2011, at 5:44 PM, Michael Blume wrote:

> I see a variable saved_objects being written, but I don't see it being 
> accessed -- is this to ease future features, or am I missing a code path?

Thanks good catch. This was a remnant from an earlier iteration of this patch, 
in which I tried (unsuccessfully) to call save() on any objects created during 
the fixture load after foreign key checks had been re-enabled, in hopes that 
the UPDATE call this generated would trigger a constraint check. It did not, 
because MySQL ignores UPDATE if the data being set matches what's already in 
the DB. I have some other changes I'm making in response to some of Jacob's 
comments on the ticket, so this will be gone in the next version of the patch.

> If I'm reading correctly, check_for_invalid_foreign_keys extends over all the 
> rows in a table. loaddata is called by syncdb and South's migrate, not just 
> when a db is set up, so this could easily wind up run over lots and lots of 
> non-fixture data. I don't know MySQL's performance characteristics that well 
> -- is this likely to be expensive?

This is a good question.

First, though it's true that loaddata is called in several places other than 
during tests, I would imagine it is rarely if ever called during a normal 
request/response cycle in production. I'm operating under the assumption that 
the performance of this patch is important but not as critical as it would be 
if this was a command used in production.

So that caveat aside, I will try later to find a large data set I can run the 
query on to get an idea of what kind of performance hit the check entails. If 
anyone else has a large data set with two related tables, you can try it out as 
well and report your results. The basic structure of the SQL statement being 
run is as follows:

SELECT REFERRING.`id`, REFERRING.`id_pointing_to_some_related_table ` FROM 
`some_table` as REFERRING
LEFT JOIN `some_related_table` as REFERRED
ON (REFERRING.`id_pointing_to_some_related_table` = REFERRED.`id`)
WHERE REFERRING.`id_pointing_to_some_related_table ` IS NOT NULL
AND REFERRED.`id` IS NULL

And keep in mind this is being called for *each* relation in the table. So 
that'll run through all the rows once for every relation. Of course, it won't 
return anything unless you have bad data, so the i/o aspect should be minimal.

If we notice an "unacceptable" performance hit the likely solution would be to 
scope that select statement to only rows that were added during the load data 
routine. Ideally I'm hoping to steer clear of that, because it entails more 
work and complexity in loaddata, since you'd have to track all the IDs that 
were added and then pass them along here. Wouldn't be a big deal but that's why 
I only wanted to do it if necessary.

Thanks for your input and help testing this btw!

Jim

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Re: Thoughts on solution to forward references in MySQL (#3615)

2011-06-27 Thread Jim Dalton
On Jun 27, 2011, at 4:52 PM, Russell Keith-Magee wrote:

> Unfortunately, not much. Your test has validated that the extra code
> doesn't break anything under MyISAM, and this is certainly useful.
> However, the root problem only exists with InnoDB because of its...
> eclectic... implementation of row level constraints. MyISAM doesn't
> have constraints, so your test hasn't exercised the part of the code
> that needs to be exercised heavily.

One unintended benefit of the current patch is that it actually will (if I'm 
not mistaken) check for invalid foreign key references in MyISAM as well, since 
the check is run for all MySQL implementations (though not for any other 
backends). So MyISAM would normally silently allow bad foreign keys to be 
loaded from your fixture data, but with the patch you'll now get an error.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.



Should test failures/errors in the Django test suite when executed under MySQL be considered blockers?

2011-06-23 Thread Jim Dalton
I've been trying to do a little work recently on some tickets that related 
to MySQL-specific issues that come up using Django (#2495 and #3615 
specifically). As part of trying to get my head around resolving these, I 
executed the test suite in the current Django trunk using a MySQL InnoDB 
backend.

The results weren't pretty. I had to make a few changes to get the tests to 
even execute (related to a missing "pk" field in 
regression_tests/admin_views), and once I did I got FAILED (failures=91, 
errors=219, skipped=21, expected failures=2)*.* It's a bit hard to know 
where a lot of these are stemming from, but presumably a big chunk relate to 
the fixture loading problems identified in #3615.

Regardless, it appears that many of these errors have existed across 
multiple version releases. Also, some of the tickets that relate to these 
issues are marked as severity normal instead of blocker.

I guess I'd like to understand a bit better what to make of this. Are test 
failures with the MySQL backend considered acceptable? Do we consider it 
okay to drop new releases while these issues go unresolved? Is MySQL not 
considered a fully supported backend?

As I mentioned in a comment recently on #3615, I'd like to get involved in 
helping move forward on some of these issues (my intent is not to gripe, if 
that's what it sounds like I'm doing). At the same time, I guess I'd like to 
have a more clear understanding of the Django developers' commitment to 
supporting MySQL in the test suite. If this is not something that's 
considered to be a big deal, or if other decisions have been made on this 
topic already that I missed, then that's cool. Hopefully someone can shed 
some light on that if this is the case.

For what it's worth, I haven't come across any serious issues running Django 
under MySQL in production -- it's all basically related to running the test 
suite, etc. Still, I'd love to get the test suite for the MySQL backend up 
to speed because it would make developing and fixing bugs for that backend 
that much easier.

Thanks for any thoughts on this.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/20XgIVa-4AIJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.