Re: ~Q broken?

Steve Holden Wed, 16 Jul 2008 17:45:13 -0700

Malcolm Tredinnick wrote:
> 
> On Wed, 2008-07-16 at 10:12 -0700, serbaut wrote:
>> Consider the following equivalent queries based on the weblog model
>> from the documentation:
>>
>> Blog.objects.filter(name="myblog").exclude(entry__body_text__contains="blah
>> blah")
>> Blog.objects.filter(Q(name="myblog") &
>> QNot(Q(entry__body_text__contains="blah blah"))) # 0.96
>> Blog.objects.filter(Q(name="myblog") &
>> ~Q(entry__body_text__contains="blah blah")) # development  version
>>
>> In 0.96 the evaluated SQL is (edited for readability):
>>
>> SELECT blog_blog.id,blog_blog.name,blog_blog.tagline
>> FROM blog_blog
>> INNER JOIN blog_entry AS blog_blog__entry ON blog_blog.id =
>> blog_blog__entry.blog_id
>> WHERE blog_blog.name = "myblog"
>> AND (NOT (blog_blog__entry.body_text LIKE "%blah blah%"))
>>
>> while in trunk (r7982) it is
>>
>> SELECT blog_blog.id, blog_blog.name, blog_blog.tagline
>> FROM blog_blog
>> WHERE blog_blog.name = "myblog"
>> AND (NOT (blog_blog.id IN (SELECT blog_entry.blog_id FROM blog_entry
>> WHERE blog_entry.body_text LIKE "%blah blah%")))
>>
>> The trunk version will perform a subquery over all blog entries which
>> will have a very negative performance impact.
>>
>> Lets say "myblog" has 10 entries in a database with 10,000,000
>> entries, the first SQL will only examine the 10 entries (assuming
>> decent query planner) while the latter will scan the whole database. A
>> join with blog_blog.id is missing from the subquery or it needs to be
>> rewritten to the 0.96 form. I understand that the code tries to handle
>> the generic case but this practical case has to work too.
> 
> The important difference, that trumps everything else, is that the 0.96
> version gives the wrong answer! It isn't a case of less or more
> efficient -- it's the difference between correct and incorrect.
> 
> The query you described asks to exclude all blog entries containing a
> particular tag. Now consider a blog entry that has two entries. One is
> the entry you're interested in excluding and the second entry is
> something else. Because that blog has an row in the m2m join table that
> does not match the entry you are excluding, that blog will be included
> (incorrectly) in the result set. This was a very big bug in 0.96 and
> impossible to work around in that code.
> 
> There is no way to write that particular exclusion correctly without
> using nested subqueries unless you have some arbitrary constraint like
> only one tag per blog entry (in which case a many-to-many field is the
> wrong choice).
> 
> Remember that the query you are writing here is something that returns
> (and filters) Blog objects. It's not for excluding individual Entry
> objects. It uses the presence or absence of an Entry to filter the
> Blogs.
> 
> Your case is particularly pessimal, since it isn't going to be helped by
> index matches in most cases. For more natural uses, such as excluding by
> pk values, or whole-field matches, adding appropriate index comparisons
> makes the inner query very efficient if it becomes a performance issue
> in production environments.
> 
Note that there's nothing wring in the general case with nested 
subqueries. The problem arises with what are called "correlated 
subqueries", where an element of the rows from the outer query is used 
in the sub-query.


In other words, given an employee table where each employee has a 
manager, it's quite fast to work out who earns more than a specific 
employee:

SELECT * FROM employee WHERE salary > (
     SELECT salary FROM employee WHERE ID=1234)

The subquery returns a result that's invariant over the rows of the 
outer query, and most SQL implementations will only perform it the once. 
It's quite a different story if you want to find out who earns more than 
their manager, however:

SELECT * FROM employee AS sub WHERE salary > (
     SELECT salary FROM employee AS mgr WHERE sub.salary > mgr.salary)

In such cases the SQL optimizer usually has little choice but to repeat 
the subquery for each row in the outer query, which can get a little 
expensive.

Then later said:

> The problem is that the filter you specified was not returning the
> correct answer in 0.96. You're asking that Django returns an incorrect
> result to make your code faster. It's possible to give incorrect results
> very fast, but it's not something we want to support.

Which made me laugh :-)

Sorry if this is preaching to the choir: sometimes people assume that 
*all* nested subqueries are bad.

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: ~Q broken?

Reply via email to