WHERE EXISTS / WHERE NOT EXISTS clause without using QuerySet.extra

john . parton Wed, 03 Feb 2016 08:04:21 -0800

Greetings!

I'm trying to refactor a query to avoid using QuerySet.extra (as per the 
recommendation here: 
https://docs.djangoproject.com/en/1.9/ref/models/querysets/#extra)


Here's a simplified version of my code:

# testapp.models

class Parent(models.Model):
    pass

class Child(models.Model):
    parent = models.ForeignKey('testapp.Parent')

This is the query I am attempting to replace:

Parent.objects.extra(where=["""
    NOT EXISTS (SELECT 1 FROM testapp_child WHERE testapp_child.parent_id = 
testapp_parent.id)
"""])

This is extremely similar to, but not the same as

Parent.objects.exclude(id__in=Child.objects.all().values('parent_id'))

Both queries should return the same rows, but the biggest difference is 
that PostgreSQL's query planner produces completely different plans. The 
performance difference can be huge, even for a relatively modest data set. 
(I believe my data set has something like 270k "parent" instances and 80k 
"child" instances.)

I tried searching this group as well as the ticket system, and couldn't 
find a solution to this exact problem (other than using the alternative 
query).

Thanks for taking the time to read through all of this! I am more than 
happy to open a ticket, but I thought I should post here first. 


More Technical Stuff

Here's the output of EXPLAIN ANALYZE on my actual models/data. I had to 
apply a LIMIT 100 because if I try to return the entire result, the second 
one completely hangs my computer. I bolded some relevant bits

Limit  (cost=0.71..9.70 rows=100 width=111) (actual time=0.074..0.483 
rows=100 loops=1)
  ->  *Merge Anti Join*  (cost=0.71..22435.69 rows=249613 width=111) 
(actual time=0.072..0.465 rows=100 loops=1)
        Merge Cond: (catalogue_product.id = 
catalogue_productcategory.product_id)"
        ->  *Index Scan* using catalogue_product_pkey on catalogue_product  
(cost=0.42..16986.70 rows=292339 width=111) (actual time=0.020..0.285 
rows=150 loops=1)
        ->  *Index Only Scan* using catalogue_productcategory_9bea82de on 
catalogue_productcategory  (cost=0.29..3861.27 rows=81671 width=4) (actual 
time=0.037..0.070 rows=101 loops=1)
              Heap Fetches: 0
Total runtime: 

*0.568 ms*Here's the plan for the second query:
Limit  (cost=0.00..234229.95 rows=100 width=111) (actual 
time=31.087..1165.022 rows=100 loops=1)
  ->  *Seq Scan* on catalogue_product  (cost=0.00..342373919.34 rows=146170 
width=111) (actual time=31.087..1164.992 rows=100 loops=1)
        Filter: (NOT (SubPlan 1))
        Rows Removed by Filter: 5
        SubPlan 1
          ->  *Materialize*  (cost=0.00..2138.07 rows=81671 width=4) 
(actual time=0.001..5.539 rows=80818 loops=105)
                ->  *Seq Scan* on catalogue_productcategory u0  
(cost=0.00..1409.71 rows=81671 width=4) (actual time=0.005..10.574 
rows=81671 loops=1)
Total runtime: 
*1165.255 ms*
As you can see the former is about ~2000 times faster than the latter.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/efc7a911-1ec7-4dc6-9b5a-c31832c071f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

WHERE EXISTS / WHERE NOT EXISTS clause without using QuerySet.extra

Reply via email to