Greetings!
I'm trying to refactor a query to avoid using QuerySet.extra (as per the
recommendation here:
https://docs.djangoproject.com/en/1.9/ref/models/querysets/#extra)
Here's a simplified version of my code:
# testapp.models
class Parent(models.Model):
pass
class Child(models.Model):
parent = models.ForeignKey('testapp.Parent')
This is the query I am attempting to replace:
Parent.objects.extra(where=["""
NOT EXISTS (SELECT 1 FROM testapp_child WHERE testapp_child.parent_id =
testapp_parent.id)
"""])
This is extremely similar to, but not the same as
Parent.objects.exclude(id__in=Child.objects.all().values('parent_id'))
Both queries should return the same rows, but the biggest difference is
that PostgreSQL's query planner produces completely different plans. The
performance difference can be huge, even for a relatively modest data set.
(I believe my data set has something like 270k "parent" instances and 80k
"child" instances.)
I tried searching this group as well as the ticket system, and couldn't
find a solution to this exact problem (other than using the alternative
query).
Thanks for taking the time to read through all of this! I am more than
happy to open a ticket, but I thought I should post here first.
More Technical Stuff
Here's the output of EXPLAIN ANALYZE on my actual models/data. I had to
apply a LIMIT 100 because if I try to return the entire result, the second
one completely hangs my computer. I bolded some relevant bits
Limit (cost=0.71..9.70 rows=100 width=111) (actual time=0.074..0.483
rows=100 loops=1)
-> *Merge Anti Join* (cost=0.71..22435.69 rows=249613 width=111)
(actual time=0.072..0.465 rows=100 loops=1)
Merge Cond: (catalogue_product.id =
catalogue_productcategory.product_id)"
-> *Index Scan* using catalogue_product_pkey on catalogue_product
(cost=0.42..16986.70 rows=292339 width=111) (actual time=0.020..0.285
rows=150 loops=1)
-> *Index Only Scan* using catalogue_productcategory_9bea82de on
catalogue_productcategory (cost=0.29..3861.27 rows=81671 width=4) (actual
time=0.037..0.070 rows=101 loops=1)
Heap Fetches: 0
Total runtime:
*0.568 ms*Here's the plan for the second query:
Limit (cost=0.00..234229.95 rows=100 width=111) (actual
time=31.087..1165.022 rows=100 loops=1)
-> *Seq Scan* on catalogue_product (cost=0.00..342373919.34 rows=146170
width=111) (actual time=31.087..1164.992 rows=100 loops=1)
Filter: (NOT (SubPlan 1))
Rows Removed by Filter: 5
SubPlan 1
-> *Materialize* (cost=0.00..2138.07 rows=81671 width=4)
(actual time=0.001..5.539 rows=80818 loops=105)
-> *Seq Scan* on catalogue_productcategory u0
(cost=0.00..1409.71 rows=81671 width=4) (actual time=0.005..10.574
rows=81671 loops=1)
Total runtime:
*1165.255 ms*
As you can see the former is about ~2000 times faster than the latter.
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-users/efc7a911-1ec7-4dc6-9b5a-c31832c071f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.