I'd like to discuss automatic prefetching in querysets. Specifically 
automatically doing prefetch_related where needed without the user having 
to request it.

For context consider these three snippets using the Question & Choice 
models from the tutorial 
<https://docs.djangoproject.com/en/1.11/intro/tutorial02/#creating-models> when 
there are 100 questions each with 5 choices for a total of 500 choices.

Default
for choice in Choice.objects.all():
    print(choice.question.question_text, ':', choice.choice_text)
501 db queries, fetches 500 choice rows and 500 question rows from the DB

Prefetch_related
for choice in Choice.objects.prefetch_related('question'):
    print(choice.question.question_text, ':', choice.choice_text)
2 db queries, fetches 500 choice rows and 100 question rows from the DB

Select_related
for choice in Choice.objects.select_related('question'):
    print(choice.question.question_text, ':', choice.choice_text)
1 db query, fetches 500 choice rows and 500 question rows from the DB

I've included select_related for completeness, I'm not going to propose 
changing anything about it's use. There are places where it is the best 
choice and in those places it will still be up to the user to request it. I 
will note that anywhere select_related is optimal prefetch_related is still 
better than the default and leave it at that.

The 'Default' example above is a classic example of the N+1 query problem, 
a problem that is widespread in Django apps.
This pattern of queries is what new users produce because they don't know 
enough about the database and / or ORM to do otherwise.
Experieced users will also often produce this because it's not always 
obvious what fields will and won't be used and subsequently what should be 
prefetched.
Additionally that list will change over time. A small change to a template 
to display an extra field can result in a denial of service on your DB due 
to a missing prefetch.
Identifying missing prefetches is fiddly, time consuming and error prone. 
Tools like django-perf-rec <https://github.com/YPlan/django-perf-rec> 
(which I was involved in creating) and nplusone 
<https://github.com/jmcarp/nplusone> exist in part to flag missing 
prefetches introduced by changed code.
Finally libraries like Django Rest Framework and the Admin will also 
produce queries like this because it's very difficult for them to know what 
needs prefetching without being explicitly told by an experienced user.

As hinted at the top I'd like to propose changing Django so the default 
code behaves like the prefetch_related code.
Longer term I think this should be the default behaviour but obviously it 
needs to be proved first so for now I'd suggest a new queryset function 
that enables this behaviour.

I have a proof of concept of this mechanism that I've used successfully in 
production. I'm not posting it yet because I'd like to focus on desired 
behavior rather than implementation details. But in summary, what it does 
is when accessing a missing field on a model, rather than fetching it just 
for that instance, it runs a prefetch_related query to fetch it for all 
peer instances that were fetched in the same queryset. So in the example 
above it prefetches all Questions in one query.

This might seem like a risky thing to do but I'd argue that it really isn't.
The only time this isn't superior to the default case is when you are post 
filtering the queryset results in Python.
Even in that case it's only inferior if you started with a large number of 
results, filtered basically all of them and the code is structured so that 
the filtered ones aren't garbage collected.
To cover this rare case the automatic prefetching can easily be disabled on 
a per queryset or per object basis. Leaving us with a rare downside that 
can easily be manually resolved in exchange for a significant general 
improvement.

In practice this thing is almost magical to work with. Unless you already 
have extensive and tightly maintained prefetches everywhere you get an 
immediate boost to virtually everything that touches the database, often 
knocking orders of magnitude off page load times.

If an agreement can be reached on pursuing this then I'm happy to put in 
the work to productize the proof of concept.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/d402bf30-a5af-4072-8b50-85e921f7f9af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to