#16937: Prefetch related objects -------------------------------------+------------------------------------- Reporter: lukeplant | Owner: nobody Type: New feature | Status: new Component: Database layer | Version: 1.3 (models, ORM) | Resolution: Severity: Normal | Triage Stage: Accepted Keywords: | Needs documentation: 0 Has patch: 0 | Patch needs improvement: 1 Needs tests: 0 | UI/UX: 0 Easy pickings: 0 | -------------------------------------+-------------------------------------
Comment (by akaariai): Replying to [comment:14 lukeplant]: > It is basically following the existing code in `get_query_set()`, which may be incorrect itself, but is also sending using instances that are not of the same type as the model given. The docs suggest this might be OK: > > https://docs.djangoproject.com/en/dev/topics/db/multi-db/#topics-db- multi-db-hints > > However, since there are many instances here, it is probably better not to send the 'instance' hint at all. Thank you for forcing the issue. > > I don't think we should add multi-db tests unless there are specific multi-db issues we need to address. I also don't use multi-db at all, so I would be stabbing in the dark trying to find issues there. I think somebody who uses the multi-db feature should comment about this. Two people stabbing in the dark is not better than one people doing that. Come to think of it, it might be worse... :) The first instance as a hint is probably somewhat useful as you can get the db used in the query from that. I guess you want to get that from somewhere. >> [About too-may-parameters in SQL]. > I think we can punt on this one. There are many cases where you can end up with more than 1000 parameters in a single query. If your DB can't do that, and you need to, don't use prefetch_related or get a better DB. If there is the demand, it should be possible to add a fix for this as an enhancement. Agreed. I checked other core supported databases, and there doesn't seem to be any practical limits. > > - I don't like the usage of queryset len() to force the evaluation of the queryset. Better to have explicit method (.evaluate()) or something) and call it than rely on the implicit evaluation when len() is called. Maybe another ticket for this? > > The internals of `QuerySet` at this point are very delicate, and optimized. `__iter__`, `__nonzero__`, `__len__` etc all depend heavily on the internals of each other in order to achieve the optimization required, and we have tests that ensure things are as we expect. I am happy to use the same method here, and I don't think that an `evaluate()` method would actually add clarity while preserving the optimisations we've done, because it is not as simple as 'evaluate' - what I mean is 'fill the internal cache', and I know that `len()` has exactly that behaviour. OK, this is then for sure food for another ticket, if even for that. I have to add this to my list of queryset annoyances, it already contains .clone() that does a lot more than cloning... > > - The current method of select_matching_instances can be slow. The method used here is effectively a nested loop, while using a dict would be considerably faster in many cases. The current method is O(nm) while dict would give O(n + m) but with higer memory cost. Currently fetching 1000 objects and prefetching 5 related objects for each takes 3 seconds on my laptop. Just fetching the objects in two different querysets takes 0.2 seconds. So the joining takes almost 2.8 seconds. In the end of this post is a profile of this run. > > Nice catch, and thanks for the figures, I'll see what I can do about this, unless you want to have a go. I can take a stab at this, I have done the hash-join-in-python too many times already... > > - In models/query.py around line 1646 there is this snippet: > > {{{ > > for obj in instances: > > qs = getattr(obj, attname).all() > > }}} > > I am afraid this is somewhat expensive, as this will result in the following code executed for each > > obj: > > {{{ > > return super(RelatedManager, self).get_query_set().using(db).filter(**(self.core_filters)) > > }}} > > That is one QuerySet construction and two clones for each object. This is one good reason not to use > > relatedmanager.all() for prefetch cache... :) > > > The code is exactly the same as what you would otherwise execute - you will be doing `[foo.bars.all() for foo in foos]` at some point (otherwise why use prefetch_related?), and here we are just doing that up front. The next time you call the `all()` you get the pre-created version. So I don't think trying to optimise here is necessary. If we find a shortcut which doesn't significantly reduce the clarity, fine, but it can wait. True. But this whole ticket is about optimizing the mentioned code path. On the other hand, the queryset creation takes about as much time as model initialization in the attached profile. Making queryset clone faster would be the key here, and I might have an idea or two about making that faster... In any case, this can be handled later if the need arises. I will try to get the hash-join patch done in a hour or two. IMHO it would be good to make a summary of the API extensibility discussion to django-developers and leave it there for a couple of days so that others can have a say. I think the current resolution is good enough, but in this case more eyes means better results. Otherwise, close to commit ready :) -- Ticket URL: <https://code.djangoproject.com/ticket/16937#comment:15> Django <https://code.djangoproject.com/> The Web framework for perfectionists with deadlines. -- You received this message because you are subscribed to the Google Groups "Django updates" group. To post to this group, send email to django-updates@googlegroups.com. To unsubscribe from this group, send email to django-updates+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-updates?hl=en.