Re: QuerySet: "if obj in queryset" can be very slow

Ian Clelland Wed, 02 Nov 2011 11:07:39 -0700

On Wed, Nov 2, 2011 at 10:46 AM, Tom Evans <tevans...@googlemail.com> wrote:


> On Wed, Nov 2, 2011 at 5:30 PM, Ian Clelland <clell...@gmail.com> wrote:
> > On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler <h...@tbz-pariv.de> wrote:
> >>
> >> # This is my current solution
> >> if get_special_objects().filter(pk=obj.pk).count():
> >>    # yes, it is special
> >>
> >
> > I can't speak to the "why" of this situation; it seems to me that this
> could
> > always be converted into a more efficient database query without any
> > unexpected side-effects (and if I really wanted the side effects, I would
> > just write "if obj in list(qs)" instead). In this case, though, I would
> > usually write something like this:
> > if get_special_objects().filter(pk=obj.pk).exists():
> >    # yes, it is special
> > I believe that in some cases, the exists() query can be optimized to
> return
> > faster than a count() aggregation, and I think that the intent of the
> code
> > appears more clearly.
> > Ian
>
> OK, take this example. I have a django model table with 70 million
> rows in it. Doing any kind of query on this table is slow, and
> typically the query is date restrained - which mysql will use as the
> optimum key, meaning any further filtering is a table scan on the
> filtered rows.
>
> Pulling a large query (say, all logins in a month, ~1 million rows)
> takes only a few seconds longer than counting the number of rows the
> query would find - after all, the database still has to do precisely
> the same amount of work, it just doesn't have to deliver the data.
>
> Say I have a n entries I want to test are in that resultset, and I
> also want to iterate through the list, calculating some data and
> printing out the row, I can do the existence tests either in python or
> in the database. If I do it in the database, I have n+1 expensive
> queries to perform. If I do it in python, I have 1 expensive query to
> perform, and (worst case) n+1 full scans of the data retrieved (and I
> avoid locking the table for n+1 expensive queries).
>
> Depending on the size of the data set, as the developer I have the
> choice of which will be more appropriate for my needs. Sometimes I
> need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
> qs".
>

I agree that there are situations where you want, or need, to pull the data
in to Python for processing, to avoid a lot of database overhead. That's
why we have select_related, as well: sometimes you really do need to just
grab as much as possible all at once.

The trouble is that querysets are *supposed* to be lazy; just evaluating as
much as necessary, as late as possible, to do the job. I think that this
behaviour violates the principle of least surprise, by instantiating a
(potentially very large) queryset as a side-effect of a simple inclusion
test.

Any other time that you want a queryset instantiated, the idiomatic way to
do it is to construct a Python list based on it:

# Get all objects at once from database
objs = list(qs)
# Now use that list multiple times in a method

or

for obj in list(qs):
  # qs is evaluated once, list members may be manipulated as needed in
Python

or, by extension,

if obj in list(qs):
  # stuff

I wouldn't rely on the behaviour of the in operator to evaluate the
queryset for me; it doesn't look right to me, it's not obvious to anyone
else looking at the code, and I don't think it's documented behaviour.

I would prefer that in did an exists query, but since there are explicit
ways to force either behavior, in practise I use one of those explicit
ways, rather than leave the code looking ambiguous.

-- 
Regards,
Ian Clelland
<clell...@gmail.com>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: QuerySet: "if obj in queryset" can be very slow

Reply via email to