Re: QuerySet: "if obj in queryset" can be very slow

2011-11-03 Thread Tom Evans
On Wed, Nov 2, 2011 at 6:26 PM, Ian Clelland  wrote:
> Just looking at the source to QuerySet (finally), and it looks like the
> __contains__ method actually does something different than this: It
> evaluates the whole QuerySet in bulk at the database level, and starts
> creating model instances based on that, but only until it finds a matching
> one. So, after running "if obj in qs", you might end up with one object
> created, or you might end up with 70M objects, or anywhere in between.
> Again: odd, undocumented, and potentially surprising behaviour, and I'd
> recommend explicit over implicit, especially in this case.
>

I disagree. All of this behaviour is documented in my mind:

It is documented here when querysets are evaluated:

https://docs.djangoproject.com/en/1.3/ref/models/querysets/#when-querysets-are-evaluated

It's documented here that iterating through a queryset will cache the
objects created:

https://docs.djangoproject.com/en/1.3/ref/models/querysets/#iterator

It's documented here that testing for results is slightly quicker than
evaluating and testing the queryset:

https://docs.djangoproject.com/en/1.3/ref/models/querysets/#exists

It's documented here that caching is optimistic - it caches the
results that have been requested:

https://docs.djangoproject.com/en/1.3/topics/db/queries/#caching-and-querysets

Do we really need to specifically point out that the __contains__
method iterates through the queryset and stops when it finds an
answer?

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Thomas Guettler
Hi,

I created a ticket:

https://code.djangoproject.com/ticket/17156

{{{
The documentation should explain what happens in this code:

if obj in queryset:


If queryset is huge, the above code can be very slow.

If you want to run the code in the database, you should use this:

if queryset.filter(pk=obj.pk).exists():


Related discussion:  
http://groups.google.com/group/django-users/browse_thread/thread/5a8196ff1086f4ae/530548a127a6e3ca

Can someone please provide a patch for the documentation. I am not a native 
speaker, and won't find the right words.
}}}


-- 
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Ian Clelland
On Wed, Nov 2, 2011 at 10:46 AM, Tom Evans  wrote:

> OK, take this example. I have a django model table with 70 million
> rows in it. Doing any kind of query on this table is slow, and
> typically the query is date restrained - which mysql will use as the
> optimum key, meaning any further filtering is a table scan on the
> filtered rows.
>
> Pulling a large query (say, all logins in a month, ~1 million rows)
> takes only a few seconds longer than counting the number of rows the
> query would find - after all, the database still has to do precisely
> the same amount of work, it just doesn't have to deliver the data.
>
> Say I have a n entries I want to test are in that resultset, and I
> also want to iterate through the list, calculating some data and
> printing out the row, I can do the existence tests either in python or
> in the database. If I do it in the database, I have n+1 expensive
> queries to perform. If I do it in python, I have 1 expensive query to
> perform, and (worst case) n+1 full scans of the data retrieved (and I
> avoid locking the table for n+1 expensive queries).
>
> Depending on the size of the data set, as the developer I have the
> choice of which will be more appropriate for my needs. Sometimes I
> need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
> qs".
>
>
Just looking at the source to QuerySet (finally), and it looks like the
__contains__ method actually does something different than this: It
evaluates the whole QuerySet in bulk at the database level, and starts
creating model instances based on that, but only until it finds a matching
one. So, after running "if obj in qs", you might end up with one object
created, or you might end up with 70M objects, or anywhere in between.

Again: odd, undocumented, and potentially surprising behaviour, and I'd
recommend explicit over implicit, especially in this case.



-- 
Regards,
Ian Clelland


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Ian Clelland
On Wed, Nov 2, 2011 at 10:46 AM, Tom Evans  wrote:

> On Wed, Nov 2, 2011 at 5:30 PM, Ian Clelland  wrote:
> > On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler  wrote:
> >>
> >> # This is my current solution
> >> if get_special_objects().filter(pk=obj.pk).count():
> >># yes, it is special
> >>
> >
> > I can't speak to the "why" of this situation; it seems to me that this
> could
> > always be converted into a more efficient database query without any
> > unexpected side-effects (and if I really wanted the side effects, I would
> > just write "if obj in list(qs)" instead). In this case, though, I would
> > usually write something like this:
> > if get_special_objects().filter(pk=obj.pk).exists():
> ># yes, it is special
> > I believe that in some cases, the exists() query can be optimized to
> return
> > faster than a count() aggregation, and I think that the intent of the
> code
> > appears more clearly.
> > Ian
>
> OK, take this example. I have a django model table with 70 million
> rows in it. Doing any kind of query on this table is slow, and
> typically the query is date restrained - which mysql will use as the
> optimum key, meaning any further filtering is a table scan on the
> filtered rows.
>
> Pulling a large query (say, all logins in a month, ~1 million rows)
> takes only a few seconds longer than counting the number of rows the
> query would find - after all, the database still has to do precisely
> the same amount of work, it just doesn't have to deliver the data.
>
> Say I have a n entries I want to test are in that resultset, and I
> also want to iterate through the list, calculating some data and
> printing out the row, I can do the existence tests either in python or
> in the database. If I do it in the database, I have n+1 expensive
> queries to perform. If I do it in python, I have 1 expensive query to
> perform, and (worst case) n+1 full scans of the data retrieved (and I
> avoid locking the table for n+1 expensive queries).
>
> Depending on the size of the data set, as the developer I have the
> choice of which will be more appropriate for my needs. Sometimes I
> need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
> qs".
>

I agree that there are situations where you want, or need, to pull the data
in to Python for processing, to avoid a lot of database overhead. That's
why we have select_related, as well: sometimes you really do need to just
grab as much as possible all at once.

The trouble is that querysets are *supposed* to be lazy; just evaluating as
much as necessary, as late as possible, to do the job. I think that this
behaviour violates the principle of least surprise, by instantiating a
(potentially very large) queryset as a side-effect of a simple inclusion
test.

Any other time that you want a queryset instantiated, the idiomatic way to
do it is to construct a Python list based on it:

# Get all objects at once from database
objs = list(qs)
# Now use that list multiple times in a method

or

for obj in list(qs):
  # qs is evaluated once, list members may be manipulated as needed in
Python

or, by extension,

if obj in list(qs):
  # stuff

I wouldn't rely on the behaviour of the in operator to evaluate the
queryset for me; it doesn't look right to me, it's not obvious to anyone
else looking at the code, and I don't think it's documented behaviour.

I would prefer that in did an exists query, but since there are explicit
ways to force either behavior, in practise I use one of those explicit
ways, rather than leave the code looking ambiguous.

-- 
Regards,
Ian Clelland


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Furbee
Thanks Tom, that's a great explanation!

Furbeenator

On Wed, Nov 2, 2011 at 10:46 AM, Tom Evans  wrote:

> On Wed, Nov 2, 2011 at 5:30 PM, Ian Clelland  wrote:
> > On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler  wrote:
> >>
> >> # This is my current solution
> >> if get_special_objects().filter(pk=obj.pk).count():
> >># yes, it is special
> >>
> >
> > I can't speak to the "why" of this situation; it seems to me that this
> could
> > always be converted into a more efficient database query without any
> > unexpected side-effects (and if I really wanted the side effects, I would
> > just write "if obj in list(qs)" instead). In this case, though, I would
> > usually write something like this:
> > if get_special_objects().filter(pk=obj.pk).exists():
> ># yes, it is special
> > I believe that in some cases, the exists() query can be optimized to
> return
> > faster than a count() aggregation, and I think that the intent of the
> code
> > appears more clearly.
> > Ian
>
> OK, take this example. I have a django model table with 70 million
> rows in it. Doing any kind of query on this table is slow, and
> typically the query is date restrained - which mysql will use as the
> optimum key, meaning any further filtering is a table scan on the
> filtered rows.
>
> Pulling a large query (say, all logins in a month, ~1 million rows)
> takes only a few seconds longer than counting the number of rows the
> query would find - after all, the database still has to do precisely
> the same amount of work, it just doesn't have to deliver the data.
>
> Say I have a n entries I want to test are in that resultset, and I
> also want to iterate through the list, calculating some data and
> printing out the row, I can do the existence tests either in python or
> in the database. If I do it in the database, I have n+1 expensive
> queries to perform. If I do it in python, I have 1 expensive query to
> perform, and (worst case) n+1 full scans of the data retrieved (and I
> avoid locking the table for n+1 expensive queries).
>
> Depending on the size of the data set, as the developer I have the
> choice of which will be more appropriate for my needs. Sometimes I
> need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
> qs".
>
> Cheers
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to django-users@googlegroups.com.
> To unsubscribe from this group, send email to
> django-users+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Tom Evans
On Wed, Nov 2, 2011 at 5:30 PM, Ian Clelland  wrote:
> On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler  wrote:
>>
>> # This is my current solution
>> if get_special_objects().filter(pk=obj.pk).count():
>>    # yes, it is special
>>
>
> I can't speak to the "why" of this situation; it seems to me that this could
> always be converted into a more efficient database query without any
> unexpected side-effects (and if I really wanted the side effects, I would
> just write "if obj in list(qs)" instead). In this case, though, I would
> usually write something like this:
> if get_special_objects().filter(pk=obj.pk).exists():
>    # yes, it is special
> I believe that in some cases, the exists() query can be optimized to return
> faster than a count() aggregation, and I think that the intent of the code
> appears more clearly.
> Ian

OK, take this example. I have a django model table with 70 million
rows in it. Doing any kind of query on this table is slow, and
typically the query is date restrained - which mysql will use as the
optimum key, meaning any further filtering is a table scan on the
filtered rows.

Pulling a large query (say, all logins in a month, ~1 million rows)
takes only a few seconds longer than counting the number of rows the
query would find - after all, the database still has to do precisely
the same amount of work, it just doesn't have to deliver the data.

Say I have a n entries I want to test are in that resultset, and I
also want to iterate through the list, calculating some data and
printing out the row, I can do the existence tests either in python or
in the database. If I do it in the database, I have n+1 expensive
queries to perform. If I do it in python, I have 1 expensive query to
perform, and (worst case) n+1 full scans of the data retrieved (and I
avoid locking the table for n+1 expensive queries).

Depending on the size of the data set, as the developer I have the
choice of which will be more appropriate for my needs. Sometimes I
need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
qs".

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Ian Clelland
On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler  wrote:

> # This is my current solution
> if get_special_objects().filter(pk=obj.pk).count():
># yes, it is special
>
>
I can't speak to the "why" of this situation; it seems to me that this
could always be converted into a more efficient database query without any
unexpected side-effects (and if I really wanted the side effects, I would
just write "if obj in list(qs)" instead). In this case, though, I would
usually write something like this:

if get_special_objects().filter(pk=obj.pk).*exists*():
   # yes, it is special

I believe that in some cases, the exists() query can be optimized to return
faster than a count() aggregation, and I think that the intent of the code
appears more clearly.

Ian

-- 
Regards,
Ian Clelland


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Tom Evans
On Wed, Nov 2, 2011 at 3:25 PM, Thomas Guettler  wrote:
> Here is a better example:
>
> def get_special_objects():
>    # Some are special. But the result can still be huge!
>    return MyModel.objects.filter()
>
> obj=MyModel.objects.get()
>
> # is this object "special?"
> if obj in get_special_objects(): # This is very slow if there are many rows 
> in the result.
>    # yes, it is special
>
> # This is my current solution
> if get_special_objects().filter(pk=obj.pk).count():
>    # yes, it is special
>
>
> Up to now I sometimes used "if obj in queryset" but never realized, that
> this evaluates the whole queryset. Up to now I thought this is lazy.
>
> I have no big problem with this, since I found a solution.
>
> Is there a reason why "if obj in queryset" is executed in python code, and not
> in the DB?
>
>  Thomas
>
>

Choice and consistency.

You've picked an edge case where it does make more sense to query the
database, however I could contrive any number of examples where it
would be wrong to filter in the database, for example if I had a long
list of objects I wanted to test are in the queryset AND I wanted the
items in the queryset constructed anyway, I'd use the 'in' operator.
This is the choice.

For consistency, querying the database is performed by methods on a
model's manager class or related queryset, where as python functions
operating on a queryset do not perform database queries. Eg, len(qs)
evaluates the queryset and counts the number of instances in the
queryset using python, where as qs.count() performs a DB query to
obtain the result. Neither is wrong, they are appropriate in different
cases.

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Thomas Guettler
Here is a better example:

def get_special_objects():
# Some are special. But the result can still be huge!
return MyModel.objects.filter()

obj=MyModel.objects.get()

# is this object "special?"
if obj in get_special_objects(): # This is very slow if there are many rows in 
the result.
# yes, it is special

# This is my current solution
if get_special_objects().filter(pk=obj.pk).count():
# yes, it is special


Up to now I sometimes used "if obj in queryset" but never realized, that
this evaluates the whole queryset. Up to now I thought this is lazy.

I have no big problem with this, since I found a solution.

Is there a reason why "if obj in queryset" is executed in python code, and not
in the DB?

  Thomas



Am 02.11.2011 12:42, schrieb Thomas Guettler:
> Hi,
> 
> I just discovered, that "if obj in queryset" can be very slow:
> 
> queryset=MyModel.objects.all()
> obj=MyModel.objects.get(id=319526)
> 
> #if obj in queryset: # This is very slow, if queryset is big: All lines from 
> queryset get created to Python objects
> #print 'in'
> 
> if queryset.filter(id=obj.id): # Fast: Check is done inside DB.
> print 'in'
> 
> What is the best way to do "if obj in queryset"?
> 
>   Thomas
> 

-- 
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Jirka Vejrazka
> queryset=MyModel.objects.all()
> obj=MyModel.objects.get(id=319526)

Thomas,

  if the second line works for you (i.e. does not raise
MyModel.DoesNotExist exception), then obj is in queryset by
definition.

  What are you trying to achieve?

   Jirka

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Flavia Missi
That's because your queryset is being evaluated when you compare, maybe if
you explain your problem, we can give you a better solution than the use of
`in`.

Take a look at the docs about when querysets are evaluated:

https://docs.djangoproject.com/en/dev/ref/models/querysets/#when-querysets-are-evaluated

[]'s

On Wed, Nov 2, 2011 at 9:42 AM, Thomas Guettler  wrote:

> Hi,
>
> I just discovered, that "if obj in queryset" can be very slow:
>
> queryset=MyModel.objects.all()
> obj=MyModel.objects.get(id=319526)
>
> #if obj in queryset: # This is very slow, if queryset is big: All lines
> from queryset get created to Python objects
> #print 'in'
>
> if queryset.filter(id=obj.id): # Fast: Check is done inside DB.
>print 'in'
>
> What is the best way to do "if obj in queryset"?
>
>  Thomas
>
> --
> Thomas Guettler, http://www.thomas-guettler.de/
> E-Mail: guettli (*) thomas-guettler + de
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to django-users@googlegroups.com.
> To unsubscribe from this group, send email to
> django-users+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.
>
>


-- 
Flávia Missi
@flaviamissi 
flaviamissi.com.br
https://github.com/flaviamissi

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



QuerySet: "if obj in queryset" can be very slow

2011-11-02 Thread Thomas Guettler
Hi,

I just discovered, that "if obj in queryset" can be very slow:

queryset=MyModel.objects.all()
obj=MyModel.objects.get(id=319526)

#if obj in queryset: # This is very slow, if queryset is big: All lines from 
queryset get created to Python objects
#print 'in'

if queryset.filter(id=obj.id): # Fast: Check is done inside DB.
print 'in'

What is the best way to do "if obj in queryset"?

  Thomas

-- 
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.