Re: Complex query reduction
On Fri, Nov 8, 2013 at 1:44 AM, Robin St.Clairwrote: > The last time I checked the use of IN, all the records from the database in > the query were brought back to the workstation, rather than being processed > on the backend and only the results returned to the workstation. Django ORM's __in operator tries to use SQL facilities, but depending on the type of object passed, it could read the list of options and pass it explicitly, or it could merge two SQL commands. always check the type of SQL commands generated! (debug toolbar is your friend) i haven't seen it doing a post-retrieve check in python, thought. -- Javier -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAFkDaoRLfeLi%3DWVu5f%3DcyCXRH0x5mr1fbDiV0OzW%2Bn-fpk6uHw%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Complex query reduction
I am pretty sure the IN() performance issue in MySQL was fixed 5.5, and 5.6/5.7 certainly don't have an issue with IN() whether you use a bunch of values or a subquery. Cheers François On Nov 8, 2013, at 2:15 AM, akaariaiwrote: > On Friday, November 8, 2013 8:44:09 AM UTC+2, Robin St.Clair wrote: > Anssi > > The last time I checked the use of IN, all the records from the database in > the query were brought back to the workstation, rather than being processed > on the backend and only the results returned to the workstation. > > Have there been changes that carry out the entire query on the backend? What > has changed to cause you to prefer the use of the IN statement? > > Django has had the ability to execute __in=qs in single query for a somewhat > long time (from Django 1.2 maybe?). > > It is true that __in lookup against a large list of values is often a bad > choice even if you have the values at hand. Unfortunately this doesn't apply > to all databases, for example older versions of MySQL do not handle > subqueries well. > > - Anssi > > -- > You received this message because you are subscribed to the Google Groups > "Django users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to django-users+unsubscr...@googlegroups.com. > To post to this group, send email to django-users@googlegroups.com. > Visit this group at http://groups.google.com/group/django-users. > To view this discussion on the web visit > https://groups.google.com/d/msgid/django-users/f3428fda-e4be-4d75-8626-a95bd20f66c8%40googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Complex query reduction
On Friday, November 8, 2013 8:44:09 AM UTC+2, Robin St.Clair wrote: > > Anssi > > The last time I checked the use of IN, all the records from the database > in the query were brought back to the workstation, rather than being > processed on the backend and only the results returned to the workstation. > > Have there been changes that carry out the entire query on the backend? > What has changed to cause you to prefer the use of the IN statement? > Django has had the ability to execute __in=qs in single query for a somewhat long time (from Django 1.2 maybe?). It is true that __in lookup against a large list of values is often a bad choice even if you have the values at hand. Unfortunately this doesn't apply to all databases, for example older versions of MySQL do not handle subqueries well. - Anssi -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/f3428fda-e4be-4d75-8626-a95bd20f66c8%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Complex query reduction
Anssi The last time I checked the use of IN, all the records from the database in the query were brought back to the workstation, rather than being processed on the backend and only the results returned to the workstation. Have there been changes that carry out the entire query on the backend? What has changed to cause you to prefer the use of the IN statement? R+C On 08/11/2013 05:55, akaariai wrote: On Sunday, November 3, 2013 1:48:07 PM UTC+2, Robin St.Clair wrote: *IN* * if using Django avoid the IN operation at all costs If there are potentially more than 15 items in the list, rework the IN as a JOIN against whatever the source of the keys is I don't necessarily agree with everything else said in the post, but this one is just plain wrong. It is completely OK to use __in=queryset. In fact, it is recommended in cases where the alternate is using join + distinct. - Anssi -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/6b5f9e3a-2e41-47b5-b0d6-94473fe323b8%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/BLU0-SMTP357C2324CC6BEC3F975EB4AE2F20%40phx.gbl. For more options, visit https://groups.google.com/groups/opt_out.
Re: Complex query reduction
On Sunday, November 3, 2013 1:48:07 PM UTC+2, Robin St.Clair wrote: > > *IN* > >- if using Django avoid the IN operation at all costs > > > If there are potentially more than 15 items in the list, rework the IN as > a JOIN against whatever the source of the keys is > I don't necessarily agree with everything else said in the post, but this one is just plain wrong. It is completely OK to use __in=queryset. In fact, it is recommended in cases where the alternate is using join + distinct. - Anssi -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/6b5f9e3a-2e41-47b5-b0d6-94473fe323b8%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Complex query reduction
On Sat, Nov 2, 2013 at 4:50 PM, Daniele Procidawrote: > > But, the real killer is the combination of ordering (in the queryset or on > the model, it doesn't matter) with the distinct() - as soon as one is removed > from the equation, the execution time drops to around 250ms. > > That's for 55000 BibliographicRecords created by that last operation (before > distinct() is applied; distinct() reduces them to 28000). Do you happen to use PostgreSQL? This could be a case of a "non-optimal" configuration, that makes PostgreSQL use the disk to do the sorting. Take a look at http://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-WORK-MEM. Apostolis -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAEa3b%2BoCFp3oPwtcR-uyoERW3YwHR4aXLubp7WUpa%2BHtYNhvmw%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
RE: Complex query reduction
Hi You are running into typical production Django problems - works fine in testing, terrible in production. Firstly If you use attributed many-to-manys, create the table outside of Django and use raw SQL.Create a primary key on the two things being related (usually larger table first). Multi-column indicesTwo choicesA single index that includes both (do this outside Django)Two indices, one for each columnThe size will be roughly equivalent between the two Single composite indexsmaller than two indexes.faster to update.accelerates a more restricted range of queries.An index on (A, B) does nothing for queries just using BBut on the queries it does accelerate, it will be faster. Two indiceslargerslower to updatewill help on queries on both columns, but not as much as a single composite indexwill help accelerate queries on either of the columnscan be expressed directly in Django Helpful Toolpg_stat_activity shows a wealth of useful information - including which indexes are actually being used If an index is not being used (or not being used very often) drop itif you are surprised that it is not being used, find out why Basic Performance Rules1. Do not iterate over QuerySets2. If you think you have to iterate over a QuerySet, see rule #13. If you are absolutely, certainly, 100% positive that the only possible solution to your problem is iterating over a QuerySet, see rule #3 Iteration - A Health Warning• Ignores Django’s lazy-evaluation mechanism and copies everything into local memory• Copies data from the database just to process it locally• Does filtration or summation in the application that is processed more efficiently in the database.• Databases are good at this sort of stuff let the DB do it Alternatives to Iteration• QuerySet.update()• cursor.execute(“UPDATE reader_hours ...”)• Stored procedures Take in the code below How much many objects are in memory at point A? qs = Orders.objects.all() # There are about 2,500,000 rows in “orders” for order in qs:order.age_in_days += 1 # POINT A order.save() Answer - 2,500,000 Why Is This? Django does lazy evaluation… (everyone tells me so!)The Django code carefully asks for a slice of 100 objectswhich trickles down through lots of really convoluted Python to psycopg2which dutifully asks for 100 rows from Postgreswhich sends all 2,500,000 over the wire Solution - Named CursorsThe protocol between the Postgres client and server only does partial sends when using named cursorspsycopg2 fully supports named cursorsDjango doesn’t use named cursorsTherefore, the first time you ask for any object in a QuerySet, you get all of themThis is a very good reason not to ask for large result sets INif using Django avoid the IN operation at all costs If there are potentially more than 15 items in the list, rework the IN as a JOIN against whatever the source of the keys is At some stage you are going to have to get to grips with SQL, might as well be sooner than later R+C PS some time ago one of my clients was only able to process 4 transactons a second against their very complicated DB. They needed to process 70+ transactions per second. After 3 days of code analysis we had the transaction rate up to 240 per second and so could get rid of the 6 week backlog that had developed. Front end coders didn't consider the DBMS to be importantDate: Sun, 3 Nov 2013 02:37:09 -0800 From: akaar...@gmail.com To: django-users@googlegroups.com Subject: Re: Complex query reduction You should rewrite the query into a form that doesn't require distinct. In general, when you see a query that has joins and DISTINCT, that should be an alarm bell that something isn't written correctly in the query. Unfortunately Django's ORM generates such queries, and that isn't easy to fix as there are databases that like DISTINCT more than rewriting the query to use subqueries. In any case, you should check how to write the query without the need of distinct. Something like this should work: self.items = BibliographicRecord.objects.listable_objects().filter( authored__researcher__in=researchers, ) self.items = BibliographicRecord.objects.listable_objects().filter( pk__in=self.items.values_list('pk') ) But maybe you can push the __in to deeper into the authored__researches lookup... - Anssi On Saturday, November 2, 2013 4:50:42 PM UTC+2, Daniele Procida wrote:On Fri, Nov 1, 2013, Javier Guerra Giraldez <jav...@guerrag.com> wrote: >have you tried eliminating the second IN relationship? something like > >entities = entity.get_descendants() > >items = BibliographicRecord.objects.filter >(authored__researcher__person__member_of__entity__in=entities).distinct() Indeed I have, but in that form it takes around 1770ms, compared to around 1540ms in the original form. What I actually do is: # breaking apart the queries allows the use of values_lists entities = self.entity.get_descendants(
Re: Complex query reduction
You should rewrite the query into a form that doesn't require distinct. In general, when you see a query that has joins and DISTINCT, that should be an alarm bell that something isn't written correctly in the query. Unfortunately Django's ORM generates such queries, and that isn't easy to fix as there are databases that like DISTINCT more than rewriting the query to use subqueries. In any case, you should check how to write the query without the need of distinct. Something like this should work: self.items = BibliographicRecord.objects. listable_objects().filter( authored__researcher__in=researchers, ) self.items = BibliographicRecord.objects. listable_objects().filter( pk__in=self.items.values_list('pk') ) But maybe you can push the __in to deeper into the authored__researches lookup... - Anssi On Saturday, November 2, 2013 4:50:42 PM UTC+2, Daniele Procida wrote: > > On Fri, Nov 1, 2013, Javier Guerra Giraldez> wrote: > > >have you tried eliminating the second IN relationship? something like > > > >entities = entity.get_descendants() > > > >items = BibliographicRecord.objects.filter > >(authored__researcher__person__member_of__entity__in=entities).distinct() > > Indeed I have, but in that form it takes around 1770ms, compared to around > 1540ms in the original form. What I actually do is: > > # breaking apart the queries allows the use of values_lists > entities = self.entity.get_descendants( > include_self=True > ).values_list('id', flat=True) > > # and the set() here is about 230ms faster than putting a distinct() > on > # the first query > researchers = set(Researcher.objects.filter( > person__entities__in=entities > ).values_list('person', flat=True)) > > self.items = BibliographicRecord.objects.listable_objects().filter( > authored__researcher__in=researchers, > ).distinct() > > I think that's partly because this way the SELECT doesn't have to grab all > the fields of publications_bibliographicrecord. > > But, the real killer is the combination of ordering (in the queryset or on > the model, it doesn't matter) with the distinct() - as soon as one is > removed from the equation, the execution time drops to around 250ms. > > That's for 55000 BibliographicRecords created by that last operation > (before distinct() is applied; distinct() reduces them to 28000). > > That seems excessive to me. > > BibliographicRecord has a custom primary key, and its id fields look like > "d9ce7e2f-663e-4fc6-8448-b214c6915aed:web-of-science". Could that be > implicated in performance? > > Daniele > > -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/8208cc22-cc94-4fe6-9245-709bdc42647f%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Complex query reduction
On Fri, Nov 1, 2013, Javier Guerra Giraldezwrote: >have you tried eliminating the second IN relationship? something like > >entities = entity.get_descendants() > >items = BibliographicRecord.objects.filter >(authored__researcher__person__member_of__entity__in=entities).distinct() Indeed I have, but in that form it takes around 1770ms, compared to around 1540ms in the original form. What I actually do is: # breaking apart the queries allows the use of values_lists entities = self.entity.get_descendants( include_self=True ).values_list('id', flat=True) # and the set() here is about 230ms faster than putting a distinct() on # the first query researchers = set(Researcher.objects.filter( person__entities__in=entities ).values_list('person', flat=True)) self.items = BibliographicRecord.objects.listable_objects().filter( authored__researcher__in=researchers, ).distinct() I think that's partly because this way the SELECT doesn't have to grab all the fields of publications_bibliographicrecord. But, the real killer is the combination of ordering (in the queryset or on the model, it doesn't matter) with the distinct() - as soon as one is removed from the equation, the execution time drops to around 250ms. That's for 55000 BibliographicRecords created by that last operation (before distinct() is applied; distinct() reduces them to 28000). That seems excessive to me. BibliographicRecord has a custom primary key, and its id fields look like "d9ce7e2f-663e-4fc6-8448-b214c6915aed:web-of-science". Could that be implicated in performance? Daniele -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/20131102145042.868061817%40smtp.modern-world.net. For more options, visit https://groups.google.com/groups/opt_out.
Re: Complex query reduction
On Fri, Nov 1, 2013 at 12:45 PM, Daniele Procidawrote: > In practice I use some tweaks (such as values_list) to speed this up, and > caching, but the fundamental pattern is the same. It's slow, because there > are 30 thousand BibliographicRecords. the total number of records shouldn't matter. more important is the number of selected records at some points in the query. i'd guess the number of chosen entities (those that _are_ descendant of the first one), and the number of chosen researches (those that _are_ members of those entities) should be the most significant quantities. the point is that most DB optimizers are somewhat shy to shuffle conditions around IN operators. but it might be easier to work on a small (a few tens of values?) IN condition than on a very long JOIN chain. > > I'm trying to rearrange the construction of the query in different ways to > find a speed improvement. I don't think that either prefetch_related or > select_related will help here, but perhaps there are some other tools that > would. have you tried eliminating the second IN relationship? something like entities = entity.get_descendants() items = BibliographicRecord.objects.filter(authored__researcher__person__member_of__entity__in=entities).distinct() that should turn most of the query in a long sequence of "INNER JOIN", giving the optimizer an easier job, avoiding the IN operation on research records. if the query chooses too many researches, this could be a big part of the slowness. -- Javier -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAFkDaoSVg3K44wRUnOTrinqXZbVKm1ZZLHxFPB8LdgrzBtxViA%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.