Re: Performance when retrieving data from related models
Thanks again! On 19 March 2013 13:09, Peter of the Norse wrote: > [snip] > > We might put the onus on the client to call: > qs = getTendersForContractor(contractor) > for tender in list(qs.prefetch_related('project', 'project__contract', > 'project__contract')): > print tender, tender.getTotalCost() > > > If this is really what you are doing, then I strongly recommend > https://docs.djangoproject.com/en/dev/topics/db/aggregation/. This case > is a single SQL statement. > > Not only is this complicated, but this requires the client know how the > Tender is implemented (the prefetched would become uneccessary if we > changed the implementation to store the total cost on the actual Tender > itself at the cost of denormalised data). Note that you wouldn't have these > problems if the data existed in memory. > > These points are making me rethink my data model and API design, and that > the OO paradigm doesn't always fit well with web apps. In particular the > slides at > http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application: > > 23: "Django doesn’t have a good primitive for returning an object graph in > one call. - multiple db trips required" Implies that object graphs like > what I've designed is difficult to implement effeciently..? > > 24: "So, what do to? If you almost always return the parent along with > the related object, consider object inheritance instead. That does do a > nice join operation. Don’t create an insane class graph, though. If you > frequently use the related object without reference to the parent object, > use a foreign key." Not sure what 'object inheritance' is but might be > useful, and concerned that I might be heading towards an 'insane class > graph' (the real data model has 6 or 7 more related objects). Am I heading > down the wrong path of embracing an OO style API and data model? > > I don’t know. But I think you have to acknowledge that there is such a > thing as too OOP. I’m not sure that creating an API on top of Django is a > good idea. Django is already super high level. I would have to ask myself > how I expect my users to use what I give them. On some level, you’ll have > to trust your user to ask for the data they need. > > Getting the total cost isn't exactly what I was doing, I was trying to come up with a simple example of a function on the class that can't easily/cleanly be calculated by simply aggregating data from the related objects. With this I think I've hit an impedance mismatch problem, from http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx "Unfortunately, fields within the object are only part of the problem--the other problem we face is that objects are frequently associated with other objects, in various cardinalities (one-to-one, one-to-many, many-to-one, many-to-many), and an O/R mapping has to make some up-front decisions about when to retrieve these associated objects" Which is exactly the issue I'm having. As you've stated there isn't a straight forward solution, and I'll have to consider tradeoffs in my design decisions. Taras -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Re: Performance when retrieving data from related models
More inline stuff. On Mar 18, 2013, at 4:14 AM, Taras D wrote: > Hi Peter > > On 18 March 2013 16:35, Peter of the Norse wrote: > You’re over thinking this by a large margin. getTendersForContractor should > return a QuerySet. That way you can add things like .select_related as > necessary in the view. > > Also, I’m a bit worried about the fact you have a function called > getTendersForContractor. It really should be Tender.objects.forContractor. Or > if the function is just a single filter call, it can be eliminated entirely. > > To confirm, this would require a new Manager that implements the > 'forContractor' method? Yes. >> * Prefetch_related selects related obects via the criteria: 'in (list of >> ids'), which arguably can be slower > > What database are you using where this is true? > > It was the result of a discussion with a colleague. Also slides 67 & 68 at > http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application > imply that large IN's may cause performance problems - > "68: Large INs are a mess. Very expensive for the database to parse. Very > expensive for the database to execute. If there are potentially more than > 10-15 items in the list, rework the IN as a JOIN against whatever the source > of the keys is." That sounds like a good candidate for a bug ticket. It’s not trivial though. The limit of IDs in the query set would depend on the database, the complexity of the query, and the data being pulled. I would have to think about this for a long time before I felt comfortable making any kind of suggestion. >> * Seems brittle - if you change the data model, you have to change all of >> the queries in the data retrieval layer. Denomralising the data would reduce >> this, but we'd still have the problem of retrieving more data than the >> client may possibly want > Yes, if you change code in one place, you might have to change it someplace > else to match it. Most of us just use the ORM as the data retrieval layer. > Again, I’m worried you’re over thinking this. > > I hadn't really considered passing around a QuerySet in place of a list of > Tenders. One thing I'm finding is that the cost in retrieving related objects > is making me rethink the API presented to the client. For example, consider > the following API, which I think is quite reasonable: > > class Tender: > def getTotalCost(self): > return self.project.getTotalCost() > > class Project: > > def getTotalCost(self): > cost = 0 > for contract in self.contract_set.all(): > cost += constract.getTotalCost() > return cost > > similarly implementations for Project & Contract > > class Contract: > def getTotalCost(self): > cost = 0 > for job in self.job_set.all(): > cost += job.getTotalCost() > return cost https://docs.djangoproject.com/en/dev/topics/db/aggregation/ > Having a Project know how much its total cost is, and presenting this to a > client, is in line with OO principles of abstraction and encapsulation. > Consider a page that outputs the total cost of each contract tendered for a > particular contractor. If we re-use the 'getTendersForContractor' function > (that returns a QuerySet), and then call 'getTotalCost' on each Project that > has a tender, then we still have a problem in that we'll get a multitude of > calls to the database. > > We might put the onus on the client to call: > qs = getTendersForContractor(contractor) > for tender in list(qs.prefetch_related('project', 'project__contract', > 'project__contract')): > print tender, tender.getTotalCost() If this is really what you are doing, then I strongly recommend https://docs.djangoproject.com/en/dev/topics/db/aggregation/. This case is a single SQL statement. > Not only is this complicated, but this requires the client know how the > Tender is implemented (the prefetched would become uneccessary if we changed > the implementation to store the total cost on the actual Tender itself at the > cost of denormalised data). Note that you wouldn't have these problems if the > data existed in memory. > > These points are making me rethink my data model and API design, and that the > OO paradigm doesn't always fit well with web apps. In particular the slides > at http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application: > > 23: "Django doesn’t have a good primitive for returning an object graph in > one call. - multiple db trips required" Implies that object graphs like what > I've designed is difficult to implement effeciently..? > > 24: "So, what do to? If you almost always return the parent along with the > related object, consider object inheritance instead. That does do a nice > join operation. Don’t create an insane class graph, though. If you > frequently use the related object without reference to the parent object, use > a foreign key.
Re: Performance when retrieving data from related models
Hi Peter On 18 March 2013 16:35, Peter of the Norse wrote: > You’re over thinking this by a large margin. getTendersForContractor > should return a QuerySet. That way you can add things like .select_related > as necessary in the view. > > Also, I’m a bit worried about the fact you have a function called > getTendersForContractor. It really should be Tender.objects.forContractor. > Or if the function is just a single filter call, it can be eliminated > entirely. > To confirm, this would require a new Manager that implements the 'forContractor' method? > * Each prefetch_related hits the database once - a lot better than 3k > hits, but not as minimal as a sql join which joins all of the tables > > Actually, no. If you were to try and do the prefetch_related as a single > SQL join, you would return an insane number of rows. A single > prefetch_related wouldn't be a problem, but if you were to try and do a > select with more than one many field, they would multiply. If you have an > example of a join statement that doesn’t do that, we’d love to see it. > You're correct > > * Prefetch_related selects related obects via the criteria: 'in (list of > ids'), which arguably can be slower > > What database are you using where this is true? > It was the result of a discussion with a colleague. Also slides 67 & 68 at http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-applicationimply that large IN's may cause performance problems - "68: Large INs are a mess. Very expensive for the database to parse. Very expensive for the database to execute. If there are potentially more than 10-15 items in the list, rework the IN as a JOIN against whatever the source of the keys is." > > * Seems brittle - if you change the data model, you have to change all of > the queries in the data retrieval layer. Denomralising the data would > reduce this, but we'd still have the problem of retrieving more data than > the client may possibly want > > Yes, if you change code in one place, you might have to change it > someplace else to match it. Most of us just use the ORM as the data > retrieval layer. Again, I’m worried you’re over thinking this. > I hadn't really considered passing around a QuerySet in place of a list of Tenders. One thing I'm finding is that the cost in retrieving related objects is making me rethink the API presented to the client. For example, consider the following API, which I think is quite reasonable: class Tender: def getTotalCost(self): return self.project.getTotalCost() class Project: def getTotalCost(self): cost = 0 for contract in self.contract_set.all(): cost += constract.getTotalCost() return cost similarly implementations for Project & Contract class Contract: def getTotalCost(self): cost = 0 for job in self.job_set.all(): cost += job.getTotalCost() return cost Having a Project know how much its total cost is, and presenting this to a client, is in line with OO principles of abstraction and encapsulation. Consider a page that outputs the total cost of each contract tendered for a particular contractor. If we re-use the 'getTendersForContractor' function (that returns a QuerySet), and then call 'getTotalCost' on each Project that has a tender, then we still have a problem in that we'll get a multitude of calls to the database. We might put the onus on the client to call: qs = getTendersForContractor(contractor) for tender in list(qs.prefetch_related('project', 'project__contract', 'project__contract')): print tender, tender.getTotalCost() Not only is this complicated, but this requires the client know how the Tender is implemented (the prefetched would become uneccessary if we changed the implementation to store the total cost on the actual Tender itself at the cost of denormalised data). Note that you wouldn't have these problems if the data existed in memory. These points are making me rethink my data model and API design, and that the OO paradigm doesn't always fit well with web apps. In particular the slides at http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application: 23: "Django doesn’t have a good primitive for returning an object graph in one call. - multiple db trips required" Implies that object graphs like what I've designed is difficult to implement effeciently..? 24: "So, what do to? If you almost always return the parent along with the related object, consider object inheritance instead. That does do a nice join operation. Don’t create an insane class graph, though. If you frequently use the related object without reference to the parent object, use a foreign key." Not sure what 'object inheritance' is but might be useful, and concerned that I might be heading towards an 'insane class graph' (the real data model has 6 or 7 more related objects). Am I heading down the wrong path of embracing an OO style API and data model?
Re: Performance when retrieving data from related models
You’re over thinking this by a large margin. getTendersForContractor should return a QuerySet. That way you can add things like .select_related as necessary in the view. Also, I’m a bit worried about the fact you have a function called getTendersForContractor. It really should be Tender.objects.forContractor. Or if the function is just a single filter call, it can be eliminated entirely. On Mar 15, 2013, at 9:04 PM, Taras_96 wrote: > qs = Tender.objects.sql_related(/*names of keyed relationships > here*/).prefetch_related('project', 'project__job_set') (the actual example > has many more relations in the prefetch_related). This fixes the problem of > hitting the database in loops such as the above, and the complexity can be > hidden in a data retrieval layer that the client calls into. > > returnedTenders = getTendersForContractor(contractor) # << the actual query > with prefetches etc is done in this function > > However, there are a few drawbacks to this: [SNIP] > * Each prefetch_related hits the database once - a lot better than 3k hits, > but not as minimal as a sql join which joins all of the tables Actually, no. If you were to try and do the prefetch_related as a single SQL join, you would return an insane number of rows. A single prefetch_related wouldn't be a problem, but if you were to try and do a select with more than one many field, they would multiply. If you have an example of a join statement that doesn’t do that, we’d love to see it. > * Prefetch_related selects related obects via the criteria: 'in (list of > ids'), which arguably can be slower What database are you using where this is true? > * Seems brittle - if you change the data model, you have to change all of the > queries in the data retrieval layer. Denomralising the data would reduce > this, but we'd still have the problem of retrieving more data than the client > may possibly want Yes, if you change code in one place, you might have to change it someplace else to match it. Most of us just use the ORM as the data retrieval layer. Again, I’m worried you’re over thinking this. Peter of the Norse rahmc...@radio1190.org -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Performance when retrieving data from related models
My web app hits the database up to 3k per page because it has to retrieve data from lots of related models. It has a fairly normalised database with a structure similar to: - one project has many contracts - one project has many tenders (a tender is put forward by a contractor) - one contract has many jobs (there are a few extra relations not shown for simplicity). One of the pages, for example, lists out details for every contract a contractors has tendered). I've naively implemented this by treating the Django models as I would OO models, resulting in about 3k database hits for the one page. This is because something like: qs = Tender.objects.filter(filter_criteria) for tender in qs: print tender.getProject() where getProject() looks something like: def getProject(self): return self.project I think here that the extra level of indirection means that Django can't conclude that all 'project' objects for each tender from the query set should be retrieved, and instead it ends up hitting the database once per loop. At the moment I do this in quite a few places, contributing to the 3k database hits. One way of of solving this is for the query to fetch all related data, something like: qs = Tender.objects.sql_related(/*names of keyed relationships here*/).prefetch_related('project', 'project__job_set') (the actual example has many more relations in the prefetch_related). This fixes the problem of hitting the database in loops such as the above, and the complexity can be hidden in a data retrieval layer that the client calls into. returnedTenders = getTendersForContractor(contractor) # << the actual query with prefetches etc is done in this function However, there are a few drawbacks to this: * Major one is that data is fetched that you might not require. EG: if there was another page that just listed the ids of the Tenders, then a lot of data would have been fetched that is not required ** you could mitigate this by having the client somehow specify which data requires prefetching, but *** this exposes implementation detail to the client *** the function calling into the data retrieval layer may not know what data requires prefetching, and instead it is a function higher up in the call chain (the function that actually uses the data) that has this information * Each prefetch_related hits the database once - a lot better than 3k hits, but not as minimal as a sql join which joins all of the tables * Prefetch_related selects related obects via the criteria: 'in (list of ids'), which arguably can be slower * Seems brittle - if you change the data model, you have to change all of the queries in the data retrieval layer. Denomralising the data would reduce this, but we'd still have the problem of retrieving more data than the client may possibly want What is the group's suggestions/advice? -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at http://groups.google.com/group/django-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.