Re: Performance when retrieving data from related models

2013-03-19 Thread Taras D
Thanks again!

On 19 March 2013 13:09, Peter of the Norse  wrote:

> [snip]
>
> We might put the onus on the client to call:
> qs = getTendersForContractor(contractor)
> for tender in list(qs.prefetch_related('project', 'project__contract',
> 'project__contract')):
> print tender, tender.getTotalCost()
>
>
> If this is really what you are doing, then I strongly recommend
> https://docs.djangoproject.com/en/dev/topics/db/aggregation/. This case
> is a single SQL statement.
>
> Not only is this complicated, but this requires the client know how the
> Tender is implemented (the prefetched would become uneccessary if we
> changed the implementation to store the total cost on the actual Tender
> itself at the cost of denormalised data). Note that you wouldn't have these
> problems if the data existed in memory.
>
> These points are making me rethink my data model and API design, and that
> the OO paradigm doesn't always fit well with web apps. In particular the
> slides at
> http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application:
>
> 23: "Django doesn’t have a good primitive for returning an object graph in
> one call. - multiple db trips required" Implies that object graphs like
> what I've designed is difficult to implement effeciently..?
>
> 24: "So, what do to?  If you almost always return the parent along with
> the related object, consider object inheritance instead.  That does do a
> nice join operation.  Don’t create an insane class graph, though.  If you
> frequently use the related object without reference to the parent object,
> use a foreign key." Not sure what 'object inheritance' is but might be
> useful, and concerned that I might be heading towards an 'insane class
> graph' (the real data model has 6 or 7 more related objects). Am I heading
> down the wrong path of embracing an OO style API and data model?
>
> I don’t know. But I think you have to acknowledge that there is such a
> thing as too OOP. I’m not sure that creating an API on top of Django is a
> good idea. Django is already super high level. I would have to ask myself
> how I expect my users to use what I give them. On some level, you’ll have
> to trust your user to ask for the data they need.
>
>
Getting the total cost isn't exactly what I was doing, I was trying to come
up with a simple example of a function on the class that can't
easily/cleanly be calculated by simply aggregating data from the related
objects. With this I think I've hit an impedance mismatch problem, from
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx

"Unfortunately, fields within the object are only part of the problem--the
other problem we face is that objects are frequently associated with other
objects, in various cardinalities (one-to-one, one-to-many, many-to-one,
many-to-many), and an O/R mapping has to make some up-front decisions about
when to retrieve these associated objects"

Which is exactly the issue I'm having. As you've stated there isn't a
straight forward solution, and I'll have to consider tradeoffs in my design
decisions.

Taras

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




Re: Performance when retrieving data from related models

2013-03-18 Thread Taras D
Hi Peter

On 18 March 2013 16:35, Peter of the Norse  wrote:

> You’re over thinking this by a large margin. getTendersForContractor
> should return a QuerySet. That way you can add things like .select_related
> as necessary in the view.
>
> Also, I’m a bit worried about the fact you have a function called
> getTendersForContractor. It really should be Tender.objects.forContractor.
> Or if the function is just a single filter call, it can be eliminated
> entirely.
>

To confirm, this would require a new Manager that implements the
'forContractor' method?


> * Each prefetch_related hits the database once - a lot better than 3k
> hits, but not as minimal as a sql join which joins all of the tables
>
> Actually, no. If you were to try and do the prefetch_related as a single
> SQL join, you would return an insane number of rows. A single
> prefetch_related wouldn't be a problem, but if you were to try and do a
> select with more than one many field, they would multiply. If you have an
> example of a join statement that doesn’t do that, we’d love to see it.
>

You're correct


>
> * Prefetch_related selects related obects via the criteria: 'in (list of
> ids'), which arguably can be slower
>
> What database are you using where this is true?
>

It was the result of a discussion with a colleague. Also slides 67 & 68 at
http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-applicationimply
that large IN's may cause performance problems -
"68:  Large INs are a mess.  Very expensive for the database to parse.
 Very expensive for the database to execute.  If there are potentially more
than 10-15 items in the list, rework the IN as a JOIN against whatever the
source of the keys is."


>
> * Seems brittle - if you change the data model, you have to change all of
> the queries in the data retrieval layer. Denomralising the data would
> reduce this, but we'd still have the problem of retrieving more data than
> the client may possibly want
>
> Yes, if you change code in one place, you might have to change it
> someplace else to match it. Most of us just use the ORM as the data
> retrieval layer. Again, I’m worried you’re over thinking this.
>

I hadn't really considered passing around a QuerySet in place of a list of
Tenders. One thing I'm finding is that the cost in retrieving related
objects is making me rethink the API presented to the client. For example,
consider the following API, which I think is quite reasonable:

class Tender:
  def getTotalCost(self):
  return self.project.getTotalCost()

class Project:

  def getTotalCost(self):
  cost = 0
  for contract in self.contract_set.all():
  cost += constract.getTotalCost()
  return cost

similarly implementations for Project & Contract

class Contract:
  def getTotalCost(self):
  cost = 0
  for job in self.job_set.all():
  cost += job.getTotalCost()
  return cost

Having a Project know how much its total cost is, and presenting this to a
client, is in line with OO principles of abstraction and encapsulation.
Consider a page that outputs the total cost of each contract tendered for a
particular contractor. If we re-use the 'getTendersForContractor' function
(that returns a QuerySet), and then call 'getTotalCost' on each Project
that has a tender, then we still have a problem in that we'll get a
multitude of calls to the database.

We might put the onus on the client to call:
qs = getTendersForContractor(contractor)
for tender in list(qs.prefetch_related('project', 'project__contract',
'project__contract')):
print tender, tender.getTotalCost()

Not only is this complicated, but this requires the client know how the
Tender is implemented (the prefetched would become uneccessary if we
changed the implementation to store the total cost on the actual Tender
itself at the cost of denormalised data). Note that you wouldn't have these
problems if the data existed in memory.

These points are making me rethink my data model and API design, and that
the OO paradigm doesn't always fit well with web apps. In particular the
slides at
http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application:

23: "Django doesn’t have a good primitive for returning an object graph in
one call. - multiple db trips required" Implies that object graphs like
what I've designed is difficult to implement effeciently..?

24: "So, what do to?  If you almost always return the parent along with the
related object, consider object inheritance instead.  That does do a nice
join operation.  Don’t create an insane class graph, though.  If you
frequently use the related object without reference to the parent object,
use a foreign key." Not sure what 'object inheritance' is but might be
useful, and concerned that I might be heading towards an 'insane class
graph' (the real data model has 6 or 7 more related objects). Am I heading
down the wrong path of embracing an OO