Re: Performance when retrieving data from related models

2013-03-19 Thread Taras D
Thanks again!

On 19 March 2013 13:09, Peter of the Norse  wrote:

> [snip]
>
> We might put the onus on the client to call:
> qs = getTendersForContractor(contractor)
> for tender in list(qs.prefetch_related('project', 'project__contract',
> 'project__contract')):
> print tender, tender.getTotalCost()
>
>
> If this is really what you are doing, then I strongly recommend
> https://docs.djangoproject.com/en/dev/topics/db/aggregation/. This case
> is a single SQL statement.
>
> Not only is this complicated, but this requires the client know how the
> Tender is implemented (the prefetched would become uneccessary if we
> changed the implementation to store the total cost on the actual Tender
> itself at the cost of denormalised data). Note that you wouldn't have these
> problems if the data existed in memory.
>
> These points are making me rethink my data model and API design, and that
> the OO paradigm doesn't always fit well with web apps. In particular the
> slides at
> http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application:
>
> 23: "Django doesn’t have a good primitive for returning an object graph in
> one call. - multiple db trips required" Implies that object graphs like
> what I've designed is difficult to implement effeciently..?
>
> 24: "So, what do to?  If you almost always return the parent along with
> the related object, consider object inheritance instead.  That does do a
> nice join operation.  Don’t create an insane class graph, though.  If you
> frequently use the related object without reference to the parent object,
> use a foreign key." Not sure what 'object inheritance' is but might be
> useful, and concerned that I might be heading towards an 'insane class
> graph' (the real data model has 6 or 7 more related objects). Am I heading
> down the wrong path of embracing an OO style API and data model?
>
> I don’t know. But I think you have to acknowledge that there is such a
> thing as too OOP. I’m not sure that creating an API on top of Django is a
> good idea. Django is already super high level. I would have to ask myself
> how I expect my users to use what I give them. On some level, you’ll have
> to trust your user to ask for the data they need.
>
>
Getting the total cost isn't exactly what I was doing, I was trying to come
up with a simple example of a function on the class that can't
easily/cleanly be calculated by simply aggregating data from the related
objects. With this I think I've hit an impedance mismatch problem, from
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx

"Unfortunately, fields within the object are only part of the problem--the
other problem we face is that objects are frequently associated with other
objects, in various cardinalities (one-to-one, one-to-many, many-to-one,
many-to-many), and an O/R mapping has to make some up-front decisions about
when to retrieve these associated objects"

Which is exactly the issue I'm having. As you've stated there isn't a
straight forward solution, and I'll have to consider tradeoffs in my design
decisions.

Taras

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




Re: Performance when retrieving data from related models

2013-03-18 Thread Peter of the Norse
More inline stuff. 

On Mar 18, 2013, at 4:14 AM, Taras D wrote:

> Hi Peter
> 
> On 18 March 2013 16:35, Peter of the Norse  wrote:
> You’re over thinking this by a large margin. getTendersForContractor should 
> return a QuerySet. That way you can add things like .select_related as 
> necessary in the view. 
> 
> Also, I’m a bit worried about the fact you have a function called 
> getTendersForContractor. It really should be Tender.objects.forContractor. Or 
> if the function is just a single filter call, it can be eliminated entirely.
> 
> To confirm, this would require a new Manager that implements the 
> 'forContractor' method?

Yes. 

>> * Prefetch_related selects related obects via the criteria: 'in (list of 
>> ids'), which arguably can be slower
> 
> What database are you using where this is true?
> 
> It was the result of a discussion with a colleague. Also slides 67 & 68 at 
> http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application 
> imply that large IN's may cause performance problems - 
> "68:  Large INs are a mess.  Very expensive for the database to parse.  Very 
> expensive for the database to execute.  If there are potentially more than 
> 10-15 items in the list, rework the IN as a JOIN against whatever the source 
> of the keys is." 

That sounds like a good candidate for a bug ticket. It’s not trivial though. 
The limit of IDs in the query set would depend on the database, the complexity 
of the query, and the data being pulled. I would have to think about this for a 
long time before I felt comfortable making any kind of suggestion.

>> * Seems brittle - if you change the data model, you have to change all of 
>> the queries in the data retrieval layer. Denomralising the data would reduce 
>> this, but we'd still have the problem of retrieving more data than the 
>> client may possibly want
> Yes, if you change code in one place, you might have to change it someplace 
> else to match it. Most of us just use the ORM as the data retrieval layer. 
> Again, I’m worried you’re over thinking this.
> 
> I hadn't really considered passing around a QuerySet in place of a list of 
> Tenders. One thing I'm finding is that the cost in retrieving related objects 
> is making me rethink the API presented to the client. For example, consider 
> the following API, which I think is quite reasonable:
> 
> class Tender:
>   def getTotalCost(self):
>   return self.project.getTotalCost()
> 
> class Project:
>   
>   def getTotalCost(self):
>   cost = 0
>   for contract in self.contract_set.all():
>   cost += constract.getTotalCost()
>   return cost
> 
> similarly implementations for Project & Contract
> 
> class Contract:
>   def getTotalCost(self):
>   cost = 0
>   for job in self.job_set.all():
>   cost += job.getTotalCost()
>   return cost

https://docs.djangoproject.com/en/dev/topics/db/aggregation/

> Having a Project know how much its total cost is, and presenting this to a 
> client, is in line with OO principles of abstraction and encapsulation. 
> Consider a page that outputs the total cost of each contract tendered for a 
> particular contractor. If we re-use the 'getTendersForContractor' function 
> (that returns a QuerySet), and then call 'getTotalCost' on each Project that 
> has a tender, then we still have a problem in that we'll get a multitude of 
> calls to the database. 
> 
> We might put the onus on the client to call:
> qs = getTendersForContractor(contractor)
> for tender in list(qs.prefetch_related('project', 'project__contract', 
> 'project__contract')):
> print tender, tender.getTotalCost()

If this is really what you are doing, then I strongly recommend 
https://docs.djangoproject.com/en/dev/topics/db/aggregation/. This case is a 
single SQL statement.

> Not only is this complicated, but this requires the client know how the 
> Tender is implemented (the prefetched would become uneccessary if we changed 
> the implementation to store the total cost on the actual Tender itself at the 
> cost of denormalised data). Note that you wouldn't have these problems if the 
> data existed in memory.
> 
> These points are making me rethink my data model and API design, and that the 
> OO paradigm doesn't always fit well with web apps. In particular the slides 
> at http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application:
> 
> 23: "Django doesn’t have a good primitive for returning an object graph in 
> one call. - multiple db trips required" Implies that object graphs like what 
> I've designed is difficult to implement effeciently..?
> 
> 24: "So, what do to?  If you almost always return the parent along with the 
> related object, consider object inheritance instead.  That does do a nice 
> join operation.  Don’t create an insane class graph, though.  If you 
> frequently use the related object without reference to the parent object, use 
> a foreign key.

Re: Performance when retrieving data from related models

2013-03-18 Thread Taras D
Hi Peter

On 18 March 2013 16:35, Peter of the Norse  wrote:

> You’re over thinking this by a large margin. getTendersForContractor
> should return a QuerySet. That way you can add things like .select_related
> as necessary in the view.
>
> Also, I’m a bit worried about the fact you have a function called
> getTendersForContractor. It really should be Tender.objects.forContractor.
> Or if the function is just a single filter call, it can be eliminated
> entirely.
>

To confirm, this would require a new Manager that implements the
'forContractor' method?


> * Each prefetch_related hits the database once - a lot better than 3k
> hits, but not as minimal as a sql join which joins all of the tables
>
> Actually, no. If you were to try and do the prefetch_related as a single
> SQL join, you would return an insane number of rows. A single
> prefetch_related wouldn't be a problem, but if you were to try and do a
> select with more than one many field, they would multiply. If you have an
> example of a join statement that doesn’t do that, we’d love to see it.
>

You're correct


>
> * Prefetch_related selects related obects via the criteria: 'in (list of
> ids'), which arguably can be slower
>
> What database are you using where this is true?
>

It was the result of a discussion with a colleague. Also slides 67 & 68 at
http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-applicationimply
that large IN's may cause performance problems -
"68:  Large INs are a mess.  Very expensive for the database to parse.
 Very expensive for the database to execute.  If there are potentially more
than 10-15 items in the list, rework the IN as a JOIN against whatever the
source of the keys is."


>
> * Seems brittle - if you change the data model, you have to change all of
> the queries in the data retrieval layer. Denomralising the data would
> reduce this, but we'd still have the problem of retrieving more data than
> the client may possibly want
>
> Yes, if you change code in one place, you might have to change it
> someplace else to match it. Most of us just use the ORM as the data
> retrieval layer. Again, I’m worried you’re over thinking this.
>

I hadn't really considered passing around a QuerySet in place of a list of
Tenders. One thing I'm finding is that the cost in retrieving related
objects is making me rethink the API presented to the client. For example,
consider the following API, which I think is quite reasonable:

class Tender:
  def getTotalCost(self):
  return self.project.getTotalCost()

class Project:

  def getTotalCost(self):
  cost = 0
  for contract in self.contract_set.all():
  cost += constract.getTotalCost()
  return cost

similarly implementations for Project & Contract

class Contract:
  def getTotalCost(self):
  cost = 0
  for job in self.job_set.all():
  cost += job.getTotalCost()
  return cost

Having a Project know how much its total cost is, and presenting this to a
client, is in line with OO principles of abstraction and encapsulation.
Consider a page that outputs the total cost of each contract tendered for a
particular contractor. If we re-use the 'getTendersForContractor' function
(that returns a QuerySet), and then call 'getTotalCost' on each Project
that has a tender, then we still have a problem in that we'll get a
multitude of calls to the database.

We might put the onus on the client to call:
qs = getTendersForContractor(contractor)
for tender in list(qs.prefetch_related('project', 'project__contract',
'project__contract')):
print tender, tender.getTotalCost()

Not only is this complicated, but this requires the client know how the
Tender is implemented (the prefetched would become uneccessary if we
changed the implementation to store the total cost on the actual Tender
itself at the cost of denormalised data). Note that you wouldn't have these
problems if the data existed in memory.

These points are making me rethink my data model and API design, and that
the OO paradigm doesn't always fit well with web apps. In particular the
slides at
http://www.slideshare.net/OReillyOSCON/unbreaking-your-django-application:

23: "Django doesn’t have a good primitive for returning an object graph in
one call. - multiple db trips required" Implies that object graphs like
what I've designed is difficult to implement effeciently..?

24: "So, what do to?  If you almost always return the parent along with the
related object, consider object inheritance instead.  That does do a nice
join operation.  Don’t create an insane class graph, though.  If you
frequently use the related object without reference to the parent object,
use a foreign key." Not sure what 'object inheritance' is but might be
useful, and concerned that I might be heading towards an 'insane class
graph' (the real data model has 6 or 7 more related objects). Am I heading
down the wrong path of embracing an OO style API and data model?


Re: Performance when retrieving data from related models

2013-03-17 Thread Peter of the Norse
You’re over thinking this by a large margin. getTendersForContractor should 
return a QuerySet. That way you can add things like .select_related as 
necessary in the view. 

Also, I’m a bit worried about the fact you have a function called 
getTendersForContractor. It really should be Tender.objects.forContractor. Or 
if the function is just a single filter call, it can be eliminated entirely.

On Mar 15, 2013, at 9:04 PM, Taras_96 wrote:

> qs = Tender.objects.sql_related(/*names of keyed relationships 
> here*/).prefetch_related('project', 'project__job_set') (the actual example 
> has many more relations in the prefetch_related). This fixes the problem of 
> hitting the database in loops such as the above, and the complexity can be 
> hidden in a data retrieval layer that the client calls into.
> 
> returnedTenders = getTendersForContractor(contractor) # << the actual query 
> with prefetches etc is done in this function
> 
> However, there are a few drawbacks to this:
[SNIP]
> * Each prefetch_related hits the database once - a lot better than 3k hits, 
> but not as minimal as a sql join which joins all of the tables
Actually, no. If you were to try and do the prefetch_related as a single SQL 
join, you would return an insane number of rows. A single prefetch_related 
wouldn't be a problem, but if you were to try and do a select with more than 
one many field, they would multiply. If you have an example of a join statement 
that doesn’t do that, we’d love to see it.

> * Prefetch_related selects related obects via the criteria: 'in (list of 
> ids'), which arguably can be slower
What database are you using where this is true?

> * Seems brittle - if you change the data model, you have to change all of the 
> queries in the data retrieval layer. Denomralising the data would reduce 
> this, but we'd still have the problem of retrieving more data than the client 
> may possibly want
Yes, if you change code in one place, you might have to change it someplace 
else to match it. Most of us just use the ORM as the data retrieval layer. 
Again, I’m worried you’re over thinking this.


Peter of the Norse
rahmc...@radio1190.org



-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




Performance when retrieving data from related models

2013-03-15 Thread Taras_96
My web app hits the database up to 3k per page because it has to retrieve 
data from lots of related models. It has a fairly normalised database  with 
a structure similar to:

- one project has many contracts
- one project has many tenders (a tender is put forward by a contractor)
- one contract has many jobs

(there are a few extra relations not shown for simplicity). One of the 
pages, for example, lists out details for every contract a contractors has 
tendered). I've naively implemented this by treating the Django models as I 
would OO models, resulting in about 3k database hits for the one page. This 
is because something like:

qs = Tender.objects.filter(filter_criteria)
for tender in qs:
  print tender.getProject()

where getProject() looks something like:

def getProject(self):
  return self.project

I think here that the extra level of indirection means that Django can't 
conclude that all 'project' objects for each tender from the query set 
should be retrieved, and instead it ends up hitting the database once per 
loop. At the moment I do this in quite a few places, contributing to the 3k 
database hits.

One way of of solving this is for the query to fetch all related data, 
something like:

qs = Tender.objects.sql_related(/*names of keyed relationships 
here*/).prefetch_related('project', 'project__job_set') (the actual example 
has many more relations in the prefetch_related). This fixes the problem of 
hitting the database in loops such as the above, and the complexity can be 
hidden in a data retrieval layer that the client calls into.

returnedTenders = getTendersForContractor(contractor) # << the actual query 
with prefetches etc is done in this function

However, there are a few drawbacks to this:

* Major one is that data is fetched that you might not require. EG: if 
there was another page that just listed the ids of the Tenders, then a lot 
of data would have been fetched that is not required
** you could mitigate this by having the client somehow specify which data 
requires prefetching, but
*** this exposes implementation detail to the client
*** the function calling into the data retrieval layer may not know what 
data requires prefetching, and instead it is a function higher up in the 
call chain (the function that actually uses the data) that has this 
information
* Each prefetch_related hits the database once - a lot better than 3k hits, 
but not as minimal as a sql join which joins all of the tables
* Prefetch_related selects related obects via the criteria: 'in (list of 
ids'), which arguably can be slower
* Seems brittle - if you change the data model, you have to change all of 
the queries in the data retrieval layer. Denomralising the data would 
reduce this, but we'd still have the problem of retrieving more data than 
the client may possibly want

What is the group's suggestions/advice?


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.