Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-16 Thread David Cramer

Per my patches and what I believe Malcom has done with qsrf, the
syntax for related fields was field__related__name, and yes, you would
repeat this over and over, but it keeps it consistant.

I would like a values-like method to return partial objects, and then
requesting any field thats not "filled" would hit the db in some
fashion, but that's another story.

On Feb 15, 3:42 am, Dima Dogadaylo <[EMAIL PROTECTED]> wrote:
> On 15 Feb, 12:30, Malcolm Tredinnick <[EMAIL PROTECTED]> wrote:
>
> > When Adrian
> > proposed that API, he realised that almost always you're going to be
> > pulling back all of the fields or almost all of them. Once a database
> > has read a row to access some of the data, accessing all of the data in
> > the row is close to zero overhead. It's already had to read the block
> > off disk, for example, which is many kilobytes in size. Only for either
> > extremely large fields or for fields that need a large amount of
> > post-processing to be useful (such as polygon fields) is it really a win
> > to micro-manage what is pulled that.
>
> > For most non-trivial SQL queries with modern databases, constructing
> > (extracting) the output columns, unless they are complex computations,
> > is a very small fraction of the timeslice devote to the query execution.
>
> It's true but not all true.
>
> Often it's have sense to exclude even small non-used fields, because
> when you do joins with GROUP BY or similar for final 100 rows you may
> have temporary in-memory or even disc-table from 100*100 rows and size
> of each rows really matters.
>
> Then not only db-stage consume resources, but also network bandwidth
> is consumed, python database adapter consumes a lot on converting raw
> char stream to never used python objects, memory is wasted. It's
> extremely important on limited VPS-hostings.
>
> However, IMHO such technical details isn't important because ORM
> should not assume that things are "good enough". With ideal ORM I
> should be able to do _all_ that I can do with raw SQL. Everyone needs
> are special, same ORM will be good for small project but unacceptable
> for big project if ORM assumes that "this" and "those" limitations are
> acceptable for all.
>
> >> That's why the API is to specify what to exclude, rather than what to
> > include. I tend to agree with the proposal, too. From a performance
> > management perspective, it's going to be easier to use that way.
>
> I agree that in many cases exclude() may be useful and probably it's
> worth to add it also, but as I already wrote in my blog post, this
> patch come from real experience when you don't have "tutorial-like"
> models and when there is conscious data redundancy in many tables.
> It's typical task to show object_list and object_detail for any model
> and for each object_list I know exactly what fields I need and usually
> it's 3-5 fields from model itself and 2-3 fields from related models.
> I strongly believe that API should give me ability to receive only
> those fields even if all database severs will work with speed of
> light.
>
> > This isn't "standard Django lookup syntax", though. It's some entirely
> > new lookup syntax you created. There's nowhere in the current lookup
> > hierarchy that uses this method for referring to fields in another
> > model. Reusing the existing syntax makes more sense. It's less for
> > people to learn and it's less restrictive: with the double-underscore
> > syntax, I can list the fields in any order, since they're all positional
> > arguments. Your syntax has normal fields as positional arguments (so
> > they must come first) and related fields as keyword arguments that take
> > a sequence.
>
> I sure you know Django lookup syntax much better than me, but IMHO
> your proposition violates DRY, because  if I need to select several
> fields from related model I need to repeat and repeat
> "user__address__fieldname". Your proposition is close to original SQL
> syntax of SELECT statement and due to this issue your proposition
> should be accepted probably even if it violates DRY but keeps
> conceptual integrity with underlaying SQL level.
>
> I think it's only several lines task to change syntax from
> fields(*fields, **related_fields) to fields(*args). So if sometimes
> core developers change own mind about fields() method I can add to my
> patch plain lookup syntax support.
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-15 Thread Dima Dogadaylo

On 15 Feb, 12:30, Malcolm Tredinnick <[EMAIL PROTECTED]> wrote:

> When Adrian
> proposed that API, he realised that almost always you're going to be
> pulling back all of the fields or almost all of them. Once a database
> has read a row to access some of the data, accessing all of the data in
> the row is close to zero overhead. It's already had to read the block
> off disk, for example, which is many kilobytes in size. Only for either
> extremely large fields or for fields that need a large amount of
> post-processing to be useful (such as polygon fields) is it really a win
> to micro-manage what is pulled that.
>
> For most non-trivial SQL queries with modern databases, constructing
> (extracting) the output columns, unless they are complex computations,
> is a very small fraction of the timeslice devote to the query execution.

It's true but not all true.

Often it's have sense to exclude even small non-used fields, because
when you do joins with GROUP BY or similar for final 100 rows you may
have temporary in-memory or even disc-table from 100*100 rows and size
of each rows really matters.

Then not only db-stage consume resources, but also network bandwidth
is consumed, python database adapter consumes a lot on converting raw
char stream to never used python objects, memory is wasted. It's
extremely important on limited VPS-hostings.

However, IMHO such technical details isn't important because ORM
should not assume that things are "good enough". With ideal ORM I
should be able to do _all_ that I can do with raw SQL. Everyone needs
are special, same ORM will be good for small project but unacceptable
for big project if ORM assumes that "this" and "those" limitations are
acceptable for all.

>> That's why the API is to specify what to exclude, rather than what to
> include. I tend to agree with the proposal, too. From a performance
> management perspective, it's going to be easier to use that way.

I agree that in many cases exclude() may be useful and probably it's
worth to add it also, but as I already wrote in my blog post, this
patch come from real experience when you don't have "tutorial-like"
models and when there is conscious data redundancy in many tables.
It's typical task to show object_list and object_detail for any model
and for each object_list I know exactly what fields I need and usually
it's 3-5 fields from model itself and 2-3 fields from related models.
I strongly believe that API should give me ability to receive only
those fields even if all database severs will work with speed of
light.


> This isn't "standard Django lookup syntax", though. It's some entirely
> new lookup syntax you created. There's nowhere in the current lookup
> hierarchy that uses this method for referring to fields in another
> model. Reusing the existing syntax makes more sense. It's less for
> people to learn and it's less restrictive: with the double-underscore
> syntax, I can list the fields in any order, since they're all positional
> arguments. Your syntax has normal fields as positional arguments (so
> they must come first) and related fields as keyword arguments that take
> a sequence.

I sure you know Django lookup syntax much better than me, but IMHO
your proposition violates DRY, because  if I need to select several
fields from related model I need to repeat and repeat
"user__address__fieldname". Your proposition is close to original SQL
syntax of SELECT statement and due to this issue your proposition
should be accepted probably even if it violates DRY but keeps
conceptual integrity with underlaying SQL level.

I think it's only several lines task to change syntax from
fields(*fields, **related_fields) to fields(*args). So if sometimes
core developers change own mind about fields() method I can add to my
patch plain lookup syntax support.
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-15 Thread Sebastian Noack

On Fri, 15 Feb 2008 21:30:23 +1100
Malcolm Tredinnick <[EMAIL PROTECTED]> wrote:

> There's also the fairly pragmatic side-effect that if you are only
> pulling back a very few fields, there isn't really a lot your model
> methods are going to be able to do without loading more data, unless
> they're very simple. You might as well just write those methods as
> functions which take a dictionary and use values() for the cases when
> only two or three items are needed. 

I agree completely, as it is nearly the same I said in my last comment
on this thread.

> > > That QuerySet.values can not select values from related fields,
> > > annoys myself, too. But your implementation is  a bit restricted.
> > > Here the example from your blog:
> > >
> > > entries = Entry.objects.values('headline',
> > > user=('username',))[:10]
> > >
> > > A syntax as following would be more straight forward:
> > >
> > > entries = Entry.objects.values('headline', 'user__username')[:10]
> > 
> > I used standard Django lookup syntax, but because in trunk Djnago
> > don't support selecting of related fields, I thought it's better to
> > respect DRY principe and write:
> > entries = Entry.objects.values('headline', user__address=('country',
> > 'city', 'zip'))
> 
> This isn't "standard Django lookup syntax", though. It's some entirely
> new lookup syntax you created. There's nowhere in the current lookup
> hierarchy that uses this method for referring to fields in another
> model. Reusing the existing syntax makes more sense. It's less for
> people to learn and it's less restrictive: with the double-underscore
> syntax, I can list the fields in any order, since they're all
> positional arguments. Your syntax has normal fields as positional
> arguments (so they must come first) and related fields as keyword
> arguments that take a sequence.
> 
> > Also please note, currently lookups with 2 underscores are used only
> > in **kwargs, because it don't have many sense as quoted string (even
> > current QuerySet.order_by() uses '.' instead of '__' as separator).
> 
> Well, that information's a little old (if you're going to be writing
> code in the query construction are of Django, you also need to keep up
> with the branch work that's rewriting large portions of that area;
> that was already mentioned in the related-values ticket before you
> started working on it). Let's not use the current cross-model
> ordering syntax as an example. It's a bit of a wart in the design
> (Database table name + model field attribute name... it's mixing
> apples and oranges). In queryset-refactor, we've changed that
> (order-by) use double-underscores and only field names, just as in
> filters. One of the reasons for this was consistency.
> 
> I think, as Sebastian suggested (and I mentioned in the comment to the
> ticket on related values lookup a few months bcak), we should also use
> double underscores for selecting related values. That makes everything
> nice and consistent across the board.

In my previous message, I said that I agree with Dima's comment, but
that was just because I trusted that it is true that this syntax is
already used in django, without looking at the source. I don't care
much, how the syntax should look like, but it have to enable selecting
related fields from related fields and it should be the same syntax used
in the rest of django's API.


Regards
Sebastian Noack

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-15 Thread Sebastian Noack

On Fri, 15 Feb 2008 01:54:24 -0800 (PST)
Dima Dogadaylo <[EMAIL PROTECTED]> wrote:
> > Of course, this decreases database load, but IMHO it is better to
> > use QuerySet.values if you want select just certain values from a
> > model.
> 
> Often I need methods defined in models and isn't available at
> dictionaries.
> For example get_avatar_url().

These methods might change and don't will work anymore, if the now
required fields are not selected by QuerySet.fields. So you have to
look for all occurrences in your code and change the parameter list of
QuerySet.fields, after figuring out which fields are required. This is
why I think it would be more straight forward to write your functions
independent from the model and pass the required values (as retrieved
by QuerySet.values) as separate arguments if you are going to avoid
selecting more data than required.

But on the other hand it is ugly to extract logic which relates to a
model from the model. But an ORM always has performance drawbacks in
favor of strict design. So I would rathe omit this performance feature
in favor of strict desgings, since incomplete models as created by
QuerySet.fields might introduce as many problems as the performance
increase.

> I used standard Django lookup syntax, but because in trunk Djnago
> don't support selecting of related fields, I thought it's better to
> respect DRY principe and write:
> entries = Entry.objects.values('headline', user__address=('country',
> 'city', 'zip'))
> 
> than
> entries = Entry.objects.values('headline', 'user__address__country',
> 'user__address__city', 'user__address__zip').
> 
> Also please note, currently lookups with 2 underscores are used only
> in **kwargs, because it don't have many sense as quoted string (even
> current QuerySet.order_by() uses '.' instead of '__' as separator). So
> from design and usability point of view my form of selecting related
> fields is better, I think.

Ok, I agree. I would like to getting this feature merged in trunk, too.

Regards
Sebastian Noack

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-15 Thread Malcolm Tredinnick


On Fri, 2008-02-15 at 01:54 -0800, Dima Dogadaylo wrote:
> On 14 Feb, 17:26, Sebastian Noack <[EMAIL PROTECTED]>
> wrote:
> >
> > Of course, this decreases database load, but IMHO it is better to use
> > QuerySet.values if you want select just certain values from a model.
> 
> Often I need methods defined in models and isn't available at
> dictionaries.

> For example get_avatar_url().

This is all very reasonable, but one of the problems with your patch is
that you've approached it from the reverse end of what was suggested in
#5420 and #5420 is written the way it is for a reason. When Adrian
proposed that API, he realised that almost always you're going to be
pulling back all of the fields or almost all of them. Once a database
has read a row to access some of the data, accessing all of the data in
the row is close to zero overhead. It's already had to read the block
off disk, for example, which is many kilobytes in size. Only for either
extremely large fields or for fields that need a large amount of
post-processing to be useful (such as polygon fields) is it really a win
to micro-manage what is pulled that.

For most non-trivial SQL queries with modern databases, constructing
(extracting) the output columns, unless they are complex computations,
is a very small fraction of the timeslice devote to the query execution.

That's why the API is to specify what to exclude, rather than what to
include. I tend to agree with the proposal, too. From a performance
management perspective, it's going to be easier to use that way.

There's also the fairly pragmatic side-effect that if you are only
pulling back a very few fields, there isn't really a lot your model
methods are going to be able to do without loading more data, unless
they're very simple. You might as well just write those methods as
functions which take a dictionary and use values() for the cases when
only two or three items are needed. If you have a case that doesn't fit
this, the standard data modelling technique is you move your very
heavyweight fields into another model/table, so the lightweight stuff
doesn't get dragged down by also pulling back the heavyweight stuff.
That's always going to be more effective than trying to only select a
small fraction of the available columns in a row.

> > That QuerySet.values can not select values from related fields, annoys
> > myself, too. But your implementation is  a bit restricted. Here
> > the example from your blog:
> >
> > entries = Entry.objects.values('headline', user=('username',))[:10]
> >
> > A syntax as following would be more straight forward:
> >
> > entries = Entry.objects.values('headline', 'user__username')[:10]
> 
> I used standard Django lookup syntax, but because in trunk Djnago
> don't support selecting of related fields, I thought it's better to
> respect DRY principe and write:
> entries = Entry.objects.values('headline', user__address=('country',
> 'city', 'zip'))

This isn't "standard Django lookup syntax", though. It's some entirely
new lookup syntax you created. There's nowhere in the current lookup
hierarchy that uses this method for referring to fields in another
model. Reusing the existing syntax makes more sense. It's less for
people to learn and it's less restrictive: with the double-underscore
syntax, I can list the fields in any order, since they're all positional
arguments. Your syntax has normal fields as positional arguments (so
they must come first) and related fields as keyword arguments that take
a sequence.

> Also please note, currently lookups with 2 underscores are used only
> in **kwargs, because it don't have many sense as quoted string (even
> current QuerySet.order_by() uses '.' instead of '__' as separator).

Well, that information's a little old (if you're going to be writing
code in the query construction are of Django, you also need to keep up
with the branch work that's rewriting large portions of that area; that
was already mentioned in the related-values ticket before you started
working on it). Let's not use the current cross-model ordering syntax as
an example. It's a bit of a wart in the design (Database table name +
model field attribute name... it's mixing apples and oranges). In
queryset-refactor, we've changed that (order-by) use double-underscores
and only field names, just as in filters. One of the reasons for this
was consistency.

I think, as Sebastian suggested (and I mentioned in the comment to the
ticket on related values lookup a few months bcak), we should also use
double underscores for selecting related values. That makes everything
nice and consistent across the board.

Regards,
Malcolm

-- 
Many are called, few volunteer. 
http://www.pointy-stick.com/blog/


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL

Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-15 Thread Dima Dogadaylo

On 14 Feb, 17:26, Sebastian Noack <[EMAIL PROTECTED]>
wrote:
>
> Of course, this decreases database load, but IMHO it is better to use
> QuerySet.values if you want select just certain values from a model.

Often I need methods defined in models and isn't available at
dictionaries.
For example get_avatar_url().

> That QuerySet.values can not select values from related fields, annoys
> myself, too. But your implementation is  a bit restricted. Here
> the example from your blog:
>
> entries = Entry.objects.values('headline', user=('username',))[:10]
>
> A syntax as following would be more straight forward:
>
> entries = Entry.objects.values('headline', 'user__username')[:10]

I used standard Django lookup syntax, but because in trunk Djnago
don't support selecting of related fields, I thought it's better to
respect DRY principe and write:
entries = Entry.objects.values('headline', user__address=('country',
'city', 'zip'))

than
entries = Entry.objects.values('headline', 'user__address__country',
'user__address__city', 'user__address__zip').

Also please note, currently lookups with 2 underscores are used only
in **kwargs, because it don't have many sense as quoted string (even
current QuerySet.order_by() uses '.' instead of '__' as separator). So
from design and usability point of view my form of selecting related
fields is better, I think.
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-14 Thread Sebastian Noack

> I made a patch for Django to add QuerySet.fields(*fields,
> **related_fields) and make possible to load only some from master and
> related models fields. It allows to tune various object list queries
> when we need only limited subset of all fields, improve general
> performance and decrease database load.

Of course, this decreases database load, but IMHO it is better to use
QuerySet.values if you want select just certain values from a model.

> As side effect of this patch
> support of selecting fields from related models in QuerySet.values()
> is implemented too. It was changed signature of this method from
> values(*fields) to values(*fields, **related_fields) but the change is
> backward compatible.

That QuerySet.values can not select values from related fields, annoys
myself, too. But your implementation is  a bit restricted. Here
the example from your blog:

entries = Entry.objects.values('headline', user=('username',))[:10]

A syntax as following would be more straight forward:

entries = Entry.objects.values('headline', 'user__username')[:10]

At first this is the way other methods of the QuerySet as filter for
example selects related fields and furthermore this would enable to
select the related field of a related field, as QuerySet.filter can do.

Regards
Sebastian Noack

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: Django ORM performance patch. Fixes #5420, #5768

2008-02-14 Thread Malcolm Tredinnick


On Thu, 2008-02-14 at 03:50 -0800, Entropy Hacker wrote:
> I made a patch for Django to add QuerySet.fields(*fields,
> **related_fields) and make possible to load only some from master and
> related models fields. It allows to tune various object list queries
> when we need only limited subset of all fields, improve general
> performance and decrease database load. As side effect of this patch
> support of selecting fields from related models in QuerySet.values()
> is implemented too. It was changed signature of this method from
> values(*fields) to values(*fields, **related_fields) but the change is
> backward compatible.

Please don't post to both django-users and django-developers (and Trac x
2). We notice when patches are put on tickets. Otherwise it gets very
noisy for those of us monitoring all the sources.

Regards,
Malcolm

-- 
Honk if you love peace and quiet. 
http://www.pointy-stick.com/blog/


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---