Refactoring QuerySet

Malcolm Tredinnick Sun, 25 Feb 2007 04:13:25 -0800

I noticed today that a new GIS branch has been created, which includes a
goal to develop a custom QuerySet-derivative. There has also been a
discussion on trying to work out an API for aggregates, which seems to
be moving along, although I suspect the design has a bit more work
required yet. Michael Radziej has inquired about trying to integrate
custom SQL as a natural QuerySet result. That's possibly a bit
pie-in-the-sky for all cases, but we might be able to get close.


All of these (plus no doubt some other things I don't remember or know
about) are going to impact QuerySets. So I thought I just lob out my
plans/ideas for refactoring that part of the code prior to 1.0.

What I would like here is for anybody planning to play in this space to
have a think about whether their particular use-case is going to get
clearer harder to implement under this scheme, so that changes can be
made now.

No really urgent action required on this. I keep bashing away at the
code for this (and tweaking the design) whenever I have time. And I
realise everybody at PyCon already has their head full of other stuff. I
just want to inject the thoughts into peoples' brains since there are a
bunch of you sitting in Texas talking about "the future".

Apologies for the length. This makes Jacob's admin refactoring post look
short. Life sucks like that sometimes.

How QuerySets work at the moment
---------------------------------
By magic!

Seriously, the QuerySet class performs a few jobs: it constructs the
database query (SQL), retrieves the data by calling the appropriate
backend, and then provides an API for accessing the data in a
Python-friendly format.

There are really two portions that I want to separate here: the
mechanics of getting the data from the database (constructing the query,
particularly) and providing the Python interface via model instances,
etc.

When we construct an SQL query at the moment, it is mostly done by
pushing string fragments around inside a couple of functions in the
QuerySet class. This makes it very difficult to fiddle with query
construction because you need to get at the right string fragment at
just the right moment. You cannot go back later and tweak any portion of
the query. The obvious example of where this is painful is if we
implement slicing functionality for MS SQL Server: that database does
not have the non-SQL-standard LIMIT/OFFSET keywords. Instead, you
construct a sub-select in the FROM clause to get the equivalent
behaviour. I've no doubt screwed up the details slightly there, but the
important point here is that it's basically impossible without really
special-case handling ("if using_ms_sql: ...") to hijack the FROM
construction and not put in LIMIT/OFFSET bits. Oracle has similar, but
not quite as difficult to accommodate requirements for extracting an
ordered portion of the results, too.

What would the new (QuerySet) world order look like
----------------------------------------------------
What I propose to do is to split out the SQL construction from the body
of QuerySet and make it into a separate class. This class (let's call it
Query, for simplicity) would end up being a "query" attribute on
QuerySets and it's __str__ function would return the appropriate SQL
statement.

The Query class would not push string fragments around. Instead, it
would keep the parts of the SQL statement separated until asked to
combine them (in __str__, for example). So there would be attributes
called, say, select, tables, where, having, grouping, limit and offset.
Initially, "having" and "grouping" wouldn't be used by our current
functionality, but they're there for extending the Query class (and for
use by things like aggregates and custom tweaks).

The exact contents of these attributes isn't really important for the
purposes of this discussion, but they will be Python objects that you
can poke at and shuffle as much as you like. For example, the "select"
attribute would be a list of (column name, alias) pairs (where alias
could be None) and the "tables" attribute would be a (possibly nested)
list of (table name, alias, join type) triples. Since the table names
just need to be convertible to strings, it would be possible to use a
Query class for the table name, giving us a way to inject
SQL-Server-required views into the FROM clause.

The Query class is primarily responsible for generating the SQL string.
However, when I started writing code for this, it rapidly becomes
apparent that it will need to know the database engine it is coding for.
Tasks like constructing the right SQL for slicing (LIMIT/OFFSET vs
SQL-Server requirements), for example, need that knowledge to construct
the right string.

So, at some level, it might make sense to pass off *all* of the database
interaction to this class. The QuerySet class would create a Query
instance that contains the right information, tell it the database
engine, pass it the values for the parameters (maybe not all in the same
place) and then call "run_the_query_please()" and get back an iterator
over the results.

This helps on another, slightly crazier level, too: if you want to
replace the database-backed ORM with another type of storage, but were
happy to keep using the Django API for querying, all the SQL-specific
stuff is now in one place and you would only need to write your own
Query replacement or subclass. Not a driving requirement, but very much
nice to have, since not everything runs out of databases.

How does this help for custom SQL and extra QuerySet functionality?
-------------------------------------------------------------------
Writing custom SQL fragments now becomes possible, because you have
direct access to the individual pieces of the query before they are
combined into a string. I'm not saying it's trivial, but that's not the
goal. We try to make the common things easy and the hard things
possible. This is in the "hard things are possible" category.

Adding functionality to QuerySets as far as interfacing with the
databases via extra SQL is another motivation here. You would subclass
QuerySet and tell your subclass to use your subclass of Query for its
default query construction (probably overriding QuerySet.__init__ --
details to be determined, but they aren't important at this point
anyway, since there are about 6 ways to do it). Then you can make
whatever changes you want to a subclass (or even straight replacement)
of the Query class. For example, you could add a "sum()" method to the
derived QuerySet -- if we don't have it in the default version -- and
this sum() method would know what method to call in the Query subclass
to add the right things to the query.

To clarify that a little bit, suppose you create MyQuerySet and MyQuery
as subclasses of QuerySet and Query, respectively:

        MyQuerySet.sum() knows what method in MyQuery to call and with
        what information. It also knows how to present the result. Okay,
        sum() is a trivial example, since it's going to return a number,
        but, still, it's a number and not a model instance or a sequence
        and you could imagine cases where the return type is something
        more complex.
        
        MyQuery has a method (let's call it sum()) that knows to add a
        sum(*) to the select attribute and possibly something to the
        grouping attribute as well. That's all it does: modified the
        query based on the controls passed in from the caller.
        
Addressing Michael Radjiez's (amongst others) request for somehow
returning the results of custom queries and making it act like a real
QuerySet: I think this can be done, to some extent, by subclassing
QuerySet. Your subclass handles your custom Query modifications and also
supports the normal QuerySet interface, so a call to, say, filter() on
your custom QuerySet will work.

The hard part about integrating custom queries with QuerySets is because
we haven't evaluated your custom portion of the query at the time you
start calling other QuerySet methods on it. So the new methods need to
be able to extent the existing query that is being built up. At the
moment, we cannot do that, because the query construction is fairly dumb
about building up the query fragments and kind of stomps all over any
custom fragments you might want to inject. By keeping all the fragments
in the Query class and not combining them until the final __str__
conversion, it makes it easier to augment an existing query.

Is this really necessary?
-------------------------
I (and others, especially Russell) have spent quite a while living in
the current QuerySet code fixing bugs with query construction and
scratching our heads wondering how on earth we are going to fix some
other cases. For example, we often don't use the most efficient (or the
correct -- in some cases) type of table join when constructing queries.
Joining QuerySets with particular combinations of disjunctions and
conjunctions when common fields are involved leads to incorrect results.
Thinking about how to join in support for aggregates just gives me a
nose bleed, because constructing the GROUP BY and HAVING clauses is
fiddly. All of these problems are much easier to fix by restructuring
how queries are created -- they are, in fact, the motivation for this
design.

Based on periodic IRC visits and reading django-users, people often
try/want to insert custom output columns in the select portion of a
query, which are then linked to something custom in the where clause,
but we don't make that possible. You cannot currently do things like
insert a custom "tableA.column1 = tableB.column2" fragment in a where
clause. For anything custom like this, the (not totally unacceptable)
solution at the moment is to write out the whole thing as a custom SQL
statement. I think the above should make some of this a bit easier and
the rest no harder than it is now. This is another case where I'm not
sure it's a driving requirement for the changes, since we aren't trying
to be a completely isomorphic Python interface over SQL, so asking
people to drop down to SQL if they want to use SQL functionality isn't
evil. But it's a nice side-effect.

Things I think become easier this way
-------------------------------------
        - aggregates and other computed output columns
        - custom fragments in portions of a query (as opposed to writing
        a whole query for the database)
        - adding new query types (GIS?)
        - different database syntax (LIMIT/OFFSET in both MS SQL Server
        and Oracle)
        
I don't know of any announced plans/wishes that affect QuerySets that
would become more difficult under this change, so please sing out if you
know of any.

Regards,
Malcolm


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Refactoring QuerySet

Reply via email to