[GSOC] Multiple Database API proposal

Alex Gaynor Fri, 20 Mar 2009 06:46:02 -0700

Hello all,

To those who don't me I'm a freshman computer science student at Rensselaer
Polytechnic Institute in Troy, New York.  I'm on the mailing lists quite a
bit
so you may have seen me around.


A Multiple Database API For Django
==================================

Django current has the low level hooks necessary for multiple database
support,
but it doesn't have the high level API for using, nor any support
infrastructure, documentation, or tests.  The purpose of this project would
be
to implement the high level API necessary for the use of multiple databases
in
Django, along with requisit documentation and tests.

There have been several previous proposals and implementation of
multiple-database support in Django, non of which has been complete, or
gained
sufficient traction in the community in order to be included in Django
itself.
As such this proposal will specifically address some of the reasons for past

failures, and their remedies.

The API
-------

First there is the API for defining multiple connections.  A new setting
will
be created ``DATABASES`` (or something similar), which is a dictionary
mapping
database alias(internal name) to a dictionary containing the current
``DATABASE_*`` settings:

.. sourcecode:: python

    DATABASES = {
        'default': {
            'DATABASE_ENGINE': 'postgresql_psycopg2',
            'DATABASE_NAME': 'my_data_base',
            'DATABASE_USER': 'django',
            'DATABASE_PASSWORD': 'super_secret',
        }
        'user': {
            'DATABASE_ENGINE': 'sqlite3',
            'DATABASE_NAME': '/home/django_projects/universal/users.db',
        }
    }

A database with the alias ``default`` will be the default connection(it will
be
used if no other one is specified for a query) and will be the direct
replacement for the ``DATABASE_*`` settings.  In compliance with Django's
deprecation policy the ``DATABASE_*`` will automatically be handled as if
they
were defined in the ``DATABASES`` dict for at least 2 releases.

Next a ``connections`` object will be implemented in ``django.db``, analgous

to the ``django.db.connection`` object, the ``connections`` one will be a
dictionary like object, that is subscripted by database alias, and lazily
returns a connection to the database.  ``django.db.connection`` will
remain(at
least for the present, it's ultimate state will be by community consensus)
and
merely proxy to ``django.db.connections['default']``.  Using the previously
defined database setting this might be used as:

.. sourcecode:: python

    from django.db import connections

    conn = connections['user']
    c = conn.cursor()
    results = c.execute("""SELECT 1""")
    results.fetchall()

Now that there is the necessary infastructure to accompany the very low
level
plumbing we need our actual API.  The high level API will have 2
components.
First here will be a ``using()`` method on ``QuerySet`` and ``Manager``
objects.  This method simply takes an alias to a connection(and possibly a
connection object itself to allow for dynamic database usage) and makes that

the connection that will be used for that query.  Secondly, a new options
will
be created in the inner Meta class of models.  This option will be named
``using`` and specify the default connection to use for all queries against
this model, overiding the default specified in the settings:

.. sourcecode:: python

    class MyUser(models.Model):
        ...
        class Meta:
            using = 'user'

    # this queries the 'user' database
    MyUser.objects.all()
    # this queries the 'default' database
    MyUser.objects.using('default')

Lastly, various plumbing will need to be updated to reflect the new multidb
API, such as transactions, breakpoints, management commands, etc.

More Advanced Usage
-------------------

While the above two methods are strictly speaking sufficient they require
the
user to write lots of boilerplate code in order to implement advanced multi
database strategies such as replication and sharding.  Therefore we also
introduce the concept of ``DatabaseManagers``, not to be confused with
Django's
current managers.  DatabaseManagers are classes that define how what
connection
should be used for a given query.  There are 2 levels at which to specify
what
``DatabaseManager`` to use, as a setting, and at the class level.  For
example
in one's settings.py one might have:

.. sourcecode:: python

    DEFAULT_DB_MANAGER = 'django.db.multidb.round_robin.Random'

This tells Django that for each query it should use the ``DatabaseManager``
specified at that location, unless it is overidden by the ``using`` Meta
option,
or the ``using()`` method.

The more granular way to use ``DatabaseManagers`` is to provide them, in
place
of a string, as the ``using`` Meta option.  Here we pass an instance of the
class we want to use:

.. sourcecode:: python

    class MyModel(models.Model):
        class Meta:
            using = Random(['my_db1', 'my_db2', 'my_db2'])

At this level it can still be overidden by the explicit usage of the
``using()`` method.

But how exactly do ``DatabaseManagers`` work?  Let's start with an example:

.. sourcecode:: python

    class Random(DatabaseManager):
        def __init__(self, dbs=None):
            self.dbs = dbs if dbs is not None else settings.DATABASES.keys()

        def select(self, cls, **params):
            return random.choose(self.dbs)

        def create(self, cls, **params):
            raise TypeError("Random database manager is intended only for
reads")

        def update(self, cls, **params):
            raise TypeError("Random database manager is intended only for
reads")

Basically we have 3 methods on a ``DatabaseManager``, plus the ``__init__``
method.  ``__init__`` should be able to be called with no parameters if you
want to make the class the default for your project.  ``select()``,
``create()``, and ``update()`` each take the class of the model that the
query
is for, plus ``**params``, it has yet to be determined what params should be

passed, ideas include:

 * The ``Query`` object for the ``QuerySet`` in question.
 * The ``WhereNode`` for the ``Query`` object.
 * others...


Plan of Action
--------------
1) Implement the ``connections`` object. -- 1 day
2) Alter the relevant management commands and anything else to use all
   connections or ``django.db.connections['default']`` depending on which is

   approporiate. -- 1 week
3) Implement the method tracking(command pattern). -- 1 week
4) Implement the ``using()`` method and the ``using`` inner ``Meta``
options.
   -- 1 week
5) Write initial tests and docs, the rest will be written as features are
   implemented, however a large initial set needs to be written. -- 3 weeks
6) Fix up transaction support, the close database signal, anything else in
   transactions.py. -- 2 weeks
7) Add support for the ``DatabaseManager`` for more complex support. -- 2
weeks
8) Time permitting implement a few common replication patterns.

All of these times are fairly aggressive, and there are about 2 weeks to
spare,
so those can be used as necessary, or for part #8.

Hurdles
-------

The following are a list of possible technical issues:

 * In ``django.db.models.sql.query.Query`` are any tests done on what the
   connection is before the actual SQL construction phase.  If so these need

   to be changed not to do this, since the connect might change at some
point
   after that test.  If this can't be done than ``using()`` needs to be the
   first method called on a ``QuerySet``, or at a minimum called before any
   methods that do such testing.  Further, if these tests can't be put off
then
   the only option is a callback that's called right when the first
``Query``
   object is constructed, this means Django won't know what type of query it

   would be, rendering the ``DatabaseManager`` impossible.
 * Will models need to know which database they came from so that they can
be
   saved back correctly?
 * Does ``Model.save()`` need to take a ``using`` parameter so new objecs
can
   be created on a specific database or saved to a new database.
 * Backends that use custom query classes, will we need a ``from_query``
   classmethod to transform them.  This would require all backends to store
   and use information that is basically less than or equal to what the
   ``Query`` object stores.  Also, there needs to be the reverse, a way to
go
   from a custom ``Query`` object back to either the Django default or some
   other custom ``Query`` object.
 * Foreign keys will basically be handled en passant because of how they are

   implemented, but many to many fields will require more thought,
especially
   since that SQL isn't in the ``Query`` class.
 *

Solutions
---------

The greatest hurdle is changing the connection after we already have our
``Query`` partly created.  The issues here are that: we might have done
tests
against ``connection.features`` already, we might need to switch either to
or
from a custom ``Query`` object, amongst other issues.  One possible solution

that is very powerful(though quite inellegant) is to have the ``QuerySet``
keep
track of all public API method calls against it and what parameters they
took,
then when the ``connection`` is changed it will recreate the ``Query``
object
by creating a "blank" one with the new connection and reapplying all the
methods it has stored.  This is basically a simple implementation of the
command pattern.


I'm here soliciting feedback on both the API, and any potential hurdles I
may
have missed.

Thanks,
Alex

-- 
"I disapprove of what you say, but I will defend to the death your right to
say it." --Voltaire
"The people's good is the highest law."--Cicero

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

[GSOC] Multiple Database API proposal

Reply via email to