Re: Proposal: user-friendly API for multi-database support

Sean Stoops Wed, 10 Sep 2008 22:11:25 -0700

I just wanted to toss out one more use case that I know I would find
especially useful.  Legacy databases.


I'd like to see the ability to setup a model (using rough
introspection and then user tweaking) based on a legacy database,
maybe with the options of read-write or read-only access.

Like Robert, I'll have to give this another read through tomorrow
after a little sleep.

-Sean

On Sep 10, 5:27 pm, Robert Lofthouse <[EMAIL PROTECTED]> wrote:
> Wow, i'm with Malcolm on the "steady on, let's take this a bit at a
> time" :) Someone has been working hard, but that's a huge amount to
> think about.
>
> Reading through some of the comments: Most of the functionality
> described here would be an explicit mapping and choice made by the
> developer, it would be pretty hard to use any form of magic. I'm all
> for the setting up of bindings in a settings.py, which you later
> access and pass through whenever you do a select/update/delete. I'm
> also for per-model connections/connection overrides, with an option to
> override on a more general level at the (overseeing) application
> level.
>
> I'm against the idea of having functions in settings.py, but I don't
> see why they can't be discovered based on definitions in the settings
> file.
>
> Will have a proper read through tomorrow, as this thread is
> interesting.
>
> On Sep 10, 6:53 pm, Simon Willison <[EMAIL PROTECTED]> wrote:
>
> > For those who weren't at DjangoCon, here's the state of play with
> > regards to multi-db support: Django actually supports multiple
> > database connections right now: the Query class (in django/db/models/
> > sql/query.py) accepts a connection argument to its constructor, and
> > the QuerySet class (django/db/models/query.py) can be passed an
> > optional Query instance - if you don't pass it one, it will create a
> > Query that uses the default django.db.connection.
>
> > As a result, if you want to talk to a different connection you can do
> > it right now using a custom manager:
>
> > class MyManager(Manager):
>
> >     def get_query_set(self):
> >         query = sql.Query(self.model, my_custom_db_connection)
> >         return QuerySet(self.model, query)
>
> > As Malcolm described it, he's provided the plumbing, now we need to
> > provide the porcelain in the form of a powerful, user-friendly API to
> > this stuff. Here's my first attempt at an API design.
>
> > Requirements
> > ============
>
> > There are a number of important use-cases for multi-db support:
>
> > * Simple master-slave replication: SELECT queries are distributed
> >   between slaves, while UPDATE and DELETE statements are sent to
> >   the master.
> > * Different Django applications live on different databases, e.g.
> >   a forum on one database while blog lives on another.
> > * Moving data between different databases - it would be useful if
> >   you could do this using the ORM to help paper over the
> >   differences in SQL syntax.
> > * Sharding: data in a single Django model is distributed across
> >   multiple databases depending on some kind of heuristic (e.g. by
> >   user ID, or with archived content moved to a different server)
> > * Replication tricks, for example if you use MySQL's InnoDB for
> >   your data but replicate to a MyISAM server somewhere so you can
> >   use MySQL full-text indexing to search (Flickr used to do this).
>
> > I've probably missed some; please feel free to fill in the gaps.
>
> > We don't need to solve every problem, but we do need to provide
> > obvious hooks for how those problems should be solved. Sharding, for
> > example, is extremely application specific. I don't think Django
> > should automatically shard your data for you if you specify 'sharding
> > = True' on a model class, but it should provide documented hooks for
> > making a custom decision on which database connection should be used
> > for a query that allow sharding to be implemented without too much
> > pain.
>
> > Different applications on different databases on the other hand is
> > something Django should support out of the box. Likewise, master-slave
> > replication is common enough that it would be good to solve it with as
> > few lines of user-written code as possible (it's the first step most
> > people take to scale their database after adding caching - and it's a
> > sweet spot for the kind of read-heavy content sites that Django is
> > particularly suited for).
>
> > Proposed API
> > ============
>
> > Here's my first attempt at describing a user-facing API.
>
> > First, we need a way of specifying multiple database connections.
> > Adrian has already expressed an interest in moving to DSNs rather than
> > individual settings, so I suggest something like this:
>
> > DATABASES = {
> >     'default': 'mysql://foo:[EMAIL PROTECTED]/baz',
>
> > }
>
> > With multiple databases configured this could be:
>
> > DATABASES = {
> >     'master': 'mysql://foo:[EMAIL PROTECTED]/mydb',
> >     'slave1': 'mysql://foo:[EMAIL PROTECTED]/mydb',
> >     'slave2': 'mysql://foo:[EMAIL PROTECTED]/mydb',
> >     'archive': 'mysql://foo:[EMAIL PROTECTED]/mydb',
> >     'default': 'master',
>
> > }
>
> > There are two types of connection string - DSNs and aliases. A DSN
> > contains '://' while an alias does not. Aliases can be used even
> > within the DATABASES setting itself, as with 'default' in the above
> > example.
>
> > It should be possible to use a DSN that has not been defined in the
> > DATABASES setting. As a result, I propose that anywhere in Django that
> > accepts a connection alias should also accept a DSN or even a raw DB-
> > API compliant connection object.
>
> > The QuerySet.using() method
> > ---------------------------
>
> > Next, we need a way of telling Django which connection to use. I
> > propose a new queryset method as the lowest level way of doing this,
> > called 'using':
>
> > qs = Article.objects.filter(published__lt = ...).using('archive')
>
> > "using(alias_or_connection_or_dsn)" simply tells the QuerySet to
> > execute against a different connection, by updating its
> > internal .connection attribute.
>
> > Other options for this method name include:
>
> > with_db()
> > with_connection()
>
> > I preferred "using()" as it reads nicely and doesn't contain an
> > underscore.
>
> > using() represents the lowest level user-facing API. We can cover a
> > common case (different applications on different databases) with the
> > following:
>
> > class Article(models.Model):
> >     ...
> >     class Meta:
> >         using = 'articles'
>
> > This means "use the articles connection for all queries originating
> > with this model". I'm repurposing the term 'using' here for API
> > consistency.
>
> > Advanced connection selection
> > -----------------------------
>
> > All of the other above use-cases boil down to one key decision: given
> > a particular database query, which database connection should I
> > execute the query against?
>
> > I propose adding a manager method which is called every time that
> > decision is made, and which is designed to be over-ridden by advanced
> > users. Here's the default implementation:
>
> > class Manager:
> >     ...
> >     def get_connection(self, query):
> >         from django.db import connection
> >         return connection # Use the default connection for everything
>
> > Here's an implementation which implements very simple master-slave
> > replication:
>
> > class Manager:
> >     ...
> >     def get_connection(self, query):
> >         if isinstance(query, (InsertQuery, DeleteQuery, UpdateQuery)):
> >             return 'master'
> >         else:
> >             return 'slave'
> >             # Or if we have more than one slave:
> >             return random.choice(['slave1', 'slave2']) # Footnote [1]
>
> > The above would be even easier if InsertQuery, DeleteQuery and
> > UpdateQuery were all subclasses of a ModificationQuery class (they are
> > currently all direct subclasses of Query) - then the check could
> > simply be:
>
> >     if isinstance(query, ModificationQuery)
>
> > We could even ship a MasterSlaveManager that implements a variant of
> > the above logic in django.contrib.masterslave (more for educational
> > and marketing purposes than because it's something that's hard to
> > implement).
>
> > Note that in my above example get_connection() methods one returns an
> > actual connection object while the other returns a connection alias.
> > This makes for a more convenient API, and is consistent with my above
> > suggestion that DSNs, aliases and connection objects should be
> > interchangeable.
>
> > Since the get_connection method has access to the full query object,
> > even complex sharding schemes based on criteria such as the individual
> > fields being looked up in the query could be supported reasonably
> > well.
>
> > Dealing with single queries that span multiple databases
> > --------------------------------------------------------
>
> > Once you have different tables living in different databases there's
> > always the chance that someone will try to write a query that attempts
> > to join tables that live on two different database servers. I don't
> > think we should address this problem at all (aside from maybe
> > attempting to throw a descriptive error message should it happen) - if
> > you're scaling across different servers you need to be aware of the
> > limitations of that approach.
>
> > That said, databases like MySQL actually do allow cross-database joins
> > provided both databases live on the same physical server. Is this
> > something we should support? I'd like to say "no" and assume that
> > people who need to do that will be happy rolling their own SQL using a
> > raw cursor, but maybe I'm wrong and it's actually a common use case.
>
> > Connection pooling
> > ------------------
>
> > This is where I get completely out of my depth, but it seems like we
> > might need to implement connection pooling at some point since we are
> > now maintaining multiple connections to multiple databases. We could
> > roll our own solution here, but to my knowledge SQLAlchemy has a solid
> > connection pool implementation which is entirely separate from the
> > rest of the SQLAlchemy ORM. We could just ensure that if someone needs
> > connection pooling there's a documented way of integrating the
> > SQLAlchemy connection pool with Django - that way we don't have an
> > external dependency on SQL Alchemy for the common case but people who
> > need connection pools can still have them.
>
> > Backwards compatibility
> > -----------------------
>
> > I think we can do all of
>
> ...
>
> read more »

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Proposal: user-friendly API for multi-database support

Reply via email to