Re: [GSOC] Multiple Database API proposal

2009-03-21 Thread Ivan Sagalaev

Alex Gaynor wrote:
> 8) Time permitting implement a few common replication patterns.

I'm kind of not very excited with this point.

To me replication is a major use-case. I suspect most people who move 
beyond single server setup and beyond 10'000 - 20'000 visitors realize 
that replication should just be in place ensuring performance and 
redundancy. In my experience other multi-DB patterns (those that covered 
with `using()` and Meta-attributes on models) are just *less* common in 
practice. So I consider leaving replication to "time permitting" a mistake.

On the other hand may be all this work won't break mysql_replicated and 
I'll just have to update it to the new db backend interface. There may 
be non-trivial things to work out though such as having separate 
master-slave pairs for each data shard.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: [GSOC] Multiple Database API proposal

2009-03-20 Thread Alex Gaynor
On Sat, Mar 21, 2009 at 1:25 AM, Malcolm Tredinnick <
malc...@pointy-stick.com> wrote:

>
> On Sat, 2009-03-21 at 00:41 -0400, Alex Gaynor wrote:
> >
> >
> > > One suggestion Eric Florenzano had was that we go above and
> > beyond
> > > just storing the methods and parameters, we don't even
> > excecute them
> > > at all until absolutely necessary.
> >
> >
> > Excuse me for a moment whilst I add Eric to a special list
> > I've been
> > keeping. He's trying to make trouble.
> >
> > Ok, back now... There are at least two problems with this.
> >
> > (a) Backwards incompatible in that some querysets would return
> > noticeably different results before and after that change. It
> > would be
> > subtle, quiet and very difficult to detect without auditing
> > every line
> > of code that contributes to a queryset. The worst kind of
> > change for us
> > to make from the perspective of the users.
> >
> > What scenario does it return different results, the one place I can
> > think of is:
> >
> > query = queryset.order_by('I AM NOT A REAL FIELD, HAHA')
> > render_to_response('template.html', {'q': query})
> >
> > which would raise an exception in the template instead of in the view.
>
> It's related to eager/deferred argument evaluation (which is done for
> the same reasons): any "smart" object like Q objects would require
> changing to handle deferring things correctly. They can currently be
> designed to evaluate only once and will work correctly.
>

I don't see this as an issue, simply because whatever happens in the
instantiation of these objects would be the same for whatever connection was
in use.


>
> >
> >
> > (b) Intentionally not done right now and not because I'm
> > whimsical and
> > arbitrary (although I am). The problem is it requires storing
> > all sorts
> > of arbitrarily complex Python objects. Which breaks pickling,
> > which
> > breaks caching. People tend to complain, a lot, about that
> > last bit.
> >
> > That's why the Where.add() converts things to more basic types
> > when they
> > are added (via a filter() command).  If somebody really needs
> > lazily
> > evaluated parameters, it's easy enough via a custom Q-like
> > object, but
> > so far nobody has asked for that if they've gotten stuck doing
> > it. It's
> > even something we could consider adding to Django, although
> > it's not a
> > no-brainer given the potential to break caching.
> >
> > I vaguely recall there being a ticket about this that you wontfixed,
> > although that may have been about defering calling callables :).  In
> > any event the caching issue was one I hadn't considered, although one
> > solution would be not to pickle it with the ability to switch to a
> > different query type, it's a bit of a strange restriction, but I don't
> > think it's one that would practically affect people, and it's less
> > restricitive.
>
> You wrote a really long sentence there that didn't make a lot of sense
> (too many prepositions and commas, not enough nouns and full stops).
> Unclear which restriction you're arguing against, but the picklability
> of querysets is pretty much a requirement. It's something people really
> use.
>
> However, before we go too far down this path: this is a very minor
> thing. It's unlikely to be required. Adding it "because we can" is an
> argument Eric can propose at some much later date if it's not absolutely
> *required* for multi-db stuff. I think we won't need to worry about this
> at all.
>

Just to clear that up what I was say was:

When you pickly a QuerySet we build up the entire Query as we would right
before SQL excecution and then just pickle that.  Then the restriction is
that you can't change the database type to be used on an unpickled query.


>
> >
> >
> > [...]
> > >
> > > Thanks for all the review Malcolm.
> >
> >
> > No problems.
> >
> > > One question that I didn't really ask in the initial post is
> > what
> > > parameters should a "DatabaseManager" receieve on it's
> > methods, one
> > > suggestion is the Query object, since that gives the use the
> > maximal
> > > amount of information,, however my concerns there are that
> > it's not a
> > > public API, and having a private API as a part of the public
> > API feels
> > > klunky.
> >
> >
> > At first glance, I believe the word you're looking for is
> > "wrong". :-)
> >
> > Yes, that's the one.
> >
> >
> > Definitely a valid concern.
> >
> > >   OTOH there isn't really another data structure that
> > carries around
> > > the information someone writing their sharding logic(or
> > 

Re: [GSOC] Multiple Database API proposal

2009-03-20 Thread Malcolm Tredinnick

On Sat, 2009-03-21 at 00:41 -0400, Alex Gaynor wrote:
> 
> 
> > One suggestion Eric Florenzano had was that we go above and
> beyond
> > just storing the methods and parameters, we don't even
> excecute them
> > at all until absolutely necessary.
> 
> 
> Excuse me for a moment whilst I add Eric to a special list
> I've been
> keeping. He's trying to make trouble.
> 
> Ok, back now... There are at least two problems with this.
> 
> (a) Backwards incompatible in that some querysets would return
> noticeably different results before and after that change. It
> would be
> subtle, quiet and very difficult to detect without auditing
> every line
> of code that contributes to a queryset. The worst kind of
> change for us
> to make from the perspective of the users.
> 
> What scenario does it return different results, the one place I can
> think of is:
> 
> query = queryset.order_by('I AM NOT A REAL FIELD, HAHA')
> render_to_response('template.html', {'q': query})
> 
> which would raise an exception in the template instead of in the view.

It's related to eager/deferred argument evaluation (which is done for
the same reasons): any "smart" object like Q objects would require
changing to handle deferring things correctly. They can currently be
designed to evaluate only once and will work correctly.

>  
> 
> (b) Intentionally not done right now and not because I'm
> whimsical and
> arbitrary (although I am). The problem is it requires storing
> all sorts
> of arbitrarily complex Python objects. Which breaks pickling,
> which
> breaks caching. People tend to complain, a lot, about that
> last bit.
> 
> That's why the Where.add() converts things to more basic types
> when they
> are added (via a filter() command).  If somebody really needs
> lazily
> evaluated parameters, it's easy enough via a custom Q-like
> object, but
> so far nobody has asked for that if they've gotten stuck doing
> it. It's
> even something we could consider adding to Django, although
> it's not a
> no-brainer given the potential to break caching.
> 
> I vaguely recall there being a ticket about this that you wontfixed,
> although that may have been about defering calling callables :).  In
> any event the caching issue was one I hadn't considered, although one
> solution would be not to pickle it with the ability to switch to a
> different query type, it's a bit of a strange restriction, but I don't
> think it's one that would practically affect people, and it's less
> restricitive.

You wrote a really long sentence there that didn't make a lot of sense
(too many prepositions and commas, not enough nouns and full stops).
Unclear which restriction you're arguing against, but the picklability
of querysets is pretty much a requirement. It's something people really
use.

However, before we go too far down this path: this is a very minor
thing. It's unlikely to be required. Adding it "because we can" is an
argument Eric can propose at some much later date if it's not absolutely
*required* for multi-db stuff. I think we won't need to worry about this
at all.

>  
> 
> [...]
> >
> > Thanks for all the review Malcolm.
> 
> 
> No problems.
> 
> > One question that I didn't really ask in the initial post is
> what
> > parameters should a "DatabaseManager" receieve on it's
> methods, one
> > suggestion is the Query object, since that gives the use the
> maximal
> > amount of information,, however my concerns there are that
> it's not a
> > public API, and having a private API as a part of the public
> API feels
> > klunky.
> 
> 
> At first glance, I believe the word you're looking for is
> "wrong". :-)
> 
> Yes, that's the one.
>  
> 
> Definitely a valid concern.
> 
> >   OTOH there isn't really another data structure that
> carries around
> > the information someone writing their sharding logic(or
> whatever other
> > scheme they want to implement) who inevitably want to have.
> 
> 
> Two solutions spring to mind, although I haven't thought this
> through a
> lot: it's not particularly germane to the proposal since it's
> something
> we can work out a bit later on. I've got limited time
> today(something
> about a beta release coming up), so I wanted to just get out
> responses
> to the two people who posted items for discussion. I suspect
> there's a
> lot of thinking n

Re: [GSOC] Multiple Database API proposal

2009-03-20 Thread Alex Gaynor
>
> > One suggestion Eric Florenzano had was that we go above and beyond
> > just storing the methods and parameters, we don't even excecute them
> > at all until absolutely necessary.
>
> Excuse me for a moment whilst I add Eric to a special list I've been
> keeping. He's trying to make trouble.
>
> Ok, back now... There are at least two problems with this.
>
> (a) Backwards incompatible in that some querysets would return
> noticeably different results before and after that change. It would be
> subtle, quiet and very difficult to detect without auditing every line
> of code that contributes to a queryset. The worst kind of change for us
> to make from the perspective of the users.
>

What scenario does it return different results, the one place I can think of
is:

query = queryset.order_by('I AM NOT A REAL FIELD, HAHA')
render_to_response('template.html', {'q': query})

which would raise an exception in the template instead of in the view.


>
> (b) Intentionally not done right now and not because I'm whimsical and
> arbitrary (although I am). The problem is it requires storing all sorts
> of arbitrarily complex Python objects. Which breaks pickling, which
> breaks caching. People tend to complain, a lot, about that last bit.
>
> That's why the Where.add() converts things to more basic types when they
> are added (via a filter() command).  If somebody really needs lazily
> evaluated parameters, it's easy enough via a custom Q-like object, but
> so far nobody has asked for that if they've gotten stuck doing it. It's
> even something we could consider adding to Django, although it's not a
> no-brainer given the potential to break caching.
>

I vaguely recall there being a ticket about this that you wontfixed,
although that may have been about defering calling callables :).  In any
event the caching issue was one I hadn't considered, although one solution
would be not to pickle it with the ability to switch to a different query
type, it's a bit of a strange restriction, but I don't think it's one that
would practically affect people, and it's less restricitive.


>
> [...]
> >
> > Thanks for all the review Malcolm.
>
> No problems.
>
> > One question that I didn't really ask in the initial post is what
> > parameters should a "DatabaseManager" receieve on it's methods, one
> > suggestion is the Query object, since that gives the use the maximal
> > amount of information,, however my concerns there are that it's not a
> > public API, and having a private API as a part of the public API feels
> > klunky.
>
> At first glance, I believe the word you're looking for is "wrong". :-)
>

Yes, that's the one.


>
> Definitely a valid concern.
>
> >   OTOH there isn't really another data structure that carries around
> > the information someone writing their sharding logic(or whatever other
> > scheme they want to implement) who inevitably want to have.
>
> Two solutions spring to mind, although I haven't thought this through a
> lot: it's not particularly germane to the proposal since it's something
> we can work out a bit later on. I've got limited time today(something
> about a beta release coming up), so I wanted to just get out responses
> to the two people who posted items for discussion. I suspect there's a
> lot of thinking needed here about the concept as a whole and I want to
> do that. Anyway...
>
> One option is to use the piece of public API that is available which
> will always be carrying around a Query object: the QuerySet. Query
> objects don't exist in isolation. However, this sounds problematic
> because the implementation is going to be working at a very low-level --
> database managers are only really interesting to Query.as_sql() and it's
> dependencies. But that leads to the next idea, ...
>
> The other is to work out a better place for this database manager in the
> hierarchy. It might be something that lives as an attribute on a
> QuerySet. Something like the user provides a function that picks the
> database based "some information" that is available to it and the base
> method selects the right database to use. Since it lives in the QuerySet
> namespace, it can happily access the "query" attribute there without any
> encapsulation violations. The database manager then becomes two pieces,
> an algorithm on QuerySet (that might just dispatch to the real algorithm
> on Query), plus some user-supplied code to make the right selections.
> That latter thing could be a callable object if you need the full class
> structure. But the stuff QuerySet/Query needs to know about is probably
> a much smaller interface than *requiring* a full class. (Did any of that
> make sense?)
>
> I think this -- the database manager concept -- is the part of your
> proposal that is most up in the air with respect to what the API looks
> like. Which is fine. The fact that it's something to consider is good
> enough to know. Certainly put some thought into the problem, but don't
> sweat the details too much just yet (in

Re: [GSOC] Multiple Database API proposal

2009-03-20 Thread Malcolm Tredinnick

Trimming unused portions of the response to make it readable (which I
should have done the first time around, too)...

On Fri, 2009-03-20 at 23:41 -0400, Alex Gaynor wrote:
> 
> 
> On Fri, Mar 20, 2009 at 11:21 PM, Malcolm Tredinnick
>  wrote:
> 
> 
> On Fri, 2009-03-20 at 09:45 -0400, Alex Gaynor wrote:
> > Hello all,

[...]

> > The greatest hurdle is changing the connection after we
> already have
> > our
> > ``Query`` partly created.  The issues here are that: we
> might have
> > done tests
> > against ``connection.features`` already, we might need to
> switch
> > either to or
> > from a custom ``Query`` object, amongst other issues.

[...]

> >  One possible solution
> > that is very powerful(though quite inellegant) is to have
> the
> > ``QuerySet`` keep
> > track of all public API method calls against it and what
> parameters
> > they took,
> > then when the ``connection`` is changed it will recreate the
> ``Query``
> > object
> > by creating a "blank" one with the new connection and
> reapplying all
> > the
> > methods it has stored.  This is basically a simple
> implementation of
> > the
> > command pattern.
> 
> 
> 
> 
> It's pretty yukky. There's a lot of Python level junk that we
> intentionally avoid storing in querysets so that they behave
> properly as
> persistent data structures (clones are independent copies) and
> can be
> pickled without trouble, etc. It would be really bad for
> performance to
> reintroduce those (I did a lot of profiling when developing
> that stuff
> and tried to throw out as much as possible). I think this
> fortunately
> isn't going to be a real issue. I was pretty careful
> originally to keep
> the leakage from django.db.connection into the Query class to
> as few
> places as possible and mostly when we're creating the SQL.
> 
> Some cases that might eb unavoidable could be replaced with
> delayed
> evaluation objects (essentially encapsulating the command
> pattern just
> for that fragment), which is a bit cleaner.
> 
> 
> One suggestion Eric Florenzano had was that we go above and beyond
> just storing the methods and parameters, we don't even excecute them
> at all until absolutely necessary.  

Excuse me for a moment whilst I add Eric to a special list I've been
keeping. He's trying to make trouble.

Ok, back now... There are at least two problems with this.

(a) Backwards incompatible in that some querysets would return
noticeably different results before and after that change. It would be
subtle, quiet and very difficult to detect without auditing every line
of code that contributes to a queryset. The worst kind of change for us
to make from the perspective of the users.

(b) Intentionally not done right now and not because I'm whimsical and
arbitrary (although I am). The problem is it requires storing all sorts
of arbitrarily complex Python objects. Which breaks pickling, which
breaks caching. People tend to complain, a lot, about that last bit.

That's why the Where.add() converts things to more basic types when they
are added (via a filter() command).  If somebody really needs lazily
evaluated parameters, it's easy enough via a custom Q-like object, but
so far nobody has asked for that if they've gotten stuck doing it. It's
even something we could consider adding to Django, although it's not a
no-brainer given the potential to break caching.

[...]
> 
> Thanks for all the review Malcolm.

No problems.

> One question that I didn't really ask in the initial post is what
> parameters should a "DatabaseManager" receieve on it's methods, one
> suggestion is the Query object, since that gives the use the maximal
> amount of information,, however my concerns there are that it's not a
> public API, and having a private API as a part of the public API feels
> klunky.

At first glance, I believe the word you're looking for is "wrong". :-)

Definitely a valid concern.

>   OTOH there isn't really another data structure that carries around
> the information someone writing their sharding logic(or whatever other
> scheme they want to implement) who inevitably want to have.

Two solutions spring to mind, although I haven't thought this through a
lot: it's not particularly germane to the proposal since it's something
we can work out a bit later on. I've got limited time today(something
about a beta release coming up), so I wanted to just get out responses
to the two people who posted items for discussion. I suspect there's a
lot of thinking needed here about the concept as a whole and I want to
do that. Anyway...

Re: [GSOC] Multiple Database API proposal

2009-03-20 Thread Alex Gaynor
On Fri, Mar 20, 2009 at 11:21 PM, Malcolm Tredinnick <
malc...@pointy-stick.com> wrote:

>
> On Fri, 2009-03-20 at 09:45 -0400, Alex Gaynor wrote:
> > Hello all,
> >
> > To those who don't me I'm a freshman computer science student at
> > Rensselaer
> > Polytechnic Institute in Troy, New York.  I'm on the mailing lists
> > quite a bit
> > so you may have seen me around.
> >
> > A Multiple Database API For Django
> > ==
> >
> > Django current has the low level hooks necessary for multiple database
> > support,
> > but it doesn't have the high level API for using, nor any support
> > infrastructure, documentation, or tests.  The purpose of this project
> > would be
> > to implement the high level API necessary for the use of multiple
> > databases in
> > Django, along with requisit documentation and tests.
> >
> > There have been several previous proposals and implementation of
> > multiple-database support in Django, non of which has been complete,
> > or gained
> > sufficient traction in the community in order to be included in Django
> > itself.
> > As such this proposal will specifically address some of the reasons
> > for past
> > failures, and their remedies.
> >
> > The API
> > ---
> >
> > First there is the API for defining multiple connections.  A new
> > setting will
> > be created ``DATABASES`` (or something similar), which is a dictionary
> > mapping
> > database alias(internal name) to a dictionary containing the current
> > ``DATABASE_*`` settings:
> >
> > .. sourcecode:: python
> >
> > DATABASES = {
> > 'default': {
> > 'DATABASE_ENGINE': 'postgresql_psycopg2',
> > 'DATABASE_NAME': 'my_data_base',
> > 'DATABASE_USER': 'django',
> > 'DATABASE_PASSWORD': 'super_secret',
> > }
> > 'user': {
> > 'DATABASE_ENGINE': 'sqlite3',
> > 'DATABASE_NAME':
> > '/home/django_projects/universal/users.db',
> > }
> > }
> >
> > A database with the alias ``default`` will be the default
> > connection(it will be
> > used if no other one is specified for a query) and will be the direct
> > replacement for the ``DATABASE_*`` settings.  In compliance with
> > Django's
> > deprecation policy the ``DATABASE_*`` will automatically be handled as
> > if they
> > were defined in the ``DATABASES`` dict for at least 2 releases.
> >
> > Next a ``connections`` object will be implemented in ``django.db``,
> > analgous
> > to the ``django.db.connection`` object, the ``connections`` one will
> > be a
> > dictionary like object, that is subscripted by database alias, and
> > lazily
> > returns a connection to the database.  ``django.db.connection`` will
> > remain(at
> > least for the present, it's ultimate state will be by community
> > consensus) and
> > merely proxy to ``django.db.connections['default']``.  Using the
> > previously
> > defined database setting this might be used as:
> >
> > .. sourcecode:: python
> >
> > from django.db import connections
> >
> > conn = connections['user']
> > c = conn.cursor()
> > results = c.execute("""SELECT 1""")
> > results.fetchall()
> >
> > Now that there is the necessary infastructure to accompany the very
> > low level
> > plumbing we need our actual API.  The high level API will have 2
> > components.
> > First here will be a ``using()`` method on ``QuerySet`` and
> > ``Manager``
> > objects.  This method simply takes an alias to a connection(and
> > possibly a
> > connection object itself to allow for dynamic database usage) and
> > makes that
> > the connection that will be used for that query.  Secondly, a new
> > options will
> > be created in the inner Meta class of models.  This option will be
> > named
> > ``using`` and specify the default connection to use for all queries
> > against
> > this model, overiding the default specified in the settings:
> >
> > .. sourcecode:: python
> >
> > class MyUser(models.Model):
> > ...
> > class Meta:
> > using = 'user'
> >
> > # this queries the 'user' database
> > MyUser.objects.all()
> > # this queries the 'default' database
> > MyUser.objects.using('default')
> >
> > Lastly, various plumbing will need to be updated to reflect the new
> > multidb
> > API, such as transactions, breakpoints, management commands, etc.
> >
> > More Advanced Usage
> > ---
> >
> > While the above two methods are strictly speaking sufficient they
> > require the
> > user to write lots of boilerplate code in order to implement advanced
> > multi
> > database strategies such as replication and sharding.  Therefore we
> > also
> > introduce the concept of ``DatabaseManagers``, not to be confused with
> > Django's
> > current managers.  DatabaseManagers are classes that define how what
> > connection
> > should be used for a given query.  There are 2 levels at which to
> > specify what
> > ``DatabaseManager`` to use, as a setting, and at th

Re: [GSOC] Multiple Database API proposal

2009-03-20 Thread Malcolm Tredinnick

On Fri, 2009-03-20 at 09:45 -0400, Alex Gaynor wrote:
> Hello all,
> 
> To those who don't me I'm a freshman computer science student at
> Rensselaer 
> Polytechnic Institute in Troy, New York.  I'm on the mailing lists
> quite a bit 
> so you may have seen me around.
> 
> A Multiple Database API For Django
> ==
> 
> Django current has the low level hooks necessary for multiple database
> support, 
> but it doesn't have the high level API for using, nor any support 
> infrastructure, documentation, or tests.  The purpose of this project
> would be 
> to implement the high level API necessary for the use of multiple
> databases in 
> Django, along with requisit documentation and tests.
> 
> There have been several previous proposals and implementation of 
> multiple-database support in Django, non of which has been complete,
> or gained 
> sufficient traction in the community in order to be included in Django
> itself.  
> As such this proposal will specifically address some of the reasons
> for past 
> failures, and their remedies.
> 
> The API
> ---
> 
> First there is the API for defining multiple connections.  A new
> setting will 
> be created ``DATABASES`` (or something similar), which is a dictionary
> mapping 
> database alias(internal name) to a dictionary containing the current 
> ``DATABASE_*`` settings:
> 
> .. sourcecode:: python
> 
> DATABASES = {
> 'default': {
> 'DATABASE_ENGINE': 'postgresql_psycopg2',
> 'DATABASE_NAME': 'my_data_base',
> 'DATABASE_USER': 'django',
> 'DATABASE_PASSWORD': 'super_secret',
> }
> 'user': {
> 'DATABASE_ENGINE': 'sqlite3',
> 'DATABASE_NAME':
> '/home/django_projects/universal/users.db',
> }
> }
> 
> A database with the alias ``default`` will be the default
> connection(it will be 
> used if no other one is specified for a query) and will be the direct 
> replacement for the ``DATABASE_*`` settings.  In compliance with
> Django's 
> deprecation policy the ``DATABASE_*`` will automatically be handled as
> if they 
> were defined in the ``DATABASES`` dict for at least 2 releases.
> 
> Next a ``connections`` object will be implemented in ``django.db``,
> analgous 
> to the ``django.db.connection`` object, the ``connections`` one will
> be a 
> dictionary like object, that is subscripted by database alias, and
> lazily 
> returns a connection to the database.  ``django.db.connection`` will
> remain(at 
> least for the present, it's ultimate state will be by community
> consensus) and 
> merely proxy to ``django.db.connections['default']``.  Using the
> previously 
> defined database setting this might be used as:
> 
> .. sourcecode:: python
> 
> from django.db import connections
> 
> conn = connections['user']
> c = conn.cursor()
> results = c.execute("""SELECT 1""")
> results.fetchall()
> 
> Now that there is the necessary infastructure to accompany the very
> low level 
> plumbing we need our actual API.  The high level API will have 2
> components.  
> First here will be a ``using()`` method on ``QuerySet`` and
> ``Manager`` 
> objects.  This method simply takes an alias to a connection(and
> possibly a 
> connection object itself to allow for dynamic database usage) and
> makes that 
> the connection that will be used for that query.  Secondly, a new
> options will 
> be created in the inner Meta class of models.  This option will be
> named 
> ``using`` and specify the default connection to use for all queries
> against 
> this model, overiding the default specified in the settings:
> 
> .. sourcecode:: python
> 
> class MyUser(models.Model):
> ...
> class Meta:
> using = 'user'
> 
> # this queries the 'user' database
> MyUser.objects.all()
> # this queries the 'default' database
> MyUser.objects.using('default')
> 
> Lastly, various plumbing will need to be updated to reflect the new
> multidb 
> API, such as transactions, breakpoints, management commands, etc.
> 
> More Advanced Usage
> ---
> 
> While the above two methods are strictly speaking sufficient they
> require the 
> user to write lots of boilerplate code in order to implement advanced
> multi 
> database strategies such as replication and sharding.  Therefore we
> also 
> introduce the concept of ``DatabaseManagers``, not to be confused with
> Django's 
> current managers.  DatabaseManagers are classes that define how what
> connection 
> should be used for a given query.  There are 2 levels at which to
> specify what 
> ``DatabaseManager`` to use, as a setting, and at the class level.  For
> example 
> in one's settings.py one might have:
> 
> .. sourcecode:: python
> 
> DEFAULT_DB_MANAGER = 'django.db.multidb.round_robin.Random'
> 
> This tells Django that for each query it should use the
> ``DatabaseManager`` 
> specified at that location, unless it is 

Re: [GSOC] Multiple Database API proposal

2009-03-20 Thread Tim Chase

> I'm here soliciting feedback on both the API, and any potential hurdles I
> may have missed.

While my vote may mean little, Alex has certainly been active and 
had quality code on the mailing list.  MultiDB has also been a 
frequent issue on the mailing-list, so Alex gets my +1

I'd hope to see "multiple databases" defined a little more 
clearly as discussed in this thread[1].  Whether the SoC project 
address *all* of the facets (wow, lots of work!) or just selects 
certain issues, I'd like to see them addressed in the proposal 
("addressing federation and load-balancing, but not sharding") to 
show that they're being considered during the implementation. 
 From what I gather in the description, Alex is only proposing 
load-balancing.

Depending on which definitions of multidb you plan to address, it 
also impacts areas such as aggregation (performing 
count/summation over shards requires extra consideration) and 
cross-database joining.  In the above thread, Malcolm also raises 
the issue of read/write consistency when doing load-balancing.

-tim

[1]
http://groups.google.com/group/django-users/browse_thread/thread/663046559fd0f9c1/




--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



[GSOC] Multiple Database API proposal

2009-03-20 Thread Alex Gaynor
Hello all,

To those who don't me I'm a freshman computer science student at Rensselaer
Polytechnic Institute in Troy, New York.  I'm on the mailing lists quite a
bit
so you may have seen me around.

A Multiple Database API For Django
==

Django current has the low level hooks necessary for multiple database
support,
but it doesn't have the high level API for using, nor any support
infrastructure, documentation, or tests.  The purpose of this project would
be
to implement the high level API necessary for the use of multiple databases
in
Django, along with requisit documentation and tests.

There have been several previous proposals and implementation of
multiple-database support in Django, non of which has been complete, or
gained
sufficient traction in the community in order to be included in Django
itself.
As such this proposal will specifically address some of the reasons for past

failures, and their remedies.

The API
---

First there is the API for defining multiple connections.  A new setting
will
be created ``DATABASES`` (or something similar), which is a dictionary
mapping
database alias(internal name) to a dictionary containing the current
``DATABASE_*`` settings:

.. sourcecode:: python

DATABASES = {
'default': {
'DATABASE_ENGINE': 'postgresql_psycopg2',
'DATABASE_NAME': 'my_data_base',
'DATABASE_USER': 'django',
'DATABASE_PASSWORD': 'super_secret',
}
'user': {
'DATABASE_ENGINE': 'sqlite3',
'DATABASE_NAME': '/home/django_projects/universal/users.db',
}
}

A database with the alias ``default`` will be the default connection(it will
be
used if no other one is specified for a query) and will be the direct
replacement for the ``DATABASE_*`` settings.  In compliance with Django's
deprecation policy the ``DATABASE_*`` will automatically be handled as if
they
were defined in the ``DATABASES`` dict for at least 2 releases.

Next a ``connections`` object will be implemented in ``django.db``, analgous

to the ``django.db.connection`` object, the ``connections`` one will be a
dictionary like object, that is subscripted by database alias, and lazily
returns a connection to the database.  ``django.db.connection`` will
remain(at
least for the present, it's ultimate state will be by community consensus)
and
merely proxy to ``django.db.connections['default']``.  Using the previously
defined database setting this might be used as:

.. sourcecode:: python

from django.db import connections

conn = connections['user']
c = conn.cursor()
results = c.execute("""SELECT 1""")
results.fetchall()

Now that there is the necessary infastructure to accompany the very low
level
plumbing we need our actual API.  The high level API will have 2
components.
First here will be a ``using()`` method on ``QuerySet`` and ``Manager``
objects.  This method simply takes an alias to a connection(and possibly a
connection object itself to allow for dynamic database usage) and makes that

the connection that will be used for that query.  Secondly, a new options
will
be created in the inner Meta class of models.  This option will be named
``using`` and specify the default connection to use for all queries against
this model, overiding the default specified in the settings:

.. sourcecode:: python

class MyUser(models.Model):
...
class Meta:
using = 'user'

# this queries the 'user' database
MyUser.objects.all()
# this queries the 'default' database
MyUser.objects.using('default')

Lastly, various plumbing will need to be updated to reflect the new multidb
API, such as transactions, breakpoints, management commands, etc.

More Advanced Usage
---

While the above two methods are strictly speaking sufficient they require
the
user to write lots of boilerplate code in order to implement advanced multi
database strategies such as replication and sharding.  Therefore we also
introduce the concept of ``DatabaseManagers``, not to be confused with
Django's
current managers.  DatabaseManagers are classes that define how what
connection
should be used for a given query.  There are 2 levels at which to specify
what
``DatabaseManager`` to use, as a setting, and at the class level.  For
example
in one's settings.py one might have:

.. sourcecode:: python

DEFAULT_DB_MANAGER = 'django.db.multidb.round_robin.Random'

This tells Django that for each query it should use the ``DatabaseManager``
specified at that location, unless it is overidden by the ``using`` Meta
option,
or the ``using()`` method.

The more granular way to use ``DatabaseManagers`` is to provide them, in
place
of a string, as the ``using`` Meta option.  Here we pass an instance of the
class we want to use:

.. sourcecode:: python

class MyModel(models.Model):
class Meta:
using = Random(['my_db1', 'my_db2', 'my_db2'])

At this level it