db backend refactoring

Brian Harring Mon, 13 Aug 2007 03:42:50 -0700

As hinted at earlier on the ml, have started doing some work on 
refactoring the actual db backend; ticket 5106 holds the current 
version (http://code.djangoproject.com/ticket/5106).


Best to start with perceived cons of current design of the backends-

1) redundancy of code; each backend implementation has to implement 
the exact same functions repeatedly- _(commit|rollback|close) are 
simple examples, better examples are the debug cursor, dictfetch* 
assignments in each base module, repeat get_* func definitions in the 
base module.  Looks of it, each backend was roughly developed via 
copying an existing one over, and modifying it for the target backend- 
this obviously isn't grand (why fix a bug once when you can fix it in 
7 duplicate spots? ;)

2) due to the lack of any real base class/interface, devs are 
basically stuck grepping each backend to identify what functionality 
is available; track the usage of get_autoinc_sql in core/management 
for example, some spots protect themselves for the function missing, 
some spots assume it always exists (always exists best I can figure).  
Lack of real OOP for the backend code also means that django is 
slightly screwed in terms of trying to do changes to the backend- 
instead of adding a compatibility hack in one spot, have to add it to 
each/every backend.  Not fun.

3) reliance on globals; this one requires some explanation, and a 
minor backstory; mod_python spawns a new interpretter per vhost; if 
you have lots of vhosts in a worker/prefork setup, this means you 
bleed memory like a siev- not fun.  The solution (at least my 
approach) is to mangle the misc globals django relies on so that they 
are able to swap their settings on the fly per request (literally 
swapping the use $DJANGO_SETTINGS_MODULE/django.conf.settings._target), 
and to force mod_python to reuse the same interpretter.  Upshot of 
it, for our usage of it at curse-gaming, this means growing >400 
mb/process limited to 100 requests becomes ~40mb/process, unlimited 
requests (we have a veritable buttload of vhosts).  Assume min ~20 
idle workers, and you get an idea of why globals are more then a wee 
bit anti-scaling for a setup with a large # of vhosts.  

Getting back to db refactoring, reliance on globals through django 
code means that tricks like that are far harder, and adds more 
work for multidb code/attempts; that codebase require a reduction of 
global reliance (quote_name is a simple example- the quoting rules 
for mysql aren't the same as oracle/pgsql/sqlite, thus you need to 
get the quoter for the specific backend).  The old mantra about 
globals sucking basically is true; the access to misc backend 
functionality really needs to be grabbed via the actual backend 
object itself if there ever is an intention of supporting N backends 
w/in a single interprettor.

4) minor, but annoying; forced module layout means writing 
generated/new backends is tricky, further, you have to shove the 
backend into django.db.backends (the hardcoded location is addressable 
w/out the refactoring, although the layout issue would remain).


What I'm implementing/proposing;

1) shift access to introspection/creation/client module 
functionality to;

connection.introspection # literal attr based namespace
connection.creation # literall attr based namespace; realistically 
  # could shift DATA_TYPES to connection.DATA_TYPES and drop creation.
connection.runshell # func to execute the shell

2) shift access of base.* misc. bits into 5 attrs;

connection.DatabaseError  # should realistically be there anyways, and 
  # potentially accessible on the cursor object
connection.IntegrityError # same
connection.orm_map # base.OPERATOR_MAPPING
connection.ops # basically the misc get_*, quote_name, *_transaction, 
  # dict* methods floating in base; 
connection.capabilities # the misc bools django relies on to discern 
  # what sql to generate for the backend; allows_group_by_ordinals, 
  # allows_unique_and_pk, autoindexes_primary_key, etc

3) convert code over to accessing connection, instead of backend.  
Kind of a given this breaks the hell out of current consumers doing 
sql generation (moving quote_name for example), but the api breakage 
can be limited via adding a temporary __getattr__ to the base 
connection class that searches the new compartmentalized locations 
returning from there.  Not a good long term solution, but should be 
an effective intermediate band aid.


Basically, the pros of the approach are:

1) fix, or enable the next step in fixing of listed cons above in 
mainline django (instead of folks just forking off django with their 
needed changes).
2) actual base interface present for backend bits, making things less 
of a crapshoot when trying to write backend agnostic code.
3) connection reuse/pooling being able to be inlined in (or wrapped, 
implementation will be similar enough) for backends that lack it; fun 
example that came to mind was writing a simple wrapper to collect 
the backtraces for where queries are getting forced to evalute- it's 
not a complex example, but ought to give y'all an idea of some of the 
stuff that's doable with cleanup.  Right now, you would have to inline 
it every time- tweaking the settings is far saner, and promotes 
reuse (plus it would be a useful tool :)

So... thoughts?  The second round of refactoring posted on 5106 
currently lacks some of the new features mentioned above (need to port 
them over mainly), but v3l address that, and add in at least a 
SteadyDB persistant connection (pooling would be based off that).  The 
current backend implementations in v2 are also still a fair bit ugly; 
currently just mapping old layout into the new api (no point 
converting till folks agree with the new api mainly).

Comments would be appreciated; personally, I'm not after a mass 
gutting of what's there, just after refactoring it so that it's in a 
semi-sane state for extension/heavier refactoring.

~harring

pgpyjVZkNSsLj.pgp
Description: PGP signature

db backend refactoring

Reply via email to