On Thu, 2007-03-29 at 05:29 +0000, John Penix wrote:
> I think I saw a get_or_create race condition today from concurrent
> runs of our data uploader that uses the model API.  Ouch.  The docs
> have several references to the api calls being atomic - now I'm
> thinking get_or_create is an exception.  And I'm guessing lots of
> other people already know this.

You're right. The get_or_create() call isn't atomic, because it isn't a
single database call and we cannot assume that the database layer has
transactions (because it's not universally true). The window of
opportunity for a problem is about five python instructions.

> So, assuming it's not atomic (by default) is there a way to make it
> safe other than using the django middleware layer to get
> transactions?  Like a flag... or a db schema tweak....

You could put a unique_together attribute in your model (part of the
Meta class). Include in it the columns that are involved in determining
when one instance differs from the next. Django's manage.py translates
unique_together into database table constraints, so you won't be able to
create multiple instances of the same type.

I suspect if you do this, you will see IntegrityError raised out of
get_or_create() when a conflict occurs, so be prepared to handle that.
There is a ticket waiting to be fixed to make IntegrityError
database-neutral because at the moment you have to catch
MySQLdb.IntegrityError or whatever. I'm going to do a run through those
sorts of tickets on the weekend, so that little item will be smoothed
over then.

So my answer to your question ends here.

However, if you really care about the gory details of why this isn't
trivial, a little bit of data modelling theory...

The root problem here is that this is actually a difficult problem at
the database level, too. By default, Django uses a surrogate primary key
for models (the automatically generated id value), so there's no
constraint present about what constitutes a "unique" item. The fact that
you present the same fields to be saved more than once doesn't really
carry any information about whether they are the same or different. The
get_or_create() utility method makes the assumption that "same fields
implies same object", but that's not enforced by the database table
constraints. If it was truly the correct assumption, the model should
technically have a Meta.unique_together attribute specifying every field
in the model. That would translate into a database constraint as well
and attempts to create multiple objects with the same fields would raise
an IntegrityError in get_or_create(). The problem is that it's a bit
heavy-handed -- a potentially large constraint for the database to check
each time -- and it's overkill in the sense that often a much smaller
set of fields determines uniqueness.

The way to solve this as taught in Database Theory 101 is to have a
genuine primary key in your data model: something that you can point to
and say "this is what makes it unique" and then have that constraint
enforced at the database level.

There are two problems that make this a little tricky in Django as it is
today. The first one is that we don't have proper validation for custom
primary keys available -- you should be able to say "check that this
primary key field is unique" and not only do we give you back a True or
False answer, but in the True case, it should *remain* true until you
save the model. So Django needs to actually make a temporary save of the
model. That's a little tricky to implement, but not impossible. I've
been putting a lot of thought into that recently, because it crops up in
a number of different disguises. We'll have that one solved before 1.0
and hopefully long before then.

The other problem we would have to solve for truly generically correct
support at the database level would be multi-column primary keys. That
is not too difficult to do. The real stumbling block is that there's no
good way that anybody has come up with to use such models in the admin
interface. We use the primary key as part of the URL in the admin
interface and primary keys can contain arbitrary characters. So you
can't just concatenate the two keys -- no way to tell where one ends and
the other starts -- and you can't use a special marker, because that
marker could occur in either or both keys, so you'd have to escape every
possible occurrence of it, reducing the readability of the URLs quite
dramatically in the normal case (unless the marker was a truly weird
character). If somebody can come up with a URL addressing scheme for
multi-column (more than one and not just two) primary keys that is
backwards compatible with our current scheme, the rest is not too
painful. It seems like a small item, but it's trickier than it looks.

Regards,
Malcolm


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to