[webpy] Serious but subtle bug in IDs returned by db.multiple_insert() when being used concurrently

Ben Hoyt Thu, 03 Apr 2014 07:30:42 -0700

Hi folks,

I just ran into a serious but quite subtle bug in db.multiple_insert() when 
you're inserting into the database from multiple sessions concurrently. 
multiple_insert() calculates the inserted IDs by using the most recent 
currval() of the sequence (see _process_insert_query), and then by 
subtracting the number of rows inserted to get the (presumed) range of IDs 
that were inserted:


out = db_cursor.fetch_one()[0]  # last ID inserted
out = range(out-len(values)+1, out+1)

The currval(seqname) returns what it's supposed to -- the last value of the 
sequence from this session. However, you *cannot* use this subtract logic 
-- the sequence guarantees the IDs are incrementing and unique, but they 
won't necessarily be consecutive for a multiple-row insert if another 
database session inserts something around the exact same time. The sequence 
locking is done quite separately from the insert/commit locking, and is 
only locked to get one value at a time. This is true at least for 
PostgreSQL, and probably other databases too, though I haven't checked.

So what happened to me is that two threads were calling multiple_insert() 
to insert (for example) 200 rows each, and they happened to run at almost 
exactly the same time. The first insert started allocating IDs (say) 
1000-1049, then the second insert came in half way and allocated IDs 
1050-1249, and finally the first insert finished and allocated IDs 
1250-1399 for its last 150 rows. These were the actual IDs inserted in the 
database, so you can see why the range() approach above would give the 
wrong values!

Thankfully in my case when my first thread tried to store the IDs in 
another table, a foreign key constraint caused the transaction to fail 
because some of those IDs (those from the second multiple_insert) hadn't 
been committed yet. Then the second thread bombed out for the same reason. 
So while there was no data corruption in my case, this is a *serious race 
condition* that can produce bogus data.

So using currval() in this way after a multiple-row insert is bogus, and 
this functionality should be removed from multiple_insert() ASAP.

What I've done instead in my local copy of web/db.py is added a "returning" 
keyword arg to multiple_insert() that uses PostgreSQL's RETURNING clause of 
the insert query. You can return values from the rows you just inserted, 
including the default values and ID sequence values. See docs for that 
here: http://www.postgresql.org/docs/9.0/static/sql-insert.html

One thing to note is that, like a SELECT with no ORDER BY, the rows from 
RETURNING are *not guaranteed* to be in the order specified in the insert 
SQL, which does make it slightly harder to work with, but at least you can 
get the data. For bonus points, you can read more on all that here:

http://postgresql.1045698.n5.nabble.com/Getting-sequence-generated-IDs-from-multiple-row-insert-td5798092.html
http://postgresql.1045698.n5.nabble.com/PATCH-Enforce-that-INSERT-RETURNING-preserves-the-order-of-multi-rows-td5728579.html

Here's how my "returning" argument looks (though of course in non-test mode 
it returns an IterBetter of rows from the RETURNING clause):

>>> db.multiple_insert('person', values=values, returning='id, email', 
_test=True)
<sql: "INSERT INTO person (name, email) VALUES ('foo', '[email protected]'), 
('bar', '[email protected]') RETURNING id, email">

If you'd like me to provide a pull request with my changes, I can do that. 
In any case, multiple_insert()'s seqname and return value should be 
removed, as it's a time bomb waiting to explode.

Side note: there's also apparently a different issue with the IDs returned 
by multiple_insert() with MySQL, though I haven't verified 
this: https://github.com/webpy/webpy/pull/263

-Ben

-- 
You received this message because you are subscribed to the Google Groups 
"web.py" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/webpy.
For more options, visit https://groups.google.com/d/optout.

[webpy] Serious but subtle bug in IDs returned by db.multiple_insert() when being used concurrently

Reply via email to