Hi folks, I just ran into a serious but quite subtle bug in db.multiple_insert() when you're inserting into the database from multiple sessions concurrently. multiple_insert() calculates the inserted IDs by using the most recent currval() of the sequence (see _process_insert_query), and then by subtracting the number of rows inserted to get the (presumed) range of IDs that were inserted:
out = db_cursor.fetch_one()[0] # last ID inserted out = range(out-len(values)+1, out+1) The currval(seqname) returns what it's supposed to -- the last value of the sequence from this session. However, you *cannot* use this subtract logic -- the sequence guarantees the IDs are incrementing and unique, but they won't necessarily be consecutive for a multiple-row insert if another database session inserts something around the exact same time. The sequence locking is done quite separately from the insert/commit locking, and is only locked to get one value at a time. This is true at least for PostgreSQL, and probably other databases too, though I haven't checked. So what happened to me is that two threads were calling multiple_insert() to insert (for example) 200 rows each, and they happened to run at almost exactly the same time. The first insert started allocating IDs (say) 1000-1049, then the second insert came in half way and allocated IDs 1050-1249, and finally the first insert finished and allocated IDs 1250-1399 for its last 150 rows. These were the actual IDs inserted in the database, so you can see why the range() approach above would give the wrong values! Thankfully in my case when my first thread tried to store the IDs in another table, a foreign key constraint caused the transaction to fail because some of those IDs (those from the second multiple_insert) hadn't been committed yet. Then the second thread bombed out for the same reason. So while there was no data corruption in my case, this is a *serious race condition* that can produce bogus data. So using currval() in this way after a multiple-row insert is bogus, and this functionality should be removed from multiple_insert() ASAP. What I've done instead in my local copy of web/db.py is added a "returning" keyword arg to multiple_insert() that uses PostgreSQL's RETURNING clause of the insert query. You can return values from the rows you just inserted, including the default values and ID sequence values. See docs for that here: http://www.postgresql.org/docs/9.0/static/sql-insert.html One thing to note is that, like a SELECT with no ORDER BY, the rows from RETURNING are *not guaranteed* to be in the order specified in the insert SQL, which does make it slightly harder to work with, but at least you can get the data. For bonus points, you can read more on all that here: http://postgresql.1045698.n5.nabble.com/Getting-sequence-generated-IDs-from-multiple-row-insert-td5798092.html http://postgresql.1045698.n5.nabble.com/PATCH-Enforce-that-INSERT-RETURNING-preserves-the-order-of-multi-rows-td5728579.html Here's how my "returning" argument looks (though of course in non-test mode it returns an IterBetter of rows from the RETURNING clause): >>> db.multiple_insert('person', values=values, returning='id, email', _test=True) <sql: "INSERT INTO person (name, email) VALUES ('foo', '[email protected]'), ('bar', '[email protected]') RETURNING id, email"> If you'd like me to provide a pull request with my changes, I can do that. In any case, multiple_insert()'s seqname and return value should be removed, as it's a time bomb waiting to explode. Side note: there's also apparently a different issue with the IDs returned by multiple_insert() with MySQL, though I haven't verified this: https://github.com/webpy/webpy/pull/263 -Ben -- You received this message because you are subscribed to the Google Groups "web.py" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/webpy. For more options, visit https://groups.google.com/d/optout.
