On Aug 4, 2011, at 9:22 AM, Michael Bayer wrote: > > On Aug 3, 2011, at 8:38 PM, Mark Erbaugh wrote: > >> I'm using SA (with SQLite) with a schema like: >> >> A -< B -< C -< D >> >> where -< means that the tables have a one to many relationship >> >> I'm populating a sample data set where there are 25 rows in A, 25 rows in B >> for each row in A, 25 rows in C for each row in B and 25 rows in D for each >> row in C. This results in about 390k rows in D. The database itself is >> only about 12 MB, but it takes a long time (several minutes) to write the >> data to the file. >> >> I'm taking the approach of appending items to the table's relationship >> column. >> >> for i in range(25): >> x = A() >> session.add(A) >> for j in range(25): >> y = B() >> x.b.append(y) >> for k in range(25): >> z = C() >> y.c.append(z) >> for l in range(25): >> xx = D() >> z.d.append(xx) >> session.flush()
Thanks again for the help. I decided to time the various approaches. My original approach took 4:23 (minutes: seconds). Note: all my times included data generation and insertion into a SQLite on-disk database. > The biggest speed variable in a mass INSERT operation is whether or not > individual cursor.execute() calls, or a small handful of cursor.executemany() > calls each with thousands of rows, are used. > > With the ORM, a mass executemany() is used for INSERT in the case that > primary key values are already present in the given objects. If not, the > ORM has to INSERT each A() row one at a time, get the new primary key value, > and then later populate 25*25 B() object's foreign key with the A.id value; > then this goes down to the B()->C() area, etc. > > So if A(), B(), C(), D() can be created with "id=5", "id=6", etc., assuming > "id" is the primary key, the ORM should be able to bunch lots of rows > together into one cursor.executemany() call and you'll see an immediate, > dramatic speedup. This took 3:36 > > The next level would be if you populated the "a_id", "b_id", > foreign-key-to-parent columns directly instead of using append(). If you > did a profile on your script you'd see lots of time taken in many places, but > all those append() operations would be one of them, as well as lots of event > firing and bookkeeping that SQLAlchemy has to do when they occur, both at > append() time as well as within the flush() (populating the primary key > values to foreign key attributes). This took 2:28 > By far the fastest way to do this would be to use > session.execute(a_table.insert(), [<rows>]), > session.execute(b_table.insert(), [<rows>]), etc. That is, assemble the > whole set of A, B, C, D, directly in terms of the mapped table, or better > yet do it in chunks, perhaps drill down through B, C, D for a single A then > insert everything, etc. That way you optimize how these rows are > constructed in Python exactly to the pattern that corresponds directly to the > database structure, instead of having SQLAlchemy decode the database > structure from an object hierarchy. An insertmany is documented at > http://www.sqlalchemy.org/docs/core/tutorial.html#executing-multiple-statements. > The ORM Session has an execute() method just like Connection does. I did this with one session.execute for each table, rather than chunks. It took 0:46 > The range of speedups here would be between 30% and 80%, with direct usage of > connection/session .execute() with Table metadata giving you the 80%. Mark -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.