Re: [sqlalchemy] Populate sample data
Hi Mark, I don't know much about sqlite perfomance, but 390k inserts in 4 minuts are 1625 inserts/second which I think is pretty impressive :D One of the things that affects most insert and update performance is index set up on fields. For each indexed field inserted, the db must write the data record AND update the index. Removing unneeded indexes would certainly speed up something. Cheers, 2011/8/4 Mark Erbaugh m...@microenh.com I'm using SA (with SQLite) with a schema like: A - B - C - D where - means that the tables have a one to many relationship I'm populating a sample data set where there are 25 rows in A, 25 rows in B for each row in A, 25 rows in C for each row in B and 25 rows in D for each row in C. This results in about 390k rows in D. The database itself is only about 12 MB, but it takes a long time (several minutes) to write the data to the file. I'm taking the approach of appending items to the table's relationship column. for i in range(25): x = A() session.add(A) for j in range(25): y = B() x.b.append(y) for k in range(25): z = C() y.c.append(z) for l in range(25): xx = D() z.d.append(xx) session.flush() The bulk of the delay seems to be the session.flush call. I'm using the Pyramid framework which used Python's transaction module. I call transaction.begin() prior to adding the rows. According to the SQLite FAQ, this should speed things up. Are there any suggestions on how to speed things up? Thanks, Mark -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en. -- -- Pau Tallada Crespí Dep. d'Astrofísica i Cosmologia Port d'Informació Científica (PIC) Tel: +34 93 586 8233 -- -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Populate sample data
On Aug 3, 2011, at 8:38 PM, Mark Erbaugh wrote: I'm using SA (with SQLite) with a schema like: A - B - C - D where - means that the tables have a one to many relationship I'm populating a sample data set where there are 25 rows in A, 25 rows in B for each row in A, 25 rows in C for each row in B and 25 rows in D for each row in C. This results in about 390k rows in D. The database itself is only about 12 MB, but it takes a long time (several minutes) to write the data to the file. I'm taking the approach of appending items to the table's relationship column. for i in range(25): x = A() session.add(A) for j in range(25): y = B() x.b.append(y) for k in range(25): z = C() y.c.append(z) for l in range(25): xx = D() z.d.append(xx) session.flush() The biggest speed variable in a mass INSERT operation is whether or not individual cursor.execute() calls, or a small handful of cursor.executemany() calls each with thousands of rows, are used. With the ORM, a mass executemany() is used for INSERT in the case that primary key values are already present in the given objects. If not, the ORM has to INSERT each A() row one at a time, get the new primary key value, and then later populate 25*25 B() object's foreign key with the A.id value; then this goes down to the B()-C() area, etc. So if A(), B(), C(), D() can be created with id=5, id=6, etc., assuming id is the primary key, the ORM should be able to bunch lots of rows together into one cursor.executemany() call and you'll see an immediate, dramatic speedup. The next level would be if you populated the a_id, b_id, foreign-key-to-parent columns directly instead of using append().If you did a profile on your script you'd see lots of time taken in many places, but all those append() operations would be one of them, as well as lots of event firing and bookkeeping that SQLAlchemy has to do when they occur, both at append() time as well as within the flush() (populating the primary key values to foreign key attributes). By far the fastest way to do this would be to use session.execute(a_table.insert(), [rows]), session.execute(b_table.insert(), [rows]), etc. That is, assemble the whole set of A, B, C, D, directly in terms of the mapped table, or better yet do it in chunks, perhaps drill down through B, C, D for a single A then insert everything, etc.That way you optimize how these rows are constructed in Python exactly to the pattern that corresponds directly to the database structure, instead of having SQLAlchemy decode the database structure from an object hierarchy. An insertmany is documented at http://www.sqlalchemy.org/docs/core/tutorial.html#executing-multiple-statements. The ORM Session has an execute() method just like Connection does. The range of speedups here would be between 30% and 80%, with direct usage of connection/session .execute() with Table metadata giving you the 80%. I'm not sure what transaction is in transaction.begin() , if you're using a regular SQLAlchemy Session in it is always in a transaction in that it uses a single connection until rollback() or commit() is called. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Populate sample data
On Aug 4, 2011, at 9:22 AM, Michael Bayer wrote: The range of speedups here would be between 30% and 80%, with direct usage of connection/session .execute() with Table metadata giving you the 80%. Thanks. I'll look into your suggestions I'm not sure what transaction is in transaction.begin() , if you're using a regular SQLAlchemy Session in it is always in a transaction in that it uses a single connection until rollback() or commit() is called. Originally, I thought transaction was from the standard Python library, but upon research, it looks like it's from the transaction package that is part of Zope. It's included in the Pyramid installation. Mark -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Populate sample data
On Aug 4, 2011, at 1:20 PM, Mark Erbaugh wrote: Originally, I thought transaction was from the standard Python library, but upon research, it looks like it's from the transaction package that is part of Zope. It's included in the Pyramid installation. Pyramid installs the zope transaction manager by default for SQLAlchemy usage ? Why do they keep saying Pyramid isn't zopey ? The TM is the zopiest zope there is. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Populate sample data
On Aug 4, 2011, at 9:22 AM, Michael Bayer wrote: On Aug 3, 2011, at 8:38 PM, Mark Erbaugh wrote: I'm using SA (with SQLite) with a schema like: A - B - C - D where - means that the tables have a one to many relationship I'm populating a sample data set where there are 25 rows in A, 25 rows in B for each row in A, 25 rows in C for each row in B and 25 rows in D for each row in C. This results in about 390k rows in D. The database itself is only about 12 MB, but it takes a long time (several minutes) to write the data to the file. I'm taking the approach of appending items to the table's relationship column. for i in range(25): x = A() session.add(A) for j in range(25): y = B() x.b.append(y) for k in range(25): z = C() y.c.append(z) for l in range(25): xx = D() z.d.append(xx) session.flush() Thanks again for the help. I decided to time the various approaches. My original approach took 4:23 (minutes: seconds). Note: all my times included data generation and insertion into a SQLite on-disk database. The biggest speed variable in a mass INSERT operation is whether or not individual cursor.execute() calls, or a small handful of cursor.executemany() calls each with thousands of rows, are used. With the ORM, a mass executemany() is used for INSERT in the case that primary key values are already present in the given objects. If not, the ORM has to INSERT each A() row one at a time, get the new primary key value, and then later populate 25*25 B() object's foreign key with the A.id value; then this goes down to the B()-C() area, etc. So if A(), B(), C(), D() can be created with id=5, id=6, etc., assuming id is the primary key, the ORM should be able to bunch lots of rows together into one cursor.executemany() call and you'll see an immediate, dramatic speedup. This took 3:36 The next level would be if you populated the a_id, b_id, foreign-key-to-parent columns directly instead of using append().If you did a profile on your script you'd see lots of time taken in many places, but all those append() operations would be one of them, as well as lots of event firing and bookkeeping that SQLAlchemy has to do when they occur, both at append() time as well as within the flush() (populating the primary key values to foreign key attributes). This took 2:28 By far the fastest way to do this would be to use session.execute(a_table.insert(), [rows]), session.execute(b_table.insert(), [rows]), etc. That is, assemble the whole set of A, B, C, D, directly in terms of the mapped table, or better yet do it in chunks, perhaps drill down through B, C, D for a single A then insert everything, etc.That way you optimize how these rows are constructed in Python exactly to the pattern that corresponds directly to the database structure, instead of having SQLAlchemy decode the database structure from an object hierarchy. An insertmany is documented at http://www.sqlalchemy.org/docs/core/tutorial.html#executing-multiple-statements. The ORM Session has an execute() method just like Connection does. I did this with one session.execute for each table, rather than chunks. It took 0:46 The range of speedups here would be between 30% and 80%, with direct usage of connection/session .execute() with Table metadata giving you the 80%. Mark -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Populate sample data
On Aug 4, 2011, at 2:48 PM, Mark Erbaugh wrote: Thanks again for the help. I decided to time the various approaches. My original approach took 4:23 (minutes: seconds). Note: all my times included data generation and insertion into a SQLite on-disk database. This took 3:36 17%... This took 2:28 43%... I did this with one session.execute for each table, rather than chunks. It took 0:46 82% ! fairly nailed the last one at least The range of speedups here would be between 30% and 80%, with direct usage of connection/session .execute() with Table metadata giving you the 80%. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] Populate sample data
I'm using SA (with SQLite) with a schema like: A - B - C - D where - means that the tables have a one to many relationship I'm populating a sample data set where there are 25 rows in A, 25 rows in B for each row in A, 25 rows in C for each row in B and 25 rows in D for each row in C. This results in about 390k rows in D. The database itself is only about 12 MB, but it takes a long time (several minutes) to write the data to the file. I'm taking the approach of appending items to the table's relationship column. for i in range(25): x = A() session.add(A) for j in range(25): y = B() x.b.append(y) for k in range(25): z = C() y.c.append(z) for l in range(25): xx = D() z.d.append(xx) session.flush() The bulk of the delay seems to be the session.flush call. I'm using the Pyramid framework which used Python's transaction module. I call transaction.begin() prior to adding the rows. According to the SQLite FAQ, this should speed things up. Are there any suggestions on how to speed things up? Thanks, Mark -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.