Re: [sqlalchemy] Populate sample data

2011-08-04 Thread Pau Tallada
Hi Mark,

I don't know much about sqlite perfomance, but 390k inserts in 4 minuts are
1625 inserts/second which I think is pretty impressive :D
One of the things that affects most insert and update performance is index
set up on fields. For each indexed field inserted, the db must write the
data record AND update the index.
Removing unneeded indexes would certainly speed up something.

Cheers,

2011/8/4 Mark Erbaugh m...@microenh.com

 I'm using SA (with SQLite) with a schema like:

 A - B - C - D

 where - means that the tables have a one to many relationship

 I'm populating  a sample data set where there are 25 rows in A, 25 rows in
 B for each row in A, 25 rows in C for each row in B and 25 rows in D for
 each row in C.  This results in about 390k rows in D.  The database itself
 is only about 12 MB, but it takes a long time (several minutes) to write the
 data to the file.

 I'm taking the approach of appending items to the table's relationship
 column.

 for i in range(25):
x = A()
session.add(A)
for j in range(25):
y = B()
x.b.append(y)
for k in range(25):
z = C()
y.c.append(z)
for  l in range(25):
xx = D()
z.d.append(xx)
 session.flush()


 The bulk of the delay seems to be the session.flush call.

 I'm using the Pyramid framework which used Python's transaction module.  I
 call transaction.begin() prior to adding the rows. According to the SQLite
 FAQ, this should speed things up.

 Are there any suggestions on how to speed things up?

 Thanks,
 Mark

 --
 You received this message because you are subscribed to the Google Groups
 sqlalchemy group.
 To post to this group, send email to sqlalchemy@googlegroups.com.
 To unsubscribe from this group, send email to
 sqlalchemy+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/sqlalchemy?hl=en.




-- 
--
Pau Tallada Crespí
Dep. d'Astrofísica i Cosmologia
Port d'Informació Científica (PIC)
Tel: +34 93 586 8233
--

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Populate sample data

2011-08-04 Thread Michael Bayer

On Aug 3, 2011, at 8:38 PM, Mark Erbaugh wrote:

 I'm using SA (with SQLite) with a schema like:
 
 A - B - C - D
 
 where - means that the tables have a one to many relationship
 
 I'm populating  a sample data set where there are 25 rows in A, 25 rows in B 
 for each row in A, 25 rows in C for each row in B and 25 rows in D for each 
 row in C.  This results in about 390k rows in D.  The database itself is only 
 about 12 MB, but it takes a long time (several minutes) to write the data to 
 the file.
 
 I'm taking the approach of appending items to the table's relationship column.
 
 for i in range(25):
   x = A()
   session.add(A)
   for j in range(25):
   y = B()
   x.b.append(y)
   for k in range(25):
   z = C()
   y.c.append(z)
   for  l in range(25):
   xx = D()
   z.d.append(xx)
 session.flush()

The biggest speed variable in a mass INSERT operation is whether or not 
individual cursor.execute() calls, or a small handful of cursor.executemany() 
calls each with thousands of rows, are used.

With the ORM, a mass executemany() is used for INSERT in the case that primary 
key values are already present in the given objects.   If not, the ORM has to 
INSERT each A() row one at a time, get the new primary key value, and then 
later populate 25*25 B() object's foreign key with the A.id value; then this 
goes down to the B()-C() area, etc.

So if A(), B(), C(), D() can be created with id=5, id=6, etc., assuming 
id is the primary key, the ORM should be able to bunch lots of rows together 
into one cursor.executemany() call and you'll see an immediate, dramatic 
speedup.

The next level would be if you populated the a_id, b_id, 
foreign-key-to-parent columns directly instead of using append().If you did 
a profile on your script you'd see lots of time taken in many places, but all 
those append() operations would be one of them, as well as lots of event firing 
and bookkeeping that SQLAlchemy has to do when they occur, both at append() 
time as well as within the flush() (populating the primary key values to 
foreign key attributes).

By far the fastest way to do this would be to use 
session.execute(a_table.insert(), [rows]), session.execute(b_table.insert(), 
[rows]), etc.   That is, assemble the whole set of A, B, C, D, directly in 
terms of the mapped table,  or better yet do it in chunks, perhaps drill down 
through B, C, D for a single A then insert everything, etc.That way you 
optimize how these rows are constructed in Python exactly to the pattern that 
corresponds directly to the database structure, instead of having SQLAlchemy 
decode the database structure from an object hierarchy.  An insertmany is 
documented at 
http://www.sqlalchemy.org/docs/core/tutorial.html#executing-multiple-statements.
   The ORM Session has an execute() method just like Connection does.

The range of speedups here would be between 30% and 80%, with direct usage of 
connection/session .execute() with Table metadata giving you the 80%.

I'm not sure what transaction is in transaction.begin() , if you're using a 
regular SQLAlchemy Session in it is always in a transaction in that it uses a 
single connection until rollback() or commit() is called.


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Populate sample data

2011-08-04 Thread Mark Erbaugh

On Aug 4, 2011, at 9:22 AM, Michael Bayer wrote:

 The range of speedups here would be between 30% and 80%, with direct usage of 
 connection/session .execute() with Table metadata giving you the 80%.

Thanks. I'll look into your suggestions

 
 I'm not sure what transaction is in transaction.begin() , if you're using a 
 regular SQLAlchemy Session in it is always in a transaction in that it uses 
 a single connection until rollback() or commit() is called.

Originally, I thought transaction was from the standard Python library, but 
upon research, it looks like it's from the transaction package that is part of 
Zope. It's included in the Pyramid installation.

Mark

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Populate sample data

2011-08-04 Thread Michael Bayer

On Aug 4, 2011, at 1:20 PM, Mark Erbaugh wrote:

 
 
 Originally, I thought transaction was from the standard Python library, but 
 upon research, it looks like it's from the transaction package that is part 
 of Zope. It's included in the Pyramid installation.

Pyramid installs the zope transaction manager by default for SQLAlchemy usage ? 
  Why do they keep saying Pyramid isn't zopey ?   The TM is the zopiest zope 
there is.



-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Populate sample data

2011-08-04 Thread Mark Erbaugh

On Aug 4, 2011, at 9:22 AM, Michael Bayer wrote:

 
 On Aug 3, 2011, at 8:38 PM, Mark Erbaugh wrote:
 
 I'm using SA (with SQLite) with a schema like:
 
 A - B - C - D
 
 where - means that the tables have a one to many relationship
 
 I'm populating  a sample data set where there are 25 rows in A, 25 rows in B 
 for each row in A, 25 rows in C for each row in B and 25 rows in D for each 
 row in C.  This results in about 390k rows in D.  The database itself is 
 only about 12 MB, but it takes a long time (several minutes) to write the 
 data to the file.
 
 I'm taking the approach of appending items to the table's relationship 
 column.
 
 for i in range(25):
  x = A()
  session.add(A)
  for j in range(25):
  y = B()
  x.b.append(y)
  for k in range(25):
  z = C()
  y.c.append(z)
  for  l in range(25):
  xx = D()
  z.d.append(xx)
 session.flush()


Thanks again for the help.  I decided to time the various approaches.  My 
original approach took 4:23 (minutes: seconds). Note: all my times included 
data generation and insertion into a SQLite on-disk database.

 The biggest speed variable in a mass INSERT operation is whether or not 
 individual cursor.execute() calls, or a small handful of cursor.executemany() 
 calls each with thousands of rows, are used.
 
 With the ORM, a mass executemany() is used for INSERT in the case that 
 primary key values are already present in the given objects.   If not, the 
 ORM has to INSERT each A() row one at a time, get the new primary key value, 
 and then later populate 25*25 B() object's foreign key with the A.id value; 
 then this goes down to the B()-C() area, etc.
 
 So if A(), B(), C(), D() can be created with id=5, id=6, etc., assuming 
 id is the primary key, the ORM should be able to bunch lots of rows 
 together into one cursor.executemany() call and you'll see an immediate, 
 dramatic speedup.


This took 3:36

 
 The next level would be if you populated the a_id, b_id, 
 foreign-key-to-parent columns directly instead of using append().If you 
 did a profile on your script you'd see lots of time taken in many places, but 
 all those append() operations would be one of them, as well as lots of event 
 firing and bookkeeping that SQLAlchemy has to do when they occur, both at 
 append() time as well as within the flush() (populating the primary key 
 values to foreign key attributes).


This took 2:28

 By far the fastest way to do this would be to use 
 session.execute(a_table.insert(), [rows]), 
 session.execute(b_table.insert(), [rows]), etc.   That is, assemble the 
 whole set of A, B, C, D, directly in terms of the mapped table,  or better 
 yet do it in chunks, perhaps drill down through B, C, D for a single A then 
 insert everything, etc.That way you optimize how these rows are 
 constructed in Python exactly to the pattern that corresponds directly to the 
 database structure, instead of having SQLAlchemy decode the database 
 structure from an object hierarchy.  An insertmany is documented at 
 http://www.sqlalchemy.org/docs/core/tutorial.html#executing-multiple-statements.
The ORM Session has an execute() method just like Connection does.

I did this with one session.execute for each table, rather than chunks. It took 
0:46

 The range of speedups here would be between 30% and 80%, with direct usage of 
 connection/session .execute() with Table metadata giving you the 80%.

Mark


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Populate sample data

2011-08-04 Thread Michael Bayer

On Aug 4, 2011, at 2:48 PM, Mark Erbaugh wrote:
 Thanks again for the help.  I decided to time the various approaches.  My 
 original approach took 4:23 (minutes: seconds). Note: all my times included 
 data generation and insertion into a SQLite on-disk database.
 
 
 
 This took 3:36

17%...

 This took 2:28

43%...


 I did this with one session.execute for each table, rather than chunks. It 
 took 0:46

82% ! fairly nailed the last one at least

 The range of speedups here would be between 30% and 80%, with direct usage 
 of connection/session .execute() with Table metadata giving you the 80%.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



[sqlalchemy] Populate sample data

2011-08-03 Thread Mark Erbaugh
I'm using SA (with SQLite) with a schema like:

A - B - C - D

where - means that the tables have a one to many relationship

I'm populating  a sample data set where there are 25 rows in A, 25 rows in B 
for each row in A, 25 rows in C for each row in B and 25 rows in D for each row 
in C.  This results in about 390k rows in D.  The database itself is only about 
12 MB, but it takes a long time (several minutes) to write the data to the file.

I'm taking the approach of appending items to the table's relationship column.

for i in range(25):
x = A()
session.add(A)
for j in range(25):
y = B()
x.b.append(y)
for k in range(25):
z = C()
y.c.append(z)
for  l in range(25):
xx = D()
z.d.append(xx)
session.flush()


The bulk of the delay seems to be the session.flush call.

I'm using the Pyramid framework which used Python's transaction module.  I call 
transaction.begin() prior to adding the rows. According to the SQLite FAQ, this 
should speed things up.

Are there any suggestions on how to speed things up?

Thanks,
Mark

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.