Re: [sqlalchemy] Speed up bulk inserts
sure that feature is here: http://docs.sqlalchemy.org/en/rel_0_9/core/dml.html?highlight=insert%20values#sqlalchemy.sql.expression.Insert.values.params.*args On Aug 18, 2014, at 7:32 PM, Anton anschat...@gmail.com wrote: Hi Michael, I checked your post on the stackoverflow, could you help me figure something out regarding usage of Core. In the example you provided you have a version with test_sqlite3 which uses customer user supplied query so it issues only one insert statement with all rows in values clause. I wonder if this can be achieved by using Core. I tried this approach: session.connection().execute( mytable.insert(), raw_list ) where raw_list is a list of dictionaries to insert and this issues a separate statement for every row, which is probably expected. But then I came across this docs page http://docs.sqlalchemy.org/en/latest/core/tutorial.html#executing-multiple-statements in the example with a list of dictionaries there is a single insert statement issues with multi-row values clause. I wonder if I can achieve the same behavior. I am using PostgreSQL 9.3.4 and SQLAlchemy==0.9.4 Thanks! Best, Anton. On Wednesday, November 6, 2013 12:58:53 PM UTC-8, Michael Bayer wrote: I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 . If you start with this, I can answer more specific questions. On Nov 6, 2013, at 10:28 AM, Achim Domma do...@procoders.net wrote: Hi, I want to speed up my SqlAlchemy bulk inserting code and yes, I'm aware that this is not the main purpose of SqlAlchemy and all databases have faster low level import tools. The background is the following: We import data from various sources and apply various mappings. Currently I'm willing to trade raw runtime for much simpler code, which is much easier to maintain. But I still want my code to run as fast as it's possible with those assumptions. There are two scenarios which I want to optimize: 1. Flat inserts without relations, but with unique ids generated inside the database: In that case, SqlAlchemy retrieves the unique ids from the database, but those ids are never used in my import process. I thought about generating an insert statement out of an object. Obviously SqlAlchemy has to do that too, so there might be some existing tool for that? The other option would be, to tell SqlAlchemy to ignore the generated id an to not retrieve it from the database. Is that possible? 2. Inserts using relations and unique ids generated in the database: I think SqlAlchemy is already quite good at optimizing this and sending objects as batches. If there are any tweeks and tricks to speed up inserts having relations, I would be happy to hear them. cheers, Achim -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com. To post to this group, send email to sqlal...@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Speed up bulk inserts
Hi Michael, I checked your post on the stackoverflow, could you help me figure something out regarding usage of Core. In the example you provided you have a version with test_sqlite3 which uses customer user supplied query so it issues only one insert statement with all rows in values clause. I wonder if this can be achieved by using Core. I tried this approach: session.connection().execute( mytable.insert(), raw_list ) where raw_list is a list of dictionaries to insert and this issues a separate statement for every row, which is probably expected. But then I came across this docs page http://docs.sqlalchemy.org/en/latest/core/tutorial.html#executing-multiple-statements in the example with a list of dictionaries there is a single insert statement issues with multi-row values clause. I wonder if I can achieve the same behavior. I am using PostgreSQL 9.3.4 and SQLAlchemy==0.9.4 Thanks! Best, Anton. On Wednesday, November 6, 2013 12:58:53 PM UTC-8, Michael Bayer wrote: I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 . If you start with this, I can answer more specific questions. On Nov 6, 2013, at 10:28 AM, Achim Domma do...@procoders.net javascript: wrote: Hi, I want to speed up my SqlAlchemy bulk inserting code and yes, I'm aware that this is not the main purpose of SqlAlchemy and all databases have faster low level import tools. The background is the following: We import data from various sources and apply various mappings. Currently I'm willing to trade raw runtime for much simpler code, which is much easier to maintain. But I still want my code to run as fast as it's possible with those assumptions. There are two scenarios which I want to optimize: 1. Flat inserts without relations, but with unique ids generated inside the database: In that case, SqlAlchemy retrieves the unique ids from the database, but those ids are never used in my import process. I thought about generating an insert statement out of an object. Obviously SqlAlchemy has to do that too, so there might be some existing tool for that? The other option would be, to tell SqlAlchemy to ignore the generated id an to not retrieve it from the database. Is that possible? 2. Inserts using relations and unique ids generated in the database: I think SqlAlchemy is already quite good at optimizing this and sending objects as batches. If there are any tweeks and tricks to speed up inserts having relations, I would be happy to hear them. cheers, Achim -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com javascript:. To post to this group, send email to sqlal...@googlegroups.com javascript:. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Speed up bulk inserts
Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer: I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768. If you start with this, I can answer more specific questions. The article was very helpful, thanks. I still want to figure out the best balance between convenience and speed for my use case. Do the following make sense and is possible? I work only with Postgresql and I'm sure that all involved objects have a unique id column which is called 'id'. So before doing a session.commit(), I could check how many objects are in my session. As I'm just bulk inserting, I know that all of them are new and don't have their id set yet. Now I ask the database for that number of new ids, iterate over the objects in my session and set the ids. Internally all ids would come from a single sequence, so I don't have to care about object types and so on. Afterwards SqlAlchemy should be aware that ids have already been set, so no generated ids have to be returned and the session.commit() should be much simpler and faster. Sounds like a still quite simple, but hopefully much faster solution. Do you agree? kind regards, Achim -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Speed up bulk inserts
On Nov 13, 2013, at 4:57 AM, Achim do...@procoders.net wrote: Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer: I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 . If you start with this, I can answer more specific questions. The article was very helpful, thanks. I still want to figure out the best balance between convenience and speed for my use case. Do the following make sense and is possible? I work only with Postgresql and I'm sure that all involved objects have a unique id column which is called 'id'. So before doing a session.commit(), I could check how many objects are in my session. As I'm just bulk inserting, I know that all of them are new and don't have their id set yet. Now I ask the database for that number of new ids, iterate over the objects in my session and set the ids. Internally all ids would come from a single sequence, so I don't have to care about object types and so on. Afterwards SqlAlchemy should be aware that ids have already been set, so no generated ids have to be returned and the session.commit() should be much simpler and faster. Sounds like a still quite simple, but hopefully much faster solution. Do you agree? sure that should be fine, if you can pre-calc your PKs. It just won’t work under any kind of concurrency, as in such a situation there could be interleaved INSERTs from different processes. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [sqlalchemy] Speed up bulk inserts
On Nov 13, 2013, at 11:52 AM, Claudio Freire klaussfre...@gmail.com wrote: On Wed, Nov 13, 2013 at 1:45 PM, Michael Bayer mike...@zzzcomputing.com wrote: Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer: I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 . If you start with this, I can answer more specific questions. The article was very helpful, thanks. I still want to figure out the best balance between convenience and speed for my use case. Do the following make sense and is possible? I work only with Postgresql and I'm sure that all involved objects have a unique id column which is called 'id'. So before doing a session.commit(), I could check how many objects are in my session. As I'm just bulk inserting, I know that all of them are new and don't have their id set yet. Now I ask the database for that number of new ids, iterate over the objects in my session and set the ids. Internally all ids would come from a single sequence, so I don't have to care about object types and so on. Afterwards SqlAlchemy should be aware that ids have already been set, so no generated ids have to be returned and the session.commit() should be much simpler and faster. Sounds like a still quite simple, but hopefully much faster solution. Do you agree? sure that should be fine, if you can pre-calc your PKs. It just won’t work under any kind of concurrency, as in such a situation there could be interleaved INSERTs from different processes. Postgresql sequences already handle that kind of concurrency scenario. how exactly, if two transactions T1 and T2 both pull a number from a sequence, T1 gets 40, T2 gets 41, then we pre-calc 10 inserts that have not yet occurred for each, T1 has 40-50, T2 has 41-51, the number is totally wrong for both - in reality it would be some random distribution of 40-60 between T1 and T2. No ? signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [sqlalchemy] Speed up bulk inserts
According to http://www.postgresql.org/docs/9.2/static/functions-sequence.html, sequences are non-transactional: Advance the sequence object to its next value and return that value. This is done atomically: even if multiple sessions execute nextval concurrently, each will safely receive a distinct sequence value. ... To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used, even if the transaction that did the nextval later aborts. This means that aborted transactions might leave unused holes in the sequence of assigned values. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Speed up bulk inserts
On Wed, Nov 13, 2013 at 2:04 PM, Michael Bayer mike...@zzzcomputing.com wrote: On Nov 13, 2013, at 11:52 AM, Claudio Freire klaussfre...@gmail.com wrote: On Wed, Nov 13, 2013 at 1:45 PM, Michael Bayer mike...@zzzcomputing.com wrote: Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer: I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 . If you start with this, I can answer more specific questions. The article was very helpful, thanks. I still want to figure out the best balance between convenience and speed for my use case. Do the following make sense and is possible? I work only with Postgresql and I'm sure that all involved objects have a unique id column which is called 'id'. So before doing a session.commit(), I could check how many objects are in my session. As I'm just bulk inserting, I know that all of them are new and don't have their id set yet. Now I ask the database for that number of new ids, iterate over the objects in my session and set the ids. Internally all ids would come from a single sequence, so I don't have to care about object types and so on. Afterwards SqlAlchemy should be aware that ids have already been set, so no generated ids have to be returned and the session.commit() should be much simpler and faster. Sounds like a still quite simple, but hopefully much faster solution. Do you agree? sure that should be fine, if you can pre-calc your PKs. It just won’t work under any kind of concurrency, as in such a situation there could be interleaved INSERTs from different processes. Postgresql sequences already handle that kind of concurrency scenario. how exactly, if two transactions T1 and T2 both pull a number from a sequence, T1 gets 40, T2 gets 41, then we pre-calc 10 inserts that have not yet occurred for each, T1 has 40-50, T2 has 41-51, the number is totally wrong for both - in reality it would be some random distribution of 40-60 between T1 and T2. No ? No, you ask for 10 ids to the same sequence, and the sequence allocates T1 40-49, and T2 50-59 -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Speed up bulk inserts
On Nov 13, 2013, at 12:22 PM, Claudio Freire klaussfre...@gmail.com wrote: On Wed, Nov 13, 2013 at 2:04 PM, Michael Bayer mike...@zzzcomputing.com wrote: On Nov 13, 2013, at 11:52 AM, Claudio Freire klaussfre...@gmail.com wrote: On Wed, Nov 13, 2013 at 1:45 PM, Michael Bayer mike...@zzzcomputing.com wrote: Am Mittwoch, 6. November 2013 21:58:53 UTC+1 schrieb Michael Bayer: I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 . If you start with this, I can answer more specific questions. The article was very helpful, thanks. I still want to figure out the best balance between convenience and speed for my use case. Do the following make sense and is possible? I work only with Postgresql and I'm sure that all involved objects have a unique id column which is called 'id'. So before doing a session.commit(), I could check how many objects are in my session. As I'm just bulk inserting, I know that all of them are new and don't have their id set yet. Now I ask the database for that number of new ids, iterate over the objects in my session and set the ids. Internally all ids would come from a single sequence, so I don't have to care about object types and so on. Afterwards SqlAlchemy should be aware that ids have already been set, so no generated ids have to be returned and the session.commit() should be much simpler and faster. Sounds like a still quite simple, but hopefully much faster solution. Do you agree? sure that should be fine, if you can pre-calc your PKs. It just won’t work under any kind of concurrency, as in such a situation there could be interleaved INSERTs from different processes. Postgresql sequences already handle that kind of concurrency scenario. how exactly, if two transactions T1 and T2 both pull a number from a sequence, T1 gets 40, T2 gets 41, then we pre-calc 10 inserts that have not yet occurred for each, T1 has 40-50, T2 has 41-51, the number is totally wrong for both - in reality it would be some random distribution of 40-60 between T1 and T2. No ? No, you ask for 10 ids to the same sequence, and the sequence allocates T1 40-49, and T2 50-59 oh right I’m thinking of the trick as applied to MySQL.Sure, if you actually run the seq’s you’re fine. signature.asc Description: Message signed with OpenPGP using GPGMail
[sqlalchemy] Speed up bulk inserts
Hi, I want to speed up my SqlAlchemy bulk inserting code and yes, I'm aware that this is not the main purpose of SqlAlchemy and all databases have faster low level import tools. The background is the following: We import data from various sources and apply various mappings. Currently I'm willing to trade raw runtime for much simpler code, which is much easier to maintain. But I still want my code to run as fast as it's possible with those assumptions. There are two scenarios which I want to optimize: 1. Flat inserts without relations, but with unique ids generated inside the database: In that case, SqlAlchemy retrieves the unique ids from the database, but those ids are never used in my import process. I thought about generating an insert statement out of an object. Obviously SqlAlchemy has to do that too, so there might be some existing tool for that? The other option would be, to tell SqlAlchemy to ignore the generated id an to not retrieve it from the database. Is that possible? 2. Inserts using relations and unique ids generated in the database: I think SqlAlchemy is already quite good at optimizing this and sending objects as batches. If there are any tweeks and tricks to speed up inserts having relations, I would be happy to hear them. cheers, Achim -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Speed up bulk inserts
I wrote a full post regarding this topic on stackoverflow at http://stackoverflow.com/questions/11769366/why-is-sqlalchemy-insert-with-sqlite-25-times-slower-than-using-sqlite3-directly/11769768#11769768 . If you start with this, I can answer more specific questions. On Nov 6, 2013, at 10:28 AM, Achim Domma do...@procoders.net wrote: Hi, I want to speed up my SqlAlchemy bulk inserting code and yes, I'm aware that this is not the main purpose of SqlAlchemy and all databases have faster low level import tools. The background is the following: We import data from various sources and apply various mappings. Currently I'm willing to trade raw runtime for much simpler code, which is much easier to maintain. But I still want my code to run as fast as it's possible with those assumptions. There are two scenarios which I want to optimize: 1. Flat inserts without relations, but with unique ids generated inside the database: In that case, SqlAlchemy retrieves the unique ids from the database, but those ids are never used in my import process. I thought about generating an insert statement out of an object. Obviously SqlAlchemy has to do that too, so there might be some existing tool for that? The other option would be, to tell SqlAlchemy to ignore the generated id an to not retrieve it from the database. Is that possible? 2. Inserts using relations and unique ids generated in the database: I think SqlAlchemy is already quite good at optimizing this and sending objects as batches. If there are any tweeks and tricks to speed up inserts having relations, I would be happy to hear them. cheers, Achim -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out. signature.asc Description: Message signed with OpenPGP using GPGMail