Re: [sqlalchemy] Batching INSERT statements

Mike Bayer Fri, 12 Feb 2021 14:16:39 -0800


On Fri, Feb 12, 2021, at 2:06 PM, Cristian Bulgaru wrote:
> Hi Vineet, Mike,
> 
> @Vineet, thank you for the interesting blog post on bulk insert with 
> SQLAlchemy ORM: 
> https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7
> 
> A few questions:
> 
> 1. Do we need to get the incremented IDs from Postgresql itself, or can we 
> just fetch the current max ID on a table and increment IDs in Python without 
> querying the DB for the incremented values?
> 
> 2. I was intrigued by the following phrase:
> 
> > * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
> > back, so it’s possible that SQLAlchemy may support batch inserting these 
> > models (with an auto-incrementing primary key) in the future.


SQLAlchemy now supports this in version 1.4 including that's it's used by the 
ORM:

https://docs.sqlalchemy.org/en/14/changelog/migration_14.html#psycopg2-dialect-features-execute-values-with-returning-for-insert-statements-by-default


basically it's now the most efficient INSERT + autoincrement PK we can get with 
the psycopg2 driver.







> 
> @Mike @Vineet, do you know if this is the case, ie if bulk insert now works 
> out of the box (without pre-assigning incremented IDs)?
> 
> 3. Does this imply any change in case of bulk insert of multiple models with 
> foreign keys referring to each other? This answer 
> <https://stackoverflow.com/a/36387887/11750716> seems to suggest 
> pre-assigning IDs for it to work.
> On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:
>> __
>> Hi Vineet -
>> 
>> glad that worked!   I'll have to find some time to recall what we worked out 
>> here and how it came out for you, I wonder where on the site this kind of 
>> thing could be mentioned.    we have 3rd party dialects listed out in the 
>> docs but not yet a place for extensions.
>> 
>> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>>> Hi Mike,
>>> 
>>> Thanks for all of your help getting this working again. We've used this 
>>> solution in production for two years now, and it's helped our performance 
>>> significantly.
>>> 
>>> We just open-sourced the solution that we built so others can use it, and 
>>> are also writing a blog post to cover some of the details. I'm attaching a 
>>> copy of the blog post here. Obviously not expected, but if you are 
>>> interested in taking a look, we are happy to incorporate any comments that 
>>> you may have before publishing.
>>> 
>>> Here's a link to the repo as well: 
>>> https://github.com/benchling/sqlalchemy_batch_inserts
>>> 
>>> Best,
>>> Vineet
>>> 
>>> On Mon, Oct 9, 2017 at 10:27 PM <vin...@benchling.com> wrote:
>>>>> if you're using Postgresql, there's a vastly easier technique to use 
>>>>> which is just to pre-fetch from the sequence: 
>>>>> identities = [ 
>>>>>     val for val, in session.execute( 
>>>>>              "select nextval('mytable_seq') from " 
>>>>>              "generate_series(1,%s)" % len(my_objects)) 
>>>>>     ) 
>>>>> ] 
>>>>> for ident, obj in zip(identities, my_objects): 
>>>>>     obj.pk = ident 
>>>> Wow, that's a great idea! I got it working for most of our models. I have 
>>>> some questions about how inserts for joined-table inheritance tables are 
>>>> batched together, but I'll ask them in a separate post since they're 
>>>> somewhat unrelated to this.
>>>> 
>>>>> So the complexity of adding multi-values insert with sequences would 
>>>>> benefit an extremely narrow set of use cases, would be very labor 
>>>>> intensive to implement and maintain, and is unnecessary for the single 
>>>>> target platform in which this case would work. 
>>>> That makes sense, thanks for the explanation!
>>>> 
>>>>  
>>>> 
>>>> On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>>>>> On Mon, Oct 9, 2017 at 4:15 AM,  <vin...@benchling.com> wrote: 
>>>>> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
>>>>> > inserts, and have a few questions about bulk_save_objects (and flushing 
>>>>> > in 
>>>>> > general). 
>>>>> > 
>>>>> > Two statements that I think are true: 
>>>>> > 
>>>>> > Right now, bulk_save_objects does not fetch primary keys for inserted 
>>>>> > rows 
>>>>> > (unless return_defaults is passed in, which mitigates the performance 
>>>>> > gains 
>>>>> > by performing one insert statement per model). 
>>>>> 
>>>>> the point of bulk save is that the objects passed to it are considered 
>>>>> as throwaway after the operation, to avoid the overhead of maintaining 
>>>>> the objects' persistence state in memory.   Using it with 
>>>>> return_defaults is nearly pointless as this will blow away the 
>>>>> efficiency gains you might get from the bulk operation, and is there 
>>>>> mostly to allow the joined-table inheritance use case to be usable. 
>>>>> 
>>>>> 
>>>>> > When running db.session.flush(), it looks like SQLAlchemy can batch 
>>>>> > model 
>>>>> > creation into a single multi-value insert statement *only if all of the 
>>>>> > primary keys are already defined on the model*. (Verified by looking at 
>>>>> > sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements) 
>>>>> > 
>>>>> > 
>>>>> > Questions: 
>>>>> > 
>>>>> > Would it be possible for bulk_save_objects to fetch primary keys while 
>>>>> > still 
>>>>> > batching INSERTs into a single query? 
>>>>> 
>>>>> What do you mean "single query", do you mean, executemany()?  the 
>>>>> answer is no because DBAPI drivers don't return result sets with 
>>>>> executemany().  Do you mean, a single INSERT..VALUES with all the 
>>>>> parameters in one statement ?  The answer is no because SQLAlchemy 
>>>>> doesn't include multi-values as a transparent option; the DBAPI 
>>>>> drivers instead make use of the multi-values syntax within their 
>>>>> executemany() implementations, where again, they don't return result 
>>>>> sets. 
>>>>> 
>>>>> Right now, only the MySQL drivers do this by default, the psycopg2 
>>>>> driver does it with a recently added option that SQLAlchemy does not 
>>>>> directly support, however you can set up via a connection event.   For 
>>>>> psycopg2 it can vastly speed up inserts as psycopg2's normal 
>>>>> executemany() implementation has some performance issues. 
>>>>> 
>>>>> 
>>>>> (This 
>>>>> > would help with e.g. inserting a bunch of rows into a table with an 
>>>>> > auto-incrementing primary key). 
>>>>> 
>>>>> if you're using Postgresql, there's a vastly easier technique to use 
>>>>> which is just to pre-fetch from the sequence: 
>>>>> 
>>>>> identities = [ 
>>>>>     val for val, in session.execute( 
>>>>>              "select nextval('mytable_seq') from " 
>>>>>              "generate_series(1,%s)" % len(my_objects)) 
>>>>>     ) 
>>>>> ] 
>>>>> for ident, obj in zip(identities, my_objects): 
>>>>>     obj.pk = ident 
>>>>> 
>>>>> 
>>>>> Now you don't need to do RETURNING or anything and the inserts can be 
>>>>> at their top efficiency. 
>>>>> 
>>>>> 
>>>>> 
>>>>> > 
>>>>> > At least in Postgres (haven't verified for other databases), it looks 
>>>>> > like 
>>>>> > one can use RETURNING for inserts with multi-value VALUES clauses. At 
>>>>> > the 
>>>>> > surface, it seems like one could extend the current behavior for a 
>>>>> > single 
>>>>> > row INSERT: 
>>>>> > 
>>>>> > INSERT INTO table ... VALUES (1) RETURNING id 
>>>>> > 
>>>>> > to multi-value INSERTS: 
>>>>> > 
>>>>> > INSERT INTO table ... VALUES (1), (2), ... RETURNING id 
>>>>> > 
>>>>> > and get all the benefits of the ORM while still batching inserts. 
>>>>> 
>>>>> 
>>>>> This would be an enormous undertaking to implement, test, and release. 
>>>>>   It would have to exist as an entirely additional series of codepaths 
>>>>> within persistence.py as multi-valued INSERT is not available on most 
>>>>> databases as well as on earlier versions of the databases that do 
>>>>> support it.     For all of that complexity, the approach would work on 
>>>>> exactly:  Postgresql only - SQLite and MySQL don't support RETURNING, 
>>>>> Oracle, SQL Server and others don't support INSERT..VALUES with 
>>>>> multiple sets. 
>>>>> 
>>>>> If you are using Postgresql, you can instead pre-fetch the sequence up 
>>>>> front.  This can even be integrated into a before_flush() event. 
>>>>> 
>>>>> So the complexity of adding multi-values insert with sequences would 
>>>>> benefit an extremely narrow set of use cases, would be very labor 
>>>>> intensive to implement and maintain, and is unnecessary for the single 
>>>>> target platform in which this case would work. 
>>>>> 
>>>>> 
>>>>> > 
>>>>> > I'm sure this is something that was thought about already, so I'd 
>>>>> > mostly 
>>>>> > love to hear any thoughts about what makes this hard. We have a lot of 
>>>>> > complex relationships and joined-table inheritance, so working with 
>>>>> > bulk_save_objects has been a bit challenging. From the comments in 
>>>>> > sqlalchemy/lib/sqlalchemy/sql/dml.py:return_defaults(), it seems like 
>>>>> > SQLAlchemy is already aware that RETURNING for insert statements with 
>>>>> > multi-values VALUES clauses is supported, so it's possible there is a 
>>>>> > reason 
>>>>> > that this is hard or can't be done. 
>>>>> > 
>>>>> > This is also something that, if it just a “missing feature”, I would 
>>>>> > love to 
>>>>> > learn more and see if I could contribute. 
>>>>> > 
>>>>> > -- 
>>>>> > SQLAlchemy - 
>>>>> > The Python SQL Toolkit and Object Relational Mapper 
>>>>> > 
>>>>> > http://www.sqlalchemy.org/ 
>>>>> > 
>>>>> > To post example code, please provide an MCVE: Minimal, Complete, and 
>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>>>>> > description. 
>>>>> > --- 
>>>>> > You received this message because you are subscribed to the Google 
>>>>> > Groups 
>>>>> > "sqlalchemy" group. 
>>>>> > To unsubscribe from this group and stop receiving emails from it, send 
>>>>> > an 
>>>>> > email to sqlalchemy+...@googlegroups.com. 
>>>>> > To post to this group, send email to sqlal...@googlegroups.com. 
>>>>> > Visit this group at https://groups.google.com/group/sqlalchemy. 
>>>>> > For more options, visit https://groups.google.com/d/optout. 
>>>> 

>>>> --
>>>> SQLAlchemy - 
>>>> The Python SQL Toolkit and Object Relational Mapper
>>>>  
>>>> http://www.sqlalchemy.org/
>>>>  
>>>> To post example code, please provide an MCVE: Minimal, Complete, and 
>>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>>>> description.
>>>> --- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "sqlalchemy" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/sqlalchemy/GyAZTThJi2I/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> sqlalchemy+...@googlegroups.com.
>>>> To post to this group, send email to sqlal...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>>> For more options, visit https://groups.google.com/d/optout.
>>> 

>>> --
>>> SQLAlchemy - 
>>> The Python SQL Toolkit and Object Relational Mapper
>>>  
>>> http://www.sqlalchemy.org/
>>>  
>>> To post example code, please provide an MCVE: Minimal, Complete, and 
>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>>> description.
>>> --- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "sqlalchemy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to sqlalchemy+...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/sqlalchemy/CAJYnotsAJ39FuiVYS_UD6338AEktOZQmw1ChuQK2HOXPkMCwLw%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/sqlalchemy/CAJYnotsAJ39FuiVYS_UD6338AEktOZQmw1ChuQK2HOXPkMCwLw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>>> 
>>> 
>>> *Attachments:*
>>>  * BatchInsertsBlogPost.pdf
>> 
> 

> -- 
> SQLAlchemy - 
> The Python SQL Toolkit and Object Relational Mapper
>  
> http://www.sqlalchemy.org/
>  
> To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description.
> --- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/sqlalchemy/8a216e1d-b421-48df-9739-9f261a745571n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/sqlalchemy/8a216e1d-b421-48df-9739-9f261a745571n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/4fc6b69f-8f17-491b-ab2a-5250e8329ce6%40www.fastmail.com.

Re: [sqlalchemy] Batching INSERT statements

Reply via email to