I'm not familiar with this exactly, but have a bit of experience in this 
area.

I just took a look at this module (nice work!).  It's VERY well documented 
in the docstrings (even nicer work!)

I think the core bit of this technique looks to be in 
`_get_next_sequence_values` -  
https://github.com/benchling/sqlalchemy_batch_inserts/blob/master/sqlalchemy_batch_inserts/__init__.py#L51-L83

Vineet is obtaining the ids by running the SQL generate_series function 
over the nextval function.

When I've done large batches and migrations like this, I've used a somewhat 
dirty cheat/trick.  Assuming a window of 1000 inserts, I would just 
increment the serial by 1000 and use "new number - 1000" as the range for 
IDs.  That is somewhat closer to the "max id" concept.  Vineet's approach 
is better.

In terms of dealing with multiple foreign key constraints, pre-assigning 
IDs may or may not work depending on how your database constraints exist.

As a habit, I always create (or re-create) Postgres foreign key checks as 
deferrable. When dealing with batches, I (i) defer all the involved 
constraints [which can be on other tables!], (ii) process the batch, (iii) 
set constraints to immediate.  If the migration is LARGE, i'll drop all the 
indexes the tables too, and possibly drop the constraints too and run 
multiple workers. This gets around the overheads from every insert 
populating rows+indexes, and the FKEY integrity checks on every row. 




On Friday, February 12, 2021 at 2:06:55 PM UTC-5 christia...@gmail.com 
wrote:

> Hi Vineet, Mike,
>
> @Vineet, thank you for the interesting blog post on bulk insert with 
> SQLAlchemy ORM: 
> https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7
>
> A few questions:
>
> 1. Do we need to get the incremented IDs from Postgresql itself, or can we 
> just fetch the current max ID on a table and increment IDs in Python 
> without querying the DB for the incremented values?
>
> 2. I was intrigued by the following phrase:
>
> > * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
> back, so it’s possible that SQLAlchemy may support batch inserting these 
> models (with an auto-incrementing primary key) in the future. 
>
> @Mike @Vineet, do you know if this is the case, ie if bulk insert now 
> works out of the box (without pre-assigning incremented IDs)?
>
> 3. Does this imply any change in case of bulk insert of multiple models 
> with foreign keys referring to each other? This answer 
> <https://stackoverflow.com/a/36387887/11750716> seems to suggest 
> pre-assigning IDs for it to work.
> On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:
>
>> Hi Vineet -
>>
>> glad that worked!   I'll have to find some time to recall what we worked 
>> out here and how it came out for you, I wonder where on the site this kind 
>> of thing could be mentioned.    we have 3rd party dialects listed out in 
>> the docs but not yet a place for extensions.
>>
>> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>>
>> Hi Mike,
>>
>> Thanks for all of your help getting this working again. We've used this 
>> solution in production for two years now, and it's helped our performance 
>> significantly.
>>
>> We just open-sourced the solution that we built so others can use it, and 
>> are also writing a blog post to cover some of the details. I'm attaching a 
>> copy of the blog post here. Obviously not expected, but if you are 
>> interested in taking a look, we are happy to incorporate any comments that 
>> you may have before publishing.
>>
>> Here's a link to the repo as well: 
>> https://github.com/benchling/sqlalchemy_batch_inserts
>>
>> Best,
>> Vineet
>>
>> On Mon, Oct 9, 2017 at 10:27 PM <vin...@benchling.com> wrote:
>>
>> if you're using Postgresql, there's a vastly easier technique to use 
>> which is just to pre-fetch from the sequence: 
>> identities = [ 
>>     val for val, in session.execute( 
>>              "select nextval('mytable_seq') from " 
>>              "generate_series(1,%s)" % len(my_objects)) 
>>     ) 
>> ] 
>> for ident, obj in zip(identities, my_objects): 
>>     obj.pk = ident 
>>
>> Wow, that's a great idea! I got it working for most of our models. I have 
>> some questions about how inserts for joined-table inheritance tables are 
>> batched together, but I'll ask them in a separate post since they're 
>> somewhat unrelated to this.
>>
>> So the complexity of adding multi-values insert with sequences would 
>> benefit an extremely narrow set of use cases, would be very labor 
>> intensive to implement and maintain, and is unnecessary for the single 
>> target platform in which this case would work. 
>>
>> That makes sense, thanks for the explanation!
>>
>>  
>>
>> On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>>
>> On Mon, Oct 9, 2017 at 4:15 AM,  <vin...@benchling.com> wrote: 
>> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
>> > inserts, and have a few questions about bulk_save_objects (and flushing 
>> in 
>> > general). 
>> > 
>> > Two statements that I think are true: 
>> > 
>> > Right now, bulk_save_objects does not fetch primary keys for inserted 
>> rows 
>> > (unless return_defaults is passed in, which mitigates the performance 
>> gains 
>> > by performing one insert statement per model). 
>>
>> the point of bulk save is that the objects passed to it are considered 
>> as throwaway after the operation, to avoid the overhead of maintaining 
>> the objects' persistence state in memory.   Using it with 
>> return_defaults is nearly pointless as this will blow away the 
>> efficiency gains you might get from the bulk operation, and is there 
>> mostly to allow the joined-table inheritance use case to be usable. 
>>
>>
>> > When running db.session.flush(), it looks like SQLAlchemy can batch 
>> model 
>> > creation into a single multi-value insert statement *only if all of the 
>> > primary keys are already defined on the model*. (Verified by looking at 
>> > sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements) 
>> > 
>> > 
>> > Questions: 
>> > 
>> > Would it be possible for bulk_save_objects to fetch primary keys while 
>> still 
>> > batching INSERTs into a single query? 
>>
>> What do you mean "single query", do you mean, executemany()?  the 
>> answer is no because DBAPI drivers don't return result sets with 
>> executemany().  Do you mean, a single INSERT..VALUES with all the 
>> parameters in one statement ?  The answer is no because SQLAlchemy 
>> doesn't include multi-values as a transparent option; the DBAPI 
>> drivers instead make use of the multi-values syntax within their 
>> executemany() implementations, where again, they don't return result 
>> sets. 
>>
>> Right now, only the MySQL drivers do this by default, the psycopg2 
>> driver does it with a recently added option that SQLAlchemy does not 
>> directly support, however you can set up via a connection event.   For 
>> psycopg2 it can vastly speed up inserts as psycopg2's normal 
>> executemany() implementation has some performance issues. 
>>
>>
>> (This 
>> > would help with e.g. inserting a bunch of rows into a table with an 
>> > auto-incrementing primary key). 
>>
>> if you're using Postgresql, there's a vastly easier technique to use 
>> which is just to pre-fetch from the sequence: 
>>
>> identities = [ 
>>     val for val, in session.execute( 
>>              "select nextval('mytable_seq') from " 
>>              "generate_series(1,%s)" % len(my_objects)) 
>>     ) 
>> ] 
>> for ident, obj in zip(identities, my_objects): 
>>     obj.pk = ident 
>>
>>
>> Now you don't need to do RETURNING or anything and the inserts can be 
>> at their top efficiency. 
>>
>>
>>
>> > 
>> > At least in Postgres (haven't verified for other databases), it looks 
>> like 
>> > one can use RETURNING for inserts with multi-value VALUES clauses. At 
>> the 
>> > surface, it seems like one could extend the current behavior for a 
>> single 
>> > row INSERT: 
>> > 
>> > INSERT INTO table ... VALUES (1) RETURNING id 
>> > 
>> > to multi-value INSERTS: 
>> > 
>> > INSERT INTO table ... VALUES (1), (2), ... RETURNING id 
>> > 
>> > and get all the benefits of the ORM while still batching inserts. 
>>
>>
>> This would be an enormous undertaking to implement, test, and release. 
>>   It would have to exist as an entirely additional series of codepaths 
>> within persistence.py as multi-valued INSERT is not available on most 
>> databases as well as on earlier versions of the databases that do 
>> support it.     For all of that complexity, the approach would work on 
>> exactly:  Postgresql only - SQLite and MySQL don't support RETURNING, 
>> Oracle, SQL Server and others don't support INSERT..VALUES with 
>> multiple sets. 
>>
>> If you are using Postgresql, you can instead pre-fetch the sequence up 
>> front.  This can even be integrated into a before_flush() event. 
>>
>> So the complexity of adding multi-values insert with sequences would 
>> benefit an extremely narrow set of use cases, would be very labor 
>> intensive to implement and maintain, and is unnecessary for the single 
>> target platform in which this case would work. 
>>
>>
>> > 
>> > I'm sure this is something that was thought about already, so I'd 
>> mostly 
>> > love to hear any thoughts about what makes this hard. We have a lot of 
>> > complex relationships and joined-table inheritance, so working with 
>> > bulk_save_objects has been a bit challenging. From the comments in 
>> > sqlalchemy/lib/sqlalchemy/sql/dml.py:return_defaults(), it seems like 
>> > SQLAlchemy is already aware that RETURNING for insert statements with 
>> > multi-values VALUES clauses is supported, so it's possible there is a 
>> reason 
>> > that this is hard or can't be done. 
>> > 
>> > This is also something that, if it just a “missing feature”, I would 
>> love to 
>> > learn more and see if I could contribute. 
>> > 
>> > -- 
>> > SQLAlchemy - 
>> > The Python SQL Toolkit and Object Relational Mapper 
>> > 
>> > http://www.sqlalchemy.org/ 
>> > 
>> > To post example code, please provide an MCVE: Minimal, Complete, and 
>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> > description. 
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "sqlalchemy" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an 
>> > email to sqlalchemy+...@googlegroups.com. 
>> > To post to this group, send email to sqlal...@googlegroups.com. 
>> > Visit this group at https://groups.google.com/group/sqlalchemy. 
>> > For more options, visit https://groups.google.com/d/optout. 
>>
>>
>> --
>> SQLAlchemy - 
>> The Python SQL Toolkit and Object Relational Mapper
>>  
>> http://www.sqlalchemy.org/
>>  
>> To post example code, please provide an MCVE: Minimal, Complete, and 
>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> description.
>> --- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "sqlalchemy" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/sqlalchemy/GyAZTThJi2I/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> sqlalchemy+...@googlegroups.com.
>> To post to this group, send email to sqlal...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/sqlalchemy.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> SQLAlchemy - 
>> The Python SQL Toolkit and Object Relational Mapper
>>  
>> http://www.sqlalchemy.org/
>>  
>> To post example code, please provide an MCVE: Minimal, Complete, and 
>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> description.
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "sqlalchemy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to sqlalchemy+...@googlegroups.com.
>>
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/sqlalchemy/CAJYnotsAJ39FuiVYS_UD6338AEktOZQmw1ChuQK2HOXPkMCwLw%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/sqlalchemy/CAJYnotsAJ39FuiVYS_UD6338AEktOZQmw1ChuQK2HOXPkMCwLw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>>
>> *Attachments:*
>>
>>    - BatchInsertsBlogPost.pdf
>>    
>>
>>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/c7922b87-fe04-4363-a8c9-34a96a857c95n%40googlegroups.com.

Reply via email to