Re: [sqlalchemy] Batching INSERT statements

2021-02-12 Thread Mike Bayer


On Fri, Feb 12, 2021, at 2:06 PM, Cristian Bulgaru wrote:
> Hi Vineet, Mike,
> 
> @Vineet, thank you for the interesting blog post on bulk insert with 
> SQLAlchemy ORM: 
> https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7
> 
> A few questions:
> 
> 1. Do we need to get the incremented IDs from Postgresql itself, or can we 
> just fetch the current max ID on a table and increment IDs in Python without 
> querying the DB for the incremented values?
> 
> 2. I was intrigued by the following phrase:
> 
> > * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
> > back, so it’s possible that SQLAlchemy may support batch inserting these 
> > models (with an auto-incrementing primary key) in the future. 

SQLAlchemy now supports this in version 1.4 including that's it's used by the 
ORM:

https://docs.sqlalchemy.org/en/14/changelog/migration_14.html#psycopg2-dialect-features-execute-values-with-returning-for-insert-statements-by-default


basically it's now the most efficient INSERT + autoincrement PK we can get with 
the psycopg2 driver.







> 
> @Mike @Vineet, do you know if this is the case, ie if bulk insert now works 
> out of the box (without pre-assigning incremented IDs)?
> 
> 3. Does this imply any change in case of bulk insert of multiple models with 
> foreign keys referring to each other? This answer 
>  seems to suggest 
> pre-assigning IDs for it to work.
> On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:
>> __
>> Hi Vineet -
>> 
>> glad that worked!   I'll have to find some time to recall what we worked out 
>> here and how it came out for you, I wonder where on the site this kind of 
>> thing could be mentioned.we have 3rd party dialects listed out in the 
>> docs but not yet a place for extensions.
>> 
>> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>>> Hi Mike,
>>> 
>>> Thanks for all of your help getting this working again. We've used this 
>>> solution in production for two years now, and it's helped our performance 
>>> significantly.
>>> 
>>> We just open-sourced the solution that we built so others can use it, and 
>>> are also writing a blog post to cover some of the details. I'm attaching a 
>>> copy of the blog post here. Obviously not expected, but if you are 
>>> interested in taking a look, we are happy to incorporate any comments that 
>>> you may have before publishing.
>>> 
>>> Here's a link to the repo as well: 
>>> https://github.com/benchling/sqlalchemy_batch_inserts
>>> 
>>> Best,
>>> Vineet
>>> 
>>> On Mon, Oct 9, 2017 at 10:27 PM  wrote:
> if you're using Postgresql, there's a vastly easier technique to use 
> which is just to pre-fetch from the sequence: 
> identities = [ 
> val for val, in session.execute( 
>  "select nextval('mytable_seq') from " 
>  "generate_series(1,%s)" % len(my_objects)) 
> ) 
> ] 
> for ident, obj in zip(identities, my_objects): 
> obj.pk = ident 
 Wow, that's a great idea! I got it working for most of our models. I have 
 some questions about how inserts for joined-table inheritance tables are 
 batched together, but I'll ask them in a separate post since they're 
 somewhat unrelated to this.
 
> So the complexity of adding multi-values insert with sequences would 
> benefit an extremely narrow set of use cases, would be very labor 
> intensive to implement and maintain, and is unnecessary for the single 
> target platform in which this case would work. 
 That makes sense, thanks for the explanation!
 
  
 
 On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
> On Mon, Oct 9, 2017 at 4:15 AM,   wrote: 
> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
> > inserts, and have a few questions about bulk_save_objects (and flushing 
> > in 
> > general). 
> > 
> > Two statements that I think are true: 
> > 
> > Right now, bulk_save_objects does not fetch primary keys for inserted 
> > rows 
> > (unless return_defaults is passed in, which mitigates the performance 
> > gains 
> > by performing one insert statement per model). 
> 
> the point of bulk save is that the objects passed to it are considered 
> as throwaway after the operation, to avoid the overhead of maintaining 
> the objects' persistence state in memory.   Using it with 
> return_defaults is nearly pointless as this will blow away the 
> efficiency gains you might get from the bulk operation, and is there 
> mostly to allow the joined-table inheritance use case to be usable. 
> 
> 
> > When running db.session.flush(), it looks like SQLAlchemy can batch 
> > model 
> > creation into a single multi-value insert statement *only if all of the 

Re: [sqlalchemy] Batching INSERT statements

2021-02-12 Thread 'Jonathan Vanasco' via sqlalchemy
I'm not familiar with this exactly, but have a bit of experience in this 
area.

I just took a look at this module (nice work!).  It's VERY well documented 
in the docstrings (even nicer work!)

I think the core bit of this technique looks to be in 
`_get_next_sequence_values` -  
https://github.com/benchling/sqlalchemy_batch_inserts/blob/master/sqlalchemy_batch_inserts/__init__.py#L51-L83

Vineet is obtaining the ids by running the SQL generate_series function 
over the nextval function.

When I've done large batches and migrations like this, I've used a somewhat 
dirty cheat/trick.  Assuming a window of 1000 inserts, I would just 
increment the serial by 1000 and use "new number - 1000" as the range for 
IDs.  That is somewhat closer to the "max id" concept.  Vineet's approach 
is better.

In terms of dealing with multiple foreign key constraints, pre-assigning 
IDs may or may not work depending on how your database constraints exist.

As a habit, I always create (or re-create) Postgres foreign key checks as 
deferrable. When dealing with batches, I (i) defer all the involved 
constraints [which can be on other tables!], (ii) process the batch, (iii) 
set constraints to immediate.  If the migration is LARGE, i'll drop all the 
indexes the tables too, and possibly drop the constraints too and run 
multiple workers. This gets around the overheads from every insert 
populating rows+indexes, and the FKEY integrity checks on every row. 




On Friday, February 12, 2021 at 2:06:55 PM UTC-5 christia...@gmail.com 
wrote:

> Hi Vineet, Mike,
>
> @Vineet, thank you for the interesting blog post on bulk insert with 
> SQLAlchemy ORM: 
> https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7
>
> A few questions:
>
> 1. Do we need to get the incremented IDs from Postgresql itself, or can we 
> just fetch the current max ID on a table and increment IDs in Python 
> without querying the DB for the incremented values?
>
> 2. I was intrigued by the following phrase:
>
> > * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
> back, so it’s possible that SQLAlchemy may support batch inserting these 
> models (with an auto-incrementing primary key) in the future. 
>
> @Mike @Vineet, do you know if this is the case, ie if bulk insert now 
> works out of the box (without pre-assigning incremented IDs)?
>
> 3. Does this imply any change in case of bulk insert of multiple models 
> with foreign keys referring to each other? This answer 
>  seems to suggest 
> pre-assigning IDs for it to work.
> On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:
>
>> Hi Vineet -
>>
>> glad that worked!   I'll have to find some time to recall what we worked 
>> out here and how it came out for you, I wonder where on the site this kind 
>> of thing could be mentioned.we have 3rd party dialects listed out in 
>> the docs but not yet a place for extensions.
>>
>> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>>
>> Hi Mike,
>>
>> Thanks for all of your help getting this working again. We've used this 
>> solution in production for two years now, and it's helped our performance 
>> significantly.
>>
>> We just open-sourced the solution that we built so others can use it, and 
>> are also writing a blog post to cover some of the details. I'm attaching a 
>> copy of the blog post here. Obviously not expected, but if you are 
>> interested in taking a look, we are happy to incorporate any comments that 
>> you may have before publishing.
>>
>> Here's a link to the repo as well: 
>> https://github.com/benchling/sqlalchemy_batch_inserts
>>
>> Best,
>> Vineet
>>
>> On Mon, Oct 9, 2017 at 10:27 PM  wrote:
>>
>> if you're using Postgresql, there's a vastly easier technique to use 
>> which is just to pre-fetch from the sequence: 
>> identities = [ 
>> val for val, in session.execute( 
>>  "select nextval('mytable_seq') from " 
>>  "generate_series(1,%s)" % len(my_objects)) 
>> ) 
>> ] 
>> for ident, obj in zip(identities, my_objects): 
>> obj.pk = ident 
>>
>> Wow, that's a great idea! I got it working for most of our models. I have 
>> some questions about how inserts for joined-table inheritance tables are 
>> batched together, but I'll ask them in a separate post since they're 
>> somewhat unrelated to this.
>>
>> So the complexity of adding multi-values insert with sequences would 
>> benefit an extremely narrow set of use cases, would be very labor 
>> intensive to implement and maintain, and is unnecessary for the single 
>> target platform in which this case would work. 
>>
>> That makes sense, thanks for the explanation!
>>
>>  
>>
>> On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>>
>> On Mon, Oct 9, 2017 at 4:15 AM,   wrote: 
>> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
>> > inserts, and have 

Re: [sqlalchemy] Batching INSERT statements

2021-02-12 Thread Cristian Bulgaru
Hi Vineet, Mike,

@Vineet, thank you for the interesting blog post on bulk insert with 
SQLAlchemy ORM: 
https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7

A few questions:

1. Do we need to get the incremented IDs from Postgresql itself, or can we 
just fetch the current max ID on a table and increment IDs in Python 
without querying the DB for the incremented values?

2. I was intrigued by the following phrase:

> * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
back, so it’s possible that SQLAlchemy may support batch inserting these 
models (with an auto-incrementing primary key) in the future. 

@Mike @Vineet, do you know if this is the case, ie if bulk insert now works 
out of the box (without pre-assigning incremented IDs)?

3. Does this imply any change in case of bulk insert of multiple models 
with foreign keys referring to each other? This answer 
 seems to suggest 
pre-assigning IDs for it to work.
On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:

> Hi Vineet -
>
> glad that worked!   I'll have to find some time to recall what we worked 
> out here and how it came out for you, I wonder where on the site this kind 
> of thing could be mentioned.we have 3rd party dialects listed out in 
> the docs but not yet a place for extensions.
>
> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>
> Hi Mike,
>
> Thanks for all of your help getting this working again. We've used this 
> solution in production for two years now, and it's helped our performance 
> significantly.
>
> We just open-sourced the solution that we built so others can use it, and 
> are also writing a blog post to cover some of the details. I'm attaching a 
> copy of the blog post here. Obviously not expected, but if you are 
> interested in taking a look, we are happy to incorporate any comments that 
> you may have before publishing.
>
> Here's a link to the repo as well: 
> https://github.com/benchling/sqlalchemy_batch_inserts
>
> Best,
> Vineet
>
> On Mon, Oct 9, 2017 at 10:27 PM  wrote:
>
> if you're using Postgresql, there's a vastly easier technique to use 
> which is just to pre-fetch from the sequence: 
> identities = [ 
> val for val, in session.execute( 
>  "select nextval('mytable_seq') from " 
>  "generate_series(1,%s)" % len(my_objects)) 
> ) 
> ] 
> for ident, obj in zip(identities, my_objects): 
> obj.pk = ident 
>
> Wow, that's a great idea! I got it working for most of our models. I have 
> some questions about how inserts for joined-table inheritance tables are 
> batched together, but I'll ask them in a separate post since they're 
> somewhat unrelated to this.
>
> So the complexity of adding multi-values insert with sequences would 
> benefit an extremely narrow set of use cases, would be very labor 
> intensive to implement and maintain, and is unnecessary for the single 
> target platform in which this case would work. 
>
> That makes sense, thanks for the explanation!
>
>  
>
> On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>
> On Mon, Oct 9, 2017 at 4:15 AM,   wrote: 
> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
> > inserts, and have a few questions about bulk_save_objects (and flushing 
> in 
> > general). 
> > 
> > Two statements that I think are true: 
> > 
> > Right now, bulk_save_objects does not fetch primary keys for inserted 
> rows 
> > (unless return_defaults is passed in, which mitigates the performance 
> gains 
> > by performing one insert statement per model). 
>
> the point of bulk save is that the objects passed to it are considered 
> as throwaway after the operation, to avoid the overhead of maintaining 
> the objects' persistence state in memory.   Using it with 
> return_defaults is nearly pointless as this will blow away the 
> efficiency gains you might get from the bulk operation, and is there 
> mostly to allow the joined-table inheritance use case to be usable. 
>
>
> > When running db.session.flush(), it looks like SQLAlchemy can batch 
> model 
> > creation into a single multi-value insert statement *only if all of the 
> > primary keys are already defined on the model*. (Verified by looking at 
> > sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements) 
> > 
> > 
> > Questions: 
> > 
> > Would it be possible for bulk_save_objects to fetch primary keys while 
> still 
> > batching INSERTs into a single query? 
>
> What do you mean "single query", do you mean, executemany()?  the 
> answer is no because DBAPI drivers don't return result sets with 
> executemany().  Do you mean, a single INSERT..VALUES with all the 
> parameters in one statement ?  The answer is no because SQLAlchemy 
> doesn't include multi-values as a transparent option; the DBAPI 
> drivers instead make use of the multi-values syntax within 

Re: [sqlalchemy] Batching INSERT statements

2020-02-21 Thread Mike Bayer
Hi Vineet -

glad that worked! I'll have to find some time to recall what we worked out here 
and how it came out for you, I wonder where on the site this kind of thing 
could be mentioned. we have 3rd party dialects listed out in the docs but not 
yet a place for extensions.

On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
> Hi Mike,
> 
> Thanks for all of your help getting this working again. We've used this 
> solution in production for two years now, and it's helped our performance 
> significantly.
> 
> We just open-sourced the solution that we built so others can use it, and are 
> also writing a blog post to cover some of the details. I'm attaching a copy 
> of the blog post here. Obviously not expected, but if you are interested in 
> taking a look, we are happy to incorporate any comments that you may have 
> before publishing.
> 
> Here's a link to the repo as well: 
> https://github.com/benchling/sqlalchemy_batch_inserts
> 
> Best,
> Vineet
> 
> On Mon, Oct 9, 2017 at 10:27 PM  wrote:
>>> if you're using Postgresql, there's a vastly easier technique to use 
>>> which is just to pre-fetch from the sequence: 
>>> identities = [ 
>>>  val for val, in session.execute( 
>>>  "select nextval('mytable_seq') from " 
>>>  "generate_series(1,%s)" % len(my_objects)) 
>>>  ) 
>>> ] 
>>> for ident, obj in zip(identities, my_objects): 
>>> obj.pk = ident 
>> Wow, that's a great idea! I got it working for most of our models. I have 
>> some questions about how inserts for joined-table inheritance tables are 
>> batched together, but I'll ask them in a separate post since they're 
>> somewhat unrelated to this.
>> 
>>> So the complexity of adding multi-values insert with sequences would 
>>> benefit an extremely narrow set of use cases, would be very labor 
>>> intensive to implement and maintain, and is unnecessary for the single 
>>> target platform in which this case would work. 
>> That makes sense, thanks for the explanation!
>> 
>> 
>> 
>> On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>>> On Mon, Oct 9, 2017 at 4:15 AM,  wrote: 
>>> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
>>> > inserts, and have a few questions about bulk_save_objects (and flushing 
>>> > in 
>>> > general). 
>>> > 
>>> > Two statements that I think are true: 
>>> > 
>>> > Right now, bulk_save_objects does not fetch primary keys for inserted 
>>> > rows 
>>> > (unless return_defaults is passed in, which mitigates the performance 
>>> > gains 
>>> > by performing one insert statement per model). 
>>> 
>>> the point of bulk save is that the objects passed to it are considered 
>>> as throwaway after the operation, to avoid the overhead of maintaining 
>>> the objects' persistence state in memory. Using it with 
>>> return_defaults is nearly pointless as this will blow away the 
>>> efficiency gains you might get from the bulk operation, and is there 
>>> mostly to allow the joined-table inheritance use case to be usable. 
>>> 
>>> 
>>> > When running db.session.flush(), it looks like SQLAlchemy can batch model 
>>> > creation into a single multi-value insert statement *only if all of the 
>>> > primary keys are already defined on the model*. (Verified by looking at 
>>> > sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements) 
>>> > 
>>> > 
>>> > Questions: 
>>> > 
>>> > Would it be possible for bulk_save_objects to fetch primary keys while 
>>> > still 
>>> > batching INSERTs into a single query? 
>>> 
>>> What do you mean "single query", do you mean, executemany()? the 
>>> answer is no because DBAPI drivers don't return result sets with 
>>> executemany(). Do you mean, a single INSERT..VALUES with all the 
>>> parameters in one statement ? The answer is no because SQLAlchemy 
>>> doesn't include multi-values as a transparent option; the DBAPI 
>>> drivers instead make use of the multi-values syntax within their 
>>> executemany() implementations, where again, they don't return result 
>>> sets. 
>>> 
>>> Right now, only the MySQL drivers do this by default, the psycopg2 
>>> driver does it with a recently added option that SQLAlchemy does not 
>>> directly support, however you can set up via a connection event. For 
>>> psycopg2 it can vastly speed up inserts as psycopg2's normal 
>>> executemany() implementation has some performance issues. 
>>> 
>>> 
>>> (This 
>>> > would help with e.g. inserting a bunch of rows into a table with an 
>>> > auto-incrementing primary key). 
>>> 
>>> if you're using Postgresql, there's a vastly easier technique to use 
>>> which is just to pre-fetch from the sequence: 
>>> 
>>> identities = [ 
>>>  val for val, in session.execute( 
>>>  "select nextval('mytable_seq') from " 
>>>  "generate_series(1,%s)" % len(my_objects)) 
>>>  ) 
>>> ] 
>>> for ident, obj in zip(identities, my_objects): 
>>> obj.pk = ident 
>>> 
>>> 
>>> Now you don't need to do RETURNING or anything and the inserts can be 
>>> at their top efficiency. 
>>> 
>>> 

Re: [sqlalchemy] Batching INSERT statements

2017-10-09 Thread vineet

>
> if you're using Postgresql, there's a vastly easier technique to use 
> which is just to pre-fetch from the sequence: 
> identities = [ 
> val for val, in session.execute( 
>  "select nextval('mytable_seq') from " 
>  "generate_series(1,%s)" % len(my_objects)) 
> ) 
> ] 
> for ident, obj in zip(identities, my_objects): 
> obj.pk = ident 

Wow, that's a great idea! I got it working for most of our models. I have 
some questions about how inserts for joined-table inheritance tables are 
batched together, but I'll ask them in a separate post since they're 
somewhat unrelated to this.

So the complexity of adding multi-values insert with sequences would 
> benefit an extremely narrow set of use cases, would be very labor 
> intensive to implement and maintain, and is unnecessary for the single 
> target platform in which this case would work. 
>
That makes sense, thanks for the explanation!

 

On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>
> On Mon, Oct 9, 2017 at 4:15 AM,   
> wrote: 
> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
> > inserts, and have a few questions about bulk_save_objects (and flushing 
> in 
> > general). 
> > 
> > Two statements that I think are true: 
> > 
> > Right now, bulk_save_objects does not fetch primary keys for inserted 
> rows 
> > (unless return_defaults is passed in, which mitigates the performance 
> gains 
> > by performing one insert statement per model). 
>
> the point of bulk save is that the objects passed to it are considered 
> as throwaway after the operation, to avoid the overhead of maintaining 
> the objects' persistence state in memory.   Using it with 
> return_defaults is nearly pointless as this will blow away the 
> efficiency gains you might get from the bulk operation, and is there 
> mostly to allow the joined-table inheritance use case to be usable. 
>
>
> > When running db.session.flush(), it looks like SQLAlchemy can batch 
> model 
> > creation into a single multi-value insert statement *only if all of the 
> > primary keys are already defined on the model*. (Verified by looking at 
> > sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements) 
> > 
> > 
> > Questions: 
> > 
> > Would it be possible for bulk_save_objects to fetch primary keys while 
> still 
> > batching INSERTs into a single query? 
>
> What do you mean "single query", do you mean, executemany()?  the 
> answer is no because DBAPI drivers don't return result sets with 
> executemany().  Do you mean, a single INSERT..VALUES with all the 
> parameters in one statement ?  The answer is no because SQLAlchemy 
> doesn't include multi-values as a transparent option; the DBAPI 
> drivers instead make use of the multi-values syntax within their 
> executemany() implementations, where again, they don't return result 
> sets. 
>
> Right now, only the MySQL drivers do this by default, the psycopg2 
> driver does it with a recently added option that SQLAlchemy does not 
> directly support, however you can set up via a connection event.   For 
> psycopg2 it can vastly speed up inserts as psycopg2's normal 
> executemany() implementation has some performance issues. 
>
>
> (This 
> > would help with e.g. inserting a bunch of rows into a table with an 
> > auto-incrementing primary key). 
>
> if you're using Postgresql, there's a vastly easier technique to use 
> which is just to pre-fetch from the sequence: 
>
> identities = [ 
> val for val, in session.execute( 
>  "select nextval('mytable_seq') from " 
>  "generate_series(1,%s)" % len(my_objects)) 
> ) 
> ] 
> for ident, obj in zip(identities, my_objects): 
> obj.pk = ident 
>
>
> Now you don't need to do RETURNING or anything and the inserts can be 
> at their top efficiency. 
>
>
>
> > 
> > At least in Postgres (haven't verified for other databases), it looks 
> like 
> > one can use RETURNING for inserts with multi-value VALUES clauses. At 
> the 
> > surface, it seems like one could extend the current behavior for a 
> single 
> > row INSERT: 
> > 
> > INSERT INTO table ... VALUES (1) RETURNING id 
> > 
> > to multi-value INSERTS: 
> > 
> > INSERT INTO table ... VALUES (1), (2), ... RETURNING id 
> > 
> > and get all the benefits of the ORM while still batching inserts. 
>
>
> This would be an enormous undertaking to implement, test, and release. 
>   It would have to exist as an entirely additional series of codepaths 
> within persistence.py as multi-valued INSERT is not available on most 
> databases as well as on earlier versions of the databases that do 
> support it. For all of that complexity, the approach would work on 
> exactly:  Postgresql only - SQLite and MySQL don't support RETURNING, 
> Oracle, SQL Server and others don't support INSERT..VALUES with 
> multiple sets. 
>
> If you are using Postgresql, you can instead pre-fetch the sequence up 
> front.  This can 

Re: [sqlalchemy] Batching INSERT statements

2017-10-09 Thread Mike Bayer
On Mon, Oct 9, 2017 at 4:15 AM,   wrote:
> Hello! I've spent some time looking at SQLAlchemy's ability to batch
> inserts, and have a few questions about bulk_save_objects (and flushing in
> general).
>
> Two statements that I think are true:
>
> Right now, bulk_save_objects does not fetch primary keys for inserted rows
> (unless return_defaults is passed in, which mitigates the performance gains
> by performing one insert statement per model).

the point of bulk save is that the objects passed to it are considered
as throwaway after the operation, to avoid the overhead of maintaining
the objects' persistence state in memory.   Using it with
return_defaults is nearly pointless as this will blow away the
efficiency gains you might get from the bulk operation, and is there
mostly to allow the joined-table inheritance use case to be usable.


> When running db.session.flush(), it looks like SQLAlchemy can batch model
> creation into a single multi-value insert statement *only if all of the
> primary keys are already defined on the model*. (Verified by looking at
> sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements)
>
>
> Questions:
>
> Would it be possible for bulk_save_objects to fetch primary keys while still
> batching INSERTs into a single query?

What do you mean "single query", do you mean, executemany()?  the
answer is no because DBAPI drivers don't return result sets with
executemany().  Do you mean, a single INSERT..VALUES with all the
parameters in one statement ?  The answer is no because SQLAlchemy
doesn't include multi-values as a transparent option; the DBAPI
drivers instead make use of the multi-values syntax within their
executemany() implementations, where again, they don't return result
sets.

Right now, only the MySQL drivers do this by default, the psycopg2
driver does it with a recently added option that SQLAlchemy does not
directly support, however you can set up via a connection event.   For
psycopg2 it can vastly speed up inserts as psycopg2's normal
executemany() implementation has some performance issues.


(This
> would help with e.g. inserting a bunch of rows into a table with an
> auto-incrementing primary key).

if you're using Postgresql, there's a vastly easier technique to use
which is just to pre-fetch from the sequence:

identities = [
val for val, in session.execute(
 "select nextval('mytable_seq') from "
 "generate_series(1,%s)" % len(my_objects))
)
]
for ident, obj in zip(identities, my_objects):
obj.pk = ident


Now you don't need to do RETURNING or anything and the inserts can be
at their top efficiency.



>
> At least in Postgres (haven't verified for other databases), it looks like
> one can use RETURNING for inserts with multi-value VALUES clauses. At the
> surface, it seems like one could extend the current behavior for a single
> row INSERT:
>
> INSERT INTO table ... VALUES (1) RETURNING id
>
> to multi-value INSERTS:
>
> INSERT INTO table ... VALUES (1), (2), ... RETURNING id
>
> and get all the benefits of the ORM while still batching inserts.


This would be an enormous undertaking to implement, test, and release.
  It would have to exist as an entirely additional series of codepaths
within persistence.py as multi-valued INSERT is not available on most
databases as well as on earlier versions of the databases that do
support it. For all of that complexity, the approach would work on
exactly:  Postgresql only - SQLite and MySQL don't support RETURNING,
Oracle, SQL Server and others don't support INSERT..VALUES with
multiple sets.

If you are using Postgresql, you can instead pre-fetch the sequence up
front.  This can even be integrated into a before_flush() event.

So the complexity of adding multi-values insert with sequences would
benefit an extremely narrow set of use cases, would be very labor
intensive to implement and maintain, and is unnecessary for the single
target platform in which this case would work.


>
> I'm sure this is something that was thought about already, so I'd mostly
> love to hear any thoughts about what makes this hard. We have a lot of
> complex relationships and joined-table inheritance, so working with
> bulk_save_objects has been a bit challenging. From the comments in
> sqlalchemy/lib/sqlalchemy/sql/dml.py:return_defaults(), it seems like
> SQLAlchemy is already aware that RETURNING for insert statements with
> multi-values VALUES clauses is supported, so it's possible there is a reason
> that this is hard or can't be done.
>
> This is also something that, if it just a “missing feature”, I would love to
> learn more and see if I could contribute.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because 

[sqlalchemy] Batching INSERT statements

2017-10-09 Thread vineet
Hello! I've spent some time looking at SQLAlchemy's ability to batch 
inserts, and have a few questions about bulk_save_objects (and flushing in 
general).

Two statements that I think are true:

   1. Right now, bulk_save_objects does not fetch primary keys for inserted 
   rows (unless return_defaults is passed in, which mitigates the performance 
   gains by performing one insert statement per model).
   2. When running db.session.flush(), it looks like SQLAlchemy can batch 
   model creation into a single multi-value insert statement *only if all of 
   the primary keys are already defined on the model*. (Verified by looking at 
   sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements)


Questions:

   1. Would it be possible for bulk_save_objects to fetch primary keys 
   while still batching INSERTs into a single query?
   2. Or even better, would it be possible to batch INSERT statements 
   during a flush even when models don't have their primary keys pre-defined? 
   (This would help with e.g. inserting a bunch of rows into a table with an 
   auto-incrementing primary key).

At least in Postgres (haven't verified for other databases), it looks like 
one can use RETURNING for inserts with multi-value VALUES clauses. At the 
surface, it seems like one could extend the current behavior for a single 
row INSERT:

INSERT INTO table ... VALUES (1) RETURNING id

to multi-value INSERTS:

INSERT INTO table ... VALUES (1), (2), ... RETURNING id

and get all the benefits of the ORM while still batching inserts.

I'm sure this is something that was thought about already, so I'd mostly 
love to hear any thoughts about what makes this hard. We have a lot of 
complex relationships and joined-table inheritance, so working with 
bulk_save_objects has been a bit challenging. From the comments in 
sqlalchemy/lib/sqlalchemy/sql/dml.py:return_defaults(), it seems like 
SQLAlchemy is already aware that RETURNING for insert statements with 
multi-values VALUES clauses is supported, so it's possible there is a 
reason that this is hard or can't be done.

This is also something that, if it just a “missing feature”, I would love 
to learn more and see if I could contribute.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.