Re: [sqlalchemy] Batching INSERT statements

2021-02-12 Thread Mike Bayer


On Fri, Feb 12, 2021, at 2:06 PM, Cristian Bulgaru wrote:
> Hi Vineet, Mike,
> 
> @Vineet, thank you for the interesting blog post on bulk insert with 
> SQLAlchemy ORM: 
> https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7
> 
> A few questions:
> 
> 1. Do we need to get the incremented IDs from Postgresql itself, or can we 
> just fetch the current max ID on a table and increment IDs in Python without 
> querying the DB for the incremented values?
> 
> 2. I was intrigued by the following phrase:
> 
> > * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
> > back, so it’s possible that SQLAlchemy may support batch inserting these 
> > models (with an auto-incrementing primary key) in the future. 

SQLAlchemy now supports this in version 1.4 including that's it's used by the 
ORM:

https://docs.sqlalchemy.org/en/14/changelog/migration_14.html#psycopg2-dialect-features-execute-values-with-returning-for-insert-statements-by-default


basically it's now the most efficient INSERT + autoincrement PK we can get with 
the psycopg2 driver.







> 
> @Mike @Vineet, do you know if this is the case, ie if bulk insert now works 
> out of the box (without pre-assigning incremented IDs)?
> 
> 3. Does this imply any change in case of bulk insert of multiple models with 
> foreign keys referring to each other? This answer 
>  seems to suggest 
> pre-assigning IDs for it to work.
> On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:
>> __
>> Hi Vineet -
>> 
>> glad that worked!   I'll have to find some time to recall what we worked out 
>> here and how it came out for you, I wonder where on the site this kind of 
>> thing could be mentioned.we have 3rd party dialects listed out in the 
>> docs but not yet a place for extensions.
>> 
>> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>>> Hi Mike,
>>> 
>>> Thanks for all of your help getting this working again. We've used this 
>>> solution in production for two years now, and it's helped our performance 
>>> significantly.
>>> 
>>> We just open-sourced the solution that we built so others can use it, and 
>>> are also writing a blog post to cover some of the details. I'm attaching a 
>>> copy of the blog post here. Obviously not expected, but if you are 
>>> interested in taking a look, we are happy to incorporate any comments that 
>>> you may have before publishing.
>>> 
>>> Here's a link to the repo as well: 
>>> https://github.com/benchling/sqlalchemy_batch_inserts
>>> 
>>> Best,
>>> Vineet
>>> 
>>> On Mon, Oct 9, 2017 at 10:27 PM  wrote:
> if you're using Postgresql, there's a vastly easier technique to use 
> which is just to pre-fetch from the sequence: 
> identities = [ 
> val for val, in session.execute( 
>  "select nextval('mytable_seq') from " 
>  "generate_series(1,%s)" % len(my_objects)) 
> ) 
> ] 
> for ident, obj in zip(identities, my_objects): 
> obj.pk = ident 
 Wow, that's a great idea! I got it working for most of our models. I have 
 some questions about how inserts for joined-table inheritance tables are 
 batched together, but I'll ask them in a separate post since they're 
 somewhat unrelated to this.
 
> So the complexity of adding multi-values insert with sequences would 
> benefit an extremely narrow set of use cases, would be very labor 
> intensive to implement and maintain, and is unnecessary for the single 
> target platform in which this case would work. 
 That makes sense, thanks for the explanation!
 
  
 
 On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
> On Mon, Oct 9, 2017 at 4:15 AM,   wrote: 
> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
> > inserts, and have a few questions about bulk_save_objects (and flushing 
> > in 
> > general). 
> > 
> > Two statements that I think are true: 
> > 
> > Right now, bulk_save_objects does not fetch primary keys for inserted 
> > rows 
> > (unless return_defaults is passed in, which mitigates the performance 
> > gains 
> > by performing one insert statement per model). 
> 
> the point of bulk save is that the objects passed to it are considered 
> as throwaway after the operation, to avoid the overhead of maintaining 
> the objects' persistence state in memory.   Using it with 
> return_defaults is nearly pointless as this will blow away the 
> efficiency gains you might get from the bulk operation, and is there 
> mostly to allow the joined-table inheritance use case to be usable. 
> 
> 
> > When running db.session.flush(), it looks like SQLAlchemy can batch 
> > model 
> > creation into a single multi-value insert statement *only if all of the 

Re: [sqlalchemy] Batching INSERT statements

2021-02-12 Thread 'Jonathan Vanasco' via sqlalchemy
I'm not familiar with this exactly, but have a bit of experience in this 
area.

I just took a look at this module (nice work!).  It's VERY well documented 
in the docstrings (even nicer work!)

I think the core bit of this technique looks to be in 
`_get_next_sequence_values` -  
https://github.com/benchling/sqlalchemy_batch_inserts/blob/master/sqlalchemy_batch_inserts/__init__.py#L51-L83

Vineet is obtaining the ids by running the SQL generate_series function 
over the nextval function.

When I've done large batches and migrations like this, I've used a somewhat 
dirty cheat/trick.  Assuming a window of 1000 inserts, I would just 
increment the serial by 1000 and use "new number - 1000" as the range for 
IDs.  That is somewhat closer to the "max id" concept.  Vineet's approach 
is better.

In terms of dealing with multiple foreign key constraints, pre-assigning 
IDs may or may not work depending on how your database constraints exist.

As a habit, I always create (or re-create) Postgres foreign key checks as 
deferrable. When dealing with batches, I (i) defer all the involved 
constraints [which can be on other tables!], (ii) process the batch, (iii) 
set constraints to immediate.  If the migration is LARGE, i'll drop all the 
indexes the tables too, and possibly drop the constraints too and run 
multiple workers. This gets around the overheads from every insert 
populating rows+indexes, and the FKEY integrity checks on every row. 




On Friday, February 12, 2021 at 2:06:55 PM UTC-5 christia...@gmail.com 
wrote:

> Hi Vineet, Mike,
>
> @Vineet, thank you for the interesting blog post on bulk insert with 
> SQLAlchemy ORM: 
> https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7
>
> A few questions:
>
> 1. Do we need to get the incremented IDs from Postgresql itself, or can we 
> just fetch the current max ID on a table and increment IDs in Python 
> without querying the DB for the incremented values?
>
> 2. I was intrigued by the following phrase:
>
> > * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
> back, so it’s possible that SQLAlchemy may support batch inserting these 
> models (with an auto-incrementing primary key) in the future. 
>
> @Mike @Vineet, do you know if this is the case, ie if bulk insert now 
> works out of the box (without pre-assigning incremented IDs)?
>
> 3. Does this imply any change in case of bulk insert of multiple models 
> with foreign keys referring to each other? This answer 
>  seems to suggest 
> pre-assigning IDs for it to work.
> On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:
>
>> Hi Vineet -
>>
>> glad that worked!   I'll have to find some time to recall what we worked 
>> out here and how it came out for you, I wonder where on the site this kind 
>> of thing could be mentioned.we have 3rd party dialects listed out in 
>> the docs but not yet a place for extensions.
>>
>> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>>
>> Hi Mike,
>>
>> Thanks for all of your help getting this working again. We've used this 
>> solution in production for two years now, and it's helped our performance 
>> significantly.
>>
>> We just open-sourced the solution that we built so others can use it, and 
>> are also writing a blog post to cover some of the details. I'm attaching a 
>> copy of the blog post here. Obviously not expected, but if you are 
>> interested in taking a look, we are happy to incorporate any comments that 
>> you may have before publishing.
>>
>> Here's a link to the repo as well: 
>> https://github.com/benchling/sqlalchemy_batch_inserts
>>
>> Best,
>> Vineet
>>
>> On Mon, Oct 9, 2017 at 10:27 PM  wrote:
>>
>> if you're using Postgresql, there's a vastly easier technique to use 
>> which is just to pre-fetch from the sequence: 
>> identities = [ 
>> val for val, in session.execute( 
>>  "select nextval('mytable_seq') from " 
>>  "generate_series(1,%s)" % len(my_objects)) 
>> ) 
>> ] 
>> for ident, obj in zip(identities, my_objects): 
>> obj.pk = ident 
>>
>> Wow, that's a great idea! I got it working for most of our models. I have 
>> some questions about how inserts for joined-table inheritance tables are 
>> batched together, but I'll ask them in a separate post since they're 
>> somewhat unrelated to this.
>>
>> So the complexity of adding multi-values insert with sequences would 
>> benefit an extremely narrow set of use cases, would be very labor 
>> intensive to implement and maintain, and is unnecessary for the single 
>> target platform in which this case would work. 
>>
>> That makes sense, thanks for the explanation!
>>
>>  
>>
>> On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>>
>> On Mon, Oct 9, 2017 at 4:15 AM,   wrote: 
>> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
>> > inserts, and have 

Re: [sqlalchemy] Batching INSERT statements

2021-02-12 Thread Cristian Bulgaru
Hi Vineet, Mike,

@Vineet, thank you for the interesting blog post on bulk insert with 
SQLAlchemy ORM: 
https://benchling.engineering/sqlalchemy-batch-inserts-a-module-for-when-youre-inserting-thousands-of-rows-and-it-s-slow-16ece0ef5bf7

A few questions:

1. Do we need to get the incremented IDs from Postgresql itself, or can we 
just fetch the current max ID on a table and increment IDs in Python 
without querying the DB for the incremented values?

2. I was intrigued by the following phrase:

> * P.S. execute_values in psycopg2 v2.8 supports getting returned values 
back, so it’s possible that SQLAlchemy may support batch inserting these 
models (with an auto-incrementing primary key) in the future. 

@Mike @Vineet, do you know if this is the case, ie if bulk insert now works 
out of the box (without pre-assigning incremented IDs)?

3. Does this imply any change in case of bulk insert of multiple models 
with foreign keys referring to each other? This answer 
 seems to suggest 
pre-assigning IDs for it to work.
On Friday, February 21, 2020 at 3:49:54 PM UTC+1 Mike Bayer wrote:

> Hi Vineet -
>
> glad that worked!   I'll have to find some time to recall what we worked 
> out here and how it came out for you, I wonder where on the site this kind 
> of thing could be mentioned.we have 3rd party dialects listed out in 
> the docs but not yet a place for extensions.
>
> On Wed, Feb 19, 2020, at 9:28 PM, Vineet Gopal wrote:
>
> Hi Mike,
>
> Thanks for all of your help getting this working again. We've used this 
> solution in production for two years now, and it's helped our performance 
> significantly.
>
> We just open-sourced the solution that we built so others can use it, and 
> are also writing a blog post to cover some of the details. I'm attaching a 
> copy of the blog post here. Obviously not expected, but if you are 
> interested in taking a look, we are happy to incorporate any comments that 
> you may have before publishing.
>
> Here's a link to the repo as well: 
> https://github.com/benchling/sqlalchemy_batch_inserts
>
> Best,
> Vineet
>
> On Mon, Oct 9, 2017 at 10:27 PM  wrote:
>
> if you're using Postgresql, there's a vastly easier technique to use 
> which is just to pre-fetch from the sequence: 
> identities = [ 
> val for val, in session.execute( 
>  "select nextval('mytable_seq') from " 
>  "generate_series(1,%s)" % len(my_objects)) 
> ) 
> ] 
> for ident, obj in zip(identities, my_objects): 
> obj.pk = ident 
>
> Wow, that's a great idea! I got it working for most of our models. I have 
> some questions about how inserts for joined-table inheritance tables are 
> batched together, but I'll ask them in a separate post since they're 
> somewhat unrelated to this.
>
> So the complexity of adding multi-values insert with sequences would 
> benefit an extremely narrow set of use cases, would be very labor 
> intensive to implement and maintain, and is unnecessary for the single 
> target platform in which this case would work. 
>
> That makes sense, thanks for the explanation!
>
>  
>
> On Monday, October 9, 2017 at 8:44:51 AM UTC-7, Mike Bayer wrote:
>
> On Mon, Oct 9, 2017 at 4:15 AM,   wrote: 
> > Hello! I've spent some time looking at SQLAlchemy's ability to batch 
> > inserts, and have a few questions about bulk_save_objects (and flushing 
> in 
> > general). 
> > 
> > Two statements that I think are true: 
> > 
> > Right now, bulk_save_objects does not fetch primary keys for inserted 
> rows 
> > (unless return_defaults is passed in, which mitigates the performance 
> gains 
> > by performing one insert statement per model). 
>
> the point of bulk save is that the objects passed to it are considered 
> as throwaway after the operation, to avoid the overhead of maintaining 
> the objects' persistence state in memory.   Using it with 
> return_defaults is nearly pointless as this will blow away the 
> efficiency gains you might get from the bulk operation, and is there 
> mostly to allow the joined-table inheritance use case to be usable. 
>
>
> > When running db.session.flush(), it looks like SQLAlchemy can batch 
> model 
> > creation into a single multi-value insert statement *only if all of the 
> > primary keys are already defined on the model*. (Verified by looking at 
> > sqlalchemy/lib/sqlalchemy/orm/persistence.py:_emit_insert_statements) 
> > 
> > 
> > Questions: 
> > 
> > Would it be possible for bulk_save_objects to fetch primary keys while 
> still 
> > batching INSERTs into a single query? 
>
> What do you mean "single query", do you mean, executemany()?  the 
> answer is no because DBAPI drivers don't return result sets with 
> executemany().  Do you mean, a single INSERT..VALUES with all the 
> parameters in one statement ?  The answer is no because SQLAlchemy 
> doesn't include multi-values as a transparent option; the DBAPI 
> drivers instead make use of the multi-values syntax within 

[sqlalchemy] Re: Relationship with 2 intermediary tables

2021-02-12 Thread 'Jonathan Vanasco' via sqlalchemy
This is, IMHO, one of the most complex parts of SQLAlchemy.

In this public project, i have a handful of secondary/secondaryjoin 
examples that may help you

https://github.com/aptise/peter_sslers/blob/main/peter_sslers/model/objects.py#L3778-L4714

There is a section in the docs that should help a bit

  
  
https://docs.sqlalchemy.org/en/14/orm/join_conditions.html#composite-secondary-joins

I think you want something like

Person.photos = relationship(
Photo,
primaryjoin="""Person.id==PersonInstance.person_id""",
secondary="""join(PersonInstance,
  PhotoInstance,
  
PersonInstance.id==PhotoInstance.person_instance_id).join(Photo, 
PhotoInstance.photo_id == Photo.id)""",
)

I don't think the secondaryjoin is needed in this case.  I could be wrong.

The way I like to structure these complex joins is something like this...

A.Zs = relationship(
Z,  # the destination
primaryjoin="""A.id == B.id""",  # only the first association table
secondary="""join(B.id == C.id).join(C.id == D.id)...(X.id==Y.id)""",  
# bring the rest of the tables in
secondaryjoin=="""and_(Y.id==Z.id,  Z.id.in(subselect))"""  # custom 
filtering/join conditions
)

Does that make sense?  Mike has another way of explaining it in the docs, 
but this is how I best remember and implement it.
On Sunday, February 7, 2021 at 3:25:35 PM UTC-5 daneb...@gmail.com wrote:

> I am trying to create a relationship from one table to another, which 
> involves two intermediary tables. I *think* I need to use the secondaryjoin 
> + secondary arguments to relationship(). But after studying the 
> documentation for a long time, I can't get my head around how these 
> arguments are supposed to work.
>
> Here is my schema:
>
> class Person(Base):
> __tablename__ = "person"
> id = Column(Integer, primary_key=True)
>
> class PersonInstance(Base):
> __tablename__ = "person_instance"
> id = Column(Integer, primary_key=True)
> person_id = Column(Integer, ForeignKey("person.id"))
>
> class Photo(Base):
> __tablename__ = "photo"
> id = Column(Integer, primary_key=True)
>
> class PhotoInstance(Base):
> __tablename__ = "photo_instance"
> id = Column(Integer, primary_key=True)
> photo_id = Column(Integer, ForeignKey("photo.id"))
> person_instance_id = Column(Integer, ForeignKey("person_instance.id"))
>
> I want to create a one-to-many relationship *Person.photos* which goes 
> from Person -> Photo. A Person is one-to-many with PersonInstance, and a 
> Photo is one-to-many with PhotoInstance objects. The connection from a 
> Person to a Photo exists between PersonInstance and PhotoInstance, via the 
> PhotoInstance.person_instance_id foreign key.
>
> First I tried using only primaryjoin:
>
> photos = relationship(
> "Photo",
> primaryjoin=(
> "and_(Person.id==PersonInstance.person_id, "
> "PersonInstance.id==PhotoInstance.person_instance_id, "
> "PhotoInstance.photo_id==Photo.id)"
> )
> )
>
> I got an error saying it couldn't find the necessary foreign keys to 
> compute the join.
>
> So now I'm messing with secondary + secondaryjoin, but it's really trial & 
> error as I don't know how these arguments are supposed to work in my case.
>
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/ff490d21-4e96-492a-a8ca-f953d1dd3e2fn%40googlegroups.com.