Re: [sqlalchemy] Bulk Inserts and Unique Constraints
Thanks for the quick reply! This seems to work pretty well. I took out the batching (as it's already batched at a higher level) and modified it to suit the insertion of children as well (and reducded the unique to a single field) , and it appears to work. with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: entry_id = engine.execute(Entry.__table__.insert(), entry).inserted_primary_key for segment in entry['segments'].values(): segment['entry_id'] = entry_id[0] segment_inserts.append(segment) engine.execute(Segment.__table__.insert(), segment_inserts) For 20,000 segments, this ends up being about 45 seconds and 1650 queries - 2 to select all the entries and segments, 1 to insert the segments and the rest to insert parts. From here, however, I rewrote it a bit: with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).all() ) ) entry_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if not existing_entry: entry_inserts.append(entry) engine.execute(Entry.__table__.insert(), entry_inserts) existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: log.error('i\'ve made a huge mistake') engine.execute(Segment.__table__.insert(), segment_inserts) This ends up being about 19 seconds, 6 queries for a clean dump, and a bit less if the table is already populated. Removing the unique indexes on both the entries and segments tables and replacing them with standard indexes saves about a second in a full dump, and about 6 seconds for an update. I'm pretty happy with where it is now, and I suspect most of the time (aside from the two insert calls) is being spent in Python. That said, if you have any tips for improvements I'd be all ears. Thanks for the help! On Monday, 24 March 2014 09:19:25 UTC+8, Michael Bayer wrote: On Mar 23, 2014, at 11:33 AM, James Meneghello muro...@gmail.comjavascript: wrote: I'm having a few issues with unique constraints and bulk inserts. The software I'm writing takes data from an external source (a lot of it, anywhere from 1,000 rows per minute to 100-200k+), crunches it down into its hierarchy and saves it to the DB, to be aggregated in the background. The function handling the initial DB save is designed to work with about 20-50k rows at a time - very little modification takes place, it's pretty much just grabbed and thrown into the table. Obviously the amount of data being saved somewhat excludes the use of the ORM in this particular table, but there are a number of other tables that benefit from using the ORM. Hence, the small stuff uses the ORM and the big stuff uses the Core. The main problem I'm having is with the initial save. The data comes in unordered and sometimes contains duplicates, so there's a UniqueConstraint on Entry on sub, division, created. Unfortunately, this hampers the bulk insert - if there's a duplicate, it rolls back the entire insert and hence the entries aren't available to be referenced by the segments later. Obviously, capturing it in a try/catch would skip the whole block as well. Both Entry and Segment have the same problem - there are often duplicate segments. Since there's a large amount of data being pushed through it, I assume it's impractical to insert the elements individually - while there's only 100-200
Re: [sqlalchemy] Bulk Inserts and Unique Constraints
Oops, I should add - the reason I can't use an itertools counter to pre-assign IDs is because the table is potentially being dumped to by multiple scripts, which is why I have to commit the parts prior to the segments (since engine.execute can't return multiple insert_ids). On Monday, 24 March 2014 14:40:52 UTC+8, James Meneghello wrote: Thanks for the quick reply! This seems to work pretty well. I took out the batching (as it's already batched at a higher level) and modified it to suit the insertion of children as well (and reducded the unique to a single field) , and it appears to work. with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: entry_id = engine.execute(Entry.__table__.insert(), entry).inserted_primary_key for segment in entry['segments'].values(): segment['entry_id'] = entry_id[0] segment_inserts.append(segment) engine.execute(Segment.__table__.insert(), segment_inserts) For 20,000 segments, this ends up being about 45 seconds and 1650 queries - 2 to select all the entries and segments, 1 to insert the segments and the rest to insert parts. From here, however, I rewrote it a bit: with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).all() ) ) entry_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if not existing_entry: entry_inserts.append(entry) engine.execute(Entry.__table__.insert(), entry_inserts) existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: log.error('i\'ve made a huge mistake') engine.execute(Segment.__table__.insert(), segment_inserts) This ends up being about 19 seconds, 6 queries for a clean dump, and a bit less if the table is already populated. Removing the unique indexes on both the entries and segments tables and replacing them with standard indexes saves about a second in a full dump, and about 6 seconds for an update. I'm pretty happy with where it is now, and I suspect most of the time (aside from the two insert calls) is being spent in Python. That said, if you have any tips for improvements I'd be all ears. Thanks for the help! On Monday, 24 March 2014 09:19:25 UTC+8, Michael Bayer wrote: On Mar 23, 2014, at 11:33 AM, James Meneghello muro...@gmail.com wrote: I'm having a few issues with unique constraints and bulk inserts. The software I'm writing takes data from an external source (a lot of it, anywhere from 1,000 rows per minute to 100-200k+), crunches it down into its hierarchy and saves it to the DB, to be aggregated in the background. The function handling the initial DB save is designed to work with about 20-50k rows at a time - very little modification takes place, it's pretty much just grabbed and thrown into the table. Obviously the amount of data being saved somewhat excludes the use of the ORM in this particular table, but there are a number of other tables that benefit from using the ORM. Hence, the small stuff uses the ORM and the big stuff uses the Core. The main problem I'm having is with the initial save. The data comes in unordered and sometimes contains duplicates, so there's a UniqueConstraint on Entry on sub, division, created. Unfortunately, this hampers the bulk insert - if there's a duplicate, it rolls back the entire insert and
[sqlalchemy] What is the rationale of having to manually set up a relationship between two tables?
Suppose we have two tables in an existing database, user and address. There is a one-to-many relationships between these two tables with a foreign key user.id==address_user_id. Now we *reflect* this schema directly from the database: from sqlalchemy.ext.declarative import declarative_base Base = declarative_base(E) Class User(Base): __tablename__ = 'user' __table_args__ = {'autoload': True} addresses = sqlalchemy.orm.relationship(Address, backref=user) Class Address(Base): __tablename__ = 'address' __table_args__ = {'autoload': True} To me, it is a mystery why the almighty SA can autoload everything, but not an obvious relationship (in red ink). What benefit is it for the user to manually define such a simple and unambiguous relationship? Can't it be automatically done, i.e., leaving out the red part? Will this feature likely be included in future versions, say, sqlalchemy 1.0? -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Bulk Inserts and Unique Constraints
INSERT statement of postgresql supports RETURNING, read this http://docs.sqlalchemy.org/en/rel_0_8/core/dml.html#sqlalchemy.sql.expression.Insert.returning On Monday, March 24, 2014 2:43:46 PM UTC+8, James Meneghello wrote: Oops, I should add - the reason I can't use an itertools counter to pre-assign IDs is because the table is potentially being dumped to by multiple scripts, which is why I have to commit the parts prior to the segments (since engine.execute can't return multiple insert_ids). On Monday, 24 March 2014 14:40:52 UTC+8, James Meneghello wrote: Thanks for the quick reply! This seems to work pretty well. I took out the batching (as it's already batched at a higher level) and modified it to suit the insertion of children as well (and reducded the unique to a single field) , and it appears to work. with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: entry_id = engine.execute(Entry.__table__.insert(), entry).inserted_primary_key for segment in entry['segments'].values(): segment['entry_id'] = entry_id[0] segment_inserts.append(segment) engine.execute(Segment.__table__.insert(), segment_inserts) For 20,000 segments, this ends up being about 45 seconds and 1650 queries - 2 to select all the entries and segments, 1 to insert the segments and the rest to insert parts. From here, however, I rewrote it a bit: with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).all() ) ) entry_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if not existing_entry: entry_inserts.append(entry) engine.execute(Entry.__table__.insert(), entry_inserts) existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: log.error('i\'ve made a huge mistake') engine.execute(Segment.__table__.insert(), segment_inserts) This ends up being about 19 seconds, 6 queries for a clean dump, and a bit less if the table is already populated. Removing the unique indexes on both the entries and segments tables and replacing them with standard indexes saves about a second in a full dump, and about 6 seconds for an update. I'm pretty happy with where it is now, and I suspect most of the time (aside from the two insert calls) is being spent in Python. That said, if you have any tips for improvements I'd be all ears. Thanks for the help! On Monday, 24 March 2014 09:19:25 UTC+8, Michael Bayer wrote: On Mar 23, 2014, at 11:33 AM, James Meneghello muro...@gmail.com wrote: I'm having a few issues with unique constraints and bulk inserts. The software I'm writing takes data from an external source (a lot of it, anywhere from 1,000 rows per minute to 100-200k+), crunches it down into its hierarchy and saves it to the DB, to be aggregated in the background. The function handling the initial DB save is designed to work with about 20-50k rows at a time - very little modification takes place, it's pretty much just grabbed and thrown into the table. Obviously the amount of data being saved somewhat excludes the use of the ORM in this particular table, but there are a number of other tables that benefit from using the ORM. Hence, the small stuff uses the ORM and the big stuff uses the Core. The main problem I'm having is with the initial save. The
Re: [sqlalchemy] single table hierarchy: disable polymorphism and load all as base class
No, polymorphic_on=None does not have any effect :( I'll keep trying :P Thanks! Pau. 2014-03-17 18:56 GMT+01:00 Michael Bayer mike...@zzzcomputing.com: er, guessing, polymorphic_on=None also for that class? not sure of the mapper’s tolerance. On Mar 17, 2014, at 12:10 PM, Pau Tallada tall...@pic.es wrote: Hi, I tried this, but then it selects WHERE table.type IN (NULL) :( class NonPolymorphicClass(MyBaseClass): __mapper_args__ = { 'concrete' : True } I'll try some more things, like overriding the __new__ method on the base class, to create subclasses if they are available, and instances of the base class if not. Thank you very much anyway :) Pau. 2014-03-17 15:44 GMT+01:00 Michael Bayer mike...@zzzcomputing.com: maybe make a subclass of the polymorphic base, and just put __concrete__ = True in the mapper_args.skip the non primary part. On Mar 17, 2014, at 8:45 AM, Pau Tallada tall...@pic.es wrote: Hi, Can it be done in declarative? I've tried several ways, but I cannot find a working one :/ Basically, I think I need to redefine some self-referential relationships, as they link against the polymorphic class. Meta = declarative_meta() class NonPolymorphicClass(MyBaseClass): __mapper_args__ = { 'non_primary' : True } == ArgumentError: Inheritance of non-primary mapper for class ' NonPolymorphicClass' is only allowed from a non-primary mapper class NonPolymorphicClass(MyBaseClass): __table__ = MyBaseClass.__table__ __mapper_args__ = { 'non_primary' : True } == ArgumentError: Inheritance of non-primary mapper for class ' NonPolymorphicClass' is only allowed from a non-primary mapper class NonPolymorphicClass(Meta): __mapper_args__ = { 'non_primary' : True } == InvalidRequestError: Class class '__main__.NonPolymorphicClass' does not have a __table__ or __tablename__ specified and does not inherit from an existing table-mapped class. class NonPolymorphicClass(Meta): __mapper_args__ = { 'non_primary' : True } == InvalidRequestError: Class class '__main__.NonPolymorphicClass' has no primary mapper configured. Configure a primary mapper first before setting up a non primary Mapper. 2014-03-17 13:09 GMT+01:00 Pau Tallada tall...@pic.es: Great! I think it works for my needs :D Thank you very much! Pau. 2014-03-17 12:28 GMT+01:00 Michael Bayer mike...@zzzcomputing.com: On Mar 17, 2014, at 6:48 AM, Pau Tallada tall...@pic.es wrote: Hi everyone :) I have an application with some kind of dynamic model. Several plugins define subclasses of a common base class and override their behaviour, but they do not change any column. In some scenarios, not all the puglins are loaded, so not all the subclasses are present/defined in the Metadata, but I still need to query those instances and access their columns. Is there any option/feature/trick to temporarily disable polymorphism in a single query (or globally for a session or engine) and load all classes as instances of the base class? you can use a non primary mapper: m = mapper(MyBase, my_base_table, non_primary=True) s.query(m).all() The other option is to hack some kind of on-demand polymorphism myself into the __new__ method of the Base class. Thanks! Pau. -- -- Pau Tallada Crespí Dep. d'Astrofísica i Cosmologia Port d'Informació Científica (PIC) Tel: +34 93 586 8233 -- -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- -- Pau Tallada Crespí Dep. d'Astrofísica i Cosmologia Port d'Informació Científica (PIC) Tel: +34 93 586 8233 -- -- -- Pau Tallada Crespí Dep. d'Astrofísica i Cosmologia Port d'Informació Científica (PIC) Tel: +34 93 586 8233 -- -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to
Re: [sqlalchemy] What is the rationale of having to manually set up a relationship between two tables?
On Mon, Mar 24, 2014 at 6:48 AM, Bao Niu niuba...@gmail.com wrote: Suppose we have two tables in an existing database, user and address. There is a one-to-many relationships between these two tables with a foreign key user.id==address_user_id. Now we *reflect* this schema directly from the database: from sqlalchemy.ext.declarative import declarative_base Base = declarative_base(E) Class User(Base): __tablename__ = 'user' __table_args__ = {'autoload': True} addresses = sqlalchemy.orm.relationship(Address, backref=user) Class Address(Base): __tablename__ = 'address' __table_args__ = {'autoload': True} To me, it is a mystery why the almighty SA can autoload everything, but not an obvious relationship (in red ink). What benefit is it for the user to manually define such a simple and unambiguous relationship? Can't it be automatically done, i.e., leaving out the red part? Will this feature likely be included in future versions, say, sqlalchemy 1.0? You might be interested in the experimental automap extension: http://docs.sqlalchemy.org/en/rel_0_9/orm/extensions/automap.html Hope that helps, Simon -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] after_delete callback doesn't being called with query(...).delete() syntax
Hey, I'm faced issue with after_delete callback call. It works when I do y = DBSession.query(Yarns).filter_by(user_id=user.id, slug=yarn_id).first() DBSession.delete(y) DBSession.flush() and doesn't being called on DBSession.query(Yarns).filter_by(user_id=user.id, slug=yarn_id).delete() where DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension())) Callback itself is being set like that: sqlalchemy.event.listen(Yarns, 'after_delete', yarn_index_remove) Model class looks like that: class Yarns(sqlalchemy.ext.declarative.declarative_base()): ... I'd really like use more succinct first syntax instead of current. Thanks, Igor -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] after_delete callback doesn't being called with query(...).delete() syntax
On Mon, Mar 24, 2014 at 1:05 PM, Igor Loskutoff igor.loskut...@gmail.com wrote: Hey, I'm faced issue with after_delete callback call. It works when I do y = DBSession.query(Yarns).filter_by(user_id=user.id, slug=yarn_id).first() DBSession.delete(y) DBSession.flush() and doesn't being called on DBSession.query(Yarns).filter_by(user_id=user.id, slug=yarn_id).delete() where DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension())) Callback itself is being set like that: sqlalchemy.event.listen(Yarns, 'after_delete', yarn_index_remove) Model class looks like that: class Yarns(sqlalchemy.ext.declarative.declarative_base()): ... I'd really like use more succinct first syntax instead of current. Thanks, Igor I think the problem is that after_delete is only used for instances which have actually been loaded from the DB. Query.delete() doesn't load the matching rows from the DB, so the event doesn't fire. There is also a (session, rather than mapper) event called after_bulk_delete which you might be able to use: http://docs.sqlalchemy.org/en/rel_0_9/orm/events.html#sqlalchemy.orm.events.SessionEvents.after_bulk_delete You'd probably have to examine the delete_context.query object to see which class is being deleted, and figuring out exactly which rows have been deleted might also be a problem (since the filter conditions passed to the query could be arbitrarily complicated). Simon -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] after_delete callback doesn't being called with query(...).delete() syntax
On Mar 24, 2014, at 9:05 AM, Igor Loskutoff igor.loskut...@gmail.com wrote: Hey, I'm faced issue with after_delete callback call. It works when I do y = DBSession.query(Yarns).filter_by(user_id=user.id, slug=yarn_id).first() DBSession.delete(y) DBSession.flush() and doesn't being called on DBSession.query(Yarns).filter_by(user_id=user.id, slug=yarn_id).delete() where DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension())) Callback itself is being set like that: sqlalchemy.event.listen(Yarns, 'after_delete', yarn_index_remove) Model class looks like that: class Yarns(sqlalchemy.ext.declarative.declarative_base()): ... I'd really like use more succinct first syntax instead of current after_delete is only called for unit-of-work deletes. The aggregate delete() doesn't have the same access to object instances that the UOW does as it may match any number (or no) objects, none of which may be locally present. To catch those events use after_bulk_delete(). http://docs.sqlalchemy.org/en/rel_0_9/orm/events.html?highlight=after_bulk_delete#sqlalchemy.orm.events.SessionEvents.after_bulk_delete . Thanks, Igor -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Bulk Inserts and Unique Constraints
RETURNING doesn't work with DBAPI's executemany style of execution, however, which is what conn.execute(stmt, [list of parameter sets]) calls. On Mar 24, 2014, at 5:33 AM, Cosmia Luna cosm...@gmail.com wrote: INSERT statement of postgresql supports RETURNING, read this http://docs.sqlalchemy.org/en/rel_0_8/core/dml.html#sqlalchemy.sql.expression.Insert.returning On Monday, March 24, 2014 2:43:46 PM UTC+8, James Meneghello wrote: Oops, I should add - the reason I can't use an itertools counter to pre-assign IDs is because the table is potentially being dumped to by multiple scripts, which is why I have to commit the parts prior to the segments (since engine.execute can't return multiple insert_ids). On Monday, 24 March 2014 14:40:52 UTC+8, James Meneghello wrote: Thanks for the quick reply! This seems to work pretty well. I took out the batching (as it's already batched at a higher level) and modified it to suit the insertion of children as well (and reducded the unique to a single field) , and it appears to work. with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: entry_id = engine.execute(Entry.__table__.insert(), entry).inserted_primary_key for segment in entry['segments'].values(): segment['entry_id'] = entry_id[0] segment_inserts.append(segment) engine.execute(Segment.__table__.insert(), segment_inserts) For 20,000 segments, this ends up being about 45 seconds and 1650 queries - 2 to select all the entries and segments, 1 to insert the segments and the rest to insert parts. From here, however, I rewrote it a bit: with db_session() as db: existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).all() ) ) entry_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if not existing_entry: entry_inserts.append(entry) engine.execute(Entry.__table__.insert(), entry_inserts) existing_entries = dict( ((entry.subject, entry) for entry in db.query(Entry).filter(Entry.subject.in_(entries.keys())).options(subqueryload('segments')).all() ) ) segment_inserts = [] for subject, entry in entries.items(): existing_entry = existing_entries.get(subject, None) if existing_entry: segments = dict(((s.segment, s) for s in existing_entry.segments)) for segment_number, segment in entry['segments'].items(): if int(segment_number) not in segments: segment['entry_id'] = existing_entry.id segment_inserts.append(segment) else: log.error('i\'ve made a huge mistake') engine.execute(Segment.__table__.insert(), segment_inserts) This ends up being about 19 seconds, 6 queries for a clean dump, and a bit less if the table is already populated. Removing the unique indexes on both the entries and segments tables and replacing them with standard indexes saves about a second in a full dump, and about 6 seconds for an update. I'm pretty happy with where it is now, and I suspect most of the time (aside from the two insert calls) is being spent in Python. That said, if you have any tips for improvements I'd be all ears. Thanks for the help! On Monday, 24 March 2014 09:19:25 UTC+8, Michael Bayer wrote: On Mar 23, 2014, at 11:33 AM, James Meneghello muro...@gmail.com wrote: I'm having a few issues with unique constraints and bulk inserts. The software I'm writing takes data from an external source (a lot of it, anywhere from 1,000 rows per minute to 100-200k+), crunches it down into its hierarchy and saves it to the DB, to be aggregated in the background. The function handling the initial DB save is designed to work with about 20-50k rows at a time - very little modification takes place, it's pretty much just grabbed and thrown into the table. Obviously the amount of data being saved somewhat excludes the use of the ORM in this
[sqlalchemy] Re: Bulk Inserts and Unique Constraints
That's effectively what I'm doing now. I'm not sure there's much I can speed up at this point - the SELECTs take about 0.05s, it's just the INSERTs taking a bulk of the time - 11-15s depending on the number of rows. That said, I'm still running on development and there'll be a significant boost once it's on proper hardware. On Monday, 24 March 2014 22:44:09 UTC+8, Jonathan Vanasco wrote: The data comes in unordered and sometimes contains duplicates, so there's a UniqueConstraint on Entry on sub, division, created. Have you tried pre-processing the list first ? I've had similar situations, when dealing with browser , user and app analytics. I normally do a first pass to restructure the raw log file and note any 'selects' i might need to associate the records to; then I lock tables, precache the selects, and do all the inserts. the speed pickups have been great. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] Re: Bulk Inserts and Unique Constraints
The data comes in unordered and sometimes contains duplicates, so there's a UniqueConstraint on Entry on sub, division, created. Have you tried pre-processing the list first ? I've had similar situations, when dealing with browser , user and app analytics. I normally do a first pass to restructure the raw log file and note any 'selects' i might need to associate the records to; then I lock tables, precache the selects, and do all the inserts. the speed pickups have been great. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] Re: What is the rationale of having to manually set up a relationship between two tables?
That's a really simple use case -- you want the primary key to be a 1:1 relationship, and you don't have potentially competing fields. But this gets complicated when: - You want User.addresses to be a single item ( User.address ) - You don't want User.addresses to just be all the addresses -- but to be filtered on is_current_address = True - User has a column name called 'address' or 'addresses'; the join would want to be named email_addresses or postal_address . It's not that SqlAlchemy can't do this stuff -- it's that it chooses not to, so you have full control of your schema. That's why it's in the extensions, not the core. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] Re: Bulk Inserts and Unique Constraints
Since you're using Postgres... have you considered using python to generate a COPY file ? Sqlalchemy doesn't seem to support it natively... maybe via 'text', but your underlying psycopg2 driver does. it's way way way faster. i've found it significantly faster than dropping fkeys and using prepared statements. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] joinedload()
Hello all, I am trying to optimize a query I have, and have noticed some strange behavior. I have a table called 'Trade'. A trade may optionally have a commission, which itself is a trade. This is the core of it: class Trade(Base, CCPTableMixin): __tablename__= trade id = Column(Integer, primary_key=True) trade_commission = Table(trade_commission, Base.metadata, Column(id_trade_victim, Integer, ForeignKey(id), primary_key=True), Column(id_trade_cash, Integer, ForeignKey(id), primary_key=True)) commission = relationship(trade_class_name, secondary=trade_commission, primaryjoin=(id == trade_commission.c.id_trade_victim), secondaryjoin=(id == trade_commission.c.id_trade_cash), uselist=False, backref=backref(commission_for, uselist=False)) Now, in a particular query, if a trade happens to be a commission trade, I wish to eagerly load the original 'victim' trade, so I have something like this: session().query(Trade) query = query.options(joinedload(commission_for)) I noticed things were going a bit slowly, and to my surprise when I turned on query tracing I could see many individual queries. It turns out that the joined loading was working in general, however, if 'commission_for' was None (e.g. the trade is not a commission trade) SQLAlchemy was issuing a query to re-populate the (empty) collection. Interestingly, when I switched it to 'subqueryload' - it didn't do this and correctly eagerly loaded the empty collection. However, I have quite an expensive query to select the list of trades in the first place which I would rather not do twice, which is required for a subquery eager load. (It would be really cool if you could select the primary keys into a temporary table or something when doing a subquery load, but I guess that would be quite difficult to do in a DB independant way) Is this a bug, or perhaps some expected side effect of the joined load? I am using SQLAlchemy 0.9.3. If it help I could knock up a working example in short order. Cheers, and keep up the fine work! - Philip -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] joinedload()
On Mar 24, 2014, at 2:09 PM, Philip Scott safetyfirstp...@gmail.com wrote: Is this a bug, or perhaps some expected side effect of the joined load? seemed like something that might be possible but the scalar loader is initializing the attribute to None to start with, here's a simple test that doesn't show your behavior, so see if you can just modify this one to show what you are seeing. note we only need to see that 'bs' is in a1.__dict__ to prevent a lazyload. from sqlalchemy import * from sqlalchemy.orm import * from sqlalchemy.ext.declarative import declarative_base Base = declarative_base() class A(Base): __tablename__ = 'a' id = Column(Integer, primary_key=True) bs = relationship(B, secondary=Table('atob', Base.metadata, Column('aid', ForeignKey('a.id')), Column('bid', ForeignKey('b.id')) ), uselist=False) class B(Base): __tablename__ = 'b' id = Column(Integer, primary_key=True) e = create_engine(sqlite://, echo=True) Base.metadata.create_all(e) sess = Session(e) sess.add(A()) sess.commit() sess.close() a1 = sess.query(A).options(joinedload(bs)).first() assert 'bs' in a1.__dict__ assert a1.__dict__['bs'] is None assert a1.bs is None -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] What is the rationale of having to manually set up a relationship between two tables?
Compare to SQLsoup, is this extension more recommended? On Mon, Mar 24, 2014 at 3:32 AM, Simon King si...@simonking.org.uk wrote: On Mon, Mar 24, 2014 at 6:48 AM, Bao Niu niuba...@gmail.com wrote: Suppose we have two tables in an existing database, user and address. There is a one-to-many relationships between these two tables with a foreign key user.id==address_user_id. Now we *reflect* this schema directly from the database: from sqlalchemy.ext.declarative import declarative_base Base = declarative_base(E) Class User(Base): __tablename__ = 'user' __table_args__ = {'autoload': True} addresses = sqlalchemy.orm.relationship(Address, backref=user) Class Address(Base): __tablename__ = 'address' __table_args__ = {'autoload': True} To me, it is a mystery why the almighty SA can autoload everything, but not an obvious relationship (in red ink). What benefit is it for the user to manually define such a simple and unambiguous relationship? Can't it be automatically done, i.e., leaving out the red part? Will this feature likely be included in future versions, say, sqlalchemy 1.0? You might be interested in the experimental automap extension: http://docs.sqlalchemy.org/en/rel_0_9/orm/extensions/automap.html Hope that helps, Simon -- You received this message because you are subscribed to a topic in the Google Groups sqlalchemy group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/_q1tBBfTh0w/unsubscribe. To unsubscribe from this group and all its topics, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.