[sqlalchemy] Re: Bulk Inserts and Unique Constraints
I wasn't going to bother, but I had a look at doing this just out of curiosity, and these were the results: executemany(): Inserting 424 entries: 0.3362s Inserting 20,000 segments: 14.01s COPY: Inserting 425 entries: 0.04s Inserting 20,000 segments: 0.3s So a pretty massive boost. Thanks :) On Monday, 24 March 2014 23:30:32 UTC+8, Jonathan Vanasco wrote: Since you're using Postgres... have you considered using python to generate a COPY file ? Sqlalchemy doesn't seem to support it natively... maybe via 'text', but your underlying psycopg2 driver does. it's way way way faster. i've found it significantly faster than dropping fkeys and using prepared statements. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] Re: Bulk Inserts and Unique Constraints
That's effectively what I'm doing now. I'm not sure there's much I can speed up at this point - the SELECTs take about 0.05s, it's just the INSERTs taking a bulk of the time - 11-15s depending on the number of rows. That said, I'm still running on development and there'll be a significant boost once it's on proper hardware. On Monday, 24 March 2014 22:44:09 UTC+8, Jonathan Vanasco wrote: The data comes in unordered and sometimes contains duplicates, so there's a UniqueConstraint on Entry on sub, division, created. Have you tried pre-processing the list first ? I've had similar situations, when dealing with browser , user and app analytics. I normally do a first pass to restructure the raw log file and note any 'selects' i might need to associate the records to; then I lock tables, precache the selects, and do all the inserts. the speed pickups have been great. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] Re: Bulk Inserts and Unique Constraints
The data comes in unordered and sometimes contains duplicates, so there's a UniqueConstraint on Entry on sub, division, created. Have you tried pre-processing the list first ? I've had similar situations, when dealing with browser , user and app analytics. I normally do a first pass to restructure the raw log file and note any 'selects' i might need to associate the records to; then I lock tables, precache the selects, and do all the inserts. the speed pickups have been great. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] Re: Bulk Inserts and Unique Constraints
Since you're using Postgres... have you considered using python to generate a COPY file ? Sqlalchemy doesn't seem to support it natively... maybe via 'text', but your underlying psycopg2 driver does. it's way way way faster. i've found it significantly faster than dropping fkeys and using prepared statements. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
[sqlalchemy] Re: Bulk Inserts and Unique Constraints
Anyway, I don't think you should issue so many SELECT's, performance suffers from it most, I think. INSERT statement will return the inserted id's for you, maybe you want to read this http://docs.sqlalchemy.org/en/latest/core/tutorial.html#executing Skip duplicates... well if you turn off the constraint check and that's what you want, it works but... I think the best strategy for database is to ROLLBACK it. It's YOUR duty to filter the duplicates before send them to the database. set in python works well in this situation. You may want learn more about stored procedure if you want the database do the filtering. And, if you really cares much about performance and you know the weakness of ORM, why do you use ORM? Cosmia On Sunday, March 23, 2014 11:33:32 PM UTC+8, James Meneghello wrote: I'm having a few issues with unique constraints and bulk inserts. The software I'm writing takes data from an external source (a lot of it, anywhere from 1,000 rows per minute to 100-200k+), crunches it down into its hierarchy and saves it to the DB, to be aggregated in the background. The function handling the initial DB save is designed to work with about 20-50k rows at a time - very little modification takes place, it's pretty much just grabbed and thrown into the table. Obviously the amount of data being saved somewhat excludes the use of the ORM in this particular table, but there are a number of other tables that benefit from using the ORM. Hence, the small stuff uses the ORM and the big stuff uses the Core. The main problem I'm having is with the initial save. The data comes in unordered and sometimes contains duplicates, so there's a UniqueConstraint on Entry on sub, division, created. Unfortunately, this hampers the bulk insert - if there's a duplicate, it rolls back the entire insert and hence the entries aren't available to be referenced by the segments later. Obviously, capturing it in a try/catch would skip the whole block as well. Both Entry and Segment have the same problem - there are often duplicate segments. Since there's a large amount of data being pushed through it, I assume it's impractical to insert the elements individually - while there's only 100-200 entries per block, there's usually 20-50k segments. Is there any way of forcing the engine to skip over duplicates and not rollback the transaction on exception? Code's below. Using Postgres, with psycopg2 as the driver. engine.execute(Entry.__table__.insert(), entries) segment_list = [] for sub, entry in entry.items(): segments = entry.pop('segments') e = db.query(Entry)\ .filter(Entry.sub==entry['sub'])\ .filter(Entry.division==entry['division'])\ .filter(Entry.created==entry['created']).first() for segment in segments: segment['entry_id'] = e.id segment_list.append(segment) engine.execute(Segment.__table__.insert(), segment_list) In addition, is there some way to pre-fetch data? Rather than query for each Entry, it'd be nice to pre-load all entries and save a couple hundred queries. Thanks! -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.