Re: [sqlalchemy] Very slow inserts when using sqlalchemy core

adam . botbyl Fri, 05 Oct 2018 06:15:08 -0700

Simon,

The MySQLdb module performs this operation with very little difference to 
native (mysql < data.sql) speed, so I'm not concerned on that, the delta 
was <0.5 seconds


In my original example (I didn't know I could pass a dict to the execute to 
have it fill in the parameters) I built the complete SQL raw with calling 
mysqldb's escape string on each value, and formatting that SQL took 
slightly over 1 second, and I didn't even try to optimize how I was doing 
that code.

My more concern/curiosity is the overhead that sqlalchemy core is adding to 
these inserts and just figuring out why... One might say I'm giving this 
too much attention because I have a working, acceptable and optimal 
solution, but that's bypassing issues and really bypassing sqlalchemy as a 
whole.

In the end, I'm pretty sure it's probably in relation to the size of my 
dataset; but the time deltas between doing things manually vs letting 
sqlalchemy do its job is just too great to go unspoken about.


On Friday, October 5, 2018 at 4:27:11 AM UTC-5, Simon King wrote:
>
> MySQLdb builds a new SQL string by escaping each of the parameters and 
> then substituting them into the original SQL: 
>
> https://github.com/farcepest/MySQLdb1/blob/master/MySQLdb/cursors.py#L182 
>
> For 89 columns * 10,550 rows, that's nearly 1 million parameters. Each 
> one involves a call to "db.literal(item)", which in turn calls 
> "self.escape(item, self.encoders)". It looks like self.escape is 
> implemented in C. 
>
> This means that you've got at least 2 million function calls (which 
> are relatively slow in Python) just to build the SQL string to send to 
> the server. 
>
> Simon 
> On Fri, Oct 5, 2018 at 12:01 AM <[email protected] <javascript:>> wrote: 
> > 
> > Mike, 
> > 
> > Er, yeah, not TSQL, my old MSSQL days are showing :D meant a str() of 
> sql 
> > 
> > I'm saying bindparms as the required input format of the params used to 
> re-bind into the sql in the optimal way, for example 
> execute(__table__.insert(), bindparams) would represent a list of dicts. 
> > 
> > I did a packet capture of execute(__table__.insert(), list(dict()) and 
> it was doing the execute many calls (insert ... (a, b) values (c, d), (e, 
> f). so there was no hangup there. 
> > I also just re-coded this to call mysqldb and run executemany(str, 
> [tuple()], it does take about the same amount of time (~8s) as 
> execute(__table__.insert(), list(dict())) so there must be something going 
> on in the DBAPI thats unoptimal, as I can generate the proper SQL and 
> insert it in ~3 seconds with a normal .execute(str). 
> > 
> > I tried the test case with PyMySQL with just the execute(str) case and 
> it took about 2 seconds, so I didn't give it much more attention; i'll try 
> it again with an executemany though. 
> > 2018-10-04 17:47:02,373 - INFO - Executing for 10,550 inserts, 89 
> columns, approx 6,252,460 bytes of raw data... 
> > 2018-10-04 17:47:02,373 - INFO - Generating datasets 
> > 2018-10-04 17:47:02,550 - INFO - Executing with 
> raw_connection().executemany(str, [tuple(a, b, ...)] 
> > 2018-10-04 17:47:10,061 - INFO - Committing 
> > 
> >             log.info('Generating datasets') 
> >             sql_str = 'INSERT IGNORE INTO table ({}) 
> VALUES({})'.format(','.join(data[0].keys()), ','.join(['%s' for i in 
> range(len(data[0].keys()))])) 
> >             sql_data = [tuple(value for value in list(part.values())) 
> for part in data] 
> >             log.info('Executing with raw_connection().executemany(str, 
> [tuple(a, b, ...)]') 
> >             conn = sql.engine.raw_connection() 
> >             cur = conn.cursor() 
> >             cur.executemany(sql_str, sql_data) 
> >             log.info('Committing') 
> >             conn.commit() 
> > 
> > 
> > I still however don't get the difference between sqlalchemy's 
> execute(__table__.insert().values(dict())) though and why it would be so 
> much slower than execute(__table__.insert(), dict()), nor why execute(str, 
> dict()) is that slow as well. 
> > 
> > 
> > 
> > On Thursday, October 4, 2018 at 4:47:31 PM UTC-5, Mike Bayer wrote: 
> >> 
> >> On Thu, Oct 4, 2018 at 4:04 PM <[email protected]> wrote: 
> >> > 
> >> > Mike, 
> >> > 
> >> > No, I'm not calling execute 10k times, but I am doing what they do in 
> your example with test_core_insert 
> >> > 
> >> > My code for that is: 
> >> >             log.info('Executing sqlalchemy insert() with bindparams 
> as execute param - session.execute(__table__.insert(), bindparams)') 
> >> >             
> sql.session.execute(models.target.__table__.insert().prefix_with('IGNORE'), 
> data) 
> >> >             log.info('Committing') 
> >> >             sql.session.commit() 
> >> > 
> >> > where data = [OrderedDict(), OrderedDict(), ...] (tried converting to 
> a regular [dict(), dict(),...] too with no gains; this method is the 
> fastest outside of direct or near direct DBAPI calls at about 8 seconds. 
> >> > 
> >> > However I read somewhere (can't find the source again to cite, but I 
> thought it was the sqlalchemy docs, I may be just not remembering 
> correctly) this query chains together individual INSERT statements instead 
> of doing a multi-insert 
> >> > i.e. insert into target (a, b, ...) values (c, d, ...); insert into 
> target (a, b, ...) values (e, f, ...); ... instead of a insert into target 
> (a, b, ...) values (c, d, ...), (e, f, ...) 
> >> > where the latter is the most optimal TSQL 
> >> 
> >> I'm not sure what "TSQL" means in this context. 
> >> 
> >> SQLAlchemy uses cursor.executemany() to invoke the statement just 
> >> once.  the DBAPI decides how it wants to batch it.     Regardless, it 
> >> should move very fast.   On the Postgresql side, there's more of a big 
> >> speed difference between the way psycopg2 executes a statement 
> >> normally, and some special modes it provides that allow it to chain 
> >> together multiple statements in one batch.   But that's still behind 
> >> the executemany() call. 
> >> 
> >> 
> >> > 
> >> > If I'm wrong there and it does do a multi-insert, what is the 
> difference between execute(__table__.insert().values(bindparams)) and 
> execute(__table__.insert(), bindparams), and why would the former be ~26 
> seconds slower in this use case? 
> >> 
> >> What is "bindparams" here, that is a list of many tuples ?    If so 
> >> then yes, the DBAPI is not batching efficiently in the latter case. 
> >> Try PyMySQL or mysqlclient. 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > On Thursday, October 4, 2018 at 2:05:59 PM UTC-5, Mike Bayer wrote: 
> >> >> 
> >> >> On Thu, Oct 4, 2018 at 2:45 PM <[email protected]> wrote: 
> >> >> > 
> >> >> > Hi all, 
> >> >> > 
> >> >> > Having some very and major performance issues using sqlalchemy to 
> insert data, been trying many different ways but none seem to hold even a 
> close candle to raw MySQLdb connections 
> >> >> > 
> >> >> > Versions: 
> >> >> > sqlalchemy: 1.2.9 
> >> >> > MySQLdb: 1.3.13 (bound with mysql+mysqldb connector) 
> >> >> > MySQL: 14.14 
> >> >> > Python: 3.6.5 
> >> >> > Ubuntu 18.04 x86_64 
> >> >> > 
> >> >> > Server: 
> >> >> > 8GB ram 
> >> >> > 2 x Intel(R) Xeon(R) CPU X5650  @ 2.67GHz threads running in 
> VMWare ESXi 
> >> >> > 
> >> >> > Connected to localhost 
> >> >> > 
> >> >> > Base data info: 
> >> >> > 2018-10-04 13:19:56,519 - INFO - Executing for 10,550 inserts, 89 
> columns, approx 6,252,460 bytes of raw data... 
> >> >> > 
> >> >> > Pre-formatted session execute ~2 seconds: 
> >> >> > 2018-10-04 13:19:56,520 - INFO - Building raw TSQL including all 
> escaped value()s - insert into x (a,b, ...) values (escape_string(c), 
> escape_string(d), ..) (escape_string(e), escape_string(f), ...) 
> >> >> > 2018-10-04 13:19:57,640 - INFO - Executing raw TSQL (9,218,872 
> bytes)  - session.execute(tsql) 
> >> >> > 2018-10-04 13:19:59,435 - INFO - Committing 
> >> >> > 
> >> >> > Raw MySQLdb API ~1 second: 
> >> >> > 2018-10-04 13:19:59,436 - INFO - Executing same raw TSQL directly 
> via MySQLdb API - engine.raw_connection(); cursor.execute(tsql) 
> >> >> > 2018-10-04 13:20:00,326 - INFO - Committing 
> >> >> > 
> >> >> > Bindparam formatted TSQL converted to text() executed with 
> bindparams as a dict ~12 seconds to build text() and ~19 seconds to 
> execute, ~31 seconds total: 
> >> >> > 2018-10-04 13:20:00,326 - INFO - Building raw TSQL - insert into x 
> (a, b, ...) values (:a1, :b1, ..), (:a2, :b2, ...) 
> >> >> > 2018-10-04 13:20:01,017 - INFO - Rebuilding Input data for TSQL 
> bindparams - {a1: d, b1: e, a2: f, b2: g} 
> >> >> > 2018-10-04 13:20:02,234 - INFO - Building text() object - 
> sql.text(tsql) 
> >> >> > 2018-10-04 13:20:14,259 - INFO - Executing text() with bindparams 
> - session.execute(text, bindparams) 
> >> >> > 2018-10-04 13:20:33,552 - INFO - Committing 
> >> >> > 
> >> >> > Bindparam formatted TSQL NOT converted to text, executed with 
> bindparams as a dict: ~33 seconds 
> >> >> > 2018-10-04 13:20:33,552 - INFO - Executing raw TSQL with 
> bindparams - session.execute(tsql, bindparams) 
> >> >> > 2018-10-04 13:21:06,307 - INFO - Committing 
> >> >> > 
> >> >> > Using a sqlalchemy class table insert() with using .values() ~34 
> seconds: 
> >> >> > 2018-10-04 13:21:06,311 - INFO - Executing sqlalchemy insert w/ 
> bindparams as values - 
> session.execute(__table__.insert().values(bindparams)) 
> >> >> > 2018-10-04 13:21:40,808 - INFO - Committing 
> >> >> > 
> >> >> > Using a sqlalchemy class table insert() with passing bindparams to 
> session.execute() ~8 seconds: 
> >> >> > 2018-10-04 13:21:40,809 - INFO - Executing sqlalchemy insert() 
> with bindparams as execute param - session.execute(__table__.insert(), 
> bindparams) 
> >> >> > 2018-10-04 13:21:48,084 - INFO - Committing 
> >> >> > 
> >> >> > Obviously the raw db api will be the fastest as there's little to 
> no processing or overhead, I would think session.execute(str) would come 
> closer to matching that, but its not too far off. 
> >> >> 
> >> >> A Core insert should be only  slightly slower than the raw DBAPI, 
> >> >> however, if you're inserting lots of rows, you need to be using 
> >> >> executemany style, that is, a single execute() call with a list of 
> >> >> parameters.   There's no code examples here so I don't know what 
> >> >> you're actually doing, if you are running session.execute() itself 
> 10K 
> >> >> times, that's the wrong way to do it.  Check out 
> >> >> 
> https://docs.sqlalchemy.org/en/latest/_modules/examples/performance/bulk_inserts.html
>  
> >> >> for usage examples how to bulk-insert for different APIs including 
> >> >> Core (test_core_insert specifically).     Additionally, I don't know 
> >> >> if you are using bound parameters with your raw MySQL code or not, 
> and 
> >> >> if so, is there some unicode processing going on in the SQLAlchemy 
> >> >> version that might be slowing things down. 
> >> >> 
> >> >> A complete code example like those you see in 
> >> >> 
> https://docs.sqlalchemy.org/en/latest/_modules/examples/performance/bulk_inserts.html
>  
> >> >> will allow me to see what you are doing. 
> >> >> 
> >> >> 
> >> >> > 
> >> >> > Having sqlalchemy.text(str) take 12 seconds to execute seems 
> rather excessive, but assuming that's the hang with the other two 
> execute(__table__.insert().values(bindparams)) statement too, however why 
> would excute(str, bindparams) do this? Does it shadow text() as well if 
> arg0 is a str? 
> >> >> > 
> >> >> > Now the oddest part of all, doing 10,550 individual inserts in the 
> last example is roughly 26 seconds faster than a single transaction with 
> 10,550 sets of values. This just does not make transactional sense. 
> >> >> > 
> >> >> > Is there something I can be doing better while utilizing core, or 
> am I over-expecting how sqlalchemy to perform with datasets like this? 
> >> >> > 
> >> >> > 
> >> >> > Hopefully this is enough detail, but I can provide more upon 
> request. 
> >> >> > 
> >> >> > Thank you! 
> >> >> > 
> >> >> > -- 
> >> >> > SQLAlchemy - 
> >> >> > The Python SQL Toolkit and Object Relational Mapper 
> >> >> > 
> >> >> > http://www.sqlalchemy.org/ 
> >> >> > 
> >> >> > To post example code, please provide an MCVE: Minimal, Complete, 
> and Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description. 
> >> >> > --- 
> >> >> > You received this message because you are subscribed to the Google 
> Groups "sqlalchemy" group. 
> >> >> > To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected]. 
> >> >> > To post to this group, send email to [email protected]. 
> >> >> > Visit this group at https://groups.google.com/group/sqlalchemy. 
> >> >> > For more options, visit https://groups.google.com/d/optout. 
> >> > 
> >> > -- 
> >> > SQLAlchemy - 
> >> > The Python SQL Toolkit and Object Relational Mapper 
> >> > 
> >> > http://www.sqlalchemy.org/ 
> >> > 
> >> > To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description. 
> >> > --- 
> >> > You received this message because you are subscribed to the Google 
> Groups "sqlalchemy" group. 
> >> > To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected]. 
> >> > To post to this group, send email to [email protected]. 
> >> > Visit this group at https://groups.google.com/group/sqlalchemy. 
> >> > For more options, visit https://groups.google.com/d/optout. 
> > 
> > -- 
> > SQLAlchemy - 
> > The Python SQL Toolkit and Object Relational Mapper 
> > 
> > http://www.sqlalchemy.org/ 
> > 
> > To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description. 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "sqlalchemy" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] <javascript:>. 
> > To post to this group, send email to [email protected] 
> <javascript:>. 
> > Visit this group at https://groups.google.com/group/sqlalchemy. 
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Very slow inserts when using sqlalchemy core

Reply via email to