Re: [sqlalchemy] Very slow inserts when using sqlalchemy core

Mike Bayer Fri, 05 Oct 2018 07:17:46 -0700

On Fri, Oct 5, 2018 at 9:15 AM <adam.bot...@gmail.com> wrote:
>
> Simon,
>
> The MySQLdb module performs this operation with very little difference to 
> native (mysql < data.sql) speed, so I'm not concerned on that, the delta was 
> <0.5 seconds
>
> In my original example (I didn't know I could pass a dict to the execute to 
> have it fill in the parameters) I built the complete SQL raw with calling 
> mysqldb's escape string on each value, and formatting that SQL took slightly 
> over 1 second, and I didn't even try to optimize how I was doing that code.
>
> My more concern/curiosity is the overhead that sqlalchemy core is adding to 
> these inserts and just figuring out why...


SQLAlchemy Core adds a very low amount of overhead when used
correctly.   the one exception to this is when there is some
processing of data occurring, which historically has been involved
with coercing Python Unicode objects to strings, but there are other
examples of this as well.  These can be spotted by using Python
profiling.

We know that Core's overhead is low because I profile it like crazy,
and here are what typical results look like for inserts using
exeutemany:

https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

SQLAlchemy Core: Total time for 100000 records 0.21024107933 secs
sqlite3: Total time for 100000 records 0.137335062027 sec

Above, we can see for a test that eliminates network and file overhead
by using a SQLite in memory database, the overhead for 100000 rows
being INSERTed is less than .1 seconds greater compared to using
sqlite3 directly.   When using the MySQL driver instead, the Core
doesn't have any additional work to do beyond type-specific issues.

The test suite I linked can be run for any database, so here's a run
using a local MariaDB database with mysqlclient, for 10000 rows like
you are doing (you can do the same thing here just running from the
SQLAlchemy source checkout):

$ python -m examples.performance --dburl
mysql://scott:tiger@localhost/test bulk_inserts --num 10000
Tests to run: test_flush_no_pk, test_bulk_save_return_pks,
test_flush_pk_given, test_bulk_save, test_bulk_insert_mappings,
test_core_insert, test_dbapi_raw
test_flush_no_pk : Individual INSERT statements via the ORM, calling
upon last row id (10000 iterations); total time 3.738233 sec
test_bulk_save_return_pks : Individual INSERT statements in "bulk",
but calling upon last row id (10000 iterations); total time 3.571442
sec
test_flush_pk_given : Batched INSERT statements via the ORM, PKs
already defined (10000 iterations); total time 0.966315 sec
test_bulk_save : Batched INSERT statements via the ORM in "bulk",
discarding PKs. (10000 iterations); total time 0.496674 sec
test_bulk_insert_mappings : Batched INSERT statements via the ORM
"bulk", using dictionaries. (10000 iterations); total time 0.283908
sec
test_core_insert : A single Core INSERT construct inserting mappings
in bulk. (10000 iterations); total time 0.220689 sec
test_dbapi_raw : The DBAPI's API inserting rows in bulk. (10000
iterations); total time 0.189292 sec

Pretty much the same result, Core insert is .22 seconds for 10000
rows, .18 seconds with straight DBAPI.     Here it is over a network
connection to a MySQL 8.0 database:

python -m examples.performance --dburl
mysql://scott:tiger@mysql80/test bulk_inserts --num 10000
Tests to run: test_flush_no_pk, test_bulk_save_return_pks,
test_flush_pk_given, test_bulk_save, test_bulk_insert_mappings,
test_core_insert, test_dbapi_raw
test_flush_no_pk : Individual INSERT statements via the ORM, calling
upon last row id (10000 iterations); total time 30.921060 sec
test_bulk_save_return_pks : Individual INSERT statements in "bulk",
but calling upon last row id (10000 iterations); total time 30.468841
sec
test_flush_pk_given : Batched INSERT statements via the ORM, PKs
already defined (10000 iterations); total time 1.618901 sec
test_bulk_save : Batched INSERT statements via the ORM in "bulk",
discarding PKs. (10000 iterations); total time 1.136148 sec
test_bulk_insert_mappings : Batched INSERT statements via the ORM
"bulk", using dictionaries. (10000 iterations); total time 0.878011
sec
test_core_insert : A single Core INSERT construct inserting mappings
in bulk. (10000 iterations); total time 0.871134 sec
test_dbapi_raw : The DBAPI's API inserting rows in bulk. (10000
iterations); total time 0.870091 sec

So we can note, two of those runs took 30 seconds - those are both ORM
runs where the primary key of the row was not provided, which means
the ORM has to run each row into an individual execute() and get the
primary key back.   Then, the rest of the runs use executemany() where
we can see the time for the ORM is in the 1.5 second area and then
back to the core, there is virtually no difference in speed between
Core and straight DBAPI.


So if you want to figure out what the issue is on your end, here are
things you can do:

1. make sure you are using the C extensions (I don't recall if you
said you  were on linux, or windows, or what)

2. run the above performance suite on your database

3. use the profiling recipe against your specific code:
https://docs.sqlalchemy.org/en/latest/faq/performance.html#code-profiling

4. provide *runnable* examples here, e.g. MCVE, which illustrate
*exactly* what is being called, and how, and with what data,
datatypes, schema, etc.   Specifics matter.   if the rows here have 5
columns vs. 100, that matters.  if the rows have large binary objects,
unicodes, dates, precision decimals, that all matters.   But really,
the profiling you get from #3 will show where the time is being spent.







One might say I'm giving this too much attention because I have a
working, acceptable and optimal solution, but that's bypassing issues
and really bypassing sqlalchemy as a whole.
>
> In the end, I'm pretty sure it's probably in relation to the size of my 
> dataset; but the time deltas between doing things manually vs letting 
> sqlalchemy do its job is just too great to go unspoken about.
>
>
> On Friday, October 5, 2018 at 4:27:11 AM UTC-5, Simon King wrote:
>>
>> MySQLdb builds a new SQL string by escaping each of the parameters and
>> then substituting them into the original SQL:
>>
>> https://github.com/farcepest/MySQLdb1/blob/master/MySQLdb/cursors.py#L182
>>
>> For 89 columns * 10,550 rows, that's nearly 1 million parameters. Each
>> one involves a call to "db.literal(item)", which in turn calls
>> "self.escape(item, self.encoders)". It looks like self.escape is
>> implemented in C.
>>
>> This means that you've got at least 2 million function calls (which
>> are relatively slow in Python) just to build the SQL string to send to
>> the server.
>>
>> Simon
>> On Fri, Oct 5, 2018 at 12:01 AM <adam....@gmail.com> wrote:
>> >
>> > Mike,
>> >
>> > Er, yeah, not TSQL, my old MSSQL days are showing :D meant a str() of sql
>> >
>> > I'm saying bindparms as the required input format of the params used to 
>> > re-bind into the sql in the optimal way, for example 
>> > execute(__table__.insert(), bindparams) would represent a list of dicts.
>> >
>> > I did a packet capture of execute(__table__.insert(), list(dict()) and it 
>> > was doing the execute many calls (insert ... (a, b) values (c, d), (e, f). 
>> > so there was no hangup there.
>> > I also just re-coded this to call mysqldb and run executemany(str, 
>> > [tuple()], it does take about the same amount of time (~8s) as 
>> > execute(__table__.insert(), list(dict())) so there must be something going 
>> > on in the DBAPI thats unoptimal, as I can generate the proper SQL and 
>> > insert it in ~3 seconds with a normal .execute(str).
>> >
>> > I tried the test case with PyMySQL with just the execute(str) case and it 
>> > took about 2 seconds, so I didn't give it much more attention; i'll try it 
>> > again with an executemany though.
>> > 2018-10-04 17:47:02,373 - INFO - Executing for 10,550 inserts, 89 columns, 
>> > approx 6,252,460 bytes of raw data...
>> > 2018-10-04 17:47:02,373 - INFO - Generating datasets
>> > 2018-10-04 17:47:02,550 - INFO - Executing with 
>> > raw_connection().executemany(str, [tuple(a, b, ...)]
>> > 2018-10-04 17:47:10,061 - INFO - Committing
>> >
>> >             log.info('Generating datasets')
>> >             sql_str = 'INSERT IGNORE INTO table ({}) 
>> > VALUES({})'.format(','.join(data[0].keys()), ','.join(['%s' for i in 
>> > range(len(data[0].keys()))]))
>> >             sql_data = [tuple(value for value in list(part.values())) for 
>> > part in data]
>> >             log.info('Executing with raw_connection().executemany(str, 
>> > [tuple(a, b, ...)]')
>> >             conn = sql.engine.raw_connection()
>> >             cur = conn.cursor()
>> >             cur.executemany(sql_str, sql_data)
>> >             log.info('Committing')
>> >             conn.commit()
>> >
>> >
>> > I still however don't get the difference between sqlalchemy's 
>> > execute(__table__.insert().values(dict())) though and why it would be so 
>> > much slower than execute(__table__.insert(), dict()), nor why execute(str, 
>> > dict()) is that slow as well.
>> >
>> >
>> >
>> > On Thursday, October 4, 2018 at 4:47:31 PM UTC-5, Mike Bayer wrote:
>> >>
>> >> On Thu, Oct 4, 2018 at 4:04 PM <adam....@gmail.com> wrote:
>> >> >
>> >> > Mike,
>> >> >
>> >> > No, I'm not calling execute 10k times, but I am doing what they do in 
>> >> > your example with test_core_insert
>> >> >
>> >> > My code for that is:
>> >> >             log.info('Executing sqlalchemy insert() with bindparams as 
>> >> > execute param - session.execute(__table__.insert(), bindparams)')
>> >> >             
>> >> > sql.session.execute(models.target.__table__.insert().prefix_with('IGNORE'),
>> >> >  data)
>> >> >             log.info('Committing')
>> >> >             sql.session.commit()
>> >> >
>> >> > where data = [OrderedDict(), OrderedDict(), ...] (tried converting to a 
>> >> > regular [dict(), dict(),...] too with no gains; this method is the 
>> >> > fastest outside of direct or near direct DBAPI calls at about 8 seconds.
>> >> >
>> >> > However I read somewhere (can't find the source again to cite, but I 
>> >> > thought it was the sqlalchemy docs, I may be just not remembering 
>> >> > correctly) this query chains together individual INSERT statements 
>> >> > instead of doing a multi-insert
>> >> > i.e. insert into target (a, b, ...) values (c, d, ...); insert into 
>> >> > target (a, b, ...) values (e, f, ...); ... instead of a insert into 
>> >> > target (a, b, ...) values (c, d, ...), (e, f, ...)
>> >> > where the latter is the most optimal TSQL
>> >>
>> >> I'm not sure what "TSQL" means in this context.
>> >>
>> >> SQLAlchemy uses cursor.executemany() to invoke the statement just
>> >> once.  the DBAPI decides how it wants to batch it.     Regardless, it
>> >> should move very fast.   On the Postgresql side, there's more of a big
>> >> speed difference between the way psycopg2 executes a statement
>> >> normally, and some special modes it provides that allow it to chain
>> >> together multiple statements in one batch.   But that's still behind
>> >> the executemany() call.
>> >>
>> >>
>> >> >
>> >> > If I'm wrong there and it does do a multi-insert, what is the 
>> >> > difference between execute(__table__.insert().values(bindparams)) and 
>> >> > execute(__table__.insert(), bindparams), and why would the former be 
>> >> > ~26 seconds slower in this use case?
>> >>
>> >> What is "bindparams" here, that is a list of many tuples ?    If so
>> >> then yes, the DBAPI is not batching efficiently in the latter case.
>> >> Try PyMySQL or mysqlclient.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thursday, October 4, 2018 at 2:05:59 PM UTC-5, Mike Bayer wrote:
>> >> >>
>> >> >> On Thu, Oct 4, 2018 at 2:45 PM <adam....@gmail.com> wrote:
>> >> >> >
>> >> >> > Hi all,
>> >> >> >
>> >> >> > Having some very and major performance issues using sqlalchemy to 
>> >> >> > insert data, been trying many different ways but none seem to hold 
>> >> >> > even a close candle to raw MySQLdb connections
>> >> >> >
>> >> >> > Versions:
>> >> >> > sqlalchemy: 1.2.9
>> >> >> > MySQLdb: 1.3.13 (bound with mysql+mysqldb connector)
>> >> >> > MySQL: 14.14
>> >> >> > Python: 3.6.5
>> >> >> > Ubuntu 18.04 x86_64
>> >> >> >
>> >> >> > Server:
>> >> >> > 8GB ram
>> >> >> > 2 x Intel(R) Xeon(R) CPU X5650  @ 2.67GHz threads running in VMWare 
>> >> >> > ESXi
>> >> >> >
>> >> >> > Connected to localhost
>> >> >> >
>> >> >> > Base data info:
>> >> >> > 2018-10-04 13:19:56,519 - INFO - Executing for 10,550 inserts, 89 
>> >> >> > columns, approx 6,252,460 bytes of raw data...
>> >> >> >
>> >> >> > Pre-formatted session execute ~2 seconds:
>> >> >> > 2018-10-04 13:19:56,520 - INFO - Building raw TSQL including all 
>> >> >> > escaped value()s - insert into x (a,b, ...) values 
>> >> >> > (escape_string(c), escape_string(d), ..) (escape_string(e), 
>> >> >> > escape_string(f), ...)
>> >> >> > 2018-10-04 13:19:57,640 - INFO - Executing raw TSQL (9,218,872 
>> >> >> > bytes)  - session.execute(tsql)
>> >> >> > 2018-10-04 13:19:59,435 - INFO - Committing
>> >> >> >
>> >> >> > Raw MySQLdb API ~1 second:
>> >> >> > 2018-10-04 13:19:59,436 - INFO - Executing same raw TSQL directly 
>> >> >> > via MySQLdb API - engine.raw_connection(); cursor.execute(tsql)
>> >> >> > 2018-10-04 13:20:00,326 - INFO - Committing
>> >> >> >
>> >> >> > Bindparam formatted TSQL converted to text() executed with 
>> >> >> > bindparams as a dict ~12 seconds to build text() and ~19 seconds to 
>> >> >> > execute, ~31 seconds total:
>> >> >> > 2018-10-04 13:20:00,326 - INFO - Building raw TSQL - insert into x 
>> >> >> > (a, b, ...) values (:a1, :b1, ..), (:a2, :b2, ...)
>> >> >> > 2018-10-04 13:20:01,017 - INFO - Rebuilding Input data for TSQL 
>> >> >> > bindparams - {a1: d, b1: e, a2: f, b2: g}
>> >> >> > 2018-10-04 13:20:02,234 - INFO - Building text() object - 
>> >> >> > sql.text(tsql)
>> >> >> > 2018-10-04 13:20:14,259 - INFO - Executing text() with bindparams - 
>> >> >> > session.execute(text, bindparams)
>> >> >> > 2018-10-04 13:20:33,552 - INFO - Committing
>> >> >> >
>> >> >> > Bindparam formatted TSQL NOT converted to text, executed with 
>> >> >> > bindparams as a dict: ~33 seconds
>> >> >> > 2018-10-04 13:20:33,552 - INFO - Executing raw TSQL with bindparams 
>> >> >> > - session.execute(tsql, bindparams)
>> >> >> > 2018-10-04 13:21:06,307 - INFO - Committing
>> >> >> >
>> >> >> > Using a sqlalchemy class table insert() with using .values() ~34 
>> >> >> > seconds:
>> >> >> > 2018-10-04 13:21:06,311 - INFO - Executing sqlalchemy insert w/ 
>> >> >> > bindparams as values - 
>> >> >> > session.execute(__table__.insert().values(bindparams))
>> >> >> > 2018-10-04 13:21:40,808 - INFO - Committing
>> >> >> >
>> >> >> > Using a sqlalchemy class table insert() with passing bindparams to 
>> >> >> > session.execute() ~8 seconds:
>> >> >> > 2018-10-04 13:21:40,809 - INFO - Executing sqlalchemy insert() with 
>> >> >> > bindparams as execute param - session.execute(__table__.insert(), 
>> >> >> > bindparams)
>> >> >> > 2018-10-04 13:21:48,084 - INFO - Committing
>> >> >> >
>> >> >> > Obviously the raw db api will be the fastest as there's little to no 
>> >> >> > processing or overhead, I would think session.execute(str) would 
>> >> >> > come closer to matching that, but its not too far off.
>> >> >>
>> >> >> A Core insert should be only  slightly slower than the raw DBAPI,
>> >> >> however, if you're inserting lots of rows, you need to be using
>> >> >> executemany style, that is, a single execute() call with a list of
>> >> >> parameters.   There's no code examples here so I don't know what
>> >> >> you're actually doing, if you are running session.execute() itself 10K
>> >> >> times, that's the wrong way to do it.  Check out
>> >> >> https://docs.sqlalchemy.org/en/latest/_modules/examples/performance/bulk_inserts.html
>> >> >> for usage examples how to bulk-insert for different APIs including
>> >> >> Core (test_core_insert specifically).     Additionally, I don't know
>> >> >> if you are using bound parameters with your raw MySQL code or not, and
>> >> >> if so, is there some unicode processing going on in the SQLAlchemy
>> >> >> version that might be slowing things down.
>> >> >>
>> >> >> A complete code example like those you see in
>> >> >> https://docs.sqlalchemy.org/en/latest/_modules/examples/performance/bulk_inserts.html
>> >> >> will allow me to see what you are doing.
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > Having sqlalchemy.text(str) take 12 seconds to execute seems rather 
>> >> >> > excessive, but assuming that's the hang with the other two 
>> >> >> > execute(__table__.insert().values(bindparams)) statement too, 
>> >> >> > however why would excute(str, bindparams) do this? Does it shadow 
>> >> >> > text() as well if arg0 is a str?
>> >> >> >
>> >> >> > Now the oddest part of all, doing 10,550 individual inserts in the 
>> >> >> > last example is roughly 26 seconds faster than a single transaction 
>> >> >> > with 10,550 sets of values. This just does not make transactional 
>> >> >> > sense.
>> >> >> >
>> >> >> > Is there something I can be doing better while utilizing core, or am 
>> >> >> > I over-expecting how sqlalchemy to perform with datasets like this?
>> >> >> >
>> >> >> >
>> >> >> > Hopefully this is enough detail, but I can provide more upon request.
>> >> >> >
>> >> >> > Thank you!
>> >> >> >
>> >> >> > --
>> >> >> > SQLAlchemy -
>> >> >> > The Python SQL Toolkit and Object Relational Mapper
>> >> >> >
>> >> >> > http://www.sqlalchemy.org/
>> >> >> >
>> >> >> > To post example code, please provide an MCVE: Minimal, Complete, and 
>> >> >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a 
>> >> >> > full description.
>> >> >> > ---
>> >> >> > You received this message because you are subscribed to the Google 
>> >> >> > Groups "sqlalchemy" group.
>> >> >> > To unsubscribe from this group and stop receiving emails from it, 
>> >> >> > send an email to sqlalchemy+...@googlegroups.com.
>> >> >> > To post to this group, send email to sqlal...@googlegroups.com.
>> >> >> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> >> >> > For more options, visit https://groups.google.com/d/optout.
>> >> >
>> >> > --
>> >> > SQLAlchemy -
>> >> > The Python SQL Toolkit and Object Relational Mapper
>> >> >
>> >> > http://www.sqlalchemy.org/
>> >> >
>> >> > To post example code, please provide an MCVE: Minimal, Complete, and 
>> >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> >> > description.
>> >> > ---
>> >> > You received this message because you are subscribed to the Google 
>> >> > Groups "sqlalchemy" group.
>> >> > To unsubscribe from this group and stop receiving emails from it, send 
>> >> > an email to sqlalchemy+...@googlegroups.com.
>> >> > To post to this group, send email to sqlal...@googlegroups.com.
>> >> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> >> > For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > SQLAlchemy -
>> > The Python SQL Toolkit and Object Relational Mapper
>> >
>> > http://www.sqlalchemy.org/
>> >
>> > To post example code, please provide an MCVE: Minimal, Complete, and 
>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> > description.
>> > ---
>> > You received this message because you are subscribed to the Google Groups 
>> > "sqlalchemy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send an 
>> > email to sqlalchemy+...@googlegroups.com.
>> > To post to this group, send email to sqlal...@googlegroups.com.
>> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description.
> ---
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Very slow inserts when using sqlalchemy core

Reply via email to