Re: [sqlalchemy] Performance questions on semi-large result sets

Mike Bayer Wed, 21 Oct 2015 14:43:33 -0700


On 10/21/15 4:52 PM, Uri Okrent wrote:
> Hello, I'm trying to understand the various performance implications of
> working with the ORM layer, vs the SQL language layer, vs a raw dbapi
> connection.  Ideally, I would like to stick with the ORM layer
> 
> First, the set up:
> CentOS 6
> SQLAlchemy 0.9.4
> Psycopg2 2.5.3
> Postgresql 9.3.9
> 
> My test is a simple query including a join on three columns, guid, name,
> and index -- guid and name are Text and index is an Integer.  Note, in
> all cases, I'm querying for individual columns, not ORM objects.
> 
> connection setup:
> |
>>>>engine =sqlalchemy.engine.create_engine("postgresql:///xmsdb",echo=True)
>>>>Session=sqlalchemy.orm.sessionmaker(bind=engine)
>>>>session=Session()
> |
> 
> 
> the query:
> 
> |
>>>>q =session.query(Volume.guid,Volume.name,Volume.index)
> |
> 
> 
> First, query using the ORM layer:
> 
> |
>>>>withprofiled():
> ...  r =q.all()
> ...
> 2015-10-2120:24:20,539INFO sqlalchemy.engine.base.EngineSELECT
> volumes.guid AS volumes_guid,mom_objects.name AS
> mom_objects_name,mom_objects.index AS mom_objects_index
> FROM mom_objects JOIN volumes ON mom_objects.guid =volumes.guid
> 2015-10-2120:24:20,539INFO sqlalchemy.engine.base.Engine{}
> 
>          147901functioncalls (147890primitive calls)in0.396CPU seconds
> 
>    Orderedby:cumulative time
> 
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
> 
>         1   0.007   0.007   0.396 
>  0.396/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2286(all)
>      8191   0.055   0.000   0.357 
>  
> 0.000/usr/lib64/python2.6/site-packages/sqlalchemy/orm/loading.py:26(instances)
>     24570   0.060   0.000   0.123 
>  0.000/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:3468(proc)
>         1   0.000   0.000   0.093 
>  
> 0.093/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:778(fetchall)
>         1   0.000   0.000   0.089 
>  
> 0.089/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:747(_fetchall_impl)
>         1   0.028   0.028   0.089   0.089{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
>      8190   0.043   0.000   0.085 
>  
> 0.000/usr/lib64/python2.6/site-packages/sqlalchemy/util/_collections.py:56(__new__)
>     24579   0.045   0.000   0.063 
>  
> 0.000/usr/lib64/python2.6/site-packages/sqlalchemy/sql/annotation.py:85(__hash__)
>     16380   0.033   0.000   0.061 
>  0.000/usr/lib64/python2.6/encodings/utf_8.py:15(decode)
>         1   0.000   0.000   0.032 
>  
> 0.032/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2399(__iter__)
>         1   0.000   0.000   0.031 
>  
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2413(_execute_and_instances)
>         1   0.000   0.000   0.031 
>  
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:652(execute)
>         1   0.000   0.000   0.031 
>  
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:316(_execute_on_connection)
>         1   0.000   0.000   0.031 
>  
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:781(_execute_clauseelement)
>         1   0.000   0.000   0.030 
>  
> 0.030/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:868(_execute_context)
>         1   0.000   0.000   0.030 
>  
> 0.030/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:434(do_execute)
>         1   0.030   0.030   0.030   0.030{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
>     16380   0.028   0.000   0.028   0.000{_codecs.utf_8_decode}
>     24579   0.018   0.000   0.018   0.000{hash}
>      8195   0.016   0.000   0.016   0.000{method 'update'of 'dict'objects}
>      8191   0.015   0.000   0.015   0.000{zip}
> |
> 
> 
> 
> The same query executed directly on the engine:
> 
> |
>>>>withprofiled():
> ...  r =engine.execute(str(q)).fetchall()
> ...
> 2015-10-2120:29:32,336INFO sqlalchemy.engine.base.EngineSELECT
> volumes.guid AS volumes_guid,mom_objects.name AS
> mom_objects_name,mom_objects.index AS mom_objects_index
> FROM mom_objects JOIN volumes ON mom_objects.guid =volumes.guid
> 2015-10-2120:29:32,337INFO sqlalchemy.engine.base.Engine{}
> 
>          33314functioncalls (33303primitive calls)in0.116CPU seconds
> 
>    Orderedby:cumulative time
> 
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1   0.000   0.000   0.092 
>  
> 0.092/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:778(fetchall)
>         1   0.000   0.000   0.087 
>  
> 0.087/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:747(_fetchall_impl)
>         1   0.028   0.028   0.087   0.087{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
>     16380   0.032   0.000   0.059 
>  0.000/usr/lib64/python2.6/encodings/utf_8.py:15(decode)
>     16380   0.027   0.000   0.027   0.000{_codecs.utf_8_decode}
>         1   0.000   0.000   0.023 
>  
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:1665(execute)
>         1   0.000   0.000   0.023 
>  
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:652(execute)
>         1   0.000   0.000   0.023 
>  
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:846(_execute_text)
>         1   0.000   0.000   0.023 
>  
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:868(_execute_context)
>         1   0.000   0.000   0.022 
>  
> 0.022/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:434(do_execute)
>         1   0.022   0.022   0.022   0.022{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
>         1   0.004   0.004   0.004 
>  
> 0.004/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:762(process_rows)
>         1   0.000   0.000   0.001 
>  0.001/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
>         1   0.000   0.000   0.001 
>  
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
>         1   0.000   0.000   0.001   0.001<string>:1(<lambda>)
> |
> 
> 
> 
> So, question #1:
> 
> Given that in both cases I'm querying columns and essentially just
> getting rows as a result, why is going through the ORM layer almost 4x
> slower?


Start with the explanation at
http://docs.sqlalchemy.org/en/rel_1_0/faq/performance.html#result-fetching-slowness-orm
and then work into the performance example suite which should give more
of an idea.  Also 1.1 should improve upon these numbers a slight bit, as
does virtually every major SQLAlchemy version (see
http://techspot.zzzeek.org/2010/12/12/a-tale-of-three-profiles/ for much
older but more in-depth history on this ongoing subject).

> 
> 
> 
> Next, just for sanity's sake I compared the query done on the raw dbapi
> connection as obtained via sqlalchemy's engine, and on a new psycopg2
> connection (which should theoretically be exactly the same):

except as you noticed, they are not, there's lots of options that can be
enabled or disabled on DBAPI connections.

> 
> |
>>>>conn =engine.connect()
>>>>raw_conn =conn.connection
>>>>curs =raw_conn.cursor()
>>>>withprofiled():
> ...  curs.execute(str(q))
> ...  r =curs.fetchall()
> ...
>          33071functioncalls (33060primitive calls)in0.113CPU seconds
> 
>    Orderedby:cumulative time
> 
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1   0.029   0.029   0.091   0.091{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
>     16380   0.034   0.000   0.062 
>  0.000/usr/lib64/python2.6/encodings/utf_8.py:15(decode)
>     16380   0.028   0.000   0.028   0.000{_codecs.utf_8_decode}
>         1   0.020   0.020   0.020   0.020{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
>         1   0.000   0.000   0.002 
>  0.002/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
>         1   0.000   0.000   0.001 
>  
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
>         1   0.000   0.000   0.001   0.001<string>:1(<lambda>)
> 
> 
>>>>importpsycopg2
>>>>conn =psycopg2.connect("dbname=xmsdb")
>>>>curs =conn.cursor()
>>>>withprofiled():
> ...  curs.execute(str(q))
> ...  r =curs.fetchall()
> ...
> 
>          311functioncalls (300primitive calls)in0.029CPU seconds
> 
>    Orderedby:cumulative time
> 
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1   0.024   0.024   0.024   0.024{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
>         1   0.003   0.003   0.003   0.003{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
>         1   0.000   0.000   0.001 
>  0.001/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
>         1   0.000   0.000   0.001 
>  
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
>         1   0.000   0.000   0.001   0.001<string>:1(<lambda>)
> |
> 
> 
> These should theoretically be exactly the same, but it looks like the
> sqlalchemy version is spending some time doing utf-8 decoding, which the
> psycopg2 connection doesn't. 

SQLAlchemy uses this subsystem of psycopg2 by default:

http://initd.org/psycopg/docs/usage.html#unicode-handling


 It looks like using the default connection
> parameters, the psycopg2 connection just returns byte strings while the
> sqlalchemy connection decodes them too.
> 
> 
> I was able to make the call via sqlalchemy match the psycopg2 call
> time-wise, by modifying the connection parameters:
> 
> |
>>>>engine
> =sqlalchemy.engine.create_engine("postgresql:///xmsdb",echo=True,use_native_unicode=False,client_encoding='utf8')
> |
> 
> 
> That caused the query via sqlalchemy to return byte strings as well (and
> also reduced the query time to ~30ms).
> 
> 
> So, question #2, given the performance impact of decoding all strings,
> what are the possible implications of simply leaving all strings as byte
> strings?  I do want my software to support the full utf-8 character set.
> 
> 
> I'm also curious about the apparent difference in the default behavior
> (to decode vs. not decode).  The database itself does hold unicode
> strings -- for example i changed the name field of one of the items to
> u'volßßßß'.  The response via the two different connections was:

You must consider the question of scale. If a large SELECT with ORM
takes 5 seconds, a SELECT with core takes 1.8 seconds, a SELECT with raw
psycopg2 with unicode handling takes .9 seconds, and a SELECT with raw
psycopg2 *without* unicode handling takes .75598 seconds, this would
suggest heavily that turning off the unicode handling, especially given
that your app *wants* to do unicode, is not really where the primary
performance gain would be.  If your app takes up tons of time doing
various things with database result sets, the .15 sec difference would
be drowned out.

This is the reason the performance suite is there.  I ran a few tests
with the large_resultsets and it seems like at a Core / native level,
Psycopg2's native unicode is a little pricey.  Making a few
modifications to the suite to add an engine with native_unicode=False
and running with 100000 rows:


python -m examples.performance --dburl
postgresql://scott:tiger@localhost/test --num 100000 large_resultsets

test_core_fetchall : Load Core result rows using fetchall. (100000
iterations); total time 0.377719 sec
test_core_fetchall_no_unicode : Load Core result rows using fetchall,
don't use unicode. (100000 iterations); total time 0.171641 sec

test_dbapi_fetchall_no_object : Load rows using DBAPI fetchall(), don't
make any objects. (100000 iterations); total time 0.322190 sec
test_dbapi_fetchall_no_object_no_unicode : Load rows using DBAPI
fetchall(), don't make any objects, don't use unicode. (100000
iterations); total time 0.102264 sec


If we OTOH use native_unicode=False and use the Unicode type for the
columns we care about, that seems a lot faster, e.g. changing the
mapping to:

class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(Unicode(255))
    description = Column(Unicode(255))

This will only take effect with ORM and Core loads, not raw psycopg2.
The core load of 100K rows now takes:

test_core_fetchall : Load Core result rows using fetchall. (100000
iterations); total time 0.182480 sec

this is better than the 0.37719 sec that PG's unicode handling took.

When we use the Unicode type and set use_native_unicode=False,
SQLAlchemy's own use of Python unicode codecs takes place; this takes
place within the C extensions and has been highly optimized for speed in
the way that the codec is invoked.  We've already observed that
cx_Oracle's native unicode methodologies turn out to be much slower than
SQLAlchemy's, which was unexpected since cx_Oracle is pure C code, and
in this case psycopg2 is also pure C code so it is again surprising
SQLAlchemy's approach is a little faster.


Then again, this is Python.  If you're looking to save milliseconds on
enormous amounts of data, you'd at the very least be using Pypy, or a
compiled language like Go.

> 
> |
> (u'0984e1b061b248b3b4705b198cadc1eb',u'vol\xdf\xdf\xdf\xdf',487)
> |
> 
> vs
> 
> |
> ('0984e1b061b248b3b4705b198cadc1eb','vol\xc3\x9f\xc3\x9f\xc3\x9f\xc3\x9f',487)
> |
> 
> I.e., one, a bytestring and the other a utf-8 encoded string.  The one
> major implication I get is that I can't compare the two without first
> putting them into a common encoding.  Is it worthwhile to defer decoding
> until I'd actually need to do a comparison or else just pass the
> bytestring on the client and let him deal with decoding it?

I would never deal with bytestrings that represent text within
application logic.   They should be turned to Unicode as early as
possible and turned back to bytestrings as late as possible.


> 
> 
> 
> That's a lot to digest...  Thanks in advance for the help, and for a
> great library!
> 
> 
> Uri
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+unsubscr...@googlegroups.com
> <mailto:sqlalchemy+unsubscr...@googlegroups.com>.
> To post to this group, send email to sqlalchemy@googlegroups.com
> <mailto:sqlalchemy@googlegroups.com>.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Performance questions on semi-large result sets

Reply via email to