[sqlalchemy] Re: executemany + postgresql
On Fri, Nov 6, 2009 at 9:57 AM, Michael Bayer mike...@zzzcomputing.com wrote: Before I even posted I resorted to strace. strace immediately confirmed my suspicion - when using psycopg2 I don't see one big fat INSERT with lots of binds, I see one INSERT per bind, and it's this that is ultimately killing the performance. You can easily observe this via strace: as I'm sure you know, the communication between the test program and postgresql takes place across a socket (unix domain or tcp/ip). For every single set of bind params, the result is essentially one sendto (INSERT INTO ) and rt_sigprocmask, a poll, and then a recvfrom and rt_sigprocmask pair. Profiling at the C level shows that sendto accounts for *35%* of the total runtime and recvfrom a healthy 15%. It's this enormous overhead for every single bind param that's killing the performance. have you asked about this on the psycopg2 mailing list ? its at http://mail.python.org/mailman/listinfo/python-list . Let me know if you do, because I'll get out the popcorn... :) That's the python list. Anyway, I did some more testing. executemany performance is not any better than looping over execute, because that's all that executemany appears to do in any case. However, I manually built a bit fat set of bind params (bypassing sqlalchemy directly) and got a SUBSTANTIAL performance improvement. Postgresql as of 8.2 supports /sets/ of bind params, it'd be nice if pg8000 or psycopg2 (or both) supported that. Building 25000 bind params by hand is not fun, but it got me to just shy of 50K inserts/second. We also support the pg8000 DBAPI in 0.6. I doubt its doing something differently here but feel free to connect with postgresql+pg8000:// and see what you get. I tried pg8000 but I got an error: ... return self.dbapi.connect(*cargs, **cparams) sqlalchemy.exc.DBAPIError: (TypeError) connect() takes at least 1 non-keyword argument (0 given) None None -- Jon --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---
[sqlalchemy] Re: executemany + postgresql
On Nov 7, 2009, at 12:53 PM, Jon Nelson wrote: have you asked about this on the psycopg2 mailing list ? its at http://mail.python.org/mailman/listinfo/python-list . Let me know if you do, because I'll get out the popcorn... :) That's the python list. oops: http://lists.initd.org/mailman/listinfo/psycopg I tried pg8000 but I got an error: ... return self.dbapi.connect(*cargs, **cparams) sqlalchemy.exc.DBAPIError: (TypeError) connect() takes at least 1 non-keyword argument (0 given) None None i can't reproduce that. this is with the latest trunk: from sqlalchemy import * e = create_engine('postgresql+pg8000://scott:ti...@localhost/test') print e.execute(select 1).fetchall() produces: [(1,)] --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---
[sqlalchemy] Re: executemany + postgresql
On Sat, Nov 7, 2009 at 11:58 AM, Michael Bayer mike...@zzzcomputing.com wrote: On Nov 7, 2009, at 12:53 PM, Jon Nelson wrote: have you asked about this on the psycopg2 mailing list ? its at http://mail.python.org/mailman/listinfo/python-list . Let me know if you do, because I'll get out the popcorn... :) That's the python list. oops: http://lists.initd.org/mailman/listinfo/psycopg I tried pg8000 but I got an error: ... return self.dbapi.connect(*cargs, **cparams) sqlalchemy.exc.DBAPIError: (TypeError) connect() takes at least 1 non-keyword argument (0 given) None None i can't reproduce that. this is with the latest trunk: from sqlalchemy import * e = create_engine('postgresql+pg8000://scott:ti...@localhost/test') print e.execute(select 1).fetchall() produces: [(1,)] Apparently, pg8000 requires host, user and pass (or at least one of those). Of course, then when I am connected, I get a traceback: ... metadata.drop_all() File /usr/lib64/python2.6/site-packages/sqlalchemy/schema.py, line 1871, in drop_all bind.drop(self, checkfirst=checkfirst, tables=tables) File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py, line 1336, in drop self._run_visitor(ddl.SchemaDropper, entity, connection=connection, **kwargs) File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py, line 1360, in _run_visitor visitorcallable(self.dialect, conn, **kwargs).traverse(element) File /usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py, line 86, in traverse return traverse(obj, self.__traverse_options__, self._visitor_dict) File /usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py, line 197, in traverse return traverse_using(iterate(obj, opts), obj, visitors) File /usr/lib64/python2.6/site-packages/sqlalchemy/sql/visitors.py, line 191, in traverse_using meth(target) File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/ddl.py, line 89, in visit_metadata collection = [t for t in reversed(sql_util.sort_tables(tables)) if self._can_drop(t)] File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/ddl.py, line 104, in _can_drop return not self.checkfirst or self.dialect.has_table(self.connection, table.name, schema=table.schema) File /usr/lib64/python2.6/site-packages/sqlalchemy/dialects/postgresql/base.py, line 611, in has_table type_=sqltypes.Unicode)] File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py, line 991, in execute return Connection.executors[c](self, object, multiparams, params) File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py, line 1053, in _execute_clauseelement return self.__execute_context(context) File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py, line 1076, in __execute_context self._cursor_execute(context.cursor, context.statement, context.parameters[0], context=context) File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py, line 1136, in _cursor_execute self.dialect.do_execute(cursor, statement, parameters, context=context) File /usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py, line 207, in do_execute cursor.execute(statement, parameters) File pg8000/dbapi.py, line 243, in _fn return fn(self, *args, **kwargs) File pg8000/dbapi.py, line 312, in execute self._execute(operation, args) File pg8000/dbapi.py, line 317, in _execute self.cursor.execute(new_query, *new_args) File pg8000/interface.py, line 303, in execute self._stmt = PreparedStatement(self.connection, query, statement_name=, *[{type: type(x), value: x} for x in args]) File pg8000/interface.py, line 108, in __init__ self._parse_row_desc = self.c.parse(self._statement_name, statement, types) File pg8000/protocol.py, line 918, in _fn return fn(self, *args, **kwargs) File pg8000/protocol.py, line 1069, in parse self._send(Parse(statement, qs, param_types)) File pg8000/protocol.py, line 975, in _send data = msg.serialize() File pg8000/protocol.py, line 121, in serialize val = struct.pack(!i, len(val) + 4) + val UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position 3: ordinal not in range(128) -- Jon --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---
[sqlalchemy] Re: executemany + postgresql
On Nov 7, 2009, at 1:30 PM, Jon Nelson wrote: File pg8000/protocol.py, line 121, in serialize val = struct.pack(!i, len(val) + 4) + val UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position 3: ordinal not in range(128) make sure you're on the latest tip of pg8000, which these days seems to be at http://github.com/mfenniak/pg8000/tree/trunk . It also adheres to the client encoding of your PG database, which you should make sure is on utf-8. But its not going to render an INSERT...VALUES with multiple parameters in one big string, so if that's your goal you need to generate that string yourself.I'm surprised that sqlite, per your observation, parses an INSERT statement and re-renders it with multiple VALUES clauses ?very surprising behavior. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---
[sqlalchemy] Re: executemany + postgresql
On Sat, Nov 7, 2009 at 3:02 PM, Michael Bayer mike...@zzzcomputing.com wrote: On Nov 7, 2009, at 1:30 PM, Jon Nelson wrote: File pg8000/protocol.py, line 121, in serialize val = struct.pack(!i, len(val) + 4) + val UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position 3: ordinal not in range(128) make sure you're on the latest tip of pg8000, which these days seems to be at http://github.com/mfenniak/pg8000/tree/trunk . It also adheres to the client encoding of your PG database, which you should make sure is on utf-8. Ah. I was running the latest /released/ version - I generally avoid running 'tip/HEAD/whatever' except during testing. Since I don't expect pg8000 to have any substantially different behavior, it's probably not even worth the effort. snip/ I'm surprised that sqlite, per your observation, parses an INSERT statement and re-renders it with multiple VALUES clauses ? very surprising behavior. I'm not sure I said that - I certainly didn't intend that. Ultimately, the IPC costs associated with each set of bind params (one per row) just murders psycopg2 when compared to sqlite. There isn't any sqlite RPC per-se, since it's always local. I can only assume that sqlite defers locking the database until the start of a transaction, and since sqlite isn't multi-writer aware the overhead of doing so is minimal. I wasn't comparing sqlite and postgresql per se - there isn't much of a comparison in my mind once you start needing all of the features, stability, and power that postgresql brings. However, I was disappointed to see that psycopg2 is not making use of the (postgresql 8.2 and newer) multi-bind param INSERT stuff, as this ultimately reduces the IPC overhead to a very small amount. The cost of a single call to postgresql might be small, but when you multiply it by hundreds of thousands or millions it suddenly becomes a deciding factor in some situations. -- Jon --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---
[sqlalchemy] Re: executemany + postgresql
Heyho! On Friday 06 November 2009 02.46:11 Jon Nelson wrote: ... was performing an individual INSERT for every single row. Don't know sqlalchemy good enough, but for big bulk imports on the SQL side, shouldn't COPY be used? Which is as far as I know pg-specific / non-SQL standard. cheers -- vbi -- Lo-lan-do モインさん? nobse Lo-lan-do: Gesundheit. -- #debian-devel signature.asc Description: This is a digitally signed message part.
[sqlalchemy] Re: executemany + postgresql
On Nov 5, 8:40 pm, Michael Bayer mike...@zzzcomputing.com wrote: On Nov 5, 2009, at 8:46 PM, Jon Nelson wrote: I recently ran into an issue today where batched (inside a transaction) I was able to achieve not more than about 9000 inserts/second to a postgresql database (same machine running the test). With everything exactly the same, I was able to achieve over 50,000 inserts/s to sqlite. Now, I won't argue the relative merits of each database, but this is a big problem for postgresql. I believe I have determined that the psycopg2 module is to blame, and the substantial portion of the time spent was being spent in IPC/RPC. Basically, every single insert in this test is identical except for the values (same table and columns), but psycopg2 (or possibly SQLAlchemy) was performing an individual INSERT for every single row. I was *not* using the ORM. The code was something like this: row_values = build_a_bunch_of_dictionaries() ins = table.insert() t = conn.begin() conn.execute(ins, row_values) t.commit() where row_values is (of course) a list of dictionaries. What can be done here to improve the speed of bulk inserts? For postgresql to get walloped by a factor of 5 in this area is a big bummer. it depends on the source of the speed problem. if your table has types which do utf-8 encoding on each value, for example, that takes up a lot of time. the sqlite backend doesn't have this requirement but the PG one in 0.5 currently does. we've done some work on this in 0.6 to reduce this - we now use psycopg2's UNICODE extension, so that we expect result rows to come back as unicode objects already. In response to this question I just made the same change for bind parameters so that they wont be encoded into utf-8 on the way in, so feel free to try r6484 of trunk. I gave that a try and did receive a mild speed boost - from ~9000 inserts/s to 9500 +/- 200. However, 9500 is still substantially lower than 50,000. In this case (pathological), *all* of the values are strings, and in fact the table doesn't even have a primary key. Also psycopg2 is a very fast, native DBAPI so I doubt there's any bottleneck there. Granted, I'm using SA on /top/ of sqlite3 and psycopg2 (2.0.12), but when the only thing that changes is the dburi... Before I even posted I resorted to strace. strace immediately confirmed my suspicion - when using psycopg2 I don't see one big fat INSERT with lots of binds, I see one INSERT per bind, and it's this that is ultimately killing the performance. You can easily observe this via strace: as I'm sure you know, the communication between the test program and postgresql takes place across a socket (unix domain or tcp/ip). For every single set of bind params, the result is essentially one sendto (INSERT INTO ) and rt_sigprocmask, a poll, and then a recvfrom and rt_sigprocmask pair. Profiling at the C level shows that sendto accounts for *35%* of the total runtime and recvfrom a healthy 15%. It's this enormous overhead for every single bind param that's killing the performance. -- Jon --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---