[sqlalchemy] Working with large IN lists
Hi all, I have to read thousands of rows from a table and compile some data, but in certain conditions update those rows, all with same value. The ratio of reads and writes here is widest possible. Sometimes no rows, sometimes few, and sometimes all rows that are read have to be updated. The last case scenario is making me concerned. for row in query.yield_per(100): # Do something with data if some_condition: row.some_column = 123 session.flush() I am thinking about just adding the row's ID to a list: list_of_ids = [] for row in query.yield_per(100): # Do something with data if some_condition: list_of_ids.append(row.primary_key) and near the end of transaction do: session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({some_column : 123}, False) Yes I'm aware of increased memory requirements to store the ID list on the application side, and no I don't need to lock the rows for update, the logic of atomic update at the end is sufficient for my case. But I think, and the real use benchmarks will probably show, I haven't tested yet, that single update query will work faster. I need lowest transaction processing time on the application side for entire call, even if takes more memory and more database iron. What I'm concerned with here is if there are any limits or significant overheads with large .in_ lists? The backend is PostgreSQL via psycopg2. Thanks -- .oO V Oo. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] Re: Working with large IN lists
This depends upon the execution plan of the query and is more really a postgresql question. Google postgresql IN performance and you will get a good idea of it. By the look of your code, Second option would obviously be faster as it hits database once whereas first one flush after every change. Regards, On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote: Hi all, I have to read thousands of rows from a table and compile some data, but in certain conditions update those rows, all with same value. The ratio of reads and writes here is widest possible. Sometimes no rows, sometimes few, and sometimes all rows that are read have to be updated. The last case scenario is making me concerned. for row in query.yield_per(100): # Do something with data if some_condition: row.some_column = 123 session.flush() I am thinking about just adding the row's ID to a list: list_of_ids = [] for row in query.yield_per(100): # Do something with data if some_condition: list_of_ids.append(row.primary_key) and near the end of transaction do: session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so me_column : 123}, False) Yes I'm aware of increased memory requirements to store the ID list on the application side, and no I don't need to lock the rows for update, the logic of atomic update at the end is sufficient for my case. But I think, and the real use benchmarks will probably show, I haven't tested yet, that single update query will work faster. I need lowest transaction processing time on the application side for entire call, even if takes more memory and more database iron. What I'm concerned with here is if there are any limits or significant overheads with large .in_ lists? The backend is PostgreSQL via psycopg2. Thanks -- .oO V Oo. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Tue, Feb 21, 2012 at 3:24 PM, Manav Goel manav.goe...@gmail.com wrote: This depends upon the execution plan of the query and is more really a postgresql question. Google postgresql IN performance and you will get a good idea of it. By the look of your code, Second option would obviously be faster as it hits database once whereas first one flush after every change. Regards, On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote: Hi all, I have to read thousands of rows from a table and compile some data, but in certain conditions update those rows, all with same value. The ratio of reads and writes here is widest possible. Sometimes no rows, sometimes few, and sometimes all rows that are read have to be updated. The last case scenario is making me concerned. for row in query.yield_per(100): # Do something with data if some_condition: row.some_column = 123 session.flush() I am thinking about just adding the row's ID to a list: list_of_ids = [] for row in query.yield_per(100): # Do something with data if some_condition: list_of_ids.append(row.primary_key) and near the end of transaction do: session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so me_column : 123}, False) Yes I'm aware of increased memory requirements to store the ID list on the application side, and no I don't need to lock the rows for update, the logic of atomic update at the end is sufficient for my case. But I think, and the real use benchmarks will probably show, I haven't tested yet, that single update query will work faster. I need lowest transaction processing time on the application side for entire call, even if takes more memory and more database iron. What I'm concerned with here is if there are any limits or significant overheads with large .in_ lists? The backend is PostgreSQL via psycopg2. Thanks -- .oO V Oo. A *long* time ago (SQLALchemy 0.3), I had some performance problems with large IN clauses, and the reason turned out to be SQLAlchemy taking a lot of time to build long lists of bindparam objects. I've no idea if this is still the case these days. The best thing you can do is just try it. Simon -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] alembic questions/comments
Hi Mike, I've been reading the Alembic docs, and the following popped out: - The partial guid approach is nice for machines, but a nightmare for humans, looking at a folder full of these can't be fun. Since it's the link in the file that's important, could the files be given friendly names by humans? - Curious about this phrase: This means a migration script that pulls some rows into memory via a SELECT statement will not work in --sql mode. Why not? The sql will still eventually be executed in the context of a database connection, right? - Is there any support for just adding raw lumps of sql as migration steps? cheers, Chris -- Simplistix - Content Management, Batch Processing Python Consulting - http://www.simplistix.co.uk -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] alembic questions/comments
On Feb 21, 2012, at 12:57 PM, Chris Withers wrote: Hi Mike, I've been reading the Alembic docs, and the following popped out: - The partial guid approach is nice for machines, but a nightmare for humans, looking at a folder full of these can't be fun. Since it's the link in the file that's important, could the files be given friendly names by humans? The current release does this: http://alembic.readthedocs.org/en/latest/tutorial.html#the-migration-environment http:/alembic.readthedocs.org/en/latest/tutorial.html#editing-the-ini-file (see file_template) - Curious about this phrase: This means a migration script that pulls some rows into memory via a SELECT statement will not work in --sql mode. Why not? The sql will still eventually be executed in the context of a database connection, right? er well in the context of a connection, but not one where there is any application code prepared to receive result rows. Think cat myscript.sql | psql. There are ways to declare variables in SQL and assign results to those, so those methods are still doable. If you built a system whereby you were going to read the results of a generated SQL script, split the lines back into individual statements, then executed them in the context of an application, then you can get results back...but still, how is that useful ? You'd need to execute some subsequent line from the script in terms of those results, and to do that you'd need to invent some kind of template system within the migration system. When doing really big table migrations, you sometimes need to create temp tables, do big INSERT ... SELECT types of operations. SQLAlchemy has a greater need for the insert from select construct now. - Is there any support for just adding raw lumps of sql as migration steps? sure just use execute(): op.execute(my raw lump of SQL) http://alembic.readthedocs.org/en/latest/ops.html#alembic.operations.Operations.execute -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
Thanks for your replies. Using the IN list definitely speeds up the process, but I hate the resulting query which uses bound variables for each and every element of the list. But I have another problem with this, there's a massive memory leak somewhere. Take a look at this model: class GatewayTransaction(Base): __tablename__ = gateway_transactions realestate_id = Column(Integer, ForeignKey(realestate.realestate_id, ondelete=set null, onupdate=cascade), primary_key=True) portal_id = Column(Text, primary_key=True) realestate_portal_id = Column(Unicode) operation = Column(Text, nullable=False) agency_id = Column(Integer, ForeignKey(agencies.agency_id, ondelete=set null, onupdate=cascade), nullable=False) agency_portal_id = Column(Unicode, nullable=False) agency_export_token = Column(Unicode, nullable=False) user_id = Column(Integer, ForeignKey(users.user_id, ondelete=set null, onupdate=cascade), nullable=False) mod_images = Column(Boolean) agency = relationship(Agency, lazy=joined) realestate = relationship(Realestate, lazy=joined) user = relationship(User, lazy=joined) Now, when I do this: for row in some_query.all(): gt = session.query(GatewayTransaction)\ .filter(GatewayTransaction.realestate_id==row.realestate_id)\ .filter(GatewayTransaction.portal_id==k)\ .first() or GatewayTransaction() # Do some data processing # # # Update existing or insert as new gt = session.merge(gt) session.flush() It is very, very slow, it takes minutes to process 2000 rows and memory usage skyrockets into multiple GB range and I have to terminate it before it starts swapping like hell. With lazy=select, it flies fast, done in a couple of seconds with very little memory consumed, because at this point there are no rows in the table so nothing is additionally selected, instead inserted. Still, why would a join slow things down so drastically and shoot Python memory usage (not DB's) skyhigh? Also, even if I try session.expunge(gt) or expunge_all() (previously preparing the row to be loaded one by one from a list of IDs), the memory always keeps growing, as if the instance do not die, never get garbage collected... .oO V Oo. On 02/21/2012 04:35 PM, Simon King wrote: A *long* time ago (SQLALchemy 0.3), I had some performance problems with large IN clauses, and the reason turned out to be SQLAlchemy taking a lot of time to build long lists of bindparam objects. I've no idea if this is still the case these days. The best thing you can do is just try it. Simon -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.