[sqlalchemy] Working with large IN lists

2012-02-21 Thread Vlad K.


Hi all,

I have to read thousands of rows from a table and compile some data, but 
in certain conditions update those rows, all with same value. The ratio 
of reads and writes here is widest possible. Sometimes no rows, 
sometimes few, and sometimes all rows that are read have to be updated. 
The last case scenario is making me concerned.



for row in query.yield_per(100):
# Do something with data

if some_condition:
row.some_column = 123
session.flush()


I am thinking about just adding the row's ID to a list:

list_of_ids = []
for row in query.yield_per(100):
# Do something with data

if some_condition:
list_of_ids.append(row.primary_key)

and near the end of transaction do:

session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({some_column 
: 123}, False)


Yes I'm aware of increased memory requirements to store the ID list on 
the application side, and no I don't need to lock the rows for update, 
the logic of atomic update at the end is sufficient for my case.


But I think, and the real use benchmarks will probably show, I haven't 
tested yet, that single update query will work faster. I need lowest 
transaction processing time on the application side for entire call, 
even if takes more memory and more database iron.


What I'm concerned with here is if there are any limits or significant 
overheads with large .in_ lists?


The backend is PostgreSQL via psycopg2.



Thanks

--

.oO V Oo.

--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



[sqlalchemy] Re: Working with large IN lists

2012-02-21 Thread Manav Goel
This depends upon the execution plan of the query and is more really a
postgresql question. Google postgresql IN performance and you will get
a good idea of it.
By the look of your code, Second option would obviously be faster as
it hits database once whereas first one flush after every change.

Regards,

On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote:
 Hi all,

 I have to read thousands of rows from a table and compile some data, but
 in certain conditions update those rows, all with same value. The ratio
 of reads and writes here is widest possible. Sometimes no rows,
 sometimes few, and sometimes all rows that are read have to be updated.
 The last case scenario is making me concerned.

 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          row.some_column = 123
      session.flush()

 I am thinking about just adding the row's ID to a list:

 list_of_ids = []
 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          list_of_ids.append(row.primary_key)

 and near the end of transaction do:

 session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so 
 me_column
 : 123}, False)

 Yes I'm aware of increased memory requirements to store the ID list on
 the application side, and no I don't need to lock the rows for update,
 the logic of atomic update at the end is sufficient for my case.

 But I think, and the real use benchmarks will probably show, I haven't
 tested yet, that single update query will work faster. I need lowest
 transaction processing time on the application side for entire call,
 even if takes more memory and more database iron.

 What I'm concerned with here is if there are any limits or significant
 overheads with large .in_ lists?

 The backend is PostgreSQL via psycopg2.

 Thanks

 --

 .oO V Oo.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-21 Thread Simon King
On Tue, Feb 21, 2012 at 3:24 PM, Manav Goel manav.goe...@gmail.com wrote:
 This depends upon the execution plan of the query and is more really a
 postgresql question. Google postgresql IN performance and you will get
 a good idea of it.
 By the look of your code, Second option would obviously be faster as
 it hits database once whereas first one flush after every change.

 Regards,

 On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote:
 Hi all,

 I have to read thousands of rows from a table and compile some data, but
 in certain conditions update those rows, all with same value. The ratio
 of reads and writes here is widest possible. Sometimes no rows,
 sometimes few, and sometimes all rows that are read have to be updated.
 The last case scenario is making me concerned.

 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          row.some_column = 123
      session.flush()

 I am thinking about just adding the row's ID to a list:

 list_of_ids = []
 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          list_of_ids.append(row.primary_key)

 and near the end of transaction do:

 session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so 
 me_column
 : 123}, False)

 Yes I'm aware of increased memory requirements to store the ID list on
 the application side, and no I don't need to lock the rows for update,
 the logic of atomic update at the end is sufficient for my case.

 But I think, and the real use benchmarks will probably show, I haven't
 tested yet, that single update query will work faster. I need lowest
 transaction processing time on the application side for entire call,
 even if takes more memory and more database iron.

 What I'm concerned with here is if there are any limits or significant
 overheads with large .in_ lists?

 The backend is PostgreSQL via psycopg2.

 Thanks

 --

 .oO V Oo.

A *long* time ago (SQLALchemy 0.3), I had some performance problems
with large IN clauses, and the reason turned out to be SQLAlchemy
taking a lot of time to build long lists of bindparam objects. I've no
idea if this is still the case these days. The best thing you can do
is just try it.

Simon

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



[sqlalchemy] alembic questions/comments

2012-02-21 Thread Chris Withers

Hi Mike,

I've been reading the Alembic docs, and the following popped out:

- The partial guid approach is nice for machines, but a nightmare for 
humans, looking at a folder full of these can't be fun. Since it's the 
link in the file that's important, could the files be given friendly 
names by humans?


- Curious about this phrase:

This means a migration script that pulls some rows into memory via a 
SELECT statement will not work in --sql mode.


Why not? The sql will still eventually be executed in the context of a 
database connection, right?


- Is there any support for just adding raw lumps of sql as migration steps?

cheers,

Chris

--
Simplistix - Content Management, Batch Processing  Python Consulting
- http://www.simplistix.co.uk

--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] alembic questions/comments

2012-02-21 Thread Michael Bayer

On Feb 21, 2012, at 12:57 PM, Chris Withers wrote:

 Hi Mike,
 
 I've been reading the Alembic docs, and the following popped out:
 
 - The partial guid approach is nice for machines, but a nightmare for 
 humans, looking at a folder full of these can't be fun. Since it's the link 
 in the file that's important, could the files be given friendly names by 
 humans?

The current release does this:

http://alembic.readthedocs.org/en/latest/tutorial.html#the-migration-environment
http:/alembic.readthedocs.org/en/latest/tutorial.html#editing-the-ini-file (see 
file_template)

 
 - Curious about this phrase:
 
 This means a migration script that pulls some rows into memory via a SELECT 
 statement will not work in --sql mode.
 
 Why not? The sql will still eventually be executed in the context of a 
 database connection, right?

er well in the context of a connection, but not one where there is any 
application code prepared to receive result rows. Think cat myscript.sql | 
psql.

There are ways to declare variables in SQL and assign results to those, so 
those methods are still doable.

If you built a system whereby you were going to read the results of a generated 
SQL script, split the lines back into individual statements, then executed them 
in the context of an application, then you can get results back...but still, 
how is that useful ?   You'd need to execute some subsequent line from the 
script in terms of those results, and to do that you'd need to invent some kind 
of template system within the migration system.

When doing really big table migrations, you sometimes need to create temp 
tables, do big INSERT ... SELECT types of operations.   SQLAlchemy has a 
greater need for the insert from select construct now.


 
 - Is there any support for just adding raw lumps of sql as migration steps?

sure just use execute():

op.execute(my raw lump of SQL)

http://alembic.readthedocs.org/en/latest/ops.html#alembic.operations.Operations.execute

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-21 Thread Vlad K.


Thanks for your replies.

Using the IN list definitely speeds up the process, but I hate the 
resulting query which uses bound variables for each and every element of 
the list.


But I have another problem with this, there's a massive memory leak 
somewhere. Take a look at this model:




class GatewayTransaction(Base):
__tablename__ = gateway_transactions

realestate_id = Column(Integer, ForeignKey(realestate.realestate_id,
   ondelete=set null,
   onupdate=cascade), 
primary_key=True)

portal_id = Column(Text, primary_key=True)
realestate_portal_id = Column(Unicode)
operation = Column(Text, nullable=False)
agency_id = Column(Integer, ForeignKey(agencies.agency_id,
ondelete=set null,
onupdate=cascade), 
nullable=False)

agency_portal_id = Column(Unicode, nullable=False)
agency_export_token = Column(Unicode, nullable=False)
user_id = Column(Integer, ForeignKey(users.user_id,
 ondelete=set null,
 onupdate=cascade), 
nullable=False)

mod_images = Column(Boolean)

agency = relationship(Agency, lazy=joined)
realestate = relationship(Realestate, lazy=joined)
user = relationship(User, lazy=joined)




Now, when I do this:


for row in some_query.all():
gt = session.query(GatewayTransaction)\

.filter(GatewayTransaction.realestate_id==row.realestate_id)\

.filter(GatewayTransaction.portal_id==k)\
.first() or GatewayTransaction()

# Do some data processing
#
#


# Update existing or insert as new
gt = session.merge(gt)
session.flush()


It is very, very slow, it takes minutes to process 2000 rows and memory 
usage skyrockets into multiple GB range and I have to terminate it 
before it starts swapping like hell. With lazy=select, it flies fast, 
done in a couple of seconds with very little memory consumed, because at 
this point there are no rows in the table so nothing is additionally 
selected, instead inserted. Still, why would a join slow things down so 
drastically and shoot Python memory usage (not DB's) skyhigh?


Also, even if I try session.expunge(gt) or expunge_all() (previously 
preparing the row to be loaded one by one from a list of IDs), the 
memory always keeps growing, as if the instance do not die, never get 
garbage collected...




.oO V Oo.


On 02/21/2012 04:35 PM, Simon King wrote:


A *long* time ago (SQLALchemy 0.3), I had some performance problems
with large IN clauses, and the reason turned out to be SQLAlchemy
taking a lot of time to build long lists of bindparam objects. I've no
idea if this is still the case these days. The best thing you can do
is just try it.

Simon



--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.