Re: [sqlalchemy] Re: Working with large IN lists

2012-02-23 Thread Michael Bayer
I'd note that those subqueryloads() render the effectiveness of yield_per() to 
be almost nil.

This is how subqueryload works:


query(Parent).options(subqueryload(Parent.child)).all()

will give you:


SELECT * FROM parent

then, the moment a parent row is fetched, the first child collection is 
referenced to be loaded, then the query emitted is:

SELECT * FROM child JOIN (SELECT * FROM parent) AS a on a.id=child.id_a

that is, the second query loads all child rows for all parents in the entire 
result.   

So let's say we do yield_per(), so that SQLAlchemy only processes the first 100 
rows before handing them out.   As soon as you hit either of those two 
subqueryloaded collections, the yield_per() is mostly thrown out the window - 
they will each load the entire list of child objects for the whole result, 
which right there will grow memory to be as big as your entire result.

The next thing I'd look at is that second query for ModelC.   You can also load 
those upfront so that you don't need to do a query each time:

modelcs = dict(
sess.query(ModelC.id_a, 
ModelC).join(ModelC.a).filter(ModelA.parent_id=same integer you're using 
against ModelA)
)

then you have a dictionary of id_a-ModelC as you loop through your ModelA 
records.   All the ModelC's are already in the session, so you don't need to 
use merge(), which is also not a super quick operation.  If an id_a is not in 
the dictionary then you know to create a new ModelC and use Session.add() to 
put it in.   If you're limiting the ModelA rows using a LIMIT or window recipe 
like I mentioned earlier, you'd apply that same criteria to the loading of the 
modelcs.

Later on you're doing something with query(ModelC).first() in a loop which is 
also something I hope isn't in the real application - looking there is seems 
like you'd only need to say query(ModelC).delete().

Watching the SQL emitted with echo=True, and in some cases also seeing how 
large the results coming in are using echo='debug', is something I strongly 
recommend when first profiling an application.





On Feb 23, 2012, at 8:18 AM, Vlad K. wrote:

 
 And yet again the problem is not in SQLAlchemy but in Pyramid Sorry, from 
 now on, I'll first try writing test scripts without Pyramid to see where the 
 problem is, I just don't have time for that kind of debugging so I'm wasting 
 yours. :)
 
 
 Thanks. I'll write to Pylons Discuss list, but the test script is here 
 nevertheless. Comment out line 266 (and remove imports) to disable Pyramid 
 and see it work okay. Without Pyramid (but even with Transaction and 
 ZopeTransactionExtension), the gcdelta is 0 or negative per batch of 200 
 iterations. With Pyramid bootstrapped, gcdelta is in thousands per batch of 
 200 iterations.
 
 https://gist.github.com/d669e958c54869c69831
 
 
 .oO V Oo.
 
 
 On 02/23/2012 02:41 AM, Michael Bayer wrote:
 On Feb 22, 2012, at 6:36 PM, Vlad K. wrote:
 
 Okay, thanks to this article:
 
 http://neverfear.org/blog/view/155/Investigating_memory_leaks_in_Python
 
 
 I made similar plot of object counts in time, showing top 50 types. The 
 resulting PDF is here (you might wish to download it first, Google messes 
 it up for me):
 
 https://docs.google.com/open?id=0ByLiBlA59qDwYTY1MGIzYWEtYjMxZi00ZDVlLTk0OTEtOGI2ZjA3NDgyM2Y3
 
 
 Everything seems to linearly grow in count. Something is keeping all those 
 objects reference somewhere. What could possibly be the cause?
 
 can you provide a self-contained, single file test case that illustrates the 
 memory growth ?
 
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 sqlalchemy group.
 To post to this group, send email to sqlalchemy@googlegroups.com.
 To unsubscribe from this group, send email to 
 sqlalchemy+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/sqlalchemy?hl=en.
 

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-23 Thread Vlad K.


.oO V Oo.


On 02/23/2012 02:50 PM, Michael Bayer wrote:

I'd note that those subqueryloads() render the effectiveness of yield_per() to 
be almost nil.


I know. I've replicated the real use case in the application which has 
yet to see proper optimization which includes better query planning and 
reduction of unnecessary joins and relationships. I'd rather investigate 
in implementing prepared statements and do basically lazy=select 
instead of subqueries.




The next thing I'd look at is that second query for ModelC.   You can also load 
those upfront so that you don't need to do a query each time:

modelcs = dict(
 sess.query(ModelC.id_a, ModelC).join(ModelC.a).filter(ModelA.parent_id=same 
integer you're using against ModelA)
)

then you have a dictionary of id_a-ModelC as you loop through your ModelA records.   All the 
ModelC's are already in the session, so you don't need to use merge(), which is also not a super quick 
operation.  If an id_a is not in the dictionary then you know to create a new ModelC and use 
Session.add() to put it in.   If you're limiting the ModelA rows using a LIMIT or window recipe like I mentioned 
earlier, you'd apply that same criteria to the loading of the modelcs.

Later on you're doing something with query(ModelC).first() in a loop which is 
also something I hope isn't in the real application - looking there is seems 
like you'd only need to say query(ModelC).delete().


Might not be visible from this test script, but the scenario is this. 
ModelA represents certain data that has to be exported to external 
services (XMLRPC, REST and similar). In an ideal situation I just select 
all ModelA that has to be exported (by looking at timestamp of last 
modification vs timestamp of process run), but I can't do that because 
if such a transaction fails, it has to remain remembered for next 
batch run. So I use ModelC table which logs these pending transactions.


So the first phase selects rows from ModelA that are up for export and 
creates transaction logs in ModelC.


The second phase then loads and exports one by one row from ModelC 
(joined with ModelA and everything else required for the export). 
However, if single transaction fails, the entire script exist and 
continues when called next time. This I have to do for other reasons 
(preventing overload on possibly downed external service etc.., so I 
can't skip that row and fetch next).


It may happen, and does regularly, that on subsequent runs of the 
process there are no new ModelA rows to load, but there are ModelC rows 
that failed from last call (since they're in the table, it means they 
were not processed)


And that's the logic in query(ModelC).first(), processing, and then 
delete(). Also note that each row has to be an individual transaction 
(load, send to external service, remove from ModelC table), which means 
I can't rely on session/identity caching by pre-loading data instead 
of joins and subqueries.




Watching the SQL emitted with echo=True, and in some cases also seeing how 
large the results coming in are using echo='debug', is something I strongly 
recommend when first profiling an application.


Yes, I use logging and see all the SQL emitted.



Thanks for your input, I appreciate all the help and advice I can get. 
Still a ton of stuff to learn about SQLA.




V

--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Michael Bayer

When we want to test if a Python program has a leak, we do that via seeing 
how many uncollected objects are present.   This is done via gc:

import gc
print total number of objects:, len(gc.get_objects())

That's the only real way to measure if the memory used by Python objects is 
growing unbounded.  Looking at the memory usage on top shows what the 
interpreter takes up - the CPython interpreter in more modern releases does 
release memory back, but only occasionally.   Older versions don't.

If you're doing an operation that loads thousands of rows, those rows are 
virtually always loaded entirely into memory by the DBAPI, before your program 
or SQLAlchemy is ever given the chance to fetch a single row.   I haven't yet 
looked closely at your case here, but that's often at the core of scripts that 
use much more memory than expected.

There's ways to get *some* DBAPIs to not do this (particularly psycopg2, if 
you're using Postgresql, see 
http://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per
 and 
http://docs.sqlalchemy.org/en/latest/core/connections.html?highlight=stream_results#sqlalchemy.engine.base.Connection.execution_options),
  though the better solution is to usually try loading chunks of records in at 
a time (one such recipe that I use for this is here: 
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery) .  Or 
better yet consider if the problem can be solved entirely on the SQL side (this 
entirely depends on exactly what you're trying to do with the data in question).


On Feb 22, 2012, at 9:46 AM, Vlad K. wrote:

 
 Okay, after several test cases, various join combinations with or without 
 relationships, with or without cherrypicking columns that are really used 
 from the joined models, I've come to the conclusion that the only problem I'm 
 having here is that there is no garbage collection. Python memory use just 
 keeps growing at a rate that, of course, depends on the size of models used 
 and data queried, but it just keeps growing, regardless of release/deletion 
 of instances or isolating one row processing in its own committed transaction.
 
 I also found this:
 
 http://permalink.gmane.org/gmane.comp.python.sqlalchemy.user/30087
 
 
 So it appears I'm having the same problem.
 
 
 Am I understanding correctly that because of this, SQLAlchemy ORM is in my 
 case useless if I have to process thousands of rows, because the memory used 
 to process each row (along with corresponding joined models etc...) will not 
 be released? So basically I'd have to use SQLA without the ORM, for this 
 particular use case?
 
 Or is this some memory leak bug?
 
 If so, any suggestions, examples on how do I switch from ORM use to non-ORM 
 if I want to retain the named tuples returned by queries and avoid rewriting 
 half the app?
 
 
 Thanks.
 
 
 .oO V Oo.
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 sqlalchemy group.
 To post to this group, send email to sqlalchemy@googlegroups.com.
 To unsubscribe from this group, send email to 
 sqlalchemy+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/sqlalchemy?hl=en.
 

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Vlad K.


Hi,

thanks for your reply. I haven't yet tested this with a profiler to see 
exactly what exactly is happening, but the bottom line is that the 
overall memory use grows with each iteration (or transaction processed), 
to the point of grinding the server to a halt, and top shows only the 
Python process involved consuming all the memory.


I've already modified code to read one row at a time, by first creating 
a list of IDs to be affected, then going through that list and selecting 
+ updating/inserting one transaction at a time.


I suppose I can solve the problem entirely on the SQL side with a stored 
function but that's a maintenance overhead I'd like to avoid if possible.


Meanwhile I've gotten rid of convenience relationships and in some 
aspects decided on lazy=select instead of subquery or joined and have 
brought down total memory use, now the entire process can finish with 
the amount of RAM available on the server, but it still shows linear 
growth from the start to the end of the process.


.oO V Oo.


On 02/22/2012 07:23 PM, Michael Bayer wrote:

When we want to test if a Python program has a leak, we do that via seeing 
how many uncollected objects are present.   This is done via gc:

import gc
print total number of objects:, len(gc.get_objects())

That's the only real way to measure if the memory used by Python objects is growing 
unbounded.  Looking at the memory usage on top shows what the interpreter 
takes up - the CPython interpreter in more modern releases does release memory back, but 
only occasionally.   Older versions don't.

If you're doing an operation that loads thousands of rows, those rows are 
virtually always loaded entirely into memory by the DBAPI, before your program 
or SQLAlchemy is ever given the chance to fetch a single row.   I haven't yet 
looked closely at your case here, but that's often at the core of scripts that 
use much more memory than expected.

There's ways to get *some* DBAPIs to not do this (particularly psycopg2, if 
you're using Postgresql, see 
http://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per
 and 
http://docs.sqlalchemy.org/en/latest/core/connections.html?highlight=stream_results#sqlalchemy.engine.base.Connection.execution_options),
  though the better solution is to usually try loading chunks of records in at 
a time (one such recipe that I use for this is here: 
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery) .  Or 
better yet consider if the problem can be solved entirely on the SQL side (this 
entirely depends on exactly what you're trying to do with the data in question).


--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Vlad K.


Yes, definitely growing at a rate of 700-800 per iteration.

.oO V Oo.


On 02/22/2012 07:23 PM, Michael Bayer wrote:

When we want to test if a Python program has a leak, we do that via seeing 
how many uncollected objects are present.   This is done via gc:

import gc
print total number of objects:, len(gc.get_objects())


--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Claudio Freire
On Wed, Feb 22, 2012 at 4:29 PM, Michael Bayer mike...@zzzcomputing.com wrote:
 thanks for your reply. I haven't yet tested this with a profiler to see 
 exactly what exactly is happening, but the bottom line is that the overall 
 memory use grows with each iteration (or transaction processed), to the 
 point of grinding the server to a halt, and top shows only the Python 
 process involved consuming all the memory.

 yeah like I said that tells you almost nothing until you start looking at 
 gc.get_objects().  If the size of gc.get_objects() grows continuously for 50 
 iterations or more, never decreasing even when gc.collect() is called, then 
 it's a leak.  Otherwise it's just too much data being loaded at once.

I've noticed compiling queries (either explicitly or implicitly) tends
to *fragment* memory. There seem to be long-lived caches in the PG
compiler at least. I can't remember exactly where, but I could take
another look.

I'm talking of rather old versions of SQLA, 0.3 and 0.5.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Michael Bayer

On Feb 22, 2012, at 2:46 PM, Claudio Freire wrote:

 On Wed, Feb 22, 2012 at 4:29 PM, Michael Bayer mike...@zzzcomputing.com 
 wrote:
 thanks for your reply. I haven't yet tested this with a profiler to see 
 exactly what exactly is happening, but the bottom line is that the overall 
 memory use grows with each iteration (or transaction processed), to the 
 point of grinding the server to a halt, and top shows only the Python 
 process involved consuming all the memory.
 
 yeah like I said that tells you almost nothing until you start looking at 
 gc.get_objects().  If the size of gc.get_objects() grows continuously for 50 
 iterations or more, never decreasing even when gc.collect() is called, then 
 it's a leak.  Otherwise it's just too much data being loaded at once.
 
 I've noticed compiling queries (either explicitly or implicitly) tends
 to *fragment* memory. There seem to be long-lived caches in the PG
 compiler at least. I can't remember exactly where, but I could take
 another look.
 
 I'm talking of rather old versions of SQLA, 0.3 and 0.5.


0.3's code is entirely gone, years ago.  I wouldn't even know what silly things 
it was doing.

In 0.5 and beyond, theres a cache of identifiers for quoting purposes.   If you 
are creating perhaps thousands of tables with hundreds of columns, all names 
being unique, then this cache might start to become  a blip on the radar.   For 
the expected use case of a schema with at most several hundred tables this 
should not be a significant size.

I don't know much what it means for a Python script to fragment memory, and I 
don't really think there's some kind of set of Python programming practices 
that deterministically link to whether or not a script fragments a lot.  Alex 
Martelli talks about it here: 
http://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python
 .The suggestion there is if you truly need to load tons of data into 
memory, doing it in a subprocess is the only way to guarantee that memory is 
freed back to the OS.

As it stands, there are no known memory leaks in SQLAlchemy itself and if you 
look at our tests under aaa_profiling/test_memusage.py you can see we are 
exhaustively ensuring that the size of gc.get_objects() does not grow unbounded 
for all sorts of awkward situations.To illustrate potential new memory 
leaks we need succinct test cases that illustrate a simple ascending growth in 
memory usage.




-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Michael Bayer

On Feb 22, 2012, at 3:28 PM, Claudio Freire wrote:

 
 Like I said, it's not a leak situation as much of a fragmentation
 situation, where long-lived objects in high memory positions can
 prevent the process' heap from shrinking.
 
 [0] http://revista.python.org.ar/2/en/html/memory-fragmentation.html

Saw that a bit, but looking at the tips at the bottom, concrete 
implementation changes are not coming to mind.   An eternal structure is 
ubiquitous in any programming language.  sys.modules is a big list of all the 
Python modules that have been imported, each one full of functions, classes, 
other data, these are all eternal structures - sys.modules is normally never 
cleaned out.I'm not seeing at what point you move beyond things that are in 
these modules into things that are so-called eternal structures that lead to 
inappropriate memory fragmentation.


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Claudio Freire
On Wed, Feb 22, 2012 at 5:40 PM, Michael Bayer mike...@zzzcomputing.com wrote:
 Saw that a bit, but looking at the tips at the bottom, concrete 
 implementation changes are not coming to mind.   An eternal structure is 
 ubiquitous in any programming language.  sys.modules is a big list of all the 
 Python modules that have been imported, each one full of functions, classes, 
 other data, these are all eternal structures - sys.modules is normally 
 never cleaned out.    I'm not seeing at what point you move beyond things 
 that are in these modules into things that are so-called eternal structures 
 that lead to inappropriate memory fragmentation.

The thing to be careful about is when those eternal structures are created.

If they're created at the beginning (as sys.modules, which is
populated with imports, which most of the time happen in the preamble
of .py files), then the resulting objects will have lower memory
locations and thus not get in the way.

But if those structures are created after the program had time to fill
its address space with transient objects (say, lazy imports, caches),
then when the transient objects are deleted, the eternal structures
(with their high addresses) prevent the heap from shrinking.

Such caches, for instance, are better made limited in lifespan (say,
giving them a finite lifetime, making them expire, actively cleaning
them from time to time). Structures that are truly required to be
eternal are better populated at load time, early in the program's
lifecycle. In my backend, for instance, queries are precompiled at
startup, to make sure they have lower memory addresses. This has
mostly solved SQLA-related memory fragmentation issues for me.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Claudio Freire
On Wed, Feb 22, 2012 at 5:51 PM, Claudio Freire klaussfre...@gmail.com wrote:
 Such caches, for instance, are better made limited in lifespan (say,
 giving them a finite lifetime, making them expire, actively cleaning
 them from time to time). Structures that are truly required to be
 eternal are better populated at load time, early in the program's
 lifecycle. In my backend, for instance, queries are precompiled at
 startup, to make sure they have lower memory addresses. This has
 mostly solved SQLA-related memory fragmentation issues for me.

One source of trouble I've had here, is the inability to use bind
parameters inside .in_(...).

Queries that accept variable lists, thus, I had to precompile to
string, and replace the inside of the condition by string
interpolation.

Ugly hack, but it served me well.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Michael Bayer

On Feb 22, 2012, at 3:51 PM, Claudio Freire wrote:

 On Wed, Feb 22, 2012 at 5:40 PM, Michael Bayer mike...@zzzcomputing.com 
 wrote:
 Saw that a bit, but looking at the tips at the bottom, concrete 
 implementation changes are not coming to mind.   An eternal structure is 
 ubiquitous in any programming language.  sys.modules is a big list of all 
 the Python modules that have been imported, each one full of functions, 
 classes, other data, these are all eternal structures - sys.modules is 
 normally never cleaned out.I'm not seeing at what point you move beyond 
 things that are in these modules into things that are so-called eternal 
 structures that lead to inappropriate memory fragmentation.
 
 The thing to be careful about is when those eternal structures are created.
 
 If they're created at the beginning (as sys.modules, which is
 populated with imports, which most of the time happen in the preamble
 of .py files), then the resulting objects will have lower memory
 locations and thus not get in the way.
 
 But if those structures are created after the program had time to fill
 its address space with transient objects (say, lazy imports, caches),
 then when the transient objects are deleted, the eternal structures
 (with their high addresses) prevent the heap from shrinking.
 
 Such caches, for instance, are better made limited in lifespan (say,
 giving them a finite lifetime, making them expire, actively cleaning
 them from time to time). Structures that are truly required to be
 eternal are better populated at load time, early in the program's
 lifecycle. In my backend, for instance, queries are precompiled at
 startup, to make sure they have lower memory addresses. This has
 mostly solved SQLA-related memory fragmentation issues for me.


IMHO the whole point of using a high level, interpreted language like Python is 
that we don't have to be bogged down thinking like C programmers.   How come 
I've never had a memory fragmentation issue before ?  I've made 
precompilation an option for folks who really wanted it but I've never had a 
need for such a thing.   And you can be sure I work on some very large and 
sprawling SQLAlchemy models these days.

There are some caches here and there like the identifier cache as well as 
caches inside of TypeEngine objects, but these caches are all intended to be of 
limited size.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Claudio Freire
On Wed, Feb 22, 2012 at 6:21 PM, Michael Bayer mike...@zzzcomputing.com wrote:
 IMHO the whole point of using a high level, interpreted language like Python 
 is that we don't have to be bogged down thinking like C programmers.   How 
 come I've never had a memory fragmentation issue before ?      I've made 
 precompilation an option for folks who really wanted it but I've never had 
 a need for such a thing.   And you can be sure I work on some very large and 
 sprawling SQLAlchemy models these days.

Maybe you never used big objects.

Memory fragmentation arises only when the application handles a
mixture of big and small objects, such that holes created by small
objects being freed don't serve big memory requirements.

If your application handles a homogenous workload (ie: every request
is pretty much the same), as is usual, then you won't probably
experience fragmentation.

My application does the usual small-object work, interspersed with
intense computation on big objects, hence my troubles.

Python's garbage collector has been a pending issue for a long time,
but, as I noticed in the linked page, past architectural decisions
prevent some widely desired improvements.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Vlad K.


Okay, thanks to this article:

http://neverfear.org/blog/view/155/Investigating_memory_leaks_in_Python


I made similar plot of object counts in time, showing top 50 types. The 
resulting PDF is here (you might wish to download it first, Google 
messes it up for me):


https://docs.google.com/open?id=0ByLiBlA59qDwYTY1MGIzYWEtYjMxZi00ZDVlLTk0OTEtOGI2ZjA3NDgyM2Y3


Everything seems to linearly grow in count. Something is keeping all 
those objects reference somewhere. What could possibly be the cause?



.oO V Oo.


--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-22 Thread Michael Bayer

On Feb 22, 2012, at 6:36 PM, Vlad K. wrote:

 
 Okay, thanks to this article:
 
 http://neverfear.org/blog/view/155/Investigating_memory_leaks_in_Python
 
 
 I made similar plot of object counts in time, showing top 50 types. The 
 resulting PDF is here (you might wish to download it first, Google messes it 
 up for me):
 
 https://docs.google.com/open?id=0ByLiBlA59qDwYTY1MGIzYWEtYjMxZi00ZDVlLTk0OTEtOGI2ZjA3NDgyM2Y3
 
 
 Everything seems to linearly grow in count. Something is keeping all those 
 objects reference somewhere. What could possibly be the cause?


can you provide a self-contained, single file test case that illustrates the 
memory growth ?


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



[sqlalchemy] Re: Working with large IN lists

2012-02-21 Thread Manav Goel
This depends upon the execution plan of the query and is more really a
postgresql question. Google postgresql IN performance and you will get
a good idea of it.
By the look of your code, Second option would obviously be faster as
it hits database once whereas first one flush after every change.

Regards,

On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote:
 Hi all,

 I have to read thousands of rows from a table and compile some data, but
 in certain conditions update those rows, all with same value. The ratio
 of reads and writes here is widest possible. Sometimes no rows,
 sometimes few, and sometimes all rows that are read have to be updated.
 The last case scenario is making me concerned.

 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          row.some_column = 123
      session.flush()

 I am thinking about just adding the row's ID to a list:

 list_of_ids = []
 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          list_of_ids.append(row.primary_key)

 and near the end of transaction do:

 session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so 
 me_column
 : 123}, False)

 Yes I'm aware of increased memory requirements to store the ID list on
 the application side, and no I don't need to lock the rows for update,
 the logic of atomic update at the end is sufficient for my case.

 But I think, and the real use benchmarks will probably show, I haven't
 tested yet, that single update query will work faster. I need lowest
 transaction processing time on the application side for entire call,
 even if takes more memory and more database iron.

 What I'm concerned with here is if there are any limits or significant
 overheads with large .in_ lists?

 The backend is PostgreSQL via psycopg2.

 Thanks

 --

 .oO V Oo.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-21 Thread Simon King
On Tue, Feb 21, 2012 at 3:24 PM, Manav Goel manav.goe...@gmail.com wrote:
 This depends upon the execution plan of the query and is more really a
 postgresql question. Google postgresql IN performance and you will get
 a good idea of it.
 By the look of your code, Second option would obviously be faster as
 it hits database once whereas first one flush after every change.

 Regards,

 On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote:
 Hi all,

 I have to read thousands of rows from a table and compile some data, but
 in certain conditions update those rows, all with same value. The ratio
 of reads and writes here is widest possible. Sometimes no rows,
 sometimes few, and sometimes all rows that are read have to be updated.
 The last case scenario is making me concerned.

 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          row.some_column = 123
      session.flush()

 I am thinking about just adding the row's ID to a list:

 list_of_ids = []
 for row in query.yield_per(100):
      # Do something with data

      if some_condition:
          list_of_ids.append(row.primary_key)

 and near the end of transaction do:

 session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so 
 me_column
 : 123}, False)

 Yes I'm aware of increased memory requirements to store the ID list on
 the application side, and no I don't need to lock the rows for update,
 the logic of atomic update at the end is sufficient for my case.

 But I think, and the real use benchmarks will probably show, I haven't
 tested yet, that single update query will work faster. I need lowest
 transaction processing time on the application side for entire call,
 even if takes more memory and more database iron.

 What I'm concerned with here is if there are any limits or significant
 overheads with large .in_ lists?

 The backend is PostgreSQL via psycopg2.

 Thanks

 --

 .oO V Oo.

A *long* time ago (SQLALchemy 0.3), I had some performance problems
with large IN clauses, and the reason turned out to be SQLAlchemy
taking a lot of time to build long lists of bindparam objects. I've no
idea if this is still the case these days. The best thing you can do
is just try it.

Simon

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Re: Working with large IN lists

2012-02-21 Thread Vlad K.


Thanks for your replies.

Using the IN list definitely speeds up the process, but I hate the 
resulting query which uses bound variables for each and every element of 
the list.


But I have another problem with this, there's a massive memory leak 
somewhere. Take a look at this model:




class GatewayTransaction(Base):
__tablename__ = gateway_transactions

realestate_id = Column(Integer, ForeignKey(realestate.realestate_id,
   ondelete=set null,
   onupdate=cascade), 
primary_key=True)

portal_id = Column(Text, primary_key=True)
realestate_portal_id = Column(Unicode)
operation = Column(Text, nullable=False)
agency_id = Column(Integer, ForeignKey(agencies.agency_id,
ondelete=set null,
onupdate=cascade), 
nullable=False)

agency_portal_id = Column(Unicode, nullable=False)
agency_export_token = Column(Unicode, nullable=False)
user_id = Column(Integer, ForeignKey(users.user_id,
 ondelete=set null,
 onupdate=cascade), 
nullable=False)

mod_images = Column(Boolean)

agency = relationship(Agency, lazy=joined)
realestate = relationship(Realestate, lazy=joined)
user = relationship(User, lazy=joined)




Now, when I do this:


for row in some_query.all():
gt = session.query(GatewayTransaction)\

.filter(GatewayTransaction.realestate_id==row.realestate_id)\

.filter(GatewayTransaction.portal_id==k)\
.first() or GatewayTransaction()

# Do some data processing
#
#


# Update existing or insert as new
gt = session.merge(gt)
session.flush()


It is very, very slow, it takes minutes to process 2000 rows and memory 
usage skyrockets into multiple GB range and I have to terminate it 
before it starts swapping like hell. With lazy=select, it flies fast, 
done in a couple of seconds with very little memory consumed, because at 
this point there are no rows in the table so nothing is additionally 
selected, instead inserted. Still, why would a join slow things down so 
drastically and shoot Python memory usage (not DB's) skyhigh?


Also, even if I try session.expunge(gt) or expunge_all() (previously 
preparing the row to be loaded one by one from a list of IDs), the 
memory always keeps growing, as if the instance do not die, never get 
garbage collected...




.oO V Oo.


On 02/21/2012 04:35 PM, Simon King wrote:


A *long* time ago (SQLALchemy 0.3), I had some performance problems
with large IN clauses, and the reason turned out to be SQLAlchemy
taking a lot of time to build long lists of bindparam objects. I've no
idea if this is still the case these days. The best thing you can do
is just try it.

Simon



--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.