Re: [sqlalchemy] Understanding memory usage under SQLA

2012-06-21 Thread Vlad K.


On 05/17/2012 05:09 PM, Claudio Freire wrote:
Precompiling queries in SQLA, to populate the various SQLA's compiler 
caches, doing some queries that cause libpq and psycopg2 to excercise 
(and thus to allocate whatever permanent data structures it needs to), 
all at load time, will help keep fragmentation to a minimum. 
Fragmentation is a complex issue, and both python and SQLA are quite 
prone to it. But it can be worked around. 



Hi, I'm going back to this email because I'm interested in this 
precompiling queries. What exactly did you mean? Obtaining the 
finalized query string?



--

.oO V Oo.

--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



[sqlalchemy] Understanding memory usage under SQLA

2012-05-17 Thread Vlad K.

  
  

Hello.

I have a problem when processing relatively large number of rows.
For example, when selecting 5000 main rows, each having a number of
many-to-one relationships, memory usage shown by top skyrockets into
200+ MB range (RES), while heapy shows cca 20MB of Python heap.
PostgreSQL backend via psycopg2.

I've made a minimum example case based on the problem I'm noticing
in my Pyramid app, so the session.commit() at line 130 is there to
simulate commit done by Transaction used in Pyramid at the end of
each request. If I'm understanding things correctly, committing
would expire all objects involved in the session, and I even tried
manual session.expunge(row), but there is no difference in memory
usage.

The following is source of an example case. Requires SQLAlchemy
(tested with 0.7.5 and 0.7.7), guppy, psycopg2 (tested with 2.4.2
and 2.4.4). Happens both on Fedora 15 64-bit and CentOS 6.2 32-bit,
though of course the 32-bit shows some 30% lower RES in top.

http://pastebin.com/UFgduWVw


Usage: setup a test database, update line 25 config. Prepopulate
database with -p flag, then run again without any flags.

I don't see where and how would any objects remain in memory, and
heapy showing much lower memory use suggests something is retained
in the involved C extensions?  I also tried with pympler, diff
before and after selecting rows, shows nothing near reported by top.
I guess there is no "leak" in traditional sense of the word because
repeating the task does not yield growing memory consumption. It
stabilizes at certain value and stays there.

Heapy before selecting rows:

Partition of a set of 102014 objects. Total size = 13160672
  bytes.
   Index  Count   % Size   % Cumulative  % Kind (class / dict of
  class)
   0  45901  45  4395296  33   4395296  33 str
   1  26041  26  2186184  17   6581480  50 tuple
   2   7039   7   900992   7   7482472  57 types.CodeType
   3   6836   7   820320   6   8302792  63 function
   4    235   0   761608   6   9064400  69 dict of module
   5    608   1   689792   5   9754192  74 dict (no owner)
   6    676   1   648544   5  10402736  79 dict of type
   7    676   1   608344   5  11011080  84 type
   8    199   0   206248   2  11217328  85 dict of class
   9    185   0   167320   1  11384648  87
  sqlalchemy.sql.visitors.VisitableType
  334 more rows. Type e.g. '_.more' to view.

Heapy after 5000 rows have been selected:

Partition of a set of 102587 objects. Total size = 16455168
  bytes.
   Index  Count   % Size   % Cumulative  % Kind (class / dict of
  class)
   0  45923  45  4397632  27   4397632  27 str
   1  1   0  3146024  19   7543656  46
  sqlalchemy.orm.identity.WeakInstanceDict
   2  26090  25  2189480  13   9733136  59 tuple
   3   7039   7   900992   5  10634128  65 types.CodeType
   4   6859   7   823080   5  11457208  70 function
   5    235   0   761608   5  12218816  74 dict of module
   6    657   1   705048   4  12923864  79 dict (no owner)
   7    676   1   650464   4  13574328  82 dict of type
   8    676   1   608344   4  14182672  86 type
   9    199   0   206248   1  14388920  87 dict of class
  372 more rows. Type e.g. '_.more' to view.




What am I doing wrong? I'm hoping something trivial and blatantly
obvious that I'm oblivious to. :)


Thanks.

-- 

.oO V Oo.
  




-- 
You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.




Re: [sqlalchemy] Understanding memory usage under SQLA

2012-05-17 Thread Michael Bayer
There's a few different parts to what you're asking.

The first is that you're comparing Python's use of OS memory (I'm assuming this 
is the 200+ MB) to Python's actual amount of objects present.   This is a 
common mistake.  Python up through version 2.6 does not release memory back to 
the OS once taken - this was improved in 2.7.   There's an old article about 
this here:  
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
 as well as Alex Martelli's answer: http://stackoverflow.com/a/1316799/34549 .

Second is, what exactly is the large object you're creating here ?   Answer - 
first, psycopg2 by default buffers the result set fully before returning it to 
SQLAlchemy - so it is first a list of 5000 tuples.  Second, the ORM itself also 
by default buffers the full set of rows from the result set in the form of 
mapped objects, so 5000 objects plus their related objects.A way to modify 
this behavior is to use the yield_per() option of Query, which will also in the 
case of psycopg2 tell psycopg2 to use its server side cursors feature which 
does not buffer.

However, yield_per() is not compatible with eager loading as eager loading 
involves being able to load collections across the full set of original 
objects.   Typically the better way to deal with large numbers of rows is to 
paginate, using either LIMIT/OFFSET or using window functions (see 
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery ).

Thirdly, there is a modest growth in memory when a series of mappings are used 
for the first time, including the configuration of mappers, initialization of 
TypeEngine value processors, and such.   But the initial large resultset is the 
main thing causing the higher initial memory footprint.You'll notice this 
isn't a leak at all, as it doesn't grow.


On May 17, 2012, at 7:33 AM, Vlad K. wrote:

 
 Hello.
 
 I have a problem when processing relatively large number of rows. For 
 example, when selecting 5000 main rows, each having a number of many-to-one 
 relationships, memory usage shown by top skyrockets into 200+ MB range (RES), 
 while heapy shows cca 20MB of Python heap. PostgreSQL backend via psycopg2.
 
 I've made a minimum example case based on the problem I'm noticing in my 
 Pyramid app, so the session.commit() at line 130 is there to simulate commit 
 done by Transaction used in Pyramid at the end of each request. If I'm 
 understanding things correctly, committing would expire all objects involved 
 in the session, and I even tried manual session.expunge(row), but there is no 
 difference in memory usage.
 
 The following is source of an example case. Requires SQLAlchemy (tested with 
 0.7.5 and 0.7.7), guppy, psycopg2 (tested with 2.4.2 and 2.4.4). Happens both 
 on Fedora 15 64-bit and CentOS 6.2 32-bit, though of course the 32-bit shows 
 some 30% lower RES in top.
 
 http://pastebin.com/UFgduWVw
 
 
 Usage: setup a test database, update line 25 config. Prepopulate database 
 with -p flag, then run again without any flags.
 
 I don't see where and how would any objects remain in memory, and heapy 
 showing much lower memory use suggests something is retained in the involved 
 C extensions?  I also tried with pympler, diff before and after selecting 
 rows, shows nothing near reported by top. I guess there is no leak in 
 traditional sense of the word because repeating the task does not yield 
 growing memory consumption. It stabilizes at certain value and stays there.
 
 Heapy before selecting rows:
 
 Partition of a set of 102014 objects. Total size = 13160672 bytes.
  Index  Count   % Size   % Cumulative  % Kind (class / dict of class)
  0  45901  45  4395296  33   4395296  33 str
  1  26041  26  2186184  17   6581480  50 tuple
  2   7039   7   900992   7   7482472  57 types.CodeType
  3   6836   7   820320   6   8302792  63 function
  4235   0   761608   6   9064400  69 dict of module
  5608   1   689792   5   9754192  74 dict (no owner)
  6676   1   648544   5  10402736  79 dict of type
  7676   1   608344   5  11011080  84 type
  8199   0   206248   2  11217328  85 dict of class
  9185   0   167320   1  11384648  87 
 sqlalchemy.sql.visitors.VisitableType
 334 more rows. Type e.g. '_.more' to view.
 
 Heapy after 5000 rows have been selected:
 
 Partition of a set of 102587 objects. Total size = 16455168 bytes.
  Index  Count   % Size   % Cumulative  % Kind (class / dict of class)
  0  45923  45  4397632  27   4397632  27 str
  1  1   0  3146024  19   7543656  46 
 sqlalchemy.orm.identity.WeakInstanceDict
  2  26090  25  2189480  13   9733136  59 tuple
  3   7039   7   900992   5  10634128  65 types.CodeType
  4   6859   7   823080   5  11457208  70 function
  5235   0   761608   5  12218816  74 dict of module
  6657   1   705048   4  12923864  79 dict (no owner)
  7676   1   650464   4  

Re: [sqlalchemy] Understanding memory usage under SQLA

2012-05-17 Thread Claudio Freire
On Thu, May 17, 2012 at 10:21 AM, Michael Bayer
mike...@zzzcomputing.com wrote:
 There's a few different parts to what you're asking.

 The first is that you're comparing Python's use of OS memory (I'm assuming
 this is the 200+ MB) to Python's actual amount of objects present.   This is
 a common mistake.  Python up through version 2.6 does not release memory
 back to the OS once taken - this was improved in 2.7.   There's an old
 article about this here:
  http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
 as well as Alex Martelli's answer: http://stackoverflow.com/a/1316799/34549

I think you're mistaken about the version number in which that was changed.
The article you pointed to (and my memory) seems to say it was on 2.5, not 2.7

 Second is, what exactly is the large object you're creating here ?   Answer
 - first, psycopg2 by default buffers the result set fully before returning
 it to SQLAlchemy - so it is first a list of 5000 tuples.  Second, the ORM
 itself also by default buffers the full set of rows from the result set in
 the form of mapped objects, so 5000 objects plus their related objects.    A
 way to modify this behavior is to use the yield_per() option of Query, which
 will also in the case of psycopg2 tell psycopg2 to use its server side
 cursors feature which does not buffer.

It's important to notice that all that you mention on the py side
would show up on heapy.
But psycopg2's buffers would not. Nor would libpq's.

 On May 17, 2012, at 7:33 AM, Vlad K. wrote:
 The following is source of an example case. Requires SQLAlchemy (tested with
 0.7.5 and 0.7.7), guppy, psycopg2 (tested with 2.4.2 and 2.4.4). Happens
 both on Fedora 15 64-bit and CentOS 6.2 32-bit, though of course the 32-bit
 shows some 30% lower RES in top.

Update to psycopg2 2.4.5, it fixes some memory leaks (they call them
reference counting problems):
http://initd.org/psycopg/articles/2012/03/29/psycopg-245-released/

Also... which version of libpq did you build psycopg2 with?

 I don't see where and how would any objects remain in memory, and heapy
 showing much lower memory use suggests something is retained in the involved
 C extensions?

C extensions, or memory fragmentation.

If you read spanish, check this out:
http://python.org.ar/pyar/Charlas#Depuraci.2BAPM-n_y_defragmentaci.2BAPM-n_de_memoria_en_Python

If not... ehm... try google translate ;-)

 Heapy before selecting rows:
 Heapy after 5000 rows have been selected:

Try the difference for more insight:

h1 = hpy.heap()
# do some
h2 = hpy.heap()
print repr(h2 - h1)

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.



Re: [sqlalchemy] Understanding memory usage under SQLA

2012-05-17 Thread Vlad K.


Hi, thanks for your reply.

Yes, I know the Python memory management model and that while it may 
free internally, it does not to OS. I've read somewhere that it has been 
fixed in 2.7 as well, but my testing on Fedora's 2.7.1 still shows the 
issue. That's why I thought perhaps there are dangling references in the 
C extension parts that are not visible to heapy (is that even possible?).


I tried with yield_per() and you once told me how that can't work if I 
use subqueryload, so I tried without subqueries. The problem is that the 
script then shoots from 5 seconds to over a minute to process same 
dataset which is unacceptable to me for other reasons (which is expected 
as there are two additional queries per each of the 5000 rows, making 
the app do 10001 queries + ORM overhead on each). However, with 
yield_per() the memory consumption stays as low as before the querying 
begins.


I've got three possible solutions here. One is repeated querying with 
limited result set AND subqueryloading which works like yield_per, 
except it requires additional sorting and offset. I just tried that and 
it indeed consumes much less memory. With sets 500 rows at once (and 
with full subqueryloads) the memory consumption is 1/10 of loading all 
rows at once which figures, 500 is 1/10 of 5000. This is acceptable.


Another is (materialized) views on the DB end with triggers and entire 
new model to select data from.


And yet another solution is to drop ORM and construct queries manually, 
returning relational data as subselects in arrays, and add a thin 
ORM-like layer that just converts row columns to named tuples so that 
the consumers of this data can use same model interface. But I'm 
guessing this is no different than the (materialized) views approach 
except the combining is done in the DB and not in the Python app. I 
still need separate model class or named tuples.





.oO V Oo.


On 05/17/2012 03:21 PM, Michael Bayer wrote:

There's a few different parts to what you're asking.

The first is that you're comparing Python's use of OS memory (I'm 
assuming this is the 200+ MB) to Python's actual amount of objects 
present.   This is a common mistake.  Python up through version 2.6 
does not release memory back to the OS once taken - this was improved 
in 2.7.   There's an old article about this here: 
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm 
as well as Alex Martelli's answer: 
http://stackoverflow.com/a/1316799/34549 .


Second is, what exactly is the large object you're creating here ?   
Answer - first, psycopg2 by default buffers the result set fully 
before returning it to SQLAlchemy - so it is first a list of 5000 
tuples.  Second, the ORM itself also by default buffers the full set 
of rows from the result set in the form of mapped objects, so 5000 
objects plus their related objects.A way to modify this behavior 
is to use the yield_per() option of Query, which will also in the case 
of psycopg2 tell psycopg2 to use its server side cursors feature 
which does not buffer.


However, yield_per() is not compatible with eager loading as eager 
loading involves being able to load collections across the full set of 
original objects.   Typically the better way to deal with large 
numbers of rows is to paginate, using either LIMIT/OFFSET or using 
window functions (see 
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery ).


Thirdly, there is a modest growth in memory when a series of mappings 
are used for the first time, including the configuration of mappers, 
initialization of TypeEngine value processors, and such.   But the 
initial large resultset is the main thing causing the higher initial 
memory footprint.You'll notice this isn't a leak at all, as it 
doesn't grow.



On May 17, 2012, at 7:33 AM, Vlad K. wrote:



Hello.

I have a problem when processing relatively large number of rows. For 
example, when selecting 5000 main rows, each having a number of 
many-to-one relationships, memory usage shown by top skyrockets into 
200+ MB range (RES), while heapy shows cca 20MB of Python heap. 
PostgreSQL backend via psycopg2.


I've made a minimum example case based on the problem I'm noticing in 
my Pyramid app, so the session.commit() at line 130 is there to 
simulate commit done by Transaction used in Pyramid at the end of 
each request. If I'm understanding things correctly, committing would 
expire all objects involved in the session, and I even tried manual 
session.expunge(row), but there is no difference in memory usage.


The following is source of an example case. Requires SQLAlchemy 
(tested with 0.7.5 and 0.7.7), guppy, psycopg2 (tested with 2.4.2 and 
2.4.4). Happens both on Fedora 15 64-bit and CentOS 6.2 32-bit, 
though of course the 32-bit shows some 30% lower RES in top.


http://pastebin.com/UFgduWVw


Usage: setup a test database, update line 25 config. Prepopulate 
database with -p flag, then run again 

Re: [sqlalchemy] Understanding memory usage under SQLA

2012-05-17 Thread Claudio Freire
On Thu, May 17, 2012 at 11:57 AM, Vlad K. v...@haronmedia.com wrote:
 PostgreSQL 9.0.7, libpq is part of the same version -devel package
 PostgreSQL 9.1.3 (via pgsql yum repo), libpq is part of the same version
 -devel package

9.1 (the one you have in production) also has leak-related fixes. They
don't necessarily apply to your problem, they're somewhat esoteric,
but it's worth noticing.

Even the tiniest memory leak will fragment your process' heap, and
cause it not to release those 200M.

In those cases, especially when the leaks are one-time (like in the
libpq case), it's convenient to force the leak to happen at load time.

Precompiling queries in SQLA, to populate the various SQLA's compiler
caches, doing some queries that cause libpq and psycopg2 to excercise
(and thus to allocate whatever permanent data structures it needs to),
all at load time, will help keep fragmentation to a minimum.

Fragmentation is a complex issue, and both python and SQLA are quite
prone to it. But it can be worked around.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.