Hi, thanks for your reply.
Yes, I know the Python memory management model and that while it may
free internally, it does not to OS. I've read somewhere that it has been
fixed in 2.7 as well, but my testing on Fedora's 2.7.1 still shows the
issue. That's why I thought perhaps there are dangling references in the
C extension parts that are not visible to heapy (is that even possible?).
I tried with yield_per() and you once told me how that can't work if I
use subqueryload, so I tried without subqueries. The problem is that the
script then shoots from 5 seconds to over a minute to process same
dataset which is unacceptable to me for other reasons (which is expected
as there are two additional queries per each of the 5000 rows, making
the app do 10001 queries + ORM overhead on each). However, with
yield_per() the memory consumption stays as low as before the querying
begins.
I've got three possible solutions here. One is repeated querying with
limited result set AND subqueryloading which works like yield_per,
except it requires additional sorting and offset. I just tried that and
it indeed consumes much less memory. With sets 500 rows at once (and
with full subqueryloads) the memory consumption is 1/10 of loading all
rows at once which figures, 500 is 1/10 of 5000. This is acceptable.
Another is (materialized) views on the DB end with triggers and entire
new model to select data from.
And yet another solution is to drop ORM and construct queries manually,
returning relational data as subselects in arrays, and add a thin
"ORM"-like layer that just converts row columns to named tuples so that
the consumers of this data can use same model interface. But I'm
guessing this is no different than the (materialized) views approach
except the combining is done in the DB and not in the Python app. I
still need separate model class or named tuples.
.oO V Oo.
On 05/17/2012 03:21 PM, Michael Bayer wrote:
There's a few different parts to what you're asking.
The first is that you're comparing Python's use of OS memory (I'm
assuming this is the 200+ MB) to Python's actual amount of objects
present. This is a common mistake. Python up through version 2.6
does not release memory back to the OS once taken - this was improved
in 2.7. There's an old article about this here:
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
as well as Alex Martelli's answer:
http://stackoverflow.com/a/1316799/34549 .
Second is, what exactly is the large object you're creating here ?
Answer - first, psycopg2 by default buffers the result set fully
before returning it to SQLAlchemy - so it is first a list of 5000
tuples. Second, the ORM itself also by default buffers the full set
of rows from the result set in the form of mapped objects, so 5000
objects plus their related objects. A way to modify this behavior
is to use the yield_per() option of Query, which will also in the case
of psycopg2 tell psycopg2 to use its "server side cursors" feature
which does not buffer.
However, "yield_per()" is not compatible with eager loading as eager
loading involves being able to load collections across the full set of
original objects. Typically the better way to deal with large
numbers of rows is to paginate, using either LIMIT/OFFSET or using
window functions (see
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery ).
Thirdly, there is a modest growth in memory when a series of mappings
are used for the first time, including the configuration of mappers,
initialization of TypeEngine value processors, and such. But the
initial large resultset is the main thing causing the higher initial
memory footprint. You'll notice this isn't a "leak" at all, as it
doesn't grow.
On May 17, 2012, at 7:33 AM, Vlad K. wrote:
Hello.
I have a problem when processing relatively large number of rows. For
example, when selecting 5000 main rows, each having a number of
many-to-one relationships, memory usage shown by top skyrockets into
200+ MB range (RES), while heapy shows cca 20MB of Python heap.
PostgreSQL backend via psycopg2.
I've made a minimum example case based on the problem I'm noticing in
my Pyramid app, so the session.commit() at line 130 is there to
simulate commit done by Transaction used in Pyramid at the end of
each request. If I'm understanding things correctly, committing would
expire all objects involved in the session, and I even tried manual
session.expunge(row), but there is no difference in memory usage.
The following is source of an example case. Requires SQLAlchemy
(tested with 0.7.5 and 0.7.7), guppy, psycopg2 (tested with 2.4.2 and
2.4.4). Happens both on Fedora 15 64-bit and CentOS 6.2 32-bit,
though of course the 32-bit shows some 30% lower RES in top.
http://pastebin.com/UFgduWVw
Usage: setup a test database, update line 25 config. Prepopulate
database with -p flag, then run again without any flags.
I don't see where and how would any objects remain in memory, and
heapy showing much lower memory use suggests something is retained in
the involved C extensions? I also tried with pympler, diff before
and after selecting rows, shows nothing near reported by top. I guess
there is no "leak" in traditional sense of the word because repeating
the task does not yield growing memory consumption. It stabilizes at
certain value and stays there.
Heapy before selecting rows:
Partition of a set of 102014 objects. Total size = 13160672 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 45901 45 4395296 33 4395296 33 str
1 26041 26 2186184 17 6581480 50 tuple
2 7039 7 900992 7 7482472 57 types.CodeType
3 6836 7 820320 6 8302792 63 function
4 235 0 761608 6 9064400 69 dict of module
5 608 1 689792 5 9754192 74 dict (no owner)
6 676 1 648544 5 10402736 79 dict of type
7 676 1 608344 5 11011080 84 type
8 199 0 206248 2 11217328 85 dict of class
9 185 0 167320 1 11384648 87
sqlalchemy.sql.visitors.VisitableType
<334 more rows. Type e.g. '_.more' to view.>
Heapy after 5000 rows have been selected:
Partition of a set of 102587 objects. Total size = 16455168 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 45923 45 4397632 27 4397632 27 str
1 1 0 3146024 19 7543656 46
sqlalchemy.orm.identity.WeakInstanceDict
2 26090 25 2189480 13 9733136 59 tuple
3 7039 7 900992 5 10634128 65 types.CodeType
4 6859 7 823080 5 11457208 70 function
5 235 0 761608 5 12218816 74 dict of module
6 657 1 705048 4 12923864 79 dict (no owner)
7 676 1 650464 4 13574328 82 dict of type
8 676 1 608344 4 14182672 86 type
9 199 0 206248 1 14388920 87 dict of class
<372 more rows. Type e.g. '_.more' to view.>
What am I doing wrong? I'm hoping something trivial and blatantly
obvious that I'm oblivious to. :)
Thanks.
--
.oO V Oo.
--
You received this message because you are subscribed to the Google
Groups "sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
<mailto:sqlalchemy@googlegroups.com>.
To unsubscribe from this group, send email to
sqlalchemy+unsubscr...@googlegroups.com
<mailto:sqlalchemy+unsubscr...@googlegroups.com>.
For more options, visit this group at
http://groups.google.com/group/sqlalchemy?hl=en.
--
You received this message because you are subscribed to the Google
Groups "sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/sqlalchemy?hl=en.
--
You received this message because you are subscribed to the Google Groups
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/sqlalchemy?hl=en.