Re: [sqlalchemy] Memory leak on large queries with pyscopg2?

Mike Bayer Wed, 02 Jan 2019 08:04:47 -0800

On Wed, Jan 2, 2019 at 10:49 AM <yoav.dani...@otorio.co> wrote:
>
> But how does storing all the data in one list without using fetchall not 
> cause a leak? The code where I use the cursor doesn't leak unless i use 
> fetchall there, and both with and without fetchall it creates a 25k rows list.


The classic definition of "leak" is unbounded memory growth.    I use
psycopg2 and do:

while True:
   cursor.execute(<query>)
   cursor.fetchall()


A "leak" would mean, the process gets bigger and bigger on each
iteration until the whole server runs out of memory, even though the
loop is not saving the results of each operation.  is that what you're
observing?

if OTOH the "leak" is simply, "fetchall() of 25K rows takes up a lot
of memory and makes the process grow very large, but once at that
large size, it just stays there", that's not a leak, that's just the
Python interpreter's way of allocating memory which is different when
using native C code like psycopg2.   "leaks" OTOH are unbounded.

if you are in fact observing the former, then that's a driver issue in
any case and you should report it to the psycopg2 mailing list, and in
fact if you think the latter case is still unreasonable, again this is
a psycopg2 issue and I'm sure they have a lot to say about this.

as for the difference between:

[dict(row) for row in cursor]

and

[dict(row) for row in cursor.fetchall()]

that's again local to psycopg2 and if you are observing different
memory issues with those two variants, ask them about it and feel free
to CC me so I can follow along.

For SQLAlchemy ORM, we do not use fetchall() when you use query.all()
with yield_per, because as is stated here, fetchall() pulls the whole
result into memory and would defeat the purpose of yield_per():
https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/orm/loading.py#L70
  if you are observing that you're calling yield_per() and the above
code is not working and is still calling fetchall() somehow, let me
know.



>
>
> On Wednesday, January 2, 2019 at 5:13:48 PM UTC+2, Mike Bayer wrote:
>>
>> On Wed, Jan 2, 2019 at 2:53 AM <yoav.d...@otorio.co> wrote:
>> >
>> > I am running sqlalchemy 1.2.2 on a postgres DB with psycopg 2.7.4 on a 
>> > docker container. Every time I'm querying a large number of rows (25000+) 
>> > the amount of memory used by my container increases, and remains higher 
>> > even after the operation has finished, the session has been closed and all 
>> > the objects have been released. This eventually makes the container take 
>> > all its allocated memory and stop functioning since it is working from 
>> > swap.
>> >
>> > Example code:
>>
>> This code makes use of the yield_per option which will use psycopg2
>> named cursors and therefore not store 25000 psycopg2 rows in memory at
>> once.
>>
>> However, on the very next line it goes on to defeat this strategy by
>> calling query.all(), which will create ORM objects, each of which is
>> an order of magnitude larger in memory size than an actual row, for
>> all 25000 rows returned from the database, and store them into a list,
>> requiring an enormous allocation of memory at once.  Then this list is
>> returned to a list comprehension where these large ORM objects are
>> converted into smaller dictionaries, but this is still a list of 25000
>> Python dictionaries which again is equivalent in size as though it had
>> not used named cursors.
>>
>> the same notion applies to using fetchall() as you mention below.
>> This method returns a Python sequence as per pep 249
>> (https://www.python.org/dev/peps/pep-0249/#fetchall) which by
>> definition contains all the contents in memory at once.
>>
>> to use yield_per() properly, you cant use query.all() and you can't
>> create a single list of objects.    You need to instead deal with each
>> object individually, or in small batches, where "deal" means whatever
>> it is you wanted to do with these objects, again being careful to
>> never create any Python structure of any kind (string, JSON document,
>> dictionary, list, etc.) which encompasses all 25000 objects at once.
>>
>> for example:
>>
>> for object in query.yield_per(100):
>>     my_tool.send_json(object)
>>
>>
>> or in batches:
>>
>> while True:
>>     batch = [obj for obj in query for i in range(100)]
>>     if not batch:
>>         break
>>     my_tool.send_json(batch)
>>
>>
>> Another way to go which I often do is to paginate the query instead,
>> that is, instead of using yield_per(), run multiple queries each of
>> which returns fewer rows.  This is more generalizable to other
>> databases as well as allows eager loading such as joined eager loading
>> and subquery eager loading to work.  A recipe for windowing that's
>> pretty popular is at
>> https://github.com/sqlalchemy/sqlalchemy/wiki/WindowedRangeQuery
>>
>>
>> hope this helps!
>>
>>
>> >
>> > # This querys 25000+ rows
>> > query = 
>> > session.query(MyTable).execution_options(stream_results=True).yield_per(100)
>> >
>> > a = [object_as_dict(obj) for obj in query.all()]
>> >
>> > del a
>> > del query
>> >
>> > session.commit()
>> > session.rollback()
>> > session.flush()
>> > session.expunge_all()
>> > session.prune()
>> > session.close()
>> >
>> > del session
>> >
>> >
>> > Going a bit deeper into this I have read that psycopg advises streaming 
>> > large queries using server side cursors, and I have eventually managed to 
>> > create a code sample that functions the same without leaking memory, by 
>> > sort of accessing psycopg directly
>> >
>> > conn = session.connection().connection
>> > cursor = conn.cursor(name='123123')
>> > cursor.itersize = 1000
>> > cursor.execute("select * from mytable")
>> >
>> > a = [list(obj) for obj in cursor]
>> >
>> > cursor.close()
>> >
>> > # Session cleanup etc
>> >
>> >
>> > What seemed to make the difference though is the usage of fetchall(). if 
>> > in the loop i write instead:
>> >
>> > a = [list(obj) for obj in cursor.fetchall()]
>> >
>> >
>> > Then the functionality remains the same yet the memory leaks again.
>> >
>> > Looking into the implementation of ResultProxy and BufferedRowResultProxy 
>> > is sqlalchemy it looks as if they are opening the server side cursor, but 
>> > using fetchall() which leads to the memory issue, which may explain why my 
>> > problem occurs.
>> > Although I have found a way that doesn't leak I would like to continue 
>> > using sqlalchemy through the ORM since otherwise it would require 
>> > reforming large segments of code, and also I don't think the right way to 
>> > go is avoiding DB abstraction layers...
>> >
>> > I'm asking here because I feel like I missed something, and there is 
>> > something wrong with my configuration or I'm calling the wrong function 
>> > which leads to my issue. Would love some assistance :)
>> >
>> > --
>> > SQLAlchemy -
>> > The Python SQL Toolkit and Object Relational Mapper
>> >
>> > http://www.sqlalchemy.org/
>> >
>> > To post example code, please provide an MCVE: Minimal, Complete, and 
>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> > description.
>> > ---
>> > You received this message because you are subscribed to the Google Groups 
>> > "sqlalchemy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send an 
>> > email to sqlalchemy+...@googlegroups.com.
>> > To post to this group, send email to sqlal...@googlegroups.com.
>> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description.
> ---
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Memory leak on large queries with pyscopg2?

Reply via email to