On 02/25/2016 07:56 PM, Randy Syring wrote:
Mike,

Just to be clear, I'm not doing any kind of selects.  I'm only inserting
records.  And these aren't large binary blobs of any kind, they are
rather small strings and ints.

I apologize in advance if I misunderstood your answer.

ah, well then that is odd, in your code example I assumed that whole yield_merchants() thing was doing a SELECT.

Running tens of thousands / millions of small INSERT statements shouldn't budge your memory use on the client side at all. If you can try doing just the SQL part given a set of mock data, maybe that would help to isolate what might be happening. You're just doing cursor.execute() on a single connection, there's no state that SQLAlchemy keeps around nor any for psycopg2 per statement that would be growing memory. Perhaps run through half of the loop and then use a tool like heapy (http://smira.ru/wp-content/uploads/2011/08/heapy.html) to see what kinds of objects are prevalent.




*Randy Syring*
Chief Executive Developer
Direct: 502.276.0459
Office: 812.285.8766
Level 12 <https://www.level12.io/>

On 02/25/2016 07:46 PM, Mike Bayer wrote:


On 02/25/2016 06:31 PM, Randy Syring wrote:
I'm working on a project to parse through a large text file (1GB) of
records.  Once parsed, each record gets sent to the DB.  Due to the size
of the file, I've been working on a streaming/functional approach that
will keep my memory usage constant.

I've been able to simply take the DB out of the equation and parse
through all of the records and memory usage stays constant. But, as
soon as I bring SA into the picture, memory usage continues to climb
through the lifetime of the program.

Well as soon as you bring psycopg2, or really any DBAPI, into the
picture.   DBAPI drivers fully buffer both rows and columns. However,
this wouldn't lead to unbounded growth, only that memory would grow as
big as the biggest single result set you've fetched.


I originally started using the ORM, and thought the Session would be the
culprit, but have now drilled down deep enough into the problem that it
appears to be an issue even when using simple connections.

*using psycopg:

*
|
     connection =db.engine.connect().connection
withconnection.cursor()ascursor:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
                 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
                 cursor.execute(insert_sql,pc_data)


I see you aren't using server side cursors, which is the minimum
required to not get psycopg2 to buffer rows as they are sent. But even
then, server side cursors don't have any impact on individual column
values being buffered - I'm not sure if these text fields are large
binary objects themselves, but the only DBAPI right now that supports
streaming of BLOB objects is cx_Oracle and SQLAlchemy also hides this
API.   For streaming of large objects in psycopg2, you have to use
this obscure Postgresql feature nobody uses called "large objects"
that requires use of a special table, that's described at
http://initd.org/psycopg/docs/usage.html#access-to-postgresql-large-objects.


|

The above, when ran, shows memory ("RES" in `top`) quickly climb and
then hold around 183K.  The resources module reports "max rss" at 182268
at the end of running the script.  Those memory numbers are just about
the same if I simply run the loop and keep the DB out of it.



*using SA

*
|
withdb.engine.begin()asconnection:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
                 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
                 connection.execute(insert_sql,pc_data)
|

The above, when ran, shows memory usage climbing through the life of the
script.  "max rss" tops out at 323984.

I'd like to ultimately be able to use the ORM for this project, but if
even the simple inserts using SA don't result in constant memory, I
can't really more forward with that plan.

Thanks in advance for any help you can provide.

Basically, people use server side cursors for this, but I find those
to be troubling since they are temperamental and aren't platform
dependent.  If the size of your data is based on that there's a lot of
rows, I'd fetch it using windows at a time, e.g. an approach similar
to that described at
https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery.


I've written routines that generalize this such that I can even pass
different chunks of data as I fetch them into individual worker
processes using a multiprocessing.Pool and I've written routines that
write data out to files and use heapq to sort them back (that job in
particular we had to read/write out a 4G XML file, where a DOM tree of
such would run out of memory immediately, so we used all SAX
parsing/streaming and heapq). IMO once you have things chunked, you
can do anything with it.





*system info

*Python 2.7.6
SA 1.0.10 & SA 1.0.12
Ubuntu Linux 14.04

--
You received this message because you are subscribed to the Google
Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to sqlalchemy+unsubscr...@googlegroups.com
<mailto:sqlalchemy+unsubscr...@googlegroups.com>.
To post to this group, send email to sqlalchemy@googlegroups.com
<mailto:sqlalchemy@googlegroups.com>.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.



--
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Reply via email to