Re: [sqlalchemy] Memory leak? with connection & insert

2016-02-25 Thread Mike Bayer



On 02/25/2016 07:56 PM, Randy Syring wrote:

Mike,

Just to be clear, I'm not doing any kind of selects.  I'm only inserting
records.  And these aren't large binary blobs of any kind, they are
rather small strings and ints.

I apologize in advance if I misunderstood your answer.


ah, well then that is odd, in your code example I assumed that whole 
yield_merchants() thing was doing a SELECT.


Running tens of thousands / millions of small INSERT statements 
shouldn't budge your memory use on the client side at all.   If you can 
try doing just the SQL part given a set of mock data, maybe that would 
help to isolate what might be happening.You're just doing 
cursor.execute() on a single connection, there's no state that 
SQLAlchemy keeps around nor any for psycopg2 per statement that would be 
growing memory.   Perhaps run through half of the loop and then use a 
tool like heapy (http://smira.ru/wp-content/uploads/2011/08/heapy.html) 
to see what kinds of objects are prevalent.






*Randy Syring*
Chief Executive Developer
Direct: 502.276.0459
Office: 812.285.8766
Level 12 

On 02/25/2016 07:46 PM, Mike Bayer wrote:



On 02/25/2016 06:31 PM, Randy Syring wrote:

I'm working on a project to parse through a large text file (1GB) of
records.  Once parsed, each record gets sent to the DB.  Due to the size
of the file, I've been working on a streaming/functional approach that
will keep my memory usage constant.

I've been able to simply take the DB out of the equation and parse
through all of the records and memory usage stays constant. But, as
soon as I bring SA into the picture, memory usage continues to climb
through the lifetime of the program.


Well as soon as you bring psycopg2, or really any DBAPI, into the
picture.   DBAPI drivers fully buffer both rows and columns. However,
this wouldn't lead to unbounded growth, only that memory would grow as
big as the biggest single result set you've fetched.



I originally started using the ORM, and thought the Session would be the
culprit, but have now drilled down deep enough into the problem that it
appears to be an issue even when using simple connections.

*using psycopg:

*
|
 connection =db.engine.connect().connection
withconnection.cursor()ascursor:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
 cursor.execute(insert_sql,pc_data)



I see you aren't using server side cursors, which is the minimum
required to not get psycopg2 to buffer rows as they are sent. But even
then, server side cursors don't have any impact on individual column
values being buffered - I'm not sure if these text fields are large
binary objects themselves, but the only DBAPI right now that supports
streaming of BLOB objects is cx_Oracle and SQLAlchemy also hides this
API.   For streaming of large objects in psycopg2, you have to use
this obscure Postgresql feature nobody uses called "large objects"
that requires use of a special table, that's described at
http://initd.org/psycopg/docs/usage.html#access-to-postgresql-large-objects.



|

The above, when ran, shows memory ("RES" in `top`) quickly climb and
then hold around 183K.  The resources module reports "max rss" at 182268
at the end of running the script.  Those memory numbers are just about
the same if I simply run the loop and keep the DB out of it.





*using SA

*
|
withdb.engine.begin()asconnection:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
 connection.execute(insert_sql,pc_data)
|

The above, when ran, shows memory usage climbing through the life of the
script.  "max rss" tops out at 323984.

I'd like to ultimately be able to use the ORM for this project, but if
even the simple inserts using SA don't result in constant memory, I
can't really more forward with that plan.

Thanks in advance for any help you can provide.


Basically, people use server side cursors for this, but I find those
to be troubling since they are temperamental and aren't platform
dependent.  If the size of your data is based on that there's a lot of
rows, I'd fetch it using windows at a time, e.g. an approach similar
to that described at
https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery.


I've written routines that generalize this such that I can even pass
different chunks of data as I fetch them into individual worker
processes using a multiprocessing.Pool and I've written routines that
write data out to files and use heapq to sort them back (that job in

Re: [sqlalchemy] Memory leak? with connection & insert

2016-02-25 Thread Randy Syring

Mike,

Just to be clear, I'm not doing any kind of selects.  I'm only inserting 
records.  And these aren't large binary blobs of any kind, they are 
rather small strings and ints.


I apologize in advance if I misunderstood your answer.

*Randy Syring*
Chief Executive Developer
Direct: 502.276.0459
Office: 812.285.8766
Level 12 

On 02/25/2016 07:46 PM, Mike Bayer wrote:



On 02/25/2016 06:31 PM, Randy Syring wrote:

I'm working on a project to parse through a large text file (1GB) of
records.  Once parsed, each record gets sent to the DB.  Due to the size
of the file, I've been working on a streaming/functional approach that
will keep my memory usage constant.

I've been able to simply take the DB out of the equation and parse
through all of the records and memory usage stays constant. But, as
soon as I bring SA into the picture, memory usage continues to climb
through the lifetime of the program.


Well as soon as you bring psycopg2, or really any DBAPI, into the 
picture.   DBAPI drivers fully buffer both rows and columns. However, 
this wouldn't lead to unbounded growth, only that memory would grow as 
big as the biggest single result set you've fetched.




I originally started using the ORM, and thought the Session would be the
culprit, but have now drilled down deep enough into the problem that it
appears to be an issue even when using simple connections.

*using psycopg:

*
|
 connection =db.engine.connect().connection
withconnection.cursor()ascursor:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
 cursor.execute(insert_sql,pc_data)



I see you aren't using server side cursors, which is the minimum 
required to not get psycopg2 to buffer rows as they are sent. But even 
then, server side cursors don't have any impact on individual column 
values being buffered - I'm not sure if these text fields are large 
binary objects themselves, but the only DBAPI right now that supports 
streaming of BLOB objects is cx_Oracle and SQLAlchemy also hides this 
API.   For streaming of large objects in psycopg2, you have to use 
this obscure Postgresql feature nobody uses called "large objects" 
that requires use of a special table, that's described at 
http://initd.org/psycopg/docs/usage.html#access-to-postgresql-large-objects. 




|

The above, when ran, shows memory ("RES" in `top`) quickly climb and
then hold around 183K.  The resources module reports "max rss" at 182268
at the end of running the script.  Those memory numbers are just about
the same if I simply run the loop and keep the DB out of it.





*using SA

*
|
withdb.engine.begin()asconnection:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
 connection.execute(insert_sql,pc_data)
|

The above, when ran, shows memory usage climbing through the life of the
script.  "max rss" tops out at 323984.

I'd like to ultimately be able to use the ORM for this project, but if
even the simple inserts using SA don't result in constant memory, I
can't really more forward with that plan.

Thanks in advance for any help you can provide.


Basically, people use server side cursors for this, but I find those 
to be troubling since they are temperamental and aren't platform 
dependent.  If the size of your data is based on that there's a lot of 
rows, I'd fetch it using windows at a time, e.g. an approach similar 
to that described at 
https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery. 



I've written routines that generalize this such that I can even pass 
different chunks of data as I fetch them into individual worker 
processes using a multiprocessing.Pool and I've written routines that 
write data out to files and use heapq to sort them back (that job in 
particular we had to read/write out a 4G XML file, where a DOM tree of 
such would run out of memory immediately, so we used all SAX 
parsing/streaming and heapq). IMO once you have things chunked, you 
can do anything with it.







*system info

*Python 2.7.6
SA 1.0.10 & SA 1.0.12
Ubuntu Linux 14.04

--
You received this message because you are subscribed to the Google
Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to sqlalchemy+unsubscr...@googlegroups.com
.
To post to this group, send email to sqlalchemy@googlegroups.com
.
Visit this group at https://groups.google.com/group/sqlalchemy.

Re: [sqlalchemy] Memory leak? with connection & insert

2016-02-25 Thread Mike Bayer



On 02/25/2016 06:31 PM, Randy Syring wrote:

I'm working on a project to parse through a large text file (1GB) of
records.  Once parsed, each record gets sent to the DB.  Due to the size
of the file, I've been working on a streaming/functional approach that
will keep my memory usage constant.

I've been able to simply take the DB out of the equation and parse
through all of the records and memory usage stays constant.  But, as
soon as I bring SA into the picture, memory usage continues to climb
through the lifetime of the program.


Well as soon as you bring psycopg2, or really any DBAPI, into the 
picture.   DBAPI drivers fully buffer both rows and columns.   However, 
this wouldn't lead to unbounded growth, only that memory would grow as 
big as the biggest single result set you've fetched.




I originally started using the ORM, and thought the Session would be the
culprit, but have now drilled down deep enough into the problem that it
appears to be an issue even when using simple connections.

*using psycopg:

*
|
 connection =db.engine.connect().connection
withconnection.cursor()ascursor:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
 cursor.execute(insert_sql,pc_data)



I see you aren't using server side cursors, which is the minimum 
required to not get psycopg2 to buffer rows as they are sent.   But even 
then, server side cursors don't have any impact on individual column 
values being buffered - I'm not sure if these text fields are large 
binary objects themselves, but the only DBAPI right now that supports 
streaming of BLOB objects is cx_Oracle and SQLAlchemy also hides this 
API.   For streaming of large objects in psycopg2, you have to use this 
obscure Postgresql feature nobody uses called "large objects" that 
requires use of a special table, that's described at 
http://initd.org/psycopg/docs/usage.html#access-to-postgresql-large-objects. 




|

The above, when ran, shows memory ("RES" in `top`) quickly climb and
then hold around 183K.  The resources module reports "max rss" at 182268
at the end of running the script.  Those memory numbers are just about
the same if I simply run the loop and keep the DB out of it.





*using SA

*
|
withdb.engine.begin()asconnection:
forcount,statement inenumerate(MEXChunker(mex_file).yield_merchants()):
forpc_data instatement.program_charges:
 insert_sql ="INSERT INTO stage.tsys_program_charges
(reporting_month,"\
"reporting_year, charge_amount, merchant_number, officer_code) "\
"VALUES (%s, %s, %s, %s, %s)"
 connection.execute(insert_sql,pc_data)
|

The above, when ran, shows memory usage climbing through the life of the
script.  "max rss" tops out at 323984.

I'd like to ultimately be able to use the ORM for this project, but if
even the simple inserts using SA don't result in constant memory, I
can't really more forward with that plan.

Thanks in advance for any help you can provide.


Basically, people use server side cursors for this, but I find those to 
be troubling since they are temperamental and aren't platform dependent. 
 If the size of your data is based on that there's a lot of rows, I'd 
fetch it using windows at a time, e.g. an approach similar to that 
described at 
https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery. 



I've written routines that generalize this such that I can even pass 
different chunks of data as I fetch them into individual worker 
processes using a multiprocessing.Pool and I've written routines that 
write data out to files and use heapq to sort them back (that job in 
particular we had to read/write out a 4G XML file, where a DOM tree of 
such would run out of memory immediately, so we used all SAX 
parsing/streaming and heapq).   IMO once you have things chunked, you 
can do anything with it.







*system info

*Python 2.7.6
SA 1.0.10 & SA 1.0.12
Ubuntu Linux 14.04

--
You received this message because you are subscribed to the Google
Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to sqlalchemy+unsubscr...@googlegroups.com
.
To post to this group, send email to sqlalchemy@googlegroups.com
.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at 

[sqlalchemy] Memory leak? with connection & insert

2016-02-25 Thread Randy Syring
I'm working on a project to parse through a large text file (1GB) of 
records.  Once parsed, each record gets sent to the DB.  Due to the size of 
the file, I've been working on a streaming/functional approach that will 
keep my memory usage constant.  

I've been able to simply take the DB out of the equation and parse through 
all of the records and memory usage stays constant.  But, as soon as I 
bring SA into the picture, memory usage continues to climb through the 
lifetime of the program.

I originally started using the ORM, and thought the Session would be the 
culprit, but have now drilled down deep enough into the problem that it 
appears to be an issue even when using simple connections.



*using psycopg:*
connection = db.engine.connect().connection
with connection.cursor() as cursor:
for count, statement in enumerate(MEXChunker(mex_file).
yield_merchants()):
for pc_data in statement.program_charges:
insert_sql = "INSERT INTO stage.tsys_program_charges 
(reporting_month," \
 "reporting_year, charge_amount, 
merchant_number, officer_code) " \
 "VALUES (%s, %s, %s, %s, %s)"
cursor.execute(insert_sql, pc_data)

The above, when ran, shows memory ("RES" in `top`) quickly climb and then 
hold around 183K.  The resources module reports "max rss" at 182268 at the 
end of running the script.  Those memory numbers are just about the same if 
I simply run the loop and keep the DB out of it.



*using SA*
with db.engine.begin() as connection:
for count, statement in enumerate(MEXChunker(mex_file).
yield_merchants()):
for pc_data in statement.program_charges:
insert_sql = "INSERT INTO stage.tsys_program_charges 
(reporting_month," \
 "reporting_year, charge_amount, 
merchant_number, officer_code) " \
 "VALUES (%s, %s, %s, %s, %s)"
connection.execute(insert_sql, pc_data)

The above, when ran, shows memory usage climbing through the life of the 
script.  "max rss" tops out at 323984.

I'd like to ultimately be able to use the ORM for this project, but if even 
the simple inserts using SA don't result in constant memory, I can't really 
more forward with that plan.

Thanks in advance for any help you can provide.



*system info*Python 2.7.6
SA 1.0.10 & SA 1.0.12
Ubuntu Linux 14.04

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.