Re: [sqlalchemy] Re: Working with large IN lists
I'd note that those subqueryloads() render the effectiveness of yield_per() to be almost nil. This is how subqueryload works: query(Parent).options(subqueryload(Parent.child)).all() will give you: SELECT * FROM parent then, the moment a parent row is fetched, the first child collection is referenced to be loaded, then the query emitted is: SELECT * FROM child JOIN (SELECT * FROM parent) AS a on a.id=child.id_a that is, the second query loads all child rows for all parents in the entire result. So let's say we do yield_per(), so that SQLAlchemy only processes the first 100 rows before handing them out. As soon as you hit either of those two subqueryloaded collections, the yield_per() is mostly thrown out the window - they will each load the entire list of child objects for the whole result, which right there will grow memory to be as big as your entire result. The next thing I'd look at is that second query for ModelC. You can also load those upfront so that you don't need to do a query each time: modelcs = dict( sess.query(ModelC.id_a, ModelC).join(ModelC.a).filter(ModelA.parent_id=same integer you're using against ModelA) ) then you have a dictionary of id_a-ModelC as you loop through your ModelA records. All the ModelC's are already in the session, so you don't need to use merge(), which is also not a super quick operation. If an id_a is not in the dictionary then you know to create a new ModelC and use Session.add() to put it in. If you're limiting the ModelA rows using a LIMIT or window recipe like I mentioned earlier, you'd apply that same criteria to the loading of the modelcs. Later on you're doing something with query(ModelC).first() in a loop which is also something I hope isn't in the real application - looking there is seems like you'd only need to say query(ModelC).delete(). Watching the SQL emitted with echo=True, and in some cases also seeing how large the results coming in are using echo='debug', is something I strongly recommend when first profiling an application. On Feb 23, 2012, at 8:18 AM, Vlad K. wrote: And yet again the problem is not in SQLAlchemy but in Pyramid Sorry, from now on, I'll first try writing test scripts without Pyramid to see where the problem is, I just don't have time for that kind of debugging so I'm wasting yours. :) Thanks. I'll write to Pylons Discuss list, but the test script is here nevertheless. Comment out line 266 (and remove imports) to disable Pyramid and see it work okay. Without Pyramid (but even with Transaction and ZopeTransactionExtension), the gcdelta is 0 or negative per batch of 200 iterations. With Pyramid bootstrapped, gcdelta is in thousands per batch of 200 iterations. https://gist.github.com/d669e958c54869c69831 .oO V Oo. On 02/23/2012 02:41 AM, Michael Bayer wrote: On Feb 22, 2012, at 6:36 PM, Vlad K. wrote: Okay, thanks to this article: http://neverfear.org/blog/view/155/Investigating_memory_leaks_in_Python I made similar plot of object counts in time, showing top 50 types. The resulting PDF is here (you might wish to download it first, Google messes it up for me): https://docs.google.com/open?id=0ByLiBlA59qDwYTY1MGIzYWEtYjMxZi00ZDVlLTk0OTEtOGI2ZjA3NDgyM2Y3 Everything seems to linearly grow in count. Something is keeping all those objects reference somewhere. What could possibly be the cause? can you provide a self-contained, single file test case that illustrates the memory growth ? -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
.oO V Oo. On 02/23/2012 02:50 PM, Michael Bayer wrote: I'd note that those subqueryloads() render the effectiveness of yield_per() to be almost nil. I know. I've replicated the real use case in the application which has yet to see proper optimization which includes better query planning and reduction of unnecessary joins and relationships. I'd rather investigate in implementing prepared statements and do basically lazy=select instead of subqueries. The next thing I'd look at is that second query for ModelC. You can also load those upfront so that you don't need to do a query each time: modelcs = dict( sess.query(ModelC.id_a, ModelC).join(ModelC.a).filter(ModelA.parent_id=same integer you're using against ModelA) ) then you have a dictionary of id_a-ModelC as you loop through your ModelA records. All the ModelC's are already in the session, so you don't need to use merge(), which is also not a super quick operation. If an id_a is not in the dictionary then you know to create a new ModelC and use Session.add() to put it in. If you're limiting the ModelA rows using a LIMIT or window recipe like I mentioned earlier, you'd apply that same criteria to the loading of the modelcs. Later on you're doing something with query(ModelC).first() in a loop which is also something I hope isn't in the real application - looking there is seems like you'd only need to say query(ModelC).delete(). Might not be visible from this test script, but the scenario is this. ModelA represents certain data that has to be exported to external services (XMLRPC, REST and similar). In an ideal situation I just select all ModelA that has to be exported (by looking at timestamp of last modification vs timestamp of process run), but I can't do that because if such a transaction fails, it has to remain remembered for next batch run. So I use ModelC table which logs these pending transactions. So the first phase selects rows from ModelA that are up for export and creates transaction logs in ModelC. The second phase then loads and exports one by one row from ModelC (joined with ModelA and everything else required for the export). However, if single transaction fails, the entire script exist and continues when called next time. This I have to do for other reasons (preventing overload on possibly downed external service etc.., so I can't skip that row and fetch next). It may happen, and does regularly, that on subsequent runs of the process there are no new ModelA rows to load, but there are ModelC rows that failed from last call (since they're in the table, it means they were not processed) And that's the logic in query(ModelC).first(), processing, and then delete(). Also note that each row has to be an individual transaction (load, send to external service, remove from ModelC table), which means I can't rely on session/identity caching by pre-loading data instead of joins and subqueries. Watching the SQL emitted with echo=True, and in some cases also seeing how large the results coming in are using echo='debug', is something I strongly recommend when first profiling an application. Yes, I use logging and see all the SQL emitted. Thanks for your input, I appreciate all the help and advice I can get. Still a ton of stuff to learn about SQLA. V -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
When we want to test if a Python program has a leak, we do that via seeing how many uncollected objects are present. This is done via gc: import gc print total number of objects:, len(gc.get_objects()) That's the only real way to measure if the memory used by Python objects is growing unbounded. Looking at the memory usage on top shows what the interpreter takes up - the CPython interpreter in more modern releases does release memory back, but only occasionally. Older versions don't. If you're doing an operation that loads thousands of rows, those rows are virtually always loaded entirely into memory by the DBAPI, before your program or SQLAlchemy is ever given the chance to fetch a single row. I haven't yet looked closely at your case here, but that's often at the core of scripts that use much more memory than expected. There's ways to get *some* DBAPIs to not do this (particularly psycopg2, if you're using Postgresql, see http://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per and http://docs.sqlalchemy.org/en/latest/core/connections.html?highlight=stream_results#sqlalchemy.engine.base.Connection.execution_options), though the better solution is to usually try loading chunks of records in at a time (one such recipe that I use for this is here: http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery) . Or better yet consider if the problem can be solved entirely on the SQL side (this entirely depends on exactly what you're trying to do with the data in question). On Feb 22, 2012, at 9:46 AM, Vlad K. wrote: Okay, after several test cases, various join combinations with or without relationships, with or without cherrypicking columns that are really used from the joined models, I've come to the conclusion that the only problem I'm having here is that there is no garbage collection. Python memory use just keeps growing at a rate that, of course, depends on the size of models used and data queried, but it just keeps growing, regardless of release/deletion of instances or isolating one row processing in its own committed transaction. I also found this: http://permalink.gmane.org/gmane.comp.python.sqlalchemy.user/30087 So it appears I'm having the same problem. Am I understanding correctly that because of this, SQLAlchemy ORM is in my case useless if I have to process thousands of rows, because the memory used to process each row (along with corresponding joined models etc...) will not be released? So basically I'd have to use SQLA without the ORM, for this particular use case? Or is this some memory leak bug? If so, any suggestions, examples on how do I switch from ORM use to non-ORM if I want to retain the named tuples returned by queries and avoid rewriting half the app? Thanks. .oO V Oo. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
Hi, thanks for your reply. I haven't yet tested this with a profiler to see exactly what exactly is happening, but the bottom line is that the overall memory use grows with each iteration (or transaction processed), to the point of grinding the server to a halt, and top shows only the Python process involved consuming all the memory. I've already modified code to read one row at a time, by first creating a list of IDs to be affected, then going through that list and selecting + updating/inserting one transaction at a time. I suppose I can solve the problem entirely on the SQL side with a stored function but that's a maintenance overhead I'd like to avoid if possible. Meanwhile I've gotten rid of convenience relationships and in some aspects decided on lazy=select instead of subquery or joined and have brought down total memory use, now the entire process can finish with the amount of RAM available on the server, but it still shows linear growth from the start to the end of the process. .oO V Oo. On 02/22/2012 07:23 PM, Michael Bayer wrote: When we want to test if a Python program has a leak, we do that via seeing how many uncollected objects are present. This is done via gc: import gc print total number of objects:, len(gc.get_objects()) That's the only real way to measure if the memory used by Python objects is growing unbounded. Looking at the memory usage on top shows what the interpreter takes up - the CPython interpreter in more modern releases does release memory back, but only occasionally. Older versions don't. If you're doing an operation that loads thousands of rows, those rows are virtually always loaded entirely into memory by the DBAPI, before your program or SQLAlchemy is ever given the chance to fetch a single row. I haven't yet looked closely at your case here, but that's often at the core of scripts that use much more memory than expected. There's ways to get *some* DBAPIs to not do this (particularly psycopg2, if you're using Postgresql, see http://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per and http://docs.sqlalchemy.org/en/latest/core/connections.html?highlight=stream_results#sqlalchemy.engine.base.Connection.execution_options), though the better solution is to usually try loading chunks of records in at a time (one such recipe that I use for this is here: http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery) . Or better yet consider if the problem can be solved entirely on the SQL side (this entirely depends on exactly what you're trying to do with the data in question). -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
Yes, definitely growing at a rate of 700-800 per iteration. .oO V Oo. On 02/22/2012 07:23 PM, Michael Bayer wrote: When we want to test if a Python program has a leak, we do that via seeing how many uncollected objects are present. This is done via gc: import gc print total number of objects:, len(gc.get_objects()) -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Wed, Feb 22, 2012 at 4:29 PM, Michael Bayer mike...@zzzcomputing.com wrote: thanks for your reply. I haven't yet tested this with a profiler to see exactly what exactly is happening, but the bottom line is that the overall memory use grows with each iteration (or transaction processed), to the point of grinding the server to a halt, and top shows only the Python process involved consuming all the memory. yeah like I said that tells you almost nothing until you start looking at gc.get_objects(). If the size of gc.get_objects() grows continuously for 50 iterations or more, never decreasing even when gc.collect() is called, then it's a leak. Otherwise it's just too much data being loaded at once. I've noticed compiling queries (either explicitly or implicitly) tends to *fragment* memory. There seem to be long-lived caches in the PG compiler at least. I can't remember exactly where, but I could take another look. I'm talking of rather old versions of SQLA, 0.3 and 0.5. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Feb 22, 2012, at 2:46 PM, Claudio Freire wrote: On Wed, Feb 22, 2012 at 4:29 PM, Michael Bayer mike...@zzzcomputing.com wrote: thanks for your reply. I haven't yet tested this with a profiler to see exactly what exactly is happening, but the bottom line is that the overall memory use grows with each iteration (or transaction processed), to the point of grinding the server to a halt, and top shows only the Python process involved consuming all the memory. yeah like I said that tells you almost nothing until you start looking at gc.get_objects(). If the size of gc.get_objects() grows continuously for 50 iterations or more, never decreasing even when gc.collect() is called, then it's a leak. Otherwise it's just too much data being loaded at once. I've noticed compiling queries (either explicitly or implicitly) tends to *fragment* memory. There seem to be long-lived caches in the PG compiler at least. I can't remember exactly where, but I could take another look. I'm talking of rather old versions of SQLA, 0.3 and 0.5. 0.3's code is entirely gone, years ago. I wouldn't even know what silly things it was doing. In 0.5 and beyond, theres a cache of identifiers for quoting purposes. If you are creating perhaps thousands of tables with hundreds of columns, all names being unique, then this cache might start to become a blip on the radar. For the expected use case of a schema with at most several hundred tables this should not be a significant size. I don't know much what it means for a Python script to fragment memory, and I don't really think there's some kind of set of Python programming practices that deterministically link to whether or not a script fragments a lot. Alex Martelli talks about it here: http://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python .The suggestion there is if you truly need to load tons of data into memory, doing it in a subprocess is the only way to guarantee that memory is freed back to the OS. As it stands, there are no known memory leaks in SQLAlchemy itself and if you look at our tests under aaa_profiling/test_memusage.py you can see we are exhaustively ensuring that the size of gc.get_objects() does not grow unbounded for all sorts of awkward situations.To illustrate potential new memory leaks we need succinct test cases that illustrate a simple ascending growth in memory usage. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Feb 22, 2012, at 3:28 PM, Claudio Freire wrote: Like I said, it's not a leak situation as much of a fragmentation situation, where long-lived objects in high memory positions can prevent the process' heap from shrinking. [0] http://revista.python.org.ar/2/en/html/memory-fragmentation.html Saw that a bit, but looking at the tips at the bottom, concrete implementation changes are not coming to mind. An eternal structure is ubiquitous in any programming language. sys.modules is a big list of all the Python modules that have been imported, each one full of functions, classes, other data, these are all eternal structures - sys.modules is normally never cleaned out.I'm not seeing at what point you move beyond things that are in these modules into things that are so-called eternal structures that lead to inappropriate memory fragmentation. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Wed, Feb 22, 2012 at 5:40 PM, Michael Bayer mike...@zzzcomputing.com wrote: Saw that a bit, but looking at the tips at the bottom, concrete implementation changes are not coming to mind. An eternal structure is ubiquitous in any programming language. sys.modules is a big list of all the Python modules that have been imported, each one full of functions, classes, other data, these are all eternal structures - sys.modules is normally never cleaned out. I'm not seeing at what point you move beyond things that are in these modules into things that are so-called eternal structures that lead to inappropriate memory fragmentation. The thing to be careful about is when those eternal structures are created. If they're created at the beginning (as sys.modules, which is populated with imports, which most of the time happen in the preamble of .py files), then the resulting objects will have lower memory locations and thus not get in the way. But if those structures are created after the program had time to fill its address space with transient objects (say, lazy imports, caches), then when the transient objects are deleted, the eternal structures (with their high addresses) prevent the heap from shrinking. Such caches, for instance, are better made limited in lifespan (say, giving them a finite lifetime, making them expire, actively cleaning them from time to time). Structures that are truly required to be eternal are better populated at load time, early in the program's lifecycle. In my backend, for instance, queries are precompiled at startup, to make sure they have lower memory addresses. This has mostly solved SQLA-related memory fragmentation issues for me. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Wed, Feb 22, 2012 at 5:51 PM, Claudio Freire klaussfre...@gmail.com wrote: Such caches, for instance, are better made limited in lifespan (say, giving them a finite lifetime, making them expire, actively cleaning them from time to time). Structures that are truly required to be eternal are better populated at load time, early in the program's lifecycle. In my backend, for instance, queries are precompiled at startup, to make sure they have lower memory addresses. This has mostly solved SQLA-related memory fragmentation issues for me. One source of trouble I've had here, is the inability to use bind parameters inside .in_(...). Queries that accept variable lists, thus, I had to precompile to string, and replace the inside of the condition by string interpolation. Ugly hack, but it served me well. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Feb 22, 2012, at 3:51 PM, Claudio Freire wrote: On Wed, Feb 22, 2012 at 5:40 PM, Michael Bayer mike...@zzzcomputing.com wrote: Saw that a bit, but looking at the tips at the bottom, concrete implementation changes are not coming to mind. An eternal structure is ubiquitous in any programming language. sys.modules is a big list of all the Python modules that have been imported, each one full of functions, classes, other data, these are all eternal structures - sys.modules is normally never cleaned out.I'm not seeing at what point you move beyond things that are in these modules into things that are so-called eternal structures that lead to inappropriate memory fragmentation. The thing to be careful about is when those eternal structures are created. If they're created at the beginning (as sys.modules, which is populated with imports, which most of the time happen in the preamble of .py files), then the resulting objects will have lower memory locations and thus not get in the way. But if those structures are created after the program had time to fill its address space with transient objects (say, lazy imports, caches), then when the transient objects are deleted, the eternal structures (with their high addresses) prevent the heap from shrinking. Such caches, for instance, are better made limited in lifespan (say, giving them a finite lifetime, making them expire, actively cleaning them from time to time). Structures that are truly required to be eternal are better populated at load time, early in the program's lifecycle. In my backend, for instance, queries are precompiled at startup, to make sure they have lower memory addresses. This has mostly solved SQLA-related memory fragmentation issues for me. IMHO the whole point of using a high level, interpreted language like Python is that we don't have to be bogged down thinking like C programmers. How come I've never had a memory fragmentation issue before ? I've made precompilation an option for folks who really wanted it but I've never had a need for such a thing. And you can be sure I work on some very large and sprawling SQLAlchemy models these days. There are some caches here and there like the identifier cache as well as caches inside of TypeEngine objects, but these caches are all intended to be of limited size. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Wed, Feb 22, 2012 at 6:21 PM, Michael Bayer mike...@zzzcomputing.com wrote: IMHO the whole point of using a high level, interpreted language like Python is that we don't have to be bogged down thinking like C programmers. How come I've never had a memory fragmentation issue before ? I've made precompilation an option for folks who really wanted it but I've never had a need for such a thing. And you can be sure I work on some very large and sprawling SQLAlchemy models these days. Maybe you never used big objects. Memory fragmentation arises only when the application handles a mixture of big and small objects, such that holes created by small objects being freed don't serve big memory requirements. If your application handles a homogenous workload (ie: every request is pretty much the same), as is usual, then you won't probably experience fragmentation. My application does the usual small-object work, interspersed with intense computation on big objects, hence my troubles. Python's garbage collector has been a pending issue for a long time, but, as I noticed in the linked page, past architectural decisions prevent some widely desired improvements. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
Okay, thanks to this article: http://neverfear.org/blog/view/155/Investigating_memory_leaks_in_Python I made similar plot of object counts in time, showing top 50 types. The resulting PDF is here (you might wish to download it first, Google messes it up for me): https://docs.google.com/open?id=0ByLiBlA59qDwYTY1MGIzYWEtYjMxZi00ZDVlLTk0OTEtOGI2ZjA3NDgyM2Y3 Everything seems to linearly grow in count. Something is keeping all those objects reference somewhere. What could possibly be the cause? .oO V Oo. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Feb 22, 2012, at 6:36 PM, Vlad K. wrote: Okay, thanks to this article: http://neverfear.org/blog/view/155/Investigating_memory_leaks_in_Python I made similar plot of object counts in time, showing top 50 types. The resulting PDF is here (you might wish to download it first, Google messes it up for me): https://docs.google.com/open?id=0ByLiBlA59qDwYTY1MGIzYWEtYjMxZi00ZDVlLTk0OTEtOGI2ZjA3NDgyM2Y3 Everything seems to linearly grow in count. Something is keeping all those objects reference somewhere. What could possibly be the cause? can you provide a self-contained, single file test case that illustrates the memory growth ? -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] Re: Working with large IN lists
This depends upon the execution plan of the query and is more really a postgresql question. Google postgresql IN performance and you will get a good idea of it. By the look of your code, Second option would obviously be faster as it hits database once whereas first one flush after every change. Regards, On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote: Hi all, I have to read thousands of rows from a table and compile some data, but in certain conditions update those rows, all with same value. The ratio of reads and writes here is widest possible. Sometimes no rows, sometimes few, and sometimes all rows that are read have to be updated. The last case scenario is making me concerned. for row in query.yield_per(100): # Do something with data if some_condition: row.some_column = 123 session.flush() I am thinking about just adding the row's ID to a list: list_of_ids = [] for row in query.yield_per(100): # Do something with data if some_condition: list_of_ids.append(row.primary_key) and near the end of transaction do: session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so me_column : 123}, False) Yes I'm aware of increased memory requirements to store the ID list on the application side, and no I don't need to lock the rows for update, the logic of atomic update at the end is sufficient for my case. But I think, and the real use benchmarks will probably show, I haven't tested yet, that single update query will work faster. I need lowest transaction processing time on the application side for entire call, even if takes more memory and more database iron. What I'm concerned with here is if there are any limits or significant overheads with large .in_ lists? The backend is PostgreSQL via psycopg2. Thanks -- .oO V Oo. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
On Tue, Feb 21, 2012 at 3:24 PM, Manav Goel manav.goe...@gmail.com wrote: This depends upon the execution plan of the query and is more really a postgresql question. Google postgresql IN performance and you will get a good idea of it. By the look of your code, Second option would obviously be faster as it hits database once whereas first one flush after every change. Regards, On Feb 21, 6:07 pm, Vlad K. v...@haronmedia.com wrote: Hi all, I have to read thousands of rows from a table and compile some data, but in certain conditions update those rows, all with same value. The ratio of reads and writes here is widest possible. Sometimes no rows, sometimes few, and sometimes all rows that are read have to be updated. The last case scenario is making me concerned. for row in query.yield_per(100): # Do something with data if some_condition: row.some_column = 123 session.flush() I am thinking about just adding the row's ID to a list: list_of_ids = [] for row in query.yield_per(100): # Do something with data if some_condition: list_of_ids.append(row.primary_key) and near the end of transaction do: session.query(Model).filter(Model.primary_key.in_(list_of_ids)).update({so me_column : 123}, False) Yes I'm aware of increased memory requirements to store the ID list on the application side, and no I don't need to lock the rows for update, the logic of atomic update at the end is sufficient for my case. But I think, and the real use benchmarks will probably show, I haven't tested yet, that single update query will work faster. I need lowest transaction processing time on the application side for entire call, even if takes more memory and more database iron. What I'm concerned with here is if there are any limits or significant overheads with large .in_ lists? The backend is PostgreSQL via psycopg2. Thanks -- .oO V Oo. A *long* time ago (SQLALchemy 0.3), I had some performance problems with large IN clauses, and the reason turned out to be SQLAlchemy taking a lot of time to build long lists of bindparam objects. I've no idea if this is still the case these days. The best thing you can do is just try it. Simon -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: Working with large IN lists
Thanks for your replies. Using the IN list definitely speeds up the process, but I hate the resulting query which uses bound variables for each and every element of the list. But I have another problem with this, there's a massive memory leak somewhere. Take a look at this model: class GatewayTransaction(Base): __tablename__ = gateway_transactions realestate_id = Column(Integer, ForeignKey(realestate.realestate_id, ondelete=set null, onupdate=cascade), primary_key=True) portal_id = Column(Text, primary_key=True) realestate_portal_id = Column(Unicode) operation = Column(Text, nullable=False) agency_id = Column(Integer, ForeignKey(agencies.agency_id, ondelete=set null, onupdate=cascade), nullable=False) agency_portal_id = Column(Unicode, nullable=False) agency_export_token = Column(Unicode, nullable=False) user_id = Column(Integer, ForeignKey(users.user_id, ondelete=set null, onupdate=cascade), nullable=False) mod_images = Column(Boolean) agency = relationship(Agency, lazy=joined) realestate = relationship(Realestate, lazy=joined) user = relationship(User, lazy=joined) Now, when I do this: for row in some_query.all(): gt = session.query(GatewayTransaction)\ .filter(GatewayTransaction.realestate_id==row.realestate_id)\ .filter(GatewayTransaction.portal_id==k)\ .first() or GatewayTransaction() # Do some data processing # # # Update existing or insert as new gt = session.merge(gt) session.flush() It is very, very slow, it takes minutes to process 2000 rows and memory usage skyrockets into multiple GB range and I have to terminate it before it starts swapping like hell. With lazy=select, it flies fast, done in a couple of seconds with very little memory consumed, because at this point there are no rows in the table so nothing is additionally selected, instead inserted. Still, why would a join slow things down so drastically and shoot Python memory usage (not DB's) skyhigh? Also, even if I try session.expunge(gt) or expunge_all() (previously preparing the row to be loaded one by one from a list of IDs), the memory always keeps growing, as if the instance do not die, never get garbage collected... .oO V Oo. On 02/21/2012 04:35 PM, Simon King wrote: A *long* time ago (SQLALchemy 0.3), I had some performance problems with large IN clauses, and the reason turned out to be SQLAlchemy taking a lot of time to build long lists of bindparam objects. I've no idea if this is still the case these days. The best thing you can do is just try it. Simon -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.