On Aug 6, 2007, at 10:42 PM, Boris Duek wrote:
> > Hi, > > I am using sqlalchemy like this: > > entries = session.query(User) > for entry in entries: > entry.check_it_is_all_right() # includes changing values of > columns if necessary > session.flush() > > It would however suffice that sqlalchemy made the access to the result > sequentially, without loading more than is really needed at the > moment. I thought the Python iterator protocol (__iter__ and next > methods) are exactly tailored for this efficient access (e.g. > iterating over lines of file objects (for line in open("file.txt", > "r")) works exactly efficiently how I would want sqlalchemy to in my > case :-) - I will show you some ways to do it; however, yes, I think you should consider using SQL-level instead of ORM level results to do what you want. The 30 seconds of overhead youre experiencing is probably not due to memory consumption as it is due to the overhead ORM loads require in order to instantiate objects, initialize and populate their attributes, and also perform various decision-making with regards to extension hooks and the like. Additionally, SQLAlchemy does place some overhead on SQL-level results as well since we do things like decode utf-8 into unicode and similar result processing but this overhead is much smaller (but still not as fast as raw DBAPI). So first, my thoughts on a "streaming" Query object. When looking to "stream" results from a Query (which ultimately comes from its instances() method), you need to consider the ORM's behavior and design regarding sessions and units of work; when you load objects, the full result is stored within a session, with the assumption that youre going to manipulate and work with these objects. To iterate through pieces of data and not hold onto it means youd want to expunge as you load. There is no functionality built directly into query.instances() to achieve this right now - a major reason its difficult is because its not very easy to tell when an individual instance is fully "loaded"; many subsequent rows may apply to a single result instance as it loads related collections in due to a join. Also its problematic to ensure the uniqueness of instances for some kinds of queries, particularly those which eagerly load related items (object # 1 references object A, then is returned and expunged. object # 2 also references object A..but now you get a *different* instance of A since the previous copy of it was expunged. surprises ensue). So embedding this feature directly into Query I fear would lead to many confused users, who are trying to get a "quick" way to be more efficient without really understanding the consequences...and as I mentioned, i think the overhead is primarily just populating the objects themselves, not the memory allocation part of it, so this feature would probably not solve too many problems (not to mention the increased complexity would slow it down even more). Externally, the most straightforward way to achieve this with Query would be by using LIMIT and OFFSET (typically by applying array slices to a Query) such that you query only some results at a time: query = session.query(MyObject).filter(<whatever>) start = 0 while True: result = query[start:start + 100] <process result> session.clear() # or session.expunge() each member in 'result' if len(result) < 100: break else: start += 100 The above approach would also be compatible with queries which uses eager loading, since the LIMIT/OFFSET is applied inside of a subquery (in the case of eager loads being present) so that eager LEFT OUTER JOINS are tacked onto the correct core rowset. But, the above approach issues many queries, namely number of rows / "clump" size. To work around issuing clumped queries with LIMIT/ OFFSET, here is a variant on that idea (not tested, may need adjusting), which uses just one SQL statement issued but is not compatible with eager loading (unless you really tweaked it): class FakeResult(object): def __init__(self, result): self.result = result def fetchall(self): """Query.instances() calls fetchall() to retrieve results. return only a 'slice' of results.""" return result.fetchmany(100) q = session.query(MyObject).filter(<whatever>) result = FakeResult(engine.execute(q.compile())) while True: results = q.instances(result) <process result> session.clear() # or session.expunge() each member in 'result' if len(result) < 100: break > > Is there a way for efficient "on-demand" iteration over all rows when > using orm object mapping, or do I have to use a more low-level > protocol (like sqlalchemy without orm (fetchone()), or even Python DB > API itself?) So onto the raw SQL idea. So yes, if you can adjust your processing functions in this case to deal with rows and not ORM objects, you will be *way* better off, since there is a lot of memory and processing overhead associated with ORM row loading whether or not its caching everything. ResultProxy objects are much faster and dont cache anything (by default). Even with the approaches I've outlined above, the ORM adds a lot of overhead. If you are just loading simple objects, you can still get your objects back using a simple paradigm such as: for row in result: instance = MyClass.__new__() for key in row.keys(): instance.__dict__[key] = row[key] *However*. From a memory/buffering standpoint, be aware that **many DBAPIs are pre-buffering all the rows anyway, before you ever get them**. So from a memory standpoint, you still might have difficulties scaling this in an unlimited fashion unless you break the request into multiple queries, since even though raw rows take up far less memory than a mapped instance, theres still a limit. If you are using cx_Oracle, I believe you get the rows unbuffered, but with Psycopg2, the result sets are by default fully buffered; you have to use SQLAlchemy's "server_side_cursors=True" flag with the PGDialect, which uses some less-documented features of psycopg2 to work around it and jumps thorugh some extra hoops to deal with PG idiosyncrasies. other DBAPIs i havent tested recently to see if theyre buffering or not. so thats the deal with buffering, hope it helps. - mike --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---