Hi Mike, thank you so much for such an extensive answer. It has provided me with much better insight about the topic, so that I can now make a qualified decision on how to proceed.
Best regards, Boris On Aug 6, 8:39 pm, Michael Bayer <[EMAIL PROTECTED]> wrote: > On Aug 6, 2007, at 10:42 PM, Boris Duek wrote: > > > > > > > Hi, > > > I am using sqlalchemy like this: > > > entries = session.query(User) > > for entry in entries: > > entry.check_it_is_all_right() # includes changing values of > > columns if necessary > > session.flush() > > > It would however suffice that sqlalchemy made the access to the result > > sequentially, without loading more than is really needed at the > > moment. I thought the Python iterator protocol (__iter__ and next > > methods) are exactly tailored for this efficient access (e.g. > > iterating over lines of file objects (for line in open("file.txt", > > "r")) works exactly efficiently how I would want sqlalchemy to in my > > case :-) - > > I will show you some ways to do it; however, yes, I think you should > consider using SQL-level instead of ORM level results to do what you > want. The 30 seconds of overhead youre experiencing is probably not > due to memory consumption as it is due to the overhead ORM loads > require in order to instantiate objects, initialize and populate > their attributes, and also perform various decision-making with > regards to extension hooks and the like. Additionally, SQLAlchemy > does place some overhead on SQL-level results as well since we do > things like decode utf-8 into unicode and similar result processing > but this overhead is much smaller (but still not as fast as raw DBAPI). > > So first, my thoughts on a "streaming" Query object. When looking to > "stream" results from a Query (which ultimately comes from its > instances() method), you need to consider the ORM's behavior and > design regarding sessions and units of work; when you load objects, > the full result is stored within a session, with the assumption that > youre going to manipulate and work with these objects. To iterate > through pieces of data and not hold onto it means youd want to > expunge as you load. There is no functionality built directly into > query.instances() to achieve this right now - a major reason its > difficult is because its not very easy to tell when an individual > instance is fully "loaded"; many subsequent rows may apply to a > single result instance as it loads related collections in due to a > join. Also its problematic to ensure the uniqueness of instances for > some kinds of queries, particularly those which eagerly load related > items (object # 1 references object A, then is returned and > expunged. object # 2 also references object A..but now you get a > *different* instance of A since the previous copy of it was > expunged. surprises ensue). > > So embedding this feature directly into Query I fear would lead to > many confused users, who are trying to get a "quick" way to be more > efficient without really understanding the consequences...and as I > mentioned, i think the overhead is primarily just populating the > objects themselves, not the memory allocation part of it, so this > feature would probably not solve too many problems (not to mention > the increased complexity would slow it down even more). > > Externally, the most straightforward way to achieve this with Query > would be by using LIMIT and OFFSET (typically by applying array > slices to a Query) such that you query only some results at a time: > > query = session.query(MyObject).filter(<whatever>) > > start = 0 > while True: > result = query[start:start + 100] > <process result> > session.clear() # or session.expunge() each member in 'result' > if len(result) < 100: > break > else: > start += 100 > > The above approach would also be compatible with queries which uses > eager loading, since the LIMIT/OFFSET is applied inside of a subquery > (in the case of eager loads being present) so that eager LEFT OUTER > JOINS are tacked onto the correct core rowset. > > But, the above approach issues many queries, namely number of rows / > "clump" size. To work around issuing clumped queries with LIMIT/ > OFFSET, here is a variant on that idea (not tested, may need > adjusting), which uses just one SQL statement issued but is not > compatible with eager loading (unless you really tweaked it): > > class FakeResult(object): > def __init__(self, result): > self.result = result > > def fetchall(self): > """Query.instances() calls fetchall() to retrieve > results. return only a 'slice' of results.""" > > return result.fetchmany(100) > > q = session.query(MyObject).filter(<whatever>) > > result = FakeResult(engine.execute(q.compile())) > > while True: > results = q.instances(result) > <process result> > session.clear() # or session.expunge() each member in 'result' > if len(result) < 100: > break > > > > > Is there a way for efficient "on-demand" iteration over all rows when > > using orm object mapping, or do I have to use a more low-level > > protocol (like sqlalchemy without orm (fetchone()), or even Python DB > > API itself?) > > So onto the raw SQL idea. So yes, if you can adjust your processing > functions in this case to deal with rows and not ORM objects, you > will be *way* better off, since there is a lot of memory and > processing overhead associated with ORM row loading whether or not > its caching everything. ResultProxy objects are much faster and dont > cache anything (by default). Even with the approaches I've outlined > above, the ORM adds a lot of overhead. If you are just loading > simple objects, you can still get your objects back using a simple > paradigm such as: > > for row in result: > instance = MyClass.__new__() > for key in row.keys(): > instance.__dict__[key] = row[key] > > *However*. From a memory/buffering standpoint, be aware that **many > DBAPIs are pre-buffering all the rows anyway, before you ever get > them**. So from a memory standpoint, you still might have > difficulties scaling this in an unlimited fashion unless you break > the request into multiple queries, since even though raw rows take up > far less memory than a mapped instance, theres still a limit. If > you are using cx_Oracle, I believe you get the rows unbuffered, but > with Psycopg2, the result sets are by default fully buffered; you > have to use SQLAlchemy's "server_side_cursors=True" flag with the > PGDialect, which uses some less-documented features of psycopg2 to > work around it and jumps thorugh some extra hoops to deal with PG > idiosyncrasies. other DBAPIs i havent tested recently to see if > theyre buffering or not. > > so thats the deal with buffering, hope it helps. > > - mike --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---