[sqlalchemy] Re: Query and efficient "on-demand" loading of all rows

Michael Bayer Mon, 06 Aug 2007 20:38:29 -0700


On Aug 6, 2007, at 10:42 PM, Boris Duek wrote:


>
> Hi,
>
> I am using sqlalchemy like this:
>
> entries = session.query(User)
> for entry in entries:
>     entry.check_it_is_all_right() # includes changing values of
> columns if necessary
>     session.flush()
>
> It would however suffice that sqlalchemy made the access to the result
> sequentially, without loading more than is really needed at the
> moment. I thought the Python iterator protocol (__iter__ and next
> methods) are exactly tailored for this efficient access (e.g.
> iterating over lines of file objects (for line in open("file.txt",
> "r")) works exactly efficiently how I would want sqlalchemy to in my
> case :-) -

I will show you some ways to do it; however, yes, I think you should  
consider using SQL-level instead of ORM level results to do what you  
want.  The 30 seconds of overhead youre experiencing is probably not  
due to memory consumption as it is due to the overhead ORM loads  
require in order to instantiate objects, initialize and populate  
their attributes, and also perform various decision-making with  
regards to extension hooks and the like.   Additionally, SQLAlchemy  
does place some overhead on SQL-level results as well since we do  
things like decode utf-8 into unicode and similar result processing  
but this overhead is much smaller (but still not as fast as raw DBAPI).

So first, my thoughts on a "streaming" Query object.  When looking to  
"stream" results from a Query (which ultimately comes from its  
instances() method),  you need to consider the ORM's behavior and  
design regarding sessions and units of work; when you load objects,  
the full result is stored within a session, with the assumption that  
youre going to manipulate and work with these objects.  To iterate  
through pieces of data and not hold onto it means youd want to  
expunge as you load.  There is no functionality built directly into  
query.instances() to achieve this right now - a major reason its  
difficult is because its not very easy to tell when an individual  
instance is fully "loaded"; many subsequent rows may apply to a  
single result instance as it loads related collections in due to a  
join.  Also its problematic to ensure the uniqueness of instances for  
some kinds of queries, particularly those which eagerly load related  
items (object # 1 references object A, then is returned and  
expunged.  object # 2 also references object A..but now you get a  
*different* instance of A since the previous copy of it was  
expunged.  surprises ensue).

So embedding this feature directly into Query I fear would lead to  
many confused users, who are trying to get a "quick" way to be more  
efficient without really understanding the consequences...and as I  
mentioned, i think the overhead is primarily just populating the  
objects themselves, not the memory allocation part of it, so this  
feature would probably not solve too many problems (not to mention  
the increased complexity would slow it down even more).

Externally, the most straightforward way to achieve this with Query  
would be by using LIMIT and OFFSET (typically by applying array  
slices to a Query) such that you query only some results at a time:

query = session.query(MyObject).filter(<whatever>)

start = 0
while True:
        result = query[start:start + 100]
        <process result>
        session.clear()  # or session.expunge() each member in 'result'
        if len(result) < 100:
                break
        else:
                start += 100

The above approach would also be compatible with queries which uses  
eager loading, since the LIMIT/OFFSET is applied inside of a subquery  
(in the case of eager loads being present) so that eager LEFT OUTER  
JOINS are tacked onto the correct core rowset.

But, the above approach issues many queries, namely number of rows /  
"clump" size.  To work around issuing clumped queries with LIMIT/ 
OFFSET, here is a variant on that idea (not tested, may need  
adjusting), which uses just one SQL statement issued but is not  
compatible with eager loading (unless you really tweaked it):

class FakeResult(object):
        def __init__(self, result):
                self.result = result

        def fetchall(self):
                 """Query.instances() calls fetchall() to retrieve  
results.  return only a 'slice' of results."""

                return result.fetchmany(100)

q = session.query(MyObject).filter(<whatever>)

result = FakeResult(engine.execute(q.compile()))

while True:
        results = q.instances(result)
        <process result>
        session.clear()  # or session.expunge() each member in 'result'
        if len(result) < 100:
                break
        
>
> Is there a way for efficient "on-demand" iteration over all rows when
> using orm object mapping, or do I have to use a more low-level
> protocol (like sqlalchemy without orm (fetchone()), or even Python DB
> API itself?)

So onto the raw SQL idea.  So yes, if you can adjust your processing  
functions in this case to deal with rows and not ORM objects, you  
will be *way* better off, since there is a lot of memory and  
processing overhead associated with ORM row loading whether or not  
its caching everything.  ResultProxy objects are much faster and dont  
cache anything (by default).  Even with the approaches I've outlined  
above, the ORM adds a lot of overhead.  If you are just loading  
simple objects, you can still get your objects back using a simple  
paradigm such as:

        for row in result:
                instance = MyClass.__new__()
                for key in row.keys():
                        instance.__dict__[key] = row[key]

*However*.  From a memory/buffering standpoint, be aware that **many  
DBAPIs are pre-buffering all the rows anyway, before you ever get  
them**.   So from a memory standpoint, you still might have  
difficulties scaling this in an unlimited fashion unless you break  
the request into multiple queries, since even though raw rows take up  
far less memory than a mapped instance, theres still a limit.   If  
you are using cx_Oracle, I believe you get the rows unbuffered, but  
with Psycopg2, the result sets are by default fully buffered; you  
have to use SQLAlchemy's "server_side_cursors=True" flag with the  
PGDialect, which uses some less-documented features of psycopg2 to  
work around it and jumps thorugh some extra hoops to deal with PG  
idiosyncrasies.  other DBAPIs i havent tested recently to see if  
theyre buffering or not.

so thats the deal with buffering, hope it helps.

- mike





--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

[sqlalchemy] Re: Query and efficient "on-demand" loading of all rows

Reply via email to