[sqlalchemy] Re: Query and efficient "on-demand" loading of all rows

Boris Dušek Thu, 09 Aug 2007 19:25:15 -0700

Hi Mike,

thank you so much for such an extensive answer. It has provided me
with much better insight about the topic, so that I can now make a
qualified decision on how to proceed.


Best regards,
Boris

On Aug 6, 8:39 pm, Michael Bayer <[EMAIL PROTECTED]> wrote:
> On Aug 6, 2007, at 10:42 PM, Boris Duek wrote:
>
>
>
>
>
> > Hi,
>
> > I am using sqlalchemy like this:
>
> > entries = session.query(User)
> > for entry in entries:
> >     entry.check_it_is_all_right() # includes changing values of
> > columns if necessary
> >     session.flush()
>
> > It would however suffice that sqlalchemy made the access to the result
> > sequentially, without loading more than is really needed at the
> > moment. I thought the Python iterator protocol (__iter__ and next
> > methods) are exactly tailored for this efficient access (e.g.
> > iterating over lines of file objects (for line in open("file.txt",
> > "r")) works exactly efficiently how I would want sqlalchemy to in my
> > case :-) -
>
> I will show you some ways to do it; however, yes, I think you should
> consider using SQL-level instead of ORM level results to do what you
> want.  The 30 seconds of overhead youre experiencing is probably not
> due to memory consumption as it is due to the overhead ORM loads
> require in order to instantiate objects, initialize and populate
> their attributes, and also perform various decision-making with
> regards to extension hooks and the like.   Additionally, SQLAlchemy
> does place some overhead on SQL-level results as well since we do
> things like decode utf-8 into unicode and similar result processing
> but this overhead is much smaller (but still not as fast as raw DBAPI).
>
> So first, my thoughts on a "streaming" Query object.  When looking to
> "stream" results from a Query (which ultimately comes from its
> instances() method),  you need to consider the ORM's behavior and
> design regarding sessions and units of work; when you load objects,
> the full result is stored within a session, with the assumption that
> youre going to manipulate and work with these objects.  To iterate
> through pieces of data and not hold onto it means youd want to
> expunge as you load.  There is no functionality built directly into
> query.instances() to achieve this right now - a major reason its
> difficult is because its not very easy to tell when an individual
> instance is fully "loaded"; many subsequent rows may apply to a
> single result instance as it loads related collections in due to a
> join.  Also its problematic to ensure the uniqueness of instances for
> some kinds of queries, particularly those which eagerly load related
> items (object # 1 references object A, then is returned and
> expunged.  object # 2 also references object A..but now you get a
> *different* instance of A since the previous copy of it was
> expunged.  surprises ensue).
>
> So embedding this feature directly into Query I fear would lead to
> many confused users, who are trying to get a "quick" way to be more
> efficient without really understanding the consequences...and as I
> mentioned, i think the overhead is primarily just populating the
> objects themselves, not the memory allocation part of it, so this
> feature would probably not solve too many problems (not to mention
> the increased complexity would slow it down even more).
>
> Externally, the most straightforward way to achieve this with Query
> would be by using LIMIT and OFFSET (typically by applying array
> slices to a Query) such that you query only some results at a time:
>
> query = session.query(MyObject).filter(<whatever>)
>
> start = 0
> while True:
>         result = query[start:start + 100]
>         <process result>
>         session.clear()  # or session.expunge() each member in 'result'
>         if len(result) < 100:
>                 break
>         else:
>                 start += 100
>
> The above approach would also be compatible with queries which uses
> eager loading, since the LIMIT/OFFSET is applied inside of a subquery
> (in the case of eager loads being present) so that eager LEFT OUTER
> JOINS are tacked onto the correct core rowset.
>
> But, the above approach issues many queries, namely number of rows /
> "clump" size.  To work around issuing clumped queries with LIMIT/
> OFFSET, here is a variant on that idea (not tested, may need
> adjusting), which uses just one SQL statement issued but is not
> compatible with eager loading (unless you really tweaked it):
>
> class FakeResult(object):
>         def __init__(self, result):
>                 self.result = result
>
>         def fetchall(self):
>                  """Query.instances() calls fetchall() to retrieve
> results.  return only a 'slice' of results."""
>
>                 return result.fetchmany(100)
>
> q = session.query(MyObject).filter(<whatever>)
>
> result = FakeResult(engine.execute(q.compile()))
>
> while True:
>         results = q.instances(result)
>         <process result>
>         session.clear()  # or session.expunge() each member in 'result'
>         if len(result) < 100:
>                 break
>
>
>
> > Is there a way for efficient "on-demand" iteration over all rows when
> > using orm object mapping, or do I have to use a more low-level
> > protocol (like sqlalchemy without orm (fetchone()), or even Python DB
> > API itself?)
>
> So onto the raw SQL idea.  So yes, if you can adjust your processing
> functions in this case to deal with rows and not ORM objects, you
> will be *way* better off, since there is a lot of memory and
> processing overhead associated with ORM row loading whether or not
> its caching everything.  ResultProxy objects are much faster and dont
> cache anything (by default).  Even with the approaches I've outlined
> above, the ORM adds a lot of overhead.  If you are just loading
> simple objects, you can still get your objects back using a simple
> paradigm such as:
>
>         for row in result:
>                 instance = MyClass.__new__()
>                 for key in row.keys():
>                         instance.__dict__[key] = row[key]
>
> *However*.  From a memory/buffering standpoint, be aware that **many
> DBAPIs are pre-buffering all the rows anyway, before you ever get
> them**.   So from a memory standpoint, you still might have
> difficulties scaling this in an unlimited fashion unless you break
> the request into multiple queries, since even though raw rows take up
> far less memory than a mapped instance, theres still a limit.   If
> you are using cx_Oracle, I believe you get the rows unbuffered, but
> with Psycopg2, the result sets are by default fully buffered; you
> have to use SQLAlchemy's "server_side_cursors=True" flag with the
> PGDialect, which uses some less-documented features of psycopg2 to
> work around it and jumps thorugh some extra hoops to deal with PG
> idiosyncrasies.  other DBAPIs i havent tested recently to see if
> theyre buffering or not.
>
> so thats the deal with buffering, hope it helps.
>
> - mike


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

[sqlalchemy] Re: Query and efficient "on-demand" loading of all rows

Reply via email to