On Feb 21, 2013, at 3:52 PM, Victor Ng <vicng...@gmail.com> wrote:

> I do a lot of processing on large amount of data. 
> 
> The common pattern we follow is: 
> 
> 1. Iterate through a large data set 
> 2. Do some sort of processing (i.e. NLP processing like tokenization, 
> capitalization, regex parsing, ... ) 
> 3. Insert the new result in another table. 
> 
> Right now we are doing something like this: 
> 
> for x in session.query(Foo).yield_per(10000): 
>   bar = Bar()
>   bar.hello = x.world.lower()
>   session.add(bar)
>   session.flush()
> session.commit()
> 
> This works, not great though. Typically, we will have to wait 30mins - 1hr to 
> see `bar`s being committed. 
> 
> My question is: Is there a way that we can commit as we are iterating without 
> breaking yield_per? 
> 
> If not, what is the recommended way of doing this? 

you can't really commit across a yield_per() that way since the yield_per() is 
holding onto an open cursor.   Committing the transaction assumes cursors and 
everything else are disposed.  This limitation is at least on the SQLAlchemy 
end and is likely a limiting factor with most DBAPIs as well (though I haven't 
tested each one).

The recommended pattern is to read the records in chunks using row limiting.   
My preferred method is that shown here: 
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery.  This 
approach has the caveat that you're on a decent database like Postgresql, SQL 
Server, or Oracle which supports window functions.  If not, you need to find 
some other way to define the "chunks", such as if the table has 1M records and 
incrementing integer ids, chunk it in ranges of "1-10000", "10001-20000", etc.

The other way to "chunk" is to use LIMIT/OFFSET but this has the drawback that 
the OFFSET scans through all the preceding rows, and you need to ORDER BY on 
every query.  The window function approach only does one query with ORDER BY.


-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to