[sqlalchemy] Proper way to do processing across entire db?
I do a lot of processing on large amount of data. The common pattern we follow is: 1. Iterate through a large data set 2. Do some sort of processing (i.e. NLP processing like tokenization, capitalization, regex parsing, ... ) 3. Insert the new result in another table. Right now we are doing something like this: for x in session.query(Foo).yield_per(1): bar = Bar() bar.hello = x.world.lower() session.add(bar) session.flush() session.commit() This works, not great though. Typically, we will have to wait 30mins - 1hr to see `bar`s being committed. My question is: Is there a way that we can commit as we are iterating without breaking yield_per? If not, what is the recommended way of doing this? **NOTE: There's a lot of answers on Stackoverflow that involves writing custom pagination functions for session.query. Their efficiency and effectiveness has not been benchmarked. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Proper way to do processing across entire db?
On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng vicng...@gmail.com wrote: I do a lot of processing on large amount of data. The common pattern we follow is: 1. Iterate through a large data set 2. Do some sort of processing (i.e. NLP processing like tokenization, capitalization, regex parsing, ... ) 3. Insert the new result in another table. Right now we are doing something like this: for x in session.query(Foo).yield_per(1): bar = Bar() bar.hello = x.world.lower() session.add(bar) session.flush() session.commit() Do you really need to flush after making each new Bar? That implies a database round-trip and state sync with SQLAlchemy. In any case, you should gather a profile to see where/how time is getting spent. SQLAlchemy is a complex framework, so whatever performance assumptions are implied in the code may be wrong. Cheers, M -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Proper way to do processing across entire db?
Um sure. That still doesn't answer my question. I am interested to persist changes in my db as I am iterating through yield_per. On Thursday, February 21, 2013 1:03:49 PM UTC-8, A.M. wrote: On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng vicn...@gmail.comjavascript: wrote: I do a lot of processing on large amount of data. The common pattern we follow is: 1. Iterate through a large data set 2. Do some sort of processing (i.e. NLP processing like tokenization, capitalization, regex parsing, ... ) 3. Insert the new result in another table. Right now we are doing something like this: for x in session.query(Foo).yield_per(1): bar = Bar() bar.hello = x.world.lower() session.add(bar) session.flush() session.commit() Do you really need to flush after making each new Bar? That implies a database round-trip and state sync with SQLAlchemy. In any case, you should gather a profile to see where/how time is getting spent. SQLAlchemy is a complex framework, so whatever performance assumptions are implied in the code may be wrong. Cheers, M -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Proper way to do processing across entire db?
On 21 Feb 2013, at 22:44, Victor Ng vicng...@gmail.com wrote: On Thursday, February 21, 2013 1:03:49 PM UTC-8, A.M. wrote: On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng vicn...@gmail.com wrote: I do a lot of processing on large amount of data. The common pattern we follow is: 1. Iterate through a large data set 2. Do some sort of processing (i.e. NLP processing like tokenization, capitalization, regex parsing, ... ) 3. Insert the new result in another table. Right now we are doing something like this: for x in session.query(Foo).yield_per(1): bar = Bar() bar.hello = x.world.lower() session.add(bar) session.flush() session.commit() Do you really need to flush after making each new Bar? That implies a database round-trip and state sync with SQLAlchemy. In any case, you should gather a profile to see where/how time is getting spent. SQLAlchemy is a complex framework, so whatever performance assumptions are implied in the code may be wrong. Cheers, M Um sure. That still doesn't answer my question. I am interested to persist changes in my db as I am iterating through yield_per. Do your Foo objects have an ordering that you can use, such as a numeric ID? If so, you could query for the first few hundred objects, process them, then do a new query for the next hundred, and so on. This should keep the memory usage of the process under control at least. Hope that helps, Simon -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
Re: [sqlalchemy] Proper way to do processing across entire db?
On Feb 21, 2013, at 3:52 PM, Victor Ng vicng...@gmail.com wrote: I do a lot of processing on large amount of data. The common pattern we follow is: 1. Iterate through a large data set 2. Do some sort of processing (i.e. NLP processing like tokenization, capitalization, regex parsing, ... ) 3. Insert the new result in another table. Right now we are doing something like this: for x in session.query(Foo).yield_per(1): bar = Bar() bar.hello = x.world.lower() session.add(bar) session.flush() session.commit() This works, not great though. Typically, we will have to wait 30mins - 1hr to see `bar`s being committed. My question is: Is there a way that we can commit as we are iterating without breaking yield_per? If not, what is the recommended way of doing this? you can't really commit across a yield_per() that way since the yield_per() is holding onto an open cursor. Committing the transaction assumes cursors and everything else are disposed. This limitation is at least on the SQLAlchemy end and is likely a limiting factor with most DBAPIs as well (though I haven't tested each one). The recommended pattern is to read the records in chunks using row limiting. My preferred method is that shown here: http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery. This approach has the caveat that you're on a decent database like Postgresql, SQL Server, or Oracle which supports window functions. If not, you need to find some other way to define the chunks, such as if the table has 1M records and incrementing integer ids, chunk it in ranges of 1-1, 10001-2, etc. The other way to chunk is to use LIMIT/OFFSET but this has the drawback that the OFFSET scans through all the preceding rows, and you need to ORDER BY on every query. The window function approach only does one query with ORDER BY. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy?hl=en. For more options, visit https://groups.google.com/groups/opt_out.