[sqlalchemy] Proper way to do processing across entire db?

2013-02-21 Thread Victor Ng
I do a lot of processing on large amount of data. 

The common pattern we follow is: 

1. Iterate through a large data set 
2. Do some sort of processing (i.e. NLP processing like tokenization, 
capitalization, regex parsing, ... ) 
3. Insert the new result in another table. 

Right now we are doing something like this: 

for x in session.query(Foo).yield_per(1): 
  bar = Bar()
  bar.hello = x.world.lower()
  session.add(bar)
  session.flush()
session.commit()

This works, not great though. Typically, we will have to wait 30mins - 1hr 
to see `bar`s being committed. 

My question is: Is there a way that we can commit as we are iterating 
without breaking yield_per? 

If not, what is the recommended way of doing this? 

**NOTE: There's a lot of answers on Stackoverflow that involves writing 
custom pagination functions for session.query. Their efficiency and 
effectiveness has not been benchmarked. 


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




Re: [sqlalchemy] Proper way to do processing across entire db?

2013-02-21 Thread A.M.
On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng vicng...@gmail.com
wrote:
 I do a lot of processing on large amount of data. 
 
 The common pattern we follow is: 
 
 1. Iterate through a large data set 
 2. Do some sort of processing (i.e. NLP processing like tokenization, 
 capitalization, regex parsing, ... ) 
 3. Insert the new result in another table. 
 
 Right now we are doing something like this: 
 
 for x in session.query(Foo).yield_per(1): 
   bar = Bar()
   bar.hello = x.world.lower()
   session.add(bar)
   session.flush()
 session.commit()

Do you really need to flush after making each new Bar? That implies a
database round-trip and state sync with SQLAlchemy.

In any case, you should gather a profile to see where/how time is getting
spent. SQLAlchemy is a complex framework, so whatever performance
assumptions are implied in the code may be wrong.

Cheers,
M

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




Re: [sqlalchemy] Proper way to do processing across entire db?

2013-02-21 Thread Victor Ng
Um sure. 

That still doesn't answer my question. 

I am interested to persist changes in my db as I am iterating through 
yield_per. 

 

On Thursday, February 21, 2013 1:03:49 PM UTC-8, A.M. wrote:

 On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng 
 vicn...@gmail.comjavascript: 

 wrote: 
  I do a lot of processing on large amount of data. 
  
  The common pattern we follow is: 
  
  1. Iterate through a large data set 
  2. Do some sort of processing (i.e. NLP processing like tokenization, 
  capitalization, regex parsing, ... ) 
  3. Insert the new result in another table. 
  
  Right now we are doing something like this: 
  
  for x in session.query(Foo).yield_per(1): 
bar = Bar() 
bar.hello = x.world.lower() 
session.add(bar) 
session.flush() 
  session.commit() 

 Do you really need to flush after making each new Bar? That implies a 
 database round-trip and state sync with SQLAlchemy. 

 In any case, you should gather a profile to see where/how time is getting 
 spent. SQLAlchemy is a complex framework, so whatever performance 
 assumptions are implied in the code may be wrong. 

 Cheers, 
 M 


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




Re: [sqlalchemy] Proper way to do processing across entire db?

2013-02-21 Thread Simon King
On 21 Feb 2013, at 22:44, Victor Ng vicng...@gmail.com wrote:

 On Thursday, February 21, 2013 1:03:49 PM UTC-8, A.M. wrote:
 On Thu, 21 Feb 2013 12:52:42 -0800 (PST), Victor Ng vicn...@gmail.com 
 wrote: 
  I do a lot of processing on large amount of data. 
  
  The common pattern we follow is: 
  
  1. Iterate through a large data set 
  2. Do some sort of processing (i.e. NLP processing like tokenization, 
  capitalization, regex parsing, ... ) 
  3. Insert the new result in another table. 
  
  Right now we are doing something like this: 
  
  for x in session.query(Foo).yield_per(1): 
bar = Bar() 
bar.hello = x.world.lower() 
session.add(bar) 
session.flush() 
  session.commit() 
 
 Do you really need to flush after making each new Bar? That implies a 
 database round-trip and state sync with SQLAlchemy. 
 
 In any case, you should gather a profile to see where/how time is getting 
 spent. SQLAlchemy is a complex framework, so whatever performance 
 assumptions are implied in the code may be wrong. 
 
 Cheers, 
 M 
 
 Um sure. 
 
 That still doesn't answer my question. 
 
 I am interested to persist changes in my db as I am iterating through 
 yield_per. 
 


Do your Foo objects have an ordering that you can use, such as a numeric ID? If 
so, you could query for the first few hundred objects, process them, then do a 
new query for the next hundred, and so on. This should keep the memory usage of 
the process under control at least.

Hope that helps,

Simon


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.




Re: [sqlalchemy] Proper way to do processing across entire db?

2013-02-21 Thread Michael Bayer

On Feb 21, 2013, at 3:52 PM, Victor Ng vicng...@gmail.com wrote:

 I do a lot of processing on large amount of data. 
 
 The common pattern we follow is: 
 
 1. Iterate through a large data set 
 2. Do some sort of processing (i.e. NLP processing like tokenization, 
 capitalization, regex parsing, ... ) 
 3. Insert the new result in another table. 
 
 Right now we are doing something like this: 
 
 for x in session.query(Foo).yield_per(1): 
   bar = Bar()
   bar.hello = x.world.lower()
   session.add(bar)
   session.flush()
 session.commit()
 
 This works, not great though. Typically, we will have to wait 30mins - 1hr to 
 see `bar`s being committed. 
 
 My question is: Is there a way that we can commit as we are iterating without 
 breaking yield_per? 
 
 If not, what is the recommended way of doing this? 

you can't really commit across a yield_per() that way since the yield_per() is 
holding onto an open cursor.   Committing the transaction assumes cursors and 
everything else are disposed.  This limitation is at least on the SQLAlchemy 
end and is likely a limiting factor with most DBAPIs as well (though I haven't 
tested each one).

The recommended pattern is to read the records in chunks using row limiting.   
My preferred method is that shown here: 
http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery.  This 
approach has the caveat that you're on a decent database like Postgresql, SQL 
Server, or Oracle which supports window functions.  If not, you need to find 
some other way to define the chunks, such as if the table has 1M records and 
incrementing integer ids, chunk it in ranges of 1-1, 10001-2, etc.

The other way to chunk is to use LIMIT/OFFSET but this has the drawback that 
the OFFSET scans through all the preceding rows, and you need to ORDER BY on 
every query.  The window function approach only does one query with ORDER BY.


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.