[sqlalchemy] Re: Best design for commits?

2011-04-25 Thread Andrey Petrov
Hi Aviv, Since your bottleneck is fetching the urls, I suggest you look at using workerpool with urllib3. It helps you do exactly what Michael describes. (Disclaimer: I wrote both, workerpool and urllib3. They were built to complement each other.) There a

Re: [sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Michael Bayer
if the URL fetch is an IO-bound operation (i.e. the time is spent waiting for IO), it might work if you did a standard consumer/producer model using Queue.Queue. One thread retrieves data from each URL and places the datasets into the Queue. the other thread pulls off items and loads them int

[sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Aviv Giladi
Thank you for your responses everyone. I have one more question - the really time heavy task here is retrieving the URLs over HTTP (it takes almost a second per URL). I am using urllib3 that has connection pooling, but other than that, is there any other way to speed this up? Perhaps multi-threadin

Re: [sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Michael Bayer
my practices with this kind of situation are: 1. theres just one commit() at the end. I'd like the whole operation in one transaction 2. There are flush() calls every 100-1000 or so. 10 is very low. 3. I frequently will disable autoflush, if there are many flushes occurring due to queries for

[sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Aviv Giladi
Thanks again Rick. The issue is that I have a LOT of duplicates (around 20-30%) - that's just how that tree is structured. Therefore, I think I am going to go with catching DB exceptions regardless, but also use an indexed collection to prevent duplicates. Cheers! On Apr 20, 12:43 pm, Richard Har

[sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Richard Harding
I'm not sure, but I'd check the exception and see if you can get the info about which of your 50 were the dupe. I don't recall if it's in the traceback or exception error. If you can identify it then you could store it aside and remove it from the session and retry the other 49 again. Otherwis

[sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Aviv Giladi
I agree, but the thing is that committing every 10 entries is a little low for me, I was thinking of around 50, at which case having 50 individual commits is quite costly.. In case I choose the implement your method, how would you go about it? How do you keep objects of the last 50 or whatever reco

[sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Richard Harding
Not that I'm aware of. When you do a Session.add() it's not touching the database yet. It's part of the performance tradeoff. There's not a good way for it to *know* there's a record in the db with that pk id until it does chat with the db on it. Sure, you can keep a list of ids on the side if y

[sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Aviv Giladi
Dear Rick, Thank you for your reply. I understand, but is there not a better way than doing a lot of single commits in case of a commit exception? In other words, is there a way to tell SQLAlchemy to throw an exception on the Session.add if there's a duplicate as opposed to on the Session.commit?

[sqlalchemy] Re: Best design for commits?

2011-04-20 Thread Richard Harding
What I tend to do in cases like this is to break things into commit chunks. For instance, I've got an import script that goes through and processes 10 at a time and performs a commit every 10. This is tweakable via a config setting, but so far 10 works for my needs. As for the duplicates, If t