Re: Transactions and paged_each
On Friday, September 2, 2016 at 2:31:29 PM UTC-5, Jeremy Evans wrote: > > I'm guessing streaming would preform better, but you would still need to > use a separate thread or sharding when using streaming. > That's great. I'll give streaming a try, but I believe the bottleneck isn't going to be retrieving the rows, but with the work I'm doing before updating them. > I have to say, if you are really just updating updated_at and are > concerned about performance, I would not use your approach unless you > absolutely had to have a model hook executed. The only time you really > need to do query per row is if you need to execute ruby code on a per row > basis to determine what the new column values should be. > This is fantastic advice, and I'll bookmark this for future use. I'm sorry I didn't make it clear, but I meant that updated_at example to be a simplified example of what I'll be doing. I'll be calculating values on a per record basis and then issuing an update_all call per record which will include the updated_at to remove it from the set of records to be processed. Person.where(updated_at: nil).paged_each { |p| Thread.new { p.update_all(some_calculated_value: CalculateValue(p), updated_at: Time.now.utc) }.join } Apologies, I should have made the example a bit more clear. In any case, your suggestion will be useful to future Googlers, I'm sure. Thank you again for all of your help. I'm off to the races using paged_each and threads. I'll report back if I run into anything of interest relevant to this thread, and I'll try sequel_pg and streaming if time permits in the next few weeks. It sounds like we'd get a number of performance benefits beyond streaming vs cursors, so it's worth looking into for sure. - Trevor -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Friday, September 2, 2016 at 11:48:46 AM UTC-7, Trevor Turk wrote: > On Friday, September 2, 2016 at 1:19:02 PM UTC-5, Jeremy Evans wrote: >> >> To work around this issue, you can use sharding (Dataset option like >> :servers=>{:foo=>{}}), and do dataset.server(:foo).paged_each. >> Alternatively, you can issue queries in a separate thread inside the >> Dataset#each block. Even something simple like >> dataset.each{Thread.new{...}.join} should work. Please try one of those >> options and report back. >> > > Ah, your suggestion to use threads seems nice. I tested running this > simultaneously and it seems to work without a hitch: > > Person.dataset.limit(15).paged_each { |p| Thread.new { sleep(1) ; > p.update_all(updated_at: Time.now.utc) ; print '.' }.join } > > Is there any reason to prefer streaming support via sequel_pg if I can get > that working, or would you think this is a reasonable way to proceed? My > backfill process would look something like this, and would be iterating > over ~100 million rows: > > Person.where(updated_at: nil).paged_each { |p| Thread.new { > p.update_all(updated_at: Time.now.utc) }.join } > I'm guessing streaming would preform better, but you would still need to use a separate thread or sharding when using streaming. I have to say, if you are really just updating updated_at and are concerned about performance, I would not use your approach unless you absolutely had to have a model hook executed. The only time you really need to do query per row is if you need to execute ruby code on a per row basis to determine what the new column values should be. An approach like this should perform much better, while still not locking the whole table at once: n = 1000 ds = Person.where(updated_at: nil) ps = ds.where{id >= :$id}.where{id < Sequel.+(:$id, n) }.prepare(:update, :update_person_updated_at, :updated_at=>Sequel::CURRENT_TIMESTAMP) ((ds.max(:id)/n) + 1).times{|id| ps.call(:id=>id*n)} ds.update(:updated_at=>Sequel::CURRENT_TIMESTAMP) This approach uses a prepared statement so the query only needs to be planned once, and just needs rows/n queries to update the whole table. After going through all values in the existing table, one final query is run at the end to pick up any new values (you should really have an index on updated_at for this query to not do a sequential scan). This approach assumes a integer primary key, but a similar approach can work for composite or non-integer keys, as long as there is a way to go through the set with an indexed lookup for a given range. Thanks, Jeremy -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Friday, September 2, 2016 at 1:19:02 PM UTC-5, Jeremy Evans wrote: > > To work around this issue, you can use sharding (Dataset option like > :servers=>{:foo=>{}}), and do dataset.server(:foo).paged_each. > Alternatively, you can issue queries in a separate thread inside the > Dataset#each block. Even something simple like > dataset.each{Thread.new{...}.join} should work. Please try one of those > options and report back. > Ah, your suggestion to use threads seems nice. I tested running this simultaneously and it seems to work without a hitch: Person.dataset.limit(15).paged_each { |p| Thread.new { sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' }.join } Is there any reason to prefer streaming support via sequel_pg if I can get that working, or would you think this is a reasonable way to proceed? My backfill process would look something like this, and would be iterating over ~100 million rows: Person.where(updated_at: nil).paged_each { |p| Thread.new { p.update_all(updated_at: Time.now.utc) }.join } Thanks again, - Trevor -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Friday, September 2, 2016 at 10:43:50 AM UTC-7, Trevor Turk wrote: > > Hello -- just reporting back on my progress so far. > > It appears we can use paged_each safely, however I was surprised to find > that we can't run two simultaneously if they issue updates in the block. > In general it's not safe to issue queries inside a Dataset#each (and #paged_each) block, unless you force the use of a different connection. Whether it works depends on the adapter, but it's always unsupported. To work around this issue, you can use sharding (Dataset option like :servers=>{:foo=>{}}), and do dataset.server(:foo).paged_each. Alternatively, you can issue queries in a separate thread inside the Dataset#each block. Even something simple like dataset.each{Thread.new{...}.join} should work. Please try one of those options and report back. Thanks, Jeremy -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Thursday, September 1, 2016 at 12:51:58 PM UTC-5, Jeremy Evans wrote: > > On Thursday, September 1, 2016 at 9:42:11 AM UTC-7, Trevor Turk wrote: >> >> On Thursday, September 1, 2016 at 11:34:44 AM UTC-5, Trevor Turk wrote: >>> >>> In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the >>> "sequel" and "pg" gems. So, I'd be using cursors with paged_each as it >>> stands. Do you know if that ought to be safe? I'm considering switching to >>> the "sequel_pg" gem for the streaming support, but I'm reluctant to change >>> things unnecessarily, unless streaming might work where cursors would not. >>> >>> Thanks again, >>> - Trevor >>> >> >> Apologies, I just noticed that "pg" is required for "sequel_pg" so I'll >> give that a try now :) >> > > Streaming should be faster, so that's one reason to use sequel_pg. > Streaming also only executes a single query on the database instead of > multiple queries. I would recommend using sequel_pg in general due to the > other optimizations it uses. > > I think either using cursors or stream should be safe in terms of not > locking the whole table, unless you are using SELECT FOR UPDATE, but again, > that's something you should test to make sure > > Thanks, > Jeremy > Hello -- just reporting back on my progress so far. It appears we can use paged_each safely, however I was surprised to find that we can't run two simultaneously if they issue updates in the block. It seems we can work around this using the :hold option added to Dataset#use_cursor in the 4.9.0 release, which appears to remove the transaction and (considering the caveats) seems to work well for our case. The relevant commit is: https://github.com/jeremyevans/sequel/commit/6083b73af616c13583f6f360ee028e3baa83c1e4 At first I thought the problem with simultaneous use might be due to duplicate cursor names, but it seems `hold: true` is all we need. I'm a little concerned about this section of the docs https://www.postgresql.org/docs/9.3/static/sql-declare.html: "A cursor created with WITH HOLD... [i]n the current implementation, the rows represented by a held cursor are copied into a temporary file or memory area so that they remain available for subsequent transactions." I'm hoping they don't mean that all rows in a table with millions of rows will be held in memory, but rather each page of them. Even if it's all of the rows, I'm not sure if that's a cause for concern. (I'll be iterating over a table with tens of millions of rows and performing updates to backfill newly created columns that we set to null so we could run a migration without downtime.) I'm not concerned about paged_each returning duplicate rows or recalculating values etc as the docs warn, because I'll be selecting rows with a null value for a specific column which my update will set not a non-null value. So, I think I can safely proceed with paged_each(hold: true) but I'm tempted to experiment with streaming support via sequel_pg in any case. I'll do some more research and testing around that and report back again. Here's an example of my testing so far: ok simultaneously: Person.dataset.limit(15).paged_each { |p| sleep(1) ; print '.' } Person.dataset.limit(15).paged_each { |p| sleep(1) ; print '.' } not ok simultaneously: Person.dataset.limit(15).paged_each { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } Person.dataset.limit(15).paged_each { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } not ok simultaneously: Person.dataset.limit(15).paged_each(cursor_name: 'cursor_one') { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } Person.dataset.limit(15).paged_each(cursor_name: 'cursor_two') { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } ok simultaneously: Person.dataset.limit(15).paged_each(cursor_name: 'cursor_one', hold: true) { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } Person.dataset.limit(15).paged_each(cursor_name: 'cursor_two', hold: true) { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } ok simultaneously: Person.dataset.limit(15).paged_each(hold: true) { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } Person.dataset.limit(15).paged_each(hold: true) { |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' } Thanks again, - Trevor -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Thursday, September 1, 2016 at 9:42:11 AM UTC-7, Trevor Turk wrote: > > On Thursday, September 1, 2016 at 11:34:44 AM UTC-5, Trevor Turk wrote: >> >> In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the >> "sequel" and "pg" gems. So, I'd be using cursors with paged_each as it >> stands. Do you know if that ought to be safe? I'm considering switching to >> the "sequel_pg" gem for the streaming support, but I'm reluctant to change >> things unnecessarily, unless streaming might work where cursors would not. >> >> Thanks again, >> - Trevor >> > > Apologies, I just noticed that "pg" is required for "sequel_pg" so I'll > give that a try now :) > Streaming should be faster, so that's one reason to use sequel_pg. Streaming also only executes a single query on the database instead of multiple queries. I would recommend using sequel_pg in general due to the other optimizations it uses. I think either using cursors or stream should be safe in terms of not locking the whole table, unless you are using SELECT FOR UPDATE, but again, that's something you should test to make sure Thanks, Jeremy -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Thursday, September 1, 2016 at 11:34:44 AM UTC-5, Trevor Turk wrote: > > In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the > "sequel" and "pg" gems. So, I'd be using cursors with paged_each as it > stands. Do you know if that ought to be safe? I'm considering switching to > the "sequel_pg" gem for the streaming support, but I'm reluctant to change > things unnecessarily, unless streaming might work where cursors would not. > > Thanks again, > - Trevor > Apologies, I just noticed that "pg" is required for "sequel_pg" so I'll give that a try now :) -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Thursday, September 1, 2016 at 1:20:21 AM UTC-5, Jeremy Evans wrote: > > Dataset#paged_each shouldn't lock the entire table, as it just selects > rows, but the behavior in regards to locking depends on the database. It > would probably be best for you to try it in a test environment to be sure > it functions as you expect, before using it in production. > > Thanks, > Jeremy > Thanks, Jeremy! I'll do some thorough testing before running in production, and I'll report back about the eventual results for future readers. In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the "sequel" and "pg" gems. So, I'd be using cursors with paged_each as it stands. Do you know if that ought to be safe? I'm considering switching to the "sequel_pg" gem for the streaming support, but I'm reluctant to change things unnecessarily, unless streaming might work where cursors would not. Thanks again, - Trevor -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Re: Transactions and paged_each
On Wednesday, August 31, 2016 at 9:47:52 PM UTC-7, Trevor Turk wrote: > > Hello, > > I've been reading about paged_each and it seems to work well in my > testing, but I'm concerned about using it in production because the > documentation says it uses a transaction internally. > > I plan to make a "backfill" script that iterates over all rows in a > production database and enqueues background jobs for processing them etc. > > Would using paged_each lock the table and prevent my (Sinatra) app from > functioning regularly? > > Thank you! > - Trevor > Dataset#paged_each shouldn't lock the entire table, as it just selects rows, but the behavior in regards to locking depends on the database. It would probably be best for you to try it in a test environment to be sure it functions as you expect, before using it in production. Thanks, Jeremy -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.
Transactions and paged_each
Hello, I've been reading about paged_each and it seems to work well in my testing, but I'm concerned about using it in production because the documentation says it uses a transaction internally. I plan to make a "backfill" script that iterates over all rows in a production database and enqueues background jobs for processing them etc. Would using paged_each lock the table and prevent my (Sinatra) app from functioning regularly? Thank you! - Trevor -- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to sequel-talk+unsubscr...@googlegroups.com. To post to this group, send email to sequel-talk@googlegroups.com. Visit this group at https://groups.google.com/group/sequel-talk. For more options, visit https://groups.google.com/d/optout.