On Friday, September 2, 2016 at 11:48:46 AM UTC-7, Trevor Turk wrote:

> On Friday, September 2, 2016 at 1:19:02 PM UTC-5, Jeremy Evans wrote:
>>
>> To work around this issue, you can use sharding (Dataset option like 
>> :servers=>{:foo=>{}}), and do dataset.server(:foo).paged_each. 
>>  Alternatively, you can issue queries in a separate thread inside the 
>> Dataset#each block.  Even something simple like 
>> dataset.each{Thread.new{...}.join} should work.  Please try one of those 
>> options and report back.
>>
>
> Ah, your suggestion to use threads seems nice. I tested running this 
> simultaneously and it seems to work without a hitch:
>
> Person.dataset.limit(15).paged_each { |p| Thread.new { sleep(1) ; 
> p.update_all(updated_at: Time.now.utc) ; print '.' }.join }
>
> Is there any reason to prefer streaming support via sequel_pg if I can get 
> that working, or would you think this is a reasonable way to proceed? My 
> backfill process would look something like this, and would be iterating 
> over ~100 million rows:
>
> Person.where(updated_at: nil).paged_each { |p| Thread.new { 
> p.update_all(updated_at: Time.now.utc) }.join }
>

I'm guessing streaming would preform better, but you would still need to 
use a separate thread or sharding when using streaming.

I have to say, if you are really just updating updated_at and are concerned 
about performance, I would not use your approach unless you absolutely had 
to have a model hook executed.  The only time you really need to do query 
per row is if you need to execute ruby code on a per row basis to determine 
what the new column values should be.

An approach like this should perform much better, while still not locking 
the whole table at once:

  n = 1000
  ds = Person.where(updated_at: nil)
  ps = ds.where{id >= :$id}.where{id < Sequel.+(:$id, n) 
 }.prepare(:update, :update_person_updated_at, 
:updated_at=>Sequel::CURRENT_TIMESTAMP)
  ((ds.max(:id)/n) + 1).times{|id| ps.call(:id=>id*n)}
  ds.update(:updated_at=>Sequel::CURRENT_TIMESTAMP)

This approach uses a prepared statement so the query only needs to be 
planned once, and just needs rows/n queries to update the whole table. 
 After going through all values in the existing table, one final query is 
run at the end to pick up any new values (you should really have an index 
on updated_at for this query to not do a sequential scan).  This approach 
assumes a integer primary key, but a similar approach can work for 
composite or non-integer keys, as long as there is a way to go through the 
set with an indexed lookup for a given range.

Thanks,
Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.

Reply via email to