Re: Transactions and paged_each

2016-09-02 Thread Trevor Turk
On Friday, September 2, 2016 at 2:31:29 PM UTC-5, Jeremy Evans wrote: 
>
> I'm guessing streaming would preform better, but you would still need to 
> use a separate thread or sharding when using streaming.
>

That's great. I'll give streaming a try, but I believe the bottleneck isn't 
going to be retrieving the rows, but with the work I'm doing before 
updating them.
 

> I have to say, if you are really just updating updated_at and are 
> concerned about performance, I would not use your approach unless you 
> absolutely had to have a model hook executed.  The only time you really 
> need to do query per row is if you need to execute ruby code on a per row 
> basis to determine what the new column values should be.
>

This is fantastic advice, and I'll bookmark this for future use. I'm sorry 
I didn't make it clear, but I meant that updated_at example to be a 
simplified example of what I'll be doing. I'll be calculating values on a 
per record basis and then issuing an update_all call per record which will 
include the updated_at to remove it from the set of records to be processed.

Person.where(updated_at: nil).paged_each { |p| Thread.new { 
p.update_all(some_calculated_value: CalculateValue(p), updated_at: 
Time.now.utc) }.join }

Apologies, I should have made the example a bit more clear. In any case, 
your suggestion will be useful to future Googlers, I'm sure. 

Thank you again for all of your help. I'm off to the races using paged_each 
and threads. I'll report back if I run into anything of interest relevant 
to this thread, and I'll try sequel_pg and streaming if time permits in the 
next few weeks. It sounds like we'd get a number of performance benefits 
beyond streaming vs cursors, so it's worth looking into for sure. 

- Trevor

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-09-02 Thread Jeremy Evans
On Friday, September 2, 2016 at 11:48:46 AM UTC-7, Trevor Turk wrote:

> On Friday, September 2, 2016 at 1:19:02 PM UTC-5, Jeremy Evans wrote:
>>
>> To work around this issue, you can use sharding (Dataset option like 
>> :servers=>{:foo=>{}}), and do dataset.server(:foo).paged_each. 
>>  Alternatively, you can issue queries in a separate thread inside the 
>> Dataset#each block.  Even something simple like 
>> dataset.each{Thread.new{...}.join} should work.  Please try one of those 
>> options and report back.
>>
>
> Ah, your suggestion to use threads seems nice. I tested running this 
> simultaneously and it seems to work without a hitch:
>
> Person.dataset.limit(15).paged_each { |p| Thread.new { sleep(1) ; 
> p.update_all(updated_at: Time.now.utc) ; print '.' }.join }
>
> Is there any reason to prefer streaming support via sequel_pg if I can get 
> that working, or would you think this is a reasonable way to proceed? My 
> backfill process would look something like this, and would be iterating 
> over ~100 million rows:
>
> Person.where(updated_at: nil).paged_each { |p| Thread.new { 
> p.update_all(updated_at: Time.now.utc) }.join }
>

I'm guessing streaming would preform better, but you would still need to 
use a separate thread or sharding when using streaming.

I have to say, if you are really just updating updated_at and are concerned 
about performance, I would not use your approach unless you absolutely had 
to have a model hook executed.  The only time you really need to do query 
per row is if you need to execute ruby code on a per row basis to determine 
what the new column values should be.

An approach like this should perform much better, while still not locking 
the whole table at once:

  n = 1000
  ds = Person.where(updated_at: nil)
  ps = ds.where{id >= :$id}.where{id < Sequel.+(:$id, n) 
 }.prepare(:update, :update_person_updated_at, 
:updated_at=>Sequel::CURRENT_TIMESTAMP)
  ((ds.max(:id)/n) + 1).times{|id| ps.call(:id=>id*n)}
  ds.update(:updated_at=>Sequel::CURRENT_TIMESTAMP)

This approach uses a prepared statement so the query only needs to be 
planned once, and just needs rows/n queries to update the whole table. 
 After going through all values in the existing table, one final query is 
run at the end to pick up any new values (you should really have an index 
on updated_at for this query to not do a sequential scan).  This approach 
assumes a integer primary key, but a similar approach can work for 
composite or non-integer keys, as long as there is a way to go through the 
set with an indexed lookup for a given range.

Thanks,
Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-09-02 Thread Trevor Turk
On Friday, September 2, 2016 at 1:19:02 PM UTC-5, Jeremy Evans wrote:
>
> To work around this issue, you can use sharding (Dataset option like 
> :servers=>{:foo=>{}}), and do dataset.server(:foo).paged_each. 
>  Alternatively, you can issue queries in a separate thread inside the 
> Dataset#each block.  Even something simple like 
> dataset.each{Thread.new{...}.join} should work.  Please try one of those 
> options and report back.
>

Ah, your suggestion to use threads seems nice. I tested running this 
simultaneously and it seems to work without a hitch:

Person.dataset.limit(15).paged_each { |p| Thread.new { sleep(1) ; 
p.update_all(updated_at: Time.now.utc) ; print '.' }.join }

Is there any reason to prefer streaming support via sequel_pg if I can get 
that working, or would you think this is a reasonable way to proceed? My 
backfill process would look something like this, and would be iterating 
over ~100 million rows:

Person.where(updated_at: nil).paged_each { |p| Thread.new { 
p.update_all(updated_at: Time.now.utc) }.join }

Thanks again,
- Trevor

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-09-02 Thread Jeremy Evans
On Friday, September 2, 2016 at 10:43:50 AM UTC-7, Trevor Turk wrote:
>
> Hello -- just reporting back on my progress so far. 
>
> It appears we can use paged_each safely, however I was surprised to find 
> that we can't run two simultaneously if they issue updates in the block. 
>

In general it's not safe to issue queries inside a Dataset#each (and 
#paged_each) block, unless you force the use of a different connection. 
 Whether it works depends on the adapter, but it's always unsupported.

To work around this issue, you can use sharding (Dataset option like 
:servers=>{:foo=>{}}), and do dataset.server(:foo).paged_each. 
 Alternatively, you can issue queries in a separate thread inside the 
Dataset#each block.  Even something simple like 
dataset.each{Thread.new{...}.join} should work.  Please try one of those 
options and report back.

Thanks,
Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-09-02 Thread Trevor Turk
On Thursday, September 1, 2016 at 12:51:58 PM UTC-5, Jeremy Evans wrote:
>
> On Thursday, September 1, 2016 at 9:42:11 AM UTC-7, Trevor Turk wrote:
>>
>> On Thursday, September 1, 2016 at 11:34:44 AM UTC-5, Trevor Turk wrote:
>>>
>>> In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the 
>>> "sequel" and "pg" gems. So, I'd be using cursors with paged_each as it 
>>> stands. Do you know if that ought to be safe? I'm considering switching to 
>>> the "sequel_pg" gem for the streaming support, but I'm reluctant to change 
>>> things unnecessarily, unless streaming might work where cursors would not. 
>>>
>>> Thanks again,
>>> - Trevor
>>>
>>  
>> Apologies, I just noticed that "pg" is required for "sequel_pg" so I'll 
>> give that a try now :)
>>
>
> Streaming should be faster, so that's one reason to use sequel_pg. 
>  Streaming also only executes a single query on the database instead of 
> multiple queries.  I would recommend using sequel_pg in general due to the 
> other optimizations it uses.
>
> I think either using cursors or stream should be safe in terms of not 
> locking the whole table, unless you are using SELECT FOR UPDATE, but again, 
> that's something you should test to make sure
>
> Thanks,
> Jeremy
>

Hello -- just reporting back on my progress so far. 

It appears we can use paged_each safely, however I was surprised to find 
that we can't run two simultaneously if they issue updates in the block. 

It seems we can work around this using the :hold option added to 
Dataset#use_cursor in the 4.9.0 release, which appears to remove the 
transaction and (considering the caveats) seems to work well for our case.

The relevant commit is: 
https://github.com/jeremyevans/sequel/commit/6083b73af616c13583f6f360ee028e3baa83c1e4

At first I thought the problem with simultaneous use might be due to 
duplicate cursor names, but it seems `hold: true` is all we need. 

I'm a little concerned about this section of the docs 
https://www.postgresql.org/docs/9.3/static/sql-declare.html:

"A cursor created with WITH HOLD... [i]n the current implementation, the 
rows represented by a held cursor are copied into a temporary file or 
memory area so that they remain available for subsequent transactions."

I'm hoping they don't mean that all rows in a table with millions of rows 
will be held in memory, but rather each page of them. Even if it's all of 
the rows, I'm not sure if that's a cause for concern. (I'll be iterating 
over a table with tens of millions of rows and performing updates to 
backfill newly created columns that we set to null so we could run a 
migration without downtime.)

I'm not concerned about paged_each returning duplicate rows or 
recalculating values etc as the docs warn, because I'll be selecting rows 
with a null value for a specific column which my update will set not a 
non-null value. 

So, I think I can safely proceed with paged_each(hold: true) but I'm 
tempted to experiment with streaming support via sequel_pg in any case. 
I'll do some more research and testing around that and report back again. 

Here's an example of my testing so far:

ok simultaneously:
Person.dataset.limit(15).paged_each { |p| sleep(1) ; print '.' }
Person.dataset.limit(15).paged_each { |p| sleep(1) ; print '.' }

not ok simultaneously:
Person.dataset.limit(15).paged_each { |p| sleep(1) ; 
p.update_all(updated_at: Time.now.utc) ; print '.' }
Person.dataset.limit(15).paged_each { |p| sleep(1) ; 
p.update_all(updated_at: Time.now.utc) ; print '.' }

not ok simultaneously:
Person.dataset.limit(15).paged_each(cursor_name: 'cursor_one') { |p| 
sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' }
Person.dataset.limit(15).paged_each(cursor_name: 'cursor_two') { |p| 
sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' }

ok simultaneously:
Person.dataset.limit(15).paged_each(cursor_name: 'cursor_one', hold: true) 
{ |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' }
Person.dataset.limit(15).paged_each(cursor_name: 'cursor_two', hold: true) 
{ |p| sleep(1) ; p.update_all(updated_at: Time.now.utc) ; print '.' }

ok simultaneously:
Person.dataset.limit(15).paged_each(hold: true) { |p| sleep(1) ; 
p.update_all(updated_at: Time.now.utc) ; print '.' }
Person.dataset.limit(15).paged_each(hold: true) { |p| sleep(1) ; 
p.update_all(updated_at: Time.now.utc) ; print '.' }

Thanks again,
- Trevor 

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-09-01 Thread Jeremy Evans
On Thursday, September 1, 2016 at 9:42:11 AM UTC-7, Trevor Turk wrote:
>
> On Thursday, September 1, 2016 at 11:34:44 AM UTC-5, Trevor Turk wrote:
>>
>> In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the 
>> "sequel" and "pg" gems. So, I'd be using cursors with paged_each as it 
>> stands. Do you know if that ought to be safe? I'm considering switching to 
>> the "sequel_pg" gem for the streaming support, but I'm reluctant to change 
>> things unnecessarily, unless streaming might work where cursors would not. 
>>
>> Thanks again,
>> - Trevor
>>
>  
> Apologies, I just noticed that "pg" is required for "sequel_pg" so I'll 
> give that a try now :)
>

Streaming should be faster, so that's one reason to use sequel_pg. 
 Streaming also only executes a single query on the database instead of 
multiple queries.  I would recommend using sequel_pg in general due to the 
other optimizations it uses.

I think either using cursors or stream should be safe in terms of not 
locking the whole table, unless you are using SELECT FOR UPDATE, but again, 
that's something you should test to make sure

Thanks,
Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-09-01 Thread Trevor Turk
On Thursday, September 1, 2016 at 11:34:44 AM UTC-5, Trevor Turk wrote:
>
> In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the 
> "sequel" and "pg" gems. So, I'd be using cursors with paged_each as it 
> stands. Do you know if that ought to be safe? I'm considering switching to 
> the "sequel_pg" gem for the streaming support, but I'm reluctant to change 
> things unnecessarily, unless streaming might work where cursors would not. 
>
> Thanks again,
> - Trevor
>
 
Apologies, I just noticed that "pg" is required for "sequel_pg" so I'll 
give that a try now :)

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-09-01 Thread Trevor Turk
On Thursday, September 1, 2016 at 1:20:21 AM UTC-5, Jeremy Evans wrote:
>
> Dataset#paged_each shouldn't lock the entire table, as it just selects 
> rows, but the behavior in regards to locking depends on the database.  It 
> would probably be best for you to try it in a test environment to be sure 
> it functions as you expect, before using it in production.
>
> Thanks,
> Jeremy
>

Thanks, Jeremy!

I'll do some thorough testing before running in production, and I'll report 
back about the eventual results for future readers. 

In my case, I'm on AWS RDS PostgreSQL version 9.3.10 and I'm using the 
"sequel" and "pg" gems. So, I'd be using cursors with paged_each as it 
stands. Do you know if that ought to be safe? I'm considering switching to 
the "sequel_pg" gem for the streaming support, but I'm reluctant to change 
things unnecessarily, unless streaming might work where cursors would not. 

Thanks again,
- Trevor

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Re: Transactions and paged_each

2016-08-31 Thread Jeremy Evans
On Wednesday, August 31, 2016 at 9:47:52 PM UTC-7, Trevor Turk wrote:
>
> Hello,
>
> I've been reading about paged_each and it seems to work well in my 
> testing, but I'm concerned about using it in production because the 
> documentation says it uses a transaction internally. 
>
> I plan to make a "backfill" script that iterates over all rows in a 
> production database and enqueues background jobs for processing them etc.
>
> Would using paged_each lock the table and prevent my (Sinatra) app from 
> functioning regularly?
>
> Thank you!
> - Trevor
>

Dataset#paged_each shouldn't lock the entire table, as it just selects 
rows, but the behavior in regards to locking depends on the database.  It 
would probably be best for you to try it in a test environment to be sure 
it functions as you expect, before using it in production.

Thanks,
Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.


Transactions and paged_each

2016-08-31 Thread Trevor Turk
Hello,

I've been reading about paged_each and it seems to work well in my testing, 
but I'm concerned about using it in production because the documentation 
says it uses a transaction internally. 

I plan to make a "backfill" script that iterates over all rows in a 
production database and enqueues background jobs for processing them etc.

Would using paged_each lock the table and prevent my (Sinatra) app from 
functioning regularly?

Thank you!
- Trevor

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sequel-talk+unsubscr...@googlegroups.com.
To post to this group, send email to sequel-talk@googlegroups.com.
Visit this group at https://groups.google.com/group/sequel-talk.
For more options, visit https://groups.google.com/d/optout.