[sqlalchemy] running parallel migrations using sharded/partioned/spaced queries?

2015-10-14 Thread Jonathan Vanasco
I have to run a script on 2MM objects to update the database.  Not really a 
schema migration, more like changing the internal data representation in 
the fields.

There's a bit of post-processing and bottlenecks involved, so doing 
everything one-at-a-time will take a few days.

I'd like to split this out into 5-10 'task runners' that are each 
responsible for a a section of the database (ie, every 5th record).  That 
should considerably drop the runtime.

I thought I had seen a recipe for this somewhere, but checked and couldn't 
find anything.  That leads me to question if this is a good idea or not. 
 Anyone have thoughts/pointers?

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] running parallel migrations using sharded/partioned/spaced queries?

2015-10-14 Thread Mike Bayer


On 10/14/15 12:55 PM, jason kirtland wrote:
> If you can partition the rows numerically, this is trivially easily to
> implement using redis as the orchestrator.
> 
> For example if you have integer PKs, you might have a loop like:
> 
> offset = 0
> while offset < tablesize:
> for row in query[offset:batchsize]:
> migrate(row)
> commit()
> offset += batchsize
> 
> With redis orchestrating, you use a key in redis and INCRBY to reliably
> distribute batches to an arbitrary number of workers on an arbitrary
> number of hosts.
> 
>while True:
>offset = redis.incrby('migration-offset', batchsize)
>rows = query[offset:batchsize]
>if not rows:
>break
>for row in rows:
>migrate(row)
>commit()
> 
> INCRBY is atomic and returns the adjusted value, so every invocation of
> this script that calls into redis and INCRBYs by, say, 1000, has its own
> chunk of 1000 to work on. For a starting value of -1000 and four
> invocations, you'd see 0, 1000, 2000 and 3000.
> 
> I'll typically do this on one invocation, see that it's running well and
> that I chose a performant batch size, and then spin up additional
> workers on more cores until the migration hits the overall throughput
> required.

What am I missing that one wouldn't use say multiprocessing.Pool() to do
this kind of thing in the general sense?   If we're only talking about
5-10 runners they could just as well be local forked processes.



> 
> 
> 
> On Wed, Oct 14, 2015 at 9:32 AM, Jonathan Vanasco  > wrote:
> 
> I have to run a script on 2MM objects to update the database.  Not
> really a schema migration, more like changing the internal data
> representation in the fields.
> 
> There's a bit of post-processing and bottlenecks involved, so doing
> everything one-at-a-time will take a few days.
> 
> I'd like to split this out into 5-10 'task runners' that are each
> responsible for a a section of the database (ie, every 5th record). 
> That should considerably drop the runtime.
> 
> I thought I had seen a recipe for this somewhere, but checked and
> couldn't find anything.  That leads me to question if this is a good
> idea or not.  Anyone have thoughts/pointers?
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to sqlalchemy+unsubscr...@googlegroups.com
> .
> To post to this group, send email to sqlalchemy@googlegroups.com
> .
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+unsubscr...@googlegroups.com
> .
> To post to this group, send email to sqlalchemy@googlegroups.com
> .
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] running parallel migrations using sharded/partioned/spaced queries?

2015-10-14 Thread jason kirtland
If you can partition the rows numerically, this is trivially easily to
implement using redis as the orchestrator.

For example if you have integer PKs, you might have a loop like:

offset = 0
while offset < tablesize:
for row in query[offset:batchsize]:
migrate(row)
commit()
offset += batchsize

With redis orchestrating, you use a key in redis and INCRBY to reliably
distribute batches to an arbitrary number of workers on an arbitrary number
of hosts.

   while True:
   offset = redis.incrby('migration-offset', batchsize)
   rows = query[offset:batchsize]
   if not rows:
   break
   for row in rows:
   migrate(row)
   commit()

INCRBY is atomic and returns the adjusted value, so every invocation of
this script that calls into redis and INCRBYs by, say, 1000, has its own
chunk of 1000 to work on. For a starting value of -1000 and four
invocations, you'd see 0, 1000, 2000 and 3000.

I'll typically do this on one invocation, see that it's running well and
that I chose a performant batch size, and then spin up additional workers
on more cores until the migration hits the overall throughput required.



On Wed, Oct 14, 2015 at 9:32 AM, Jonathan Vanasco 
wrote:

> I have to run a script on 2MM objects to update the database.  Not really
> a schema migration, more like changing the internal data representation in
> the fields.
>
> There's a bit of post-processing and bottlenecks involved, so doing
> everything one-at-a-time will take a few days.
>
> I'd like to split this out into 5-10 'task runners' that are each
> responsible for a a section of the database (ie, every 5th record).  That
> should considerably drop the runtime.
>
> I thought I had seen a recipe for this somewhere, but checked and couldn't
> find anything.  That leads me to question if this is a good idea or not.
> Anyone have thoughts/pointers?
>
> --
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.