[Wikitech-l] Best practices for read/write vs read-only requests, and our multi-DC future

2016-04-20 Thread Brion Vibber
Over in TimedMediaHandler extension, we've had a number of cases where old
code did things that were convenient in terms of squishing read-write
operations into data getters, that got removed due to problems with long
running transactions or needing to refactor things to support
future-facing multi-DC work where we want requests to be able to more
reliably distinguish between read-only and read-write. And we sometimes
want to put some of those clever hacks back and add more. ;)

For instance in https://gerrit.wikimedia.org/r/284368 we'd like to remove
transcode derivative files of types/resolutions that have been disabled
automatically when we come across them. But I'm a bit unsure it's safe to
do so.

Note that we could fire off a job queue background task to do the actual
removal... But is it also safe to do that on a read-only request?
https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki
seems to indicate job queueing will be safe, but would like to confirm
that. :)

Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to
trigger missing transcodes to run on demand, similarly. The actual re
encoding happens in a background job, but we have to fire it off, and we
have to record that we fired it off so we don't duplicate it...

(This would require a second queue to do the high-priority state table
update and queue the actual transcoding job; we can't put them in one queue
because a backup of transcode jobs would prevent the high priority job from
running in a timely fashion.)

A best practices document on future-proofing for multi DC would be pretty
awesome! Maybe factor out some of the stuff from the RfC into a nice dev
doc page...

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Best practices for read/write vs read-only requests, and our multi-DC future

2016-04-21 Thread bawolff
On Thu, Apr 21, 2016 at 1:45 AM, Brion Vibber  wrote:
> Over in TimedMediaHandler extension, we've had a number of cases where old
> code did things that were convenient in terms of squishing read-write
> operations into data getters, that got removed due to problems with long
> running transactions or needing to refactor things to support
> future-facing multi-DC work where we want requests to be able to more
> reliably distinguish between read-only and read-write. And we sometimes
> want to put some of those clever hacks back and add more. ;)
>
> For instance in https://gerrit.wikimedia.org/r/284368 we'd like to remove
> transcode derivative files of types/resolutions that have been disabled
> automatically when we come across them. But I'm a bit unsure it's safe to
> do so.
>
> Note that we could fire off a job queue background task to do the actual
> removal... But is it also safe to do that on a read-only request?
> https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki
> seems to indicate job queueing will be safe, but would like to confirm
> that. :)
>
> Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to
> trigger missing transcodes to run on demand, similarly. The actual re
> encoding happens in a background job, but we have to fire it off, and we
> have to record that we fired it off so we don't duplicate it...
>
> (This would require a second queue to do the high-priority state table
> update and queue the actual transcoding job; we can't put them in one queue
> because a backup of transcode jobs would prevent the high priority job from
> running in a timely fashion.)
>
> A best practices document on future-proofing for multi DC would be pretty
> awesome! Maybe factor out some of the stuff from the RfC into a nice dev
> doc page...
>
> -- brion
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

When doing something like that from a read request, there's also the
problem for a popular page that there might be lots of views (maybe
thousands if the queue is a little backed up) before the job is
processed. So if the view triggers the job, and it will only stop
triggering inserting the job after the job has been executed, this
might cause a large number of useless jobs to be en-queued until one
of them is finally executed.

--
-bawolff

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Best practices for read/write vs read-only requests, and our multi-DC future

2016-04-21 Thread Brion Vibber
On Thursday, April 21, 2016, bawolff  wrote:
>
>
> When doing something like that from a read request, there's also the
> problem for a popular page that there might be lots of views (maybe
> thousands if the queue is a little backed up) before the job is
> processed. So if the view triggers the job, and it will only stop
> triggering inserting the job after the job has been executed, this
> might cause a large number of useless jobs to be en-queued until one
> of them is finally executed.


Can PoolCounter help with that? Docs are a little sparse, but it was put in
place to prevent that sort of backup trying to run the same stuff over and
over simultaneously.

-- brion


>
> --
> -bawolff
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Best practices for read/write vs read-only requests, and our multi-DC future

2016-04-21 Thread Erik Bernhardson
On Apr 20, 2016 10:45 PM, "Brion Vibber"  wrote:
>
> Over in TimedMediaHandler extension, we've had a number of cases where old
> code did things that were convenient in terms of squishing read-write
> operations into data getters, that got removed due to problems with long
> running transactions or needing to refactor things to support
> future-facing multi-DC work where we want requests to be able to more
> reliably distinguish between read-only and read-write. And we sometimes
> want to put some of those clever hacks back and add more. ;)
>
> For instance in https://gerrit.wikimedia.org/r/284368 we'd like to remove
> transcode derivative files of types/resolutions that have been disabled
> automatically when we come across them. But I'm a bit unsure it's safe to
> do so.
>
> Note that we could fire off a job queue background task to do the actual
> removal... But is it also safe to do that on a read-only request?
>
https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki
> seems to indicate job queueing will be safe, but would like to confirm
> that. :)
>

I think this is the preferred method. My understanding is that the jobs
will get shipped to the primary DC job queue.

> Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to
> trigger missing transcodes to run on demand, similarly. The actual re
> encoding happens in a background job, but we have to fire it off, and we
> have to record that we fired it off so we don't duplicate it...
>
> (This would require a second queue to do the high-priority state table
> update and queue the actual transcoding job; we can't put them in one
queue
> because a backup of transcode jobs would prevent the high priority job
from
> running in a timely fashion.)
>
The job queue can do deduplication, although you would have to check if
that is active while the job is running and not only while queued. Might
help?

> A best practices document on future-proofing for multi DC would be pretty
> awesome! Maybe factor out some of the stuff from the RfC into a nice dev
> doc page...
>
> -- brion
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Best practices for read/write vs read-only requests, and our multi-DC future

2016-04-21 Thread Brion Vibber
On Thu, Apr 21, 2016 at 4:59 PM, Erik Bernhardson <
ebernhard...@wikimedia.org> wrote:

> On Apr 20, 2016 10:45 PM, "Brion Vibber"  wrote:
> > Note that we could fire off a job queue background task to do the actual
> > removal... But is it also safe to do that on a read-only request?
> >
>
> https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki
> > seems to indicate job queueing will be safe, but would like to confirm
> > that. :)
> >
>
> I think this is the preferred method. My understanding is that the jobs
> will get shipped to the primary DC job queue.
>

*nod* looks like per spec that should work with few surprises.


>
> > Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to
> > trigger missing transcodes to run on demand, similarly. The actual re
> > encoding happens in a background job, but we have to fire it off, and we
> > have to record that we fired it off so we don't duplicate it...
> [snip]
> >
> The job queue can do deduplication, although you would have to check if
> that is active while the job is running and not only while queued. Might
> help?
>

Part of the trick is we want to let the user know that the job has been
queued; and if the job errors out, we want the user to know that the job
errored out.

Currently this means we have to update a row in the 'transcode' table
(TimedMediaHandler-specific info about the transcoded derivative files)
when we fire off the job, then update its state again when the job actually
runs.

If that's split into two queues, one lightweight and one heavyweight, then
this might make sense:

* N web requests hit something using File:Foobar.webm, which has a missing
transcode
* they each try to queue up a job to the lightweight queue that says "start
queueing this to actually transcode!"
* when the job queue runner on the lightweight queue sees the first such
job, it records the status update to the database and queues up a
heavyweight job to run the actual transcoding. The N-1 remaining jobs duped
on the same title/params either get removed, or never got stored in the
first place; I forget how it works. :)
* ... time passes, during which further web requests don't yet see the
updated database table state, and keep queueing in the lightweight queue.
* lightweight queue runners see some of those jobs, but they have the
updated master database state and know they don't need to act.
* database replication of the updated state hits the remote DC
* ..time passes, during which further web requests see the updated database
table state and don't bother queueing the lightweight job
* eventually, the heavyweight job runs, completes, updates the states at
start and end.
* eventually, the database replicates the transcode state completion to the
remote DC.
* web requests start seeing the completed state, and their output includes
the updated transcode information.

It all feels a bit complex, and I wonder if we could build some common
classes to help with this transaction model. I'm pretty sure we can be
making more use of background jobs outside of TimedMediaHandler's slow
video format conversions. :D

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Best practices for read/write vs read-only requests, and our multi-DC future

2016-04-23 Thread Brion Vibber
I've opened a phab task https://phabricator.wikimedia.org/T133448 about
writing up good intro docs and updating other docs to match it.

Feel free y'all to add to that or hang additional tasks onto it like better
utility classes to help folks transition code to background jobs... And
maybe infrastructure to make sure we're handling those jobs reliably on
small sites without dedicated job runners.

-- brion
On Apr 21, 2016 5:26 PM, "Brion Vibber"  wrote:

> On Thu, Apr 21, 2016 at 4:59 PM, Erik Bernhardson <
> ebernhard...@wikimedia.org> wrote:
>
>> On Apr 20, 2016 10:45 PM, "Brion Vibber"  wrote:
>> > Note that we could fire off a job queue background task to do the actual
>> > removal... But is it also safe to do that on a read-only request?
>> >
>>
>> https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki
>> > seems to indicate job queueing will be safe, but would like to confirm
>> > that. :)
>> >
>>
>> I think this is the preferred method. My understanding is that the jobs
>> will get shipped to the primary DC job queue.
>>
>
> *nod* looks like per spec that should work with few surprises.
>
>
>>
>> > Similarly in https://gerrit.wikimedia.org/r/#/c/284269/ we may wish to
>> > trigger missing transcodes to run on demand, similarly. The actual re
>> > encoding happens in a background job, but we have to fire it off, and we
>> > have to record that we fired it off so we don't duplicate it...
>> [snip]
>> >
>> The job queue can do deduplication, although you would have to check if
>> that is active while the job is running and not only while queued. Might
>> help?
>>
>
> Part of the trick is we want to let the user know that the job has been
> queued; and if the job errors out, we want the user to know that the job
> errored out.
>
> Currently this means we have to update a row in the 'transcode' table
> (TimedMediaHandler-specific info about the transcoded derivative files)
> when we fire off the job, then update its state again when the job actually
> runs.
>
> If that's split into two queues, one lightweight and one heavyweight, then
> this might make sense:
>
> * N web requests hit something using File:Foobar.webm, which has a missing
> transcode
> * they each try to queue up a job to the lightweight queue that says
> "start queueing this to actually transcode!"
> * when the job queue runner on the lightweight queue sees the first such
> job, it records the status update to the database and queues up a
> heavyweight job to run the actual transcoding. The N-1 remaining jobs duped
> on the same title/params either get removed, or never got stored in the
> first place; I forget how it works. :)
> * ... time passes, during which further web requests don't yet see the
> updated database table state, and keep queueing in the lightweight queue.
> * lightweight queue runners see some of those jobs, but they have the
> updated master database state and know they don't need to act.
> * database replication of the updated state hits the remote DC
> * ..time passes, during which further web requests see the updated
> database table state and don't bother queueing the lightweight job
> * eventually, the heavyweight job runs, completes, updates the states at
> start and end.
> * eventually, the database replicates the transcode state completion to
> the remote DC.
> * web requests start seeing the completed state, and their output includes
> the updated transcode information.
>
> It all feels a bit complex, and I wonder if we could build some common
> classes to help with this transaction model. I'm pretty sure we can be
> making more use of background jobs outside of TimedMediaHandler's slow
> video format conversions. :D
>
> -- brion
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l