This is a follow-up to Rob's mail "Wikidata change propogation". I feel that the
question of running periodic jobs on a large number of wikis is a more generic
one, and deserves a separate thread.

Here's what I think we need:

1) Only one process should be performing a given update job on a given wiki.
This avoids conflicts and duplicates during updates.

2) No single server should be responsible for running updates on a given wiki.
This avoids a single point of failure.

3) The number of processes running update jobs (lets call them workers) should
be independent of the number of wikis to update. For better scalability, we
should not need one worker per wiki.

Such a system could be used in many scenarios where a scalable periodic update
mechanism us needed. For Wikidata, we need it to let the Wikipedias know when
data they are using from Wikidata has been changed.

Here is what we have come up with so far for that use case:

Currently:
* there is a maintenance script that has to run for each wiki
* the script is run periodically from cron on a single box
* the script uses a pid file to make sure only one instance is running.
* the script saves it's last state (continuation info) in a local state file.

This isn't good: It will require one process for each wiki (soon, all 280 or so
Wikipedias), and one cron entry for each wiki to fire up that process.

Also, the update process for a given wiki can only be configured on a single
box, creating a single point of failure. If we had a chron entry for wiki X on
two boxes, both processes could end up running concurrently, because they won't
see each other's pid file (and even if they did, via NFS or so, they wouldn't be
able to detect whether the process with the id in the file is alive or not).

And, if the state file or pid file gets lost or inaccessible, hilarity ensues.


Soon:
* We will implement a DB based locking/coordination mechanism that ensures that
only one worker will be update any given wiki, starting where the previous job
left off. The details are described in
<https://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation#Dispatching_Changes>.

* We will still be running these jobs from cron, but we can now configure a
generic "run ubdate jobs" call on any number of servers. Each one will create
one worker, that will then pick a wiki to update and lock it against other
workers until it is done.

There is however no mechanism to keep worker processes from piling up if
performing an update run takes longer than the time it takes for the next worker
to be launched. So the frequency of the cron job has to be chosen fairly low,
increasing update latency.

Note that each worker decides at runtime which wiki to update. That means it can
not be a maintenance script running with the target wiki's configuration. Tasks
that need wiki specific knowledge thus have to be deferred to jobs that the
update worker posts to the target wiki's job queue.


Later:
* Let the workers run persistently, each running it's own poll-work-sleep loop
with configurable batch size and sleep time.
* Monitor the workers and re-launch them if they die.

This way, we can easily scale by tuning the expected number of workers (or the
number of servers running workers). We can further adjust the update latency by
tuning the batch size and sleep time for each worker.

One way to implement this would be via puppet: puppet would be configured to
ensure that a given number of update workers is running on each node. For
starters, two or three boxes running one worker each, for redundancy, would be
sufficient.

Is there a better way to do this? Using start-stop-daemon or something like
that? A grid scheduler?

Any input would be great!

-- daniel



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to