https://bugzilla.wikimedia.org/show_bug.cgi?id=49757

--- Comment #5 from Ori Livneh <o...@wikimedia.org> ---
Some notes about how things are currently configured:

MediaWiki can report errors to a remote host via UDP. The MediaWiki instances
on the production cluster are configured to log to a host named 'fluorine'.
This is done by specifying its address as the value of $wmfUdp2logDest in
CommonSettings.php (in operations/mediawiki-config.git).

The MediaWiki instances that power the beta cluster set $wmfUdp2logDest to
'deployment-bastion'
(<https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000390>), a Labs
instance which plays the role of fluorine. It writes log data to files in
/home/wikipedia/logs. Exceptions and fatals are respectively logged to
exception.log and fatal.log in that directory.

When I first started looking at these logs, I didn't want to mess with the
file-based logging, since it's an important service that developers rely on. So
I submitted a patch to have fluorine stream the log data as it receives it to
an another host (vanadium), in addition to writing it to disk. On vanadium I
have a script that is generating the Ganglia graphs at <http://ur1.ca/edq1f>.

Yesterday I submitted change Ia0cc8de43 and Ryan merged it. That change
reproduces the state of affairs described above (i.e. the duplication of the
log stream to two destinations, fluorine and vanadium) on the beta cluster. It
does so by having deployment-bastion forward a copy of the log data to a new
instance, deployment-fluoride
(<https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000084c>).

So the TL;DR is that there is an instance on the beta cluster
(deployment-fluoride) that receives a live stream of errors and fatals being
generated on the beta cluster MediaWikis, and we're free to use it as a sandbox
for trying out different ways of capturing and representing this data.

I've only taken some initial steps, which is to take the stream of exceptions
and fatals (which follow an idiosyncratic format that is not easy to analyze)
and transform each error report into a JSON document. This is the work done in
Ia0cc8de43 (<https://gerrit.wikimedia.org/r/#/c/75560/>). Or "half-done", as I
should say, since I've discovered a couple of bugs that I haven't yet had a
chance to fix.

The nice thing about JSON is that most modern languages have built-in modules
in their standard library for handling it. So the status quo is that pending a
couple of bugfixes there will shortly be streaming JSON service on
deployment-fluoride that publishes MediaWiki error and exception reports as
machine-readable objects.

In this state, the logs are quite easy to pipe into a data store or a
visualization framework. We have to figure out what exactly we want to do,
though, and then spec out some solution, ideally using solid off-the-shelf
solutions where such solutions exist.

Some ideas to get the ball rolling:
https://getsentry.com/welcome/ (packages itself as a paid service, but the
software is open-source).
http://logstash.net/

We could also build our own custom UI for spelunking the data.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to