On Tuesday, April 23, 2013 at 1:06 PM, Greg Grossmeier wrote:
> Hello all,
> 
> This message is for those of you who do deployments to the WMF cluster.
> 
> 
> On the [[How to deploy code]] wikitech page, there is a section on
> Testing your live code:
> https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_your_code_live
> 
> That's a pretty basic overview of it and it could be greatly improved
> with information like:
> * How to monitor specific parts of the cluster that are relevant to what
> you deployed
> * What general monitoring should be looked at after you deploy


MediaWiki exceptions / fatals are plotted in Ganglia now, though somewhat 
awkwardly under node vanadium.eqiad.wmnet (where they're getting tallied) 
rather than the node on which the error originated. I think the way it's done 
now deserves another thought (maybe this ought to go in graphite, instead?), 
but at the same time it is sufficiently intelligible to be of _some_ use, I 
think.

The most useful view is the last two hour's worth of exceptions and misc. 
fatals (evergreen link):

http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal%7Cexception&gtype=stack&glegend=show&aggregate=1&embed=1

(The m is 'mili', so the current peaks correspond to one exception / fatal 
every 6-10 seconds.)

I'll add it to the post-deployment instructions if people find it useful.

--
Ori Livneh



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to