Re: [Wikitech-l] [Engineering] RFC: Introducing two new HTTP headers to track mobile pageviews

2013-02-07 Thread Mark Bergsma

On Feb 6, 2013, at 9:32 PM, David Schoonover  wrote:

> Just want to summarize and make sure I've got the right conclusions, as
> this thread has wandered a bit.
> 
> *1. X-MF-Mode: Alpha/Beta Site Usage*
> *
> *
> We'll roll this into the X-CS header, which will now be KV-pairs (using
> normal URL encoding), and set by Varnish. This will avoid an explosion of
> cryptic headers for analytic purposes.
> 
> Questions:
> - It seems there's some confusion around "bypassing Varnish". If I
> understand correctly, it's not that Varnish is ever bypassed, just that the
> upstream response is not cached if cookies are present. Is that right?

Yes

> - Since we're repurposing X-CS, should we perhaps rename it to something
> more apt to address concerns about cryptic non-standard headers flying
> about?

I'd like to propose to define *one* request header to be used for all analytics 
purposes. It can be key/value pairs, and be set client side where applicable. 
Varnish can append to it where needed, later keys overriding earlier ones. Then 
we can log that one header across all HTTP/caching clusters without having to 
change the log stream all the time, and without wasting much space, and caching 
edge configuration changes are kept to a minimum as well.

And we might as well be transparent in its naming. header name 
"Log-Parameters:"?

> *2. X-MF-Req: Primary vs Secondary API Requests*
> 
> This header will be replaced with a query parameter set by the client-side
> JS code making the request. Analytics will parse it out at processing time
> and Do The Right Thing.


I think the question of using a URL param vs a request header should mainly 
take into account whether the response varies on the value of the parameter. If 
the responses are otherwise identical, and the value is only used for analytics 
purposes, I would prefer to put that into the above header instead, as it will 
impair cacheability / cache size otherwise (even if those requests are 
currently not cacheable for other reasons). If the responses are actually 
different based on this parameter, I would prefer to have it in the URL where 
possible.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] RFC: Introducing two new HTTP headers to track mobile pageviews

2013-02-11 Thread Mark Bergsma

On Feb 9, 2013, at 11:21 PM, Asher Feldman  wrote:
> For this particular case, the API requests are for either getting specific
> sections of an article as opposed to either the whole thing, or the first
> section as part of an initial pageview.  I might not have grokked the
> original RFC email well, but I don't understand why this was being
> discussed as a logging challenge or necessitating a request header.  A
> mobile api request to just get section 3 of the article on otters should
> already utilize a query param denoting that section 3 is being fetched, and
> is already clearly not a "primary" request.

Yes, that part remains a bit unclear to me as well - some more details would be 
welcome.

> Whether or not it makes sense for mobile to move in the direction of
> splitting up article views into many api requests is something I'd love to
> see backed up by data.  I'm skeptical for multiple reasons.

What is the main motivation used here? Reducing article sizes/transfers at the 
expense of more latency?

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Labs-l] Maria DB

2013-02-14 Thread Mark Bergsma

On Feb 14, 2013, at 5:02 PM, Petr Bena  wrote:

> Keeping debug symbols in binaries will result in poor performance, or it 
> should

That's bollocks. It results in a larger binary _on disk_. The symbol table 
isn't even loaded into memory and doesn't affect performance.

Debug information is *highly useful* in a production setup, and we try to run 
all our core applications with it so we have a chance to debug issues when they 
occur.

I think the only reason distributions omit debug information is to save disk 
space.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Mobile caching improvements are coming

2013-04-02 Thread Mark Bergsma
Hi Max,

On Mar 29, 2013, at 10:45 AM, Max Semenik  wrote:

> Hi, we at the mobile team are currently working on improving our
> current hit rate, publishing the half-implemented plan here for review:

> == Proposed strategy ==
> * We don't vary pages on X-Device anymore.
> * Because we still need to give really ancient WAP phones WML output, we
> create a new header, X-WAP, with just two values, yes or not[1]
> * And we vary our output on X-WAP instead of X-Device[2]
> * Because we still need to serve device-specific CSS but can't use device
> name in page HTML, we create a single ResourceLoader module,
> mobile.device.detect, which outputs styles depending on X-Device.[2] This
> does not affect bits cache fragmentation because it simply changes the way
> the same data is varied, but not adds the new fragmentation factors. Bits
> hit rate currently is very high, by the way.

Yes. It does add Vary processing on the bits caches for these requests though. 
But we can change that by including the X-Device header into the hash for cache 
lookups, if we want to.

> * And because we need X-Device, we will need to direct mobile load.php
> requests to the mobile site itself instead of bits. Not a problem because
> mobile domains are served by Varnish just like bits.
> * Since now we will be serving ResourceLoader to all devices, we will
> blacklist all the incompatible devices in the startup module to prevent
> them from choking on the loads of JS they can't handle (and even if they
> degrade gracefully, still no need to force them to download tens of
> kilobytes needlessly)[3]

Good work! This should help a great deal.

> Your comments are highly appreciated! :)


I've been pondering a bit about the two options for serving mobile 
ResourceLoader requests with Varnish: on the bits caches or on the mobile 
caches. I don't fully like either option to be honest. On one hand I'd like to 
keep mobile device detection off our currently very efficient bits caches, on 
the other hand I don't like the idea of mixing in the RL requests into the high 
churn of mobile request LRU cache eviction of the frontend caches. 
Unfortunately Varnish currently doesn't allow us to separate/specify cache 
backends for objects well.

So let's go with Asher's suggestion indeed, and add the device detection to the 
bits servers. Let's keep it such that it'll always be easy to distinguish these 
requests, so we can easily decide to move these to another Varnish cluster at 
any point.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] ZERO architecture

2013-05-31 Thread Mark Bergsma
Hi Yuri,

Thanks for writing this up. I'll put some comments and questions inline.

On May 30, 2013, at 7:16 PM, Yuri Astrakhan  wrote:

> *== Technical Requirements ==*
> * increase Varnish cache hits / minimize cache fragmentation
> * Set up and configure new partners without code changes
> * Use partner-supplied IP ranges as a preferred alternative to the geo-ip
> database for fundraising & analytic teams

Note that a Varnish VMOD to support the latter is being written at the moment 
by Brandon Black.

> *== Current state ==*
> Zero domain requests set X-Subdomain="ZERO", and treat the request as
> mobile. The backend uses X-Subdomain and X-CS headers to customize result.
> The cache is heavily fragmented due to its variance on both of these
> headers in addition to the variance set by MobileFrontend extension and
> MediaWiki core.

...and also, variance due to the different hostname (and thus URL).

> *== Proposals ==*
> In order to reduce Zero-caused fragmentation, we propose to shrink from one
> bucket per carrier (X-CS) to three general buckets:
> * smart phones bucket -- banner and site modifications are done on the
> client in javascript
> * feature phones -- HTML only, the banner is inserted by the ESI
> ** for carriers with free images
> ** for carriers without free images
> 
> *=== Varnish logic ===*
> * Parse User-Agent to distinguish between desktop / mobile / feature phone:
> X-Device-Type=desktop|mobile|legacy

Using the OpenDDR library?

> * Use IP -> X-CS lookup (under development by OPs) to convert client's IP
> into X-CS header
> * If X-CS && X-Device-Type == 'legacy': Use IP -> X-Images lookup (same
> lookup plugin, different database file) to determine if carrier allows
> images

Hopefully we can set the X-Images header straight from the ip database.

> Since  each carrier has its own list of free languages, language links on
> feature phones will point to origin, which will either silently redirect
> or ask for confirmation.

Perhaps we can store the list of supported languages for the carrier in the ip 
database as well?

> *=== ZERO vs M ===*
> Even  though I think zero. and m. subdomains should both go the way of the
> dodo to make each article have just one canonical location (no more
> linking & Google issues) , this won't happen until we are fully  migrated
> to Varnish and make some mobile code changes (and possibly  other changes
> that I am not aware of).

What do you mean by "until we are fully migrated to Varnish"? MobileFrontend 
has always exclusively been on Varnish.

> At the same time, we  should try to get rid of ZERO wherever possible.
> There are two  technical differences between m & zero: zero shows a link to
> image  instead of the actual image, and a big red zero warning is shown if
> the  carrier is not detected. There is also an organizational difference --
> some carriers only whitelist zero, some - only m, and some --  both zero
> & m subdomains.

I'm still a little confused about "m" vs "ZERO" and "images" vs "no images". 
That probably means others are too. :) Can you elaborate a little on that? I 
thought that was pretty much the same, but according to your spreadsheet that 
doesn't seem to be the case?

Overall this sounds reasonable I think, we'll just need to work out the details.

As Arthur also said in this thread, I'd like to keep zero & m completely 
aligned, ideally sharing the Varnish cache objects and the mobile device 
detection at the Varnish level as much as possible. I don't think we disagree 
here.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] ZERO architecture

2013-05-31 Thread Mark Bergsma
>> * feature phones -- HTML only, the banner is inserted by the ESI
>> ** for carriers with free images
>> ** for carriers without free images
>> 
> 
> What about including ESI tags for banners for smart devices as well as
> feature phones, then either use ESI to insert the banner for both device
> types or, alternatively, for smart devices don't let Varnish populate the
> ESI chunk and instead use JS to replace the ESI tags with the banner? That
> way we can still serve the same HTML for smart phones and feature phones
> with images (one less thing for which to vary the cache).

I think the verdict is still out on whether it's better to use ESI for Banners 
in Varnish or use JS for that client-side. I guess we'll have to test and see.

> Are there carrier-specific things that would result in different HTML for
> devices that do not support JS, or can you get away with providing the same
> non-js experience for Zero as MobileFrontend (aside from the
> banner, presumably handled by ESI)? If not currently, do you think its
> feasible to do that (eg make carrier-variable links get handled via special
> pages so we can always rely on the same URIs)? Again, it would be nice if
> we could just rely on the same HTML to further reduce cache variance. It
> would be cool if MobileFrontend and Zero shared buckets and they were
> limited to:
> 
> * HTML + images
> * HTML - images
> * WAP

That would be nice.

> Since we improved MobileFrontend to no longer vary the cache on X-Device,
> I've been surprised to not see a significant increase in our cache hit
> ratio (which warrants further investigation but that's another email). Are
> there ways we can do a deeper analysis of the state of the varnish cache to
> determine just how fragmented it is, why, and how much of a problem it
> actually is? I believe I've asked this before and was met with a response
> of 'not really' - but maybe things have changed now, or others on this list
> have different insight. I think we've mostly approached the issue with a
> lot more assumption than informed analysis, and if possible I think it
> would be good to change that.

Yeah, we should look into that. We've already flagged a few possible culprits, 
and we're also working on the migration of the desktop wiki cluster from Squid 
to Varnish, which has some of the same issues with variance (sessions, XVO, 
cookies, Accept-Language...) as MobileFrontend does. After we've finished 
migrating that and confirmed that it's working well, we want to unify those 
clusters' configurations a bit more, and that by itself should give us 
additional opportunity to compare some strategies there.

We've since also figured out that the way we've calculate cache efficiency with 
Varnish is not exactly ideal; unlike Squid, cache purges are done as HTTP 
requests to Varnish. Therefore in Varnish, those cache lookups are calculated 
into the cache hit rate, which isn't very helpful. To make things worse, the 
few hundreds of purges a second vs actual client traffic matter a lot more on 
the mobile cluster (with much less traffic but a big content set) than it does 
for our other clusters. So until we can factor that out in the Varnish counters 
(might be possible in Varnish 4.0), we'll have to look at other metrics.

More useful therefore is to check the actual backend fetches ("backend_req"), 
and these appear to have gone down some. Annoyingly, every time we restart a 
Varnish instance we get a spike in the Ganglia graphs, making the long-term 
graphs pretty much unusable. To fix that we'll either need to patch Ganglia 
itself or move to some other stats engine (statsd?). So we have a bit of work 
to do there on the Ops front.

Note that we're about to replace all Varnish caches in eqiad by (fewer) newer, 
much bigger boxes, and we've decided to also upgrade the 4 mobile boxes with 
those same specs. And we're also doing that in our new west coast caching data 
center as well as esams. This will increase the mobile cache size a lot, and 
will hopefully help by throwing resources at the problem.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] The summary of new zero architecture proposal

2013-06-18 Thread Mark Bergsma
Hi Yuri,

On Jun 14, 2013, at 7:16 PM, Yuri Astrakhan  wrote:

> Based on many ideas that were put forth, I would like to seek comments on
> this ZERO design. This HTML will be rendered for both M and ZERO subdomains
> if varnish detects that request is coming from a zero partner. M and ZERO
> will be identical except for the images - ZERO substitutes images with
> links to File:xxx namespace through a redirector.
> 
> * All non-local links always point to a redirector. On javascript capable
> devices, it will load carrier configuration and replace the link with local
> confirmation dialog box or direct link. Without javascript, redirector will
> either silently 301-redirect or show confirmation HTML. Links to images on
> ZERO.wiki and all external links are done in similar way.

For M, you only want to do this when it's a zero carrier I guess? If not, just 
a straight link?

> * The banner is an ESI link to */w/api.php?action=zero&banner=250-99* -
> returns HTML  blob of the banner. (Not sure if banner ID should be
> part of the URL)
> 
> Expected cache fragmentation for each wiki page:
> * per subdomain (M|ZERO)
> * if M - per "isZeroCarrier" (TRUE|FALSE). if ZERO - always TRUE.
> 3 variants is much better then one per carrier ID * 2 per subdomain.

I'm wondering, is there any HTML difference between "M & isZeroCarrier == TRUE" 
and "ZERO"? Links maybe? Can we make those protocol relative perhaps? We might 
be able to kill the cache differences for the domain completely, while still 
supporting both URLs externally.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] wikivoyage (and wikidata) served by Varnish in eqiad

2013-08-05 Thread Mark Bergsma
Last week, we moved wikidata traffic in eqiad (so in practice, all non-European 
traffic) from Squid to the new text Varnish cluster. A few issues were found 
and fixed, and we haven't seen any new issues for several days.

Today I've done the same for Wikivoyage. Non-European Wikivoyage traffic, 
served by our eqiad cluster, is now served by Varnish. Wikivoyage has a bigger 
portion of normal users vs. API/bot traffic, so some new issues could surface.

Please let us know if you see any problems on Wikivoyage that might be related 
to the Varnish migration; file a Bugzilla ticket or mail me directly.

Thanks!

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] wikivoyage (and wikidata) served by Varnish in eqiad

2013-08-05 Thread Mark Bergsma

On Aug 5, 2013, at 5:24 PM, David Gerard  wrote:

> On 5 August 2013 16:17, Mark Bergsma  wrote:
> 
>> Last week, we moved wikidata traffic in eqiad (so in practice, all 
>> non-European traffic) from Squid to the new text Varnish cluster. A few 
>> issues were found and fixed, and we haven't seen any new issues for several 
>> days.
>> Today I've done the same for Wikivoyage. Non-European Wikivoyage traffic, 
>> served by our eqiad cluster, is now served by Varnish. Wikivoyage has a 
>> bigger portion of normal users vs. API/bot traffic, so some new issues could 
>> surface.
>> Please let us know if you see any problems on Wikivoyage that might be 
>> related to the Varnish migration; file a Bugzilla ticket or mail me directly.
> 
> 
> Somewhat ignorant question: once we go all-Varnish, will logs be
> generated in a similar format to eventually end up at stats.grok.se?


Yes, that's generated from our UDP log data, which we have for Squid, Varnish 
and nginx alike.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [outages] www.wikipedia.com from Level3 via IPv6 not working

2013-08-07 Thread Mark Bergsma
Hi George,

On Aug 7, 2013, at 8:31 PM, George Herbert  wrote:

> Not sure if this is real or not, but report that some IPv6 ingress to WMF
> not working at the moment from Level3 networks.


We had the same result in the Level3 looking glass, but while we were debugging 
it and trying to gather more info or hosts/networks affected, it started 
working again in the L3 LG as well. So it appears that the problem was resolved.

If anyone is still seeing issues reaching us over IPv6 via Level 3, then please 
let us know.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [outages] www.wikipedia.com from Level3 via IPv6 not working

2013-08-08 Thread Mark Bergsma

On Aug 7, 2013, at 10:07 PM, George Herbert  wrote:

> 
> The original reporter saw the same restoration of service (I assume)...
> 
> Question - Are the WMF ops folks on the NANOG and outages lists?  Was this 
> redundant reporting?  8-)

On... yes. Actually reading them? Not really. :)

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Data center move in Amsterdam: expect some downtime

2008-12-26 Thread Mark Bergsma
In the upcoming days until new years we will be moving our servers and
other equipment in the Amsterdam data center location to a new data
center. Unfortunately this might result in some down time and hiccups of
certain web sites & services, although we will try to keep this to a
minimum.

On Sunday the 28th, between 09:00 and 11:00 UTC we will migrate our
network in Amsterdam to new equipment. All services located there will
be unreachable for a brief period. Traffic for the main wikis will be
rerouted to the Florida cluster however, and should remain unaffected.

In the days after we will be moving the servers themselves. Some
services, such as the mailing lists server, the subversion server and
the toolserver cluster, will be down for a number of hours while the
equipment is being moved. Traffic for the wikis should again remain
largely unaffected.

We hope to have the entire migration finished before we enter the last
few hours of 2008... and start 2009 with a clean sheet. Happy Holidays
everyone!

-- 
Mark Bergsma 
System & Network Administrator, Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Downtime due to network maintenance, Friday July 31st 12:00 UTC

2009-07-30 Thread Mark Bergsma
Hello,

Due to a problem in one of our core routers in our Tampa cluster we need
to perform some network maintenance tomorrow, Friday July 31st around
12:00 UTC. We will be performing a software upgrade and reboot of the
router. This should not take more than a few minutes if everything goes
well. Unfortunately this means that practically all sites and services
will be down during that time.

For those interested: one of the line cards in the router failed earlier
this week. A replacement has arrived, but does not boot up correctly
after hot plugging. Because we want to upgrade the firmware anyway, we
will reboot the entire box.

Cheers,

-- 
Mark Bergsma 
System & Network Administrator, Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] scaled media (thumbs) as *temporary* files, not stored forever

2012-10-24 Thread Mark Bergsma
To revive this old thread...

On Sep 5, 2012, at 9:35 PM, Asher Feldman  wrote:

> On Tue, Sep 4, 2012 at 3:11 PM, Platonides  wrote:
> 
>> On 03/09/12 02:59, Tim Starling wrote:
>>> I'll go for option 4. You can't delete the images from the backend
>>> while they are still in Squid, because then they would not be purged
>>> when the image is updated or action=purge is requested. In fact, that
>>> is one of only two reasons for the existence of the backend thumbnail
>>> store on Wikimedia. The thumbnail backend could be replaced by a text
>>> file that stores a list of thumbnail filenames which were sent to
>>> Squid within a window equivalent to the expiry time sent in the
>>> Cache-Control header.
>>> -- Tim Starling
>> 
>> The second one seems easy to fix. The first one should IMHO be fixed in
>> squid/varnish by allowing wildcard purges (ie. PURGE
>> /wikipedia/commons/thumb/5/5c/Tim_starling.jpg/* HTTP/1.0)

> fast.ly  implements group purge for varnish like this via a proxy daemon
> that watches backend responses for a "tag" response header (i.e. all
> resolutions of Tim_starling.jpg would be tagged that) and builds an
> in-memory hash of tags->objects which can be purged on.  I've been told
> they'd probably open source the code for us if we want it, and it is
> interesting (especially to deal with the fact that we don't purge articles
> at all of their possible url's) albeit with its own challenges.  If we
> implemented a backend system to track thumbnails that exist for a given
> orig, we may be able to remove our dependency on swift container listings
> to purge images, paving the way for a second class of thumbnails that are
> only cached.


How about this idea:

Just "purge all images with this prefix" doesn't really work in Squid or 
Varnish, because they don't store their cache database in a format that makes 
it cheap to determine which objects would match that. Varnish could do it with 
their "bans", but each ban is kept around for a long time, and with the tens, 
sometimes hundreds of purges a second we do, this would quickly add up to a 
massive ban list.

But... Varnish allows you to customize how it hashes objects into its object 
hash table (vcl_hash). What we could do, is hash thumbnails to the same hash 
key as their original. Because of our current URL structure, that's pretty much 
a matter of stripping off the thumbnail postfix. Then the original and all its 
associated thumbnails end up at the same hash key in the hash table, and only a 
single purge for the original would nuke them all out of the cache.

This relies on Varnish having an efficient implementation for multiple objects 
at a single hash key. It probably does, since it implements Vary processing 
this way. We would essentially be doing the same, Vary-ing on the thumbnail 
size. But I'll check the implementation to be sure.

Of course this won't work for Squid, but I'm pretty close to being able to 
replace Squid by Varnish entirely for upload.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] scaled media (thumbs) as *temporary* files, not stored forever

2012-10-24 Thread Mark Bergsma

On Oct 24, 2012, at 11:36 AM, Mark Bergsma  wrote:
> How about this idea:
> 
> Just "purge all images with this prefix" doesn't really work in Squid or 
> Varnish, because they don't store their cache database in a format that makes 
> it cheap to determine which objects would match that. Varnish could do it 
> with their "bans", but each ban is kept around for a long time, and with the 
> tens, sometimes hundreds of purges a second we do, this would quickly add up 
> to a massive ban list.
> 
> But... Varnish allows you to customize how it hashes objects into its object 
> hash table (vcl_hash). What we could do, is hash thumbnails to the same hash 
> key as their original. Because of our current URL structure, that's pretty 
> much a matter of stripping off the thumbnail postfix. Then the original and 
> all its associated thumbnails end up at the same hash key in the hash table, 
> and only a single purge for the original would nuke them all out of the cache.
> 
> This relies on Varnish having an efficient implementation for multiple 
> objects at a single hash key. It probably does, since it implements Vary 
> processing this way. We would essentially be doing the same, Vary-ing on the 
> thumbnail size. But I'll check the implementation to be sure.


I checked, and Varnish stores all variant objects in a linked list per hash 
table entry. So once it looks up the hash entry for the URL of the original, 
it'll have to do a linear search for the right thumbnail size, matching each 
against a Vary header string. If we do this, we'll need to restrict the number 
of variants (thumb sizes) so we don't get hundreds/thousands on a single hash 
key.

Here's a little proof of concept to demonstrate how it could work:

https://gerrit.wikimedia.org/r/#/c/29805/2

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikimedia logging infrastructure

2010-08-10 Thread Mark Bergsma
On 10-08-10 07:16, Rob Lanphier wrote:
> At any rate, there are a couple of problems with the way that it works:
> 1.  Once we saturate the NIC on the logging machine, the quality of
> our sampling degrades pretty rapidly.  We've generally had a problem
> with that over the past few months.
>   

As already stated elsewhere, we didn't really saturate any NICs, just
some socket buffers. Because of the large number of configured log
pipes, the software (udp2log) could not empty the socket buffers fast
enough.

> If this were your typical commercial operation, the answer would be
> "why aren't you just logging into Streambase?" (or some other data
> warehousing storage solution).  I'm not suggesting that we do that (or
> even look at any of the solutions that bill themselves as open source
> alternatives), since, while our needs are increasing, we still aren't
> planning to be anywhere near as sophisticated as a lot of data
> tracking orgs.  Still, it's worth asking questions about our existing
> setup.  Should we be looking optimize our existing single-box setup,
> extending our software to have multi-node collection, or looking at a
> whole new collection strategy?
>
>   

Besides the ideas that are currently being kicked around of improving or
rewriting the udp log collection software, there's also always the
short-term, easy option of sending a multicast UDP stream, and having
multiple collectors with distinct log pipes setup. E.g. one machine for
the sampled logging, and another, independent machine to do all the
special purpose log streams. I do like more efficient software solutions
rather than throwing more iron at the problem, though. :)

-- 
Mark Bergsma 
Operations Engineer, Wikimedia Foundation


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Thumbnail issues being resolved

2011-04-03 Thread Mark Bergsma
(I just posted the following to the tech blog, http://techblog.wikimedia.org)


Last Monday, our Solaris server that contains all image thumbnails developed 
problems. It ran out of memory, became too slow and eventually even started to 
crash. (For the technically inclined: we think the kernel is leaking some file 
system structure in kernel memory.) This caused missing thumbnails across 
Wikimedia projects.

We addressed these problems in the following ways:
* We decreased the load on this server by adapting the Squid configuration, so 
it would have to handle fewer requests. 
* We ordered more memory, in order to double the total physical memory in the 
relevant systems.
* We set up two new Linux servers that will eventually replace the Solaris 
server. 

At first, the addition of these Linux servers in a partially caching setup 
seemed enough to fix the immediate problem, while gradually copying all 
thumbnail files, allowing us to replace the Solaris server completely.

However, on Saturday night the Solaris server started crashing repeatedly, 
making it necessary to engage the image scalers to regenerate a large part of 
the missing thumbnails. This is causing some slowness of loading and generating 
new (uncached) thumbnails.

Fortunately, most users have not experienced serious problems while using the 
site, since most thumbnails are cached by our HTTP caching layer. It is 
impossible to determine exactly how long it will take to recover completely 
from the slower service, but we expect that this will take no more than a few 
days.

Over the past months we have been developing a new and more scalable 
architecture for media storage, which will solve these problems once and for 
all. We hope to deploy this new architecture within a few months, also 
utilizing the new data center. Please watch the Tech Blog for updates on this 
project.


-- 
Mark Bergsma 
Operations Engineering Program Manager
Wikimedia Foundation




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Mailing lists server migration today

2012-01-13 Thread Mark Bergsma
Hi,

Today I will be migrating the mailing lists from a very old server (lily) in 
Amsterdam, to a new server (sodium) in our new Ashburn data center. Mailman 
will be upgraded to version 2.1.13 along the way.

During the migration, mail will be delayed as all data will need to be 
transferred to the new host. No mail should go lost, but no new mails will be 
sent out during the process until done, and the web interface will be 
unavailable. This shouldn't take more than one hour, if all goes well.

I will report here when things should be back up and running. Afterwards, 
please let us know of any new issues, in bugzilla or on IRC (#wikimedia-tech). 
We don't expect any problems, but as with any software upgrade or migration, 
this can't be guaranteed...

Thanks,

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] CANCELED: Mailing lists server migration today

2012-01-13 Thread Mark Bergsma

On Jan 13, 2012, at 2:54 PM, Mark Bergsma wrote:

> Hi,
> 
> Today I will be migrating the mailing lists from a very old server (lily) in 
> Amsterdam, to a new server (sodium) in our new Ashburn data center. Mailman 
> will be upgraded to version 2.1.13 along the way.


...and right after I sent this mail, I rebooted the new server once more before 
starting the maintenance. But suddenly it refused to come back up, or even 
reinstall. Likely the server has hardware issues.

Therefore the maintenance is canceled for today, until we've figured out what 
the problem is. The migration will probably happen next week, possibly using 
different hardware.

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] COMPLETED: Mailing lists server migration today

2012-01-18 Thread Mark Bergsma

On Jan 13Jan 18, 2012, at 2:54 PM, Mark Bergsma wrote:

> Today I will be migrating the mailing lists from a very old server (lily) in 
> Amsterdam, to a new server (sodium) in our new Ashburn data center. Mailman 
> will be upgraded to version 2.1.13 along the way.
> 
> During the migration, mail will be delayed as all data will need to be 
> transferred to the new host. No mail should go lost, but no new mails will be 
> sent out during the process until done, and the web interface will be 
> unavailable. This shouldn't take about one hour, if all goes well.
> 
> I will report here when things should be back up and running. Afterwards, 
> please let us know of any new issues, in bugzilla or on IRC 
> (#wikimedia-tech). We don't expect any problems, but as with any software 
> upgrade or migration, this can't be guaranteed...


The mailing lists server migration is now complete - Mailman is now running on 
server sodium.

As some people pointed out, my message earlier today was indeed sent out with 
the wrong Date header. I simply redirected my old mail and edited it a bit, 
forgetting that the Date header would not be adjusted by my mail client. Sorry 
for that. :)

The Mailman migration went smoothly, and I'm not aware of any problems. Please 
let us know in Bugzilla or on IRC (#wikimedia-tech) if you're experiencing any 
new issues.

Unfortunately we needed to change the IP address of lists.wikimedia.org for 
this migration. Some large e-mail providers (e.g. Google) are rate limiting 
reception of mail messages from the new IP (208.80.154.4) because it's not 
known and whitelisted yet. To prevent further mail delivery delays today, I've 
configured the new lists server to forward mails that would otherwise be 
delayed via the old mail server and old source IP address again, for the time 
being.

Thanks,

-- 
Mark Bergsma 
Lead Operations Architect
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Faidon Liambotis promoted to Principal Operations Engineer

2014-02-10 Thread Mark Bergsma
I'm pleased to announce that Faidon Liambotis has been promoted to Principal 
Operations Engineer.

From the very first week he was hired, Faidon has expressed great interest in 
understanding and improving the complete infrastructure stack of the Wikimedia 
Foundation, covering not only the domain of the Operations team, but far 
beyond. I distinctly remember how, a few days after he was hired (which at the 
time, I didn't take any part in), he approached me for the first time on IRC 
and said:

   "Hi Mark! Nice to meet you. I see you just wrote this nice new director for 
consistent URL hashing to backends in Varnish. Let me help you get that 
upstreamed!"

I believe in that same week he fixed some bugs in our nginx setup and solved 
our scalability issues with Puppet's external (Nagios) resources, amongst other 
things.

Ever since, Faidon has taken on many projects, large and small, and completed 
them in ways going far beyond his duties. He has spent enormous amounts of time 
reviewing other people's patch sets, discussing their ideas, and mentoring them 
in their work. He's instrumental in coordinating efforts across multiple groups 
and making sure everyone arrives at the best possible solution. In discussions 
he's noticed for being analytical and methodical, and calmly working towards a 
common goal. This is reflected in his architecting work too, where he 
contributes with sensible ideas and a great knowledge of the open source 
software and solutions landscape.

Outside of Wikimedia, Faidon has been active in Debian and other open source 
projects since 2004. He cares deeply about our use of open source solutions and 
helping our software extensions get upstreamed and made available to others.

I think it's only appropriate that we recognize his role with this promotion.

The biggest problem we may have with him is that he works too much and is 
involved with almost everything. Fortunately that is a good fit for his new 
role. :)

Please join me in congratulating Faidon.

— 
Mark Bergsma 
Lead Operations Architect
Acting Director of Technical Operations
Wikimedia Foundation




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Welcome Chase Pettet to the Wikimedia Operations Team

2014-03-11 Thread Mark Bergsma
I'm very pleased to announce that Chase Pettet is joining the Wikimedia 
Foundation as Operations Engineer. Chase comes to us from DeviantArt, where he 
was responsible for their general server management infrastructure, monitoring 
and networking, as well as supporting the development team(s). Within Wikimedia 
Operations he will have similar responsibilities, working on Operations 
infrastructure projects and supporting other Engineering teams with their 
Operations needs. Chase will be working remotely from his home in Missouri. He 
started with us yesterday.

Please join me in welcoming Chase!

— 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Welcome Giuseppe Lavagetto to Wikimedia Operations

2014-04-01 Thread Mark Bergsma
I'm pleased to announce that today, Giuseppe Lavagetto will be joining the 
Operations Team as an Operations Engineer. Giuseppe is based in Rome, Italy and 
will be working with us remotely. He's coming from Venere, a daughter company 
of Expedia, and has greatly helped streamline Operations and improve service 
reliability there.

Giuseppe is very passionate about free and open source, free content and user 
privacy, and these aspects are strong motivations for him to join the Wikimedia 
Foundation. In his free time, he's an active volunteer with Autistici[1], a 
project that provides users communications privacy and helps avoid censorship. 
He also likes to contribute to various small FLOSS projects and loves music, 
blues, soul and hip-hop in particular. He's happily living with his wife and 11 
year old step-daughter in Rome.

Giuseppe will be joining us next week in our off-site team meeting in Athens, 
which should be a short trip for him. :)

Please welcome Giuseppe!


[1] http://www.autistici.org

— 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Please welcome Filippo Giunchedi to Wikimedia TechOps

2014-05-05 Thread Mark Bergsma
I'm very happy to announce that Filippo Giunchedi is joining us as an 
Operations Engineer in the Technical Operations team. Filippo is Italian, but 
he lives in Dublin where he interned at Google and worked at Amazon before 
coming to Wikimedia. He's gained a lot of experienced working with large scale 
distributed systems and infrastructure there.

Filippo will be working with us remotely. Today is his start day, but we were 
lucky to have him join us at our Ops off-site meeting in Athens a few weeks 
ago, where he helped improve our monitoring of system metrics with Graphite.

Fiddling with machines has always been his passion - it led to being fascinated 
by computers in the late 90s. He got involved in free software projects (e.g. 
Debian, as a Debian Developer) in the mid-2000s. System level technologies, 
infrastructure, distributed systems and networking are his main interests. On a 
different level, he's also interested in online privacy and secure/anonymous 
communications (e.g. Tor).

You can find Filippo on IRC (Freenode), using the nick name "godog".

Please join me in welcoming Filippo!

— 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] SSL 3.0 disabled on Wikimedia sites

2014-10-17 Thread Mark Bergsma
Hi all,

Due to the POODLE vulnerability in SSL3.0 that's been announced this
week and has made its round through the media, we decided that we
needed to disable SSL3.0 on all our HTTPS services today, to protect
the security of all our users. The bulk of that change has been
deployed today at 15:00 UTC for the wikis, and the remaining HTTPS
services are getting the same treatment throughout the day. Please see
our blog post on this topic for details:


http://blog.wikimedia.org/2014/10/17/protecting-users-against-poodle-by-removing-ssl-3-0-support/

If you see or hear about anyone having issues connecting to our sites
over HTTPS or logging in, please direct them at the link above, and
urge them to upgrade their software. Unfortunately due to the nature
of HTTPS we're not able to provide a fallback when users get an error
message due to this. We're still looking into the possibility to
provide affected users with an informative error message upon login
however, before they get redirected from HTTP to HTTPS.

As a side note, we've also deployed Google's SCSV SSL extension[1] on
our servers yesterday, such that the attack surface for such
vulnerabilities will be reduced in the future for clients which
support this extension.

[1] 
http://googleonlinesecurity.blogspot.nl/2014/10/this-poodle-bites-exploiting-ssl-30.html

Thanks,

-- 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Announcement: Yuvi Panda joins Ops

2014-11-06 Thread Mark Bergsma
Hi all,

I'm very pleased to announce that as of this week, Yuvi Panda is part of
the Wikimedia Technical Operations team, to work on our Wikimedia Labs
infrastructure. Yuvi originally joined the Wikimedia Foundation Mobile team
in December 2011, where he has been lead development for the original
Wikipedia App and its rewrite, amongst many other projects.

Besides his work in Mobile, Yuvi has been volunteering for Ops work in
Wikimedia Labs for a long time now. One of the notable examples of his work
is a seamlessly integrated Web proxy system that allows public web requests
from the Internet to be proxied to Labs instances on private IPs without
requiring public IP addresses for each instance. This very user friendly
system, which he built on top of NGINX, LUA, redis, sqlite and the
OpenStack API, sees a lot of usage and has dramatically reduced the need
for Labs users to request (scarce) public IP address resources via a manual
approval process.

Another example of his work that has made a big difference is the
initiation of the Labs-Vagrant project; bringing the virtues of the
Mediawiki:Vagrant project to Wikimedia Labs, and allowing anyone to bring a
MediaWiki development environment up in Labs with great ease. More recently
Yuvi has been working on our much needed infrastructure in Labs for
monitoring metrics (Graphite) and service availability (Shinken). We expect
this will give us a lot more insight into the internals and availability of
software and services running in Wikimedia Labs and its many projects, and
we should be able to deploy it in Production as well.

Of course all of this work didn't go unnoticed, and about half a year ago
we've asked Yuvi if he was interested to move to Ops. With his extensive
development experience and his demonstrated ability to join this with solid
Ops work to create stable and highly useful solutions, we think he's a
great fit for this role.

Yuvi recently had his VISA application accepted, and is planning to move to
San Francisco in March 2015. Until then he will be working with us remotely
from India.

Please join me in congratulating Yuvi!

-- 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Tweet of site outage

2015-02-05 Thread Mark Bergsma
Hi all,

We've indeed had a total site outage for roughly 30 minutes. We're still
collecting all data, but we've tracked down the cause to multiple cascading
issues including loss of power to a critical SPOF network switch and HHVM
MediaWiki application servers getting blocked due to multiple unoptimal
timeout settings. We'll post a full incident report soon, and work to
correct the underlying issues as soon as possible.

Our apologies,

On Thu, Feb 5, 2015 at 7:03 PM, Guillaume Paumier 
wrote:

> Hi,
>
> Le jeudi 5 février 2015, 09:58:01 George Herbert a écrit :
> > I saw a WMF tweet of a site outage (network?) around 9:30am Pacific
> time, by
> > the time I could check now things seem ok on en
>
> Sites are mostly back up but there are still issues with login, so the Ops
> team hasn't had time to write a postmortem yet.
>
> --
> Guillaume Paumier
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Tweet of site outage

2015-02-05 Thread Mark Bergsma
The incident report is now posted on wikitech:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150205-SiteOutage

On Thu, Feb 5, 2015 at 7:57 PM, Mark Bergsma  wrote:

> Hi all,
>
> We've indeed had a total site outage for roughly 30 minutes. We're still
> collecting all data, but we've tracked down the cause to multiple cascading
> issues including loss of power to a critical SPOF network switch and HHVM
> MediaWiki application servers getting blocked due to multiple unoptimal
> timeout settings. We'll post a full incident report soon, and work to
> correct the underlying issues as soon as possible.
>
> Our apologies,
>
> On Thu, Feb 5, 2015 at 7:03 PM, Guillaume Paumier 
> wrote:
>
>> Hi,
>>
>> Le jeudi 5 février 2015, 09:58:01 George Herbert a écrit :
>> > I saw a WMF tweet of a site outage (network?) around 9:30am Pacific
>> time, by
>> > the time I could check now things seem ok on en
>>
>> Sites are mostly back up but there are still issues with login, so the Ops
>> team hasn't had time to write a postmortem yet.
>>
>> --
>> Guillaume Paumier
>>
>> _______
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> Mark Bergsma 
> Lead Operations Architect
> Director of Technical Operations
> Wikimedia Foundation
>



-- 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Moritz Muehlenhoff joins as Ops Security Engineer

2015-04-02 Thread Mark Bergsma
Hi all,

I'm very pleased to announce that as of todayyesterday, Moritz
Mühlenhoff will be joining the Ops team in the role of Operations
Security Engineer. We're excited as for the first time we'll have an
engineer on our team able to focus on enhancing the security of our
infrastructure.

Some of you Debian users may recognize his name; in his spare time
he's very active in the Debian Security Team and sends out a large
portion of their security advisory mails. ;)

Moritz lives in Bremen, North Germany (internationally perhaps best
known for being the home of Beck's beer) with his spouse Silvia and
their 16 m/o son Tjark. Besides being a Debian Developer, he also very
much enjoys Rugby Union and plays tighthead prop in his local club
"Union 60 Bremen" in the third divison of Germany. He used to be a
frequent visitor of film festivals such as the San Sebastian festival,
but with the baby around home theatre has become more prevalent. :-)

Moritz is working with us remotely, and can usually be found using his
nick "jmm" on Freenode.

Please join me in welcoming Moritz to the team!

-- 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Please welcome Jaime Crespo

2015-05-14 Thread Mark Bergsma
Hi all,

I'm very pleased to announce that we've recently hired Jaime Crespo as
Sr. Database Administrator. Jaime has joined the Technical Operations
team to strengthen our DBA capacity. He will be working closely with
Sean, and will join responsibility for our production database
infrastructure, the Wikimedia Labs replicas and the Analytics/research
databases. His addition to the team will also allow us to support our
developers better with code review and advice about database queries
and schema tuning.

Before he joined us Jaime has been a MySQL/MariaDB DBA consultant,
both at Percona and later as an independent contractor. In that role
he has supported many database environments, large and small. Being a
fan of the free software and open data movements, Jaime is excited to
be employing his experience in such an environment.

Jaime lives in the Zaragoza area in Spain, and will be working with us
remotely from home. Outside of work, he is an active contributor to
the Spanish Wikipedia and the OpenStreetMap projects as well. His
other hobbies include photography, cycling, astronomy, reading and
acting in theater.

Jaime can be found on IRC under the nickname 'jynus'.

Please join me in welcoming him!

-- 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Data center switch-over moving ahead next week: please stay available :)

2016-04-21 Thread Mark Bergsma
Hi everyone,

After we've been successfully serving our sites from our backup data-center
codfw (Dallas) for the past two days, we're now starting our switch back to
eqiad (Ashburn) as planned[1].

We've already moved cache traffic back to eqiad, and within the next
minutes, we'll disable editing by going read-only for approximately 30
minutes - hopefully a bit faster than 2 days ago.

[1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/

On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma  wrote:

> Hi all,
>
> Today the data center switch-over commenced as planned, and has just fully
> completed successfully. We are now serving our sites from codfw (Dallas,
> Texas) for the next 2 days if all stays well.
>
> We switched the wikis to read-only (editing disabled) at 14:02 UTC, and
> went back read-write at 14:48 UTC - a little longer than planned. While
> edits were possible then, unfortunately at that time Special:Recent Changes
> (and related change feeds) were not yet working due to an unexpected
> configuration problem with our Redis servers until 15:10 UTC, when we found
> and fixed the issue. The site has stayed up and available for readers
> throughout the entire migration.
>
> Overall the procedure was a success with few problems along the way.
> However we've also carefully kept track of any issues and delays we
> encountered for evaluation to improve and speed up the procedure, and
> reducing impact to our users - some of which will already be implemented
> for our switch back on Thursday.
>
> We're still expecting to find (possibly subtle) issues today, and would
> like everyone who notices anything to use the following channels to report
> them:
>
> 1. File a Phabricator issue with project #codfw-rollout
> 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)
> 3. Send an e-mail to the Operations list: o...@lists.wikimedia.org
>
> We're not done yet, but thanks to all who have helped so far. :-)
>
> Mark
>

-- 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Data center switch-over moving ahead next week: please stay available :)

2016-04-21 Thread Mark Bergsma
We've just completed the switch back, and all services are running from our
main data center eqiad (Ashburn) again.

The process went very smooth this time around. In the past two days leading
up to this, we've been able to either fix or work around the most important
issues we encountered on Tuesday. This meant that we had no real setbacks
or unanticipated delays today, and therefore were able to complete the most
time pressing and user-impacting part (during which MediaWiki is read-only)
in 20 minutes, down from ~45 minutes two days ago.

However, we'll be doing this again in the future, and until then we'll work
on improving and further automating this process to get it down to
hopefully much lower levels of impact and duration.

Please let us know if you see any issues which may be caused by the
switch-over(s).

Thanks much to everyone involved!

Mark


On Thu, Apr 21, 2016 at 3:53 PM, Mark Bergsma  wrote:

> Hi everyone,
>
> After we've been successfully serving our sites from our backup
> data-center codfw (Dallas) for the past two days, we're now starting our
> switch back to eqiad (Ashburn) as planned[1].
>
> We've already moved cache traffic back to eqiad, and within the next
> minutes, we'll disable editing by going read-only for approximately 30
> minutes - hopefully a bit faster than 2 days ago.
>
> [1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/
>
> On Tue, Apr 19, 2016 at 6:00 PM, Mark Bergsma  wrote:
>
>> Hi all,
>>
>> Today the data center switch-over commenced as planned, and has just
>> fully completed successfully. We are now serving our sites from codfw
>> (Dallas, Texas) for the next 2 days if all stays well.
>>
>> We switched the wikis to read-only (editing disabled) at 14:02 UTC, and
>> went back read-write at 14:48 UTC - a little longer than planned. While
>> edits were possible then, unfortunately at that time Special:Recent Changes
>> (and related change feeds) were not yet working due to an unexpected
>> configuration problem with our Redis servers until 15:10 UTC, when we found
>> and fixed the issue. The site has stayed up and available for readers
>> throughout the entire migration.
>>
>> Overall the procedure was a success with few problems along the way.
>> However we've also carefully kept track of any issues and delays we
>> encountered for evaluation to improve and speed up the procedure, and
>> reducing impact to our users - some of which will already be implemented
>> for our switch back on Thursday.
>>
>> We're still expecting to find (possibly subtle) issues today, and would
>> like everyone who notices anything to use the following channels to report
>> them:
>>
>> 1. File a Phabricator issue with project #codfw-rollout
>> 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)
>> 3. Send an e-mail to the Operations list: o...@lists.wikimedia.org
>>
>> We're not done yet, but thanks to all who have helped so far. :-)
>>
>> Mark
>>
>
> --
> Mark Bergsma 
> Lead Operations Architect
> Director of Technical Operations
> Wikimedia Foundation
>



-- 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] 2017-10-25 Scrum of Scrums meeting notes

2017-11-02 Thread Mark Bergsma
On Wed, Nov 1, 2017 at 9:31 AM, Federico Leva (Nemo) 
wrote:

> Thanks. "Procurement for Asia datacenter has started" is big news! Is
> energy efficiency a criterion for the product/supplier selection?
>
> (This needs not be something very complicated. For instance Dell asks a
> small surcharge if you want a more efficient PSU, IIRC. <
> http://www.dell.com/learn/uk/en/ukbsdt1/help-me-choose/hmc-
> power-supply-unit-12g?ref=CFG>)
>

We buy servers from two standard vendors (Dell and HP), and select the most
energy-efficient components (including PSUs) available for the task - not
only because it's better for the environment, but also because it allows us
to achieve higher rack density (more servers within the same space) and
therefore also saves costs over time.

These server configurations have been carefully sized, selected and
tested/measured in practice with actual work loads to provide the most
optimal usage of resources. For example, through consolidation and
optimization of cache clusters onto fewer, but somewhat higher capacity new
servers we've reduced the amount of equipment and power required to
approximately 60%, also shrinking data center space, supportive
infrastructure and management needs in the process.

-- 
Mark Bergsma 
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l