Re: [Wikitech-l] Optimizing the deployment train schedule
Le 07/11/13 18:27, C. Scott Ananian a écrit : It seems to me that having a gerrit (or other) page somewhere which lists exactly what is currently deployed where (and when the next scheduled deploy is) is a prerequisite for all of the more aggressive let's mix up the set of wikis in early deploy suggestions. Whenever someone comes with a script, I will be happy to integrate it so it is continuously generated whenever a change is merged in wmf branches. The resulting output can be hosted at https://integration.wikimedia.org/dashboard/ -- Antoine hashar Musso ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
On Mon, Oct 21, 2013 at 4:36 PM, Jon Robson jdlrob...@gmail.com wrote: Having mobile just joined it the only feedback I can give so far is it is confusing knowing what is where but I'm not quite sure how to improve that confusion yet other than having a gerrit page which tells me what is deployed everywhere so i can check out the state of mediawiki.org or en.wiki when debugging issues. It seems to me that having a gerrit (or other) page somewhere which lists exactly what is currently deployed where (and when the next scheduled deploy is) is a prerequisite for all of the more aggressive let's mix up the set of wikis in early deploy suggestions. --scott -- (http://cscott.net) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
That would indeed be useful C. Scott. Actually the people that seem to care most about what is currently deployed where are product owners and designers from my experience who are not usually technical. It would be good to give them an easy way to look this up as I spend a lot of time debugging why something is not live... ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
Even us technical folks are often ignorant of the deeper ways of ops. When I last fixed a long-standing bug in the PHP parser with the potential to cause regressions in existing wikitext, it was not exactly trivial to keep track of where the code was currently live (and exactly when it went live) -- complicated by the fact that I was convinced that the code *wasn't* actually in production, despite all evidence to the contrary, because HTML Tidy was turned on in production and hid the beneficial effects of my patch. That's just anecdotal evidence of the fact that making deployment/version info as obvious as possible can be useful even for ordinary bug fixers. --scott On Thu, Nov 7, 2013 at 12:30 PM, Jon Robson jdlrob...@gmail.com wrote: That would indeed be useful C. Scott. Actually the people that seem to care most about what is currently deployed where are product owners and designers from my experience who are not usually technical. It would be good to give them an easy way to look this up as I spend a lot of time debugging why something is not live... ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- (http://cscott.net) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
quote name=C. Scott Ananian date=2013-11-07 time=12:27:06 -0500 On Mon, Oct 21, 2013 at 4:36 PM, Jon Robson jdlrob...@gmail.com wrote: Having mobile just joined it the only feedback I can give so far is it is confusing knowing what is where but I'm not quite sure how to improve that confusion yet other than having a gerrit page which tells me what is deployed everywhere so i can check out the state of mediawiki.org or en.wiki when debugging issues. It seems to me that having a gerrit (or other) page somewhere which lists exactly what is currently deployed where (and when the next scheduled deploy is) is a prerequisite for all of the more aggressive let's mix up the set of wikis in early deploy suggestions. Best we have right now is a combination of: The wmfXX release notes pages (autogenerated with love by Reedy). eg: https://www.mediawiki.org/wiki/MediaWiki_1.23/wmf2 The Included In dropdown in Gerrit. eg, go to https://gerrit.wikimedia.org/r/#/c/93980/, click Included in see a list of Master and wmf/1.23wmf3. Now, you need to correlate 1.23wmf3 and: https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap#Schedule_for_the_deployments (which is updated by hand by mostly Reedy and sometimes me) Yeah, not elegant at all. Who wants to devote some time to making a nice purty dashboard for this info? :) Greg --scott -- (http://cscott.net) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- | Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @gregA18D 1138 8E47 FAC8 1C7D | ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
On 11/07/2013 10:07 AM, Greg Grossmeier wrote: Who wants to devote some time to making a nice purty dashboard for this info? :) It's probably a bit late for this northernHemisphere(Winter), but... You can trust that someone will take it from here in the next 6 months, or you can write one paragraph and a related enhancement request in Bugzilla, and post it at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects Yes, I know you know but, you know... :) -- Quim Gil Technical Contributor Coordinator @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
quote name=Greg Grossmeier date=2013-10-19 time=14:43:38 -0700 tldr; I like a modified Option C, but also propose a very different Option D that I think would also be good, either now or as the next next step. This Monday is a US Holiday, so no deploys that day. Seems like a reasonable week to start on the Option C modification (ie: move Monday's deploy to Tuesday). Let's do that. I'd still like to move around the wikis in the various groups/phases, but that can wait (and will need to, as we need to see which ones want to move where). Greg -- | Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @gregA18D 1138 8E47 FAC8 1C7D | signature.asc Description: Digital signature ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
On Sat, Oct 19, 2013 at 1:35 PM, Antoine Musso hashar+...@free.fr wrote: Le 19/10/13 00:26, Erik Moeller a écrit : Are there other ways to optimize / issues I'm missing or misrepresenting above? Evil plan: deploy automatically on merge. But we are not ready yet :-] We're not ready-- except in the beta cluster we are. The earlier that changes are merged to the master branch, the more time we have for scrutiny of those changes in beta labs, and the deployment there is in fact all automated and hands-off. I still occasionally see code being merged to master very shortly before being deployed, which means that beta gets updated at about the same time as the test wikis, which occasionally causes surprises. -Chris ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
On Fri, Oct 18, 2013 at 6:41 PM, MZMcBride z...@mzmcbride.com wrote: Though everyone to have commented here so far (myself included) don't deploy code or help fix the bugs that arise after (not directly, anyway). I'd be most interested to hear from Sam, Arthur, Max, Greg, et al. (the people on https://wikitech.wikimedia.org/wiki/Deployments) about how the deployments process(es) are working. Personally, I do not have a strong opinion about this yet. The mobile web team just got on the deployment train two weeks ago (previously we managed our own weekly deployments that went out cluster-wide), so it feels too early for me have a sense of what works well/doesn't as we're still working out some internal kinks and getting used to the new rhythm. That said, 'option c' seems really sensible to me - it would be nice to have the extra working day to address issues that cropped up on the testwikis before pushing changes out to the non-wikipedia wikis. -- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
Having mobile just joined it the only feedback I can give so far is it is confusing knowing what is where but I'm not quite sure how to improve that confusion yet other than having a gerrit page which tells me what is deployed everywhere so i can check out the state of mediawiki.org or en.wiki when debugging issues. In case it is useful I drew a tube map after a quick chat with Greg to describe the deployment train process: https://commons.wikimedia.org/wiki/File:Deployment_train_tube_map_for_MediaWiki_2013-10-21_13-30.jpg On Mon, Oct 21, 2013 at 10:26 AM, Arthur Richards aricha...@wikimedia.org wrote: On Fri, Oct 18, 2013 at 6:41 PM, MZMcBride z...@mzmcbride.com wrote: Though everyone to have commented here so far (myself included) don't deploy code or help fix the bugs that arise after (not directly, anyway). I'd be most interested to hear from Sam, Arthur, Max, Greg, et al. (the people on https://wikitech.wikimedia.org/wiki/Deployments) about how the deployments process(es) are working. Personally, I do not have a strong opinion about this yet. The mobile web team just got on the deployment train two weeks ago (previously we managed our own weekly deployments that went out cluster-wide), so it feels too early for me have a sense of what works well/doesn't as we're still working out some internal kinks and getting used to the new rhythm. That said, 'option c' seems really sensible to me - it would be nice to have the extra working day to address issues that cropped up on the testwikis before pushing changes out to the non-wikipedia wikis. -- Arthur Richards Software Engineer, Mobile [[User:Awjrichards]] IRC: awjr +1-415-839-6885 x6687 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Jon Robson http://jonrobson.me.uk @rakugojon ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
On 2013-10-18 9:40 PM, Rob Lanphier ro...@wikimedia.org wrote: Hi Erik, I'm not a fan of removing one of the stages of our current deployments. More inline: On Fri, Oct 18, 2013 at 3:26 PM, Erik Moeller e...@wikimedia.org wrote: Option B: No Monday deploy. This would mean we'd have to improve our testing process to catch issues affecting the non-Wikipedia wikis before they hit production. I personally think getting rid of the Monday deploy could create some _desirable_ pain that would act as a forcing function to improve pre-release test practices, rather than using production wikis to test. At the same time, we'd have a full week to work out the kinks we find in testing before they hit any production wiki, and could have a more systematic process of backing out changes if needed prior to deployment. The Monday deploy is where we catch load based issues in a way that's not absolutely crushing. The cumulative traffic of the wikis is approximately 10% of our overall traffic, which is large enough to notice load-based problems, but small enough to make the difference between hmm, we seem to have a load issue to oh crap, we just brought down the site. We also generally discover many more issues through getting it in front of more people, but not foisting it on everyone. It's not great that there are bugs that some people have to suffer through, but it's better than making all people suffer through them. We can change the mix of wikis so that it's not always the same set that's part of the pilot group (and maybe one day in the glorious future be able to do mixed versioning on a per-wiki basis so that people could opt-in), but I'd rather not foist everything on everyone at once. Finally, another advantage of staging things this way is that we get some time to focus on non-Wikipedia sister project bugs before we deploy to Wikipedia. There are often project-specific bugs, and our test infrastructure isn't *nearly* built out enough to catch even the majority of them. If we deploy to all projects at once, we get hit with all of the bugs at once. Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic. I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often have to reschedule these for Tuesday anyway. Rob ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Tuesdays are also nice as that gives a day for bugs filed by a user on a weekend to be found/triaged by someone, and the correct person notified before the next stage of deploy. As a user I have vauge memories of the site going down much more often in the past due to performance issues, which doesn't seem to happen anymore with the split up deploy. Our ability to do effective load testing prior to a deploy is essentially zero other than reading code afaik, and I have yet to hear any proposals to change that. I don't think the pain points caused would actually get fixed. (Ok, I guess comparing profiling data of the testwikis before and after deploy carefully can reveal performance issues, but I still think one has to actually test with high load to see the high load issues) -bawolff ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
Le 19/10/13 00:26, Erik Moeller a écrit : Are there other ways to optimize / issues I'm missing or misrepresenting above? Hello, As a summary we deploy a new release in three stages spanned over a one week window. The last stage of the previous window occurring the same day as the first stage of the next window. The three stages are: 1) test wikis (ie mediawiki) 2) non-wikipedias 3) wikipedias The stages are scheduled as: Thursdaywindow 1 stage 1 Monday window 1 stage 2 Thursday+7 window 1 stage 3, window 2 stage 1 Monday window 2 stage 2 ... What about doing all three stages the same day? We could take advantage of our 18 hours presence from Europe to San Francisco. Hence we could go with something like: 8:00 UTC (1am PST): deploy on test wikis (Europe folks) 16:00 UTC (9am PST): deploy non wikipedias (Europe, East Coast + SF) 20:00 UTC (1pm PST): deploy on wikipedias (East Coast + SF) European folks would catch issues appearing on test wikis, the non wikipedias could be done with Europe+SF and the wikipedias by SF. We also have ops coverage on all that time frame. With such a system, we could keep deploying on Thursdays and Mondays, though we will deploy two releases per weeks. Evil plan: deploy automatically on merge. But we are not ready yet :-] -- Antoine hashar Musso ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
Hi there, tldr; I like a modified Option C, but also propose a very different Option D that I think would also be good, either now or as the next next step. quote name=Erik Moeller date=2013-10-18 time=15:26:16 -0700 [snip overview of problem, combined with Robla's and you get a good picture of the issues.] == Some options == Option A: Change nothing. I've not heard from enough folks to see if the problems above are widely perceived to _be_ problems. If the consensus is that current practice, for now, is the best possible approach, obviously we should stick with it. I think this is a non-option, honestly. The current schedule has issues that can be resolved; let's try to resolve them. Option B: No Monday deploy. This would mean we'd have to improve our testing process to catch issues affecting the non-Wikipedia wikis before they hit production. I personally think getting rid of the Monday deploy could create some _desirable_ pain that would act as a forcing function to improve pre-release test practices, rather than using production wikis to test. At the same time, we'd have a full week to work out the kinks we find in testing before they hit any production wiki, and could have a more systematic process of backing out changes if needed prior to deployment. Due to the concerns raised by Robla (and I, when in person), I'm not sure this is the right way to go next. It might be an option later when our cycle is a matter of a day or two, but not now with the week-long cycle. Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic. I like this option as a next step, but with a caveat/suggestion: we mix up the wikis in stage 0, 1, and 2. And, we should be open to changing the mix more frequently and based on community feedback (I know some are actually willing/wanting to join the fun of being earlier in the cycle...). Until we have a way to gradually increase the % of users who are using the new wmf *cross wiki*, then our only option is doing things per wiki, which gives you two conceptual options: a test/production split, and that's it, or a tiered system like the 3-tier one we have now. I have two suggestions; a safe one and a less safe one (where 'safe' being 'easy to sell to people'): 1) the safe one: We move Monday's deploy to Tuesday. Let some wikis move into phase 1 from phase 2, and some move from phase 1 to phase 2 (but probably keep phase 0 the same unless some community is as crazy as mw.org's ;) ). This will give more agency to communities on their placement in the cycle while still giving us a more thorough load test on Tuesday after blatant issues are found on Thur/Fri. 2) the less safe one (Option D): We have a four-tiered system. tier0 on Mon, tier1 on Tue, tier2 on Wed, tier3 on Thurs, on Friday we rest (er, merge into master for Monday). Ideal breakdown of user load (of total cross cluster) would be something like: tier0:5% (5% total) tier1:20% (25% total) tier2:30% (55% total) tier3:45% (100%) This gives us: increasing load, with more measurable moments in time. What I mean by that is: With Ori's awesome new work (and planned work), we'll be able to make more sense of performance/load pre/post a deploy. We already look at 500s and similar logs, but those are lumped in the 'apparent bugs' that are found right after a deploy (along with obvious this button went missing things). With only a 3 tier system, where the first tier is basically so small it is hard to tell signal from noise in pre/post deploy performance data. We still only get one chance to test load (tier1, non-wikipedias now) before going everywhere and potentially having downtime. I argue/theorize, that with 3 deploys before we get to everywhere, we would be better able to spot performance issues. Now, we can't probably do that idealized load distribution I lay out above. See: http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjectsOriginal.htm for the breakdown per project type. Also (for the Wikpedia's breakdown): http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm insert time where Greg goes off to sift through data Ok, I'm going to have to sit down with this data on Monday (this current naptime session won't be long enough) and come back with a proposed distribution. Simply: I'll try to hit the above idealized breakdown, but with these restrictions: A) ENWP in tier3 (which is 44% by itself, using Sept'13 data); B) for tiers 1 and 2, get a mix of project types (ie: include WPs, wikibookos, wiktionaries, etc in both); and C) tier0 being only testwikis (and mw.org). But leave this open for others to join, if desired. Other benefits of Option D: * gets us accustomed to more frequent deploys. * will provide some of that beneficial pain
Re: [Wikitech-l] Optimizing the deployment train schedule
On 18 October 2013 15:26, Erik Moeller e...@wikimedia.org wrote: Hi folks, after speaking to a few folks, I'd like to check in on the WMF deployment train schedule overall, and see if there are ways to optimize it. [Snip] I think Option B is a good option, and agree that it's a good think that it forces us to have more discipline in the code that goes out in terms of testing/scaling/happiness, rather than spotting issues in production and using the sister projects as guinea pigs. I'd note that this is effectively what we've had with VisualEditor since the beginning of deployment train releases in May last year (before the switch to weekly releases): because VE isn't deployed to any of the sister projects, we go live each Thursday to phase 0, and with the previous version to phase 2; no wikis that get new code on Monday currently have VE enabled. J. -- James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc. jforres...@wikimedia.org | @jdforrester ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
Hi Erik, I'm not a fan of removing one of the stages of our current deployments. More inline: On Fri, Oct 18, 2013 at 3:26 PM, Erik Moeller e...@wikimedia.org wrote: Option B: No Monday deploy. This would mean we'd have to improve our testing process to catch issues affecting the non-Wikipedia wikis before they hit production. I personally think getting rid of the Monday deploy could create some _desirable_ pain that would act as a forcing function to improve pre-release test practices, rather than using production wikis to test. At the same time, we'd have a full week to work out the kinks we find in testing before they hit any production wiki, and could have a more systematic process of backing out changes if needed prior to deployment. The Monday deploy is where we catch load based issues in a way that's not absolutely crushing. The cumulative traffic of the wikis is approximately 10% of our overall traffic, which is large enough to notice load-based problems, but small enough to make the difference between hmm, we seem to have a load issue to oh crap, we just brought down the site. We also generally discover many more issues through getting it in front of more people, but not foisting it on everyone. It's not great that there are bugs that some people have to suffer through, but it's better than making all people suffer through them. We can change the mix of wikis so that it's not always the same set that's part of the pilot group (and maybe one day in the glorious future be able to do mixed versioning on a per-wiki basis so that people could opt-in), but I'd rather not foist everything on everyone at once. Finally, another advantage of staging things this way is that we get some time to focus on non-Wikipedia sister project bugs before we deploy to Wikipedia. There are often project-specific bugs, and our test infrastructure isn't *nearly* built out enough to catch even the majority of them. If we deploy to all projects at once, we get hit with all of the bugs at once. Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic. I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often have to reschedule these for Tuesday anyway. Rob ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
Rob Lanphier wrote: Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic. I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often have to reschedule these for Tuesday anyway. I agree. Though everyone to have commented here so far (myself included) don't deploy code or help fix the bugs that arise after (not directly, anyway). I'd be most interested to hear from Sam, Arthur, Max, Greg, et al. (the people on https://wikitech.wikimedia.org/wiki/Deployments) about how the deployments process(es) are working. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Optimizing the deployment train schedule
On Fri, Oct 18, 2013 at 6:41 PM, MZMcBride z...@mzmcbride.com wrote: Rob Lanphier wrote: Option C: Shift Monday deploys to Tuesday. This would at least give us an additional work day to fix issues that have occurred in testing before they hit prod. I personally don't think this goes far enough, but might be a useful tweak to make if option B seems too problematic. I like this option. U.S. Holidays (and holidays observed by a significant chunk of key WMF employees) often fall on Monday, which means we often have to reschedule these for Tuesday anyway. I agree. Though everyone to have commented here so far (myself included) don't deploy code or help fix the bugs that arise after (not directly, anyway). I'd be most interested to hear from Sam, Arthur, Max, Greg, et al. (the people on https://wikitech.wikimedia.org/wiki/Deployments) about how the deployments process(es) are working. MZMcBride I deploy, both extension and the train in rare cases. I would definitely vote for A or C. Although I'd like to think that option B would force better pre-cluster testing, I think a lot of that desirable pain would be entirely focused on the person doing the deploy (or the rest of the platform team, who get pulled in when things go really bad), and not on the developer/team who caused the issue. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l