Antoine: yes :) All: please email the QA mailing list with requests like this so we don't keep split braining the conversation. I've done so now. (Also, please subscribe!) (I've BCC'd Andrew O assuming he's just the messenger here.)
Short version: Beta stability is always an on-going goal of RelEng. With that said: can you give specific examples of Beta Cluster being unstable *today*? Asking for "more reliable" is fine, but we need targets and goals and bugs filed to address. I fear much of the worry about Beta Cluster is due to the rocky transition to HHVM (which was less than ideal). We are better equipped/able to deal with such changes in the future right now (and we are no longer experiencing HHVM-related issues, afaict). We are also not planning to create yet-another cluster at this time but instead use multiversion (hetdeploy) to deploy two versions on the current cluster (master updating every 10 mins as now, and a nightly updating once per day). That isn't on the immediate roadmap (iow: more than a month out) and will wait until after the hardware procurement for WMF Labs infrastructure. (NB: There is also the work to convert the load of if-statements that currently make the prod puppet code work on Beta Cluster to using Hiera instead, but that is slightly orthogonal though it will help improve stability as well.) I've also updated the referenced wiki page accordingly: https://www.mediawiki.org/w/index.php?title=Wikimedia_Release_Engineering_Team%2FStaging_Cluster&diff=1245809&oldid=1202735 (Sorry, that page should have had {{draft}} on it from the beginning.) Relatedly, Jenkins/Zuul are suffering from some performance issues and that is one of the reasons Antoine is taking a sabbatical from IRC (to address those issues; see that thread for more details). When those issues happen (eg: failed browser tests due to time outs, which shouldn't be happening very often at all anymore) a common response is to blame Beta Cluster incorrectly. Instead, let's all file bugs when those issues occur so that A) we have a record of them and B) we can identify root cause instead of assuming which piece is failing. Thanks, Greg On Wed, Oct 29, 2014 at 11:14 AM, Antoine Musso <[email protected]> wrote: > Hello Greg, > > You are probably in a better seat to reply to that SOS card related to beta > stability. > > Should we bring it up on QA list to reach a wider audience? > > ---------- Message transféré ---------- > De : "Andrew Otto" <[email protected]> > Date : 29 oct. 2014 18:51 > Objet : # 147 more reliable Beta Labs for QA > À : "Marc A. Pelletier" <[email protected]>, "Antoine Musso" > <[email protected]>, <[email protected]> > Cc : > >> Hi yalls, >> >> Ryan brought up this card at Scrum of Scrums today, and I have an action >> item to ask you about it. >> >> >> https://wikimedia.mingle.thoughtworks.com/projects/scrum_of_scrums/cards/147 >> >> >> https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Staging_Cluster >> >> Thoughts? >> >> Thanks! >> -Ao >> >> > -- Greg Grossmeier Release Team Manager _______________________________________________ QA mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/qa
