(TL;DR? Skip down three paragraphs to the possible workaround....) Last month, I reported on the progress of SHA-1 updates from the WMF servers, and noted that s1 replag was likely to continue to be a problem for a number of weeks. As I said then, the WMF was using (at least) three processes to populate the SHA-1 field on three separate blocks of revision records. All these changes then were being replicated to the Toolserver's copies of the databases, and this flood of updates was causing the replag.
The three blocks were being populated at different rates (for reasons that are beyond my knowledge). On July 23 at about 15:00 UTC, rosemary (sql-s1-rr) completed updating the first of the three blocks. The other blocks continued to be populated (and at some point the WMF started another process to help finish off the slowest block), but the rate of updates was somewhat less, and rosemary actually caught up on its backlog and reached zero replag within about a day after this milestone. The situation on thyme (sql-s1-user) is less favorable, as we all know. The replag on that server got much higher to start with, and thyme didn't even reach the end of the first block until Sunday August 5 at about 12:00 UTC. Unlike the situation with rosemary, the reduced load after this event did not make any noticeable difference to the replag, which has continued to increase for the past three days at much the same rate as before. The next milestone will be completion of the second major block, which looks like it will occur either late on Friday August 9 or early on Saturday August 10 UTC, barring any other major problems (like the WMF server outage on Monday which caused replication at the TS end to stop for several hours). At that point, the load from SHA-1 updates should be roughly about 30% of what it had been during July. One would think that would allow the replag to drop, but since the events of this week, I can't be confident of that. There is a possible workaround. The TS could treat this like a server outage; copy user databases from thyme to rosemary and then point sql-s1-user to rosemary, which currently has no replag. Rosemary would then have to handle twice the load, but thyme should start to recover very quickly with no user-generated queries hitting it. Once thyme has recovered, point sql-s1-rr to it. Downsides: (1) this would require several hours of downtime for sql-s1-user while the user databases are copied; all tools that require access to user databases would be offline entirely for this period. (2) it would have to wait until our volunteer TS admins have time to do it. (3) the added load on rosemary could cause replag to grow there, although I doubt it would come anywhere near the 14+ days replag we are dealing with now on thyme. (4) this could all be unnecessary since thyme might recover on its own once the SHA-1 update load is reduced, although I don't know any way of forecasting that and experience so far has not been encouraging. Question for those of you who operate and/or use tools that access s1 (enwiki): would you be willing to accept several hours of service outage and the other downsides in exchange for getting rid of the 14-day replag? -- Russell Blau russb...@imapmail.org _______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette