[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #23 from MZMcBride b...@mzmcbride.com --- (In reply to comment #22) Change 77975 merged by BBlack: Add ganglia monitoring for vhtcpd. https://gerrit.wikimedia.org/r/77975 With this changeset now merged, I'm a little unclear what's still needed to mark this bug as resolved/fixed. Brian W.: can you clarify? -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Bawolff (Brian Wolff) bawolff...@gmail.com changed: What|Removed |Added Status|PATCH_TO_REVIEW |RESOLVED Resolution|--- |FIXED -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #24 from Bawolff (Brian Wolff) bawolff...@gmail.com --- (In reply to comment #23) (In reply to comment #22) Change 77975 merged by BBlack: Add ganglia monitoring for vhtcpd. https://gerrit.wikimedia.org/r/77975 With this changeset now merged, I'm a little unclear what's still needed to mark this bug as resolved/fixed. Brian W.: can you clarify? I'm going to call this bug closed for now. There's pretty graphs at http://ganglia.wikimedia.org/latest/?r=hourcs=ce=m=vhtcpd_inpkts_dequeueds=by+namec=Upload+caches+eqiadh=host_regex=max_graphs=0tab=mvn=sh=1z=smallhc=4 One thing we were talking about earlier was doing actual tests where we have a script that either looks at recent re-uploads, and check the purge succeded, or specificly purged things, and checked to see if that works. We could do that later if this monitoring turns out not to be enough -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #22 from Gerrit Notification Bot gerritad...@wikimedia.org --- Change 77975 merged by BBlack: Add ganglia monitoring for vhtcpd. https://gerrit.wikimedia.org/r/77975 -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #20 from Bryan Davis bda...@wikimedia.org --- There is a patch to setup ganglia monitoring for vhtcpd in gerrit 77975. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #13 from Rob Lanphier ro...@wikimedia.org --- Brandon, are you actually building proper monitoring into this daemon, or do we need to start separate work? I remember Mark making the case that this could be done within Varnish, but I'm still kinda confused as to how we can actually do effective monitoring of Varnish purging from within Varnish. Bug 49362 is an example of a bug that would be great to have proper monitoring for. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #14 from Brandon Black bbl...@wikimedia.org --- The daemon logs some stats to a file, which we could pick up and graph (but currently do not, yet). These would basically give you the rate of multicast purge requests the daemon's receiving and whether it's failing to process any of them due to some large-impact bug that's overflowing the queue. The larger issue that makes that relatively ineffective is that the requests arrive over multicast, which is an unreliable protocol by design. They could be lost in the sender's output buffers, anywhere in the network, or discarded at the receiving cache (local buffering issues) and we'd have no indication that was happening. Upgrading from multicast is also an expensive proposition in terms of complexity (after all, the reason we're using it is that it's simple and efficient). We've thrown around some ideas about replacing multicast with http://en.wikipedia.org/wiki/Pragmatic_General_Multicast , likely using http://zeromq.org/ as the communications abstraction layer, as a solution to the unreliability of multicast. This would basically give us a reliable sequence-number system with retransmission that's handled at that layer. That means adding zeromq support to the php that sends the purge requests, adding it to vhtcpd, and most likely also building out a redundant, co-operating set of middleboxes as publish/subscribe multiplexers. I'm not fond of going down this path unless we really see a strong need to upgrade from multicast, though. It smells of too much complexity for the problem we're trying to solve, and/or that there may be a better mechanism for this if we re-think how purging is being accomplished in general. In any case, I think that would all be outside the scope of this ticket. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #15 from Bawolff (Brian Wolff) bawolff...@gmail.com --- I think monitoring should happen outside of the deamon (Since as you say, there's a limit to what we can do with an unreliable protocol). What I would suggest is some script (perhaps even living on the tools lab) that does the following: *Find the 10 most recent overwritten files. Get the thumbnail sizes that would be on the image description page, along with the original file asset (from both europe and north america varnish). Look at the age header. If the age header is longer than the time between last re-upload and now, yell. *Pick a test file at random. Request the file at some random size. Do ?action=purge. Sleep for about 10 seconds. Request the file again. Check to make sure that the age header is either not present or 10. *For good measure. Pick a popular page like [[Wikipedia:Village pump (Technical)]] (also some redirect page like [[WP:VPT]]). Request the page. Check that the age header is less than the time between now and last edit. (Or at least for the redirect case, make sure that the difference isn't super big to give some lee-way for job queue) -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #16 from Brandon Black bbl...@wikimedia.org --- How would one Find the 10 most recent overwritten files reliably/efficiently? Most of these solutions you're suggesting seem to give us some probabilistic idea that things are working, but really solve the problem if a random small percentage of purges are being lost in the pipe somewhere. They'd have to run at pretty high rates to even catch singular failed elements (one varnish not receiving purges, which may or may not have already cached the test file, which you may or may not hit with your check) -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #17 from Brandon Black bbl...@wikimedia.org --- Sorry, I meant to say ..., but really DON'T solve the problem ... -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #18 from Bawolff (Brian Wolff) bawolff...@gmail.com --- (In reply to comment #16) How would one Find the 10 most recent overwritten files reliably/efficiently? Most of these solutions you're suggesting seem to give us some probabilistic idea that things are working, but really solve the problem if a random small percentage of purges are being lost in the pipe somewhere. They'd have to run at pretty high rates to even catch singular failed elements (one varnish not receiving purges, which may or may not have already cached the test file, which you may or may not hit with your check) Very true. However I'm more concerned with mass failures. (The type of thing where doing this once every 6 hours would be sufficient). Massive failures to the purging system have happend in the past several times. Monitoring for this type of failure I think is important. (Fine grained monitoring would be cool too, but seems more difficult) -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #19 from Bawolff (Brian Wolff) bawolff...@gmail.com --- How would one Find the 10 most recent overwritten files reliably/efficiently? via db query (or api): select log_title from logging where log_type = 'upload' and log_action = 'overwrite' order by log_timestamp DESC limit 10; -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #11 from Andre Klapper aklap...@wikimedia.org --- Brandon: Were the last 5 weeks enough time to judge whether it's stable enough? (Is this bug report fixed?) -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #12 from Brandon Black bbl...@wikimedia.org --- Yes, I think so, although I just fixed a bug in the software yesterday. Still, it's a significant improvement and we've un-deployed the previous software. May as well close this bug and then open further ones as warranted for further changes to our purging architecture? The title of the bug doesn't precisely correlate with what ended up happening anyways (monitoring the success rate, which we still can't really do, and won't ever be able to do with any real accuracy so long as it's plain multicast). -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #10 from Andre Klapper aklap...@wikimedia.org --- (In reply to comment #9 by Brandon Black) The replacement daemon was deployed to production today. The initial deployment is just a minimum-change swap of the two pieces of software. Further enhancements (to performance, and logging of stats to spot multicast loss) will come once this has had a little time to stabilize without any loud complaints of being worse than before. Great! Is it already possible to judge whether it's stable enough? -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Ryan Kaldari rkald...@wikimedia.org changed: What|Removed |Added See Also||https://bugzilla.wikimedia. ||org/show_bug.cgi?id=48927 -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #9 from Brandon Black bbl...@wikimedia.org --- The replacement daemon was deployed to production today. The initial deployment is just a minimum-change swap of the two pieces of software. Further enhancements (to performance, and logging of stats to spot multicast loss) will come once this has had a little time to stabilize without any loud complaints of being worse than before. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #7 from Andre Klapper aklap...@wikimedia.org --- Brandon: Are there any news / progress to share yet? -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Brandon Black bbl...@wikimedia.org changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #8 from Brandon Black bbl...@wikimedia.org --- Yes, I've been implementing a replacement for varnishhtcpd. You can see the evolving initial version at the changeset here: https://gerrit.wikimedia.org/r/#/c/60390/ . I hope to be able to test it in prod in the next few days, and it shouldn't suffer from the perf/loss bugs of the previous implementation. Stats output still needs implementation there as well (for monitoring the daemon's own reliability as well as other issues like loss of multicast delivery), but we'd rather put the stats work in the fresh new code than attach to the known-failing code. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Greg Grossmeier g...@wikimedia.org changed: What|Removed |Added Assignee|m...@nedworks.org |bbl...@wikimedia.org --- Comment #6 from Greg Grossmeier g...@wikimedia.org --- Assigning to Brandon per Roadmap Updates meeting and email thread. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Rob Lanphier ro...@wikimedia.org changed: What|Removed |Added Priority|Normal |High -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #5 from Andre Klapper aklap...@wikimedia.org --- RT #4607 -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Andre Klapper aklap...@wikimedia.org changed: What|Removed |Added Assignee|ct...@wikimedia.org |m...@nedworks.org --- Comment #4 from Andre Klapper aklap...@wikimedia.org --- Assigning to Mark as just discussed in the Ops/Platform meeting here in SF. -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Rob Lanphier ro...@wikimedia.org changed: What|Removed |Added Assignee|wikibugs-l@lists.wikimedia. |ct...@wikimedia.org |org | --- Comment #3 from Rob Lanphier ro...@wikimedia.org --- I spoke with CT about this, and he's going to talk to his team about what can be done here. -- You are receiving this mail because: You are the assignee for the bug. You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Richard Guk richardg...@yahoo.com changed: What|Removed |Added CC||richardg...@yahoo.com -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 --- Comment #2 from Andre Klapper aklap...@wikimedia.org --- For the records, in case somebody considers working on this: TimStarling andre__: just purge a URL, request it, and check its Age header TimStarling it should be less than some threshold TimStarling http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2 -- You are receiving this mail because: You are the assignee for the bug. You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Andre Klapper aklap...@wikimedia.org changed: What|Removed |Added CC||aklap...@wikimedia.org --- Comment #1 from Andre Klapper aklap...@wikimedia.org --- FYI, posted on ops@ by Tim Starling six hours ago: There is a nagios check to make sure varnishhtcpd is working, but it only checks to see if the process is still running, it doesn't check to see if it is actually working. -- You are receiving this mail because: You are the assignee for the bug. You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 MZMcBride b...@mzmcbride.com changed: What|Removed |Added CC||b...@mzmcbride.com -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Nemo federicol...@tiscali.it changed: What|Removed |Added See Also||https://bugzilla.wikimedia. ||org/show_bug.cgi?id=41130 -- You are receiving this mail because: You are the assignee for the bug. You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Nemo federicol...@tiscali.it changed: What|Removed |Added CC||afeld...@wikimedia.org, ||federicol...@tiscali.it -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Andre Klapper aklap...@wikimedia.org changed: What|Removed |Added Priority|Unprioritized |Normal Severity|normal |enhancement -- You are receiving this mail because: You are the assignee for the bug. You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 43449] Monitor effectiveness of HTCP purging
https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 Marco maic...@yahoo.com changed: What|Removed |Added CC||maic...@yahoo.com -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l