RKemper added a comment.

  Alright, I had an initial meeting with Traffic team (Brandon & Valentin).
  
  Traffic team meeting summary
  ----------------------------
  
  The primary concern they had was related to the potential impact on the ATS 
side of things; in a scenario where Blazegraph is taking a consistently long 
time to respond, ATS would be impacted due to the large numbers of dangling 
sockets, which could theoretically impact the rest of production infrastructure 
(mediawiki, etc).
  
  This isn't necessarily a new problem, it sounds like this is a potential 
concern they've had with WDQS in general for some time now. One of the 
possibilities we discussed was to bypass the caching layer entirely and just 
use LVS, so with respect to these net-new services backed by a single backend 
host each, it would look like a single backend host behind LVS but avoiding the 
ATS/caching layer entirely. That eliminates the concern around ATS but does 
introduce a few drawbacks:
  
  - (primary drawback) **We lose `requestctl`** which is a tremendously useful 
tool when managing WDQS outages. We'd presumably be going back to the old way 
of doing things where we'd manually ban at the nginx level when necessary.
  - Some extra latency would be introduced since we wouldn't be terminating TLS 
as close to the user. This probably isn't the hugest deal; adding up to 100ms 
of latency to the user end likely wouldn't break existing usecases.
  - There's some changes to puppet automation, etc that we'd have to make. It 
sounds like the main one is that tls certs would have to go thru acmechief 
rather than relying on the cdn. This generates some work on our (Search team) 
end in creating the corresponding puppet patch(es) but wouldn't be a 
showstopper.
  
  Of the above 3 drawbacks the most painful one is losing requestctl; it's a 
really great tool. But it might be a worthwhile tradeoff to entirely avoid the 
possibility of a misbehaving query service impacting non-WDQS production 
infrastructure like MediaWiki itself. I'd note that I'm not aware of us 
specifically having encountered that problem (ATS backing up and impacting the 
rest of prod) in previous WDQS outages, but it's also not something we were 
going out of our way to look for either.
  
  So, we'll need to discuss amongst Search team and see what the consensus is, 
then bring the discussion back to Traffic team for further feedback.
  
  Other context
  -------------
  
  The existing request flow for WDQS as it exists currently is `haproxy 
[traffic team manage certs] -> varnish -> ats -> envoy -> nginx -> blazegraph` 
(these are from hastily transcribed notes + I filled in the missing gaps on the 
righthand side [nginx->blazegraph] so I'll want to follow up and validate that 
the above flow is correct)
  
  As far as how things would look like after spinning up the new endpoints 
(sidestepping the question of whether to bypass the caching layer or not), 
`query.wikidata.org` would still be getting the vast majority of the traffic, 
with `wdqs-scholarly-articles` getting just a few % of total traffic at most. 
We expect actual usage of these new endpoints to be quite low - basically only 
the WDQS powerusers will actually be trying it out, at least initially - but 
given it'd still be a production service and exposed to the outside world there 
is of course always the potential for a malicious attacker wrt the concerns 
about ATS getting backed up.

TASK DETAIL
  https://phabricator.wikimedia.org/T351650

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: RKemper
Cc: bking, dcausse, dr0ptp4kt, RKemper, Aklapper, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, BTullis, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to