Hi, On 11/18/24 11:38 AM, MOESSBAUER, Felix wrote: > On Mon, 2024-11-18 at 10:35 +0100, Linus Nordberg wrote: >> Hi all, >> >> Snapshot is behind Fastly since Sunday Nov 17 2024. I think that's >> bad >> and would like to change that. It's bad in the short term since we >> expose user data to a third party. It's bad in the long term since >> the >> short term bad won't go away until we learn how to deal with web >> traffic. > > That's a trade off between the advantages of a CDN and privacy. > For me as snapshot user that needs it to build reproducible things in > CI systems, the most important aspect is reliability and performance.
That's also how I see it. We need a way to ban entire ASes from Debian infrastructure, as long as they keep sending abusive requests from a very large amount of IP addresses. While I think we should make sure that we can keep up with a high amount of requests (which probably requires pgbouncer and some other fixes), serving the ridiculous amount of scraping sent by Tencent without coordination or backoff is not helpful. I hacked together something to collect data from BGP and I could go and put stuff into an ipset to block on - but Fastly made that ridiculously easy. And Tencent shot at snapshot-master (sallinen) the day before, which was easy to shield off. Note that a lot of traffic to snapshot is HTTP - and you are traversing the world to get to the target host - and thus the privacy bits are already very low. We are also not serving user data here, only known public bits. >> I have not been able to solve the problem with more incoming HTTP >> traffic than what the snapshot setup comfortably can deal with. >> Partly >> because I'm not very knowledgeable in this field and partly because I >> have not been given enough access to the cache layer(s). My hope is that with Fastly in the path it's easier to open up that log. Technically we still have a mix of Fastly and whatever goes to snapshot-mlm-01 directly, but maybe that is fine. > I also had a look at this topic (mostly based on code-review) and > identified a couple of problems: > > 1. apt behaves badly on 429 TooManyRequests. Addressed in [1] I think investments into apt retry logic are the most important. Individual failures should be retried sensibly, as we cannot guarantee 100% success rate. > 2. Expensive redirects to farm (DB lookup!) are cached too short. > Addressed in [2], also affected by [3] Varnish has been really badly behaving with the "file" backend. We kept shooting down objects continuously. It looks like the backend is effectively unsupported, everyone is supposed to use Varnish Enterprise with the corresponding storage engine if you want any notion of persistence. Varnish Open Source implements cleaning the cache on the hot path, so you need to do a lot of manual sizing of the cache to ensure that you have slots free for your objects. This became impossible with too many active elements in the cache - together with us caching both small and large objects in the same store. I'm not necessarily convinced that we have a high rate of cache hits here, beyond for some few "repositories" that people are using for hermetic builds (e.g. there are a lot of requests with a bazel User-Agent with pinned versioned repositories). > 3. Varnish internal redirect to farm not working [4], unfortunately > reverted due to not working properly in prod setup I hope that I get to retrying the change today, on !mlm-01. I made the mistake of testing the change on the production machine and not being able to debug much after that. I was personally annoyed by the browser not downloading the correct filenames myself. :) > [1] https://salsa.debian.org/apt-team/apt/-/merge_requests/383 > [2] https://salsa.debian.org/snapshot-team/snapshot/-/merge_requests/23 > [3] > https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/commit/63f16e08199040871752135df533f0001fe537fb > [4] https://lists.debian.org/debian-snapshot/2024/11/msg00008.html > >> >> DSA have legitimate concerns about exposing user data to people who >> do >> not need access to it. Would it help if my relation to Debian was >> formalised further than the current status of Debian Contributor? > > I'm just a DM, but I definitely want to help improving the situation. > >> >> More generally, I sometimes find it hard to understand the roles and >> responsibilities wrt the snapshot service. This results in me on the >> one >> hand being overly cautious with asking for some things and on the >> other >> hand sometimes pestering the wrong people, most probably also in the >> wrong way. It would be good to minimise unnecessary frustration and >> lost >> calendar time. > > Same! It took me quite some time to get an understanding of the overall > architecture of s.d.o which all its layers. Also I don't know who is > responsible for the intermediate infrastructure (basically everything > between the s.d.o flask app and the DNS entry s.d.o). It should be simpler. It's a bit of a Rube Goldberg machine when you have multiple caching/proxying/rate limiting layers. I jumped in because it looked like there's some more attention/bandwidth needed temporarily. (And I temporarily had some more time on my hands to help out.) I cannot speak for DSA just yet - but in general the delineation is that DSA wires up the web setup to serve things and the remainder is on the service owner. Of course here we have a ton of components in Puppet (haproxy/varnish/iptables/apache) that have snapshot-specific configuration bits. That means that any change to the outer infrastructure requires time from a DSA member to test and deploy the change. IMO the most important concern around granting more access is around privilege escalation - i.e. can a configuration change influence the machine's config. We have divested the power to change apache2 configs to service owners, for instance - the config changes are versioned and then apache2 is reloaded. apache2 also does not crash on invalid configs - although it will not start properly anymore. varnish as handled by Puppet is restarted, not reloaded - and thus will fail to start when the VCL is broken. > I further can only guess where exactly the bottlenecks are. These > obviously depend on the usage patterns which I (for good reasons) do > not have insights into. Munin is also not super helpful to visualize this data. I'd be open to a setup that allows for more custom introspection (specifically latency, error codes, and internal service state), e.g. a Grafana instance. Kind regards Philipp Kern
