Re: Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale

Wes Turner Tue, 10 Mar 2020 13:02:25 -0700

Thank you for your feedback and for sustainably caching build dependencies.


Presumably a caching proxy for all users of a CI service (with Apache,
Squid, Nginx, ?) would need to have a current SSL cert, and would be
mediating any requests to other servers.

Is it also possible to configure clients to use a caching proxy using just
environment variables?

On Tue, Mar 10, 2020, 2:49 PM Jeremy Stanley <fu...@yuggoth.org> wrote:

> [Apologies if you receive multiple copies of this, it seems that
> Google Groups may silently discard posts with PGP+MIME signatures.]
>
> On 2020-03-10 06:50:56 -0700 (-0700), Wes Turner wrote:
> [...]
> > Reference Implementation
> > ========================
> >
> > - [ ] Does anyone have examples of CI services that are doing this well
> >   / correctly? E.g. with proxy-caching on by default
> [...]
>
> The CI system in OpenDev does this. In fact, we tried a number of
> the aforementioned approaches over the years:
>
> 1. Building a limited mirror based on package dependencies; this was
> inconvenient as projects needed to wait for the package to be pulled
> into the mirror set before it could be used by jobs (because we also
> wanted to prevent jobs from accidentally bypassing the mirror and
> hitting PyPI directly), and curating the growing list of package
> names/versions became cumbersome.
>
> 2. Maintaining a full mirror of PyPI via bandersnatch; we did this
> for years, but it was unstable (especially early on, serial handling
> in PyPI's API got better over time) so needed a fair amount of
> attention, but the real reason we stopped was that some AI/ML
> projects (I'm not pointing fingers but you know who you are) started
> dumping giant nightly snapshots of their datasets into PyPI and we
> didn't want to have to deal with multi-terabyte filesystem coherency
> issues or month-long rebootstrapping periods; bandersnatch
> eventually grew an option for filtering specific projects, but
> required a full rebuild to filter out all the previously fetched
> files, which we didn't want to deal with (and this would have become
> an ongoing game of Whack-a-Mole with any new projects following
> similar processes).
>
> 3. Using a caching proxy; this has turned out to be the
> lowest-effort solution for us, occasional changes in pip and related
> toolchain aside.
>
> OpenDev's Zuul (project gating CI) service utilizes resources across
> roughly a dozen different cloud providers, so we've found the best
> way to reduce nondeterministic network failures is to cache as much
> as possible locally within every provider/region. We configure Apache
> on a persistent virtual machine in each of these via Ansible, and
> this is what the relevant configuration currently looks like for
> PyPI caching:
>
> <URL:
> https://opendev.org/opendev/system-config/src/commit/b2b0cc1c834856afa5511ca9a489d0dfbc6ba948/playbooks/roles/mirror/templates/mirror.vhost.j2#L36-L88
> >
>
> Early in the setup phase before jobs might want to start pulling
> anything from PyPI we install an /etc/pip.conf file onto the job
> nodes from this template, with the local mirror hostname substituted
> appropriately:
>
> <URL:
> https://opendev.org/zuul/zuul-jobs/src/commit/de04f76d57ffd5737dea6c6eb3af4c26f2fe08a6/roles/configure-mirrors/templates/etc/pip.conf.j2
> >
>
> You'll notice that extra-index-url is set to a wheel_mirror URL,
> that's a separate cache we build to accelerate jobs which rely on
> packages that don't publish wheels to PyPI for the various platforms
> we offer (a variety of Linux distributions). We collect common
> Python package dependencies for projects running jobs, perform test
> installations of them in a separate periodic job, check to see if
> they or their transitive dependency set require building a wheel
> from sdist rather than downloading a prebuilt one from PyPI, and
> then add all of those to a central cache. We do this for each
> available Python version across all the distros/releases for which
> we maintain node images. The wheels are stored globally in AFS (the
> infamous Andrew Filesystem) and then local OpenAFS caches are served
> from Apache in every configured cloud provider (the configuration
> for it appears immediately below the PyPI proxy cache in the vhost
> template linked earlier).
>
> Of course we don't just cache PyPI, we also mirror and/or cache
> Linux distribution package repositories, Dockerhub/Quay, NPM
> packages, Git repositories and whatever else is of interest to our
> users. Every time a job has to touch the greater Internet to
> retrieve build resources, that's one more opportunity for unexpected
> failure and further waste of our generously donated build capacity,
> so it's in our best interests and those of our users to implement
> and take advantage of local caches anywhere we can safely do so
> without undue compromise to the integrity of build results.
> --
> Jeremy Stanley
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "pypa-dev" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pypa-dev/Pdnoi8UeFZ8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> pypa-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pypa-dev/20200310184938.a3mktqnp5db7jj3v%40yuggoth.org
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"pypa-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to pypa-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/pypa-dev/CACfEFw_gpcgQSjP%2B%2Ba%2BRMC70uLnUM-HqBkbkiWL_BJviX6%3Do9Q%40mail.gmail.com.

Re: Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale

Reply via email to