Re: Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale

2020-03-10 Thread Wes Turner
Thank you for your feedback and for sustainably caching build dependencies.

Presumably a caching proxy for all users of a CI service (with Apache,
Squid, Nginx, ?) would need to have a current SSL cert, and would be
mediating any requests to other servers.

Is it also possible to configure clients to use a caching proxy using just
environment variables?

On Tue, Mar 10, 2020, 2:49 PM Jeremy Stanley  wrote:

> [Apologies if you receive multiple copies of this, it seems that
> Google Groups may silently discard posts with PGP+MIME signatures.]
>
> On 2020-03-10 06:50:56 -0700 (-0700), Wes Turner wrote:
> [...]
> > Reference Implementation
> > 
> >
> > - [ ] Does anyone have examples of CI services that are doing this well
> >   / correctly? E.g. with proxy-caching on by default
> [...]
>
> The CI system in OpenDev does this. In fact, we tried a number of
> the aforementioned approaches over the years:
>
> 1. Building a limited mirror based on package dependencies; this was
> inconvenient as projects needed to wait for the package to be pulled
> into the mirror set before it could be used by jobs (because we also
> wanted to prevent jobs from accidentally bypassing the mirror and
> hitting PyPI directly), and curating the growing list of package
> names/versions became cumbersome.
>
> 2. Maintaining a full mirror of PyPI via bandersnatch; we did this
> for years, but it was unstable (especially early on, serial handling
> in PyPI's API got better over time) so needed a fair amount of
> attention, but the real reason we stopped was that some AI/ML
> projects (I'm not pointing fingers but you know who you are) started
> dumping giant nightly snapshots of their datasets into PyPI and we
> didn't want to have to deal with multi-terabyte filesystem coherency
> issues or month-long rebootstrapping periods; bandersnatch
> eventually grew an option for filtering specific projects, but
> required a full rebuild to filter out all the previously fetched
> files, which we didn't want to deal with (and this would have become
> an ongoing game of Whack-a-Mole with any new projects following
> similar processes).
>
> 3. Using a caching proxy; this has turned out to be the
> lowest-effort solution for us, occasional changes in pip and related
> toolchain aside.
>
> OpenDev's Zuul (project gating CI) service utilizes resources across
> roughly a dozen different cloud providers, so we've found the best
> way to reduce nondeterministic network failures is to cache as much
> as possible locally within every provider/region. We configure Apache
> on a persistent virtual machine in each of these via Ansible, and
> this is what the relevant configuration currently looks like for
> PyPI caching:
>
>  https://opendev.org/opendev/system-config/src/commit/b2b0cc1c834856afa5511ca9a489d0dfbc6ba948/playbooks/roles/mirror/templates/mirror.vhost.j2#L36-L88
> >
>
> Early in the setup phase before jobs might want to start pulling
> anything from PyPI we install an /etc/pip.conf file onto the job
> nodes from this template, with the local mirror hostname substituted
> appropriately:
>
>  https://opendev.org/zuul/zuul-jobs/src/commit/de04f76d57ffd5737dea6c6eb3af4c26f2fe08a6/roles/configure-mirrors/templates/etc/pip.conf.j2
> >
>
> You'll notice that extra-index-url is set to a wheel_mirror URL,
> that's a separate cache we build to accelerate jobs which rely on
> packages that don't publish wheels to PyPI for the various platforms
> we offer (a variety of Linux distributions). We collect common
> Python package dependencies for projects running jobs, perform test
> installations of them in a separate periodic job, check to see if
> they or their transitive dependency set require building a wheel
> from sdist rather than downloading a prebuilt one from PyPI, and
> then add all of those to a central cache. We do this for each
> available Python version across all the distros/releases for which
> we maintain node images. The wheels are stored globally in AFS (the
> infamous Andrew Filesystem) and then local OpenAFS caches are served
> from Apache in every configured cloud provider (the configuration
> for it appears immediately below the PyPI proxy cache in the vhost
> template linked earlier).
>
> Of course we don't just cache PyPI, we also mirror and/or cache
> Linux distribution package repositories, Dockerhub/Quay, NPM
> packages, Git repositories and whatever else is of interest to our
> users. Every time a job has to touch the greater Internet to
> retrieve build resources, that's one more opportunity for unexpected
> failure and further waste of our generously donated build capacity,
> so it's

Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale

2020-03-10 Thread Wes Turner

This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.

Working titles; seeking feedback:

- Guide for PyPI CI Service Providers
- Request from and advisory for CI Services and CI implementors
- PyPI cost solutions: CI, mirrors, containers, and caching to scale
- PyPI-dependent CI Service Provider and implementor Guide

See "Open Issues":

> - Does this need to be a PEP?
>  - No: It's merely an informational advisory and a request
>for consideration of sustainable resource utilization practices.
>  - Yes: It might as well be maintained as the document to be
>sent to CI services which are unnecessarily using significant
>amounts of bandwidth.


PEP: 
Title: PyPI-dependent CI Service Provider and Implementor Guide
Author: Wes Turner
Sponsor: *[Full Name ]*
BDFL-Delegate:
Discussions-To: https://groups.google.com/forum/#!forum/pypa-dev
Status: Draft
Type: [Standards Track | Informational | Process]
Content-Type: text/x-rst
Requires: *[NNN]*
Created: 2020-03-07
Resolution:


Abstract


Continuous Integration automated build and testing services
can help reduce the costs of hosting PyPI by running local mirrors
and advising clients in regards to how to efficiently re-build
software hundreds or thousands of times a month
without re-downloading everything from PyPI every time.

This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.

Motivation
==

- The costs of maintaining PyPI are increasing exponentially.
- CI builds impose significant load upon PyPI.
- Frequently re-downloading the exact same packages
  is wasting PyPI and CI services' time, money, and bandwidth.
- Perhaps the primary issue is lack of awareness
  of solutions for reducing resource requirements
  and thereby costs for all involved.
- Many thousands of projects are overutilizing donated resources
  when there is a more efficient way that CI services
  can just centrally solve for.


Request from and advisory for CI Services and CI Implementors
==
Dear CI Service,

1. Please consider running local package mirrors and enabling use of local
   package mirrors by default for clients' CI builds.
2. Please advice clients regarding more efficient containerized
   software build and test strategies.

Running local package mirrors will save PyPI (the Python Package Index,
a service maintained by PyPA, a group within the non-profit Python
Software Foundation) generously donated resources.
(At present (March 2020), PyPI costs ~ $800,000 USD a month to operate; 
even with
generously donated resources).

If you would prefer to instead or also donate to PSF, [earmarked]
donations are very welcome and will be publicly acknowledged.

Data locality through caching is the solution
to efficient software distribution. There are a number of opportunities
to cache package downloads and thereby (1) reduce bandwidth
requirements, and (2) reduce build times:

- ~/.cache/pip -- This does not persist across hermetically isolated 
container invocations
- Network-local package repository mirror
- Container image

There are many package mirroring solutions for Python packages
and other packages and containers:

- A full mirror
  - bandersnatch: https://pypi.org/project/bandersnatch/
- A partial mirror:
  - pulp: https://pulpproject.org/
- Pulp also handles RPM, Debian, Puppet, Docker, and OSTree
- A transparent proxy cache mirror
  - devpi: https://pypi.org/project/devpi/
  - Dumb HTTPS cache with maximum filesize:
- squid?
- IPFS
  - IPFS for software package repository mirroring is an active area of
research.

Containers:

- OCI Container Registry
  - Notary (TUF): https://github.com/theupdateframework/notary
  - Amazon Elastic Container Registry: https://aws.amazon.com/ecr/
  - Azure Container Registry: 
https://azure.microsoft.com/en-us/services/container-registry/
  - Docker registry: https://docs.docker.com/registry/deploying/
  - DockerHub:  https://hub.docker.com/
  - GitLab Container Registry:
https://docs.gitlab.com/ce/user/packages/container_registry/
  - Google Container Registry: https://gcr.io
  - RedHat Quay Container Registry: https://quay.io
- Container Build Services
  - Any CI Service can be used to build and upload a container

There are approaches to making individual (containerized) (Python)
software package builds more efficient:

A. Build a named container image containing the necessary dependencies,
   upload the container image to a container registry,
   reuse the container image for subsequent builds of your
   package(s)
B. Automate updates of pinned dependency versions using a
   free or paid service that regularly audits dependency specifications
   stored in source code repositories and sends pull requests
  

Re: [Distutils] Summary of PyPI overhaul in new LWN article

2018-04-12 Thread Wes Turner
>From "TUF, Warehouse, Pip, PyPA, ld-signatures, ed25519"

https://mail.python.org/pipermail/distutils-sig/2018-March/032081.html :

> Are there pypa/warehouse github issues for implementing the TUF trust
root support in warehouse; and client support in pip (or a module that pip
and other tools can use)?

Read and review these PEPs:

"PEP 458 -- Surviving a Compromise of PyPI"
https://www.python.org/dev/peps/pep-0458/";

"PEP 480 -- Surviving a Compromise of PyPI: The Maximum Security Model"
https://www.python.org/dev/peps/pep-0480/

On Thursday, April 12, 2018, Trishank Kuppusamy <
trishank.kuppus...@datadoghq.com> wrote:

> On Wed, Apr 11, 2018 at 10:30 PM, Sumana Harihareswara 
> wrote:
>
>> Today, LWN published my new article "A new package index for Python".
>> https://lwn.net/Articles/751458/ In it, I discuss security, policy, UX
>> and developer experience changes in the 15+ years since PyPI's founding,
>> new features (and deprecated old features) in Warehouse, and future
>> plans. Plus: screenshots!
>>
>> If you aren't already an LWN subscriber, you can use this subscriber
>> link for the next week to read the article despite the LWN paywall.
>> https://lwn.net/SubscriberLink/751458/81b2759e7025d6b9/
>
>
> Thanks for the summary, and all your hard work, Sumana :)
>
> Happy to see this bit about TUF in future horizons:
>
> Warehouse's signature handling demonstrates a shift in Python's thinking
>> regarding key management and package signatures. Ideally, package users,
>> software distributors, and package distribution tools would regularly use
>> signatures to verify Python package integrity. For the most part, however,
>> they don't, and there are major infrastructural barriers to them
>> effectively doing so. Therefore, GPG/PGP signatures for packages are no
>> longer visible in PyPI's web interface. Project maintainers can still
>> attach signatures to their release uploads, and those signatures still
>> appear in the Simple Project API as described in PEP 503. Stufft has made
>> no secret of his opinion that "package signing is not the Holy Grail";
>> current discussion among packaging-tools developers leans toward removing
>> signing features from another part of the Python packaging ecology (the
>> wheel library) and working toward implementing The Update Framework
>> instead. Relatedly, Warehouse, unlike legacy PyPI, does not provide an
>> interface for users to manage GPG or SSH public keys.
>
>
>  We would love to help with this efforts any way we can.
>
> --
> curl https://keybase.io/trishankdatadog/pgp_keys.asc | gpg --import
>


Re: Disallow packages with the same name as stdlib modules

2018-01-16 Thread Wes Turner
There was an ANN for this issue:

[Python-Dev] SK-CSIRT identified malicious software libraries in the official 
Python package repository, PyPI
https://mail.python.org/pipermail/python-dev/2017-September/149569.html

[Security-announce] Typo squatting and malicious packages on PyPI
https://mail.python.org/pipermail/security-announce/2017-September/00.html

And GitHub issues:

Can register packages that match system packages
https://github.com/pypa/pypi-legacy/issues/585

Block package names that conflict with core libraries
https://github.com/pypa/warehouse/issues/2151