Bug#1020217: S3-backed snapshot implementation

Lucas Nussbaum Sun, 05 May 2024 12:33:13 -0700

Hi,

I made some minor progress on this, and I thought I'd report back (I'll
try to attend the meeting tomorrow, but I'm not sure I'll manage).



# What I did:

- I got the OK to host a S3-backed snapshot mirror using the Debian AWS
  account (see thread in #1020217)
- I got access to the account, and set up a VM with a Debian mirror.
- I could run the file-backed snapshot importer on it
- I modified the snapshot importer code to make it import to S3
  (basically it means creating an S3Backend class that inherits from
  StockageBackend), and tested it by importing the debian-security
  archive.


# What I plan to work on:

- Set up a real development environment. I plan to use Vagrant, which is
  not a perfect solution for many reasons, but anyway the provisioning
  scripts will likely be re-usable with something else.
- Change the web frontend to allow using S3.
- Improve (parallelize) the importer code, specifically the sha1-hashing
  (to process multiple files in parallel, one per core) and the file
  copying/uploading-to-S3 (this is especially important for S3 because,
  to achieve good throughput, you need many transfers in parallel).


# Open questions

## What to do with this?

Assuming all this works and we can have a S3-backed snapshot
service, there's the question of what to do with it.
We have several options I think:

### A. s3-snapshot as a mirror of snapshot.debian.org

The imports would continue to be done on snapshot.debian.org, but
everything would be mirrored on a regular basis to S3.
That would allow faster access to the data, but would not help with the
performance of imports.

### B. dual-stack snapshot.debian.org

The importer on snapshot.debian.org would import both to local stockage,
and to s3. The web app could proxy requests to both.
That would allow more resilience, but does not help with the performance
of imports (on the contrary).

### C. s3-snapshot as a fork of snapshot.debian.org

After an initial import of snapshot.debian.org data, s3-snapshot would
live its own independent life.
The main downside is that both databases will become out of sync
(not the same mirror runs; they might each miss some packages, but not
the same ones).

### D. do both at the same time

Do C, but also make sure that every file that ever gets stored in
snapshot.debian.org gets imported in the bucket used for s3-snapshot, to
be able to expose a full read-only mirror of the snapshot.debian.org DB.

### E. Nice experiment, but let's forget about it

(That should be mentioned as well)



In any case, it probably makes sense to keep at least two different
instances of the snapshot service (and data) on preferably different
implementations, to make sure that we don't lose everything in case of
catastrophic incident.

I plan to aim for C as a first step.


## How to do an initial import of snapshot.debian.org data?

That's more a technical question. The PostgreSQL DB should not be a
problem, as it's quite small (~ 20 GB). For the data itself, it could
probably be uploaded directly from local storage on the snapshot.d.o
hosts to a S3 bucket. I could upload from an EC2 VM to S3 at about 10
Gbps (limited by the bandwidth of local storage). I don't know about the
performance (storage, network) of snapshot.debian.org, but that probably
means that an import into S3 is doable in a couple of weeks in the worst
case.  In any case, that's a question to keep in mind, but that does not
need to be resolved now.

Lucas

Bug#1020217: S3-backed snapshot implementation

Reply via email to