Hi, I made some minor progress on this, and I thought I'd report back (I'll try to attend the meeting tomorrow, but I'm not sure I'll manage).
# What I did: - I got the OK to host a S3-backed snapshot mirror using the Debian AWS account (see thread in #1020217) - I got access to the account, and set up a VM with a Debian mirror. - I could run the file-backed snapshot importer on it - I modified the snapshot importer code to make it import to S3 (basically it means creating an S3Backend class that inherits from StockageBackend), and tested it by importing the debian-security archive. # What I plan to work on: - Set up a real development environment. I plan to use Vagrant, which is not a perfect solution for many reasons, but anyway the provisioning scripts will likely be re-usable with something else. - Change the web frontend to allow using S3. - Improve (parallelize) the importer code, specifically the sha1-hashing (to process multiple files in parallel, one per core) and the file copying/uploading-to-S3 (this is especially important for S3 because, to achieve good throughput, you need many transfers in parallel). # Open questions ## What to do with this? Assuming all this works and we can have a S3-backed snapshot service, there's the question of what to do with it. We have several options I think: ### A. s3-snapshot as a mirror of snapshot.debian.org The imports would continue to be done on snapshot.debian.org, but everything would be mirrored on a regular basis to S3. That would allow faster access to the data, but would not help with the performance of imports. ### B. dual-stack snapshot.debian.org The importer on snapshot.debian.org would import both to local stockage, and to s3. The web app could proxy requests to both. That would allow more resilience, but does not help with the performance of imports (on the contrary). ### C. s3-snapshot as a fork of snapshot.debian.org After an initial import of snapshot.debian.org data, s3-snapshot would live its own independent life. The main downside is that both databases will become out of sync (not the same mirror runs; they might each miss some packages, but not the same ones). ### D. do both at the same time Do C, but also make sure that every file that ever gets stored in snapshot.debian.org gets imported in the bucket used for s3-snapshot, to be able to expose a full read-only mirror of the snapshot.debian.org DB. ### E. Nice experiment, but let's forget about it (That should be mentioned as well) In any case, it probably makes sense to keep at least two different instances of the snapshot service (and data) on preferably different implementations, to make sure that we don't lose everything in case of catastrophic incident. I plan to aim for C as a first step. ## How to do an initial import of snapshot.debian.org data? That's more a technical question. The PostgreSQL DB should not be a problem, as it's quite small (~ 20 GB). For the data itself, it could probably be uploaded directly from local storage on the snapshot.d.o hosts to a S3 bucket. I could upload from an EC2 VM to S3 at about 10 Gbps (limited by the bandwidth of local storage). I don't know about the performance (storage, network) of snapshot.debian.org, but that probably means that an import into S3 is doable in a couple of weeks in the worst case. In any case, that's a question to keep in mind, but that does not need to be resolved now. Lucas