Hi Seref, For the benchmarks, I used Hetzner's cloud service with the following setup:
- A Hetzner s3 bucket in the FSN1 region - A virtual machine of type ccx63 48 vCPU 192 GB memory - 3 ZeroFS nbd devices (same s3 bucket) - A ZFS stripped pool with the 3 devices - 200GB zfs L2ARC - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off. Best, Pierre On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: > Sorry, this was meant to go to the whole group: > > Very interesting!. Great work. Can you clarify how exactly you're running > postgres in your tests? A specific AWS service? What's the test > infrastructure that sits above the file system? > > On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote: >> Hi everyone, >> >> I wanted to share a project I've been working on that enables PostgreSQL to >> run on S3 storage while maintaining performance comparable to local NVMe. >> The approach uses block-level access rather than trying to map filesystem >> operations to S3 objects. >> >> ZeroFS: https://github.com/Barre/ZeroFS >> >> # The Architecture >> >> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as >> raw block devices. PostgreSQL runs unmodified on ZFS pools built on these >> block devices: >> >> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >> >> By providing block-level access and leveraging ZFS's caching capabilities >> (L2ARC), we can achieve microsecond latencies despite the underlying storage >> being in S3. >> >> ## Performance Results >> >> Here are pgbench results from PostgreSQL running on this setup: >> >> ### Read/Write Workload >> >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: <builtin: TPC-B (sort of)> >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 100000 >> number of transactions actually processed: 5000000/5000000 >> number of failed transactions: 0 (0.000%) >> latency average = 0.943 ms >> initial connection time = 48.043 ms >> tps = 53041.006947 (without initial connection time) >> ``` >> >> ### Read-Only Workload >> >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: <builtin: select only> >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 100000 >> number of transactions actually processed: 5000000/5000000 >> number of failed transactions: 0 (0.000%) >> latency average = 0.121 ms >> initial connection time = 53.358 ms >> tps = 413436.248089 (without initial connection time) >> ``` >> >> These numbers are with 50 concurrent clients and the actual data stored in >> S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold >> data comes from S3. >> >> ## How It Works >> >> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use >> like any other block device >> 2. Multiple cache layers hide S3 latency: >> a. ZFS ARC/L2ARC for frequently accessed blocks >> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD >> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block >> device >> c. Optional local disk cache >> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree >> >> ## Geo-Distributed PostgreSQL >> >> Since each region can run its own ZeroFS instance, you can create >> geographically distributed PostgreSQL setups. >> >> Example architectures: >> >> Architecture 1 >> >> >> PostgreSQL Client >> | >> | SQL queries >> | >> +--------------+ >> | PG Proxy | >> | (HAProxy/ | >> | PgBouncer) | >> +--------------+ >> / \ >> / \ >> Synchronous Synchronous >> Replication Replication >> / \ >> / \ >> +---------------+ +---------------+ >> | PostgreSQL 1 | | PostgreSQL 2 | >> | (Primary) |◄------►| (Standby) | >> +---------------+ +---------------+ >> | | >> | POSIX filesystem ops | >> | | >> +---------------+ +---------------+ >> | ZFS Pool 1 | | ZFS Pool 2 | >> | (3-way mirror)| | (3-way mirror)| >> +---------------+ +---------------+ >> / | \ / | \ >> / | \ / | \ >> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >> | | | | | | >> +--------++--------++--------++--------++--------++--------+ >> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >> +--------++--------++--------++--------++--------++--------+ >> | | | | | | >> | | | | | | >> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6 >> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) >> >> Architecture 2: >> >> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2) >> \ / >> \ / >> Same ZFS Pool (NBD) >> | >> 6 Global ZeroFS >> | >> S3 Regions >> >> >> The main advantages I see are: >> 1. Dramatic cost reduction for large datasets >> 2. Simplified geo-distribution >> 3. Infinite storage capacity >> 4. Built-in encryption and compression >> >> Looking forward to your feedback and questions! >> >> Best, >> Pierre >> >> P.S. The full project includes a custom NFS filesystem too. >>