Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with 
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
> Sorry, this was meant to go to the whole group:
> 
> Very interesting!. Great work. Can you clarify how exactly you're running 
> postgres in your tests? A specific AWS service? What's the test 
> infrastructure that sits above the file system?
> 
> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote:
>> Hi everyone,
>> 
>> I wanted to share a project I've been working on that enables PostgreSQL to 
>> run on S3 storage while maintaining performance comparable to local NVMe. 
>> The approach uses block-level access rather than trying to map filesystem 
>> operations to S3 objects.
>> 
>> ZeroFS: https://github.com/Barre/ZeroFS
>> 
>> # The Architecture
>> 
>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as 
>> raw block devices. PostgreSQL runs unmodified on ZFS pools built on these 
>> block devices:
>> 
>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>> 
>> By providing block-level access and leveraging ZFS's caching capabilities 
>> (L2ARC), we can achieve microsecond latencies despite the underlying storage 
>> being in S3.
>> 
>> ## Performance Results
>> 
>> Here are pgbench results from PostgreSQL running on this setup:
>> 
>> ### Read/Write Workload
>> 
>> ```
>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> transaction type: <builtin: TPC-B (sort of)>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 100000
>> number of transactions actually processed: 5000000/5000000
>> number of failed transactions: 0 (0.000%)
>> latency average = 0.943 ms
>> initial connection time = 48.043 ms
>> tps = 53041.006947 (without initial connection time)
>> ```
>> 
>> ### Read-Only Workload
>> 
>> ```
>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> transaction type: <builtin: select only>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 100000
>> number of transactions actually processed: 5000000/5000000
>> number of failed transactions: 0 (0.000%)
>> latency average = 0.121 ms
>> initial connection time = 53.358 ms
>> tps = 413436.248089 (without initial connection time)
>> ```
>> 
>> These numbers are with 50 concurrent clients and the actual data stored in 
>> S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold 
>> data comes from S3.
>> 
>> ## How It Works
>> 
>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use 
>> like any other block device
>> 2. Multiple cache layers hide S3 latency:
>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD 
>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block 
>> device
>>    c. Optional local disk cache
>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>> 
>> ## Geo-Distributed PostgreSQL
>> 
>> Since each region can run its own ZeroFS instance, you can create 
>> geographically distributed PostgreSQL setups.
>> 
>> Example architectures:
>> 
>> Architecture 1
>> 
>> 
>>                          PostgreSQL Client
>>                                    |
>>                                    | SQL queries
>>                                    |
>>                             +--------------+
>>                             |  PG Proxy    |
>>                             | (HAProxy/    |
>>                             |  PgBouncer)  |
>>                             +--------------+
>>                                /        \
>>                               /          \
>>                    Synchronous            Synchronous
>>                    Replication            Replication
>>                             /              \
>>                            /                \
>>               +---------------+        +---------------+
>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>               | (Primary)     |◄------►| (Standby)     |
>>               +---------------+        +---------------+
>>                       |                        |
>>                       |  POSIX filesystem ops  |
>>                       |                        |
>>               +---------------+        +---------------+
>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>               | (3-way mirror)|        | (3-way mirror)|
>>               +---------------+        +---------------+
>>                /      |      \          /      |      \
>>               /       |       \        /       |       \
>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>              |        |        |           |        |        |
>>         +--------++--------++--------++--------++--------++--------+
>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>         +--------++--------++--------++--------++--------++--------+
>>              |         |         |         |         |         |
>>              |         |         |         |         |         |
>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>> 
>> Architecture 2:
>> 
>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>                 \                    /
>>                  \                  /
>>                   Same ZFS Pool (NBD)
>>                          |
>>                   6 Global ZeroFS
>>                          |
>>                       S3 Regions
>> 
>> 
>> The main advantages I see are:
>> 1. Dramatic cost reduction for large datasets
>> 2. Simplified geo-distribution 
>> 3. Infinite storage capacity
>> 4. Built-in encryption and compression
>> 
>> Looking forward to your feedback and questions!
>> 
>> Best,
>> Pierre
>> 
>> P.S. The full project includes a custom NFS filesystem too.
>> 

Reply via email to