Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Pierre Barre Sat, 26 Jul 2025 00:51:57 -0700

Ah, by "shared storage" I mean that each node can acquire exclusivity, not that 
they can both R/W to it at the same time.


> Some pretty well-known cases of storage / compute separation (Aurora, Neon) 
> also share the storage between instances,

That model is cool, but I think it's more of a solution for outliers as I was 
suggesting, not something that most would or should want.

Best,
Pierre

On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:
> Sorry, I was referring to this:
> 
> >  But when PostgreSQL instances share storage rather than replicate:
> > - Consistency seems maintained (same data)
> > - Availability seems maintained (client can always promote an accessible 
> > node)
> > - Partitions between PostgreSQL nodes don't prevent the system from 
> > functioning
> 
> Some pretty well-known cases of storage / compute separation (Aurora, Neon) 
> also share the storage between instances,
> that's why I'm a bit confused by your reply. I thought you're thinking about 
> this approach too, that's why I mentioned what kind of challenges one may 
> have on that path.
> 
> 
> On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pie...@barre.sh> wrote:
>> __
>> What you describe doesn’t look like something very useful for the vast 
>> majority of projects that needs a database. Why would you even want that if 
>> you can avoid it? 
>> 
>> If your “single node” can handle tens / hundreds of thousands requests per 
>> second, still have very durable and highly available storage, as well as 
>> fast recovery mechanisms, what’s the point?
>> 
>> I am not trying to cater to extreme outliers that may want very weird like 
>> this, that’s just not the use-cases I want to address, because I believe 
>> they are few and far between.
>> 
>> Best,
>> Pierre 
>> 
>> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
>>> A shared storage would require a lot of extra work. That's essentially what 
>>> AWS Aurora does.
>>> You will have to have functionality to sync in-memory states between nodes, 
>>> because all the instances will have cached data that can easily become 
>>> stale on any write operation.
>>> That alone is not that simple. You will have to modify some locking logic. 
>>> Most likely do a lot of other changes in a lot of places, Postgres was not 
>>> just built with the assumption that the storage can be shared.
>>> 
>>> -Vladimir
>>> 
>>> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote:
>>>> Now, I'm trying to understand how CAP theorem applies here. Traditional 
>>>> PostgreSQL replication has clear CAP trade-offs - you choose between 
>>>> consistency and availability during partitions.
>>>> 
>>>> But when PostgreSQL instances share storage rather than replicate:
>>>> - Consistency seems maintained (same data)
>>>> - Availability seems maintained (client can always promote an accessible 
>>>> node)
>>>> - Partitions between PostgreSQL nodes don't prevent the system from 
>>>> functioning
>>>> 
>>>> It seems that CAP assumes specific implementation details (like nodes 
>>>> maintaining independent state) without explicitly stating them.
>>>> 
>>>> How should we think about CAP theorem when distributed nodes share storage 
>>>> rather than coordinate state? Are the trade-offs simply moved to a 
>>>> different layer, or does shared storage fundamentally change the analysis?
>>>> 
>>>> Client with awareness of both PostgreSQL nodes
>>>>     |                               |
>>>>     ↓ (partition here)              ↓
>>>> PostgreSQL Primary              PostgreSQL Standby
>>>>     |                               |
>>>>     └───────────┬───────────────────┘
>>>>                 ↓
>>>>          Shared ZFS Pool
>>>>                 |
>>>>          6 Global ZeroFS instances
>>>> 
>>>> Best,
>>>> Pierre
>>>> 
>>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
>>>> > Hi Seref,
>>>> >
>>>> > For the benchmarks, I used Hetzner's cloud service with the following 
>>>> > setup:
>>>> >
>>>> > - A Hetzner s3 bucket in the FSN1 region
>>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
>>>> > - 3 ZeroFS nbd devices (same s3 bucket)
>>>> > - A ZFS stripped pool with the 3 devices
>>>> > - 200GB zfs L2ARC
>>>> > - Postgres configured accordingly memory-wise as well as with 
>>>> > synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>> >
>>>> > Best,
>>>> > Pierre
>>>> >
>>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>>>> >> Sorry, this was meant to go to the whole group:
>>>> >>
>>>> >> Very interesting!. Great work. Can you clarify how exactly you're 
>>>> >> running postgres in your tests? A specific AWS service? What's the test 
>>>> >> infrastructure that sits above the file system?
>>>> >>
>>>> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote:
>>>> >>> Hi everyone,
>>>> >>>
>>>> >>> I wanted to share a project I've been working on that enables 
>>>> >>> PostgreSQL to run on S3 storage while maintaining performance 
>>>> >>> comparable to local NVMe. The approach uses block-level access rather 
>>>> >>> than trying to map filesystem operations to S3 objects.
>>>> >>>
>>>> >>> ZeroFS: https://github.com/Barre/ZeroFS
>>>> >>>
>>>> >>> # The Architecture
>>>> >>>
>>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 
>>>> >>> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools 
>>>> >>> built on these block devices:
>>>> >>>
>>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>> >>>
>>>> >>> By providing block-level access and leveraging ZFS's caching 
>>>> >>> capabilities (L2ARC), we can achieve microsecond latencies despite the 
>>>> >>> underlying storage being in S3.
>>>> >>>
>>>> >>> ## Performance Results
>>>> >>>
>>>> >>> Here are pgbench results from PostgreSQL running on this setup:
>>>> >>>
>>>> >>> ### Read/Write Workload
>>>> >>>
>>>> >>> ```
>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 
>>>> >>> example
>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> >>> starting vacuum...end.
>>>> >>> transaction type: <builtin: TPC-B (sort of)>
>>>> >>> scaling factor: 50
>>>> >>> query mode: simple
>>>> >>> number of clients: 50
>>>> >>> number of threads: 15
>>>> >>> maximum number of tries: 1
>>>> >>> number of transactions per client: 100000
>>>> >>> number of transactions actually processed: 5000000/5000000
>>>> >>> number of failed transactions: 0 (0.000%)
>>>> >>> latency average = 0.943 ms
>>>> >>> initial connection time = 48.043 ms
>>>> >>> tps = 53041.006947 (without initial connection time)
>>>> >>> ```
>>>> >>>
>>>> >>> ### Read-Only Workload
>>>> >>>
>>>> >>> ```
>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S 
>>>> >>> example
>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> >>> starting vacuum...end.
>>>> >>> transaction type: <builtin: select only>
>>>> >>> scaling factor: 50
>>>> >>> query mode: simple
>>>> >>> number of clients: 50
>>>> >>> number of threads: 15
>>>> >>> maximum number of tries: 1
>>>> >>> number of transactions per client: 100000
>>>> >>> number of transactions actually processed: 5000000/5000000
>>>> >>> number of failed transactions: 0 (0.000%)
>>>> >>> latency average = 0.121 ms
>>>> >>> initial connection time = 53.358 ms
>>>> >>> tps = 413436.248089 (without initial connection time)
>>>> >>> ```
>>>> >>>
>>>> >>> These numbers are with 50 concurrent clients and the actual data 
>>>> >>> stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory 
>>>> >>> caches, while cold data comes from S3.
>>>> >>>
>>>> >>> ## How It Works
>>>> >>>
>>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS 
>>>> >>> can use like any other block device
>>>> >>> 2. Multiple cache layers hide S3 latency:
>>>> >>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>>> >>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD 
>>>> >>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other 
>>>> >>> block device
>>>> >>>    c. Optional local disk cache
>>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' 
>>>> >>> LSM-tree
>>>> >>>
>>>> >>> ## Geo-Distributed PostgreSQL
>>>> >>>
>>>> >>> Since each region can run its own ZeroFS instance, you can create 
>>>> >>> geographically distributed PostgreSQL setups.
>>>> >>>
>>>> >>> Example architectures:
>>>> >>>
>>>> >>> Architecture 1
>>>> >>>
>>>> >>>
>>>> >>>                          PostgreSQL Client
>>>> >>>                                    |
>>>> >>>                                    | SQL queries
>>>> >>>                                    |
>>>> >>>                             +--------------+
>>>> >>>                             |  PG Proxy    |
>>>> >>>                             | (HAProxy/    |
>>>> >>>                             |  PgBouncer)  |
>>>> >>>                             +--------------+
>>>> >>>                                /        \
>>>> >>>                               /          \
>>>> >>>                    Synchronous            Synchronous
>>>> >>>                    Replication            Replication
>>>> >>>                             /              \
>>>> >>>                            /                \
>>>> >>>               +---------------+        +---------------+
>>>> >>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>>> >>>               | (Primary)     |◄------►| (Standby)     |
>>>> >>>               +---------------+        +---------------+
>>>> >>>                       |                        |
>>>> >>>                       |  POSIX filesystem ops  |
>>>> >>>                       |                        |
>>>> >>>               +---------------+        +---------------+
>>>> >>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>>> >>>               | (3-way mirror)|        | (3-way mirror)|
>>>> >>>               +---------------+        +---------------+
>>>> >>>                /      |      \          /      |      \
>>>> >>>               /       |       \        /       |       \
>>>> >>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>>> >>>              |        |        |           |        |        |
>>>> >>>         +--------++--------++--------++--------++--------++--------+
>>>> >>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>> >>>         +--------++--------++--------++--------++--------++--------+
>>>> >>>              |         |         |         |         |         |
>>>> >>>              |         |         |         |         |         |
>>>> >>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 
>>>> >>> S3-Region6
>>>> >>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>> >>>
>>>> >>> Architecture 2:
>>>> >>>
>>>> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>>> >>>                 \                    /
>>>> >>>                  \                  /
>>>> >>>                   Same ZFS Pool (NBD)
>>>> >>>                          |
>>>> >>>                   6 Global ZeroFS
>>>> >>>                          |
>>>> >>>                       S3 Regions
>>>> >>>
>>>> >>>
>>>> >>> The main advantages I see are:
>>>> >>> 1. Dramatic cost reduction for large datasets
>>>> >>> 2. Simplified geo-distribution
>>>> >>> 3. Infinite storage capacity
>>>> >>> 4. Built-in encryption and compression
>>>> >>>
>>>> >>> Looking forward to your feedback and questions!
>>>> >>>
>>>> >>> Best,
>>>> >>> Pierre
>>>> >>>
>>>> >>> P.S. The full project includes a custom NFS filesystem too.
>>>> >>>
>>>> >
>>>> 
>>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Reply via email to