Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Vladimir Churyukin Sat, 26 Jul 2025 00:43:33 -0700

Sorry, I was referring to this:

>  But when PostgreSQL instances share storage rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems maintained (client can always promote an accessible
node)
> - Partitions between PostgreSQL nodes don't prevent the system from
functioning


Some pretty well-known cases of storage / compute separation (Aurora, Neon)
also share the storage between instances,
that's why I'm a bit confused by your reply. I thought you're thinking
about this approach too, that's why I mentioned what kind of challenges one
may have on that path.


On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pie...@barre.sh> wrote:

> What you describe doesn’t look like something very useful for the vast
> majority of projects that needs a database. Why would you even want that if
> you can avoid it?
>
> If your “single node” can handle tens / hundreds of thousands requests per
> second, still have very durable and highly available storage, as well as
> fast recovery mechanisms, what’s the point?
>
> I am not trying to cater to extreme outliers that may want very weird like
> this, that’s just not the use-cases I want to address, because I believe
> they are few and far between.
>
> Best,
> Pierre
>
> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
>
> A shared storage would require a lot of extra work. That's essentially
> what AWS Aurora does.
> You will have to have functionality to sync in-memory states between
> nodes, because all the instances will have cached data that can easily
> become stale on any write operation.
> That alone is not that simple. You will have to modify some locking logic.
> Most likely do a lot of other changes in a lot of places, Postgres was not
> just built with the assumption that the storage can be shared.
>
> -Vladimir
>
> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote:
>
> Now, I'm trying to understand how CAP theorem applies here. Traditional
> PostgreSQL replication has clear CAP trade-offs - you choose between
> consistency and availability during partitions.
>
> But when PostgreSQL instances share storage rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems maintained (client can always promote an accessible
> node)
> - Partitions between PostgreSQL nodes don't prevent the system from
> functioning
>
> It seems that CAP assumes specific implementation details (like nodes
> maintaining independent state) without explicitly stating them.
>
> How should we think about CAP theorem when distributed nodes share storage
> rather than coordinate state? Are the trade-offs simply moved to a
> different layer, or does shared storage fundamentally change the analysis?
>
> Client with awareness of both PostgreSQL nodes
>     |                               |
>     ↓ (partition here)              ↓
> PostgreSQL Primary              PostgreSQL Standby
>     |                               |
>     └───────────┬───────────────────┘
>                 ↓
>          Shared ZFS Pool
>                 |
>          6 Global ZeroFS instances
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> > Hi Seref,
> >
> > For the benchmarks, I used Hetzner's cloud service with the following
> setup:
> >
> > - A Hetzner s3 bucket in the FSN1 region
> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
> > - 3 ZeroFS nbd devices (same s3 bucket)
> > - A ZFS stripped pool with the 3 devices
> > - 200GB zfs L2ARC
> > - Postgres configured accordingly memory-wise as well as with
> synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
> >
> > Best,
> > Pierre
> >
> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
> >> Sorry, this was meant to go to the whole group:
> >>
> >> Very interesting!. Great work. Can you clarify how exactly you're
> running postgres in your tests? A specific AWS service? What's the test
> infrastructure that sits above the file system?
> >>
> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote:
> >>> Hi everyone,
> >>>
> >>> I wanted to share a project I've been working on that enables
> PostgreSQL to run on S3 storage while maintaining performance comparable to
> local NVMe. The approach uses block-level access rather than trying to map
> filesystem operations to S3 objects.
> >>>
> >>> ZeroFS: https://github.com/Barre/ZeroFS
> >>>
> >>> # The Architecture
> >>>
> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3
> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built
> on these block devices:
> >>>
> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
> >>>
> >>> By providing block-level access and leveraging ZFS's caching
> capabilities (L2ARC), we can achieve microsecond latencies despite the
> underlying storage being in S3.
> >>>
> >>> ## Performance Results
> >>>
> >>> Here are pgbench results from PostgreSQL running on this setup:
> >>>
> >>> ### Read/Write Workload
> >>>
> >>> ```
> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000
> example
> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> >>> starting vacuum...end.
> >>> transaction type: <builtin: TPC-B (sort of)>
> >>> scaling factor: 50
> >>> query mode: simple
> >>> number of clients: 50
> >>> number of threads: 15
> >>> maximum number of tries: 1
> >>> number of transactions per client: 100000
> >>> number of transactions actually processed: 5000000/5000000
> >>> number of failed transactions: 0 (0.000%)
> >>> latency average = 0.943 ms
> >>> initial connection time = 48.043 ms
> >>> tps = 53041.006947 (without initial connection time)
> >>> ```
> >>>
> >>> ### Read-Only Workload
> >>>
> >>> ```
> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S
> example
> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> >>> starting vacuum...end.
> >>> transaction type: <builtin: select only>
> >>> scaling factor: 50
> >>> query mode: simple
> >>> number of clients: 50
> >>> number of threads: 15
> >>> maximum number of tries: 1
> >>> number of transactions per client: 100000
> >>> number of transactions actually processed: 5000000/5000000
> >>> number of failed transactions: 0 (0.000%)
> >>> latency average = 0.121 ms
> >>> initial connection time = 53.358 ms
> >>> tps = 413436.248089 (without initial connection time)
> >>> ```
> >>>
> >>> These numbers are with 50 concurrent clients and the actual data
> stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches,
> while cold data comes from S3.
> >>>
> >>> ## How It Works
> >>>
> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS
> can use like any other block device
> >>> 2. Multiple cache layers hide S3 latency:
> >>>    a. ZFS ARC/L2ARC for frequently accessed blocks
> >>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD
> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block
> device
> >>>    c. Optional local disk cache
> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS'
> LSM-tree
> >>>
> >>> ## Geo-Distributed PostgreSQL
> >>>
> >>> Since each region can run its own ZeroFS instance, you can create
> geographically distributed PostgreSQL setups.
> >>>
> >>> Example architectures:
> >>>
> >>> Architecture 1
> >>>
> >>>
> >>>                          PostgreSQL Client
> >>>                                    |
> >>>                                    | SQL queries
> >>>                                    |
> >>>                             +--------------+
> >>>                             |  PG Proxy    |
> >>>                             | (HAProxy/    |
> >>>                             |  PgBouncer)  |
> >>>                             +--------------+
> >>>                                /        \
> >>>                               /          \
> >>>                    Synchronous            Synchronous
> >>>                    Replication            Replication
> >>>                             /              \
> >>>                            /                \
> >>>               +---------------+        +---------------+
> >>>               | PostgreSQL 1  |        | PostgreSQL 2  |
> >>>               | (Primary)     |◄------►| (Standby)     |
> >>>               +---------------+        +---------------+
> >>>                       |                        |
> >>>                       |  POSIX filesystem ops  |
> >>>                       |                        |
> >>>               +---------------+        +---------------+
> >>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
> >>>               | (3-way mirror)|        | (3-way mirror)|
> >>>               +---------------+        +---------------+
> >>>                /      |      \          /      |      \
> >>>               /       |       \        /       |       \
> >>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
> >>>              |        |        |           |        |        |
> >>>         +--------++--------++--------++--------++--------++--------+
> >>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
> >>>         +--------++--------++--------++--------++--------++--------+
> >>>              |         |         |         |         |         |
> >>>              |         |         |         |         |         |
> >>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5
> S3-Region6
> >>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
> >>>
> >>> Architecture 2:
> >>>
> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
> >>>                 \                    /
> >>>                  \                  /
> >>>                   Same ZFS Pool (NBD)
> >>>                          |
> >>>                   6 Global ZeroFS
> >>>                          |
> >>>                       S3 Regions
> >>>
> >>>
> >>> The main advantages I see are:
> >>> 1. Dramatic cost reduction for large datasets
> >>> 2. Simplified geo-distribution
> >>> 3. Infinite storage capacity
> >>> 4. Built-in encryption and compression
> >>>
> >>> Looking forward to your feedback and questions!
> >>>
> >>> Best,
> >>> Pierre
> >>>
> >>> P.S. The full project includes a custom NFS filesystem too.
> >>>
> >
>
>
>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Reply via email to