Sorry, I was referring to this: > But when PostgreSQL instances share storage rather than replicate: > - Consistency seems maintained (same data) > - Availability seems maintained (client can always promote an accessible node) > - Partitions between PostgreSQL nodes don't prevent the system from functioning
Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances, that's why I'm a bit confused by your reply. I thought you're thinking about this approach too, that's why I mentioned what kind of challenges one may have on that path. On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pie...@barre.sh> wrote: > What you describe doesn’t look like something very useful for the vast > majority of projects that needs a database. Why would you even want that if > you can avoid it? > > If your “single node” can handle tens / hundreds of thousands requests per > second, still have very durable and highly available storage, as well as > fast recovery mechanisms, what’s the point? > > I am not trying to cater to extreme outliers that may want very weird like > this, that’s just not the use-cases I want to address, because I believe > they are few and far between. > > Best, > Pierre > > On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: > > A shared storage would require a lot of extra work. That's essentially > what AWS Aurora does. > You will have to have functionality to sync in-memory states between > nodes, because all the instances will have cached data that can easily > become stale on any write operation. > That alone is not that simple. You will have to modify some locking logic. > Most likely do a lot of other changes in a lot of places, Postgres was not > just built with the assumption that the storage can be shared. > > -Vladimir > > On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote: > > Now, I'm trying to understand how CAP theorem applies here. Traditional > PostgreSQL replication has clear CAP trade-offs - you choose between > consistency and availability during partitions. > > But when PostgreSQL instances share storage rather than replicate: > - Consistency seems maintained (same data) > - Availability seems maintained (client can always promote an accessible > node) > - Partitions between PostgreSQL nodes don't prevent the system from > functioning > > It seems that CAP assumes specific implementation details (like nodes > maintaining independent state) without explicitly stating them. > > How should we think about CAP theorem when distributed nodes share storage > rather than coordinate state? Are the trade-offs simply moved to a > different layer, or does shared storage fundamentally change the analysis? > > Client with awareness of both PostgreSQL nodes > | | > ↓ (partition here) ↓ > PostgreSQL Primary PostgreSQL Standby > | | > └───────────┬───────────────────┘ > ↓ > Shared ZFS Pool > | > 6 Global ZeroFS instances > > Best, > Pierre > > On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: > > Hi Seref, > > > > For the benchmarks, I used Hetzner's cloud service with the following > setup: > > > > - A Hetzner s3 bucket in the FSN1 region > > - A virtual machine of type ccx63 48 vCPU 192 GB memory > > - 3 ZeroFS nbd devices (same s3 bucket) > > - A ZFS stripped pool with the 3 devices > > - 200GB zfs L2ARC > > - Postgres configured accordingly memory-wise as well as with > synchronous_commit = off, wal_init_zero = off and wal_recycle = off. > > > > Best, > > Pierre > > > > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: > >> Sorry, this was meant to go to the whole group: > >> > >> Very interesting!. Great work. Can you clarify how exactly you're > running postgres in your tests? A specific AWS service? What's the test > infrastructure that sits above the file system? > >> > >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote: > >>> Hi everyone, > >>> > >>> I wanted to share a project I've been working on that enables > PostgreSQL to run on S3 storage while maintaining performance comparable to > local NVMe. The approach uses block-level access rather than trying to map > filesystem operations to S3 objects. > >>> > >>> ZeroFS: https://github.com/Barre/ZeroFS > >>> > >>> # The Architecture > >>> > >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 > storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built > on these block devices: > >>> > >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 > >>> > >>> By providing block-level access and leveraging ZFS's caching > capabilities (L2ARC), we can achieve microsecond latencies despite the > underlying storage being in S3. > >>> > >>> ## Performance Results > >>> > >>> Here are pgbench results from PostgreSQL running on this setup: > >>> > >>> ### Read/Write Workload > >>> > >>> ``` > >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 > example > >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) > >>> starting vacuum...end. > >>> transaction type: <builtin: TPC-B (sort of)> > >>> scaling factor: 50 > >>> query mode: simple > >>> number of clients: 50 > >>> number of threads: 15 > >>> maximum number of tries: 1 > >>> number of transactions per client: 100000 > >>> number of transactions actually processed: 5000000/5000000 > >>> number of failed transactions: 0 (0.000%) > >>> latency average = 0.943 ms > >>> initial connection time = 48.043 ms > >>> tps = 53041.006947 (without initial connection time) > >>> ``` > >>> > >>> ### Read-Only Workload > >>> > >>> ``` > >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S > example > >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) > >>> starting vacuum...end. > >>> transaction type: <builtin: select only> > >>> scaling factor: 50 > >>> query mode: simple > >>> number of clients: 50 > >>> number of threads: 15 > >>> maximum number of tries: 1 > >>> number of transactions per client: 100000 > >>> number of transactions actually processed: 5000000/5000000 > >>> number of failed transactions: 0 (0.000%) > >>> latency average = 0.121 ms > >>> initial connection time = 53.358 ms > >>> tps = 413436.248089 (without initial connection time) > >>> ``` > >>> > >>> These numbers are with 50 concurrent clients and the actual data > stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, > while cold data comes from S3. > >>> > >>> ## How It Works > >>> > >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS > can use like any other block device > >>> 2. Multiple cache layers hide S3 latency: > >>> a. ZFS ARC/L2ARC for frequently accessed blocks > >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD > devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block > device > >>> c. Optional local disk cache > >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 > >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' > LSM-tree > >>> > >>> ## Geo-Distributed PostgreSQL > >>> > >>> Since each region can run its own ZeroFS instance, you can create > geographically distributed PostgreSQL setups. > >>> > >>> Example architectures: > >>> > >>> Architecture 1 > >>> > >>> > >>> PostgreSQL Client > >>> | > >>> | SQL queries > >>> | > >>> +--------------+ > >>> | PG Proxy | > >>> | (HAProxy/ | > >>> | PgBouncer) | > >>> +--------------+ > >>> / \ > >>> / \ > >>> Synchronous Synchronous > >>> Replication Replication > >>> / \ > >>> / \ > >>> +---------------+ +---------------+ > >>> | PostgreSQL 1 | | PostgreSQL 2 | > >>> | (Primary) |◄------►| (Standby) | > >>> +---------------+ +---------------+ > >>> | | > >>> | POSIX filesystem ops | > >>> | | > >>> +---------------+ +---------------+ > >>> | ZFS Pool 1 | | ZFS Pool 2 | > >>> | (3-way mirror)| | (3-way mirror)| > >>> +---------------+ +---------------+ > >>> / | \ / | \ > >>> / | \ / | \ > >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 > >>> | | | | | | > >>> +--------++--------++--------++--------++--------++--------+ > >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| > >>> +--------++--------++--------++--------++--------++--------+ > >>> | | | | | | > >>> | | | | | | > >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 > S3-Region6 > >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) > >>> > >>> Architecture 2: > >>> > >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2) > >>> \ / > >>> \ / > >>> Same ZFS Pool (NBD) > >>> | > >>> 6 Global ZeroFS > >>> | > >>> S3 Regions > >>> > >>> > >>> The main advantages I see are: > >>> 1. Dramatic cost reduction for large datasets > >>> 2. Simplified geo-distribution > >>> 3. Infinite storage capacity > >>> 4. Built-in encryption and compression > >>> > >>> Looking forward to your feedback and questions! > >>> > >>> Best, > >>> Pierre > >>> > >>> P.S. The full project includes a custom NFS filesystem too. > >>> > > > > >