Ah, by "shared storage" I mean that each node can acquire exclusivity, not that they can both R/W to it at the same time.
> Some pretty well-known cases of storage / compute separation (Aurora, Neon) > also share the storage between instances, That model is cool, but I think it's more of a solution for outliers as I was suggesting, not something that most would or should want. Best, Pierre On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote: > Sorry, I was referring to this: > > > But when PostgreSQL instances share storage rather than replicate: > > - Consistency seems maintained (same data) > > - Availability seems maintained (client can always promote an accessible > > node) > > - Partitions between PostgreSQL nodes don't prevent the system from > > functioning > > Some pretty well-known cases of storage / compute separation (Aurora, Neon) > also share the storage between instances, > that's why I'm a bit confused by your reply. I thought you're thinking about > this approach too, that's why I mentioned what kind of challenges one may > have on that path. > > > On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pie...@barre.sh> wrote: >> __ >> What you describe doesn’t look like something very useful for the vast >> majority of projects that needs a database. Why would you even want that if >> you can avoid it? >> >> If your “single node” can handle tens / hundreds of thousands requests per >> second, still have very durable and highly available storage, as well as >> fast recovery mechanisms, what’s the point? >> >> I am not trying to cater to extreme outliers that may want very weird like >> this, that’s just not the use-cases I want to address, because I believe >> they are few and far between. >> >> Best, >> Pierre >> >> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: >>> A shared storage would require a lot of extra work. That's essentially what >>> AWS Aurora does. >>> You will have to have functionality to sync in-memory states between nodes, >>> because all the instances will have cached data that can easily become >>> stale on any write operation. >>> That alone is not that simple. You will have to modify some locking logic. >>> Most likely do a lot of other changes in a lot of places, Postgres was not >>> just built with the assumption that the storage can be shared. >>> >>> -Vladimir >>> >>> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote: >>>> Now, I'm trying to understand how CAP theorem applies here. Traditional >>>> PostgreSQL replication has clear CAP trade-offs - you choose between >>>> consistency and availability during partitions. >>>> >>>> But when PostgreSQL instances share storage rather than replicate: >>>> - Consistency seems maintained (same data) >>>> - Availability seems maintained (client can always promote an accessible >>>> node) >>>> - Partitions between PostgreSQL nodes don't prevent the system from >>>> functioning >>>> >>>> It seems that CAP assumes specific implementation details (like nodes >>>> maintaining independent state) without explicitly stating them. >>>> >>>> How should we think about CAP theorem when distributed nodes share storage >>>> rather than coordinate state? Are the trade-offs simply moved to a >>>> different layer, or does shared storage fundamentally change the analysis? >>>> >>>> Client with awareness of both PostgreSQL nodes >>>> | | >>>> ↓ (partition here) ↓ >>>> PostgreSQL Primary PostgreSQL Standby >>>> | | >>>> └───────────┬───────────────────┘ >>>> ↓ >>>> Shared ZFS Pool >>>> | >>>> 6 Global ZeroFS instances >>>> >>>> Best, >>>> Pierre >>>> >>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: >>>> > Hi Seref, >>>> > >>>> > For the benchmarks, I used Hetzner's cloud service with the following >>>> > setup: >>>> > >>>> > - A Hetzner s3 bucket in the FSN1 region >>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory >>>> > - 3 ZeroFS nbd devices (same s3 bucket) >>>> > - A ZFS stripped pool with the 3 devices >>>> > - 200GB zfs L2ARC >>>> > - Postgres configured accordingly memory-wise as well as with >>>> > synchronous_commit = off, wal_init_zero = off and wal_recycle = off. >>>> > >>>> > Best, >>>> > Pierre >>>> > >>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >>>> >> Sorry, this was meant to go to the whole group: >>>> >> >>>> >> Very interesting!. Great work. Can you clarify how exactly you're >>>> >> running postgres in your tests? A specific AWS service? What's the test >>>> >> infrastructure that sits above the file system? >>>> >> >>>> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote: >>>> >>> Hi everyone, >>>> >>> >>>> >>> I wanted to share a project I've been working on that enables >>>> >>> PostgreSQL to run on S3 storage while maintaining performance >>>> >>> comparable to local NVMe. The approach uses block-level access rather >>>> >>> than trying to map filesystem operations to S3 objects. >>>> >>> >>>> >>> ZeroFS: https://github.com/Barre/ZeroFS >>>> >>> >>>> >>> # The Architecture >>>> >>> >>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 >>>> >>> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools >>>> >>> built on these block devices: >>>> >>> >>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>>> >>> >>>> >>> By providing block-level access and leveraging ZFS's caching >>>> >>> capabilities (L2ARC), we can achieve microsecond latencies despite the >>>> >>> underlying storage being in S3. >>>> >>> >>>> >>> ## Performance Results >>>> >>> >>>> >>> Here are pgbench results from PostgreSQL running on this setup: >>>> >>> >>>> >>> ### Read/Write Workload >>>> >>> >>>> >>> ``` >>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 >>>> >>> example >>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>> >>> starting vacuum...end. >>>> >>> transaction type: <builtin: TPC-B (sort of)> >>>> >>> scaling factor: 50 >>>> >>> query mode: simple >>>> >>> number of clients: 50 >>>> >>> number of threads: 15 >>>> >>> maximum number of tries: 1 >>>> >>> number of transactions per client: 100000 >>>> >>> number of transactions actually processed: 5000000/5000000 >>>> >>> number of failed transactions: 0 (0.000%) >>>> >>> latency average = 0.943 ms >>>> >>> initial connection time = 48.043 ms >>>> >>> tps = 53041.006947 (without initial connection time) >>>> >>> ``` >>>> >>> >>>> >>> ### Read-Only Workload >>>> >>> >>>> >>> ``` >>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S >>>> >>> example >>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>> >>> starting vacuum...end. >>>> >>> transaction type: <builtin: select only> >>>> >>> scaling factor: 50 >>>> >>> query mode: simple >>>> >>> number of clients: 50 >>>> >>> number of threads: 15 >>>> >>> maximum number of tries: 1 >>>> >>> number of transactions per client: 100000 >>>> >>> number of transactions actually processed: 5000000/5000000 >>>> >>> number of failed transactions: 0 (0.000%) >>>> >>> latency average = 0.121 ms >>>> >>> initial connection time = 53.358 ms >>>> >>> tps = 413436.248089 (without initial connection time) >>>> >>> ``` >>>> >>> >>>> >>> These numbers are with 50 concurrent clients and the actual data >>>> >>> stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory >>>> >>> caches, while cold data comes from S3. >>>> >>> >>>> >>> ## How It Works >>>> >>> >>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS >>>> >>> can use like any other block device >>>> >>> 2. Multiple cache layers hide S3 latency: >>>> >>> a. ZFS ARC/L2ARC for frequently accessed blocks >>>> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD >>>> >>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other >>>> >>> block device >>>> >>> c. Optional local disk cache >>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' >>>> >>> LSM-tree >>>> >>> >>>> >>> ## Geo-Distributed PostgreSQL >>>> >>> >>>> >>> Since each region can run its own ZeroFS instance, you can create >>>> >>> geographically distributed PostgreSQL setups. >>>> >>> >>>> >>> Example architectures: >>>> >>> >>>> >>> Architecture 1 >>>> >>> >>>> >>> >>>> >>> PostgreSQL Client >>>> >>> | >>>> >>> | SQL queries >>>> >>> | >>>> >>> +--------------+ >>>> >>> | PG Proxy | >>>> >>> | (HAProxy/ | >>>> >>> | PgBouncer) | >>>> >>> +--------------+ >>>> >>> / \ >>>> >>> / \ >>>> >>> Synchronous Synchronous >>>> >>> Replication Replication >>>> >>> / \ >>>> >>> / \ >>>> >>> +---------------+ +---------------+ >>>> >>> | PostgreSQL 1 | | PostgreSQL 2 | >>>> >>> | (Primary) |◄------►| (Standby) | >>>> >>> +---------------+ +---------------+ >>>> >>> | | >>>> >>> | POSIX filesystem ops | >>>> >>> | | >>>> >>> +---------------+ +---------------+ >>>> >>> | ZFS Pool 1 | | ZFS Pool 2 | >>>> >>> | (3-way mirror)| | (3-way mirror)| >>>> >>> +---------------+ +---------------+ >>>> >>> / | \ / | \ >>>> >>> / | \ / | \ >>>> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >>>> >>> | | | | | | >>>> >>> +--------++--------++--------++--------++--------++--------+ >>>> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >>>> >>> +--------++--------++--------++--------++--------++--------+ >>>> >>> | | | | | | >>>> >>> | | | | | | >>>> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 >>>> >>> S3-Region6 >>>> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) >>>> >>> >>>> >>> Architecture 2: >>>> >>> >>>> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2) >>>> >>> \ / >>>> >>> \ / >>>> >>> Same ZFS Pool (NBD) >>>> >>> | >>>> >>> 6 Global ZeroFS >>>> >>> | >>>> >>> S3 Regions >>>> >>> >>>> >>> >>>> >>> The main advantages I see are: >>>> >>> 1. Dramatic cost reduction for large datasets >>>> >>> 2. Simplified geo-distribution >>>> >>> 3. Infinite storage capacity >>>> >>> 4. Built-in encryption and compression >>>> >>> >>>> >>> Looking forward to your feedback and questions! >>>> >>> >>>> >>> Best, >>>> >>> Pierre >>>> >>> >>>> >>> P.S. The full project includes a custom NFS filesystem too. >>>> >>> >>>> > >>>> >>