Repeating what was mentioned on Twitter, because I had some experience with the topic. With fewer files per table there will be more contention on the per-inode mutex (which might now be the per-inode rwsem). I haven't read filesystem source in a long time. Back in the day, and perhaps today, it was locked for the duration of a write to storage (locked within the kernel) and was briefly locked while setting up a read.
The workaround for writes was one of: 1) enable disk write cache or use battery-backed HW RAID to make writes faster (yes disks, I encountered this prior to 2010) 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't locked for the duration of a write I have a vague memory that filesystems have improved in this regard. On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimm...@gmail.com> wrote: > > On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.mu...@gmail.com> > wrote: > >> I am not aware of any modern/non-historic filesystem[2] that can't do > >> large files with ease. Anyone know of anything to worry about on that > >> front? > > > > There is some trouble in the ambiguity of what we mean by "modern" and > "large files". There are still a large number of users of ext4 where the > max file size is 16TB. Switching to a single large file per relation would > effectively cut the max table size in half for those users. How would a > user with say a 20TB table running on ext4 be impacted by this change? > > Hrmph. Yeah, that might be a bit of a problem. I see it discussed in > various places that MySQL/InnoDB can't have tables bigger than 16TB on > ext4 because of this, when it's in its default one-file-per-object > mode (as opposed to its big-tablespace-files-to-hold-all-the-objects > mode like DB2, Oracle etc, in which case I think you can have multiple > 16TB segment files and get past that ext4 limit). It's frustrating > because 16TB is still really, really big and you probably should be > using partitions, or more partitions, to avoid all kinds of other > scalability problems at that size. But however hypothetical the > scenario might be, it should work, and this is certainly a plausible > argument against the "aggressive" plan described above with the hard > cut-off where we get to drop the segmented mode. > > Concretely, a 20TB pg_upgrade in copy mode would fail while trying to > concatenate with the above patches, so you'd have to use link or > reflink mode (you'd probably want to use that anyway unless due to > sheer volume of data to copy otherwise, since ext4 is also not capable > of block-range sharing), but then you'd be out of luck after N future > major releases, according to that plan where we start deleting the > code, so you'd need to organise some smaller partitions before that > time comes. Or pg_upgrade to a target on xfs etc. I wonder if a > future version of extN will increase its max file size. > > A less aggressive version of the plan would be that we just keep the > segment code for the foreseeable future with no planned cut off, and > we make all of those "piggy back" transformations that I showed in the > patch set optional. For example, I had it so that CLUSTER would > quietly convert your relation to large format, if it was still in > segmented format (might as well if you're writing all the data out > anyway, right?), but perhaps that could depend on a GUC. Likewise for > base backup. Etc. Then someone concerned about hitting the 16TB > limit on ext4 could opt out. Or something like that. It seems funny > though, that's exactly the user who should want this feature (they > have 16,000 relation segment files). > > > -- Mark Callaghan mdcal...@gmail.com