Re: Large files for relations

MARK CALLAGHAN Fri, 12 May 2023 09:42:03 -0700

Repeating what was mentioned on Twitter, because I had some experience with
the topic. With fewer files per table there will be more contention on the
per-inode mutex (which might now be the per-inode rwsem). I haven't read
filesystem source in a long time. Back in the day, and perhaps today, it
was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.


The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes
faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.


On Thu, May 11, 2023 at 4:38 PM Thomas Munro <thomas.mu...@gmail.com> wrote:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski <jimm...@gmail.com> wrote:
> > On Mon, May 1, 2023 at 9:29 PM Thomas Munro <thomas.mu...@gmail.com>
> wrote:
> >> I am not aware of any modern/non-historic filesystem[2] that can't do
> >> large files with ease.  Anyone know of anything to worry about on that
> >> front?
> >
> > There is some trouble in the ambiguity of what we mean by "modern" and
> "large files". There are still a large number of users of ext4 where the
> max file size is 16TB. Switching to a single large file per relation would
> effectively cut the max table size in half for those users. How would a
> user with say a 20TB table running on ext4 be impacted by this change?
>
> Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
> various places that MySQL/InnoDB can't have tables bigger than 16TB on
> ext4 because of this, when it's in its default one-file-per-object
> mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
> mode like DB2, Oracle etc, in which case I think you can have multiple
> 16TB segment files and get past that ext4 limit).  It's frustrating
> because 16TB is still really, really big and you probably should be
> using partitions, or more partitions, to avoid all kinds of other
> scalability problems at that size.  But however hypothetical the
> scenario might be, it should work, and this is certainly a plausible
> argument against the "aggressive" plan described above with the hard
> cut-off where we get to drop the segmented mode.
>
> Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
> concatenate with the above patches, so you'd have to use link or
> reflink mode (you'd probably want to use that anyway unless due to
> sheer volume of data to copy otherwise, since ext4 is also not capable
> of block-range sharing), but then you'd be out of luck after N future
> major releases, according to that plan where we start deleting the
> code, so you'd need to organise some smaller partitions before that
> time comes.  Or pg_upgrade to a target on xfs etc.  I wonder if a
> future version of extN will increase its max file size.
>
> A less aggressive version of the plan would be that we just keep the
> segment code for the foreseeable future with no planned cut off, and
> we make all of those "piggy back" transformations that I showed in the
> patch set optional.  For example, I had it so that CLUSTER would
> quietly convert your relation to large format, if it was still in
> segmented format (might as well if you're writing all the data out
> anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> base backup.  Etc.  Then someone concerned about hitting the 16TB
> limit on ext4 could opt out.  Or something like that.  It seems funny
> though, that's exactly the user who should want this feature (they
> have 16,000 relation segment files).
>
>
>

-- 
Mark Callaghan
mdcal...@gmail.com

Re: Large files for relations

Reply via email to