> How do you figure that it's not really needed? Working copies are
> typically used for a long time, with changes coming in from updates and
> commits, not just checkouts. Pristine files are also not automagically
> deleted when they're not referenced, you need 'svn cleanup
> --vacuum-pristines' for that. (Except when pristines are explicitly
> omitted, then they're created only for the scope of the operation that
> needs them, IIRC.)

Operations that don't do probing can still benefit from reflinked
pristines if they use the workqueue install path. It's just that every
file install will attempt to reflink the pristine. When probing is
used and detects a lack of reflink support, no reflink attempt occurs;
in the case of a checkout, it instead performs a "streamy checkout"

However, not all operations go through the workqueue install path:
- Checkout/update/switch go through the update_editor these all do the
reflink probing then conditionally reflink new pristines
- copy/move (on a WC file) installs via the workqueue so these will
also reflink opportunistically without probing, a directory copy goes
through copy_dir_recursively and won't reflink
- revert does go through the workqueue but uses a temp pristine file,
so the WC file will potentially get reflinked to the temp pristine.
Which won't save any disk space since the temp pristine will get
deleted. Revert could probably be tweaked to avoid the temp pristine
and make it reflink friendly
- add/commit doesn't go through the workqueue and so also won't
reflink any newly versioned files. This could also probably be
resolved

It would be good to get revert/add also reflinking since these are
likely common operations, but there will always be cases where the
user can break the reflinks themselves. If they modify a file then
revert it back without using SVN the reflink wouldn't exist anymore.

A more general solution for this as I mentioned in the op is to have a
"svn cleanup --reflink-pristines" flag which goes through the repo and
recreates reflinks for all unmodified candidate files.
This is also useful for an existing repo that was checked out on an
older SVN version that didn't do any reflinking.
I actually have a local version of this cleanup flag working and have
been using it to reflink all my existing SVN repos. I'm still using
SVN 1.14 for daily tasks but since the reflinking doesn't change the
repo format I can run it on my repos occasionally to gain the reflink
disk space savings.

> I'm well aware that we can #ifdef our way around any new code, but this
> change as it stands now is much too big and intrusive for that. I would
> not support committing it directly to trunk in its current state, even
> if disabled by default. This isn't a comment about your patch
> specifically, I'd say the same about any other large, intrusive change,
> no matter the source. We do develop some significant new features on
> trunk (e.g., FSX, svnxx, svnbrowse), but they're well isolated from the
> rest of the code – and also never finished, eh. What can go directly to
> trunk and what should be developed on a branch is a bit of a judgement
> call. I think my opinion on the matter is consistent.

I'm happy for it to be in a branch if that's the standard for new
features, I was just highlighting that it's easy to gate it with a
single #ifdef. I presumably can't create a branch or write to it if
one was created though, since I don't have write access? What is the
best way forward to get a branch up and running?

Thanks,
Jordan


On Mon, 15 Jun 2026 at 14:31, Jordan Peck <[email protected]> wrote:
>
> Hi Evgeny, Brane
> Thank you for taking the time to go through the patch in depth, you
> bring up a lot of good points about the feature.
>
> > Also, the --store-pristine=no option already provides a consistent size
> > reduction on all file systems and OSes, without disabling the streaminess,
> > regressing other characteristics or requiring any kind of in-advance 
> > probing.
>
> So "--store-pristine=no" provides the largest disk savings, but it
> comes with a few downsides:
> - The biggest downside is that it's opt-in, so the average user won't
> know about it or use it
> - You lose the ability to diff files without pulling data from the SVN server
> - It requires a new working copy format which older tools won't support
>
> > The optimization is far from universally applicable.  For example, it
> > doesn't cover the default file system on Windows (NTFS).
>
> Across the 3 OSes, Windows users are probably least likely to have a
> reflink supporting filesystem since it was only made available on
> standard versions of Windows about 3 years ago. Although they are
> pushing its usage via the Dev Drive feature which you can see here:
> https://learn.microsoft.com/en-us/windows/dev-drive/
>
> Linux is similar to Windows in the majority of distributions still
> default to ext4 (no reflink), but a higher % of users likely have
> reflink-supporting filesystems because reflink supporting filesystems
> have been around longer and there is a greater presence of power
> users. Some distributions like Fedora, use Btrfs (reflink supported)
> as the default filesystem.
>
> MacOS is the best case by far, the only supported filesystem is APFS
> which has reflink support. The only case where it wouldn't apply is if
> the repository is on an external drive formatted as FAT32.
>
> > The required probing is too expensive to run in every scope where it
> > would be needed.
> >
> > For example, probing on Windows calls GetVolumeInformation().  Probing
> > on other OSes has to create a file to check whether copy-on-write is
> > supported.
> >
> > The patch limits this probing only for larger scopes, such as attaching it
> > to an update editor instance, but smaller-scoped operations like workqueue
> > installs or file reverts are left without probing.  If not all code paths
> > use the copy-on-write path, the size reduction may not be sustainable or
> > predictable in the long run.
>
> I only added the probing so that checkouts can still benefit from
> "streamy checkouts" when reflink support is absent. The probing isn't
> that slow, but it's also not really needed for the smaller operations
> that still use the workqueue installs, the implementation differs
> slightly by platform:
>
> Linux has the best API support for this, ficlone_fd can be used with
> an existing open file handle. If that fails due to lack of reflink
> support the fallback uses the same open handle to perform a regular
> byte copy. This is only 1 extra system call per file in the case of no
> reflink support.
>
> MacOS is similar to Linux but the API requires file paths instead of
> handles, making the fallback path less smooth. But also the fallback
> is unlikely to be needed often due to macOS's good reflink support.
>
> On Windows, CopyFile2 is a generic file copy API, so even on NTFS this
> performs a regular file copy; it is not a failed call like on Linux.
> There is a fallback path if it fails, but I'm not sure if there are
> any cases where CopyFile2 fails where a byte copy loop would succeed.
> The reason to avoid using CopyFile2 via probing is only to get the
> "streamy checkout" benefits which don't apply to all SVN operations
> anyway.
>
> > Skipping the streamy checkout path can potentially regress performance and
> > cause spurious HTTP timeouts and failures. To some extent, this is a step
> > back from what we currently have on trunk.
>
> From what I've gathered the HTTP timeouts were caused when large files
> were being written to disk. Setting up the reflink is only a small
> metadata operation, so it will always be very fast even on large
> files. Since the file data isn't being written to disks it should
> avoid the HTTP timeouts issue.
>
> > The new copy_file_to_temp_copyfile_windows() appears to make significantly
> > more syscalls than the current code on trunk.  It opens and closes a temp
> > file, performs the copy, and changes the file attributes and mtime, all
> > with separate path-based calls, which are very expensive on Windows.
>
> So in my testing a checkout onto a ReFS drive on Windows is about 10%
> slower than a streamy checkout on an NTFS drive, likely due to these
> extra system calls. I wasn't 100% sure about SVN's file attribute
> requirements, so I was overly strict on matching the existing path.
> Perhaps some of those steps could be trimmed back to aid performance.
> For me a 10% slowdown is well worth the disk space saving, but
> obviously this is a personal opinion.
>
> > Thoughts on creating a branch for this? It may be easier to work through
> > actual working code, as well as test interaction with other working copy
> > changes.
> >
> > Downside is that the code may just bitrot on a branch that's not
> > actively maintained.
>
> I don't think it needs its own branch, it would be easy to add a
> global toggle via a pre-processor define since it's already gated
> around the reflink support probing.
>
> Thanks,
> Jordan
>
>
>
> On Wed, 10 Jun 2026 at 15:41, Jordan Peck <[email protected]> wrote:
> >
> > Hi all,
> >
> > I've spent time cutting the patch down to the minimal changes needed to get 
> > the feature working while maintaining a sensible integration (not hacking 
> > it in). The majority of the patch size now comes from the new 
> > io_copy_temp.c file which holds all the native platform functions for 
> > probing/setting up file reflinks.
> >
> > I've compiled and run the tests on Windows, Linux and MacOS this time.
> >
> > Hopefully this is a reviewable size now.
> > Let me know what you think and any questions.
> >
> > Thanks,
> > Jordan
> >
> >
> > On Fri, 5 Jun 2026 at 21:02, Jordan Peck <[email protected]> wrote:
> >>
> >> Regarding the disk space savings, I haven't tested against pristineless, 
> >> but the savings are also close to 50% on a repo with lots of large binary 
> >> files:
> >>>
> >>> The large SVN repo we use at work is a 1.6TB fresh checkout on disk, with 
> >>> this change that drops to 865GB (on Windows+ReFS). A huge saving due to 
> >>> the large number of art assets in our repository!
> >>> I ran some tests with and without the patch, the ~10% slowdown is from 
> >>> losing the streamy checkout benefits and instead having to do extra 
> >>> system calls.
> >>>
> >>> 1.16:      1588.22 GB  169 mins
> >>> 1.16-CoW:  864.76 GB   202 mins
> >>
> >>
> >>>  volatile svn_atomic_t svn_wc__test_writer_copy_source_count = 0;
> >>
> >> I agree this extern debug atomic for at est is awkward, I didn't really 
> >> see any way around it if we want a test that checks the reflink path is 
> >> actually being taken. Adding platform native methods for checking if a 
> >> file is reflinked for testing reasons is not practical, given it's also 
> >> not a simple to work out via the filesystem anyway. However, we could 
> >> discuss reworking or removing the test to avoid this weird variable. I'm 
> >> not sure about the warning you are seeing, I don't think I saw that when I 
> >> built for windows/linux.
> >>
> >>>
> >>> This means that, for example, a Subversion working copy, where most
> >>> files have svn:eol-style=native, would see no improvement, correct?
> >>
> >> No they would see no benefit, which is why this saves slightly less space 
> >> than pristineless. But in our large repo 95% of the disk usage comes from 
> >> large binary blob files which can benefit. I realise this isn't the case 
> >> for everyone, but if you have a repo with only code files with 
> >> svn:eol-style=native it's probably not large enough to be a disk space 
> >> concern in the first place.
> >>
> >>> How does this interact with the pristines-on-demand feature? How is this
> >>> tested?
> >>
> >> There isn't a test for this but the "pristines-on-demand" takes precedence 
> >> over reflinking, they shouldn't interfere with each other.
> >>
> >>> If clients that do not support CoW use the same working copy as a client
> >>> that does, how do they interact?
> >>
> >> Yes! That's the best part, there's no working copy format change. The 
> >> filesystem handles the reflinked files transparently once they've been set 
> >> up. If you did a checkout on a CoW supporting drive (ReFS) and got the 
> >> disk usage benefits of CoW you would still be able to copy your svn 
> >> checkout folder to another drive with no CoW support (NTFS), it would 
> >> transparently expand to its full size on the new drive, but otherwise 
> >> would be fully functional.
> >>
> >>> Does this happen on every invocation? Concretely: if I'm on Windows with
> >>> NTFS, I'll get two file open attempts instead of one, every time?
> >>
> >> There are details on this in the op but no, there is a 1 time check at the 
> >> start of a checkout/update operation to see if the current filesystem 
> >> supports CoW. If not all files will use the current "streamy-checkout" 
> >> path avoiding any extra system calls.
> >>
> >>> Not that this is indicative of patch quality in general. But I'd have
> >>> preferred to see some discussion on dev@ before being presented with
> >>> 160k of patch. It's basically unreviewable, I don't even know where to
> >>> start.
> >>
> >> I was originally working on this feature for myself and work colleagues. 
> >> It was initially a hacky, Windows-only implementation. Then I added linux, 
> >> polished it a bit and ended up fleshing it out a lot more. It got to the 
> >> point were I figured I might as well bring it all together into a patch 
> >> and share it here to get people's thoughts on it. I do realise it's a huge 
> >> patch, and in the "Notes" section of the op I laid out how it could be 
> >> split into several parts. But I didn't want to do the work to split it 
> >> before gauging general interest in the feature.
> >>
> >> On Fri, 5 Jun 2026 at 19:50, Sean McBride <[email protected]> wrote:
> >>>
> >>> On 4 Jun 2026, at 7:31, Jordan Peck via dev wrote:
> >>>
> >>> > This patch saves disk space on supported filesystems for byte-identical 
> >>> > file installation by utilising filesystem clone APIs.
> >>>
> >>> FWIW, just the other week, I've used this tool:
> >>>
> >>> https://github.com/ttkb-oss/dedup
> >>>
> >>> on my Mac, and several others in our office, to deduplicate files in our 
> >>> svn (1.14) working copies. It detected identical files not only in our 
> >>> /branches vs /trunk but also among the pristine copies, and it 
> >>> deduplicated them using APFS' cloning feature.
> >>>
> >>> It didn't cause svn to freak out, so that bodes well.
> >>>
> >>> Sean

Reply via email to