[
https://issues.apache.org/jira/browse/SVN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256056#comment-17256056
]
Karl Fogel edited comment on SVN-525 at 12/30/20, 1:48 AM:
-----------------------------------------------------------
Hi, Aditi Maurya. Thanks for your interest in this issue. It's been a long
time since I was a core developer, but I think I still understand enough of the
internals of Subversion to point you in the right direction.
Let's start with some background. Right now, Subversion stores a pristine copy
of the BASE revision (that is, the currently checked out revision) of each file
locally. These pristine copies are stored under the .svn/pristine/ directory
in the top level of each checked-out working copy. Inside .svn/pristine/,
you'll see a bunch of subdirectories with two-character names, and then inside
each subdirectory there are some ".svn-base" files, where each file's name is
an SHA1 hash. What's going on is exactly the kind of content-addressed
arrangement you suspect :-).
The purpose of these pristine BASE copies is threefold:
# To make commits use less network bandwidth, because the commit only needs to
send to the repository the differences between the local BASE version and the
locally modified working file. (Remember that an SVN "commit" is like a
"commit + push" would be in Git.)
# To make 'svn diff' and 'svn revert' be purely local operations, that don't
talk to the upstream repository over the network.
# To enable the occasional three-way merge. This kind of merge is less common
in Subversion than in distributed version control systems (DVCS), but it still
is done sometimes.
Obviously, (1) is just an optimization. One *could* always just send the full
new file contents in a commit. Furthermore, the optimization only happens for
text files anyway, like program source code or plaintext documentation.
Binary-format files (such as LibreOffice files, different versions of a video,
PDFs, compilation output, etc) are not diffable/mergeable, at least not in the
practical sense needed by a version control system, so committing them always
ends up transmitting the entire new contents anyway.
Regarding (2): for binary (non-mergeable, non-diffable) files, no one is doing
'svn diff' against the BASE revision anyway. And while it's nice if 'svn
revert' can be purely local, that's not a "must have" behavior. It's okay if
it's local when the pristine BASE copy is present but uses the network when the
BASE copy is not present.
And regarding (3): just as with (2), one could get the BASE copy from the
upstream repository if necessary, but, again, no one is doing three-way merges
on binary blobs anyway.
(By the way, note that Subversion doesn't store full history on the client
side, the way DVCSs like Git and Mercurial do. That's why in Subversion we
call the local side a "working copy" or "working tree", not a "repository".
For textual/mergeable materials, the DVCS way is superior -- having full
history locally is great, and you can afford it when the history can be stored
in an efficient internal-diff way locally. However, when you're keeping
successive versions of big binary blobs under version control, the DVCS way
doesn't work well: it requires too much storage on every client machine. For
this situation, SVN's way is better, because only the central repository server
needs that kind of storage.)
Okay, the above all background. Now here's what this issue is about:
For those who do use Subversion to version binary blobs, it's already workable,
but the problem is it is still using *double* the amount of client-side disk
space it needs to. When it comes to trees with really large objects, this is a
problem! Those pristine BASE versions are not helping anyone: they don't make
commits more efficient in this use case, and since these files are almost never
mergeable no one is doing 'svn diff' nor three-way merges on them either. At
the most, someone might want to do an 'svn revert', but it's okay if that's not
a purely local operation.
If the BASE version of a file weren't present, most operations would work just
fine. Even Subversion's network protocol for transmitting changes wouldn't
need to be updated, because that protocol naturally already has an "insert the
following N bytes" command already. Therefore one can *always* construct a
commit-transmission diff as a series of inserts, without reference to any BASE
contents.
(Now, of course, if both the client and server were updated to support some
kind of 'sendfile' functionality for that circumstance, that might be even more
efficient, but that's an optimization. New clients will still be able to work
with old servers, guaranteed.)
So the modification needed here is purely client-side. To make this change,
all one has to do is find the parts of the working copy code that currently
consult the pristine BASE version and make them still work when the pristine
BASE is not present.
While Subversion is already a decent system for keeping track of large binary
blobs, with this change, it would be a really *good* system for doing that,
especially because of its optional file-locking feature. (None of the DVCSs
are really suitable for this use case, by design, as far as I'm aware.)
I think the mechanics of the change would involve code under
{{subversion/libsvn_wc/}} and maybe {{subversion/libsvn_client/}} in the
[Subversion sources|http://subversion.apache.org/source-code.html]. The
decision about when to omit a pristine BASE copy should be made purely by the
client side, as different people may configure it differently depending on
their local storage capacity. This would mean some kind of new specifier in
the [client-side run-time configuration
area|http://svnbook.red-bean.com/nightly/en/svn.advanced.confarea.html], like
maybe a {{no-pristine}} option that can be set based any of file size, file
name pattern, or file mime-type. (There may be some discussion of user-facing
design questions earlier in this issue, too.)
I'm very happy to answer more questions here, and I'd also suggest that you
post to the Subversion Development mailing list (see the [mailing
lists|http://subversion.apache.org/mailing-lists.html] page) with
questions/ideas. If you post there, please CC me (kfogel {_AT_} red-bean.com);
I'm not subscribed to the list these days, but I'd like to follow any progress
on this issue and help where I can. There are many much more experienced
developers on the list, too, and they'll be able to save you a lot of time.
was (Author: kfogel):
Hi, Aditi Maurya. Thanks for your interest in this issue. It's been a long
time since I was a core developer, but I think I still understand enough of the
internals of Subversion to point you in the right direction.
Let's start with some background. Right now, Subversion stores a pristine copy
of the BASE revision (that is, the currently checked out revision) of each file
locally. These pristine copies are stored under the .svn/pristine/ directory
in the top level of each checked-out working copy. Inside .svn/pristine/,
you'll see a bunch of subdirectories with two-character names, and then inside
each subdirectory there are some ".svn-base" files, where each file's name is
an SHA1 hash. What's going on is exactly the kind of content-addressed
arrangement you suspect :-).
The purpose of these pristine BASE copies is threefold:
# To make commits use less network bandwidth, because the commit only needs to
send to the repository the differences between the local BASE version and the
locally modified working file. (Remember that an SVN "commit" is like a
"commit + push" would be in Git.)
# To make 'svn diff' and 'svn revert' be purely local operations, that don't
talk to the upstream repository over the network.
# To enable the occasional three-way merge. This kind of merge is less common
in Subversion than in distributed version control systems (DVCS), but it still
is done sometimes.
Obviously, (1) is just an optimization. One *could* always just send the full
new file contents in a commit. Furthermore, the optimization only happens for
text files anyway, like program source code or plaintext documentation.
Binary-format files (such as LibreOffice files, different versions of a video,
PDFs, compilation output, etc) are not diffable/mergeable, at least not in the
practical sense needed by a version control system, so committing them always
ends up transmitting the entire new contents anyway.
Regarding (2): for binary (non-mergeable, non-diffable) files, no one is doing
'svn diff' against the BASE revision anyway. And while it's nice if 'svn
revert' can be purely local, that's not a "must have" behavior. It's okay if
it's local when the pristine BASE copy is present but uses the network when the
BASE copy is not present.
And regarding (3): just as with (2), one could get the BASE copy from the
upstream repository if necessary, but, again, no one is doing three-way merges
on binary blobs anyway.
(By the way, note that Subversion doesn't store full history on the client
side, the way DVCSs like Git and Mercurial do. That's why in Subversion we
call the local side a "working copy" or "working tree", not a "repository".
For textual/mergeable materials, the DVCS way is superior -- having full
history locally is great, and you can afford it when the history can be stored
in an efficient internal-diff way locally. However, when you're keeping
successive versions of big binary blobs under version control, the DVCS way
doesn't work well: it requires too much storage on every client machine. For
this situation, SVN's way is better, because only the central repository server
needs that kind of storage.)
Okay, the above all background. Now here's what this issue is about:
For those who do use Subversion to version binary blobs, it's already workable,
but the problem is it is still using *double* the amount of client-side disk
space it needs to. When it comes to trees with really large objects, this is a
problem! Those pristine BASE versions are not helping anyone: they don't make
commits more efficient in this use case, and since these files are almost never
mergeable no one is doing 'svn diff' nor three-way merges on them either. At
the most, someone might want to do an 'svn revert', but it's okay if that's not
a purely local operation.
If the BASE version of a file weren't present, most operations would work just
fine. Even Subversion's network protocol for transmitting changes wouldn't
need to be updated, because that protocol naturally already has an "insert the
following N bytes" command already. Therefore one can *always* construct a
commit-transmission diff as a series of inserts, without reference to any BASE
contents.
(Now, of course, if both the client and server were updated to support some
kind of 'sendfile' functionality for that circumstance, that might be even more
efficient, but that's an optimization. New clients will still be able to work
with old servers, guaranteed.)
So the modification needed here is purely client-side. To make this change,
all one has to do is find the parts of the working copy code that currently
consult the pristine BASE version and make them still work when the pristine
BASE is not present.
While Subversion is already a decent system for keeping track of large binary
blobs, with this change, it would be a really *good* system for doing that,
especially because of its optional file-locking feature. (None of the DVCSs
are really suitable for this use case, by design, as far as I'm aware.)
I think the mechanics of the change would involve code under
{{subversion/libsvn_wc/}} and maybe {{subversion/libsvn_client/}} in the
[Subversion sources|http://subversion.apache.org/source-code.html]. The
decision about when to omit a pristine BASE copy should be made purely by the
client side, as different people may configure it differently depending on
their local storage capacity. This would mean some kind of new specifier in
the [client-side run-time configuration
area|http://svnbook.red-bean.com/nightly/en/svn.advanced.confarea.html], like
maybe a {{no-pristine}} option that can be set based any of file size, file
name pattern, or file mime-type. (There may be some discussion of user-facing
design questions earlier in this issue, too.)
I'm very happy to answer more questions here, and I'd also suggest that you
post to the Subversion Development mailing list (see the [mailing
lists|http://subversion.apache.org/mailing-lists.html] page) with
questions/ideas. If you post there, please CC me (kfogel {_AT_} red-bean.com);
I'm not subscribed to the list these days, but I'd like to follow any progress
on this issue. There are many much more experienced developers on the list,
and they'll be able to save you a lot of time.
> Allow working copies without .svn/pristine/ cache (a.k.a. "text-base/" files).
> ------------------------------------------------------------------------------
>
> Key: SVN-525
> URL: https://issues.apache.org/jira/browse/SVN-525
> Project: Subversion
> Issue Type: New Feature
> Affects Versions: all
> Environment: other
> Reporter: Ben Collins-Sussman
> Priority: Minor
> Fix For: unscheduled
>
>
> It's possible to make the cached pristine files in .svn/pristine/ optional.
> Doing so would be a huge storage savings on the client side, and would make
> Subversion even more compelling as a system for managing medium-large binary
> files.
> A much more technically thorough explanation of this issue and its background
> is available in [this 2020-12-29
> comment|https://issues.apache.org/jira/browse/SVN-525?focusedCommentId=17256056&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256056]
> below.
> (Note that the cached pristine base versions used to be stored in
> .svn/text-base/, so you'll probably see references to that old location
> throughout this ticket. Also, there used to be one .svn/ directory per
> working tree directory; later that was changed to one .svn/ directory at the
> top of the working tree. Knowing that might also help clarify some of the
> older comments in this ticket.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)