On 1/24/2012 23:04, Ryan Schmidt wrote:
On Jan 24, 2012, at 15:18, The Grey Wolf wrote:

Hello, I'm not quite sure how to properly phrase the subject as a query
term, so if this has been answered, please forgive the redundancy and
quietly point me to where this gets addressed.

We are using svn at work to hold customer 'vault' data [various bits of
information for each customer].  It has been a huge success -- to the
point where we have over 1,000 customers using vaults.  The checkins are
automated, and we have amassed over 100,000 revisions thus far.

User directories are created as /Ab/username [where Ab is a 2-character
hash via a known (balanced) algorithm to make location of username files
more machine-efficient].  So we have about 1,200 of these guys, with some
hashes obviously being re-used, no big deal.

The problem is that, even on miniscule changes, we are finding the
db/rev/<shard>/<revno>  files to be disproportionately large; for an
addition or change of a file that is about 1k-4k, the rev files are at
100K each.  At lower revisions, we noticed that the rev files are 4k but
have been increasing in size with each shard that gets added, usually to
the tune of 1k/shard.  With so many revisions being checked in at a rapid
rate, we found ourselves having to take production off line for a couple
of minutes while we migrated the repository in question to a larger
filesystem due to the threat of the filesystem filling up.

The upshot of this is:  Why does a minimal delta create such a large
delta file?  100k for a small change?  What's going on and how can we
mitigate this?

It probably has to do with the size of the directory entries, not the
changes you're making to the files.

If you add a file, that's recorded as a change to the directory. When you
change a file, Subversion stores only the changes you made, not the
complete new file, and it stores them compressed. However, when you change
a directory (e.g. by adding or removing a file or directory), Subversion
records a complete new copy of the directory, and I don't know if it's
compressed or not. If the directory has hundreds or thousands of items,
that will take some space.

I don't remember if modifying a file counts as a change to the directory,
but adding or deleting a file certainly do.

Based on this I would assume you could mitigate the problem by having fewer
items in each directory. Create a deeper directory structure from your
hash: /A/Ab/username, or even /A/Ab/Abc/username. You should try this out
in a testing environment. Either create some test data, or dump your
current repository, and then a) load it into a fresh empty repository
as-is, and b) transform it into a deeper directory structure using a tool
like svndumptool, then load that into a second fresh empty repository. Then
see if there is an appreciable size difference.

Interesting, to be sure.  Here's some stats.

top level = 2817 entries
second level = 1..22 entries [depending on which one]
Some have a third level, most don't; ranges 1..27 entries.

So are you saying that if I add a file /ab/username/file, it's going to copy
the ENTIRE top level directory in as a delta?









--
                                --*greywolf;

Reply via email to