----- Original Message -----

> From: Philip Martin <philip.mar...@wandisco.com>
> To: Erik Huelsmann <ehu...@gmail.com>
> Cc: Ashod Nakashian <ashodnakash...@yahoo.com>; Daniel Shahaf 
> <danie...@elego.de>; "dev@subversion.apache.org" <dev@subversion.apache.org>; 
> Ivan Zhakov <i...@visualsvn.com>
> Sent: Thursday, March 22, 2012 4:00 PM
> Subject: Re: Compressed Pristines (Design Doc)
> 
> Erik Huelsmann <ehu...@gmail.com> writes:
> 
>>  As the others, I'm surprised we seem to be going with a custom file 
> format.
>>  You claim source files are generally small in size and hence only small
>>  benefits can be had from compressing them, if at all, due to the fact that
>>  they would be of sub-block size already.
> 
> I was surprised too, so I looked at GCC where a trunk checkout has
> 75,000 files of various types:
> 
[snip]

The current major concern seems to be the development of a new custom 
file-format that isn't warranted. This is a fair concern and one we should 
reach a consensus on before moving forward. I've tried to combine all issues 
raised on the topic so far in this (admittedly long) mail.

The design's fundamental assumption is that source files are typically smaller 
than a typical FS block (after compression). Eric and Philip have ran tests on 
SVN and GCC respectively with different results. I do not have hard-figures 
because it's near impossible to define "a typical project". However, let me 
point out the rationale behind this argument:

1. We don't care as much about file sizes before compression as we do *after* 
compression (with better compression, more files should fall into the sub-block 
size, which depends on the FS config, after compression).
2. Compressed file-size is highly dependent on the compression algorithm (we 
should use the best compression that meets our reqs).
3. Combining files *before* compression in many cases yield better compression, 
especially if multiple tags/branches are involved.[1]
4. Projects that have "small" files will suffer more by the wasted 
sub-block space especially when multiple tags/branches are checked out 
(typical for active maintainers).
5. Reducing the number of files on disk can improve overall disk performance 
(for very large projects).[2]
6. Flexibility, extensibility and opaqueness.

The gist of the above is that if we choose a better-than-gz compression and 
combine the files both *before* and *after* compression, we'll have much more 
significant results than what we have now just with gz on individual files. 
This can be noticed using tar.bz2, for example, where the result is not unlike 
what we can achieve with the custom file format (although bz2 is probably too 
slow for our needs).

Now, does this justify the cost of a new file-format? That's reasonable to 
argue. My take is that the proposed file-format is simple enough and the gains 
(especially on large projects with many branches/tags checked) should justify 
this overhead.

I also like the fact that the pristine files are opaque and don't encourage the 
user to mess with them. Markus raised this point as "debuggability". I don't 
see "debuggability" as a user requirement (it is justifiably an SVN 
dev/maintainer requirement) and I don't find reason to add it as one. On the 
contrary, there are many reasons to suspect the user is doing something gravely 
wrong when they mess with the pristine files.

Another point raised by Markus is to store "common pristine" files and reuse 
them to reduce network traffic. This is neither part of this feature, nor can 
we determine what's "common" and we shouldn't optimize repository protocol in 
the WC client. However, if the user checks-out multiple branches/tags in the 
same WC tree, they will get savings and if we combine the pristine files 
*before* compression, the savings should be significant as most files change 
little between branches (if nothing is changed, they'll have the same hash and 
only one copy will exist in the pristine store anyway). This latter advantage 
will be lost on per-file compression (point #3 above).

Sqlite may be used as Branko has suggested. I'm not opposed to this. It has 
it's shortcomings (not exploiting inter-file similarities which point #3 makes, 
for one) but it can be considered as a compromise between individual gz files 
and the custom pack file. The basic idea would be to store "small" files (after 
compression) in wc.db and have "link" to compressed files on disk for "large" 
files. My main concern is that frequent updates to small files will leave the 
sqlite file with heavy external fragmentation (holes within the file unused but 
consuming disk-space). The solution is to "vacuum" wc.db, but that depends on 
its size, will lock it when vacuuming and other factors, so we can't do it as 
routine. The custom pack file would take care of this by splitting the files 
such that avoiding external fragmentation would be feasible and routine (see 
the numbers in the doc for full defragmentation on typical pack store files). 
Of course we can use multiple
 sqlite database files instead of the custom format and achieve the same goal, 
but my suspicion is that using sqlite for small files in the long run will 
probably give similar results as individual gz files (due to overhead, external 
fragmentation etc), so personally I feel it's probably not worth it.

A point regarding concurrency was raised (apologies for missing the name) 
regarding the custom format. Short answer: Yes! Files will be grouped and each 
group may be streamed to a compressor instance on separate threads and written 
to disk in parallel (thanks to splitting the pack files). The index file or 
wc.db may be a bottleneck, but the slow part is compressing, not updating 
entries.

So, again, the justification for the custom format are the points mentioned 
above. With the right compression algorithm, the custom format should give us a 
lot of flexibility and will result in disk savings that are significant to 
small and large WC's alike.

If I missed some point, please bring it up again, I don't mean to ignore them. 
Thanks for reading thus far.

[1] Compare total disk space of individual gz files and a tar.gz. Try the same 
with tar.bz2 which has a much larger window and yields significantly better 
compression. Boost 1.49 shows ~22% gain between tar.gz and tar.bz2 
(http://sourceforge.net/projects/boost/files/boost/1.49.0/).
[2] My painful experience is with WebKit, 7 branches full-checkout on ntfs (2+ 
million total files on partition). Prior to svn 1.7 the number of files/folders 
was even worse.

Cheers,
Ash

Reply via email to