All,

Comments and simulation results in-line.


>________________________________
> From: Hyrum K Wright <hyrum.wri...@wandisco.com>
[snip]
>
>So, I've read through the design document, and the various threads,
>and have a couple of comments / questions which I don't think have
>been addressed.  My first impression, though is to give you major
>kudos for going through the effort to research and think about this
>complex and subtle problem.  Now my thoughts...
>
>As mentioned elsewhere, I too was surprised by the choice of a custom
>container, though I think you make a good argument for it.  One
>simplification I was thinking about is this: what if the container
>only needed to support add and batch-delete operations?  These are the
>current contraints of the existing pristine store; would they
>introduce additional simplicity into your design?

These are indeed the two main operations. Splitting and repacking (aka 
defragmenting) aren't basic operations per-se, they're housekeeping, but 
necessary ones if we care about disk space saving and performance (principle 
requirements for this feature).

>
>In some respects, it looks like you're solving *two* problems:
>compression and the internal fragmentation due to large FS block
>sizes.  How orthogonal are the problems?  Could they be solved
>independently of each other in some way?  I know that compression
>exposes the internal fragmentation issue, but used alone it certainly
>doesn't make things *worse* does it?

Compression exposes internal fragmentation and, yes, it makes it *worse* (see 
hard numbers below).
Therefore, compression and internal fragmentation are orthogonal only if we 
care about absolute savings. (In other words, compressed files cause more 
internal fragmentation, but overall footprint is still reduced, however not as 
efficiently as ultimately possible.)

>
>Finally, in all the above let's not let "the perfect be the enemy of
>the good."  If something *simple* will give us demonstrable
>performance improvements now, can we do so without limiting out
>ability to do a more complex and complete solution later?

That is very possible, although it comes with a price tag. Namely, maintenance. 
Unless it's an internal experimental change, releasing a simple solution to the 
public will potentially hinder replacing it with an improved version. But 
nothing is preventing us from doing things incrementally. Again, see below for 
more info.

>
>Anyway, good work, and here's hoping it yield fruit.

Thanks! The pleasure is all mine :)

>
>-Hyrum
>

Greg voiced a similar suggestion of a simple solution that can be improved 
later:

----- Original Message -----
> From: Greg Stein <gst...@gmail.com>
[snip]
> For simplicity's sake, I would recommend an initial solution using
> wc.db for the metadata. You could then write a test to see whether a
> custom format would significantly outperform sqlite, and discussion a
> change at that point.
> 
> IOW, do not "prematurely optimize". Go with the straightforward
> sqlite-based solution and test/validate the speed in a future review.
> 
> Cheers,
> -g 

Since the debated techniques (individual gz files vs packing+gz) for 
implementing Compressed Pristines are within reach with existing tools, and 
indeed some tests were done to yield hard figures (see older messages in this 
thread), it's reasonable to run simulations that can show with hard numbers the 
extent to which the speculations and estimations done (mostly by yours truly) 
regarding the advantages of a custom pack file are justifiable.

To bring everyone on the same page, the proposed custom pack file is promising 
significant reduction in disk space thanks to combining pristine files before 
compression, sorting them and packing into large files (but balanced for 
performance) to reduce internal fragmentation to a minimum (aka sub-block 
waste). The competing alternative is to simply compress the individual pristine 
files in their place (gzip them basically) at least as a first-step. A 
compromising solution has been suggested to reduce internal fragmentation 
(which percentage wise affects small files more) by placing "small" files in 
an/the sqlite database file (for some threshold for smallness). The argument 
against the latter suggestion is that we can't exploit inter-file similarities 
in sqlite blobs and therefore we'll only save the sub-block waste (which if the 
simulation is of any value, it's significant).

The simulation script (attached) does the following:
Checkout a given WC.
Collect file stats (count, total size, total size on disk) for the Pristine 
Store.
gzip individual files.
bzip2 individual files.
cat all files then gzip.
cat all files then bzip2.
cat all files sorted by extension then filename then gzip.
cat all files sorted by extension then filename then bzip2. 

The total sizes are collected and compared in absolute bytes and percentages. 
The simulation attempts to demonstrate the advantages of packing files before 
compression and packing files sorted first against simply compressing 
individual files in-place. In addition, the simulation hopes to show that gzip 
is outdated and superior compressors will yield significant gains, especially 
when packing and sorting. Bzip2 isn't the last word in general purpose 
compression and there are superior algorithms/implementations, but it's used to 
give us a feel of how much packing and sorting can affect the results given a 
compressor that can exploit them better. Ideally, a compressor that is as fast 
as gzip and more efficient than bzip2 is hoped for.

NOTE: In all cases, except for sorting, the actual pristine files were used. 
This is because they are duplicate free (which would benefit the packing as it 
can exploit inter-file similarities). However for sorting the original 
file-names are necessary which the PS lacks. In addition, I *did* want the 
packing approach to benefit from duplicates because in a WC that has multiple 
tags/branches packing sorted files will yield superior results thanks to 
inter-file similarities.

Here is a sample output:

::Subversion Compressed Pristines Simulation::
Disk block-size is 4096 bytes.

Checking out http://svn.apache.org/repos/asf/subversion/trunk/
svn: .svn-base is 44730415 bytes in 1882 files, 23767 average file size. 
Sub-block waste 3942353 (8.81%)
svn: .svn-base.GZ is 16044735 bytes in 1882 files, 8525 average gz file size. 
Sub-block waste 4644161 (28.94%)
svn: .svn-base.BZ2 is 15146221 bytes in 1882 files, 8047 average bzip2 file 
size. Sub-block waste 4563731 (30.13%)
svn: .svn-base.GZ saves 27983872 bytes, .svn-base.BZ2 saves 28962816 bytes 
(57.49% and 59.50%) respectively.
svn: .svn-base is 48672768 bytes on disk, .svn-base.GZ is 20688896 bytes on 
disk, .svn-text.BZ2 is 19709952 bytes on disk.
svn: GZ Pack is 14295040 bytes on disk, BZ2 Pack is 13062144 bytes on disk. 
Saved 34377728 and 35610624 bytes (70.63% and 73.16%) respectively.
svn: GZ pack saves 6393856 bytes (69.09%) compared to .svn-base.GZ.
svn: BZ2 pack saves 6647808 bytes (66.27%) compared to .svn-base.BZ2.
svn: GZ Sorted Pack is 13832192 bytes on disk, BZ2 Sorted Pack is 12128256 
bytes on disk. Saved 34840576 and 36544512 bytes (71.58% and 75.08%) 
respectively.
svn: pristine files: 48672768 bytes on disk.
svn: in-place gz: 20688896 bytes (42.50% of original)
svn: in-place bz2: 19709952 bytes (40.49% of original)
svn: packed gz: 14295040 bytes (29.36% of original)
svn: packed bz2: 13062144 bytes (26.83% of original)
svn: sorted packed gz: 13832192 bytes (28.41% of original)
svn: sorted packed bz2: 12128256 bytes (24.91% of original)

Checking out svn://gcc.gnu.org/svn/gcc/trunk
gcc: .svn-base is 444058945 bytes in 75223 files, 5903 average file size. 
Sub-block waste 217023167 (48.87%)
gcc: .svn-base.GZ is 137009307 bytes in 75223 files, 1821 average gz file size. 
Sub-block waste 247084901 (180.34%)
gcc: .svn-base.BZ2 is 127560017 bytes in 75223 files, 1695 average bzip2 file 
size. Sub-block waste 246343343 (193.11%)
gcc: .svn-base.GZ saves 276987904 bytes, .svn-base.BZ2 saves 287178752 bytes 
(41.89% and 43.44%) respectively.
gcc: .svn-base is 661082112 bytes on disk, .svn-base.GZ is 384094208 bytes on 
disk, .svn-text.BZ2 is 373903360 bytes on disk.
gcc: GZ Pack is 109268992 bytes on disk, BZ2 Pack is 91897856 bytes on disk. 
Saved 551813120 and 569184256 bytes (83.47% and 86.09%) respectively.
gcc: GZ pack saves 274825216 bytes (28.44%) compared to .svn-base.GZ.
gcc: BZ2 pack saves 282005504 bytes (24.57%) compared to .svn-base.BZ2.
gcc: GZ Sorted Pack is 94666752 bytes on disk, BZ2 Sorted Pack is 73170944 
bytes on disk. Saved 566415360 and 587911168 bytes (85.68% and 88.93%) 
respectively.
gcc: pristine files: 661082112 bytes on disk.
gcc: in-place gz: 384094208 bytes (58.10% of original)
gcc: in-place bz2: 373903360 bytes (56.55% of original)
gcc: packed gz: 109268992 bytes (16.52% of original)
gcc: packed bz2: 91897856 bytes (13.90% of original)
gcc: sorted packed gz: 94666752 bytes (14.31% of original)
gcc: sorted packed bz2: 73170944 bytes (11.06% of original)

From this output it's clear how much waste there is due to internal 
fragmentation and how much packing helps. What we should notice is that SVN is 
indeed smallish with only 1882 files averaging 23767 bytes / file. Although 
in-place compression gives almost 60% reduction in disk space (thanks to 
largish files), sorting+packing+gzip can cross the 70% mark and bzip crosses 
the 75% point. Here sorting+packing is almost 150% more efficient than in-place 
gzip.

With GCC the picture is quite different. With 75000+ files averaging 5900 
bytes, there is almost 49% wasted to internal fragmentation. When compressed 
individually, the waste jumps to 180% and 193% for gzip and bzip2 respectively. 
It is therefore no surprise that while in-place gzip saves over 40% of the disk 
space, with sorting+packing+gzip we cross the 85% threshold and with bzip2 we 
almost hit 89%. That is, sorting+packing is 400-500% more efficient than 
in-place gzip.

Another important point, besides the advantages of having custom pack files, is 
the reduction in the number of files on disk. So even if these numbers are too 
good to be true in practice, reducing the number of files in the pristine store 
will alleviate pressure on the FS and should in most cases improve overall 
performance of pristine store operations (thanks to reducing IO operations, 
reducing kernel-mode transitions and file caching). It may be the case that we 
can't achieve the numbers above, due to the overheads of our solution, but we 
should asymptotically reach these numbers with improved compression and by 
balancing the overheads.

With multiple branches in the same WC the advantage of sorting+packing should 
grow even more, thanks to inter-file similarity exploitation, which in-place 
compression has no way of competing against. Unfortunately, checking out 10s of 
GBs and running millions of files through the compressors takes the better half 
of a day... at least with my resources. Everybody is welcome to run the 
simulation on their favorite repository and taste things for themselves. You're 
of course encouraged to share your findings with us, especially to confirm or 
contradict the results of my two case studies. Also, please do let me know if 
you think I goofed somewhere!

The bottom line of this simulation is to give us more information to make 
better, more educated, decisions. What I hope to have achieved is, at the 
expense of a few good hours from my weekend, that I will have saved many more 
good hours for the members of the SVN community. I'm not here to make decisions 
or suggestions, so I'll leave you with the hard numbers and my opinion that 
there must be a sweet spot between the naive and the extreme approaches that 
will prove best to the community at large.

Cheers,
Ash

Attachment: svn_cp_sim.sh
Description: Bourne shell script

Reply via email to