On 06/02/2011 14:17, Ulrik Mikaelsson wrote:
2011/2/4 Bruno Medeiros<brunodomedeiros+spam@com.gmail>:

Well, like I said, my concern about size is not so much disk space, but the
time to make local copies of the repository, or cloning it from the internet
(and the associated transfer times), both of which are not neglectable yet.
My project at work could easily have gone to 1Gb of repo size if in the last
year or so it has been stored on a DVCS! :S

I hope this gets addressed at some point. But I fear that the main
developers of both Git and Mercurial may be too "biased" to experience
projects which are typically somewhat small in size, in terms of bytes
(projects that consist almost entirely of source code).
For example, in UI applications it would be common to store binary data
(images, sounds, etc.) in the source control. The other case is what I
mentioned before, wanting to store dependencies together with the project
(in my case including the javadoc and source code of the dependencies - and
there's very good reasons to want to do that).

I think the storage/bandwidth requirements of DVCS:s are very often
exagerated, especially for text, but also somewhat for blobs.
  * For text-content, the compression of archives reduces them to,
perhaps, 1/5 of their original size?
    - That means, that unless you completely rewrite a file 5 times
during the course of a project, simple per-revision-compression of the
file will turn out smaller, than the single uncompressed base-file
that subversion transfers and stores.
    - The delta-compression applied ensures small changes does not
count as a "rewrite".
  * For blobs, the archive-compression may not do as much, and they
certainly pose a larger challenge for storing history, but:
    - AFAIU, at least git delta-compresses even binaries so even
changes in them might be slightly reduced (dunno about the others)
    - I think more and more graphics are today are written in SVG?
    - I believe, for most projects, audio-files are usually not changed
very often, once entered a project? Usually existing samples are
simply copied in?
  * For both binaries and text, and for most projects, the latest
revision is usually the largest. (Projects usually grow over time,
they don't consistently shrink) I.E. older revisions are, compared to
current, much much smaller, making the size of old history smaller
compared to the size of current history.

Finally, as a test, I tried checking out the last version of druntime
from SVN and compare it to git (AFICT, history were preserved in the
git-migration), the results were about what I expected. Checking out
trunk from SVN, and the whole history from git:
   SVN: 7.06 seconds, 5,3 MB on disk
   Git: 2.88 seconds, 3.5 MB on disk
   Improvement Git/SVN: time reduced by 59%, space reduced by 34%.

I did not measure bandwidth, but my guess is it is somewhere between
the disk- and time- reductions. Also, if someone has an example of a
recently converted repository including some blobs it would make an
interesting experiment to repeat.

Regards
/ Ulrik

-----

ulrik@ulrik ~/p/test>  time svn co
http://svn.dsource.org/projects/druntime/trunk druntime_svn
...
0.26user 0.21system 0:07.06elapsed 6%CPU (0avgtext+0avgdata 47808maxresident)k
544inputs+11736outputs (3major+3275minor)pagefaults 0swaps
ulrik@ulrik ~/p/test>  du -sh druntime_svn
5,3M    druntime_svn

ulrik@ulrik ~/p/test>  time git clone
git://github.com/D-Programming-Language/druntime.git druntime_git
...
0.26user 0.06system 0:02.88elapsed 11%CPU (0avgtext+0avgdata 14320maxresident)k
3704inputs+7168outputs (18major+1822minor)pagefaults 0swaps
ulrik@ulrik ~/p/test>  du -sh druntime_git/
3,5M    druntime_git/


Yes, Brad had posted some statistics of the size of the Git repositories for dmd, druntime, and phobos, and yes, they are pretty small. Projects which contains practically only source code, and little to no binary data are unlikely to grow much and repo size ever be a problem. But it might not be the case for other projects (also considering that binary data is usually already well compressed, like .zip, .jpg, .mp3, .ogg, etc., so VCS compression won't help much).

It's unlikely you will see converted repositories with a lot of changing blob data. DVCS, at the least in the way they work currently, simply kill this workflow/organization-pattern. I very much suspect this issue will become more important as time goes on - a lot of people are still new to DVCS and they still don't realize the full implications of that architecture with regards to repo size. Any file you commit will add to the repository size *FOREVER*. I'm pretty sure we haven't heard the last word on the VCS battle, in that in a few years time people are *again* talking about and switching to another VCS :( . Mark these words. (The only way this is not going to happen is if Git or Mercurial are able to address this issue in a satisfactory way, which I'm not sure is possible or easy)


--
Bruno Medeiros - Software Engineer

Reply via email to