Re: Git and GCC

2007-12-14 Thread Nix
On 8 Dec 2007, Johannes Schindelin said:

 Hi,

 On Sat, 8 Dec 2007, J.C. Pizarro wrote:

 On 2007/12/07, Linus Torvalds [EMAIL PROTECTED] wrote:

  SHA1 is almost totally insignificant on x86. It hardly shows up. But 
  we have a good optimized version there.
 
 If SHA1 is slow then why dont he contribute adding Haval160 (3 rounds) 
 that it's faster than SHA1? And to optimize still more it with SIMD 
 instructions in kernelspace and userland.

 He said SHA-1 is insignificant.

Actually davem also said it *is* significant on SPARC. But of course
J. C. Pizarro's suggested solution won't work because you can't just go
around replacing SHA-1 in git with something else :) you could *add* new
hashing methods, but you couldn't avoid SHA-1, and adding a new hashing
method would bloat every object and every hash in objects like commits
with an indication of which hashing method was in use.

(But you know this.)

 1.   Don't compress this repo but compact this uncompressed repo
   using minimal spanning forest and deltas

... and then you do a git-gc. Oops, now what?

... or perhaps you want to look something up in the pack. Now you have to
unpack a large hunk of the whole damn thing.

 2.   After, compress this whole repo with LZMA (e.g. 48MiB) from 7zip before
   burning it to DVD for backup reasons or before replicating it to
  internet.

 Patches? ;-)

Replicating a pack to the internet is almost invariably replicating
*parts* of a pack anyway, which reduces to the problem with option 1
above...

-- 
`The rest is a tale of post and counter-post.' --- Ian Rawlings
   describes USENET


Re: Git and GCC

2007-12-13 Thread Harvey Harrison
On Thu, 2007-12-13 at 14:40 +, Rafael Espindola wrote:
  Yes, everything, by default you only get the more modern branches/tags,
  but it's all in there.  If there is interest I can work with Bernardo
  and get the rest publically exposed.
 
 I decided to give it a try, but could not find the tuples branch. Is
 it too hard to make gimple-tuples-branch and lto visible?
 

Here's a suggestion I sent to the git list, it's a bit loner than it
needs to be, but I think you'll understand a lot better what's going
on this way.:

After the discussions lately regarding the gcc svn mirror.  I'm coming
up with a recipe to set up your own git-svn mirror.  Suggestions on the
following.

// Create directory and initialize git
mkdir gcc
cd gcc
git init
// add the remote site that currently mirrors gcc
// I have chosen the name gcc.gnu.org *1* as my local name to refer to
// this choose something else if you like
git remote add gcc.gnu.org git://git.infradead.org/gcc.git
// fetching someone else's remote branches is not a standard thing to do
// so we'll need to edit our .git/config file
// you should have a section that looks like:
[remote gcc.gnu.org]
url = git://git.infradead.org/gcc.git
fetch = +refs/heads/*:refs/remotes/gcc.gnu.org/*
// infradead's mirror puts the gcc svn branches in its own namespace
// refs/remotes/gcc.gnu.org/*
// change our fetch line accordingly
[remote gcc.gnu.org]
url = git://git.infradead.org/gcc.git
fetch = +refs/remotes/gcc.gnu.org/*:refs/remotes/gcc.gnu.org/*
// fetch the remote data from the mirror site
git remote update
// set up git-svn
// gcc has the standard trunk/branches/tags naming so use -s
// add a prefix so git-svn uses the metadata we just got from the
// mirror so we don't have to get everything from the svn server
// the --prefix must match whatever you chose in *1*, the trailing
// slash is important.
git svn init -s --prefix=gcc.gnu.org/ svn://gcc.gnu.org/svn/gcc
// your config should look like this now:
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
[remote gccmirror]
url = git://git.infradead.org/gcc.git
fetch = +refs/heads/*:refs/remotes/gccmirror/*
[svn-remote svn]
url = svn://gcc.gnu.org/svn/gcc
fetch = trunk:refs/remotes/gcc.gnu.org/trunk
branches = branches/*:refs/remotes/gcc.gnu.org/*
tags = tags/*:refs/remotes/gcc.gnu.org/tags/*
// Try and get more revisions from the svn server
// this may take a little while the first time as git-svn builds
// metadata to allow bi-directional operation
// Note: git-svn has a patch in testing to use a _vastly_ more
// space efficient mapping from svn rev to git sha, I'd
// suggest you get it.
//
// This will rebuild the mapping for every svn branch
git svn fetch
// If you only care about one branch
// Check out a local copy of the tuples branch and switch
// to it
git checkout -b tuples remotes/gcc.gnu.org/tuples
// Update the git-svn metadata
git svn rebase

Hope that helps to get you started.

Harvey




Re: Git and GCC

2007-12-10 Thread Gabriel Paubert
On Fri, Dec 07, 2007 at 04:47:19PM -0800, Harvey Harrison wrote:
 Some interesting stats from the highly packed gcc repo.  The long chain
 lengths very quickly tail off.  Over 60% of the objects have a chain
 length of 20 or less.  If anyone wants the full list let me know.  I
 also have included a few other interesting points, the git default
 depth of 50, my initial guess of 100 and every 10% in the cumulative
 distribution from 60-100%.
 
 This shows the git default of 50 really isn't that bad, and after
 about 100 it really starts to get sparse.  

Do you have a way to know which files have the longest chains?

I have a suspiscion that the ChangeLog* files are among them,
not only because they are, almost without exception, only modified
by prepending text to the previous version (and a fairly small amount
compared to the size of the file), and therefore the diff is simple
(a single hunk) so that the limit on chain depth is probably what
causes a new copy to be created. 

Besides that these files grow quite large and become some of the 
largest files in the tree, and at least one of them is changed 
for every commit. This leads again to many versions of fairly 
large files.

If this guess is right, this implies that most of the size gains
from longer chains comes from having less copies of the ChangeLog*
files. From a performance point of view, it is rather favourable
since the differences are simple. This would also explain why
the window parameter has little effect.

Regards,
Gabriel


Re: Git and GCC

2007-12-10 Thread David Miller
From: David Miller [EMAIL PROTECTED]
Date: Fri, 07 Dec 2007 04:53:29 -0800 (PST)

 I should run oprofile...

While doing the initial object counting, most of the time is spent in
lookup_object(), memcmp() (via hashcmp()), and inflate().  I tried to
see if I could do some tricks on sparc with the hashcmp() but the sha1
pointers are very often not even 4 byte aligned.

I suspect lookup_object() could be improved if it didn't use a hash
table without chaining, but I can see why 'struct object' size is a
concern and thus why things are done the way they are.

samples  %app name symbol name
504  13.7517  libc-2.6.1.somemcmp
386  10.5321  libz.so.1.2.3.3  inflate
288   7.8581  git  lookup_object
248   6.7667  libz.so.1.2.3.3  inflate_fast
201   5.4843  libz.so.1.2.3.3  inflate_table
175   4.7749  git  decode_tree_entry
 ...

Deltifying is %94 consumed by create_delta(), the rest is completely
in the noise.

samples  %app name symbol name
1058194.8373  git  create_delta
181   1.6223  git  create_delta_index
720.6453  git  prepare_pack
550.4930  libc-2.6.1.soloop
340.3047  libz.so.1.2.3.3  inflate_fast
330.2958  libc-2.6.1.so_int_malloc
220.1972  libshadow.so shadowUpdatePacked
210.1882  libc-2.6.1.so_int_free
190.1703  libc-2.6.1.somalloc
 ...


Re: Git and GCC

2007-12-10 Thread Nicolas Pitre
On Mon, 10 Dec 2007, Gabriel Paubert wrote:

 On Fri, Dec 07, 2007 at 04:47:19PM -0800, Harvey Harrison wrote:
  Some interesting stats from the highly packed gcc repo.  The long chain
  lengths very quickly tail off.  Over 60% of the objects have a chain
  length of 20 or less.  If anyone wants the full list let me know.  I
  also have included a few other interesting points, the git default
  depth of 50, my initial guess of 100 and every 10% in the cumulative
  distribution from 60-100%.
  
  This shows the git default of 50 really isn't that bad, and after
  about 100 it really starts to get sparse.  
 
 Do you have a way to know which files have the longest chains?

With 'git verify-pack -v' you get the delta depth for each object.
Then you can use 'git show' with the object SHA1 to see its content.

 I have a suspiscion that the ChangeLog* files are among them,
 not only because they are, almost without exception, only modified
 by prepending text to the previous version (and a fairly small amount
 compared to the size of the file), and therefore the diff is simple
 (a single hunk) so that the limit on chain depth is probably what
 causes a new copy to be created. 

My gcc repo is currently repacked with a max delta depth of 50, and 
a quick sample of those objects at the depth limit does indeed show the 
content of the ChangeLog file.  But I have occurrences of the root 
directory tree object too, and the GCC machine description for IA-32 
content as well.

But yes, the really deep delta chains are most certainly going to 
contain those ChangeLog files.

 Besides that these files grow quite large and become some of the 
 largest files in the tree, and at least one of them is changed 
 for every commit. This leads again to many versions of fairly 
 large files.
 
 If this guess is right, this implies that most of the size gains
 from longer chains comes from having less copies of the ChangeLog*
 files. From a performance point of view, it is rather favourable
 since the differences are simple. This would also explain why
 the window parameter has little effect.

Well, actually the window parameter does have big effects.  For instance 
the default of 10 is completely inadequate for the gcc repo, since 
changing the window size from 10 to 100 made the corresponding pack 
shrink from 2.1GB down to 400MB, with the same max delta depth.


Nicolas


Re: Git and GCC

2007-12-08 Thread Johannes Schindelin
Hi,

On Sat, 8 Dec 2007, J.C. Pizarro wrote:

 On 2007/12/07, Linus Torvalds [EMAIL PROTECTED] wrote:

  SHA1 is almost totally insignificant on x86. It hardly shows up. But 
  we have a good optimized version there.
 
 If SHA1 is slow then why dont he contribute adding Haval160 (3 rounds) 
 that it's faster than SHA1? And to optimize still more it with SIMD 
 instructions in kernelspace and userland.

He said SHA-1 is insignificant.

  zlib tends to be a lot more noticeable (especially the uncompression: 
  it may be faster than compression, but it's done _so_ much more that 
  it totally dominates).
 
 It's better
 
 1.   Don't compress this repo but compact this uncompressed repo
   using minimal spanning forest and deltas
 2.   After, compress this whole repo with LZMA (e.g. 48MiB) from 7zip before
   burning it to DVD for backup reasons or before replicating it to
   internet.

Patches? ;-)

Ciao,
Dscho



Re: Git and GCC

2007-12-08 Thread Joe Buck
On Sat, 8 Dec 2007, J.C. Pizarro wrote:
  1.   Don't compress this repo but compact this uncompressed repo
using minimal spanning forest and deltas
  2.   After, compress this whole repo with LZMA (e.g. 48MiB) from 7zip 
  before
burning it to DVD for backup reasons or before replicating it to
  internet.

On Sat, Dec 08, 2007 at 12:24:00PM +, Johannes Schindelin wrote:
 Patches? ;-)

git list, meet J.C. Pizarro.  Care to take him off of our hands for
a while?  He's been hanging on the gcc list for some time, and perhaps
seeks new horizons.

Mr. Pizarro has endless ideas, and he'll give you some new ones every day.
He thinks that no one else knows any computer science, and he will attempt
to teach you what he knows, and tell you to rewrite all of your code based
on something he read and half-understood.  But he's not interested in
actually DOING the work, mind you; that's up to you.  When you object
that he's wasting your time, he'll start talking about freedom of speech.





Re: Git and GCC

2007-12-08 Thread Marco Costalba
On Dec 8, 2007 8:53 PM, Joe Buck [EMAIL PROTECTED] wrote:

 Mr. Pizarro has endless ideas, and he'll give you some new ones every day.

That's true.

 He thinks that no one else knows any computer science, and he will attempt
 to teach you what he knows,

It's not the only one ;-) is in good and numerous company.

  But he's not interested in
 actually DOING the work, mind you; that's up to you.

Where did have you read this ? I missed that part.

  When you object
 that he's wasting your time, he'll start talking about freedom of speech.


Actually he never spoke like that (probably I missed that part too).


Thanks
Marco


Re: Git and GCC

2007-12-08 Thread Daniel Berlin

 Where did have you read this ? I missed that part.

   When you object
  that he's wasting your time, he'll start talking about freedom of speech.
 

 Actually he never spoke like that (probably I missed that part too).



Read gcc mailing list archives, if you have a lot of time on your hands.


Re: Git and GCC

2007-12-07 Thread David Miller
From: Jon Smirl [EMAIL PROTECTED]
Date: Fri, 7 Dec 2007 02:10:49 -0500

 On 12/7/07, Jeff King [EMAIL PROTECTED] wrote:
  On Thu, Dec 06, 2007 at 07:31:21PM -0800, David Miller wrote:
 
  # and test multithreaded large depth/window repacking
  cd test
  git config pack.threads 4
 
 64 threads with 64 CPUs, if they are multicore you want even more.
 you need to adjust chunk_size as mentioned in the other mail.

It's an 8 core system with 64 cpu threads.

  time git repack -a -d -f --window=250 --depth=250

Didn't work very well, even with the one-liner patch for
chunk_size it died.  I think I need to build 64-bit
binaries.

[EMAIL PROTECTED]:~/src/GCC/git/test$ time git repack -a -d -f --window=250 
--depth=250
Counting objects: 1190671, done.
fatal: Out of memory? mmap failed: Cannot allocate memory

real58m36.447s
user289m8.270s
sys 4m40.680s
[EMAIL PROTECTED]:~/src/GCC/git/test$ 

While it did run the load was anywhere between 5 and 9, although it
did create 64 threads, and the size of the process was about 3.2GB
This may be in part why it wasn't able to use all 64 thread
effectively.  Like I said it seemed to have 9 active at best, at any
one time, most of the time only 4 or 5 were busy doing anything.

Also I could end up being performance limited by SHA, it's not very
well tuned on Sparc.  It's been on my TODO list to code up the crypto
unit support for Niagara-2 in the kernel, then work with Herbert Xu on
the userland interfaces to take advantage of that in things like
libssl.  Even a better C/asm version would probably improve GIT
performance a bit.

Is SHA a significant portion of the compute during these repacks?
I should run oprofile...


Re: Git and GCC

2007-12-07 Thread Linus Torvalds


On Thu, 6 Dec 2007, Harvey Harrison wrote:
 
 I've updated the public mirror repo with the very-packed version.

Side note: it might be interesting to compare timings for 
history-intensive stuff with and without this kind of very-packed 
situation.

The very density of a smaller pack-file might be enough to overcome the 
downsides (more CPU time to apply longer delta-chains), but regardless, 
real numbers talks, bullshit walks. So wouldn't it be nice to have real 
numbers?

One easy way to get real numbers for history would be to just time some 
reasonably costly operation that uses lots of history. Ie just do a 

time git blame -C gcc/regclass.c  /dev/null

and see if the deeper delta chains are very expensive.

(Yeah, the above is pretty much designed to be the worst possible case for 
this kind of aggressive history packing, but I don't know if that choice 
of file to try to annotate is a good choice or not. I suspect that git 
blame -C with a CVS import is just horrid, because CVS commits tend to be 
pretty big and nasty and not as localized as we've tried to make things in 
the kernel, so doing the code copy detection is probably horrendously 
expensive)

Linus



Re: Git and GCC

2007-12-07 Thread J.C. Pizarro
On 2007/12/07, Linus Torvalds [EMAIL PROTECTED] wrote:
 On Fri, 7 Dec 2007, David Miller wrote:
 
  Also I could end up being performance limited by SHA, it's not very
  well tuned on Sparc.  It's been on my TODO list to code up the crypto
  unit support for Niagara-2 in the kernel, then work with Herbert Xu on
  the userland interfaces to take advantage of that in things like
  libssl.  Even a better C/asm version would probably improve GIT
  performance a bit.

 I doubt yu can use the hardware support. Kernel-only hw support is
 inherently broken for any sane user-space usage, the setup costs are just
 way way too high. To be useful, crypto engines need to support direct user
 space access (ie a regular instruction, with all state being held in
 normal registers that get saved/restored by the kernel).

  Is SHA a significant portion of the compute during these repacks?
  I should run oprofile...

 SHA1 is almost totally insignificant on x86. It hardly shows up. But we
 have a good optimized version there.

If SHA1 is slow then why dont he contribute adding Haval160 (3 rounds)
that it's faster than SHA1? And to optimize still more it with SIMD instructions
in kernelspace and userland.


 zlib tends to be a lot more noticeable (especially the uncompression: it
 may be faster than compression, but it's done _so_ much more that it
 totally dominates).

   Linus

It's better

1.   Don't compress this repo but compact this uncompressed repo
  using minimal spanning forest and deltas
2.   After, compress this whole repo with LZMA (e.g. 48MiB) from 7zip before
  burning it to DVD for backup reasons or before replicating it to
internet.

   J.C.Pizarro the noiser


Re: Git and GCC

2007-12-07 Thread Harvey Harrison
Some interesting stats from the highly packed gcc repo.  The long chain
lengths very quickly tail off.  Over 60% of the objects have a chain
length of 20 or less.  If anyone wants the full list let me know.  I
also have included a few other interesting points, the git default
depth of 50, my initial guess of 100 and every 10% in the cumulative
distribution from 60-100%.

This shows the git default of 50 really isn't that bad, and after
about 100 it really starts to get sparse.  

Harvey

1:  103817  103817  10.20%  1017922
2:  67332   171149  16.81%
3:  57520   228669  22.46%
4:  52570   281239  27.63%
5:  43910   325149  31.94%
6:  37520   362669  35.63%
7:  35248   397917  39.09%
8:  29819   427736  42.02%
9:  27619   455355  44.73%
10: 22656   478011  46.96%
11: 21073   499084  49.03%
12: 18738   517822  50.87%
13: 16674   534496  52.51%
14: 14882   549378  53.97%
15: 14424   563802  55.39%
16: 12765   576567  56.64%
17: 11662   588229  57.79%
18: 11845   600074  58.95%
19: 11694   611768  60.10%
20: 9625621393  61.05%
34: 5354719356  70.67%
50: 3395785342  77.15%
60: 2547815072  80.07%
100:1644898284  88.25%
113:1292917046  90.09%
158:959 967429  95.04%
200:652 997653  98.01%
219:491 1008132 99.04%
245:179 1017717 99.98%
246:111 1017828 99.99%
247:61  1017889 100.00%
248:27  1017916 100.00%
249:6   1017922 100.00%



Re: Git and GCC

2007-12-07 Thread Giovanni Bajo
On Fri, 2007-12-07 at 14:14 -0800, Jakub Narebski wrote:

   Is SHA a significant portion of the compute during these repacks?
   I should run oprofile...
   SHA1 is almost totally insignificant on x86. It hardly shows up. But
   we have a good optimized version there.
   zlib tends to be a lot more noticeable (especially the
   *uncompression*: it may be faster than compression, but it's done _so_
   much more that it totally dominates).
  
  Have you considered alternatives, like:
  http://www.oberhumer.com/opensource/ucl/
 
 quote
   As compared to LZO, the UCL algorithms achieve a better compression
   ratio but *decompression* is a little bit slower. See below for some
   rough timings.
 /quote
 
 It is uncompression speed that is more important, because it is used
 much more often.

I know, but the point is not what is the fastestest, but if it's fast
enough to get off the profiles. I think UCL is fast enough since it's
still times faster than zlib. Anyway, LZO is GPL too, so why not
considering it too. They are good libraries.
-- 
Giovanni Bajo



Re: Git and GCC

2007-12-07 Thread Giovanni Bajo

On 12/7/2007 6:23 PM, Linus Torvalds wrote:


Is SHA a significant portion of the compute during these repacks?
I should run oprofile...


SHA1 is almost totally insignificant on x86. It hardly shows up. But we 
have a good optimized version there.


zlib tends to be a lot more noticeable (especially the uncompression: it 
may be faster than compression, but it's done _so_ much more that it 
totally dominates).


Have you considered alternatives, like:
http://www.oberhumer.com/opensource/ucl/
--
Giovanni Bajo



Re: Git and GCC

2007-12-07 Thread Linus Torvalds


On Fri, 7 Dec 2007, David Miller wrote:
 
 Also I could end up being performance limited by SHA, it's not very
 well tuned on Sparc.  It's been on my TODO list to code up the crypto
 unit support for Niagara-2 in the kernel, then work with Herbert Xu on
 the userland interfaces to take advantage of that in things like
 libssl.  Even a better C/asm version would probably improve GIT
 performance a bit.

I doubt yu can use the hardware support. Kernel-only hw support is 
inherently broken for any sane user-space usage, the setup costs are just 
way way too high. To be useful, crypto engines need to support direct user 
space access (ie a regular instruction, with all state being held in 
normal registers that get saved/restored by the kernel).

 Is SHA a significant portion of the compute during these repacks?
 I should run oprofile...

SHA1 is almost totally insignificant on x86. It hardly shows up. But we 
have a good optimized version there.

zlib tends to be a lot more noticeable (especially the uncompression: it 
may be faster than compression, but it's done _so_ much more that it 
totally dominates).

Linus


Re: Git and GCC

2007-12-07 Thread Nicolas Pitre
On Fri, 7 Dec 2007, Jon Smirl wrote:

 On 12/7/07, Linus Torvalds [EMAIL PROTECTED] wrote:
 
 
  On Thu, 6 Dec 2007, Jon Smirl wrote:
   
time git blame -C gcc/regclass.c  /dev/null
  
   [EMAIL PROTECTED]:/video/gcc$ time git blame -C gcc/regclass.c  /dev/null
  
   real1m21.967s
   user1m21.329s
 
  Well, I was also hoping for a compared to not-so-aggressive packing
  number on the same machine.. IOW, what I was wondering is whether there is
  a visible performance downside to the deeper delta chains in the 300MB
  pack vs the (less aggressive) 500MB pack.
 
 Same machine with a default pack
 
 [EMAIL PROTECTED]:/video/gcc/.git/objects/pack$ ls -l
 total 2145716
 -r--r--r-- 1 jonsmirl jonsmirl   23667932 2007-12-07 02:03
 pack-bd163555ea9240a7fdd07d2708a293872665f48b.idx
 -r--r--r-- 1 jonsmirl jonsmirl 2171385413 2007-12-07 02:03
 pack-bd163555ea9240a7fdd07d2708a293872665f48b.pack
 [EMAIL PROTECTED]:/video/gcc/.git/objects/pack$
 
 Delta lengths have virtually no impact. 

I can confirm this.

I just did a repack keeping the default depth of 50 but with window=100 
instead of the default of 10, and the pack shrunk from 2171385413 bytes 
down to 410607140 bytes.

So our default window size is definitely not adequate for the gcc repo.

OTOH, I recall tytso mentioning something about not having much return 
on  a bigger window size in his tests when he proposed to increase the 
default delta depth to 50.  So there is definitely some kind of threshold 
at which point the increased window size stops being advantageous wrt 
the number of cycles involved, and we should find a way to correlate it 
to the data set to have a better default window size than the current 
fixed default.


Nicolas


Re: Git and GCC

2007-12-07 Thread Jakub Narebski
Giovanni Bajo [EMAIL PROTECTED] writes:

 On 12/7/2007 6:23 PM, Linus Torvalds wrote:
 
  Is SHA a significant portion of the compute during these repacks?
  I should run oprofile...
  SHA1 is almost totally insignificant on x86. It hardly shows up. But
  we have a good optimized version there.
  zlib tends to be a lot more noticeable (especially the
  *uncompression*: it may be faster than compression, but it's done _so_
  much more that it totally dominates).
 
 Have you considered alternatives, like:
 http://www.oberhumer.com/opensource/ucl/

quote
  As compared to LZO, the UCL algorithms achieve a better compression
  ratio but *decompression* is a little bit slower. See below for some
  rough timings.
/quote

It is uncompression speed that is more important, because it is used
much more often.

-- 
Jakub Narebski
ShadeHawk on #git



Re: Git and GCC

2007-12-07 Thread Luke Lu

On Dec 7, 2007, at 2:14 PM, Jakub Narebski wrote:

Giovanni Bajo [EMAIL PROTECTED] writes:

On 12/7/2007 6:23 PM, Linus Torvalds wrote:

Is SHA a significant portion of the compute during these repacks?
I should run oprofile...

SHA1 is almost totally insignificant on x86. It hardly shows up. But
we have a good optimized version there.
zlib tends to be a lot more noticeable (especially the
*uncompression*: it may be faster than compression, but it's done  
_so_

much more that it totally dominates).


Have you considered alternatives, like:
http://www.oberhumer.com/opensource/ucl/


quote
  As compared to LZO, the UCL algorithms achieve a better compression
  ratio but *decompression* is a little bit slower. See below for some
  rough timings.
/quote

It is uncompression speed that is more important, because it is used
much more often.


So why didn't we consider lzo then? It's much faster than zlib.

__Luke

 


Re: Git and GCC

2007-12-07 Thread Daniel Berlin
On 12/7/07, Giovanni Bajo [EMAIL PROTECTED] wrote:
 On Fri, 2007-12-07 at 14:14 -0800, Jakub Narebski wrote:

Is SHA a significant portion of the compute during these repacks?
I should run oprofile...
SHA1 is almost totally insignificant on x86. It hardly shows up. But
we have a good optimized version there.
zlib tends to be a lot more noticeable (especially the
*uncompression*: it may be faster than compression, but it's done _so_
much more that it totally dominates).
  
   Have you considered alternatives, like:
   http://www.oberhumer.com/opensource/ucl/
 
  quote
As compared to LZO, the UCL algorithms achieve a better compression
ratio but *decompression* is a little bit slower. See below for some
rough timings.
  /quote
 
  It is uncompression speed that is more important, because it is used
  much more often.

 I know, but the point is not what is the fastestest, but if it's fast
 enough to get off the profiles. I think UCL is fast enough since it's
 still times faster than zlib. Anyway, LZO is GPL too, so why not
 considering it too. They are good libraries.


At worst, you could also use fastlz (www.fastlz.org), which is faster
than all of these by a factor of 4 (and compression wise, is actually
sometimes better, sometimes worse, than LZO).


Re: Git and GCC

2007-12-07 Thread David Miller
From: Linus Torvalds [EMAIL PROTECTED]
Date: Fri, 7 Dec 2007 09:23:47 -0800 (PST)

 
 
 On Fri, 7 Dec 2007, David Miller wrote:
  
  Also I could end up being performance limited by SHA, it's not very
  well tuned on Sparc.  It's been on my TODO list to code up the crypto
  unit support for Niagara-2 in the kernel, then work with Herbert Xu on
  the userland interfaces to take advantage of that in things like
  libssl.  Even a better C/asm version would probably improve GIT
  performance a bit.
 
 I doubt yu can use the hardware support. Kernel-only hw support is 
 inherently broken for any sane user-space usage, the setup costs are just 
 way way too high. To be useful, crypto engines need to support direct user 
 space access (ie a regular instruction, with all state being held in 
 normal registers that get saved/restored by the kernel).

Unfortunately they are hypervisor calls, and you have to give
the thing physical addresses for the buffer to work on, so
letting userland get at it directly isn't currently doable.

I still believe that there are cases where userland can take
advantage of in-kernel crypto devices, such as when we are
streaming the data into the kernel anyways (for a write()
or sendmsg()) and the user just wants the transformation to
be done on that stream.

As a specific case, hardware crypto SSL support works quite
well for sendmsg() user packet data.  And this the kind of API
Solaris provides to get good SSL performance with Niagara.

  Is SHA a significant portion of the compute during these repacks?
  I should run oprofile...
 
 SHA1 is almost totally insignificant on x86. It hardly shows up. But we 
 have a good optimized version there.

Ok.

 zlib tends to be a lot more noticeable (especially the uncompression: it 
 may be faster than compression, but it's done _so_ much more that it 
 totally dominates).

zlib is really hard to optimize on Sparc, I've tried numerous times.
Actually compress is the real cycle killer, and in that case the inner
loop wants to dereference 2-byte shorts at a time but they are
unaligned half of the time, and any the check for alignment nullifies
the gains of avoiding the two byte loads.

Uncompress I don't think is optimized at all on any platform with
asm stuff like the compress side is.  It's a pretty straightforward
transformation and the memory accesses dominate the overhead.

I'll do some profiling to see what might be worth looking into.


Re: Git and GCC

2007-12-06 Thread David Brown

On Wed, Dec 05, 2007 at 11:49:21PM -0800, Harvey Harrison wrote:



git repack -a -d --depth=250 --window=250



Since I have the whole gcc repo locally I'll give this a shot overnight
just to see what can be done at the extreme end or things.


When I tried this on a very large repo, at least one with some large files
in it, git quickly exceeded my physical memory and started thrashing the
machine.  I had good results with

 git config pack.deltaCacheSize 512m
 git config pack.windowMemory 512m

of course adjusting based on your physical memory.  I think changing the
windowMemory will affect the resulting compression, so changing these
ratios might get better compression out of the result.

If you're really patient, though, you could leave the unbounded window,
hope you have enough swap, and just let it run.

Dave


Re: Git and GCC

2007-12-06 Thread Andreas Schwab
Harvey Harrison [EMAIL PROTECTED] writes:

 git svn does accept a mailmap at import time with the same format as the
 cvs importer I think.  But for someone that just wants a repo to check
 out this was easiest.  I'd be willing to spend the time to do a nicer
 job if there was any interest from the gcc side, but I'm not that
 invested (other than owing them for an often-used tool).

I have a complete list of the uid-mail mapping for the gcc repository.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.


Re: Git and GCC

2007-12-06 Thread Johannes Schindelin
Hi,

On Wed, 5 Dec 2007, David Miller wrote:

 From: Daniel Berlin [EMAIL PROTECTED]
 Date: Wed, 5 Dec 2007 21:41:19 -0500
 
  It is true I gave up quickly, but this is mainly because i don't like 
  to fight with my tools.
 
  I am quite fine with a distributed workflow, I now use 8 or so gcc 
  branches in mercurial (auto synced from svn) and merge a lot between 
  them. I wanted to see if git would sanely let me manage the commits 
  back to svn.  After fighting with it, i gave up and just wrote a 
  python extension to hg that lets me commit non-svn changesets back to 
  svn directly from hg.
 
 I find it ironic that you were even willing to write tools to facilitate 
 your hg based gcc workflow.  That really shows what your thinking is on 
 this matter, in that you're willing to put effort towards making hg work 
 better for you but you're not willing to expend that level of effort to 
 see if git can do so as well.

While this is true...

 This is what really eats me from the inside about your dissatisfaction 
 with git.  Your analysis seems to be a self-fullfilling prophecy, and 
 that's totally unfair to both hg and git.

... I actually appreciate people complaining -- in the meantime.  It shows 
right away what group you belong to in the Those who can do, do, those 
who can't, complain..

You can see that very easily on the git list, or on the #git channel on 
irc.freenode.net.  There is enough data for a study which yearns to be 
written, that shows how quickly we resolve issues with people that are 
sincerely interested in a solution.

(Of course, on the other hand, there are also quite a few cases which show 
how frustrating (for both sides) and unfruitful discussions started by a 
complaint are.)

So I fully expect an issue like Daniel's to be resolved in a matter of 
minutes on the git list, if the OP gives us a chance.  If we are not even 
Cc'ed, you are completely right, she or he probably does not want the 
issue to be resolved.

Ciao,
Dscho



Re: Git and GCC

2007-12-06 Thread Ismail Dönmez
Thursday 06 December 2007 13:57:06 Johannes Schindelin yazmıştı:
[...]
 So I fully expect an issue like Daniel's to be resolved in a matter of
 minutes on the git list, if the OP gives us a chance.  If we are not even
 Cc'ed, you are completely right, she or he probably does not want the
 issue to be resolved.

Lets be fair about this, Ollie Wild already sent a mail about git-svn disk 
usage and there is no concrete solution yet, though it seems the bottleneck 
is known.

Regards,
ismail


-- 
Never learn by your mistakes, if you do you may never dare to try again.


Re: Git and GCC

2007-12-06 Thread Nicolas Pitre
On Wed, 5 Dec 2007, Harvey Harrison wrote:

 
  git repack -a -d --depth=250 --window=250
  
 
 Since I have the whole gcc repo locally I'll give this a shot overnight
 just to see what can be done at the extreme end or things.

Don't forget to add -f as well.


Nicolas


Re: Git and GCC

2007-12-06 Thread Nicolas Pitre
On Thu, 6 Dec 2007, Jeff King wrote:

 On Thu, Dec 06, 2007 at 01:47:54AM -0500, Jon Smirl wrote:
 
  The key to converting repositories of this size is RAM. 4GB minimum,
  more would be better. git-repack is not multi-threaded. There were a
  few attempts at making it multi-threaded but none were too successful.
  If I remember right, with loads of RAM, a repack on a 450MB repository
  was taking about five hours on a 2.8Ghz Core2. But this is something
  you only have to do once for the import. Later repacks will reuse the
  original deltas.
 
 Actually, Nicolas put quite a bit of work into multi-threading the
 repack process; the results have been in master for some time, and will
 be in the soon-to-be-released v1.5.4.
 
 The downside is that the threading partitions the object space, so the
 resulting size is not necessarily as small (but I don't know that
 anybody has done testing on large repos to find out how large the
 difference is).

Quick guesstimate is in the 1% ballpark.


Nicolas


Re: Git and GCC

2007-12-06 Thread Nicolas Pitre
On Thu, 6 Dec 2007, Jeff King wrote:

 On Thu, Dec 06, 2007 at 09:18:39AM -0500, Nicolas Pitre wrote:
 
   The downside is that the threading partitions the object space, so the
   resulting size is not necessarily as small (but I don't know that
   anybody has done testing on large repos to find out how large the
   difference is).
  
  Quick guesstimate is in the 1% ballpark.
 
 Fortunately, we now have numbers. Harvey Harrison reported repacking the
 gcc repo and getting these results:
 
  /usr/bin/time git repack -a -d -f --window=250 --depth=250
 
  23266.37user 581.04system 7:41:25elapsed 86%CPU (0avgtext+0avgdata 
  0maxresident)k
  0inputs+0outputs (419835major+123275804minor)pagefaults 0swaps
 
  -r--r--r-- 1 hharrison hharrison  29091872 2007-12-06 07:26 
  pack-1d46ca030c3d6d6b95ad316deb922be06b167a3d.idx
  -r--r--r-- 1 hharrison hharrison 324094684 2007-12-06 07:26 
  pack-1d46ca030c3d6d6b95ad316deb922be06b167a3d.pack
 
 I tried the threaded repack with pack.threads = 3 on a dual-processor
 machine, and got:
 
   time git repack -a -d -f --window=250 --depth=250
 
   real309m59.849s
   user377m43.948s
   sys 8m23.319s
 
   -r--r--r-- 1 peff peff  28570088 2007-12-06 10:11 
 pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.idx
   -r--r--r-- 1 peff peff 339922573 2007-12-06 10:11 
 pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.pack
 
 So it is about 5% bigger.

Right.  I should probably revisit that idea of finding deltas across 
partition boundaries to mitigate that loss.  And those partitions could 
be made coarser as well to reduce the number of such partition gaps 
(just increase the value of chunk_size on line 1648 in 
builtin-pack-objects.c).

 What is really disappointing is that we saved
 only about 20% of the time. I didn't sit around watching the stages, but
 my guess is that we spent a long time in the single threaded writing
 objects stage with a thrashing delta cache.

Maybe you should run the non threaded repack on the same machine to have 
a good comparison.  And if you have only 2 CPUs, you will have better 
performances with pack.threads = 2, otherwise there'll be wasteful task 
switching going on.

And of course, if the delta cache is being trashed, that might be due to 
the way the existing pack was previously packed.  Hence the current pack 
might impact object _access_ when repacking them.  So for a really 
really fair performance comparison, you'd have to preserve the original 
pack and swap it back before each repack attempt.


Nicolas


Re: Git and GCC

2007-12-06 Thread Daniel Berlin
On 12/6/07, Linus Torvalds [EMAIL PROTECTED] wrote:


 On Thu, 6 Dec 2007, Daniel Berlin wrote:
 
  Actually, it turns out that git-gc --aggressive does this dumb thing
  to pack files sometimes regardless of whether you converted from an
  SVN repo or not.

 Absolutely. git --aggressive is mostly dumb. It's really only useful for
 the case of I know I have a *really* bad pack, and I want to throw away
 all the bad packing decisions I have done.

 To explain this, it's worth explaining (you are probably aware of it, but
 let me go through the basics anyway) how git delta-chains work, and how
 they are so different from most other systems.

I worked on Monotone and other systems that use object stores. for a
little while :)
In particular, I believe GIT's original object store was based on
Monotone, IIRC.

 In other SCM's, a delta-chain is generally fixed. It might be forwards
 or backwards, and it might evolve a bit as you work with the repository,
 but generally it's a chain of changes to a single file represented as some
 kind of single SCM entity. In CVS, it's obviously the *,v file, and a lot
 of other systems do rather similar things.


 Git also does delta-chains, but it does them a lot more loosely. There
 is no fixed entity. Delta's are generated against any random other version
 that git deems to be a good delta candidate (with various fairly
 successful heursitics), and there are absolutely no hard grouping rules.

Sure. SVN actually supports this (surprisingly), it just never happens
to choose delta bases that aren't related by ancestry.  (IE it would
have absolutely no problem with you using random other parts of the
repository as delta bases, and i've played with it before).

I actually advocated we move towards an object store model, as
ancestry can be a  crappy way of approximating similarity when you
have a lot of branches.

 So the equivalent of git gc --aggressive - but done *properly* - is to
 do (overnight) something like

 git repack -a -d --depth=250 --window=250

I gave this a try overnight, and it definitely helps a lot.
Thanks!

 And then it's going to take forever and a day (ie a do it overnight
 thing). But the end result is that everybody downstream from that
 repository will get much better packs, without having to spend any effort
 on it themselves.


If your forever and a day is spent figuring out which deltas to use,
you can reduce this significantly.
If it is spent writing out the data, it's much harder. :)


Re: Git and GCC

2007-12-06 Thread Ian Lance Taylor
NightStrike [EMAIL PROTECTED] writes:

 On 12/5/07, Daniel Berlin [EMAIL PROTECTED] wrote:
  As I said, maybe i'll look at git in another year or so.
  But  i'm certainly going to ignore all the git is so great, we should
  move gcc to it people until it works better, while i am much more
  inclined to believe the hg is so great, we should move gc to it
  people.
 
 Just out of curiosity, is there something wrong with the current
 choice of svn?  As I recall, it wasn't too long ago that gcc converted
 from cvs to svn.  What's the motivation to change again?  (I'm not
 trying to oppose anything.. I'm just curious, as I don't know much
 about this kind of thing).

Distributed version systems like git or Mercurial have some advantages
over Subversion.  For example, it is easy for developers to produce
patches which can be reliably committed or exchanged with other
developers.  With Subversion, we send around patch files generated by
diff and applied with patch.  This works, but is inconvenient, and
there is no way to track them.

With regard to git, I think it's worth noting that it was initially
designed to solve the problems faced by one man, Linus Torvalds.  The
problems he faces are not the problems which gcc developers face.  Our
development process is not the Linux kernel development process.  Of
course, many people have worked on git, and I expect that git can do
what we need.


For any git proponents, I'm curious to hear what advantages it offers
over Mercurial.  From this thread, one advantage of Mercurial seems
clear: it is easier to understand how to use it correctly.

Ian


Re: Git and GCC

2007-12-06 Thread Linus Torvalds


On Thu, 6 Dec 2007, Jeff King wrote:
 
 What is really disappointing is that we saved only about 20% of the 
 time. I didn't sit around watching the stages, but my guess is that we 
 spent a long time in the single threaded writing objects stage with a 
 thrashing delta cache.

I don't think you spent all that much time writing the objects. That part 
isn't very intensive, it's mostly about the IO.

I suspect you may simply be dominated by memory-throughput issues. The 
delta matching doesn't cache all that well, and using two or more cores 
isn't going to help all that much if they are largely waiting for memory 
(and quite possibly also perhaps fighting each other for a shared cache? 
Is this a Core 2 with the shared L2?)

Linus


Re: Git and GCC

2007-12-06 Thread Linus Torvalds


On Thu, 6 Dec 2007, Daniel Berlin wrote:

 I worked on Monotone and other systems that use object stores. for a 
 little while :) In particular, I believe GIT's original object store was 
 based on Monotone, IIRC.

Yes and no. 

Monotone does what git does for the blobs. But there is a big difference 
in how git then does it for everything else too, ie trees and history. 
Tree being in that object store in particular are very important, and one 
of the biggest deals for deltas (actually, for two reasons: most of the 
time they don't change AT ALL if some subdirectory gets no changes and you 
don't need any delta, and even when they do change, it's usually going to 
delta very well, since it's usually just a small part that changes).

  And then it's going to take forever and a day (ie a do it overnight
  thing). But the end result is that everybody downstream from that
  repository will get much better packs, without having to spend any effort
  on it themselves.
 
 If your forever and a day is spent figuring out which deltas to use,
 you can reduce this significantly.

It's almost all about figuring out the delta. Which is why *not* using 
-f (or --aggressive) is such a big deal for normal operation, because 
then you just skip it all.

Linus


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Linus Torvalds [EMAIL PROTECTED] wrote:


 On Thu, 6 Dec 2007, Jeff King wrote:
 
  What is really disappointing is that we saved only about 20% of the
  time. I didn't sit around watching the stages, but my guess is that we
  spent a long time in the single threaded writing objects stage with a
  thrashing delta cache.

 I don't think you spent all that much time writing the objects. That part
 isn't very intensive, it's mostly about the IO.

 I suspect you may simply be dominated by memory-throughput issues. The
 delta matching doesn't cache all that well, and using two or more cores
 isn't going to help all that much if they are largely waiting for memory
 (and quite possibly also perhaps fighting each other for a shared cache?
 Is this a Core 2 with the shared L2?)

When I lasted looked at the code, the problem was in evenly dividing
the work. I was using a four core machine and most of the time one
core would end up with 3-5x the work of the lightest loaded core.
Setting pack.threads up to 20 fixed the problem. With a high number of
threads I was able to get a 4hr pack to finished in something like
1:15.

A scheme where each core could work a minute without communicating to
the other cores would be best. It would also be more efficient if the
cores could avoid having sync points between them.

-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jeff King
On Thu, Dec 06, 2007 at 09:18:39AM -0500, Nicolas Pitre wrote:

  The downside is that the threading partitions the object space, so the
  resulting size is not necessarily as small (but I don't know that
  anybody has done testing on large repos to find out how large the
  difference is).
 
 Quick guesstimate is in the 1% ballpark.

Fortunately, we now have numbers. Harvey Harrison reported repacking the
gcc repo and getting these results:

 /usr/bin/time git repack -a -d -f --window=250 --depth=250

 23266.37user 581.04system 7:41:25elapsed 86%CPU (0avgtext+0avgdata 
 0maxresident)k
 0inputs+0outputs (419835major+123275804minor)pagefaults 0swaps

 -r--r--r-- 1 hharrison hharrison  29091872 2007-12-06 07:26 
 pack-1d46ca030c3d6d6b95ad316deb922be06b167a3d.idx
 -r--r--r-- 1 hharrison hharrison 324094684 2007-12-06 07:26 
 pack-1d46ca030c3d6d6b95ad316deb922be06b167a3d.pack

I tried the threaded repack with pack.threads = 3 on a dual-processor
machine, and got:

  time git repack -a -d -f --window=250 --depth=250

  real309m59.849s
  user377m43.948s
  sys 8m23.319s

  -r--r--r-- 1 peff peff  28570088 2007-12-06 10:11 
pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.idx
  -r--r--r-- 1 peff peff 339922573 2007-12-06 10:11 
pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.pack

So it is about 5% bigger. What is really disappointing is that we saved
only about 20% of the time. I didn't sit around watching the stages, but
my guess is that we spent a long time in the single threaded writing
objects stage with a thrashing delta cache.

-Peff


Re: Git and GCC

2007-12-06 Thread Nicolas Pitre
On Thu, 6 Dec 2007, Jon Smirl wrote:

 On 12/6/07, Linus Torvalds [EMAIL PROTECTED] wrote:
 
 
  On Thu, 6 Dec 2007, Jeff King wrote:
  
   What is really disappointing is that we saved only about 20% of the
   time. I didn't sit around watching the stages, but my guess is that we
   spent a long time in the single threaded writing objects stage with a
   thrashing delta cache.
 
  I don't think you spent all that much time writing the objects. That part
  isn't very intensive, it's mostly about the IO.
 
  I suspect you may simply be dominated by memory-throughput issues. The
  delta matching doesn't cache all that well, and using two or more cores
  isn't going to help all that much if they are largely waiting for memory
  (and quite possibly also perhaps fighting each other for a shared cache?
  Is this a Core 2 with the shared L2?)
 
 When I lasted looked at the code, the problem was in evenly dividing
 the work. I was using a four core machine and most of the time one
 core would end up with 3-5x the work of the lightest loaded core.
 Setting pack.threads up to 20 fixed the problem. With a high number of
 threads I was able to get a 4hr pack to finished in something like
 1:15.

But as far as I know you didn't try my latest incarnation which has been
available in Git's master branch for a few months already.


Nicolas


Re: Git and GCC

2007-12-06 Thread Linus Torvalds


On Thu, 6 Dec 2007, NightStrike wrote:
 
 No disrespect is meant by this reply.  I am just curious (and I am
 probably misunderstanding something)..  Why remove all of the
 documentation entirely?  Wouldn't it be better to just document it
 more thoroughly?

Well, part of it is that I don't think --aggressive as it is implemented 
right now is really almost *ever* the right answer. We could change the 
implementation, of course, but generally the right thing to do is to not 
use it (tweaking the --window and --depth manually for the repacking 
is likely the more natural thing to do).

The other part of the answer is that, when you *do* want to do what that 
--aggressive tries to achieve, it's such a special case event that while 
it should probably be documented, I don't think it should necessarily be 
documented where it is now (as part of git gc), but as part of a much 
more technical manual for deep and subtle tricks you can play.

 I thought you did a fine job in this post in explaining its purpose, 
 when to use it, when not to, etc.  Removing the documention seems 
 counter-intuitive when you've already gone to the trouble of creating 
 good documentation here in this post.

I'm so used to writing emails, and I *like* trying to explain what is 
going on, so I have no problems at all doing that kind of thing. However, 
trying to write a manual or man-page or other technical documentation is 
something rather different.

IOW, I like explaining git within the _context_ of a discussion or a 
particular problem/issue. But documentation should work regardless of 
context (or at least set it up), and that's the part I am not so good at.

In other words, if somebody (hint hint) thinks my explanation was good and 
readable, I'd love for them to try to turn it into real documentation by 
editing it up and creating enough context for it! But I'm nort personally 
very likely to do that. I'd just send Junio the patch to remove a 
misleading part of the documentation we have.

Linus


Re: Git and GCC

2007-12-06 Thread Jon Loeliger
On Thu, 2007-12-06 at 00:09, Linus Torvalds wrote:

 Git also does delta-chains, but it does them a lot more loosely. There 
 is no fixed entity. Delta's are generated against any random other version 
 that git deems to be a good delta candidate (with various fairly 
 successful heursitics), and there are absolutely no hard grouping rules.

I'd like to learn more about that.  Can someone point me to
either more documentation on it?  In the absence of that,
perhaps a pointer to the source code that implements it?

I guess one question I posit is, would it be more accurate
to think of this as a delta net in a weighted graph rather
than a delta chain?

Thanks,
jdl




Re: Git and GCC

2007-12-06 Thread NightStrike
On 12/6/07, Linus Torvalds [EMAIL PROTECTED] wrote:


 On Thu, 6 Dec 2007, Daniel Berlin wrote:
 
  Actually, it turns out that git-gc --aggressive does this dumb thing
  to pack files sometimes regardless of whether you converted from an
  SVN repo or not.
 I'll send a patch to Junio to just remove the git gc --aggressive
 documentation. It can be useful, but it generally is useful only when you
 really understand at a very deep level what it's doing, and that
 documentation doesn't help you do that.

No disrespect is meant by this reply.  I am just curious (and I am
probably misunderstanding something)..  Why remove all of the
documentation entirely?  Wouldn't it be better to just document it
more thoroughly?  I thought you did a fine job in this post in
explaining its purpose, when to use it, when not to, etc.  Removing
the documention seems counter-intuitive when you've already gone to
the trouble of creating good documentation here in this post.


Re: Git and GCC. Why not with fork, exec and pipes like in linux?

2007-12-06 Thread J.C. Pizarro
On 2007/12/06, Jon Smirl [EMAIL PROTECTED] wrote:
 On 12/6/07, Linus Torvalds [EMAIL PROTECTED] wrote:
  On Thu, 6 Dec 2007, Jeff King wrote:
  
   What is really disappointing is that we saved only about 20% of the
   time. I didn't sit around watching the stages, but my guess is that we
   spent a long time in the single threaded writing objects stage with a
   thrashing delta cache.
 
  I don't think you spent all that much time writing the objects. That part
  isn't very intensive, it's mostly about the IO.
 
  I suspect you may simply be dominated by memory-throughput issues. The
  delta matching doesn't cache all that well, and using two or more cores
  isn't going to help all that much if they are largely waiting for memory
  (and quite possibly also perhaps fighting each other for a shared cache?
  Is this a Core 2 with the shared L2?)

 When I lasted looked at the code, the problem was in evenly dividing
 the work. I was using a four core machine and most of the time one
 core would end up with 3-5x the work of the lightest loaded core.
 Setting pack.threads up to 20 fixed the problem. With a high number of
 threads I was able to get a 4hr pack to finished in something like
 1:15.

 A scheme where each core could work a minute without communicating to
 the other cores would be best. It would also be more efficient if the
 cores could avoid having sync points between them.

 --
 Jon Smirl
 [EMAIL PROTECTED]

For multicores CPUs, don't divide the work in threads.
To divide the work in processes!

Tips, tricks and hacks: to use fork, exec, pipes and another IPC mechanisms like
mutexes, shared memory's IPC, file locks, pipes, semaphores, RPCs, sockets, etc.
to access concurrently and parallely to the filelocked database.

For Intel Quad Core e.g., x4 cores, it need a parent process and 4
child processes
linked to the parent with pipes.

The parent process can be
* no-threaded using select/epoll/libevent
* threaded using Pth (GNU Portable Threads), NPTL (from RedHat) or whatever.

   J.C.Pizarro


Re: Git and GCC

2007-12-06 Thread Vincent Lefevre
On 2007-12-06 10:15:17 -0800, Ian Lance Taylor wrote:
 Distributed version systems like git or Mercurial have some advantages
 over Subversion.

It's surprising that you don't mention svk, which is based on top
of Subversion[*]. Has anyone tried? Is there any problem with it?

[*] You have currently an obvious advantage here.

-- 
Vincent Lefèvre [EMAIL PROTECTED] - Web: http://www.vinc17.org/
100% accessible validated (X)HTML - Blog: http://www.vinc17.org/blog/
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


Re: Git and GCC

2007-12-06 Thread Ismail Dönmez
Thursday 06 December 2007 21:28:59 Vincent Lefevre yazmıştı:
 On 2007-12-06 10:15:17 -0800, Ian Lance Taylor wrote:
  Distributed version systems like git or Mercurial have some advantages
  over Subversion.

 It's surprising that you don't mention svk, which is based on top
 of Subversion[*]. Has anyone tried? Is there any problem with it?

 [*] You have currently an obvious advantage here.

Last time I tried SVK it was slow and buggy. I wouldn't recommend it.

/ismail

-- 
Never learn by your mistakes, if you do you may never dare to try again.


Re: Git and GCC

2007-12-06 Thread Linus Torvalds


On Thu, 6 Dec 2007, Jon Loeliger wrote:

 On Thu, 2007-12-06 at 00:09, Linus Torvalds wrote:
  Git also does delta-chains, but it does them a lot more loosely. There 
  is no fixed entity. Delta's are generated against any random other version 
  that git deems to be a good delta candidate (with various fairly 
  successful heursitics), and there are absolutely no hard grouping rules.
 
 I'd like to learn more about that.  Can someone point me to
 either more documentation on it?  In the absence of that,
 perhaps a pointer to the source code that implements it?

Well, in a very real sense, what the delta code does is:
 - just list every single object in the whole repository
 - walk over each object, trying to find another object that it can be 
   written as a delta against
 - write out the result as a pack-file

That's simplified: we may not walk _all_ objects, for example: only a 
global repack does that (and most pack creations are actually for pushign 
and pulling between two repositories, so we only walk the objects that are 
in the source but not the destination repository).

The interesting phase is the walk each object, try to find a delta part. 
In particular, you don't want to try to find a delta by comparing each 
object to every other object out there (that would be O(n^2) in objects, 
and with a fairly high constant cost too!). So what it does is to sort the 
objects by a few heuristics (type of object, base name that object was 
found as when traversing a tree and size, and how recently it was found in 
the history).

And then over that sorted list, it tries to find deltas between entries 
that are close to each other (and that's where the --window=xyz thing 
comes in - it says how big the window is for objects being close. A 
smaller window generates somewhat less good deltas, but takes a lot less 
effort to generate).

The source is in git/builtin-pack-objects.c, with the core of it being

 - try_delta() - try to generate a *single* delta when given an object 
   pair.

 - find_deltas() - do the actual list traversal

 - prepare_pack() and type_size_sort() - create the delta sort list from 
   the list of objects.

but that whole file is probably some of the more opaque parts of git.

 I guess one question I posit is, would it be more accurate
 to think of this as a delta net in a weighted graph rather
 than a delta chain?

It's certainly not a simple chain, it's more of a set of acyclic directed 
graphs in the object list. And yes, it's weigted by the size of the delta 
between objects, and the optimization problem is kind of akin to finding 
the smallest spanning tree (well, forest - since you do *not* want to 
create one large graph, you also want to make the individual trees shallow 
enough that you don't have excessive delta depth).

There are good algorithms for finding minimum spanning trees, but this one 
is complicated by the fact that the biggest cost (by far!) is the 
calculation of the weights itself. So rather than really worry about 
finding the minimal tree/forest, the code needs to worry about not having 
to even calculate all the weights!

(That, btw, is a common theme. A lot of git is about traversing graphs, 
like the revision graph. And most of the trivial graph problems all assume 
that you have the whole graph, but since the whole graph is the whole 
history of the repository, those algorithms are totally worthless, since 
they are fundamentally much too expensive - if we have to generate the 
whole history, we're already screwed for a big project. So things like 
revision graph calculation, the main performance issue is to avoid having 
to even *look* at parts of the graph that we don't need to see!)

Linus


Re: Git and GCC

2007-12-06 Thread Junio C Hamano
Jon Loeliger [EMAIL PROTECTED] writes:

 On Thu, 2007-12-06 at 00:09, Linus Torvalds wrote:

 Git also does delta-chains, but it does them a lot more loosely. There 
 is no fixed entity. Delta's are generated against any random other version 
 that git deems to be a good delta candidate (with various fairly 
 successful heursitics), and there are absolutely no hard grouping rules.

 I'd like to learn more about that.  Can someone point me to
 either more documentation on it?  In the absence of that,
 perhaps a pointer to the source code that implements it?

See Documentation/technical/pack-heuristics.txt,
but the document predates and does not talk about delta
reusing, which was covered here:

http://thread.gmane.org/gmane.comp.version-control.git/16223/focus=16267

 I guess one question I posit is, would it be more accurate
 to think of this as a delta net in a weighted graph rather
 than a delta chain?

Yes.


Re: Git and GCC

2007-12-06 Thread Andrey Belevantsev

Vincent Lefevre wrote:

It's surprising that you don't mention svk, which is based on top
of Subversion[*]. Has anyone tried? Is there any problem with it?
I must agree with Ismail's reply here.  We have used svk for our 
internal development for about two years, for the reason of easy 
mirroring of gcc trunk and branching from it locally.  I would not 
complain about its speed, but sometimes we had problems with merge from 
trunk, ending up with e.g. zero-sized files in our branch which were 
removed from trunk, or we even couldn't merge at all, and I had to 
resort to underlying subversion repository for merging.  As a result, 
we're currently migrating to mercurial.


Andrey


Re: Git and GCC. Why not with fork, exec and pipes like in linux?

2007-12-06 Thread J.C. Pizarro
On 2007/12/6, J.C. Pizarro [EMAIL PROTECTED], i wrote:
 For multicores CPUs, don't divide the work in threads.
 To divide the work in processes!

 Tips, tricks and hacks: to use fork, exec, pipes and another IPC mechanisms 
 like
 mutexes, shared memory's IPC, file locks, pipes, semaphores, RPCs, sockets, 
 etc.
 to access concurrently and parallely to the filelocked database.

I'm sorry, we don't need exec. We need fork, pipes and another IPC mechanisms
because it so shares easy the C code for parallelism.

Thanks to Linus because GIT is implemented in C language to interact with
system calls of the kernel written in C.

 For Intel Quad Core e.g., x4 cores, it need a parent process and 4
 child processes linked to the parent with pipes.

For peak performance (e.g 99.9% usage), the minimum number of child
processes should be more than 4, normally between e.g. 6 and 10 processes
depending on the statistics of idle's stalls of the cores.

 The parent process can be
 * no-threaded using select/epoll/libevent
 * threaded using Pth (GNU Portable Threads), NPTL (from RedHat) or whatever.

Note: there is a little design's problem with slowdown of I/O bandwith when
the parent is multithreaded and the children MUST to be multithreaded that
we can't avoid them to be non-multithreaded for maximum I/O bandwith.

The finding of the smallest spanning forest with deltas consumes a lot of
CPU, so if it scales well in a CPU x4 cores then it can to reduce 4
hours to 1 hour.

   J.C.Pizarro :)


Re: Git and GCC

2007-12-06 Thread Daniel Berlin
On 12/6/07, Andrey Belevantsev [EMAIL PROTECTED] wrote:
 Vincent Lefevre wrote:
  It's surprising that you don't mention svk, which is based on top
  of Subversion[*]. Has anyone tried? Is there any problem with it?
 I must agree with Ismail's reply here.  We have used svk for our
 internal development for about two years, for the reason of easy
 mirroring of gcc trunk and branching from it locally.  I would not
 complain about its speed, but sometimes we had problems with merge from
 trunk, ending up with e.g. zero-sized files in our branch which were
 removed from trunk, or we even couldn't merge at all, and I had to
 resort to underlying subversion repository for merging.  As a result,
 we're currently migrating to mercurial.

I would not recommend SVK either (even being an SVN committer). While
i love the SVK guys to death, it's just not the way to go if you want
a distributed system.


 Andrey



Re: Git and GCC

2007-12-06 Thread Junio C Hamano
Junio C Hamano [EMAIL PROTECTED] writes:

 Jon Loeliger [EMAIL PROTECTED] writes:

 I'd like to learn more about that.  Can someone point me to
 either more documentation on it?  In the absence of that,
 perhaps a pointer to the source code that implements it?

 See Documentation/technical/pack-heuristics.txt,

A somewhat funny thing about this is ...

$ git show --stat --summary b116b297
commit b116b297a80b54632256eb89dd22ea2b140de622
Author: Jon Loeliger [EMAIL PROTECTED]
Date:   Thu Mar 2 19:19:29 2006 -0600

Added Packing Heursitics IRC writeup.

Signed-off-by: Jon Loeliger [EMAIL PROTECTED]
Signed-off-by: Junio C Hamano [EMAIL PROTECTED]

 Documentation/technical/pack-heuristics.txt |  466 +++
 1 files changed, 466 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/technical/pack-heuristics.txt


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
  When I lasted looked at the code, the problem was in evenly dividing
  the work. I was using a four core machine and most of the time one
  core would end up with 3-5x the work of the lightest loaded core.
  Setting pack.threads up to 20 fixed the problem. With a high number of
  threads I was able to get a 4hr pack to finished in something like
  1:15.

 But as far as I know you didn't try my latest incarnation which has been
 available in Git's master branch for a few months already.

I've deleted all my giant packs. Using the kernel pack:
4GB Q6600

Using the current thread pack code I get these results.

The interesting case is the last one. I set it to 15 threads and
monitored with 'top'.
For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
74-100% was 100% CPU. It never used all for cores. The only other
things running were top and my desktop. This is the same load
balancing problem I observed earlier. Much more clock time was spent
in the 2/1 core phases than the 3 core one.

Threaded, threads = 5

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 528994), reused 0 (delta 0)

real1m31.395s
user2m59.239s
sys 0m3.048s
[EMAIL PROTECTED]:/home/linux$

12 seconds counting
53 seconds compressing
38 seconds writing

Without threads,

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f
warning: no threads support, ignoring pack.threads
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 528999), reused 0 (delta 0)

real2m54.849s
user2m51.267s
sys 0m1.412s
[EMAIL PROTECTED]:/home/linux$

Threaded, threads = 5

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f --depth=250 --window=250
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539080), reused 0 (delta 0)

real9m18.032s
user19m7.484s
sys 0m3.880s
[EMAIL PROTECTED]:/home/linux$

[EMAIL PROTECTED]:/home/linux/.git/objects/pack$ ls -l
total 182156
-r--r--r-- 1 jonsmirl jonsmirl  15561848 2007-12-06 16:15
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.idx
-r--r--r-- 1 jonsmirl jonsmirl 170768761 2007-12-06 16:15
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.pack
[EMAIL PROTECTED]:/home/linux/.git/objects/pack$

Non-threaded:

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f --depth=250 --window=250
warning: no threads support, ignoring pack.threads
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539080), reused 0 (delta 0)

real18m51.183s
user18m46.538s
sys 0m1.604s
[EMAIL PROTECTED]:/home/linux$


[EMAIL PROTECTED]:/home/linux/.git/objects/pack$ ls -l
total 182156
-r--r--r-- 1 jonsmirl jonsmirl  15561848 2007-12-06 15:33
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.idx
-r--r--r-- 1 jonsmirl jonsmirl 170768761 2007-12-06 15:33
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.pack
[EMAIL PROTECTED]:/home/linux/.git/objects/pack$

Threaded, threads = 15

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f --depth=250 --window=250
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539080), reused 0 (delta 0)

real9m18.325s
user19m14.340s
sys 0m3.996s
[EMAIL PROTECTED]:/home/linux$

-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Nicolas Pitre
On Thu, 6 Dec 2007, Jon Smirl wrote:

 On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
   When I lasted looked at the code, the problem was in evenly dividing
   the work. I was using a four core machine and most of the time one
   core would end up with 3-5x the work of the lightest loaded core.
   Setting pack.threads up to 20 fixed the problem. With a high number of
   threads I was able to get a 4hr pack to finished in something like
   1:15.
 
  But as far as I know you didn't try my latest incarnation which has been
  available in Git's master branch for a few months already.
 
 I've deleted all my giant packs. Using the kernel pack:
 4GB Q6600
 
 Using the current thread pack code I get these results.
 
 The interesting case is the last one. I set it to 15 threads and
 monitored with 'top'.
 For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
 74-100% was 100% CPU. It never used all for cores. The only other
 things running were top and my desktop. This is the same load
 balancing problem I observed earlier.

Well, that's possible with a window 25 times larger than the default.

The load balancing is solved with a master thread serving relatively 
small object list segments to any work thread that finished with its 
previous segment.  But the size for those segments is currently fixed to 
window * 1000 which is way too large when window == 250.

I have to find a way to auto-tune that segment size somehow.

But with the default window size there should not be any such noticeable 
load balancing problem.

Note that threading only happens in the compression phase.  The count 
and write phase are hardly paralleled.


Nicolas


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
 On Thu, 6 Dec 2007, Jon Smirl wrote:

  On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
When I lasted looked at the code, the problem was in evenly dividing
the work. I was using a four core machine and most of the time one
core would end up with 3-5x the work of the lightest loaded core.
Setting pack.threads up to 20 fixed the problem. With a high number of
threads I was able to get a 4hr pack to finished in something like
1:15.
  
   But as far as I know you didn't try my latest incarnation which has been
   available in Git's master branch for a few months already.
 
  I've deleted all my giant packs. Using the kernel pack:
  4GB Q6600
 
  Using the current thread pack code I get these results.
 
  The interesting case is the last one. I set it to 15 threads and
  monitored with 'top'.
  For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
  74-100% was 100% CPU. It never used all for cores. The only other
  things running were top and my desktop. This is the same load
  balancing problem I observed earlier.

 Well, that's possible with a window 25 times larger than the default.

Why did it never use more than three cores?


 The load balancing is solved with a master thread serving relatively
 small object list segments to any work thread that finished with its
 previous segment.  But the size for those segments is currently fixed to
 window * 1000 which is way too large when window == 250.

 I have to find a way to auto-tune that segment size somehow.

 But with the default window size there should not be any such noticeable
 load balancing problem.

 Note that threading only happens in the compression phase.  The count
 and write phase are hardly paralleled.


 Nicolas



-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Nicolas Pitre
On Thu, 6 Dec 2007, Jon Smirl wrote:

 On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
  On Thu, 6 Dec 2007, Jon Smirl wrote:
 
   On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
 When I lasted looked at the code, the problem was in evenly dividing
 the work. I was using a four core machine and most of the time one
 core would end up with 3-5x the work of the lightest loaded core.
 Setting pack.threads up to 20 fixed the problem. With a high number of
 threads I was able to get a 4hr pack to finished in something like
 1:15.
   
But as far as I know you didn't try my latest incarnation which has been
available in Git's master branch for a few months already.
  
   I've deleted all my giant packs. Using the kernel pack:
   4GB Q6600
  
   Using the current thread pack code I get these results.
  
   The interesting case is the last one. I set it to 15 threads and
   monitored with 'top'.
   For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
   74-100% was 100% CPU. It never used all for cores. The only other
   things running were top and my desktop. This is the same load
   balancing problem I observed earlier.
 
  Well, that's possible with a window 25 times larger than the default.
 
 Why did it never use more than three cores?

You have 648366 objects total, and only 647457 of them are subject to 
delta compression.

With a window size of 250 and a default thread segment of window * 1000 
that means only 3 segments will be distributed to threads, hence only 3 
threads with work to do.


Nicolas


Re: Git and GCC

2007-12-06 Thread David Kastrup
Junio C Hamano [EMAIL PROTECTED] writes:

 Junio C Hamano [EMAIL PROTECTED] writes:

 Jon Loeliger [EMAIL PROTECTED] writes:

 I'd like to learn more about that.  Can someone point me to
 either more documentation on it?  In the absence of that,
 perhaps a pointer to the source code that implements it?

 See Documentation/technical/pack-heuristics.txt,

 A somewhat funny thing about this is ...

 $ git show --stat --summary b116b297
 commit b116b297a80b54632256eb89dd22ea2b140de622
 Author: Jon Loeliger [EMAIL PROTECTED]
 Date:   Thu Mar 2 19:19:29 2006 -0600

 Added Packing Heursitics IRC writeup.

Ah, fishing for compliments.  The cookie baking season...

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
 On Thu, 6 Dec 2007, Jon Smirl wrote:

  On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
When I lasted looked at the code, the problem was in evenly dividing
the work. I was using a four core machine and most of the time one
core would end up with 3-5x the work of the lightest loaded core.
Setting pack.threads up to 20 fixed the problem. With a high number of
threads I was able to get a 4hr pack to finished in something like
1:15.
  
   But as far as I know you didn't try my latest incarnation which has been
   available in Git's master branch for a few months already.
 
  I've deleted all my giant packs. Using the kernel pack:
  4GB Q6600
 
  Using the current thread pack code I get these results.
 
  The interesting case is the last one. I set it to 15 threads and
  monitored with 'top'.
  For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
  74-100% was 100% CPU. It never used all for cores. The only other
  things running were top and my desktop. This is the same load
  balancing problem I observed earlier.

 Well, that's possible with a window 25 times larger than the default.

 The load balancing is solved with a master thread serving relatively
 small object list segments to any work thread that finished with its
 previous segment.  But the size for those segments is currently fixed to
 window * 1000 which is way too large when window == 250.

 I have to find a way to auto-tune that segment size somehow.

That would be nice. Threading is most important on the giant
pack/window combinations. The normal case is fast enough that I don't
real notice it. These giant pack/window combos can run 8-10 hours.


 But with the default window size there should not be any such noticeable
 load balancing problem.

I only spend 30 seconds in the compression phase without making the
window larger. It's not long enough to really see what is going on.


 Note that threading only happens in the compression phase.  The count
 and write phase are hardly paralleled.


 Nicolas



-- 
Jon Smirl
[EMAIL PROTECTED]


[OT] Re: Git and GCC

2007-12-06 Thread Randy Dunlap
On Thu, 06 Dec 2007 23:26:07 +0100 David Kastrup wrote:

 Junio C Hamano [EMAIL PROTECTED] writes:
 
  Junio C Hamano [EMAIL PROTECTED] writes:
 
  Jon Loeliger [EMAIL PROTECTED] writes:
 
  I'd like to learn more about that.  Can someone point me to
  either more documentation on it?  In the absence of that,
  perhaps a pointer to the source code that implements it?
 
  See Documentation/technical/pack-heuristics.txt,
 
  A somewhat funny thing about this is ...
 
  $ git show --stat --summary b116b297
  commit b116b297a80b54632256eb89dd22ea2b140de622
  Author: Jon Loeliger [EMAIL PROTECTED]
  Date:   Thu Mar 2 19:19:29 2006 -0600
 
  Added Packing Heursitics IRC writeup.
 
 Ah, fishing for compliments.  The cookie baking season...

Indeed.  Here are some really good  sweet recipes (IMHO).

http://www.xenotime.net/linux/recipes/


---
~Randy
Features and documentation: http://lwn.net/Articles/260136/


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre [EMAIL PROTECTED] wrote:
   Well, that's possible with a window 25 times larger than the default.
 
  Why did it never use more than three cores?

 You have 648366 objects total, and only 647457 of them are subject to
 delta compression.

 With a window size of 250 and a default thread segment of window * 1000
 that means only 3 segments will be distributed to threads, hence only 3
 threads with work to do.

One little tweak and the clock time drops from 9.5 to 6 minutes. The
tweak makes all four cores work.

[EMAIL PROTECTED]:/home/apps/git$ git diff
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 4f44658..e0dd12e 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1645,7 +1645,7 @@ static void ll_find_deltas(struct object_entry
**list, unsigned list_size,
}

/* this should be auto-tuned somehow */
-   chunk_size = window * 1000;
+   chunk_size = window * 50;

do {
unsigned sublist_size = chunk_size;


[EMAIL PROTECTED]:/home/linux/.git$ time git repack -a -d -f --depth=250
--window=250
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539043), reused 0 (delta 0)

real6m2.109s
user20m0.491s
sys 0m4.608s
[EMAIL PROTECTED]:/home/linux/.git$





 Nicolas



-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jakub Narebski
Linus Torvalds [EMAIL PROTECTED] writes:

 On Thu, 6 Dec 2007, Jon Loeliger wrote:

 I guess one question I posit is, would it be more accurate
 to think of this as a delta net in a weighted graph rather
 than a delta chain?
 
 It's certainly not a simple chain, it's more of a set of acyclic directed 
 graphs in the object list. And yes, it's weigted by the size of the delta 
 between objects, and the optimization problem is kind of akin to finding 
 the smallest spanning tree (well, forest - since you do *not* want to 
 create one large graph, you also want to make the individual trees shallow 
 enough that you don't have excessive delta depth).
 
 There are good algorithms for finding minimum spanning trees, but this one 
 is complicated by the fact that the biggest cost (by far!) is the 
 calculation of the weights itself. So rather than really worry about 
 finding the minimal tree/forest, the code needs to worry about not having 
 to even calculate all the weights!
 
 (That, btw, is a common theme. A lot of git is about traversing graphs, 
 like the revision graph. And most of the trivial graph problems all assume 
 that you have the whole graph, but since the whole graph is the whole 
 history of the repository, those algorithms are totally worthless, since 
 they are fundamentally much too expensive - if we have to generate the 
 whole history, we're already screwed for a big project. So things like 
 revision graph calculation, the main performance issue is to avoid having 
 to even *look* at parts of the graph that we don't need to see!)

Hmmm...

I think that these two problems (find minimal spanning forest with
limited depth and traverse graph) with the additional constraint to
avoid calculating weights / avoid calculating whole graph would be
a good problem to present at CompSci course.

Just a thought...
-- 
Jakub Narebski
Poland
ShadeHawk on #git


Re: Git and GCC

2007-12-06 Thread Harvey Harrison
On Thu, 2007-12-06 at 13:04 -0500, Daniel Berlin wrote:
 On 12/6/07, Linus Torvalds [EMAIL PROTECTED] wrote:
 
  So the equivalent of git gc --aggressive - but done *properly* - is to
  do (overnight) something like
 
  git repack -a -d --depth=250 --window=250
 
 I gave this a try overnight, and it definitely helps a lot.
 Thanks!

I've updated the public mirror repo with the very-packed version.

People cloning it now should get the just over 300MB repo now.

git.infradead.org/gcc.git


Cheers,

Harvey



Re: Git and GCC

2007-12-06 Thread David Miller
From: Jeff King [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 12:39:47 -0500

 I tried the threaded repack with pack.threads = 3 on a dual-processor
 machine, and got:
 
   time git repack -a -d -f --window=250 --depth=250
 
   real309m59.849s
   user377m43.948s
   sys 8m23.319s
 
   -r--r--r-- 1 peff peff  28570088 2007-12-06 10:11 
 pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.idx
   -r--r--r-- 1 peff peff 339922573 2007-12-06 10:11 
 pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.pack
 
 So it is about 5% bigger. What is really disappointing is that we saved
 only about 20% of the time. I didn't sit around watching the stages, but
 my guess is that we spent a long time in the single threaded writing
 objects stage with a thrashing delta cache.

If someone can give me a good way to run this test case I can
have my 64-cpu Niagara-2 box crunch on this and see how fast
it goes and how much larger the resulting pack file is.


Re: Git and GCC

2007-12-06 Thread Nicolas Pitre
On Thu, 6 Dec 2007, Jon Smirl wrote:

 I have a 4.8GB git process with 4GB of physical memory. Everything
 started slowing down a lot when the process got that big. Does git
 really need 4.8GB to repack? I could only keep 3.4GB resident. Luckily
 this happen at 95% completion. With 8GB of memory you should be able
 to do this repack in under 20 minutes.

Probably you have too many cached delta results.  By default, every 
delta smaller than 1000 bytes is kept in memory until the write phase.  
Try using pack.deltacachesize = 256M or lower, or try disabling this 
caching entirely with pack.deltacachelimit = 0.


Nicolas


Re: Git and GCC

2007-12-06 Thread NightStrike
On 12/6/07, Linus Torvalds [EMAIL PROTECTED] wrote:


 On Thu, 6 Dec 2007, NightStrike wrote:
 
  No disrespect is meant by this reply.  I am just curious (and I am
  probably misunderstanding something)..  Why remove all of the
  documentation entirely?  Wouldn't it be better to just document it
  more thoroughly?

 Well, part of it is that I don't think --aggressive as it is implemented
 right now is really almost *ever* the right answer. We could change the
 implementation, of course, but generally the right thing to do is to not
 use it (tweaking the --window and --depth manually for the repacking
 is likely the more natural thing to do).

 The other part of the answer is that, when you *do* want to do what that
 --aggressive tries to achieve, it's such a special case event that while
 it should probably be documented, I don't think it should necessarily be
 documented where it is now (as part of git gc), but as part of a much
 more technical manual for deep and subtle tricks you can play.

  I thought you did a fine job in this post in explaining its purpose,
  when to use it, when not to, etc.  Removing the documention seems
  counter-intuitive when you've already gone to the trouble of creating
  good documentation here in this post.

 I'm so used to writing emails, and I *like* trying to explain what is
 going on, so I have no problems at all doing that kind of thing. However,
 trying to write a manual or man-page or other technical documentation is
 something rather different.

 IOW, I like explaining git within the _context_ of a discussion or a
 particular problem/issue. But documentation should work regardless of
 context (or at least set it up), and that's the part I am not so good at.

 In other words, if somebody (hint hint) thinks my explanation was good and
 readable, I'd love for them to try to turn it into real documentation by
 editing it up and creating enough context for it! But I'm nort personally
 very likely to do that. I'd just send Junio the patch to remove a
 misleading part of the documentation we have.

hehe.. I'd love to, actually.  I can work on it next week.


Re: Git and GCC

2007-12-06 Thread Linus Torvalds


On Thu, 6 Dec 2007, Jon Smirl wrote:
 
  time git blame -C gcc/regclass.c  /dev/null
 
 [EMAIL PROTECTED]:/video/gcc$ time git blame -C gcc/regclass.c  /dev/null
 
 real1m21.967s
 user1m21.329s

Well, I was also hoping for a compared to not-so-aggressive packing 
number on the same machine.. IOW, what I was wondering is whether there is 
a visible performance downside to the deeper delta chains in the 300MB 
pack vs the (less aggressive) 500MB pack.

Linus


Re: Git and GCC

2007-12-06 Thread Jeff King
On Thu, Dec 06, 2007 at 01:02:58PM -0500, Nicolas Pitre wrote:

  What is really disappointing is that we saved
  only about 20% of the time. I didn't sit around watching the stages, but
  my guess is that we spent a long time in the single threaded writing
  objects stage with a thrashing delta cache.
 
 Maybe you should run the non threaded repack on the same machine to have 
 a good comparison.

Sorry, I should have been more clear. By saved I meant we needed N
minutes of CPU time, but took only M minutes of real time to use it.
IOW, if we assume that the threading had zero overhead and that we were
completely CPU bound, then the task would have taken N minutes of real
time. And obviously those assumptions aren't true, but I was attempting
to say it would have been at most N minutes of real time to do it
single-threaded.

 And if you have only 2 CPUs, you will have better performances with
 pack.threads = 2, otherwise there'll be wasteful task switching going
 on.

Yes, but balanced by one thread running out of data way earlier than the
other, and completing the task with only one CPU. I am doing a 4-thread
test on a quad-CPU right now, and I will also try it with threads=1 and
threads=6 for comparison.

 And of course, if the delta cache is being trashed, that might be due to 
 the way the existing pack was previously packed.  Hence the current pack 
 might impact object _access_ when repacking them.  So for a really 
 really fair performance comparison, you'd have to preserve the original 
 pack and swap it back before each repack attempt.

I am working each time from the pack generated by fetching from
git://git.infradead.org/gcc.git.

-Peff


Re: Git and GCC

2007-12-06 Thread Jeff King
On Thu, Dec 06, 2007 at 07:31:21PM -0800, David Miller wrote:

  So it is about 5% bigger. What is really disappointing is that we saved
  only about 20% of the time. I didn't sit around watching the stages, but
  my guess is that we spent a long time in the single threaded writing
  objects stage with a thrashing delta cache.
 
 If someone can give me a good way to run this test case I can
 have my 64-cpu Niagara-2 box crunch on this and see how fast
 it goes and how much larger the resulting pack file is.

That would be fun to see. The procedure I am using is this:

# compile recent git master with threaded delta
cd git
echo THREADED_DELTA_SEARCH = 1 config.mak
make install

# get the gcc pack
mkdir gcc  cd gcc
git --bare init
git config remote.gcc.url git://git.infradead.org/gcc.git
git config remote.gcc.fetch \
  '+refs/remotes/gcc.gnu.org/*:refs/remotes/gcc.gnu.org/*'
git remote update

# make a copy, so we can run further tests from a known point
cd ..
cp -a gcc test

# and test multithreaded large depth/window repacking
cd test
git config pack.threads 4
time git repack -a -d -f --window=250 --depth=250

-Peff


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/7/07, Linus Torvalds [EMAIL PROTECTED] wrote:


 On Thu, 6 Dec 2007, Jon Smirl wrote:
  
   time git blame -C gcc/regclass.c  /dev/null
 
  [EMAIL PROTECTED]:/video/gcc$ time git blame -C gcc/regclass.c  /dev/null
 
  real1m21.967s
  user1m21.329s

 Well, I was also hoping for a compared to not-so-aggressive packing
 number on the same machine.. IOW, what I was wondering is whether there is
 a visible performance downside to the deeper delta chains in the 300MB
 pack vs the (less aggressive) 500MB pack.

Same machine with a default pack

[EMAIL PROTECTED]:/video/gcc/.git/objects/pack$ ls -l
total 2145716
-r--r--r-- 1 jonsmirl jonsmirl   23667932 2007-12-07 02:03
pack-bd163555ea9240a7fdd07d2708a293872665f48b.idx
-r--r--r-- 1 jonsmirl jonsmirl 2171385413 2007-12-07 02:03
pack-bd163555ea9240a7fdd07d2708a293872665f48b.pack
[EMAIL PROTECTED]:/video/gcc/.git/objects/pack$

Delta lengths have virtually no impact. The bigger pack file causes
more IO which offsets the increased delta processing time.

One of my rules is smaller is almost always better. Smaller eliminates
IO and helps with the CPU cache. It's like the kernel being optimized
for size instead of speed ending up being  faster.

time git blame -C gcc/regclass.c  /dev/null
real1m19.289s
user1m17.853s
sys 0m0.952s




 Linus



-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jeff King
On Thu, Dec 06, 2007 at 10:35:22AM -0800, Linus Torvalds wrote:

  What is really disappointing is that we saved only about 20% of the 
  time. I didn't sit around watching the stages, but my guess is that we 
  spent a long time in the single threaded writing objects stage with a 
  thrashing delta cache.
 
 I don't think you spent all that much time writing the objects. That part 
 isn't very intensive, it's mostly about the IO.

It can get nasty with super-long deltas thrashing the cache, I think.
But in this case, I think it ended up being just a poor division of
labor caused by the chunk_size parameter using the quite large window
size (see elsewhere in the thread for discussion).

 I suspect you may simply be dominated by memory-throughput issues. The 
 delta matching doesn't cache all that well, and using two or more cores 
 isn't going to help all that much if they are largely waiting for memory 
 (and quite possibly also perhaps fighting each other for a shared cache? 
 Is this a Core 2 with the shared L2?)

I think the chunk_size more or less explains it. I have had reasonable
success keeping both CPUs busy on similar tasks in the past (but with
smaller window sizes).

For reference, it was a Core 2 Duo; do they all share L2, or is there
something I can look for in /proc/cpuinfo?

-Peff


Re: Git and GCC

2007-12-06 Thread Jeff King
On Fri, Dec 07, 2007 at 01:50:47AM -0500, Jeff King wrote:

 Yes, but balanced by one thread running out of data way earlier than the
 other, and completing the task with only one CPU. I am doing a 4-thread
 test on a quad-CPU right now, and I will also try it with threads=1 and
 threads=6 for comparison.

Hmm. As this has been running, I read the rest of the thread, and it
looks like Jon Smirl has already posted the interesting numbers. So
nevermind, unless there is something particular you would like to see.

-Peff


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/7/07, Jeff King [EMAIL PROTECTED] wrote:
 On Thu, Dec 06, 2007 at 07:31:21PM -0800, David Miller wrote:

   So it is about 5% bigger. What is really disappointing is that we saved
   only about 20% of the time. I didn't sit around watching the stages, but
   my guess is that we spent a long time in the single threaded writing
   objects stage with a thrashing delta cache.
 
  If someone can give me a good way to run this test case I can
  have my 64-cpu Niagara-2 box crunch on this and see how fast
  it goes and how much larger the resulting pack file is.

 That would be fun to see. The procedure I am using is this:

 # compile recent git master with threaded delta
 cd git
 echo THREADED_DELTA_SEARCH = 1 config.mak
 make install

 # get the gcc pack
 mkdir gcc  cd gcc
 git --bare init
 git config remote.gcc.url git://git.infradead.org/gcc.git
 git config remote.gcc.fetch \
   '+refs/remotes/gcc.gnu.org/*:refs/remotes/gcc.gnu.org/*'
 git remote update

 # make a copy, so we can run further tests from a known point
 cd ..
 cp -a gcc test

 # and test multithreaded large depth/window repacking
 cd test
 git config pack.threads 4

64 threads with 64 CPUs, if they are multicore you want even more.
you need to adjust chunk_size as mentioned in the other mail.


 time git repack -a -d -f --window=250 --depth=250

 -Peff



-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-05 Thread Ismail Dönmez
Wednesday 05 December 2007 21:08:41 Daniel Berlin yazmıştı:
 So I tried a full history conversion using git-svn of the gcc
 repository (IE every trunk revision from 1-HEAD as of yesterday)
 The git-svn import was done using repacks every 1000 revisions.
 After it finished, I used git-gc --aggressive --prune.  Two hours
 later, it finished.
 The final size after this is 1.5 gig for all of the history of gcc for
 just trunk.

 [EMAIL PROTECTED]:/compilerstuff/gitgcc/gccrepo/.git/objects/pack$ ls -trl
 total 1568899
 -r--r--r-- 1 dberlin dberlin 1585972834 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.pack
 -r--r--r-- 1 dberlin dberlin   19008488 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.idx

 This is 3x bigger than hg *and* hg doesn't require me to waste my life
 repacking every so often.
 The hg operations run roughly as fast as the git ones

I think this (gcc HG repo) is very good but only problem is its not always in 
sync with SVN, it would really rock if a post svn commit hook would sync hg 
repo.

Thanks for doing this anyhow.

Regards,
ismail

-- 
Never learn by your mistakes, if you do you may never dare to try again.


Re: Git and GCC

2007-12-05 Thread NightStrike
On 12/5/07, Daniel Berlin [EMAIL PROTECTED] wrote:
 I already have two way sync with hg.
 Maybe someday when git is more usable than hg to a normal developer,
 or it at least is significantly smaller than hg, i'll look at it
 again.

Sorry, what is hg?


Re: Git and GCC

2007-12-05 Thread Ollie Wild
On Dec 5, 2007 11:08 AM, Daniel Berlin [EMAIL PROTECTED] wrote:
 So I tried a full history conversion using git-svn of the gcc
 repository (IE every trunk revision from 1-HEAD as of yesterday)
 The git-svn import was done using repacks every 1000 revisions.
 After it finished, I used git-gc --aggressive --prune.  Two hours
 later, it finished.
 The final size after this is 1.5 gig for all of the history of gcc for
 just trunk.

Out of curiosity, how much of that is the .git/svn directory?  This is
where git-svn-specific data is stored.  It is *very* inefficient, at
least for the 1.5.2.5 version I'm using.

Ollie


Re: Git and GCC

2007-12-05 Thread Daniel Berlin
On 12/5/07, NightStrike [EMAIL PROTECTED] wrote:
 On 12/5/07, Daniel Berlin [EMAIL PROTECTED] wrote:
  I already have two way sync with hg.
  Maybe someday when git is more usable than hg to a normal developer,
  or it at least is significantly smaller than hg, i'll look at it
  again.

 Sorry, what is hg?

http://www.selenic.com/mercurial/


Re: Git and GCC

2007-12-05 Thread Samuel Tardieu
 Daniel == Daniel Berlin [EMAIL PROTECTED] writes:

Daniel So I tried a full history conversion using git-svn of the gcc
Daniel repository (IE every trunk revision from 1-HEAD as of
Daniel yesterday) The git-svn import was done using repacks every
Daniel 1000 revisions.  After it finished, I used git-gc --aggressive
Daniel --prune.  Two hours later, it finished.  The final size after
Daniel this is 1.5 gig for all of the history of gcc for just trunk.

Most of the space is probably taken by the SVN specific data. To get
an idea of how GIT would handle GCC data, you should clone the GIT
directory or checkout one from infradead.org:

  % git clone git://git.infradead.org/gcc.git

On my machine, it takes 856M with a checkout copy of trunk and
contains the trunk, autovect, fixed-point, 4.1 and 4.2 branches. In
comparaison, my checked out copy of trunk using SVN requires 1.2G, and
I don't have any history around...

  Sam
-- 
Samuel Tardieu -- [EMAIL PROTECTED] -- http://www.rfc1149.net/



Re: Git and GCC

2007-12-05 Thread Daniel Berlin
For the record:

 [EMAIL PROTECTED]:/compilerstuff/gitgcc/gccrepo$ git --version
git version 1.5.3.7

(I downloaded it yesterday when i started the import)

On 12/5/07, Daniel Berlin [EMAIL PROTECTED] wrote:
 So I tried a full history conversion using git-svn of the gcc
 repository (IE every trunk revision from 1-HEAD as of yesterday)
 The git-svn import was done using repacks every 1000 revisions.
 After it finished, I used git-gc --aggressive --prune.  Two hours
 later, it finished.
 The final size after this is 1.5 gig for all of the history of gcc for
 just trunk.

 [EMAIL PROTECTED]:/compilerstuff/gitgcc/gccrepo/.git/objects/pack$ ls -trl
 total 1568899
 -r--r--r-- 1 dberlin dberlin 1585972834 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.pack
 -r--r--r-- 1 dberlin dberlin   19008488 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.idx

 This is 3x bigger than hg *and* hg doesn't require me to waste my life
 repacking every so often.
 The hg operations run roughly as fast as the git ones

 I'm sure there are magic options, magic command lines, etc, i could
 use to make it smaller.

 I'm sure if i spent the next few weeks fucking around with git, it may
 even be usable!

 But given that git is harder to use, requires manual repacking to get
 any kind of sane space usage, and is 3x bigger anyway, i don't see any
 advantage to continuing to experiment with git and gcc.

 I already have two way sync with hg.
 Maybe someday when git is more usable than hg to a normal developer,
 or it at least is significantly smaller than hg, i'll look at it
 again.
 For now, it seems a net loss.

 --Dan
 
  git clone --depth 100 git://git.infradead.org/gcc.git
 
  should give around ~50mb repository with usable trunk. This is all thanks to
  Bernardo Innocenti for setting up an up-to-date gcc git repo.
 
  P.S:Please cut down on the usage of exclamation mark.
 
  Regards,
  ismail
 
  --
  Never learn by your mistakes, if you do you may never dare to try again.
 



Re: Git and GCC

2007-12-05 Thread Daniel Berlin
On 12/5/07, Ollie Wild [EMAIL PROTECTED] wrote:
 On Dec 5, 2007 11:08 AM, Daniel Berlin [EMAIL PROTECTED] wrote:
  So I tried a full history conversion using git-svn of the gcc
  repository (IE every trunk revision from 1-HEAD as of yesterday)
  The git-svn import was done using repacks every 1000 revisions.
  After it finished, I used git-gc --aggressive --prune.  Two hours
  later, it finished.
  The final size after this is 1.5 gig for all of the history of gcc for
  just trunk.

 Out of curiosity, how much of that is the .git/svn directory?  This is
 where git-svn-specific data is stored.  It is *very* inefficient, at
 least for the 1.5.2.5 version I'm using.


I was only counting the space in .the packs dir.

 Ollie



Re: Git and GCC

2007-12-05 Thread J.C. Pizarro
On 12/5/07, Daniel Berlin [EMAIL PROTECTED] wrote:
 So I tried a full history conversion using git-svn of the gcc
 repository (IE every trunk revision from 1-HEAD as of yesterday)
 The git-svn import was done using repacks every 1000 revisions.
 After it finished, I used git-gc --aggressive --prune.  Two hours
 later, it finished.
 The final size after this is 1.5 gig for all of the history of gcc for
 just trunk.

 [EMAIL PROTECTED]:/compilerstuff/gitgcc/gccrepo/.git/objects/pack$ ls -trl
 total 1568899
 -r--r--r-- 1 dberlin dberlin 1585972834 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.pack
 -r--r--r-- 1 dberlin dberlin   19008488 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.idx

 This is 3x bigger than hg *and* hg doesn't require me to waste my life
 repacking every so often.
 The hg operations run roughly as fast as the git ones

 I'm sure there are magic options, magic command lines, etc, i could
 use to make it smaller.

 I'm sure if i spent the next few weeks fucking around with git, it may
 even be usable!

 But given that git is harder to use, requires manual repacking to get
 any kind of sane space usage, and is 3x bigger anyway, i don't see any
 advantage to continuing to experiment with git and gcc.

 I already have two way sync with hg.
 Maybe someday when git is more usable than hg to a normal developer,
 or it at least is significantly smaller than hg, i'll look at it
 again.
 For now, it seems a net loss.

 --Dan
 
  git clone --depth 100 git://git.infradead.org/gcc.git
 
  should give around ~50mb repository with usable trunk. This is all thanks to
  Bernardo Innocenti for setting up an up-to-date gcc git repo.
 
  P.S:Please cut down on the usage of exclamation mark.
 
  Regards,
  ismail
 
  --
  Never learn by your mistakes, if you do you may never dare to try again.
 

To see Re: svn trunk reaches nearly 1 GiB!!! That massive!!!

http://gcc.gnu.org/ml/gcc/2007-11/msg00805.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00770.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00769.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00768.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00767.html

On 12/5/07, Daniel Berlin [EMAIL PROTECTED] wrote:
 So I tried a full history conversion using git-svn of the gcc
 repository (IE every trunk revision from 1-HEAD as of yesterday)
 The git-svn import was done using repacks every 1000 revisions.
 After it finished, I used git-gc --aggressive --prune.  Two hours
 later, it finished.
 The final size after this is 1.5 gig for all of the history of gcc for
 just trunk.

 [EMAIL PROTECTED]:/compilerstuff/gitgcc/gccrepo/.git/objects/pack$ ls -trl
 total 1568899
 -r--r--r-- 1 dberlin dberlin 1585972834 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.pack
 -r--r--r-- 1 dberlin dberlin   19008488 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.idx

 This is 3x bigger than hg *and* hg doesn't require me to waste my life
 repacking every so often.
 The hg operations run roughly as fast as the git ones

 I'm sure there are magic options, magic command lines, etc, i could
 use to make it smaller.

 I'm sure if i spent the next few weeks fucking around with git, it may
 even be usable!

 But given that git is harder to use, requires manual repacking to get
 any kind of sane space usage, and is 3x bigger anyway, i don't see any
 advantage to continuing to experiment with git and gcc.

 I already have two way sync with hg.
 Maybe someday when git is more usable than hg to a normal developer,
 or it at least is significantly smaller than hg, i'll look at it
 again.
 For now, it seems a net loss.

 --Dan
 
  git clone --depth 100 git://git.infradead.org/gcc.git
 
  should give around ~50mb repository with usable trunk. This is all thanks to
  Bernardo Innocenti for setting up an up-to-date gcc git repo.
 
  P.S:Please cut down on the usage of exclamation mark.
 
  Regards,
  ismail
 
  --
  Never learn by your mistakes, if you do you may never dare to try again.
 

To see Re: svn trunk reaches nearly 1 GiB!!! That massive!!!

http://gcc.gnu.org/ml/gcc/2007-11/msg00805.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00770.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00769.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00768.html
http://gcc.gnu.org/ml/gcc/2007-11/msg00767.html

* In http://gcc.gnu.org/ml/gcc/2007-11/msg00675.html , i did put

The generated files from flex/bison are a lot of trashing hexadecimals that
don't must to be commited to any cvs/svn/git/hg because it consumes
a lot of diskspace for only a modification of few lines of flex/bison sources.

* In http://gcc.gnu.org/ml/gcc/2007-11/msg00683.html , i did put

I hate considering temporary files as sources of the tree. They aren't sources.

It's good idea to remove ALL generated files from sources:

A) generated *.c, *.h from lex/bison sources *.l/*.y
B) generated not-handwritten configure, makefile, aclocal.m4, config.h.in,
 makefile.in from the configure.ac and 

Re: Git and GCC

2007-12-05 Thread Daniel Berlin
On 12/5/07, Samuel Tardieu [EMAIL PROTECTED] wrote:
  Daniel == Daniel Berlin [EMAIL PROTECTED] writes:

 Daniel So I tried a full history conversion using git-svn of the gcc
 Daniel repository (IE every trunk revision from 1-HEAD as of
 Daniel yesterday) The git-svn import was done using repacks every
 Daniel 1000 revisions.  After it finished, I used git-gc --aggressive
 Daniel --prune.  Two hours later, it finished.  The final size after
 Daniel this is 1.5 gig for all of the history of gcc for just trunk.

 Most of the space is probably taken by the SVN specific data.

I showed a du of the pack directory.
Everyone tells me that svn specfic data is in .svn, so i am
disinclined to believe this.

Also, given that hg can store the svn data without this kind of
penalty, it's just another strike against git.


  To get
 an idea of how GIT would handle GCC data, you should clone the GIT
 directory or checkout one from infradead.org:
Does infradead have the entire history?

   % git clone git://git.infradead.org/gcc.git

 On my machine, it takes 856M with a checkout copy of trunk and
 contains the trunk, autovect, fixed-point, 4.1 and 4.2 branches. In
 comparaison, my checked out copy of trunk using SVN requires 1.2G, and
 I don't have any history around...

This is about git's usability and space usage, not SVN.
People say we should consider GIT. I have been considering GIT and hg,
and right now, GIT looks like a massive loser in every respect.
It's harder to use.
It takes up more space than hg to store the same data.
It requires manual repacking
it's diff/etc commands are not any faster.

Humorously, i tried to verify whether infradead has full history or
not, but of course git log git://git.infradead.org/gcc.git says
fatal, not a git repository.
(though git clone is happy to clone it, because it is a git repository).
I'm sure there is some magic option or command line i need to use to
view remote log history without cloning the repository.
But all the other systems we look at don't require this kind of
bullshit to actually get things done.

As I said, maybe i'll look at git in another year or so.
But  i'm certainly going to ignore all the git is so great, we should
move gcc to it people until it works better, while i am much more
inclined to believe the hg is so great, we should move gc to it
people.


Re: Git and GCC

2007-12-05 Thread Andreas Schwab
Harvey Harrison [EMAIL PROTECTED] writes:

 On Wed, 2007-12-05 at 21:23 +0100, Samuel Tardieu wrote:
  Daniel == Daniel Berlin [EMAIL PROTECTED] writes:
 
 Daniel So I tried a full history conversion using git-svn of the gcc
 Daniel repository (IE every trunk revision from 1-HEAD as of
 Daniel yesterday) The git-svn import was done using repacks every
 Daniel 1000 revisions.  After it finished, I used git-gc --aggressive
 Daniel --prune.  Two hours later, it finished.  The final size after
 Daniel this is 1.5 gig for all of the history of gcc for just trunk.
 
 Most of the space is probably taken by the SVN specific data. To get
 an idea of how GIT would handle GCC data, you should clone the GIT
 directory or checkout one from infradead.org:
 
   % git clone git://git.infradead.org/gcc.git
 

 Actually I went through and created the basis for that repo.  It
 contains all branches and tags in the gcc svn repo and the final
 pack comes to about 600M.  This has _everything_, not just trunk.

Not everything.  Only trunk and a few selected branches, and no tags.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.


Re: Git and GCC

2007-12-05 Thread Harvey Harrison
On Thu, 2007-12-06 at 00:34 +0100, Andreas Schwab wrote:
 Harvey Harrison [EMAIL PROTECTED] writes:
 
  On Wed, 2007-12-05 at 21:23 +0100, Samuel Tardieu wrote:
   Daniel == Daniel Berlin [EMAIL PROTECTED] writes:
  
  Daniel So I tried a full history conversion using git-svn of the gcc
  Daniel repository (IE every trunk revision from 1-HEAD as of
  Daniel yesterday) The git-svn import was done using repacks every
  Daniel 1000 revisions.  After it finished, I used git-gc --aggressive
  Daniel --prune.  Two hours later, it finished.  The final size after
  Daniel this is 1.5 gig for all of the history of gcc for just trunk.
  
  Most of the space is probably taken by the SVN specific data. To get
  an idea of how GIT would handle GCC data, you should clone the GIT
  directory or checkout one from infradead.org:
  
% git clone git://git.infradead.org/gcc.git
  
 
  Actually I went through and created the basis for that repo.  It
  contains all branches and tags in the gcc svn repo and the final
  pack comes to about 600M.  This has _everything_, not just trunk.
 
 Not everything.  Only trunk and a few selected branches, and no tags.
 

Yes, everything, by default you only get the more modern branches/tags,
but it's all in there.  If there is interest I can work with Bernardo
and get the rest publically exposed.

Harvey



Re: Git and GCC

2007-12-05 Thread Harvey Harrison
On Wed, 2007-12-05 at 21:23 +0100, Samuel Tardieu wrote:
  Daniel == Daniel Berlin [EMAIL PROTECTED] writes:
 
 Daniel So I tried a full history conversion using git-svn of the gcc
 Daniel repository (IE every trunk revision from 1-HEAD as of
 Daniel yesterday) The git-svn import was done using repacks every
 Daniel 1000 revisions.  After it finished, I used git-gc --aggressive
 Daniel --prune.  Two hours later, it finished.  The final size after
 Daniel this is 1.5 gig for all of the history of gcc for just trunk.
 
 Most of the space is probably taken by the SVN specific data. To get
 an idea of how GIT would handle GCC data, you should clone the GIT
 directory or checkout one from infradead.org:
 
   % git clone git://git.infradead.org/gcc.git
 

Actually I went through and created the basis for that repo.  It
contains all branches and tags in the gcc svn repo and the final
pack comes to about 600M.  This has _everything_, not just trunk.

For the first time after doing such an import, I found it much better
to do git repack -a -f --depth=100 --window=100.  After that initial
repack a plain git-gc occasionally will be just fine.

If you want any more information about this, let me know.

CHeers,

Harvey



Re: Git and GCC

2007-12-05 Thread NightStrike
On 12/5/07, Daniel Berlin [EMAIL PROTECTED] wrote:
 As I said, maybe i'll look at git in another year or so.
 But  i'm certainly going to ignore all the git is so great, we should
 move gcc to it people until it works better, while i am much more
 inclined to believe the hg is so great, we should move gc to it
 people.

Just out of curiosity, is there something wrong with the current
choice of svn?  As I recall, it wasn't too long ago that gcc converted
from cvs to svn.  What's the motivation to change again?  (I'm not
trying to oppose anything.. I'm just curious, as I don't know much
about this kind of thing).


Re: Git and GCC

2007-12-05 Thread Ollie Wild
On Dec 5, 2007 1:40 PM, Daniel Berlin [EMAIL PROTECTED] wrote:

  Out of curiosity, how much of that is the .git/svn directory?  This is
  where git-svn-specific data is stored.  It is *very* inefficient, at
  least for the 1.5.2.5 version I'm using.
 

 I was only counting the space in .the packs dir.

In my personal client, which includes the entire history of GCC, the
packs dir is only 652MB.

Obviouisly, you're not a big fan of Git, and you're entitled to your
opinion.  I, however, find it very useful.  Given a choice between Git
and Mercurial, I choose git, but only because I have prior experience
working with the Linux kernel.  From what I've heard, both do the job
reasonably well.

Thanks to git-svn, using Git to develop GCC is practical with or
without explicit support from the GCC maintainers.  As I see it, the
main barrier is the inordinate amount of time it takes to bring up a
repository from scratch.  As has already been noted, Harvey has
provided a read-only copy, but it (a) only allows access to a subset
of GCC's branches and (b) doesn't provide a mechanism for developers
to push changes directly via git-svn.

This sounds like a homework project.  I'll do some investigation and
see if I can come up with a good bootstrap process.

Ollie


Re: Git and GCC

2007-12-05 Thread David Miller
From: Daniel Berlin [EMAIL PROTECTED]
Date: Wed, 5 Dec 2007 14:08:41 -0500

 So I tried a full history conversion using git-svn of the gcc
 repository (IE every trunk revision from 1-HEAD as of yesterday)
 The git-svn import was done using repacks every 1000 revisions.
 After it finished, I used git-gc --aggressive --prune.  Two hours
 later, it finished.
 The final size after this is 1.5 gig for all of the history of gcc for
 just trunk.
 
 [EMAIL PROTECTED]:/compilerstuff/gitgcc/gccrepo/.git/objects/pack$ ls -trl
 total 1568899
 -r--r--r-- 1 dberlin dberlin 1585972834 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.pack
 -r--r--r-- 1 dberlin dberlin   19008488 2007-12-05 14:01
 pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.idx
 
 This is 3x bigger than hg *and* hg doesn't require me to waste my life
 repacking every so often.
 The hg operations run roughly as fast as the git ones
 
 I'm sure there are magic options, magic command lines, etc, i could
 use to make it smaller.
 
 I'm sure if i spent the next few weeks fucking around with git, it may
 even be usable!
 
 But given that git is harder to use, requires manual repacking to get
 any kind of sane space usage, and is 3x bigger anyway, i don't see any
 advantage to continuing to experiment with git and gcc.

I would really appreciate it if you would share experiences
like this with the GIT community, who have been now CC:'d.

That's the only way this situation is going to improve.

When you don't CC: the people who can fix the problem, I can only
speculate that perhaps at least subconsciously you don't care if
the situation improves or not.

The OpenSolaris folks behaved similarly, and that really ticked me
off.


Re: Git and GCC

2007-12-05 Thread Daniel Berlin
On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
 From: Daniel Berlin [EMAIL PROTECTED]
 Date: Wed, 5 Dec 2007 14:08:41 -0500

  So I tried a full history conversion using git-svn of the gcc
  repository (IE every trunk revision from 1-HEAD as of yesterday)
  The git-svn import was done using repacks every 1000 revisions.
  After it finished, I used git-gc --aggressive --prune.  Two hours
  later, it finished.
  The final size after this is 1.5 gig for all of the history of gcc for
  just trunk.
 
  [EMAIL PROTECTED]:/compilerstuff/gitgcc/gccrepo/.git/objects/pack$ ls -trl
  total 1568899
  -r--r--r-- 1 dberlin dberlin 1585972834 2007-12-05 14:01
  pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.pack
  -r--r--r-- 1 dberlin dberlin   19008488 2007-12-05 14:01
  pack-cd328fcf0bd673d8f2f72c42fbe67da64cbcd218.idx
 
  This is 3x bigger than hg *and* hg doesn't require me to waste my life
  repacking every so often.
  The hg operations run roughly as fast as the git ones
 
  I'm sure there are magic options, magic command lines, etc, i could
  use to make it smaller.
 
  I'm sure if i spent the next few weeks fucking around with git, it may
  even be usable!
 
  But given that git is harder to use, requires manual repacking to get
  any kind of sane space usage, and is 3x bigger anyway, i don't see any
  advantage to continuing to experiment with git and gcc.

 I would really appreciate it if you would share experiences
 like this with the GIT community, who have been now CC:'d.

 That's the only way this situation is going to improve.

 When you don't CC: the people who can fix the problem, I can only
 speculate that perhaps at least subconsciously you don't care if
 the situation improves or not.

I didn't cc the git community for three reasons

1. It's not the nicest message in the world, and thus, more likely to
get bad responses than constructive ones.

2. Based on the level of usability, I simply assume it is too young
for regular developers to use.  At least, I hope this is the case.

3. People i know have had bad experiences talking usability issues
with the git community in the past.  I am not likely to fare any
better, so I would rather have someone who is involved with both our
community and theirs, raise these issues, rather than a complete
newcomer.

But hey, whatever floats your boat :)

It is true I gave up quickly, but this is mainly because i don't like
to fight with my tools.
I am quite fine with a distributed workflow, I now use 8 or so gcc
branches in mercurial (auto synced from svn) and merge a lot between
them. I wanted to see if git would sanely let me manage the commits
back to svn.  After fighting with it, i gave up and just wrote a
python extension to hg that lets me commit non-svn changesets back to
svn directly from hg.

--Dan


Re: Git and GCC

2007-12-05 Thread David Miller
From: Daniel Berlin [EMAIL PROTECTED]
Date: Wed, 5 Dec 2007 21:41:19 -0500

 It is true I gave up quickly, but this is mainly because i don't like
 to fight with my tools.
 I am quite fine with a distributed workflow, I now use 8 or so gcc
 branches in mercurial (auto synced from svn) and merge a lot between
 them. I wanted to see if git would sanely let me manage the commits
 back to svn.  After fighting with it, i gave up and just wrote a
 python extension to hg that lets me commit non-svn changesets back to
 svn directly from hg.

I find it ironic that you were even willing to write tools to
facilitate your hg based gcc workflow.  That really shows what your
thinking is on this matter, in that you're willing to put effort
towards making hg work better for you but you're not willing to expend
that level of effort to see if git can do so as well.

This is what really eats me from the inside about your dissatisfaction
with git.  Your analysis seems to be a self-fullfilling prophecy, and
that's totally unfair to both hg and git.


Re: Git and GCC

2007-12-05 Thread Daniel Berlin
On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
 From: Daniel Berlin [EMAIL PROTECTED]
 Date: Wed, 5 Dec 2007 21:41:19 -0500

  It is true I gave up quickly, but this is mainly because i don't like
  to fight with my tools.
  I am quite fine with a distributed workflow, I now use 8 or so gcc
  branches in mercurial (auto synced from svn) and merge a lot between
  them. I wanted to see if git would sanely let me manage the commits
  back to svn.  After fighting with it, i gave up and just wrote a
  python extension to hg that lets me commit non-svn changesets back to
  svn directly from hg.

 I find it ironic that you were even willing to write tools to
 facilitate your hg based gcc workflow.
Why?

 That really shows what your
 thinking is on this matter, in that you're willing to put effort
 towards making hg work better for you but you're not willing to expend
 that level of effort to see if git can do so as well.
See, now you claim to know my thinking.
I went back to hg because the GIT's space usage wasn't even in the
ballpark, i couldn't get git-svn rebase to update the revs after the
initial import (even though i had properly used a rewriteRoot).

The size is clearly not just svn data, it's in the git pack itself.

I spent a long time working on SVN to reduce it's space usage (repo
side and cleaning up the client side and giving a path to svn devs to
reduce it further), as well as ui issues, and I really don't feel like
having to do the same for GIT.

I'm tired of having to spend a large amount of effort to get my tools
to work.  If the community wants to find and fix the problem, i've
already said repeatedly i'll happily give over my repo, data,
whatever.  You are correct i am not going to spend even more effort
when i can be productive with something else much quicker.  The devil
i know (committing to svn) is better than the devil i don't (diving
into git source code and finding/fixing what is causing this space
blowup).
The python extension took me a few hours ( 4).
In git, i spent these hours waiting for git-gc to finish.

 This is what really eats me from the inside about your dissatisfaction
 with git.  Your analysis seems to be a self-fullfilling prophecy, and
 that's totally unfair to both hg and git.
Oh?
You seem to be taking this awfully personally.
I came into this completely open minded. Really, I did (i'm sure
you'll claim otherwise).
GIT people told me it would work great and i'd have a really small git
repo and be able to commit back to svn.
I tried it.
It didn't work out.
It doesn't seem to be usable for whatever reason.
I'm happy to give details, data, whatever.

I made the engineering decision that my effort would be better spent
doing something I knew i could do quickly (make hg commit back to svn
for my purposes) then trying to improve larger issues in GIT (UI and
space usage).  That took me a few hours, and I was happy again.

I would have been incredibly happy to have git just have come up with
a 400 meg gcc repository, and to be happily committing away from
git-svn to gcc's repository  ...
But it didn't happen.
So far, you have yet to actually do anything but incorrectly tell me
what I am thinking.

I'll probably try again in 6 months, and maybe it will be better.


Re: Git and GCC

2007-12-05 Thread David Miller
From: Daniel Berlin [EMAIL PROTECTED]
Date: Wed, 5 Dec 2007 22:47:01 -0500

 The size is clearly not just svn data, it's in the git pack itself.

And other users have shown much smaller metadata from a GIT import,
and yes those are including all of the repository history and branches
not just the trunk.


Re: Git and GCC

2007-12-05 Thread Harvey Harrison
I fought with this a few months ago when I did my own clone of gcc svn.
My bad for only discussing this on #git at the time.  Should have put
this to the list as well.

If anyone recalls my report was something along the lines of
git gc --aggressive explodes pack size.

git repack -a -d --depth=100 --window=100 produced a ~550MB packfile
immediately afterwards a git gc --aggressive produces a 1.5G packfile.

This was for all branches/tags, not just trunk like Daniel's repo.

The best theory I had at the time was that the gc doesn't find as good
deltas or doesn't allow the same delta chain depth and so generates a 
new object in the pack, rather the reusing a good delta it already has
in the well-packed pack.

Cheers,

Harvey



Re: Git and GCC

2007-12-05 Thread Harvey Harrison

On Wed, 2007-12-05 at 20:20 -0800, David Miller wrote:
 From: Daniel Berlin [EMAIL PROTECTED]
 Date: Wed, 5 Dec 2007 22:47:01 -0500
 
  The size is clearly not just svn data, it's in the git pack itself.
 
 And other users have shown much smaller metadata from a GIT import,
 and yes those are including all of the repository history and branches
 not just the trunk.

David, I think it is actually a bug in git gc with the --aggressive
option...mind you, even if he solves that the format git svn uses
for its bi-directional metadata is so space-inefficient Daniel will
be crying for other reasons immediately afterwards...4MB for every
branch and tag in gcc svn (more than a few thousand).

You only need it around for any branches you are planning on committing
to but it is all created during the default git svn import.

FYI

Harvey



Re: Git and GCC

2007-12-05 Thread Daniel Berlin
On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
 From: Daniel Berlin [EMAIL PROTECTED]
 Date: Wed, 5 Dec 2007 22:47:01 -0500

  The size is clearly not just svn data, it's in the git pack itself.

 And other users have shown much smaller metadata from a GIT import,
 and yes those are including all of the repository history and branches
 not just the trunk.
I followed the instructions in the tutorials.
I followed the instructions given to by people who created these.
I came up with a 1.5 gig pack file.
You want to help, or you want to argue with me.
Right now it sounds like you are trying to blame me or make it look
like i did something wrong.

You are of course, welcome to try it yourself.
I can give you the absolute exactly commands I gave, and with git
1.5.3.7, it will give you a 1.5 gig pack file.


Re: Git and GCC

2007-12-05 Thread David Miller
From: Daniel Berlin [EMAIL PROTECTED]
Date: Wed, 5 Dec 2007 23:32:52 -0500

 On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
  From: Daniel Berlin [EMAIL PROTECTED]
  Date: Wed, 5 Dec 2007 22:47:01 -0500
 
   The size is clearly not just svn data, it's in the git pack itself.
 
  And other users have shown much smaller metadata from a GIT import,
  and yes those are including all of the repository history and branches
  not just the trunk.
 I followed the instructions in the tutorials.
 I followed the instructions given to by people who created these.
 I came up with a 1.5 gig pack file.
 You want to help, or you want to argue with me.

Several people replied in this thread showing what options can lead to
smaller pack files.

They also listed what the GIT limitations are that would effect the
kind of work you are doing, which seemed to mostly deal with the high
space cost of branching and tags when converting to/from SVN repos.


Re: Git and GCC

2007-12-05 Thread Linus Torvalds


On Wed, 5 Dec 2007, Harvey Harrison wrote:
 
 If anyone recalls my report was something along the lines of
 git gc --aggressive explodes pack size.

Yes, --aggressive is generally a bad idea. I think we should remove it or 
at least fix it. It doesn't do what the name implies, because it actually 
throws away potentially good packing, and re-does it all from a clean 
slate.

That said, it's totally pointless for a person who isn't a git proponent 
to do an initial import, and in that sense I agree with Daniel: he 
shouldn't waste his time with tools that he doesn't know or care about, 
since there are people who *can* do a better job, and who know what they 
are doing, and understand and like the tool.

While you can do a half-assed job with just mindlessly running git 
svnimport (which is deprecated these days) or git svn clone (better), 
the fact is, to do a *good* import does likely mean spending some effort 
on it. Trying to make the user names / emails to be better with a mailmap, 
for example. 

[ By default, for example, git svn clone/fetch seems to create those 
  horrible fake email addresses that contain the ID of the SVN repo in 
  each commit - I'm not talking about the git-svn-id, I'm talking about 
  the [EMAIL PROTECTED] thing for the author. Maybe people don't 
  really care, but isn't that ugly as hell? I'd think it's worth it doing 
  a really nice import, spending some effort on it.

  But maybe those things come from the older CVS-SVN import, I don't 
  really know. I've done a few SVN imports, but I've done them just for 
  stuff where I didn't want to touch SVN, but just wanted to track some 
  project like libgpod. For things like *that*, a totally mindless git 
  svn thing is fine ]

Of course, that does require there to be git people in the gcc crowd who 
are motivated enough to do the proper import and then make sure it's 
up-to-date and hosted somewhere. If those people don't exist, I'm not sure 
there's much idea to it.

The point being, you cannot ask a non-git person to do a major git import 
for an actual switch-over. Yes, it *can* be as simple as just doing a

git svn clone --stdlayout svn://svn://gcc.gnu.org/svn/gcc gcc

but the fact remains, you want to spend more effort and expertise on it if 
you actually want the result to be used as a basis for future work (as 
opposed to just tracking somebody elses SVN tree).

That includes:

 - do the historic import with good packing (and no, --aggressive 
   is not it, never mind the misleading name and man-page)

 - probably mailmap entries, certainly spending some time validating the 
   results.

 - hosting it

and perhaps most importantly

 - helping people who are *not* git users get up to speed.

because doing a good job at it is like asking a CVS newbie to set up a 
branch in CVS. I'm sure you can do it from man-pages, but I'm also sure 
you sure as hell won't like the end result.

Linus


Re: Git and GCC

2007-12-05 Thread Harvey Harrison
On Wed, 2007-12-05 at 20:54 -0800, Linus Torvalds wrote:
 
 On Wed, 5 Dec 2007, Harvey Harrison wrote:
  
  If anyone recalls my report was something along the lines of
  git gc --aggressive explodes pack size.

 [ By default, for example, git svn clone/fetch seems to create those 
   horrible fake email addresses that contain the ID of the SVN repo in 
   each commit - I'm not talking about the git-svn-id, I'm talking about 
   the [EMAIL PROTECTED] thing for the author. Maybe people don't 
   really care, but isn't that ugly as hell? I'd think it's worth it doing 
   a really nice import, spending some effort on it.
 
   But maybe those things come from the older CVS-SVN import, I don't 
   really know. I've done a few SVN imports, but I've done them just for 
   stuff where I didn't want to touch SVN, but just wanted to track some 
   project like libgpod. For things like *that*, a totally mindless git 
   svn thing is fine ]
 

git svn does accept a mailmap at import time with the same format as the
cvs importer I think.  But for someone that just wants a repo to check
out this was easiest.  I'd be willing to spend the time to do a nicer
job if there was any interest from the gcc side, but I'm not that
invested (other than owing them for an often-used tool).

Harvey



Re: Git and GCC

2007-12-05 Thread Daniel Berlin
On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
 From: Daniel Berlin [EMAIL PROTECTED]
 Date: Wed, 5 Dec 2007 23:32:52 -0500

  On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
   From: Daniel Berlin [EMAIL PROTECTED]
   Date: Wed, 5 Dec 2007 22:47:01 -0500
  
The size is clearly not just svn data, it's in the git pack itself.
  
   And other users have shown much smaller metadata from a GIT import,
   and yes those are including all of the repository history and branches
   not just the trunk.
  I followed the instructions in the tutorials.
  I followed the instructions given to by people who created these.
  I came up with a 1.5 gig pack file.
  You want to help, or you want to argue with me.

 Several people replied in this thread showing what options can lead to
 smaller pack files.

Actually, one person did, but that's okay, let's assume it was several.
I am currently trying Harvey's options.

I asked about using the pre-existing repos so i didn't have to do
this, but they were all
1. Done using read-only imports or
2. Don't contain full history
(IE the one that contains full history that is often posted here was
done as a read only import and thus doesn't have the metadata).

 They also listed what the GIT limitations are that would effect the
 kind of work you are doing, which seemed to mostly deal with the high
 space cost of branching and tags when converting to/from SVN repos.

Actually, it turns out that git-gc --aggressive does this dumb thing
to pack files sometimes regardless of whether you converted from an
SVN repo or not.


Re: Git and GCC

2007-12-05 Thread Harvey Harrison
On Thu, 2007-12-06 at 00:11 -0500, Daniel Berlin wrote:
 On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
  From: Daniel Berlin [EMAIL PROTECTED]
  Date: Wed, 5 Dec 2007 23:32:52 -0500
 
   On 12/5/07, David Miller [EMAIL PROTECTED] wrote:
From: Daniel Berlin [EMAIL PROTECTED]
Date: Wed, 5 Dec 2007 22:47:01 -0500
   
 The size is clearly not just svn data, it's in the git pack itself.
   
And other users have shown much smaller metadata from a GIT import,
and yes those are including all of the repository history and branches
not just the trunk.
   I followed the instructions in the tutorials.
   I followed the instructions given to by people who created these.
   I came up with a 1.5 gig pack file.
   You want to help, or you want to argue with me.
 
  Several people replied in this thread showing what options can lead to
  smaller pack files.
 
 Actually, one person did, but that's okay, let's assume it was several.
 I am currently trying Harvey's options.
 
 I asked about using the pre-existing repos so i didn't have to do
 this, but they were all
 1. Done using read-only imports or
 2. Don't contain full history
 (IE the one that contains full history that is often posted here was
 done as a read only import and thus doesn't have the metadata).

While you won't get the git svn metadata if you clone the infradead
repo, it can be recreated on the fly by git svn if you want to start
commiting directly to gcc svn.

Harvey



Re: Git and GCC

2007-12-05 Thread Daniel Berlin
 While you won't get the git svn metadata if you clone the infradead
 repo, it can be recreated on the fly by git svn if you want to start
 commiting directly to gcc svn.

I will give this a try :)


Re: Git and GCC

2007-12-05 Thread Linus Torvalds


On Thu, 6 Dec 2007, Daniel Berlin wrote:
 
 Actually, it turns out that git-gc --aggressive does this dumb thing
 to pack files sometimes regardless of whether you converted from an
 SVN repo or not.

Absolutely. git --aggressive is mostly dumb. It's really only useful for 
the case of I know I have a *really* bad pack, and I want to throw away 
all the bad packing decisions I have done.

To explain this, it's worth explaining (you are probably aware of it, but 
let me go through the basics anyway) how git delta-chains work, and how 
they are so different from most other systems.

In other SCM's, a delta-chain is generally fixed. It might be forwards 
or backwards, and it might evolve a bit as you work with the repository, 
but generally it's a chain of changes to a single file represented as some 
kind of single SCM entity. In CVS, it's obviously the *,v file, and a lot 
of other systems do rather similar things.

Git also does delta-chains, but it does them a lot more loosely. There 
is no fixed entity. Delta's are generated against any random other version 
that git deems to be a good delta candidate (with various fairly 
successful heursitics), and there are absolutely no hard grouping rules.

This is generally a very good thing. It's good for various conceptual 
reasons (ie git internally never really even needs to care about the whole 
revision chain - it doesn't really think in terms of deltas at all), but 
it's also great because getting rid of the inflexible delta rules means 
that git doesn't have any problems at all with merging two files together, 
for example - there simply are no arbitrary *,v revision files that have 
some hidden meaning.

It also means that the choice of deltas is a much more open-ended 
question. If you limit the delta chain to just one file, you really don't 
have a lot of choices on what to do about deltas, but in git, it really 
can be a totally different issue.

And this is where the really badly named --aggressive comes in. While 
git generally tries to re-use delta information (because it's a good idea, 
and it doesn't waste CPU time re-finding all the good deltas we found 
earlier), sometimes you want to say let's start all over, with a blank 
slate, and ignore all the previous delta information, and try to generate 
a new set of deltas.

So --aggressive is not really about being aggressive, but about wasting 
CPU time re-doing a decision we already did earlier!

*Sometimes* that is a good thing. Some import tools in particular could 
generate really horribly bad deltas. Anything that uses git fast-import, 
for example, likely doesn't have much of a great delta layout, so it might 
be worth saying I want to start from a clean slate.

But almost always, in other cases, it's actually a really bad thing to do. 
It's going to waste CPU time, and especially if you had actually done a 
good job at deltaing earlier, the end result isn't going to re-use all 
those *good* deltas you already found, so you'll actually end up with a 
much worse end result too!

I'll send a patch to Junio to just remove the git gc --aggressive 
documentation. It can be useful, but it generally is useful only when you 
really understand at a very deep level what it's doing, and that 
documentation doesn't help you do that.

Generally, doing incremental git gc is the right approach, and better 
than doing git gc --aggressive. It's going to re-use old deltas, and 
when those old deltas can't be found (the reason for doing incremental GC 
in the first place!) it's going to create new ones.

On the other hand, it's definitely true that an initial import of a long 
and involved history is a point where it can be worth spending a lot of 
time finding the *really*good* deltas. Then, every user ever after (as 
long as they don't use git gc --aggressive to undo it!) will get the 
advantage of that one-time event. So especially for big projects with a 
long history, it's probably worth doing some extra work, telling the delta 
finding code to go wild.

So the equivalent of git gc --aggressive - but done *properly* - is to 
do (overnight) something like

git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be 
(make them longer for old history - it's worth the space overhead), and 
the window thing is about how big an object window we want each delta 
candidate to scan.

And here, you might well want to add the -f flag (which is the drop all 
old deltas, since you now are actually trying to make sure that this one 
actually finds good candidates.

And then it's going to take forever and a day (ie a do it overnight 
thing). But the end result is that everybody downstream from that 
repository will get much better packs, without having to spend any effort 
on it themselves.

Linus


Re: Git and GCC

2007-12-05 Thread Jon Smirl
On 12/6/07, Daniel Berlin [EMAIL PROTECTED] wrote:
  While you won't get the git svn metadata if you clone the infradead
  repo, it can be recreated on the fly by git svn if you want to start
  commiting directly to gcc svn.
 
 I will give this a try :)

Back when I was working on the Mozilla repository we were able to
convert the full 4GB CVS repository complete with all history into a
450MB pack file. That work is where the git-fastimport tool came from.
But it took a month of messing with the import tools to achieve this
and Mozilla still chose another VCS (mainly because of poor Windows
support in git).

Like Linus says, this type of command will yield the smallest pack file:
 git repack -a -d --depth=250 --window=250

I do agree that importing multi-gigabyte repositories is not a daily
occurrence nor a turn-key operation. There are significant issues when
translating from one VCS to another. The lack of global branch
tracking in CVS causes extreme problems on import. Hand editing of CVS
files also caused endless trouble.

The key to converting repositories of this size is RAM. 4GB minimum,
more would be better. git-repack is not multi-threaded. There were a
few attempts at making it multi-threaded but none were too successful.
If I remember right, with loads of RAM, a repack on a 450MB repository
was taking about five hours on a 2.8Ghz Core2. But this is something
you only have to do once for the import. Later repacks will reuse the
original deltas.

-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-05 Thread Jeff King
On Thu, Dec 06, 2007 at 01:47:54AM -0500, Jon Smirl wrote:

 The key to converting repositories of this size is RAM. 4GB minimum,
 more would be better. git-repack is not multi-threaded. There were a
 few attempts at making it multi-threaded but none were too successful.
 If I remember right, with loads of RAM, a repack on a 450MB repository
 was taking about five hours on a 2.8Ghz Core2. But this is something
 you only have to do once for the import. Later repacks will reuse the
 original deltas.

Actually, Nicolas put quite a bit of work into multi-threading the
repack process; the results have been in master for some time, and will
be in the soon-to-be-released v1.5.4.

The downside is that the threading partitions the object space, so the
resulting size is not necessarily as small (but I don't know that
anybody has done testing on large repos to find out how large the
difference is).

-Peff


Re: Git and GCC

2007-12-05 Thread Harvey Harrison

   git repack -a -d --depth=250 --window=250
 

Since I have the whole gcc repo locally I'll give this a shot overnight
just to see what can be done at the extreme end or things.

Harvey