subject:"more git updates.."

Re: more git updates..

2005-04-20 Thread Kai Henningsen

[EMAIL PROTECTED] (H. Peter Anvin)  wrote on 11.04.05 in <[EMAIL PROTECTED]>:

> Followup to:  <[EMAIL PROTECTED]>
> By author:Christopher Li <[EMAIL PROTECTED]>
> In newsgroup: linux.dev.kernel
> >
> > There is one problem though. How about the SHA1 hash collision?
> > Even the chance is very remote, you don't want to lose some data do due
> > to "software" error. I think it is OK that no handle that
> > case right now. On the other hand, it will be nice to detect that
> > and give out a big error message if it really happens.
> >
>
> If you're actually worried about it, it'd be better to just use a
> different hash, like one of the SHA-2's (probably a better choice
> anyway), instead of SHA-1.

How could that help? *Every* hash has hash collisions. It's an unavoidable  
result of using less bits than the original data has.

MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-13 Thread Matt Mackall

On Thu, Apr 14, 2005 at 01:42:11AM +0200, Krzysztof Halasa wrote:
> Matt Mackall <[EMAIL PROTECTED]> writes:
> 
> > Now if you can assume that blobs never change and are never deleted,
> > you can simply append them all onto a log, and then index them with a
> > separate file containing an htree of (sha1, offset, length) or the
> > like.
> 
> That mean a problem with rsync, though.

I believe 200k inodes is a problem for rsync too. But we can simply
grab the remote htree, do a tree compare, find the ranges of the
remote file we need, sort and merge the ranges, and then pull them.
That will surely trounce rsync.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-13 Thread Krzysztof Halasa

Matt Mackall <[EMAIL PROTECTED]> writes:

> Now if you can assume that blobs never change and are never deleted,
> you can simply append them all onto a log, and then index them with a
> separate file containing an htree of (sha1, offset, length) or the
> like.

That mean a problem with rsync, though.

BTW: I think the bandwidth increase compared to bkcvs isn't that obvious.
After a file is modified with git, it has to be transmitted (plus
small additional things.
If a file is modified with bkcvs, it has to be transmitted (the whole
RCS file) as well.

Only the initial rsync would be much smaller with bkcvs.
-- 
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-13 Thread Matt Mackall

On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> > 
> > I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> > the CVS/SCCS format as storage may be more appealing than the current
> > git format.
> 
> Go wild. I did mine in six days, and you've been whining about other 
> peoples SCM's for three years.

I wrote a hack to do efficient delta storage with O(1) seeks for
lookup and append last week, I believe it's been integrated into the
latest Bazaar-NG. I expect it'll give better compression and
performance than BK. Of course it ends up being O(revisions) for
modifications or insertions (but that is probably a non-issue for the
SCM models we're looking at).

The git model is obviously very different, but I worry about the slop
space implied. With 200k file revision and an average of 2k slop per
file, that's 400MB of slop, or almost the size of an equivalent delta
compressed kernel repo.

Now if you can assume that blobs never change and are never deleted,
you can simply append them all onto a log, and then index them with a
separate file containing an htree of (sha1, offset, length) or the
like. Since the key is already a strong hash, this is an excellent
match and avoids rehashing in the kernel's directory lookup. And it'll
save an inode, a directory entry, and about half a data block per
entry. "Open" will also be cheaper as there's no per-revision inode to
grab.

I could hack on this if you think it fits with the git model,
otherwise I'll go back to my other experiments..

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-13 Thread Linus Torvalds

On Wed, 13 Apr 2005, Russell King wrote:
> 
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.
> 
> BK is also a lot better than CVS.  So _your_ point is?

Hey, anybody who wants to argue that BK is getter than GIT won't be 
getting any counter-arguments from me.

The fact is, I have constraints. Like needing something to work within a
few days. If somebody comes up with a ultra-fast, replicatable, space
efficient SCM in three days, I'm all over it. 

In the meantime, I'd suggest people who worry about network bandwidth try 
to work out a synchronization protocol that allows you to send "diff 
updates" between git repositories. The git model doesn't preclude looking 
at the objects and sending diffs instead (and re-creating the objects on 
the other side). But my time-constraints _do_.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-13 Thread Andrea Arcangeli

On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
> Go wild. I did mine in six days, and you've been whining about other 
> peoples SCM's for three years.

Even if I spend 6 days doing git, you'd never have thrown away BK in
exchange for git.

> In other words - go and _do_ something instead of whining. I'm not 
> interested.

CVS and SVN are already an order of magnitude more efficient than git at
storing and exporting the data and they shouldn't annoy you during the
checkins either, they have a backend much more efficient than git too,
and yet you seem not to care about them.

My suggestion was simply to at least change git to coalesce the diffs
like CVS/SCCS, I'm only making a suggestion to give git a chance to have
a backend at least as efficient as the one that CVS uses and to avoid
running rsync on a 2.8G uncompressible blob. I don't have enough spare
time to do something myself, my spare time would be too short anyway to
make a difference in SCM space, so I'd rather spend it all in more
innovative space where it might have a slight change to make a
difference.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-13 Thread Andrea Arcangeli

On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote:
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.

Yep, this is why I mentioned SCCS format too, I didn't know it was even
smaller, but I expected a similar density from SCCS.

> Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
> think the space usage point still stands.

If it wasn't for network synchronization it almost wouldn't matter, but
fetching 2.8G uncompressible when I could simply fetch 220MB
compressible (that will compress with zlib at little cost during rsync
to less than 78M), sounds a bit overkill.

> What is the space usage behaviour when you have multiple git trees?

Multiple trees in the sense of pulls from multiple developers aren't
more costly than a normal checkin, due the "soft hardlink" property of
the hashes. It's just every checkin taking lots of space, and generating
a new uncompressible blobs every time a changeset touches one file.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-13 Thread Russell King

On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> > At the rate of 9M for every 198 changeset checkins, that means I'll have
> > to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> > per-file ratio due the too-small files) for a whole pack including all
> > changesets without accounting the original 111MB of the original tree,
> > with rsync -z of git.  That compares with 514M _compressible_ with CVS
> > format on-disk, and with ~79M of the CVS-network download with rsync -z of
> > the CVS repository (assuming default gzip compression level).
> 
> Yes. CVS is much denser.
> 
> CVS is also total crap. So your point is?

And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
is more dense than CVS.

BK is also a lot better than CVS.  So _your_ point is?

8)

Note: I'm _not_ arguing with your sentiments towards CVS.  However, I
think the space usage point still stands.

What is the space usage behaviour when you have multiple git trees?
Do we need a git relink command in git-pasky? 8)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-12 Thread Matthias Urlichs

Hi,   Linus Torvalds schrub am Tue, 12 Apr 2005 15:49:07 -0700:

>> Have to tried to import it?
> 
> It would take days.

You can always import it later and then graft it into the commit tree.

That would of course change *every* commit node, but so what? They're
small, and you can delete the old ones when you're done.

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds



On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> 
> I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> the CVS/SCCS format as storage may be more appealing than the current
> git format.

Go wild. I did mine in six days, and you've been whining about other 
peoples SCM's for three years.

In other words - go and _do_ something instead of whining. I'm not 
interested.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread Andrea Arcangeli

On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> Yes. CVS is much denser.
>
> CVS is also total crap. So your point is?

I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
the CVS/SCCS format as storage may be more appealing than the current
git format. I guess I should have said RCS instead of CVS, sorry if that
created any confusion. The arch/darcs approach of pratically storing
patches would also be much denser but it has no efficient way of doing
"rcs up -p 1.x" on a file, that doesn't involve potentially unpacking
tons of unrelated changesets.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread Andrea Arcangeli

On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
> and a test-run of 198 patches from Andrew) is 111MB. In other words,
> adding 198 "full" new kernels only grew the archive by 9MB (that's all
> "actual disk usage" btw - the files themselves are smaller, but since they
> all end up taking up a full disk block..)

reiserfs can do tail packing, plus the disk block is meaningless when
fetching the data from the network which is the real cost to worry about
when synchronizing and downloading (disk cost isn't a big deal).

The pagecache cost sounds a very minor one too, since you don't need
the whole data in ram, not even all dentries need to be in cache.  This
is one of the reasons why you don't need to run readdir, and why you can
discard the old trees anytime.

At the rate of 9M for every 198 changeset checkins, that means I'll have
to download 2.7G _uncompressible_ (i.e. already compressed with a bad
per-file ratio due the too-small files) for a whole pack including all
changesets without accounting the original 111MB of the original tree,
with rsync -z of git.  That compares with 514M _compressible_ with CVS
format on-disk, and with ~79M of the CVS-network download with rsync -z of
the CVS repository (assuming default gzip compression level).

What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of
rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns
should be expected for synchronizations over time while fetching new
blobs etc...

Ok, BKCVS has less than 6 checkins due the linearization and
coalescing of pulls that couldn't be represented losslessy in CVS, so
the network-bound slowdown is less than -97.2%, my math is
approximative, but the order of magnitude should remain the same.

Clearly one can write an ad-hoc network protocol instead of using
rsync/wget, but the server will need quite a bit of cpu and ram to do a
checkout/update/sync efficiently to unpack all data and create all
changesets to gzip and transfer.

Anyway git simplicity and immutable hashes robustness certainly makes it
an ideal interim format (and it may even be a very pratical local
live format on-disk, except for the backups), I'm only unsure if it's a
wise idea to build an SCM on top of the current git format or if it's
better to use something like SCCS or CVS to coalesce all diffs of a
single file together and to save space and make rsync -z very efficient
too (or an approach like arch and darcs that stores changesets per file,
i.e. patches).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread Panagiotis Issaris

Hi David,

On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote:
> > No. A tree is not the full data. A tree contains enough information
> > to 
> > _recreate_ the full data, but the tree itself just tells you _how_
> > to do 
> > that. It doesn't contain very much of the data itself at all.
> 
> Perhaps I'd understand this if you tell me what "recreate" means.
> If a have a SHA1 hash of a file, and I have the file, I can verify
> that said
> file has the SHA1 hash it's supposed to have, but I can't generate the
> file
> from it's hash...

But, but if you have that hexified SHA1 hash of a particular file you
want to access, there would be a file with a filename equal to that
hexified SHA1 hash which contained the compressed contents of the file
you're looking for.

At least, that's how I understood it...

With friendly regards,
Takis

-- 
OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt
fingerprint: 6571 13A3 33D9 3726 F728  AA98 F643 B12E ECF3 E029
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds



On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> 
> At the rate of 9M for every 198 changeset checkins, that means I'll have
> to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> per-file ratio due the too-small files) for a whole pack including all
> changesets without accounting the original 111MB of the original tree,
> with rsync -z of git.  That compares with 514M _compressible_ with CVS
> format on-disk, and with ~79M of the CVS-network download with rsync -z of
> the CVS repository (assuming default gzip compression level).

Yes. CVS is much denser.

CVS is also total crap. So your point is?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-12 Thread Linus Torvalds

On Wed, 13 Apr 2005, Krzysztof Halasa wrote:
> 
> Does that mean that the 64 K changes imported from bk would take ~ 3 GB?
> Is that real?

That's a _guess_. 

> Have to tried to import it?

It would take days.

> I'm going to import the CVS data (with cvsps) - as the CVS "misses" half
> the changes, the resulting archive should be half in size too?

No. The CVS archive is going to be almost the same size. BKCVS gets about 
98% of all the data. It just doesn't show the complex merge graphs, but 
those are "small" in comparison.

> I don't know how much space did bk use, but 3 GB for the full history
> is reasonable for most people, isn't it? Especially that one can purge
> older data.

I think it's entirely reasonable, yes. But I may be off by an order of
magnitude. I based the 3GB on estimating form the sparse tree, but I
wasn't being too careful. Andrew estimated 2GB per year (at our current
historical rate of changes) based on my merge with him. So it's in that 
general range of 3-6GB, I htink.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-12 Thread Krzysztof Halasa

Linus Torvalds <[EMAIL PROTECTED]> writes:

> The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
> and a test-run of 198 patches from Andrew) is 111MB. In other words,
> adding 198 "full" new kernels only grew the archive by 9MB (that's all
> "actual disk usage" btw - the files themselves are smaller, but since they
> all end up taking up a full disk block..)

Does that mean that the 64 K changes imported from bk would take ~ 3 GB?
Is that real?

Have to tried to import it?
I'm going to import the CVS data (with cvsps) - as the CVS "misses" half
the changes, the resulting archive should be half in size too?

I don't know how much space did bk use, but 3 GB for the full history
is reasonable for most people, isn't it? Especially that one can purge
older data.
-- 
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread David Eger

On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> 
> Yes. A tree is defined by the blobs it references (and the subtrees) but 
> it doesn't _contain_ them. It just contains a pointer to them.

A pointer to them?  You mean a SHA1 hash of them? or what?
Where is the *real* data stored?  The real files, the real patches?
Are these somewhere completely outside of git?

> > Therefore, "TREE" must be the *full* data, and since we have the following
> > definition for CHANGESET:
> 
> No. A tree is not the full data. A tree contains enough information to 
> _recreate_ the full data, but the tree itself just tells you _how_ to do 
> that. It doesn't contain very much of the data itself at all.

Perhaps I'd understand this if you tell me what "recreate" means.
If a have a SHA1 hash of a file, and I have the file, I can verify that said
file has the SHA1 hash it's supposed to have, but I can't generate the file
from it's hash...

Sorry for being stubbornly dumb, but you'll have a couple of us puzzling 
at the README ;-)

-dte
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread Linus Torvalds

On Tue, 12 Apr 2005, David Eger wrote:
> 
> The reason I am questioning this point is the GIT README file.
> 
> Linus makes explicit that a "blob" is just the "file contents," and that
> really, a "blob" is not just the SHA1 of the "blob":
> 
> > In particular, the "current directory cache" certainly does not need to
> > be consistent with the current directory contents, but it has two very
> > important attributes:
> > 
> > (a) it can re-generate the full state it caches (not just the directory
> > structure: through the "blob" object it can regenerate the data too)
> 
> And he defines "TREE" with the same name: blob

Yes. A tree is defined by the blobs it references (and the subtrees) but 
it doesn't _contain_ them. It just contains a pointer to them.

> Therefore, "TREE" must be the *full* data, and since we have the following
> definition for CHANGESET:

No. A tree is not the full data. A tree contains enough information to 
_recreate_ the full data, but the tree itself just tells you _how_ to do 
that. It doesn't contain very much of the data itself at all.

> That each changeset remembers *everything* for *each point in the tree*.

But only BY REFERENCE. A "commit" is usually very small. For example, the
top-of-tree commit-file for my currest kernel test is literally 401
_bytes_ in size. Because it just references a tree (20 bytes of
_reference_).

> Linus, if you actually mean to differentiate between the full data
> and a SHA1 of the data

There is no differentiation. The sha1 _is_ the data as far as git is 
concerned. 

It's only confusing if you think they are different. 

> Also, the details of just what data constitutes a 'changeset' would be
> lovely... i.e. a precise spec of what Pat is describing below...

[EMAIL PROTECTED]:~/test-tools/linux-2.6.12-rc2> cat-file commit `cat 
.git/HEAD `
tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6
parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0
author NeilBrown <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005
committer Linus Torvalds <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005

[PATCH] md: remove a number of misleading calls to MD_BUG

The conditions that cause these calls to MD_BUG are not kernel bugs, 
just
oddities in what userspace is asking for.

Also convert analyze_sbs to return void, and the value it returned was
always 0.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

That's it. In all it's glory. Compressed and tagged it's 401 bytes. 

The tree it references is 677 bytes in size. That in turn references a 
number of subtrees, but almost all of the sub-trees are shared with 
_other_ tree commits, so their size is spread out over all the commits.

The full archive of the 2.6.12-rc2 kernel that I used for testing (only
_one_ version) is 102MB in size. That's about half of what the kernel is
uncompressed.

The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
and a test-run of 198 patches from Andrew) is 111MB. In other words,
adding 198 "full" new kernels only grew the archive by 9MB (that's all
"actual disk usage" btw - the files themselves are smaller, but since they
all end up taking up a full disk block..)

Basically, the whole point of git is that objects are equated with their 
sha1 name, and that you can thus "include" an object by just referring to 
its name. The two are equivalent. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread David Eger


The reason I am questioning this point is the GIT README file.

Linus makes explicit that a "blob" is just the "file contents," and that
really, a "blob" is not just the SHA1 of the "blob":

> In particular, the "current directory cache" certainly does not need to
> be consistent with the current directory contents, but it has two very
> important attributes:
> 
> (a) it can re-generate the full state it caches (not just the directory
> structure: through the "blob" object it can regenerate the data too)

And he defines "TREE" with the same name: blob

> TREE: The next hierarchical object type is the "tree" object.  A tree
> object is a list of permission/name/blob data, sorted by name.

Therefore, "TREE" must be the *full* data, and since we have the following
definition for CHANGESET:

> A "changeset" is defined by the tree-object that it results in, the
> parent changesets (zero, one or more) that led up to that point, and a
> comment on what happened.

That each changeset remembers *everything* for *each point in the tree*.

Linus, if you actually mean to differentiate between the full data
and a SHA1 of the data, *please please please* say "blob" in one place
and "SHA1 of the blob" elsewhere.  It's quite confusing, to me at least.

Also, the details of just what data constitutes a 'changeset' would be
lovely... i.e. a precise spec of what Pat is describing below...

-dte 

> where David Eger <[EMAIL PROTECTED]> told me that...
> > So with git, *every* changeset is an entire (compressed) copy of the
> > kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
> > 
> > Am I missing something here?
> 
> Yes. Only changes files re-appear. The unchanged files keep the same
> SHA1 hash, therefore they don't re-appear in the repository.
> 
> So, if Linus gets a patch which sanitizes drivers/char/selection.c,
> only these new objects appear in the repository:
> 
>   drivers/char/selection.c
>   drivers/char
>   drivers
>   . (project root)
>   commit message
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-12 Thread Helge Hafting

On Sun, Apr 10, 2005 at 09:01:22AM -0700, Linus Torvalds wrote:
> 
> So I was for a while debating having a totally flat directory space, but 
> since there are _some_ downsides (linear lookup for cold-cache, and just 
> that "ls -l" ends up being O(n**2) and things), I decided that a single 
> fan-out is probably a good idea.
> 
Isn't that fixed even in ext2/ext3 these days?

man mke2fs:
   dir_index
  Use  hashed  b-trees  to  speed  up lookups in large
  directories.

Also, the popular reiserfs was designed with this in mind from the start.

> > Or maybe the files should be named objects/xx/yy/?
> 
> Hey, I may end up being wrong, and yes, maybe I should have done a 
> two-level one. 

Unless there still is performance issues, please don't.  A directory
structure with extra levels is necessarily harder to use if one
ever have to use it manually somehow.

Helge Hafting 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-12 Thread Petr Baudis

Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter
where David Eger <[EMAIL PROTECTED]> told me that...
> So with git, *every* changeset is an entire (compressed) copy of the
> kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?
> 
> Am I missing something here?

Yes. Only changes files re-appear. The unchanged files keep the same
SHA1 hash, therefore they don't re-appear in the repository.

So, if Linus gets a patch which sanitizes drivers/char/selection.c,
only these new objects appear in the repository:

drivers/char/selection.c
drivers/char
drivers
. (project root)
commit message

Kind regards,

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-12 Thread Barry K. Nathan

On Mon, Apr 11, 2005 at 10:14:13PM -0700, David Lang wrote:
> I've been reading this and have another thought for you guys to keep in 
> mind for this tool.
> 
> version control of system config files on linux systems.

I've been thinking about this too. (I won't have time to implement this
however. If I do have time in the near future to do anything involving
git, it probably won't have anything to do with version control of
config files.)

> it sounds like you could put the / fileystem under the control of git 
> (after teaching it to not cross fileystem boundries so you can have 
> another filesystem to work with) and version control your entire system. 
> if this was done it would be nice to add a item type that would referance 
> a file in a distro package to save space. it sounds like you could run a 
> git checkin daily (as part of the updatedb run for example) at very little 
> cost.

I was thinking that the GIT checkin should actually be done by the
distro configuration tools, and not as a cronjob. And maybe the config
tools could do two checkins if there were any manual changes since the
last checkin, or something. (That is, one checkin to check in the manual
changes since the last checkin, and another to check in whatever the
config tool just did.)

Now that I think about it, it would be really good to have a simple tool
for doing a manual checkin after manual editing of config files, but I
think something like the dual-checkin scheme would be needed as a safety
net in case root forgets to do the checkin.

-Barry K. Nathan <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread David Eger

So with git, *every* changeset is an entire (compressed) copy of the
kernel.  Really?  Every patch you accept adds 37 MB to your hard disk?

Am I missing something here?

-dte
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread Paul Jackson

David wrote:
> and version control your entire system

Yeah - that works.  That's how I back up my system.  Not
git actually, but a similar sort of store (no compression,
a line oriented ascii 'index' file).

See my post on "Kernel SCM saga..", Sat, 9 Apr 2005 08:15:53 -0700,
Message-Id: <[EMAIL PROTECTED]>

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread David Lang

I've been reading this and have another thought for you guys to keep in 
mind for this tool.

version control of system config files on linux systems.
it sounds like you could put the / fileystem under the control of git 
(after teaching it to not cross fileystem boundries so you can have 
another filesystem to work with) and version control your entire system. 
if this was done it would be nice to add a item type that would referance 
a file in a distro package to save space. it sounds like you could run a 
git checkin daily (as part of the updatedb run for example) at very little 
cost.

for that matter by comparing the git data between servers (or between a 
server and an archive) you could easily use it to detect tampering.

sounds very interesting, but I'm going to let things settle down a bit 
before I try to tackle this (but you guys who ar working on it shoudl feel 
free to add the couple options nessasary to implement this if you want ;-)

David Lang
On Sun, 10 Apr 2005, Christopher Li wrote:
Date: Sun, 10 Apr 2005 17:28:50 -0400
From: Christopher Li <[EMAIL PROTECTED]>
To: Linus Torvalds <[EMAIL PROTECTED]>
Cc: Paul Jackson <[EMAIL PROTECTED]>, [EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], linux-kernel@vger.kernel.org
Subject: Re: more git updates..
I see. It just need some basic set operation (+, -, and)
and some way to select a set:
 sha5--->
/
   /
sha1-->sha2-->sha3--
  \/
   \  /
>sha4
list sha1   # all the file list in changeset sha1
# {sha1}
list sha1,sha1  # same as above
list sha1,sha2  # all the file list in between changeset sha1
# and changeset sha2
# {sha1, sha2} in example
list sha1,sha3  # {sha1, sha2, sha3, sha4}
list sha1,any   # all the change set reachable from sha1.
{sha1, ... sha5, ...}
new  sha1,sha2  # all the new file add between in sha1, sha2 (+)
changed  sha1,sha2  # add the changed file between sha1, sha2   (>) (<)
deleted  sha1,sha2  # add the deleted file between sha1, sha2(-)
before   time   # all the file before time
aftertime   # all the file after time
So in my example, the file I want to delete is :
{list hack1, base}+ {list hack2, base} ... {list hack6, base} \
- [list official_merge, base ]

On Sun, Apr 10, 2005 at 04:21:08PM -0700, Linus Torvalds wrote:

the official tree. It is more for my local version control.
I have a plan. Namely to have a "list-needed" command, which you give one
commit, and a flag implying how much "history" you want (*), and then it
spits out all the sha1 files it needs for that history.
Then you delete all the other ones from your SHA1 archive (easy enough to
do efficiently by just sorting the two lists: the list of "needed" files
and the list of "available" files).
Script that, and call the command "prune-tree" or something like that, and
you're all done.
(*) The amount of history you want might be "none", which is to say that
you don't want to go back in time, so you want _just_ the list of tree and
blob objects associated with that commit.
That will be {list head}
Or you might want a "linear"  history, which would be the longest path
through the parent changesets to the root.
That will be {list head,root}
Or you might want "all", which would follow all parents and all trees.
That will be {list any, root}
Or you might want to prune the history tree by date - "give me all
history, but cut it off when you hit a parent that was done more than 6
months ago".
That is {after -6month }
This "list-needed" thing is not just for pruning history either. If you
have a local tree "x", and you want to figure out how much of it you need
to send to somebody else who has an older tree "y", then what you'd do is
basically "list-needed x" and remove the set of "list-needed y". That
gives you the answer to the question "what's the minimum set of sha1 files
I need to send to the other guy so that he can re-create my top-of-tree".
That is {list x, any} - {list y, any}

My second plan is to make somebody else so fired up about the problem that
I can just sit back and take patches. That's really what I'm best at.
Sitting here, in the (rain) on the patio, drinking a foofy tropical drink,
and pressing the "apply" button. Then I take all the credit for my
incredible work.
Sounds like a good plan.
Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
--
There are two ways of constructing a software design. One way is to

Re: Re: more git updates..

2005-04-11 Thread Petr Baudis

Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter
where "Randy.Dunlap" <[EMAIL PROTECTED]> told me that...
> On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:
..snip..
> | Yes. Crappy old tree, but it can still read my git.git directory, so you 
> | can use it to update to my current source base.
> 
> Please go into a little more detail about how to do this step...
> that seems to be the most basic concept that I am missing.
> i.e., how to find the "latest/current" tree (version/commit)
> and check it out (read-tree, checkout-cache, etc.).

Well, its ID is by convention kept in .dircache/HEAD. But that is really
only a convention, no "core git" tool reads it directly, and you need to
update it manually after you do commit-tree.

First, you need to get the accompanying tree's id. git-pasky's shortcut
is $(tree-id), but manually you can do it by

$(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree'

Note that if you ever forgot to update HEAD or if you have multiple
branches in your repository, you can list all "head commits" (that is,
commits which have no other commits referencing them as parents) by
doing fsck-cache.

Now, you need to populate the directory cache by the tree (see Paul
Jackson's diagram):

read-tree $tree_id

And now you want to update your working tree from the cache:

checkout-cache -a -f

This will bring your tree in sync with the cache (it won't remove any
stale files, though). That means it will overwrite your local changes
too - turn that off by omitting the "-f". If you want to update only
some files, omit the "-a" and list them.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread ross

On Sat, Apr 09, 2005 at 12:45:52PM -0700, Linus Torvalds wrote:
> Can you guys re-send the scripts you wrote? They probably need some 
> updating for the new semantics. Sorry about that ;(

I've been off email this weekend, so have fallen a bit behind here.
I'll forgo updating my stuff, since it looks like there's superior
work.  Looks cool!

I must say, the git as a filesystem thing is really neat.  This has
been one of the more fun projects I've toyed around with.

-- 
Ross Vandegrift
[EMAIL PROTECTED]

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread Randy.Dunlap

On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:

| 
| 
| On Sun, 10 Apr 2005, Paul Jackson wrote:
| >
| > Useful explanation - thanks, Linus.
| 
| Hey. You're welcome. Especially when you create good documentation for 
| this thing.
| 
| Because:
| 
| > Is this picture and description accurate:
| 
| [ deleted, but I'll probably try to put it in an explanation file 
|   somewhere ]
| 
| Yes. Excellent.
| 
| > Minor question:
| > 
| >   I must have an old version - I got 'git-0.03', but
| >   it doesn't have 'checkout-cache', and its 'read-tree'
| >   directly writes my working files.
| 
| Yes. Crappy old tree, but it can still read my git.git directory, so you 
| can use it to update to my current source base.

Please go into a little more detail about how to do this step...
that seems to be the most basic concept that I am missing.
i.e., how to find the "latest/current" tree (version/commit)
and check it out (read-tree, checkout-cache, etc.).

Even if I use Pasky's tools, I'd like to understand this step.

| However, from a usability angle, my source-base really has been 
| concentrating _entirely_ on just the plumbing, and if you actually want a 
| faucet or a toilet _conntected_ to the plumbing, you're better off with 
| Pasky's tree, methinks:
| 
| >   How do I get a current version?  Well, one way I see,
| >   and that's to pick up Pasky's:
| > 
| > http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
| >  
| >   Perhaps that's the best way?
| 
| Indeed. He's got a number of shell scripts etc to automate the boring 
| parts.


---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Christopher Li <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> There is one problem though. How about the SHA1 hash collision?
> Even the chance is very remote, you don't want to lose some data do due
> to "software" error. I think it is OK that no handle that
> case right now. On the other hand, it will be nice to detect that
> and give out a big error message if it really happens.
> 

If you're actually worried about it, it'd be better to just use a
different hash, like one of the SHA-2's (probably a better choice
anyway), instead of SHA-1.

-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread Anton Altaparmakov

On Mon, 2005-04-11 at 01:04 +0200, Bernd Eckenfels wrote:
> In article <[EMAIL PROTECTED]> you wrote:
> > (I repeat the xxx in the leaf name - easier to code.)
> 
> It is a bit OT, but just a note: there are file systems (hash functions) out
> there who dont like a lot of files named the same way. For example NTFS with
> the 8.3 short names.

Since you mention NTFS, there is no need to worry about that for Linux.
Certainly the Linux kernel NTFS driver is never going to create 8.3
short names.  (It doesn't create names at all at the moment but my grand
plan is that it will only ever create file names in the Win32 and/or
POSIX name spaces.  The DOS name space is a thing of the past IMO.)

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-11 Thread Christer Weinigel

bert hubert <[EMAIL PROTECTED]> writes:

> On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:
> 
> > compressed with zlib, they are all named by the sha1 file, and they all 
> 
> Now I know this is a concious decision, but recent zlib allows you to write
> out gzip content, at a cost of 14 bytes I think per file, by adding 32 to
> the window size. This in turn would allow users to zcat your objects at
> ease.
> 
> You get confirmation of completeness of the file for free, as gzip encodes
> the length of the file at the end.

I would very much like it if git used normal gzip files with a .gz
extension.  Doing it this way means that the compression methods can
be extended in the future.  I.e:

ab/1234567890.gzgzip compressed
ab/1234567890.xdxdelta compressed

I find the xdelta encoding very attractive since it can probably
reduce the size of the repository drastically.  A compression script
could for run nightly and xdelta compress everything that's older than
a few months (to figure out what files to create the delta from, just
look at the commit files and compare the parent tree to the current
tree).

Of course, this means that a dumb wget won't work all that well to
synchronize two trees, but it might be worthwile anyways.

  /Christer

-- 
"Just how much can I get away with and still go to heaven?"

Freelance consultant specializing in device driver programming for Linux 
Christer Weinigel <[EMAIL PROTECTED]>  http://www.weinigel.se
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread bert hubert

On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:

> compressed with zlib, they are all named by the sha1 file, and they all 

Now I know this is a concious decision, but recent zlib allows you to write
out gzip content, at a cost of 14 bytes I think per file, by adding 32 to
the window size. This in turn would allow users to zcat your objects at
ease.

You get confirmation of completeness of the file for free, as gzip encodes
the length of the file at the end.

Perhaps something to consider.

-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Christopher Li

I see. It just need some basic set operation (+, -, and)
and some way to select a set:


  sha5--->
 / 
/ 
sha1-->sha2-->sha3--
   \/
\  /
 >sha4


list sha1   # all the file list in changeset sha1
# {sha1}
list sha1,sha1  # same as above
list sha1,sha2  # all the file list in between changeset sha1
# and changeset sha2
# {sha1, sha2} in example
list sha1,sha3  # {sha1, sha2, sha3, sha4}

list sha1,any   # all the change set reachable from sha1.
{sha1, ... sha5, ...}

new  sha1,sha2  # all the new file add between in sha1, sha2 (+)
changed  sha1,sha2  # add the changed file between sha1, sha2   (>) (<)
deleted  sha1,sha2  # add the deleted file between sha1, sha2(-)

before   time   # all the file before time
aftertime   # all the file after time


So in my example, the file I want to delete is :

{list hack1, base}+ {list hack2, base} ... {list hack6, base} \
- [list official_merge, base ]



On Sun, Apr 10, 2005 at 04:21:08PM -0700, Linus Torvalds wrote:
> 
> 
> > the official tree. It is more for my local version control.
> 
> I have a plan. Namely to have a "list-needed" command, which you give one
> commit, and a flag implying how much "history" you want (*), and then it
> spits out all the sha1 files it needs for that history.
> 
> Then you delete all the other ones from your SHA1 archive (easy enough to
> do efficiently by just sorting the two lists: the list of "needed" files
> and the list of "available" files).
> 
> Script that, and call the command "prune-tree" or something like that, and 
> you're all done.
> 
> (*) The amount of history you want might be "none", which is to say that 
> you don't want to go back in time, so you want _just_ the list of tree and 
> blob objects associated with that commit.

That will be {list head}

> 
> Or you might want a "linear"  history, which would be the longest path
> through the parent changesets to the root.

That will be {list head,root}

> 
> Or you might want "all", which would follow all parents and all trees.

That will be {list any, root}

> 
> Or you might want to prune the history tree by date - "give me all
> history, but cut it off when you hit a parent that was done more than 6
> months ago".

That is {after -6month }

> 
> This "list-needed" thing is not just for pruning history either. If you
> have a local tree "x", and you want to figure out how much of it you need
> to send to somebody else who has an older tree "y", then what you'd do is
> basically "list-needed x" and remove the set of "list-needed y". That
> gives you the answer to the question "what's the minimum set of sha1 files
> I need to send to the other guy so that he can re-create my top-of-tree".
>

That is {list x, any} - {list y, any}


> My second plan is to make somebody else so fired up about the problem that 
> I can just sit back and take patches. That's really what I'm best at. 
> Sitting here, in the (rain) on the patio, drinking a foofy tropical drink, 
> and pressing the "apply" button. Then I take all the credit for my 
> incredible work. 

Sounds like a good plan.

Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Paul Jackson

Linus writes:
> Hey. You're welcome. Especially when you create good documentation for 
> this thing.

Glad to be of service.  Sounds like the umbrella in your foofy
drink drink will come in handy - keeping off the rain.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-10 Thread Petr Baudis

Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter
where Paul Jackson <[EMAIL PROTECTED]> told me that...
> Useful explanation - thanks, Linus.
> 
> Is this picture and description accurate:
> 
> ==
> 
> 
>  < working directory files (foo.c) >
>^
>   ^|
>   |  upward ops|downward ops  |
>   |  --|  |
>   | checkout-cache |update-cache  |
>   | show-diff  |  v
>v
> < current directory cache (".dircache/index") >
>^
>   ^|
>   |  upward ops|downward ops  |
>   |  --|  |
>   |   read-tree| write-tree   |
>   ||commit-tree   |
>|  v
>v
> < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >

Well, except that from purely technical standpoint commit-tree has
nothing to do in this picture - it creates new object in the git
filesystem based on its input data, but regardless to the directory
cache or current tree. It probably still belongs where it is from the
workflow standpoint, though.

..snip..
> Minor question:
> 
>   I must have an old version - I got 'git-0.03', but
>   it doesn't have 'checkout-cache', and its 'read-tree'
>   directly writes my working files.
>  
>   How do I get a current version?  Well, one way I see,
>   and that's to pick up Pasky's:
> 
> http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>  
>   Perhaps that's the best way?

You can take mine, and do:

git pull pasky
git pull linus
cp .dircache/HEAD .dircache/HEAD.local

Now, your tree and git filesystem is up to date.

git track local

Now, when you do git pull pasky, your working tree will not be updated
automatically anymore.

git track linus

Now, you start tracking Linus' tree instead. Note that the initial
update will blow away the scripts in your current tree, so before you do
the last two steps you will probably want to clone the tree and set PATH
to the one still tracking me, so you get all the comfort. ;-)

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Linus Torvalds



On Sun, 10 Apr 2005, Paul Jackson wrote:
>
> Useful explanation - thanks, Linus.

Hey. You're welcome. Especially when you create good documentation for 
this thing.

Because:

> Is this picture and description accurate:

[ deleted, but I'll probably try to put it in an explanation file 
  somewhere ]

Yes. Excellent.

> Minor question:
> 
>   I must have an old version - I got 'git-0.03', but
>   it doesn't have 'checkout-cache', and its 'read-tree'
>   directly writes my working files.

Yes. Crappy old tree, but it can still read my git.git directory, so you 
can use it to update to my current source base.

However, from a usability angle, my source-base really has been 
concentrating _entirely_ on just the plumbing, and if you actually want a 
faucet or a toilet _conntected_ to the plumbing, you're better off with 
Pasky's tree, methinks:

>   How do I get a current version?  Well, one way I see,
>   and that's to pick up Pasky's:
> 
> http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>  
>   Perhaps that's the best way?

Indeed. He's got a number of shell scripts etc to automate the boring 
parts.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Linus Torvalds

On Sun, 10 Apr 2005, Christopher Li wrote:
> 
> How about deleting trees from the caches? I don't need to delete stuff from
> the official tree. It is more for my local version control.

I have a plan. Namely to have a "list-needed" command, which you give one
commit, and a flag implying how much "history" you want (*), and then it
spits out all the sha1 files it needs for that history.

Then you delete all the other ones from your SHA1 archive (easy enough to
do efficiently by just sorting the two lists: the list of "needed" files
and the list of "available" files).

Script that, and call the command "prune-tree" or something like that, and 
you're all done.

(*) The amount of history you want might be "none", which is to say that 
you don't want to go back in time, so you want _just_ the list of tree and 
blob objects associated with that commit.

Or you might want a "linear"  history, which would be the longest path
through the parent changesets to the root.

Or you might want "all", which would follow all parents and all trees.

Or you might want to prune the history tree by date - "give me all
history, but cut it off when you hit a parent that was done more than 6
months ago".

This "list-needed" thing is not just for pruning history either. If you
have a local tree "x", and you want to figure out how much of it you need
to send to somebody else who has an older tree "y", then what you'd do is
basically "list-needed x" and remove the set of "list-needed y". That
gives you the answer to the question "what's the minimum set of sha1 files
I need to send to the other guy so that he can re-create my top-of-tree".

My second plan is to make somebody else so fired up about the problem that 
I can just sit back and take patches. That's really what I'm best at. 
Sitting here, in the (rain) on the patio, drinking a foofy tropical drink, 
and pressing the "apply" button. Then I take all the credit for my 
incredible work. 

Hint, hint.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Paul Jackson

Useful explanation - thanks, Linus.

Is this picture and description accurate:

==


 < working directory files (foo.c) >
   ^
  ^|
  |  upward ops|downward ops  |
  |  --|  |
  | checkout-cache |update-cache  |
  | show-diff  |  v
   v
< current directory cache (".dircache/index") >
   ^
  ^|
  |  upward ops|downward ops  |
  |  --|  |
  |   read-tree| write-tree   |
  ||commit-tree   |
   |  v
   v
< git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >


==


The checkout-cache and show-diff ops read their meta-data from
the cache, and the actual file contents from the git filesystem.
Similary, the update-cache op writes meta-data into the cache,
and may create new files in the git filesystem.

The cache (but not the git filesystem) stores transient
information (ctime, mtime, dev, ino, uid, gid, and size)
about each working file update-cache has copied into the git
filesystem so that checkout-cache and show-diff can detect
changes in the contents of working files just from a stat,
without actually rereading the file.

In some sense, the cache holds the git filesystem inodes,
and the git filesystem holds the data blocks.  Except that:
  (1) the cache just holds the current "view" into the git
  filesystem,
  (2) objects in the filesystem have an "inode" number (their
   value) that is persistent whether in view or not,
  (3) objects in the filesystem are not removed just because
  nothing in the cache references them,
  (4) objects in the filesystem can reference other objects,
  that are typically also in the filesystem, but that can
  still be reliably self-identified even if found in the
  wild of say one's email inbox, and
  (5) the view in the directory cache can itself be made into
  a filesystem object - using commit-tree.


==

Minor question:

  I must have an old version - I got 'git-0.03', but
  it doesn't have 'checkout-cache', and its 'read-tree'
  directly writes my working files.
 
  How do I get a current version?  Well, one way I see,
  and that's to pick up Pasky's:

http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
 
  Perhaps that's the best way?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Bernd Eckenfels

In article <[EMAIL PROTECTED]> you wrote:
> (I repeat the xxx in the leaf name - easier to code.)

It is a bit OT, but just a note: there are file systems (hash functions) out
there who dont like a lot of files named the same way. For example NTFS with
the 8.3 short names.

Greetings
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Christopher Li

On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 10 Apr 2005, Christopher Li wrote:
> > 
> > BTW, one thing I learn from ext3 is that it is very useful to have some
> > compatible flag for future development. I think if we want to reserve some
> > room in the file format for further development of git
> 
> Way ahead of you.
> 
> This is (one reason) why all git objects have the type embedded inside of 
> them. The format of all objects is totally regular: they are all 
> compressed with zlib, they are all named by the sha1 file, and they all 
> start out with a magic header of " ".
> 
> So if I want to create a new kind of tree object that does the same thing 
> as the old one but has some other layout, I'd just call it something else. 
> Like "dir". That was what I initially planned to do about the change to 
> recursive tree objects, but it turned out to actually be a lot easier to 
> just encode it in the old type (that way the routines that read it don't 
> even have to care about old/new types - it's all the same to them).

Ha, that is right. You put the new type into same object trick me into
thinking I have to do the same way. Totally forget I can introduce new type
of objects. It is even cleaner. Cool.

How about deleting trees from the caches? I don't need to delete stuff from
the official tree. It is more for my local version control.
Here is the usage case,
- I check out the git.git.
- using quilt to build my series of patches, git-hack1, git-hack2.. git-hack6.
  let's say those are store in git cache as well
- I pick some of them come up with a clean one "submit.patch"
- submit.patch get merged into official git tree.
- Now I want to get rid of the hack1 to hack6, but how?

One way to do it is never commit hack1 to hack6 into git or cache. They stay as 
quilt
patches only. But it is very tempting to let quilt using git instead of the
.pc/ directory, quilt can simplify as some usage case of patch and git.

Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Linus Torvalds

On Sun, 10 Apr 2005, Christopher Li wrote:
> 
> BTW, one thing I learn from ext3 is that it is very useful to have some
> compatible flag for future development. I think if we want to reserve some
> room in the file format for further development of git

Way ahead of you.

This is (one reason) why all git objects have the type embedded inside of 
them. The format of all objects is totally regular: they are all 
compressed with zlib, they are all named by the sha1 file, and they all 
start out with a magic header of " ".

So if I want to create a new kind of tree object that does the same thing 
as the old one but has some other layout, I'd just call it something else. 
Like "dir". That was what I initially planned to do about the change to 
recursive tree objects, but it turned out to actually be a lot easier to 
just encode it in the old type (that way the routines that read it don't 
even have to care about old/new types - it's all the same to them).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-10 Thread Petr Baudis

Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter
where Christopher Li <[EMAIL PROTECTED]> told me that...
> I totally agree that odds is really really small.
> That is why it is not worthy to handle the case. People hit that
> can just add a new line or some thing to avoid it, if
> it happen after all.
> 
> It is the little peace of mind to know for sure that did
> not happen. I am just paranoid. 

BTW, I've merged the check to git-pasky some time ago, you can disable
it in the Makefile. It is by default on now, until someone convinces me
it actually affects performance measurably.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Christopher Li

On Sun, Apr 10, 2005 at 01:57:33PM -0700, Linus Torvalds wrote:
> 
> > That way of thinking really doesn't work well here.
> > 
> > I will have to look more closely at pasky's GIT toolkit
> > if I want to see an SCM style interface.
> 
> Yes. You really should think of GIT as a filesystem, and of me as a 
> _systems_ person, not an SCM person. In fact, I tend to detest SCM's. I 
> think the reason I worked so well with BitKeeper is that Larry used to do 
> operating systems. He's also a systems person, not really an SCM person. 
> Or at least he's in between the two.
> 

Yes, I am puzzled for a while how to use git until I realize that it is
a version file system.

BTW, one thing I learn from ext3 is that it is very useful to have some
compatible flag for future development. I think if we want to reserve some
room in the file format for further development of git, it is the right time
to do it before it get bigs. e.g. an optional variable size header in "tree"
including format version and capability etc. I can see the counter argument
that it is not as important as a real file system because it is a lot easier
bring it off line to upgrade the whole tree.

One the other hand, it is almost did not cost any thing in terms of space and
CPU time, most directory did not get to file system block boundary so extra few 
bytes
is almost free. If carefully planed, it will make the future up grade of git
a lot smoother.

What do you think?

Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RE: more git updates..

2005-04-10 Thread Petr Baudis

Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter
where "Luck, Tony" <[EMAIL PROTECTED]> told me that...
..snip..
> >Hey, I may end up being wrong, and yes, maybe I should have done a 
> >two-level one. The good news is that we can trivially fix it later (even 
> >dynamically - we can make the "sha1 object tree layout" be a per-tree 
> >config option, and there would be no real issue, so you could make small 
> >projects use a flat version and big projects use a very deep structure 
> >etc). You'd just have to script some renames to move the files around.
> 
> It depends on how many eco-system shell scripts get built that need to
> know about the layout ... if some shell/perl "libraries" encode this
> filename layout (and people use them) ... then switching later would
> indeed be painless.

FWIW, my short-term plans include support for monotone-like hash ID
shortening - it's enough to use the shortest leading unique part of the
ID to identify the revision. I will poke to the object repository for
that. I also already have Randy Dunlap's git lsobj, which will list all
objects of a specified type (very useful especially when looking for
orphaned commits and such rather lowlevel work).

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: more git updates..

2005-04-10 Thread Luck, Tony

>Also, I did actually debate that issue with myself, and decided that even
>if we do have tons of files per directory, git doesn't much care. The
>reason? Git never _searches_ for them. Assuming you have enough memory to
>cache the tree, you just end up doing a "lookup", and inside the kernel
>that's done using an efficient hash, which doesn't actually care _at_all_
>about how many files there are per directory.

So long as the hash *is* efficient when the directory is packed full of
38 character filenames made only of [0-9a-f] ... which might not match
the test cases under which the hash was picked :-)  When there are some
full-sized kernel git images, someone should do a sanity check.

>Hey, I may end up being wrong, and yes, maybe I should have done a 
>two-level one. The good news is that we can trivially fix it later (even 
>dynamically - we can make the "sha1 object tree layout" be a per-tree 
>config option, and there would be no real issue, so you could make small 
>projects use a flat version and big projects use a very deep structure 
>etc). You'd just have to script some renames to move the files around.

It depends on how many eco-system shell scripts get built that need to
know about the layout ... if some shell/perl "libraries" encode this
filename layout (and people use them) ... then switching later would
indeed be painless.

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Christopher Li

I totally agree that odds is really really small.
That is why it is not worthy to handle the case. People hit that
can just add a new line or some thing to avoid it, if
it happen after all.

It is the little peace of mind to know for sure that did
not happen. I am just paranoid. 

Chris

On Sun, Apr 10, 2005 at 12:23:52PM -0700, Paul Jackson wrote:
> > Some thing like the following patch, may be turn off able.
> 
> Take out an old envelope and compute on it the odds of this
> happening.
> 
> Say we have 10,000 kernel hackers, each producing one
> new file every minute, for 100 hours a week.  And we've
> cloned a small army of Andrew Morton's to integrate
> the resulting tsunamai of patches.  And Linus is well
> cared for in the state funny farm.
> 
> What is the probability that this check will fire even
> once, between now and 10 billion years from now, when
> the Sun has become a red giant destroying all life on
> planet Earth?
> 
> -- 
>   I won't rest till it's the best ...
>   Programmer, Linux Scalability
>   Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Linus Torvalds

On Sun, 10 Apr 2005, Paul Jackson wrote:
> 
> Ah ha - that explains the read-tree and write-tree names.
> 
> The read-tree pulls stuff out of this file system into
> your working files, clobbering local edits.  This is like
> the read(2) system call, which clobbers stuff in your
> read buffer.

Yes. Except it's a two-stage thing, where the staging area is always the 
"current directory cache".

So a "read-tree" always reads the tree information into the directory 
cache, but does not actually _update_ any of the files it "caches". To do 
that, you need to do a "checkout-cache" phase.

Similarly, "write-tree" writes the current directory cache contents into a
set of tree files. But in order to have that match what is actually in
your directory right now, you need to have done a "update-cache"  phase
before you did the "write-tree".

So there is always a staging area between the "real contents" and the 
"written tree". 

> That way of thinking really doesn't work well here.
> 
> I will have to look more closely at pasky's GIT toolkit
> if I want to see an SCM style interface.

Yes. You really should think of GIT as a filesystem, and of me as a 
_systems_ person, not an SCM person. In fact, I tend to detest SCM's. I 
think the reason I worked so well with BitKeeper is that Larry used to do 
operating systems. He's also a systems person, not really an SCM person. 
Or at least he's in between the two.

My operations are like the "system calls". Useless on their own: they're
not real applications, they're just how you read and write files in this
really strange filesystem. You need to wrap them up to make them do
anything sane.

For example, take "commit-tree" - it really just says that "this is the 
new tree, and these other trees were its parents". It doesn't do any of 
the actual work to _get_ those trees written.

So to actually do the high-level operation of a real commit, you need to
first update the current directory cache to match what you want to commit
(the "update-cache" phase).

Then, when your directory cache matches what you want to commit (which is
NOT necessarily the same thing as your actual current working area - if
you don't want to commit some of the changes you have in your tree, you
should avoid updating the cache with those changes), you do stage 2, ie
"write-tree". That writes a tree node that describes what you want to
commit.

Only THEN, as phase three, do you do the "commit-tree". Now you give it 
the tree you want to commit (remember - that may not even match your 
current directory contents), and the history of how you got here (ie you 
tell commit what the previous commit(s) were), and the changelog. 

So a "commit" in SCM-speak is actually three totally separate phases in my
filesystem thing, and each of the phases (except for the last
"commit-tree" which is the thing that brings it all together) is actually
in turn many smaller parts (ie "update-cache"  may have been called
hundreds of times, and "write-tree" will write several tree objects that
point to each other).

Similarly, a "checkout" really is about first finding the tree ID you want
to check out, and then bringing it into the "directory cache" by doing a
"read-tree" on it. You can then actually update the directory cache 
further: you might "read-tree" _another_ project, or you could decide that 
you want to keep one of the files you already had.

So in that scneario, after doing the read-tree you'd do an "update-cache"
on the file you want to keep in your current directory structure, which
updates your directory cache to be a _mix_ of the original tree you now
want to check out _and_ of the file you want to use from your current
directory. Then doing a "checkout-cache -a" will actually do the actual
checkout, and only at that point does your working directory really get
changed.

Btw, you don't even have to have any working directory files at all. Let's
say that you have two independent trees, and you want to create a new
commit that is the join of those two trees (where one of the trees take
precedence). You'd do a "read-tree  ", which will create a directory
cache (but not check out) that is the union of the  and  trees (
will overrride). And then you can do a "write-tree" and commit the
resulting tree - without ever having _any_ of those files checked out. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Paul Jackson

> Some thing like the following patch, may be turn off able.

Take out an old envelope and compute on it the odds of this
happening.

Say we have 10,000 kernel hackers, each producing one
new file every minute, for 100 hours a week.  And we've
cloned a small army of Andrew Morton's to integrate
the resulting tsunamai of patches.  And Linus is well
cared for in the state funny farm.

What is the probability that this check will fire even
once, between now and 10 billion years from now, when
the Sun has become a red giant destroying all life on
planet Earth?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Paul Jackson

Linus wrote:
>  It's a filesystem - although a
> fairly strange one.

Ah ha - that explains the read-tree and write-tree names.

The read-tree pulls stuff out of this file system into
your working files, clobbering local edits.  This is like
the read(2) system call, which clobbers stuff in your
read buffer.

The write-tree pushes stuff down into the file system,
just like write(2) pushes data into the kernel.

I was getting all kind of frustrated yesterday trying
to use Linus's git commands, coming at these names with my
SCM hat on.

That way of thinking really doesn't work well here.

I will have to look more closely at pasky's GIT toolkit
if I want to see an SCM style interface.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Paul Jackson

Tony wrote:
> Or maybe the files should be named objects/xx/yy/?

I tend to size these things with the square root of the number of
leaf nodes.  If I have 2,560,000 leaves (your 10,000 files in each
of 16*16 directories), then I will aim for 1600 directories of
1600 leaves each.

My backup is sized for about this number of leaves, and it uses:

xxx/xxx

(I repeat the xxx in the leaf name - easier to code.)

I don't think there is any need for two levels.  There are 4096
different values of three digit hex numbers.  That's ok in one
directory.

The only question would be 'xx' or 'xxx' - two or three digits.

This one is on the cusp in my view - either works.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Ingo Molnar


* Rik van Riel <[EMAIL PROTECTED]> wrote:

> GCC 4 isn't very happy.  Mostly sign changes, but also something that 
> looks like a real error:
> 
> gcc -g -O3 -Wall   -c -o fsck-cache.o fsck-cache.c
> fsck-cache.c: In function 'main':
> fsck-cache.c:59: warning: control may reach end of non-void function 
> 'fsck_tree' being inlined
> fsck-cache.c:62: warning: control may reach end of non-void function 
> 'fsck_commit' being inlined
> 
> I assume that fsck_tree and fsck_commit should complain loudly if they 
> ever get to that point - but since I'm not quite sure there's no 
> patch, sorry.

i sent a patch for most of the sign errors, but the above is a case gcc 
not noticing that the function can never ever exit the loop, and thus 
cannot get to the 'return' point.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Paul Jackson

Ralph wrote:
> but good enough for
> most uses that people will get caught out when it fails.

Exactly.

If Linus persists in this diff-tree output format, using two lines for
changed files, then I will have to add the following sed script to my
arsenal:

  sed '/^/ / }'

It collapses pairs of lines:

<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile
>100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile

to the single line:

<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile 100664 
5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile

However, more people will get bit by this git glitch than know sed.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Rik van Riel

On Sat, 9 Apr 2005, Linus Torvalds wrote:

> I've rsync'ed the new git repository to kernel.org, it should all be there
> in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the
> mirror scripts already picked it up on the public side too).

GCC 4 isn't very happy.  Mostly sign changes, but also something
that looks like a real error:

gcc -g -O3 -Wall   -c -o fsck-cache.o fsck-cache.c
fsck-cache.c: In function 'main':
fsck-cache.c:59: warning: control may reach end of non-void function 
'fsck_tree' being inlined
fsck-cache.c:62: warning: control may reach end of non-void function 
'fsck_commit' being inlined

I assume that fsck_tree and fsck_commit should complain loudly
if they ever get to that point - but since I'm not quite sure
there's no patch, sorry.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Rutger Nijlunsing

On Sun, Apr 10, 2005 at 08:44:56AM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 10 Apr 2005, Junio C Hamano wrote:
> > 
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
> 
> You can represent renames on top of git - git itself really doesn't care.  
> In many ways you can just see git as a filesystem - it's content-
> addressable, and it has a notion of versioning, but I really really
> designed it coming at the problem from the viewpoint of a _filesystem_
> person (hey, kernels is what I do), and I actually have absolutely _zero_
> interest in creating a traditional SCM system.
> 
> So to take renaming a file as an example - why do you actually want to 
> track renames? In traditional SCM's, you do it for two reasons:
> 
>  - space efficiency. Most SCM's are based on describing changes to a file, 
[snip]
>  - annotate/blame. This is a valid concern, but the fact is, I never use 
[snip]

- merging.
  When the parent tree renames a file, it's easier for an out-of-tree
  patch to get up-to-date.

- reviewing.
  A huge patch with 2000 added lines and 1990 removed lines is more
  difficult to review then a rename + 10 lines patch.

> So consider me deficient, or consider me radical. It boils down to the 
> same thing. Renames don't matter. 

When you've got no out-of-tree patches since you've got the
parent-of-all-trees, then they don't matter, that's true :)

> So whether you agree with the things that _I_ consider important probably
> depends on how you work. The real downside of GIT may be that _my_ way of 
> doing things is quite possibly very rare.


-- 
Rutger Nijlunsing -- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Linus Torvalds

On Sat, 9 Apr 2005 [EMAIL PROTECTED] wrote:
>
> With 60,000 changesets in the current tree, we will start out our git
> repository with about 600,000 files.  Assuming the first byte of the
> SHA1 hash is random, that means an average of 2343 files in each of the
> objects/xx directories.  Give it a few more years at the current pace,
> and we'll have over 10,000 files per directory.  This sounds like a lot
> to me ... but perhaps filesystems now handle large directories enough
> better than they used to for this to not be a problem?

The good news is that git itself doesn't really care. I think it's
literally _one_ function ("get_sha1_filename()") that you need to change,
and then you need to write a small script that moves files around, and
you're really much done.

Also, I did actually debate that issue with myself, and decided that even
if we do have tons of files per directory, git doesn't much care. The
reason? Git never _searches_ for them. Assuming you have enough memory to
cache the tree, you just end up doing a "lookup", and inside the kernel
that's done using an efficient hash, which doesn't actually care _at_all_
about how many files there are per directory.

So I was for a while debating having a totally flat directory space, but 
since there are _some_ downsides (linear lookup for cold-cache, and just 
that "ls -l" ends up being O(n**2) and things), I decided that a single 
fan-out is probably a good idea.

> Or maybe the files should be named objects/xx/yy/?

Hey, I may end up being wrong, and yes, maybe I should have done a 
two-level one. The good news is that we can trivially fix it later (even 
dynamically - we can make the "sha1 object tree layout" be a per-tree 
config option, and there would be no real issue, so you could make small 
projects use a flat version and big projects use a very deep structure 
etc). You'd just have to script some renames to move the files around..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Linus Torvalds

On Sun, 10 Apr 2005, Junio C Hamano wrote:
> 
> But I am wondering what your plans are to handle renames---or
> does git already represent them?

You can represent renames on top of git - git itself really doesn't care.  
In many ways you can just see git as a filesystem - it's content-
addressable, and it has a notion of versioning, but I really really
designed it coming at the problem from the viewpoint of a _filesystem_
person (hey, kernels is what I do), and I actually have absolutely _zero_
interest in creating a traditional SCM system.

So to take renaming a file as an example - why do you actually want to 
track renames? In traditional SCM's, you do it for two reasons:

 - space efficiency. Most SCM's are based on describing changes to a file, 
   and compress the data by doing revisions on the same file. In order to 
   continue that process past a rename, such an SCM _has_ to track 
   renames, or lose the delta-based approach.

   The most trivial example of this is "diff", ie a rename ends up 
   generating a _huge_ diff unless you track the rename explicitly.

   GIT doesn't care. There is _zero_ space efficiency in trying to track 
   renames. In fact, it would add overhead to the system, not lessen it. 
   That's because GIT fundamentally doesn't do the "delta-within-a-file"  
   model.

 - annotate/blame. This is a valid concern, but the fact is, I never use 
   it. It may be a deficiency of mine, but I simply don't do the per-line 
   thing when I debug or try to find who was responsible. I do "blame" on 
   a much bigger-picture level, and I personally believe (pretty strongly) 
   that per-line annotations are not actually a good thing - they come not 
   because people _want_ to do things at that low level, but because 
   historically, you didn't _have_ the bigger-picture thing.

   In other words, pretty much every SCM out there is based on SCCS 
   "mentally", even if not in any other model. That's why people think 
   per-line blame is important - you have that mental model. 

So consider me deficient, or consider me radical. It boils down to the 
same thing. Renames don't matter. 

That said, if somebody wants to create a _real_ SCM (rather than my notion
of a pure content tracker) on top of GIT, you probably could fairly easily
do so by imposing a few limitations on a higher level. For example, most
SCM's that track renames require that the user _tell_ them about the
renames: you do a "bk mv" or a "svn rename" or something.

If you want to do the same on top of GIT, then you should think of GIT as
what it is: GIT just tracks contents. It's a filesystem - although a
fairly strange one. How would you track renames on top of that? Easy: add
your own fields to the GIT revision messages: GIT enforces the header, but
you can add anything you want to the "free-form" part that follows it. 

Same goes for any other information where you care about what happens 
"within" a file. GIT simply doesn't track it. You can build things on top 
of GIT if you want to, though. They may not be as efficient as they would 
be if they were built _into_ GIT, but on the other hand GIT does a lot of 
other things a hell of a lot faster thanks to it's design.

So whether you agree with the things that _I_ consider important probably
depends on how you work. The real downside of GIT may be that _my_ way of 
doing things is quite possibly very rare.

But it clearly is the only right way. The fact that everybody else does it 
some other way only means that they are wrong.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread tony . luck

>In other words, each "commit" file is very small and cheap, but since 
>almost every commit will also imply a totally new tree-file, "git" is 
>going to have an overhead of half a megabyte per commit. Oops.
>
>Damn, that's painful. I suspect I will have to change the format somehow.

Having dodged that bullet with the change to make tree files point at
other tree files ... here's another (potential) issue.

A changeset that touches just one file a few levels down from the top
of the tree (say arch/i386/kernel/setup.c) will make six new files in
the git repository (one for the changeset, four tree files, and a new
blob for the new version of the file). More complex changes make more
files ... but say the average is ten new files per changeset since most
changes touch few files.  With 60,000 changesets in the current tree, we
will start out our git repository with about 600,000 files.  Assuming
the first byte of the SHA1 hash is random, that means an average of 2343
files in each of the objects/xx directories.  Give it a few more years at
the current pace, and we'll have over 10,000 files per directory.  This
sounds like a lot to me ... but perhaps filesystems now handle large
directories enough better than they used to for this to not be a problem?

Or maybe the files should be named objects/xx/yy/?

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Ralph Corderoy

Hi,

Christopher Li wrote:
> On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote:
> > NOTE! This means that each "tree" file basically tracks just a
> > single directory. The old style of "every file in one tree file"
> > still works, but fsck-cache will warn about it. Happily, the git
> > archive itself doesn't have any subdirectories, so git itself is not
> > impacted by it.
> 
> That is really cool stuff. My way to read it, correct me if I am
> wrong, git is a user space version file system. "tree" <--> directory
> and "blob" <--> file.  "commit" to describe the version history.

See the Venti filesystem in Bell Labs's Plan 9 OS.  It too uses SHA-1.

http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf

Abstract

This paper describes a network storage system, called Venti,
intended for archival data. In this system, a unique hash of a
block's contents acts as the block identifier for read and write
operations. This approach enforces a write-once policy, preventing
accidental or malicious destruction of data. In addition, duplicate
copies of a block can be coalesced, reducing the consumption of
storage and simplifying the implementation of clients. Venti is a
building block for constructing a variety of storage applications
such as logical backup, physical backup, and snapshot file systems.

We have built a prototype of the system and present some preliminary
performance results. The system uses magnetic disks as the storage
technology, resulting in an access time for archival data that is
comparable to non-archival data. The feasibility of the write-once
model for storage is demonstrated using data from over a decade's
use of two Plan 9 file systems. 

Cheers,

Ralph.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread tony . luck

>handle by pure rename only plus the extra delta. The current git don't
>have per file change history. From git's point of view some file deleted
>and the other file appeared with same content.
>
>It is the top level SCM to handle that correctly.
>Rename a directory will be even more fun.

But from a git perspective it will be very efficient.  Imagine that
Linus decides to rename arch/i386 as arch/x86 ... at the git repository
level this just requires a changeset, a new top level tree, and a new
tree for the arch directory showing that i386 changed to x86.  That's
all ... every files below that didn't change, so the blobs for the files
are all the same.

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Proposal for shell-patch-format [was: Re: more git updates..]

2005-04-10 Thread Rutger Nijlunsing

On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> Listing the file paths and their sigs included in a tree to make
> a snapshot of a tree state sounds fine, and diffing two trees by
> looking at the sigs between two such files sounds fine as well.
> 
> But I am wondering what your plans are to handle renames---or
> does git already represent them?

git doesn't represent transitions (or deltas), but only state. So it's
not (much) more then a .tar file from version-management perspective;
the only difference being that a git-tree has a comment field and a
predecessor-reference, which are currently not used in determining the
'patch' between two trees.

Deltas are derived by comparing different versions and determining
the difference by reverse-engineering the differences which got us
from version A to version B.

Deltas are currently described as patch(1)es. Patches don't have the
concept of 'renaming', so even after determining that file X has been
renamed to Y, we have no container for this fact. A patch(1) only
contains local-file-edits: substitute lines by other lines.

Deltas are not needed to follow a tree; deltas are useful for merging
branches of versions, and for reviewing purposes. This is comparable
to using tar for version-management: it is very common to weekly tar
your current version of your project as a poor-mans-version management
for one-person one-project.

So what is needed is a way to represent deltas which can contain more
than only traditional patches. I would propose a simple format: 
the shell-script in a fixed-format.

Shell-patch format in EBNF:
   ::= ( ? * )*
   ::= +
The comments contains the text describing the function of the
patch following it.
   ::= "# " 
   ::=
"mv "  " "  "\n" |
"cp "  " "  "\n" |
"chmod "   "\n" |
"patch <<__UNIQUE_STRING__\n"  "__UNIQUE_STRING__\n"
  (where UNIQUE_STRING must not be contained in patch)
   ::= 
(but pointing to a file)
   ::= a pathname relative to '.';
escaping special characters the shell-way;
may not contain '..'.

Example:
  # Rename file b to a1, and change a line.
  mv b a1
  patch <<__END__
  *** a1  Sun Apr 10 11:43:37 2005
  --- a2  Sun Apr 10 11:43:41 2005
  ***
  *** 1,4 
1
2
  ! from
3
  --- 1,4 
1
2
  ! to
3
  __END__

Advantages:
  - ASCII!
  - a shell-patch is executable without extra tooling
  - a shell-patch is readable and therefore reviewable
  - a shell-patch is forward-compatible: a shell-patch acts
like a patch (since patch(1) ignores garbage around patch :),
but not backwards-compatible.
  - extensible
  - the heavy-lifting is done by 'patch'
Disadvantages:
  - no deltas for binary files

Open issues:
  -  could be made more structured; maybe containing fields
like Sujbect:, Author:, Signed-By:, certificates, ...
(BitKeeper seems to be using "# "  ":"  "\n" lines)
  - patch(1) doesn't know any directories. Should shell-patch
know directories? This implies commands working on directories to
(like directory renaming, mode changing, ...). Otherwise directories
are implicit (a file in a directories implies the existance of that
directory). Also implies mkdir and rmdir as shell-patch commands.
  - extra commands might be useful to conserve more state(changes):
ln -s  -- for symbolic links;
ln -- for hard links;
chown  -- for permissions;
chattr -- for storing extended attributes
touch  -- for setting timestamps (probably creation time only,
  since mtime is something git relies on)
...and for the really adventurous:
sed 's,,,' -- for substitutions
  (this is something darcs supports, but which I think is too
   bothersome to use since it is difficult to reverse engineere
   from two random trees)
Why a fixed format at all?
  - This way, the executable shell-patch can be proven to be
harmless to the machine: 'rm -rf /' is a valid shell-script,
but not a valid shell-patch (since 'rm' is not valid command,
random flags like '-rf' are not supported, and '/' is an absolute
pathname.
  - A fixed format enables tooling to support such a patch format;
for example creating the reverse-patch, merging patches (yeah,
'cat' also merges patches...).

...what has this to do with git?  Not much and everything, depending
on how you look onto it. 'git' is 'tar', and 'shell-patch' is 'patch';
both orthogonal concepts but very usable in combination. We'll look at
getting from two git trees to a shell-patch.

Diffing the trees would not only look at the file and per file at the
hashes, but also the other way around: which hash values are used more
than once. For files with the same hash value, compare the contents
(and rest of attributes); this is needed since the mapping from file
contents to sha1 is one-way. When the contents is the same, the
shell-patch-command to generate is obviously a 'cp'.

For example, we have got two trees

Re: more git updates..

2005-04-10 Thread Ralph Corderoy


Hi Paul,

> Ralph wrote:
> > Watch out for when xargs invokes do_something more than once and the
> > `<' is parsed by a different one than the `>'.
> 
> It will take a pretty long list to do that.  It seems that GNU xargs
> on top of a Linux kernel has a 128 KByte ARG_MAX.

I didn't realise it was that long, but one pair of files to diff takes
128 bytes of that.

$ wc -c <<\E
> <100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c
> >100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c
> E
128

So that's space for 1024 pairs.  (Doesn't envp take some up too?)  That
doesn't seem enough for diffs between revisions, but good enough for
most uses that people will get caught out when it fails.

$ bzip2 -dc patch-2.6.10.bz2 | grep -c '^diff '  
5384

Cheers,


Ralph.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-10 Thread Christopher Li

On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote:
> Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
> where Christopher Li <[EMAIL PROTECTED]> told me that...
> > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > > 
> > > But I am wondering what your plans are to handle renames---or
> > > does git already represent them?
> > >
> > 
> > Rename should just work.  It will create a new tree object and you
> > will notice that in the entry that changed, the hash for the blob
> > object is the same.
> 
> Which is of course wrong when you want to do proper merging, examine
> per-file history, etc. One solution which springs to my mind is to have
> a UUID accompany each blob and tree; that will take relatively lot of
> space though, and I'm not sure it is really worth it.

It should just use the rename + change two step then it is tractable
with git now.

Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Christopher Li

On Sun, Apr 10, 2005 at 02:28:54AM -0700, Junio C Hamano wrote:
> > "CL" == Christopher Li <[EMAIL PROTECTED]> writes:
> 
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >> 
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >> 
> 
> CL> Rename should just work.  It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
> 
> Sorry, I was unclear.  But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?
> 
> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time.  If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?
> 

Git is not a SCM yet.  For the rename + change set it should internally
handle by pure rename only plus the extra delta. The current git don't
have per file change history. From git's point of view some file deleted
and the other file appeared with same content.

It is the top level SCM to handle that correctly.
Rename a directory will be even more fun.

Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Christopher Li

On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote:
> 
> Done, and pushed out. The current git.git repository seems to do all of 
> this correctly.
> 
> NOTE! This means that each "tree" file basically tracks just a single
> directory. The old style of "every file in one tree file" still works, but 
> fsck-cache will warn about it. Happily, the git archive itself doesn't 
> have any subdirectories, so git itself is not impacted by it.

That is really cool stuff. My way to read it, correct me if I am wrong,
git is a user space version file system. "tree" <--> directory and
"blob" <--> file.  "commit" to describe the version history.

Git always write out a full new version of blob when there is any
update to it. At first I think that waste a lot of space, especially
when there is only tiny change to it. But the more I think about it,
it make more sense. Kernel source is usually small objects and file is
compressed store any way. A very useful thing to gain form it is that,
we can truncate the older history. e.g. We can have option not to sync
the pre 2.4 change set, only grab it if we need to. Most of the time we
only interested in the recent change set.

There is one problem though. How about the SHA1 hash collision?
Even the chance is very remote, you don't want to lose some data do due
to "software" error. I think it is OK that no handle that
case right now. On the other hand, it will be nice to detect that
and give out a big error message if it really happens.

Some thing like the following patch, may be turn off able.

Chris

Index: git-0.03/read-cache.c
===
--- git-0.03.orig/read-cache.c  2005-04-09 18:42:16.0 -0400
+++ git-0.03/read-cache.c   2005-04-10 02:48:36.0 -0400
@@ -210,8 +210,22 @@
int fd;

fd = open(filename, O_WRONLY | O_CREAT | O_EXCL, 0666);
-   if (fd < 0)
-   return (errno == EEXIST) ? 0 : -1;
+   if (fd < 0) {
+   void *map;
+   static int error(const char * string);
+
+   if (errno != EEXIST)
+   return -1;
+   fd = open(filename, O_RDONLY);
+   if (fd < 0)
+   return -1;
+   map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+   if (map == MAP_FAILED)
+   return -1;
+   if (memcmp(buf, map, size))
+   return error("Ouch, Strike by lighting!\n");
+   return 0;
+   }
write(fd, buf, size);
close(fd);
return 0;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-10 Thread Petr Baudis

Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter
where Junio C Hamano <[EMAIL PROTECTED]> told me that...
> > "CL" == Christopher Li <[EMAIL PROTECTED]> writes:
> 
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >> 
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >> 
> 
> CL> Rename should just work.  It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
> 
> Sorry, I was unclear.  But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?

No. See diff-tree output and
http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done.
Basically, you just take the two trees and compare them linearily (do a
normal diff on them, essentialy). Then the differences you spot this way
are everything what needs to appear in the patch.

> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time.  If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?

Actually, this could be possible too I think. We will have to make
diff-tree two-pass, but it is already so blinding fast that I guess that
doesn't hurt too much. I might try to get my hands on that.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-10 Thread Petr Baudis

Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
where Christopher Li <[EMAIL PROTECTED]> told me that...
> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > 
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
> >
> 
> Rename should just work.  It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

Which is of course wrong when you want to do proper merging, examine
per-file history, etc. One solution which springs to my mind is to have
a UUID accompany each blob and tree; that will take relatively lot of
space though, and I'm not sure it is really worth it.

How many renames were there in the 64k commits so far anyway?

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Wichert Akkerman

Previously Christopher Li wrote:
> Rename should just work.  It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

What if you rename and change a file within a changeset?

Wichert.

-- 
Wichert Akkerman <[EMAIL PROTECTED]>It is simple to make things.
http://www.wiggy.net/   It is hard to make things simple.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Junio C Hamano

> "CL" == Christopher Li <[EMAIL PROTECTED]> writes:

CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
>> 
>> But I am wondering what your plans are to handle renames---or
>> does git already represent them?
>> 

CL> Rename should just work.  It will create a new tree object and you
CL> will notice that in the entry that changed, the hash for the blob
CL> object is the same.

Sorry, I was unclear.  But doesn't that imply that a SCM built
on top of git storage needs to read all the commit and tree
records up to the common ancestor to show tree diffs between two
forked tree?

I suspect that another problem is that noticing the move of the
same SHA1 hash from one pathname to another and recognizing that
as a rename would not always work in the real world, because
sometimes people move files *and* make small changes at the same
time.  If git is meant to be an intermediate format to suck
existing kernel history out of BK so that the history can be
converted for the next SCM chosen for the kernel work, I would
imagine that there needs to be a way to represent such a case.
Maybe convert a file rename as two git trees (one tree for pure
move which immediately followed by another tree for edit) if it
is not a pure move?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Christopher Li

On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> 
> But I am wondering what your plans are to handle renames---or
> does git already represent them?
>

Rename should just work.  It will create a new tree object and you
will notice that in the entry that changed, the hash for the blob
object is the same.

Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-10 Thread Junio C Hamano

Listing the file paths and their sigs included in a tree to make
a snapshot of a tree state sounds fine, and diffing two trees by
looking at the sigs between two such files sounds fine as well.

But I am wondering what your plans are to handle renames---or
does git already represent them?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: more git updates..

2005-04-09 Thread Petr Baudis

Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter
where Linus Torvalds <[EMAIL PROTECTED]> told me that...
> On Sat, 9 Apr 2005, Linus Torvalds wrote:
> > 
> > Actually, I guess I wouldn't have to change the format. I could just 
> > extend the existing "tree" object to be able to point to other trees, and 
> > that's it.
> 
> Done, and pushed out. The current git.git repository seems to do all of 
> this correctly.
..snip..

Ok, so now I can dare announce it, I hope. I hacked my branch of git
somewhat, kept in sync with Linus, and now I have something to show.
Please see it at

http://pasky.or.cz/~pasky/dev/git/

It is basically a set of (still rather crude) shell scripts upon Linus'
git, which make it sanely usable by mere humans for actual version
tracking. Its usage _is_ going to change, so don't get too used to it
(that'd be hard anyway, I suspect), but it should be working nicely.

I have described most of the interesting parts and some basic usage in
the README at that page. It wraps commits, supports log retrieval and
comfortable diffing between any two trees. And on top of that, it can do
some basic remote repositories - it will pull (rsync) from them and it
can make the local copy track them - on pull, it will be updated
accordingly (and your local commits on the tracked branch will get
orphaned).

I didn't attach a patch against Linus since I think it's pretty much
useless now. It's available as against-linus.patch on the web, and
you can apply it to the latest git tree (NOT 0.03). But it's probably
better idea to wget my tree. You can then watch us making progress by

gitpull.sh linus
gitpull.sh pasky

and see where we differ by:

gitdiff.sh linus pasky

(This is how the against-linus.patch was generated. I'd easily generate
even 0.03 patch this way, but I forgot to merge the fsck at that time,
so it would suck.)

(Note that the tree you wget is set up to track my branch. If you want
to stop tracking it (basically necessary now if you want to do local
commits), do:

cp .dircache/HEAD .dircache/HEAD.local
gittrack.sh

The cp says that something like "I want to pick up where the tracked
branch left off". Otherwise, untracking would return you to your "local"
branch, which is just some ancient predecessor of the pasky branch here
anyway.)

Note that I didn't really test it on anything but git itself yet, so I'm
not sure how will it cope especially with directories - I tried to make
it aware of them though. I will do some more practical testing tomorrow.

Otherwise, I will probably try to consolidate the usage and
documentation now, and beautify the scripts. I might start pondering
some merging too. Oh, and gitpatch.sh. :-)

Have fun and please share your opinions,

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Paul Jackson

>From before:

The sha1 (ascii) digests for 16817 files take:

689497 bytes before compression
397475 bytes after minigzip

New numbers:

The sha1 (binary) digests for 16817 files take:

336340 bytes before compression
334943 bytes after minigzip

So compressing binary digests isn't worth a darn, and compressing ascii
digests gets them down to within 18% of binary digests in size.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Paul Jackson

> Then a "tree" object would point to a "directory" object, 

Ah - light bulb flickers - in _separate_ files.

Yes, that obviously makes a difference.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Paul Jackson

Linus wrote:
> Damn, that's painful. I suspect I will have to change the format somehow.

The sha1 (ascii) digests for 16817 files take:

689497 bytes before compression
397475 bytes after minigzip

The pathnames, relative to top of tree, for these 16817
files take:

503983 bytes before compression
 85786 bytes after minigzip compression

I doubt any fancifying up of the pathname storage will gain much.

However going from binary to ascii sha1 digest might help (compresses
better, I suspect - I'll have to write a few lines of code to see).

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Paul Jackson

Bernd wrote:
> more parser friendly to have single records for diffs.

good point

[looks like you trimmed the cc list - folks around here don't like that ;)]

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Bernd Eckenfels

In article <[EMAIL PROTECTED]> you wrote:
> Ralph wrote:
>> Watch out for when xargs invokes do_something more than once and the `<'
>> is parsed by a different one than the `>'.
> It will take a pretty long list to do that.  It seems that
> GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX.
> In the old days, with 4 KByte ARG_MAX limits, this would have
> bitten us pretty quickly.

Nevertheless I  think it is more parser friendly to have single records for
diffs.

Greetings
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Paul Jackson

Ralph wrote:
> Watch out for when xargs invokes do_something more than once and the `<'
> is parsed by a different one than the `>'.

It will take a pretty long list to do that.  It seems that
GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX.

In the old days, with 4 KByte ARG_MAX limits, this would have
bitten us pretty quickly.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Linus Torvalds

On Sat, 9 Apr 2005, Linus Torvalds wrote:
> 
> Actually, I guess I wouldn't have to change the format. I could just 
> extend the existing "tree" object to be able to point to other trees, and 
> that's it.

Done, and pushed out. The current git.git repository seems to do all of 
this correctly.

NOTE! This means that each "tree" file basically tracks just a single
directory. The old style of "every file in one tree file" still works, but 
fsck-cache will warn about it. Happily, the git archive itself doesn't 
have any subdirectories, so git itself is not impacted by it.

Now, this means that I should add a "recusive" option to "tree-diff", but 
I haven't done so yet. So right now if I change the top-level Makefile,
_and_ change kernel/exit.c, then the "tree diff" between the two commit 
trees ends up looking like:

[EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> diff-tree 
7bec1223736d7e02c755e9a365984b3cbfa1e6e9 
d64817f809a60cd960d3078ae91b4d19cb649501 | tr '\0' '\n'
<100644 e1e7f7430c0297f22042cff58da5ca73ef121b95 Makefile
>100644 8ee21134577e98fb642dffc5b797a0121645c543 Makefile
<4 2239383d00ae746f5e79ceccf8ac3fbca62f949d kernel
>4 a8fad219cb78a6b6a05a10f8643d615fefc8160f kernel

ie it shows that the Makefile blob has changed, and the kernel directory 
has changed. You then need to recurse into the kernel tree to see what the 
changes were there:

[EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> diff-tree 
2239383d00ae746f5e79ceccf8ac3fbca62f949d 
a8fad219cb78a6b6a05a10f8643d615fefc8160f | tr '\0' '\n'
<100644 1a50b58453679b6fee8de4f744f4befc39397bb1 exit.c
>100644 e8df1325bf25816827a1a64404ad533a97bfdae2 exit.c

but it clearly all seems to work. And it means that a subdirectory that 
didn't change at all (the common case) will be able to re-use the old sha1 
file when you create a tree (this may in fact make "diff-tree" much less 
important, since now it tends to handle objects that are just a few kB in 
size, rather than almost a megabyte.

So in this case, the "commit cost" of changing two files was two small 
tree files (1468 and 679 bytes respectively for the kernel/ and top-level 
directory) and the commit file itself (251 bytes). In addition to the 
actual data files that were changed, of course.

Goodie. Big difference between that and the 460kB of the old monolithic
tree file.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Ralph Corderoy


Hi Linus,

> Btw, the NUL-termination makes this really easy to use even in shell
> scripts, ie you can do
> 
>   diff-tree   | xargs -0 do_something
> 
> and you'll get each line as one nice argument to your "do_something"
> script. So a do_diff could be based on something like
> 
>   #!/bin/sh

Watch out for when xargs invokes do_something more than once and the `<'
is parsed by a different one than the `>'.  A `while read ...; do ...
done' would avoid that, but wouldn't like the NULs instead of LFs.

Cheers,


Ralph.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Paul Jackson

Linus wrote:
> the NUL-termination makes this really easy to use even in shell

grumble ...

> I still use the old tools I learnt to use fifteen years ago

new comer ;)

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Linus Torvalds

On Sat, 9 Apr 2005, Linus Torvalds wrote:
> 
> I suspect that I have to change the file format. Maybe make the "tree" 
> object a two-level thing, and have a "directory" object.
> 
> Then a "tree" object would point to a "directory" object, which would in
> turn point to the individual files (and other "directory" objects, of
> course). That way a commit that only changes a few files will only need to
> create a few new "directory" objects, instead of creating one huge "tree"
> object.

Actually, I guess I wouldn't have to change the format. I could just 
extend the existing "tree" object to be able to point to other trees, and 
that's it.

The downside of that is that then a tree wouldn't have a canonical format 
any more: you could have two trees that have the exact same content, but 
they'd haev different names. They should obviously merge very easily (and 
thus you could create a new merge that _does_ have a common name), but 
it's ugly.

I'll have to think about it. It's good to notice these issues early, this 
was the first time I had actually tried to check in a kernel-sized tree 
for real.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Linus Torvalds

On Sat, 9 Apr 2005, Petr Baudis wrote:
> 
> > Also, I wrote the "diff-tree" thing I talked about: 
> ..snip..
> 
> Hmm, I wonder, is this better done in C instead of a simple shell
> script, like my gitdiff.sh?

With 17,000 files in the kernel, and most commits just changing a small 
number of them, I actually think "diff-tree" matters. You use "join" 
(which is quite reasonable), but let's put it this way: just the list of 
files in the current kernel is about half a megabyte of data. Ie your 
temporary files that you use in the "ls-tree + ls-tree + join" is actually 
going to be quite sizeable.

My goal here is that the speed of "git" really should be almost totally
independent of the size of the project. You clearly cannot avoid _some_ 
size-dependency: my "diff-tree" clearly also has to work through the same 
1MB of data, but I think it's worth making the constant factor be as small 
as humanly possible.

I just tried checking in a kernel tree tar-file, and the initial checkin 
(which is allt he compression and the sha1 calculations for every single 
file) took about 1:35 (minutes, not hours ;).

Doing a commit (trivial change to the top-level Makefile) and then doing a 
"treediff" between those two things took 0.05 seconds using my C thing. Ie 
we're talking so fast that we really don't care.

Doing a "show-diff" takes 0.15 secs or so (that's all the "stat" calls), 
and now that I test it out I realize that the most expensive operation is 
actually _writing_ the "index" file out. These are the two most expensive 
steps:

[EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> time update-cache Makefile

real0m0.283s
user0m0.171s
sys 0m0.113s

[EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> time write-tree
5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a

real0m0.441s
user0m0.354s
sys 0m0.087s

ie with the current infrastructure it looks like I can do a "patch + 
commit" in less than one second on the kernel, and 0.75 secs of that is 
because the "tree" file actually grows pretty large:

cat-file tree 5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a | wc -c 

says that the uncompressed tree-file is 950,874 bytes. Compressing it 
means that the archival version of it is "just" 462,546 bytes, but this is 
really the part that is going to eat _tons_ of disk-space.

In other words, each "commit" file is very small and cheap, but since 
almost every commit will also imply a totally new tree-file, "git" is 
going to have an overhead of half a megabyte per commit. Oops.

Damn, that's painful. I suspect I will have to change the format somehow.

One option (which I haven't tested yet) is that since the tree-file is 
already sorted, I could always write it out with the common subdirectory 
part "collapsed", ie instead of writing

...
include/asm-i386/mach-default/bios_ebda.h
include/asm-i386/mach-default/do_timer.h
...

I'd write just

...
///bios_ebda.h
///do_timer.h
...

since the directory names are implied by the predecessor.

However, that doesn't help with the 20-byte sha1 associated with each
file, which is also obviously uncompressible, so with 17,000+ files, we
have a minimum overhead of abotu 350kB per tree-file.

So even if I did the pathname compression, it wouldn't help all that much.  
I'd only be removing the only part of the file that _is_ very
compressible, and I'd probably end up with something that isn't all that
far away from the 450kB+ it is now.

I suspect that I have to change the file format. Maybe make the "tree" 
object a two-level thing, and have a "directory" object.

Then a "tree" object would point to a "directory" object, which would in
turn point to the individual files (and other "directory" objects, of
course). That way a commit that only changes a few files will only need to
create a few new "directory" objects, instead of creating one huge "tree"
object.

Sadly, that will make "tree-diff" potentially more expensive. On the other
hand, maybe not: it will also speed it _up_, since directories that are
totally shared will be trivially seen as such and need no further
operation.

Thougths? That would break the current repository formats, and I'd have to 
create a converter thing (which shouldn't be that bad, of course).

I don't have to do it right now. In fact, I'd almost prefer for the
current thing to become good enough that it's not painful to work with,
since right now I'm using it to develop itself. Then I can convert the
format with an automated script later, before I actually start working on
the kernel...

> BTW, do we care about changed modes? If so, they should probably have
> their place in the diff-tree output.

They're there. If you want to ignore them, you can just notice that the 
sha1 matches between two lines, and then you don't even have to diff them.

Linus
-
To unsubscribe from this list

Re: more git updates..

2005-04-09 Thread Petr Baudis

Hello,

Dear diary, on Sat, Apr 09, 2005 at 09:45:52PM CEST, I got a letter
where Linus Torvalds <[EMAIL PROTECTED]> told me that...
> The good news is, the data structures/indexes haven't changed, but many of 
> the tools to interface with them have new (and improved!) semantics:
> 
> In particular, I changed how "read-tree" works, so that it now mirrors
> "write-tree", in that instead of actually changing the working directory,
> it only updates the index file (aka "current directory cache" file from
> the tree).
> 
> To actually change the working directory, you'd first get the index file
> setup, and then you do a "checkout-cache -a" to update the files in your
> working directory with the files from the sha1 database.

that's great. I was planning to do something with this since currently
it really annoyed me. I think I will like this, even though I didn't
look at the code itself yet (just on my way).

> Also, I wrote the "diff-tree" thing I talked about: 
..snip..

Hmm, I wonder, is this better done in C instead of a simple shell
script, like my gitdiff.sh? I'd say it is more flexible and probably
hardly performance-critical to have this scripted, and not difficult at
all provided you have ls-tree. But maybe I'm just too fond of my
script... ;-) (Ok, there's some trouble when you want to have newlines
and spaces in file names, and join appears to be awfully ignorant about
this... :[ )

BTW, do we care about changed modes? If so, they should probably have
their place in the diff-tree output.

BTW#2, I hope you will merge my ls-tree anyway, even though there is no
user for it currently... I should quickly figure out some. :-)

> Can you guys re-send the scripts you wrote? They probably need some 
> updating for the new semantics. Sorry about that ;(

I'll try to merge ASAP.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: more git updates..

2005-04-09 Thread Linus Torvalds

On Sat, 9 Apr 2005, Linus Torvalds wrote:
> 
> To actually change the working directory, you'd first get the index file
> setup, and then you do a "checkout-cache -a" to update the files in your
> working directory with the files from the sha1 database.

Btw, this will not overwrite any old files, so if you have an old version 
of something, you'd need to do "checkout-cache -f -a" (and order matters: 
the "-f" must come first). This time I actually have a big comment at the 
top of the checkout-cache.c file trying to explain the logic.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

more git updates..

2005-04-09 Thread Linus Torvalds


Sorry guys,
 several of you have sent me small fixes and scripts to "git", but I've 
been busy on breaking/changing the core infrastructure, so I didn't get 
around to looking at the scripts yet.

The good news is, the data structures/indexes haven't changed, but many of 
the tools to interface with them have new (and improved!) semantics:

In particular, I changed how "read-tree" works, so that it now mirrors
"write-tree", in that instead of actually changing the working directory,
it only updates the index file (aka "current directory cache" file from
the tree).

To actually change the working directory, you'd first get the index file
setup, and then you do a "checkout-cache -a" to update the files in your
working directory with the files from the sha1 database.

Also, I wrote the "diff-tree" thing I talked about: 

[EMAIL PROTECTED]:~/git> ./diff-tree 
8fd07d4b7778cd0233ea0a17acd3fe9d710af035 
8c6d29d6a496d12f1c224db945c0c56fd60ce941 | tr '\0' '\n'
<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile
>100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile
<100664 9e1bee21e17c134a2fb008db62679048fc819528 cache.h
>100664 56ef561e590fd99e938bd47fd1f2c7ed46126ff0 cache.h
<100664 fd690acc02ef9c06d7c4c3541f69b10ca4b4f8c9 cat-file.c
>100664 6e6d89291ced17a406e64b97fe8bb96a22eefc9d cat-file.c
+100664 fd00e5603dcc4a93acceda0b8cb914fabc8645d5 checkout-cache.c
<100664 a4a8c3d9ef0c4cc6c82b96b5d1a91ac6d3bed466 commit-tree.c
>100664 236ceb7646e3f5d110fd83f815b82e94cc5b2927 commit-tree.c
+100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c
<100664 0eaa053919e0cc400ab9bc40d9272360117e6978 init-db.c
>100664 815743e92dad7e451c65bab01448ee8ae9deeb56 init-db.c
<100664 e7bfaadd5d2331123663a8f14a26604a3cdcb678 read-cache.c
>100664 71d0cb6fe9b7ff79e3b2c5a61e288ac9f62b39dc read-cache.c
<100664 ec0f167a6a505659e5af6911c97f465506534c34 read-tree.c
>100664 f5c50ba79d02f002b9675fd8f129fa388e3282c6 read-tree.c
<100664 00a29c403e751c2a2a61eb24fa2249c8956d1c80 show-diff.c
>100664 b963dd738989bc92bf02352bbedad13a74e66a7d show-diff.c
<100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c
>100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c
<100664 7abeeba116b2b251c12ae32c7b38cb048199b574 write-tree.c
>100664 9525c6fc975888a394477339db86216cd5bd5d7c write-tree.c

(ie the output of "diff-tree" has the same NUL-termination, but if you 
insist on getting ASCII output, you can just use "tr" to change the NUL 
into a NL).

The format of the "diff-tree" output is that the first character is "-"  
for "remove file", "+" for "add file" and "<"/">" for "change file" (where
the "<" shows the old state, and ">" shows the new state).

Btw, the NUL-termination makes this really easy to use even in shell
scripts, ie you can do

diff-tree   | xargs -0 do_something

and you'll get each line as one nice argument to your "do_something"  
script. So a do_diff could be based on something like

#!/bin/sh
while [ "$1" != "" ]; do
filename="$(echo $1 | cut -d' ' -f3-)"
first_sha="$(echo $1 | cut -d' ' -f2)"
second_sha="$(echo $2 | cut -d' ' -f2)"
c="$(echo $1 | cut -c1)"
case "$c" in
"+")
echo diff -u /dev/null "$filename($first_sha)";;
"-")
echo diff -u "$filename($first_sha)" /dev/null;;
"<")
echo diff -u "$filename($first_sha)" 
"$filename($second_sha)"
shift;;
*)
echo WHAT?
exit 1;;
esac
shift
done

which really shows what a horrid shell-person I am (I still use the old 
tools I learnt to use fifteen years ago. I bet you can do it trivially in 
perl or something sane, and I'm just stuck in the stone age of UNIX).

That makes it _very_ easy to parse. The example above is the diff between 
the initial commit and one of the more recent trees, so it has changes to 
everything, but a more normal thing would be

[EMAIL PROTECTED]:~/git> diff-tree 
787763499dc4f8cc345bc6ed8ee1e0ae31adedd6 
5b0c2695634b5bab2f5d63c7bb30f7e5815af470 | tr '\0' '\n'
<100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c
>100664 81aa7bee003264ea302db835158e725eefa4012d fsck-cache.c

which tells you that the last commit changed just one file (it's from this 
one:

[EMAIL PROTECTED]:~/git> cat-file commit `cat .dircache/HEAD`
tree 5b0c2695634b5bab2f5d63c7bb30f7e5815af470
parent 81c53a1d3551f358860731481bb2d87179d221e6
author Linus Torvalds <[EMAIL PROTECTED]> Sat Apr  9 12:02:30 2005
committer Linus Torvalds <[EMAIL PROTECTED]> Sat Apr  9 12:02:30 2005

85 matches

Mail list logo