Re: more git updates..
[EMAIL PROTECTED] (H. Peter Anvin) wrote on 11.04.05 in <[EMAIL PROTECTED]>: > Followup to: <[EMAIL PROTECTED]> > By author:Christopher Li <[EMAIL PROTECTED]> > In newsgroup: linux.dev.kernel > > > > There is one problem though. How about the SHA1 hash collision? > > Even the chance is very remote, you don't want to lose some data do due > > to "software" error. I think it is OK that no handle that > > case right now. On the other hand, it will be nice to detect that > > and give out a big error message if it really happens. > > > > If you're actually worried about it, it'd be better to just use a > different hash, like one of the SHA-2's (probably a better choice > anyway), instead of SHA-1. How could that help? *Every* hash has hash collisions. It's an unavoidable result of using less bits than the original data has. MfG Kai - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Thu, Apr 14, 2005 at 01:42:11AM +0200, Krzysztof Halasa wrote: > Matt Mackall <[EMAIL PROTECTED]> writes: > > > Now if you can assume that blobs never change and are never deleted, > > you can simply append them all onto a log, and then index them with a > > separate file containing an htree of (sha1, offset, length) or the > > like. > > That mean a problem with rsync, though. I believe 200k inodes is a problem for rsync too. But we can simply grab the remote htree, do a tree compare, find the ranges of the remote file we need, sort and merge the ranges, and then pull them. That will surely trounce rsync. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Matt Mackall <[EMAIL PROTECTED]> writes: > Now if you can assume that blobs never change and are never deleted, > you can simply append them all onto a log, and then index them with a > separate file containing an htree of (sha1, offset, length) or the > like. That mean a problem with rsync, though. BTW: I think the bandwidth increase compared to bkcvs isn't that obvious. After a file is modified with git, it has to be transmitted (plus small additional things. If a file is modified with bkcvs, it has to be transmitted (the whole RCS file) as well. Only the initial rsync would be much smaller with bkcvs. -- Krzysztof Halasa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote: > > > On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > > > I wasn't suggesting to use CVS. I meant that for a newly developed SCM, > > the CVS/SCCS format as storage may be more appealing than the current > > git format. > > Go wild. I did mine in six days, and you've been whining about other > peoples SCM's for three years. I wrote a hack to do efficient delta storage with O(1) seeks for lookup and append last week, I believe it's been integrated into the latest Bazaar-NG. I expect it'll give better compression and performance than BK. Of course it ends up being O(revisions) for modifications or insertions (but that is probably a non-issue for the SCM models we're looking at). The git model is obviously very different, but I worry about the slop space implied. With 200k file revision and an average of 2k slop per file, that's 400MB of slop, or almost the size of an equivalent delta compressed kernel repo. Now if you can assume that blobs never change and are never deleted, you can simply append them all onto a log, and then index them with a separate file containing an htree of (sha1, offset, length) or the like. Since the key is already a strong hash, this is an excellent match and avoids rehashing in the kernel's directory lookup. And it'll save an inode, a directory entry, and about half a data block per entry. "Open" will also be cheaper as there's no per-revision inode to grab. I could hack on this if you think it fits with the git model, otherwise I'll go back to my other experiments.. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Russell King wrote: > > And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which > is more dense than CVS. > > BK is also a lot better than CVS. So _your_ point is? Hey, anybody who wants to argue that BK is getter than GIT won't be getting any counter-arguments from me. The fact is, I have constraints. Like needing something to work within a few days. If somebody comes up with a ultra-fast, replicatable, space efficient SCM in three days, I'm all over it. In the meantime, I'd suggest people who worry about network bandwidth try to work out a synchronization protocol that allows you to send "diff updates" between git repositories. The git model doesn't preclude looking at the objects and sending diffs instead (and re-creating the objects on the other side). But my time-constraints _do_. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote: > Go wild. I did mine in six days, and you've been whining about other > peoples SCM's for three years. Even if I spend 6 days doing git, you'd never have thrown away BK in exchange for git. > In other words - go and _do_ something instead of whining. I'm not > interested. CVS and SVN are already an order of magnitude more efficient than git at storing and exporting the data and they shouldn't annoy you during the checkins either, they have a backend much more efficient than git too, and yet you seem not to care about them. My suggestion was simply to at least change git to coalesce the diffs like CVS/SCCS, I'm only making a suggestion to give git a chance to have a backend at least as efficient as the one that CVS uses and to avoid running rsync on a 2.8G uncompressible blob. I don't have enough spare time to do something myself, my spare time would be too short anyway to make a difference in SCM space, so I'd rather spend it all in more innovative space where it might have a slight change to make a difference. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote: > And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which > is more dense than CVS. Yep, this is why I mentioned SCCS format too, I didn't know it was even smaller, but I expected a similar density from SCCS. > Note: I'm _not_ arguing with your sentiments towards CVS. However, I > think the space usage point still stands. If it wasn't for network synchronization it almost wouldn't matter, but fetching 2.8G uncompressible when I could simply fetch 220MB compressible (that will compress with zlib at little cost during rsync to less than 78M), sounds a bit overkill. > What is the space usage behaviour when you have multiple git trees? Multiple trees in the sense of pulls from multiple developers aren't more costly than a normal checkin, due the "soft hardlink" property of the hashes. It's just every checkin taking lots of space, and generating a new uncompressible blobs every time a changeset touches one file. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote: > On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > At the rate of 9M for every 198 changeset checkins, that means I'll have > > to download 2.7G _uncompressible_ (i.e. already compressed with a bad > > per-file ratio due the too-small files) for a whole pack including all > > changesets without accounting the original 111MB of the original tree, > > with rsync -z of git. That compares with 514M _compressible_ with CVS > > format on-disk, and with ~79M of the CVS-network download with rsync -z of > > the CVS repository (assuming default gzip compression level). > > Yes. CVS is much denser. > > CVS is also total crap. So your point is? And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which is more dense than CVS. BK is also a lot better than CVS. So _your_ point is? 8) Note: I'm _not_ arguing with your sentiments towards CVS. However, I think the space usage point still stands. What is the space usage behaviour when you have multiple git trees? Do we need a git relink command in git-pasky? 8) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Hi, Linus Torvalds schrub am Tue, 12 Apr 2005 15:49:07 -0700: >> Have to tried to import it? > > It would take days. You can always import it later and then graft it into the commit tree. That would of course change *every* commit node, but so what? They're small, and you can delete the old ones when you're done. -- Matthias Urlichs | {M:U} IT Design @ m-u-it.de | [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > I wasn't suggesting to use CVS. I meant that for a newly developed SCM, > the CVS/SCCS format as storage may be more appealing than the current > git format. Go wild. I did mine in six days, and you've been whining about other peoples SCM's for three years. In other words - go and _do_ something instead of whining. I'm not interested. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote: > Yes. CVS is much denser. > > CVS is also total crap. So your point is? I wasn't suggesting to use CVS. I meant that for a newly developed SCM, the CVS/SCCS format as storage may be more appealing than the current git format. I guess I should have said RCS instead of CVS, sorry if that created any confusion. The arch/darcs approach of pratically storing patches would also be much denser but it has no efficient way of doing "rcs up -p 1.x" on a file, that doesn't involve potentially unpacking tons of unrelated changesets. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote: > The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one > and a test-run of 198 patches from Andrew) is 111MB. In other words, > adding 198 "full" new kernels only grew the archive by 9MB (that's all > "actual disk usage" btw - the files themselves are smaller, but since they > all end up taking up a full disk block..) reiserfs can do tail packing, plus the disk block is meaningless when fetching the data from the network which is the real cost to worry about when synchronizing and downloading (disk cost isn't a big deal). The pagecache cost sounds a very minor one too, since you don't need the whole data in ram, not even all dentries need to be in cache. This is one of the reasons why you don't need to run readdir, and why you can discard the old trees anytime. At the rate of 9M for every 198 changeset checkins, that means I'll have to download 2.7G _uncompressible_ (i.e. already compressed with a bad per-file ratio due the too-small files) for a whole pack including all changesets without accounting the original 111MB of the original tree, with rsync -z of git. That compares with 514M _compressible_ with CVS format on-disk, and with ~79M of the CVS-network download with rsync -z of the CVS repository (assuming default gzip compression level). What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns should be expected for synchronizations over time while fetching new blobs etc... Ok, BKCVS has less than 6 checkins due the linearization and coalescing of pulls that couldn't be represented losslessy in CVS, so the network-bound slowdown is less than -97.2%, my math is approximative, but the order of magnitude should remain the same. Clearly one can write an ad-hoc network protocol instead of using rsync/wget, but the server will need quite a bit of cpu and ram to do a checkout/update/sync efficiently to unpack all data and create all changesets to gzip and transfer. Anyway git simplicity and immutable hashes robustness certainly makes it an ideal interim format (and it may even be a very pratical local live format on-disk, except for the backups), I'm only unsure if it's a wise idea to build an SCM on top of the current git format or if it's better to use something like SCCS or CVS to coalesce all diffs of a single file together and to save space and make rsync -z very efficient too (or an approach like arch and darcs that stores changesets per file, i.e. patches). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Hi David, On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote: > > No. A tree is not the full data. A tree contains enough information > > to > > _recreate_ the full data, but the tree itself just tells you _how_ > > to do > > that. It doesn't contain very much of the data itself at all. > > Perhaps I'd understand this if you tell me what "recreate" means. > If a have a SHA1 hash of a file, and I have the file, I can verify > that said > file has the SHA1 hash it's supposed to have, but I can't generate the > file > from it's hash... But, but if you have that hexified SHA1 hash of a particular file you want to access, there would be a file with a filename equal to that hexified SHA1 hash which contained the compressed contents of the file you're looking for. At least, that's how I understood it... With friendly regards, Takis -- OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt fingerprint: 6571 13A3 33D9 3726 F728 AA98 F643 B12E ECF3 E029 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Wed, 13 Apr 2005, Andrea Arcangeli wrote: > > At the rate of 9M for every 198 changeset checkins, that means I'll have > to download 2.7G _uncompressible_ (i.e. already compressed with a bad > per-file ratio due the too-small files) for a whole pack including all > changesets without accounting the original 111MB of the original tree, > with rsync -z of git. That compares with 514M _compressible_ with CVS > format on-disk, and with ~79M of the CVS-network download with rsync -z of > the CVS repository (assuming default gzip compression level). Yes. CVS is much denser. CVS is also total crap. So your point is? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Wed, 13 Apr 2005, Krzysztof Halasa wrote: > > Does that mean that the 64 K changes imported from bk would take ~ 3 GB? > Is that real? That's a _guess_. > Have to tried to import it? It would take days. > I'm going to import the CVS data (with cvsps) - as the CVS "misses" half > the changes, the resulting archive should be half in size too? No. The CVS archive is going to be almost the same size. BKCVS gets about 98% of all the data. It just doesn't show the complex merge graphs, but those are "small" in comparison. > I don't know how much space did bk use, but 3 GB for the full history > is reasonable for most people, isn't it? Especially that one can purge > older data. I think it's entirely reasonable, yes. But I may be off by an order of magnitude. I based the 3GB on estimating form the sparse tree, but I wasn't being too careful. Andrew estimated 2GB per year (at our current historical rate of changes) based on my merge with him. So it's in that general range of 3-6GB, I htink. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Linus Torvalds <[EMAIL PROTECTED]> writes: > The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one > and a test-run of 198 patches from Andrew) is 111MB. In other words, > adding 198 "full" new kernels only grew the archive by 9MB (that's all > "actual disk usage" btw - the files themselves are smaller, but since they > all end up taking up a full disk block..) Does that mean that the 64 K changes imported from bk would take ~ 3 GB? Is that real? Have to tried to import it? I'm going to import the CVS data (with cvsps) - as the CVS "misses" half the changes, the resulting archive should be half in size too? I don't know how much space did bk use, but 3 GB for the full history is reasonable for most people, isn't it? Especially that one can purge older data. -- Krzysztof Halasa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote: > > Yes. A tree is defined by the blobs it references (and the subtrees) but > it doesn't _contain_ them. It just contains a pointer to them. A pointer to them? You mean a SHA1 hash of them? or what? Where is the *real* data stored? The real files, the real patches? Are these somewhere completely outside of git? > > Therefore, "TREE" must be the *full* data, and since we have the following > > definition for CHANGESET: > > No. A tree is not the full data. A tree contains enough information to > _recreate_ the full data, but the tree itself just tells you _how_ to do > that. It doesn't contain very much of the data itself at all. Perhaps I'd understand this if you tell me what "recreate" means. If a have a SHA1 hash of a file, and I have the file, I can verify that said file has the SHA1 hash it's supposed to have, but I can't generate the file from it's hash... Sorry for being stubbornly dumb, but you'll have a couple of us puzzling at the README ;-) -dte - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Tue, 12 Apr 2005, David Eger wrote: > > The reason I am questioning this point is the GIT README file. > > Linus makes explicit that a "blob" is just the "file contents," and that > really, a "blob" is not just the SHA1 of the "blob": > > > In particular, the "current directory cache" certainly does not need to > > be consistent with the current directory contents, but it has two very > > important attributes: > > > > (a) it can re-generate the full state it caches (not just the directory > > structure: through the "blob" object it can regenerate the data too) > > And he defines "TREE" with the same name: blob Yes. A tree is defined by the blobs it references (and the subtrees) but it doesn't _contain_ them. It just contains a pointer to them. > Therefore, "TREE" must be the *full* data, and since we have the following > definition for CHANGESET: No. A tree is not the full data. A tree contains enough information to _recreate_ the full data, but the tree itself just tells you _how_ to do that. It doesn't contain very much of the data itself at all. > That each changeset remembers *everything* for *each point in the tree*. But only BY REFERENCE. A "commit" is usually very small. For example, the top-of-tree commit-file for my currest kernel test is literally 401 _bytes_ in size. Because it just references a tree (20 bytes of _reference_). > Linus, if you actually mean to differentiate between the full data > and a SHA1 of the data There is no differentiation. The sha1 _is_ the data as far as git is concerned. It's only confusing if you think they are different. > Also, the details of just what data constitutes a 'changeset' would be > lovely... i.e. a precise spec of what Pat is describing below... [EMAIL PROTECTED]:~/test-tools/linux-2.6.12-rc2> cat-file commit `cat .git/HEAD ` tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6 parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0 author NeilBrown <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005 committer Linus Torvalds <[EMAIL PROTECTED]> Tue Apr 12 08:27:08 2005 [PATCH] md: remove a number of misleading calls to MD_BUG The conditions that cause these calls to MD_BUG are not kernel bugs, just oddities in what userspace is asking for. Also convert analyze_sbs to return void, and the value it returned was always 0. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> That's it. In all it's glory. Compressed and tagged it's 401 bytes. The tree it references is 677 bytes in size. That in turn references a number of subtrees, but almost all of the sub-trees are shared with _other_ tree commits, so their size is spread out over all the commits. The full archive of the 2.6.12-rc2 kernel that I used for testing (only _one_ version) is 102MB in size. That's about half of what the kernel is uncompressed. The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one and a test-run of 198 patches from Andrew) is 111MB. In other words, adding 198 "full" new kernels only grew the archive by 9MB (that's all "actual disk usage" btw - the files themselves are smaller, but since they all end up taking up a full disk block..) Basically, the whole point of git is that objects are equated with their sha1 name, and that you can thus "include" an object by just referring to its name. The two are equivalent. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
The reason I am questioning this point is the GIT README file. Linus makes explicit that a "blob" is just the "file contents," and that really, a "blob" is not just the SHA1 of the "blob": > In particular, the "current directory cache" certainly does not need to > be consistent with the current directory contents, but it has two very > important attributes: > > (a) it can re-generate the full state it caches (not just the directory > structure: through the "blob" object it can regenerate the data too) And he defines "TREE" with the same name: blob > TREE: The next hierarchical object type is the "tree" object. A tree > object is a list of permission/name/blob data, sorted by name. Therefore, "TREE" must be the *full* data, and since we have the following definition for CHANGESET: > A "changeset" is defined by the tree-object that it results in, the > parent changesets (zero, one or more) that led up to that point, and a > comment on what happened. That each changeset remembers *everything* for *each point in the tree*. Linus, if you actually mean to differentiate between the full data and a SHA1 of the data, *please please please* say "blob" in one place and "SHA1 of the blob" elsewhere. It's quite confusing, to me at least. Also, the details of just what data constitutes a 'changeset' would be lovely... i.e. a precise spec of what Pat is describing below... -dte > where David Eger <[EMAIL PROTECTED]> told me that... > > So with git, *every* changeset is an entire (compressed) copy of the > > kernel. Really? Every patch you accept adds 37 MB to your hard disk? > > > > Am I missing something here? > > Yes. Only changes files re-appear. The unchanged files keep the same > SHA1 hash, therefore they don't re-appear in the repository. > > So, if Linus gets a patch which sanitizes drivers/char/selection.c, > only these new objects appear in the repository: > > drivers/char/selection.c > drivers/char > drivers > . (project root) > commit message > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, Apr 10, 2005 at 09:01:22AM -0700, Linus Torvalds wrote: > > So I was for a while debating having a totally flat directory space, but > since there are _some_ downsides (linear lookup for cold-cache, and just > that "ls -l" ends up being O(n**2) and things), I decided that a single > fan-out is probably a good idea. > Isn't that fixed even in ext2/ext3 these days? man mke2fs: dir_index Use hashed b-trees to speed up lookups in large directories. Also, the popular reiserfs was designed with this in mind from the start. > > Or maybe the files should be named objects/xx/yy/? > > Hey, I may end up being wrong, and yes, maybe I should have done a > two-level one. Unless there still is performance issues, please don't. A directory structure with extra levels is necessarily harder to use if one ever have to use it manually somehow. Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter where David Eger <[EMAIL PROTECTED]> told me that... > So with git, *every* changeset is an entire (compressed) copy of the > kernel. Really? Every patch you accept adds 37 MB to your hard disk? > > Am I missing something here? Yes. Only changes files re-appear. The unchanged files keep the same SHA1 hash, therefore they don't re-appear in the repository. So, if Linus gets a patch which sanitizes drivers/char/selection.c, only these new objects appear in the repository: drivers/char/selection.c drivers/char drivers . (project root) commit message Kind regards, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Mon, Apr 11, 2005 at 10:14:13PM -0700, David Lang wrote: > I've been reading this and have another thought for you guys to keep in > mind for this tool. > > version control of system config files on linux systems. I've been thinking about this too. (I won't have time to implement this however. If I do have time in the near future to do anything involving git, it probably won't have anything to do with version control of config files.) > it sounds like you could put the / fileystem under the control of git > (after teaching it to not cross fileystem boundries so you can have > another filesystem to work with) and version control your entire system. > if this was done it would be nice to add a item type that would referance > a file in a distro package to save space. it sounds like you could run a > git checkin daily (as part of the updatedb run for example) at very little > cost. I was thinking that the GIT checkin should actually be done by the distro configuration tools, and not as a cronjob. And maybe the config tools could do two checkins if there were any manual changes since the last checkin, or something. (That is, one checkin to check in the manual changes since the last checkin, and another to check in whatever the config tool just did.) Now that I think about it, it would be really good to have a simple tool for doing a manual checkin after manual editing of config files, but I think something like the dual-checkin scheme would be needed as a safety net in case root forgets to do the checkin. -Barry K. Nathan <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
So with git, *every* changeset is an entire (compressed) copy of the kernel. Really? Every patch you accept adds 37 MB to your hard disk? Am I missing something here? -dte - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
David wrote: > and version control your entire system Yeah - that works. That's how I back up my system. Not git actually, but a similar sort of store (no compression, a line oriented ascii 'index' file). See my post on "Kernel SCM saga..", Sat, 9 Apr 2005 08:15:53 -0700, Message-Id: <[EMAIL PROTECTED]> -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
I've been reading this and have another thought for you guys to keep in mind for this tool. version control of system config files on linux systems. it sounds like you could put the / fileystem under the control of git (after teaching it to not cross fileystem boundries so you can have another filesystem to work with) and version control your entire system. if this was done it would be nice to add a item type that would referance a file in a distro package to save space. it sounds like you could run a git checkin daily (as part of the updatedb run for example) at very little cost. for that matter by comparing the git data between servers (or between a server and an archive) you could easily use it to detect tampering. sounds very interesting, but I'm going to let things settle down a bit before I try to tackle this (but you guys who ar working on it shoudl feel free to add the couple options nessasary to implement this if you want ;-) David Lang On Sun, 10 Apr 2005, Christopher Li wrote: Date: Sun, 10 Apr 2005 17:28:50 -0400 From: Christopher Li <[EMAIL PROTECTED]> To: Linus Torvalds <[EMAIL PROTECTED]> Cc: Paul Jackson <[EMAIL PROTECTED]>, [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], linux-kernel@vger.kernel.org Subject: Re: more git updates.. I see. It just need some basic set operation (+, -, and) and some way to select a set: sha5---> / / sha1-->sha2-->sha3-- \/ \ / >sha4 list sha1 # all the file list in changeset sha1 # {sha1} list sha1,sha1 # same as above list sha1,sha2 # all the file list in between changeset sha1 # and changeset sha2 # {sha1, sha2} in example list sha1,sha3 # {sha1, sha2, sha3, sha4} list sha1,any # all the change set reachable from sha1. {sha1, ... sha5, ...} new sha1,sha2 # all the new file add between in sha1, sha2 (+) changed sha1,sha2 # add the changed file between sha1, sha2 (>) (<) deleted sha1,sha2 # add the deleted file between sha1, sha2(-) before time # all the file before time aftertime # all the file after time So in my example, the file I want to delete is : {list hack1, base}+ {list hack2, base} ... {list hack6, base} \ - [list official_merge, base ] On Sun, Apr 10, 2005 at 04:21:08PM -0700, Linus Torvalds wrote: the official tree. It is more for my local version control. I have a plan. Namely to have a "list-needed" command, which you give one commit, and a flag implying how much "history" you want (*), and then it spits out all the sha1 files it needs for that history. Then you delete all the other ones from your SHA1 archive (easy enough to do efficiently by just sorting the two lists: the list of "needed" files and the list of "available" files). Script that, and call the command "prune-tree" or something like that, and you're all done. (*) The amount of history you want might be "none", which is to say that you don't want to go back in time, so you want _just_ the list of tree and blob objects associated with that commit. That will be {list head} Or you might want a "linear" history, which would be the longest path through the parent changesets to the root. That will be {list head,root} Or you might want "all", which would follow all parents and all trees. That will be {list any, root} Or you might want to prune the history tree by date - "give me all history, but cut it off when you hit a parent that was done more than 6 months ago". That is {after -6month } This "list-needed" thing is not just for pruning history either. If you have a local tree "x", and you want to figure out how much of it you need to send to somebody else who has an older tree "y", then what you'd do is basically "list-needed x" and remove the set of "list-needed y". That gives you the answer to the question "what's the minimum set of sha1 files I need to send to the other guy so that he can re-create my top-of-tree". That is {list x, any} - {list y, any} My second plan is to make somebody else so fired up about the problem that I can just sit back and take patches. That's really what I'm best at. Sitting here, in the (rain) on the patio, drinking a foofy tropical drink, and pressing the "apply" button. Then I take all the credit for my incredible work. Sounds like a good plan. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- There are two ways of constructing a software design. One way is to
Re: Re: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter where "Randy.Dunlap" <[EMAIL PROTECTED]> told me that... > On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote: ..snip.. > | Yes. Crappy old tree, but it can still read my git.git directory, so you > | can use it to update to my current source base. > > Please go into a little more detail about how to do this step... > that seems to be the most basic concept that I am missing. > i.e., how to find the "latest/current" tree (version/commit) > and check it out (read-tree, checkout-cache, etc.). Well, its ID is by convention kept in .dircache/HEAD. But that is really only a convention, no "core git" tool reads it directly, and you need to update it manually after you do commit-tree. First, you need to get the accompanying tree's id. git-pasky's shortcut is $(tree-id), but manually you can do it by $(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree' Note that if you ever forgot to update HEAD or if you have multiple branches in your repository, you can list all "head commits" (that is, commits which have no other commits referencing them as parents) by doing fsck-cache. Now, you need to populate the directory cache by the tree (see Paul Jackson's diagram): read-tree $tree_id And now you want to update your working tree from the cache: checkout-cache -a -f This will bring your tree in sync with the cache (it won't remove any stale files, though). That means it will overwrite your local changes too - turn that off by omitting the "-f". If you want to update only some files, omit the "-a" and list them. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, Apr 09, 2005 at 12:45:52PM -0700, Linus Torvalds wrote: > Can you guys re-send the scripts you wrote? They probably need some > updating for the new semantics. Sorry about that ;( I've been off email this weekend, so have fallen a bit behind here. I'll forgo updating my stuff, since it looks like there's superior work. Looks cool! I must say, the git as a filesystem thing is really neat. This has been one of the more fun projects I've toyed around with. -- Ross Vandegrift [EMAIL PROTECTED] "The good Christian should beware of mathematicians, and all those who make empty prophecies. The danger already exists that the mathematicians have made a covenant with the devil to darken the spirit and to confine man in the bonds of Hell." --St. Augustine, De Genesi ad Litteram, Book II, xviii, 37 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote: | | | On Sun, 10 Apr 2005, Paul Jackson wrote: | > | > Useful explanation - thanks, Linus. | | Hey. You're welcome. Especially when you create good documentation for | this thing. | | Because: | | > Is this picture and description accurate: | | [ deleted, but I'll probably try to put it in an explanation file | somewhere ] | | Yes. Excellent. | | > Minor question: | > | > I must have an old version - I got 'git-0.03', but | > it doesn't have 'checkout-cache', and its 'read-tree' | > directly writes my working files. | | Yes. Crappy old tree, but it can still read my git.git directory, so you | can use it to update to my current source base. Please go into a little more detail about how to do this step... that seems to be the most basic concept that I am missing. i.e., how to find the "latest/current" tree (version/commit) and check it out (read-tree, checkout-cache, etc.). Even if I use Pasky's tools, I'd like to understand this step. | However, from a usability angle, my source-base really has been | concentrating _entirely_ on just the plumbing, and if you actually want a | faucet or a toilet _conntected_ to the plumbing, you're better off with | Pasky's tree, methinks: | | > How do I get a current version? Well, one way I see, | > and that's to pick up Pasky's: | > | > http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2 | > | > Perhaps that's the best way? | | Indeed. He's got a number of shell scripts etc to automate the boring | parts. --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Followup to: <[EMAIL PROTECTED]> By author:Christopher Li <[EMAIL PROTECTED]> In newsgroup: linux.dev.kernel > > There is one problem though. How about the SHA1 hash collision? > Even the chance is very remote, you don't want to lose some data do due > to "software" error. I think it is OK that no handle that > case right now. On the other hand, it will be nice to detect that > and give out a big error message if it really happens. > If you're actually worried about it, it'd be better to just use a different hash, like one of the SHA-2's (probably a better choice anyway), instead of SHA-1. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Mon, 2005-04-11 at 01:04 +0200, Bernd Eckenfels wrote: > In article <[EMAIL PROTECTED]> you wrote: > > (I repeat the xxx in the leaf name - easier to code.) > > It is a bit OT, but just a note: there are file systems (hash functions) out > there who dont like a lot of files named the same way. For example NTFS with > the 8.3 short names. Since you mention NTFS, there is no need to worry about that for Linux. Certainly the Linux kernel NTFS driver is never going to create 8.3 short names. (It doesn't create names at all at the moment but my grand plan is that it will only ever create file names in the Win32 and/or POSIX name spaces. The DOS name space is a thing of the past IMO.) Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
bert hubert <[EMAIL PROTECTED]> writes: > On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote: > > > compressed with zlib, they are all named by the sha1 file, and they all > > Now I know this is a concious decision, but recent zlib allows you to write > out gzip content, at a cost of 14 bytes I think per file, by adding 32 to > the window size. This in turn would allow users to zcat your objects at > ease. > > You get confirmation of completeness of the file for free, as gzip encodes > the length of the file at the end. I would very much like it if git used normal gzip files with a .gz extension. Doing it this way means that the compression methods can be extended in the future. I.e: ab/1234567890.gzgzip compressed ab/1234567890.xdxdelta compressed I find the xdelta encoding very attractive since it can probably reduce the size of the repository drastically. A compression script could for run nightly and xdelta compress everything that's older than a few months (to figure out what files to create the delta from, just look at the commit files and compare the parent tree to the current tree). Of course, this means that a dumb wget won't work all that well to synchronize two trees, but it might be worthwile anyways. /Christer -- "Just how much can I get away with and still go to heaven?" Freelance consultant specializing in device driver programming for Linux Christer Weinigel <[EMAIL PROTECTED]> http://www.weinigel.se - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote: > compressed with zlib, they are all named by the sha1 file, and they all Now I know this is a concious decision, but recent zlib allows you to write out gzip content, at a cost of 14 bytes I think per file, by adding 32 to the window size. This in turn would allow users to zcat your objects at ease. You get confirmation of completeness of the file for free, as gzip encodes the length of the file at the end. Perhaps something to consider. -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
I see. It just need some basic set operation (+, -, and) and some way to select a set: sha5---> / / sha1-->sha2-->sha3-- \/ \ / >sha4 list sha1 # all the file list in changeset sha1 # {sha1} list sha1,sha1 # same as above list sha1,sha2 # all the file list in between changeset sha1 # and changeset sha2 # {sha1, sha2} in example list sha1,sha3 # {sha1, sha2, sha3, sha4} list sha1,any # all the change set reachable from sha1. {sha1, ... sha5, ...} new sha1,sha2 # all the new file add between in sha1, sha2 (+) changed sha1,sha2 # add the changed file between sha1, sha2 (>) (<) deleted sha1,sha2 # add the deleted file between sha1, sha2(-) before time # all the file before time aftertime # all the file after time So in my example, the file I want to delete is : {list hack1, base}+ {list hack2, base} ... {list hack6, base} \ - [list official_merge, base ] On Sun, Apr 10, 2005 at 04:21:08PM -0700, Linus Torvalds wrote: > > > > the official tree. It is more for my local version control. > > I have a plan. Namely to have a "list-needed" command, which you give one > commit, and a flag implying how much "history" you want (*), and then it > spits out all the sha1 files it needs for that history. > > Then you delete all the other ones from your SHA1 archive (easy enough to > do efficiently by just sorting the two lists: the list of "needed" files > and the list of "available" files). > > Script that, and call the command "prune-tree" or something like that, and > you're all done. > > (*) The amount of history you want might be "none", which is to say that > you don't want to go back in time, so you want _just_ the list of tree and > blob objects associated with that commit. That will be {list head} > > Or you might want a "linear" history, which would be the longest path > through the parent changesets to the root. That will be {list head,root} > > Or you might want "all", which would follow all parents and all trees. That will be {list any, root} > > Or you might want to prune the history tree by date - "give me all > history, but cut it off when you hit a parent that was done more than 6 > months ago". That is {after -6month } > > This "list-needed" thing is not just for pruning history either. If you > have a local tree "x", and you want to figure out how much of it you need > to send to somebody else who has an older tree "y", then what you'd do is > basically "list-needed x" and remove the set of "list-needed y". That > gives you the answer to the question "what's the minimum set of sha1 files > I need to send to the other guy so that he can re-create my top-of-tree". > That is {list x, any} - {list y, any} > My second plan is to make somebody else so fired up about the problem that > I can just sit back and take patches. That's really what I'm best at. > Sitting here, in the (rain) on the patio, drinking a foofy tropical drink, > and pressing the "apply" button. Then I take all the credit for my > incredible work. Sounds like a good plan. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Linus writes: > Hey. You're welcome. Especially when you create good documentation for > this thing. Glad to be of service. Sounds like the umbrella in your foofy drink drink will come in handy - keeping off the rain. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter where Paul Jackson <[EMAIL PROTECTED]> told me that... > Useful explanation - thanks, Linus. > > Is this picture and description accurate: > > == > > > < working directory files (foo.c) > >^ > ^| > | upward ops|downward ops | > | --| | > | checkout-cache |update-cache | > | show-diff | v >v > < current directory cache (".dircache/index") > >^ > ^| > | upward ops|downward ops | > | --| | > | read-tree| write-tree | > ||commit-tree | >| v >v > < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) > Well, except that from purely technical standpoint commit-tree has nothing to do in this picture - it creates new object in the git filesystem based on its input data, but regardless to the directory cache or current tree. It probably still belongs where it is from the workflow standpoint, though. ..snip.. > Minor question: > > I must have an old version - I got 'git-0.03', but > it doesn't have 'checkout-cache', and its 'read-tree' > directly writes my working files. > > How do I get a current version? Well, one way I see, > and that's to pick up Pasky's: > > http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2 > > Perhaps that's the best way? You can take mine, and do: git pull pasky git pull linus cp .dircache/HEAD .dircache/HEAD.local Now, your tree and git filesystem is up to date. git track local Now, when you do git pull pasky, your working tree will not be updated automatically anymore. git track linus Now, you start tracking Linus' tree instead. Note that the initial update will blow away the scripts in your current tree, so before you do the last two steps you will probably want to clone the tree and set PATH to the one still tracking me, so you get all the comfort. ;-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, 10 Apr 2005, Paul Jackson wrote: > > Useful explanation - thanks, Linus. Hey. You're welcome. Especially when you create good documentation for this thing. Because: > Is this picture and description accurate: [ deleted, but I'll probably try to put it in an explanation file somewhere ] Yes. Excellent. > Minor question: > > I must have an old version - I got 'git-0.03', but > it doesn't have 'checkout-cache', and its 'read-tree' > directly writes my working files. Yes. Crappy old tree, but it can still read my git.git directory, so you can use it to update to my current source base. However, from a usability angle, my source-base really has been concentrating _entirely_ on just the plumbing, and if you actually want a faucet or a toilet _conntected_ to the plumbing, you're better off with Pasky's tree, methinks: > How do I get a current version? Well, one way I see, > and that's to pick up Pasky's: > > http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2 > > Perhaps that's the best way? Indeed. He's got a number of shell scripts etc to automate the boring parts. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, 10 Apr 2005, Christopher Li wrote: > > How about deleting trees from the caches? I don't need to delete stuff from > the official tree. It is more for my local version control. I have a plan. Namely to have a "list-needed" command, which you give one commit, and a flag implying how much "history" you want (*), and then it spits out all the sha1 files it needs for that history. Then you delete all the other ones from your SHA1 archive (easy enough to do efficiently by just sorting the two lists: the list of "needed" files and the list of "available" files). Script that, and call the command "prune-tree" or something like that, and you're all done. (*) The amount of history you want might be "none", which is to say that you don't want to go back in time, so you want _just_ the list of tree and blob objects associated with that commit. Or you might want a "linear" history, which would be the longest path through the parent changesets to the root. Or you might want "all", which would follow all parents and all trees. Or you might want to prune the history tree by date - "give me all history, but cut it off when you hit a parent that was done more than 6 months ago". This "list-needed" thing is not just for pruning history either. If you have a local tree "x", and you want to figure out how much of it you need to send to somebody else who has an older tree "y", then what you'd do is basically "list-needed x" and remove the set of "list-needed y". That gives you the answer to the question "what's the minimum set of sha1 files I need to send to the other guy so that he can re-create my top-of-tree". My second plan is to make somebody else so fired up about the problem that I can just sit back and take patches. That's really what I'm best at. Sitting here, in the (rain) on the patio, drinking a foofy tropical drink, and pressing the "apply" button. Then I take all the credit for my incredible work. Hint, hint. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Useful explanation - thanks, Linus. Is this picture and description accurate: == < working directory files (foo.c) > ^ ^| | upward ops|downward ops | | --| | | checkout-cache |update-cache | | show-diff | v v < current directory cache (".dircache/index") > ^ ^| | upward ops|downward ops | | --| | | read-tree| write-tree | ||commit-tree | | v v < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) > == The checkout-cache and show-diff ops read their meta-data from the cache, and the actual file contents from the git filesystem. Similary, the update-cache op writes meta-data into the cache, and may create new files in the git filesystem. The cache (but not the git filesystem) stores transient information (ctime, mtime, dev, ino, uid, gid, and size) about each working file update-cache has copied into the git filesystem so that checkout-cache and show-diff can detect changes in the contents of working files just from a stat, without actually rereading the file. In some sense, the cache holds the git filesystem inodes, and the git filesystem holds the data blocks. Except that: (1) the cache just holds the current "view" into the git filesystem, (2) objects in the filesystem have an "inode" number (their value) that is persistent whether in view or not, (3) objects in the filesystem are not removed just because nothing in the cache references them, (4) objects in the filesystem can reference other objects, that are typically also in the filesystem, but that can still be reliably self-identified even if found in the wild of say one's email inbox, and (5) the view in the directory cache can itself be made into a filesystem object - using commit-tree. == Minor question: I must have an old version - I got 'git-0.03', but it doesn't have 'checkout-cache', and its 'read-tree' directly writes my working files. How do I get a current version? Well, one way I see, and that's to pick up Pasky's: http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2 Perhaps that's the best way? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
In article <[EMAIL PROTECTED]> you wrote: > (I repeat the xxx in the leaf name - easier to code.) It is a bit OT, but just a note: there are file systems (hash functions) out there who dont like a lot of files named the same way. For example NTFS with the 8.3 short names. Greetings Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote: > > > On Sun, 10 Apr 2005, Christopher Li wrote: > > > > BTW, one thing I learn from ext3 is that it is very useful to have some > > compatible flag for future development. I think if we want to reserve some > > room in the file format for further development of git > > Way ahead of you. > > This is (one reason) why all git objects have the type embedded inside of > them. The format of all objects is totally regular: they are all > compressed with zlib, they are all named by the sha1 file, and they all > start out with a magic header of " ". > > So if I want to create a new kind of tree object that does the same thing > as the old one but has some other layout, I'd just call it something else. > Like "dir". That was what I initially planned to do about the change to > recursive tree objects, but it turned out to actually be a lot easier to > just encode it in the old type (that way the routines that read it don't > even have to care about old/new types - it's all the same to them). Ha, that is right. You put the new type into same object trick me into thinking I have to do the same way. Totally forget I can introduce new type of objects. It is even cleaner. Cool. How about deleting trees from the caches? I don't need to delete stuff from the official tree. It is more for my local version control. Here is the usage case, - I check out the git.git. - using quilt to build my series of patches, git-hack1, git-hack2.. git-hack6. let's say those are store in git cache as well - I pick some of them come up with a clean one "submit.patch" - submit.patch get merged into official git tree. - Now I want to get rid of the hack1 to hack6, but how? One way to do it is never commit hack1 to hack6 into git or cache. They stay as quilt patches only. But it is very tempting to let quilt using git instead of the .pc/ directory, quilt can simplify as some usage case of patch and git. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, 10 Apr 2005, Christopher Li wrote: > > BTW, one thing I learn from ext3 is that it is very useful to have some > compatible flag for future development. I think if we want to reserve some > room in the file format for further development of git Way ahead of you. This is (one reason) why all git objects have the type embedded inside of them. The format of all objects is totally regular: they are all compressed with zlib, they are all named by the sha1 file, and they all start out with a magic header of " ". So if I want to create a new kind of tree object that does the same thing as the old one but has some other layout, I'd just call it something else. Like "dir". That was what I initially planned to do about the change to recursive tree objects, but it turned out to actually be a lot easier to just encode it in the old type (that way the routines that read it don't even have to care about old/new types - it's all the same to them). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter where Christopher Li <[EMAIL PROTECTED]> told me that... > I totally agree that odds is really really small. > That is why it is not worthy to handle the case. People hit that > can just add a new line or some thing to avoid it, if > it happen after all. > > It is the little peace of mind to know for sure that did > not happen. I am just paranoid. BTW, I've merged the check to git-pasky some time ago, you can disable it in the Makefile. It is by default on now, until someone convinces me it actually affects performance measurably. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, Apr 10, 2005 at 01:57:33PM -0700, Linus Torvalds wrote: > > > That way of thinking really doesn't work well here. > > > > I will have to look more closely at pasky's GIT toolkit > > if I want to see an SCM style interface. > > Yes. You really should think of GIT as a filesystem, and of me as a > _systems_ person, not an SCM person. In fact, I tend to detest SCM's. I > think the reason I worked so well with BitKeeper is that Larry used to do > operating systems. He's also a systems person, not really an SCM person. > Or at least he's in between the two. > Yes, I am puzzled for a while how to use git until I realize that it is a version file system. BTW, one thing I learn from ext3 is that it is very useful to have some compatible flag for future development. I think if we want to reserve some room in the file format for further development of git, it is the right time to do it before it get bigs. e.g. an optional variable size header in "tree" including format version and capability etc. I can see the counter argument that it is not as important as a real file system because it is a lot easier bring it off line to upgrade the whole tree. One the other hand, it is almost did not cost any thing in terms of space and CPU time, most directory did not get to file system block boundary so extra few bytes is almost free. If carefully planed, it will make the future up grade of git a lot smoother. What do you think? Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RE: more git updates..
Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter where "Luck, Tony" <[EMAIL PROTECTED]> told me that... ..snip.. > >Hey, I may end up being wrong, and yes, maybe I should have done a > >two-level one. The good news is that we can trivially fix it later (even > >dynamically - we can make the "sha1 object tree layout" be a per-tree > >config option, and there would be no real issue, so you could make small > >projects use a flat version and big projects use a very deep structure > >etc). You'd just have to script some renames to move the files around. > > It depends on how many eco-system shell scripts get built that need to > know about the layout ... if some shell/perl "libraries" encode this > filename layout (and people use them) ... then switching later would > indeed be painless. FWIW, my short-term plans include support for monotone-like hash ID shortening - it's enough to use the shortest leading unique part of the ID to identify the revision. I will poke to the object repository for that. I also already have Randy Dunlap's git lsobj, which will list all objects of a specified type (very useful especially when looking for orphaned commits and such rather lowlevel work). -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: more git updates..
>Also, I did actually debate that issue with myself, and decided that even >if we do have tons of files per directory, git doesn't much care. The >reason? Git never _searches_ for them. Assuming you have enough memory to >cache the tree, you just end up doing a "lookup", and inside the kernel >that's done using an efficient hash, which doesn't actually care _at_all_ >about how many files there are per directory. So long as the hash *is* efficient when the directory is packed full of 38 character filenames made only of [0-9a-f] ... which might not match the test cases under which the hash was picked :-) When there are some full-sized kernel git images, someone should do a sanity check. >Hey, I may end up being wrong, and yes, maybe I should have done a >two-level one. The good news is that we can trivially fix it later (even >dynamically - we can make the "sha1 object tree layout" be a per-tree >config option, and there would be no real issue, so you could make small >projects use a flat version and big projects use a very deep structure >etc). You'd just have to script some renames to move the files around. It depends on how many eco-system shell scripts get built that need to know about the layout ... if some shell/perl "libraries" encode this filename layout (and people use them) ... then switching later would indeed be painless. -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
I totally agree that odds is really really small. That is why it is not worthy to handle the case. People hit that can just add a new line or some thing to avoid it, if it happen after all. It is the little peace of mind to know for sure that did not happen. I am just paranoid. Chris On Sun, Apr 10, 2005 at 12:23:52PM -0700, Paul Jackson wrote: > > Some thing like the following patch, may be turn off able. > > Take out an old envelope and compute on it the odds of this > happening. > > Say we have 10,000 kernel hackers, each producing one > new file every minute, for 100 hours a week. And we've > cloned a small army of Andrew Morton's to integrate > the resulting tsunamai of patches. And Linus is well > cared for in the state funny farm. > > What is the probability that this check will fire even > once, between now and 10 billion years from now, when > the Sun has become a red giant destroying all life on > planet Earth? > > -- > I won't rest till it's the best ... > Programmer, Linux Scalability > Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, > 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, 10 Apr 2005, Paul Jackson wrote: > > Ah ha - that explains the read-tree and write-tree names. > > The read-tree pulls stuff out of this file system into > your working files, clobbering local edits. This is like > the read(2) system call, which clobbers stuff in your > read buffer. Yes. Except it's a two-stage thing, where the staging area is always the "current directory cache". So a "read-tree" always reads the tree information into the directory cache, but does not actually _update_ any of the files it "caches". To do that, you need to do a "checkout-cache" phase. Similarly, "write-tree" writes the current directory cache contents into a set of tree files. But in order to have that match what is actually in your directory right now, you need to have done a "update-cache" phase before you did the "write-tree". So there is always a staging area between the "real contents" and the "written tree". > That way of thinking really doesn't work well here. > > I will have to look more closely at pasky's GIT toolkit > if I want to see an SCM style interface. Yes. You really should think of GIT as a filesystem, and of me as a _systems_ person, not an SCM person. In fact, I tend to detest SCM's. I think the reason I worked so well with BitKeeper is that Larry used to do operating systems. He's also a systems person, not really an SCM person. Or at least he's in between the two. My operations are like the "system calls". Useless on their own: they're not real applications, they're just how you read and write files in this really strange filesystem. You need to wrap them up to make them do anything sane. For example, take "commit-tree" - it really just says that "this is the new tree, and these other trees were its parents". It doesn't do any of the actual work to _get_ those trees written. So to actually do the high-level operation of a real commit, you need to first update the current directory cache to match what you want to commit (the "update-cache" phase). Then, when your directory cache matches what you want to commit (which is NOT necessarily the same thing as your actual current working area - if you don't want to commit some of the changes you have in your tree, you should avoid updating the cache with those changes), you do stage 2, ie "write-tree". That writes a tree node that describes what you want to commit. Only THEN, as phase three, do you do the "commit-tree". Now you give it the tree you want to commit (remember - that may not even match your current directory contents), and the history of how you got here (ie you tell commit what the previous commit(s) were), and the changelog. So a "commit" in SCM-speak is actually three totally separate phases in my filesystem thing, and each of the phases (except for the last "commit-tree" which is the thing that brings it all together) is actually in turn many smaller parts (ie "update-cache" may have been called hundreds of times, and "write-tree" will write several tree objects that point to each other). Similarly, a "checkout" really is about first finding the tree ID you want to check out, and then bringing it into the "directory cache" by doing a "read-tree" on it. You can then actually update the directory cache further: you might "read-tree" _another_ project, or you could decide that you want to keep one of the files you already had. So in that scneario, after doing the read-tree you'd do an "update-cache" on the file you want to keep in your current directory structure, which updates your directory cache to be a _mix_ of the original tree you now want to check out _and_ of the file you want to use from your current directory. Then doing a "checkout-cache -a" will actually do the actual checkout, and only at that point does your working directory really get changed. Btw, you don't even have to have any working directory files at all. Let's say that you have two independent trees, and you want to create a new commit that is the join of those two trees (where one of the trees take precedence). You'd do a "read-tree ", which will create a directory cache (but not check out) that is the union of the and trees ( will overrride). And then you can do a "write-tree" and commit the resulting tree - without ever having _any_ of those files checked out. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
> Some thing like the following patch, may be turn off able. Take out an old envelope and compute on it the odds of this happening. Say we have 10,000 kernel hackers, each producing one new file every minute, for 100 hours a week. And we've cloned a small army of Andrew Morton's to integrate the resulting tsunamai of patches. And Linus is well cared for in the state funny farm. What is the probability that this check will fire even once, between now and 10 billion years from now, when the Sun has become a red giant destroying all life on planet Earth? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Linus wrote: > It's a filesystem - although a > fairly strange one. Ah ha - that explains the read-tree and write-tree names. The read-tree pulls stuff out of this file system into your working files, clobbering local edits. This is like the read(2) system call, which clobbers stuff in your read buffer. The write-tree pushes stuff down into the file system, just like write(2) pushes data into the kernel. I was getting all kind of frustrated yesterday trying to use Linus's git commands, coming at these names with my SCM hat on. That way of thinking really doesn't work well here. I will have to look more closely at pasky's GIT toolkit if I want to see an SCM style interface. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Tony wrote: > Or maybe the files should be named objects/xx/yy/? I tend to size these things with the square root of the number of leaf nodes. If I have 2,560,000 leaves (your 10,000 files in each of 16*16 directories), then I will aim for 1600 directories of 1600 leaves each. My backup is sized for about this number of leaves, and it uses: xxx/xxx (I repeat the xxx in the leaf name - easier to code.) I don't think there is any need for two levels. There are 4096 different values of three digit hex numbers. That's ok in one directory. The only question would be 'xx' or 'xxx' - two or three digits. This one is on the cusp in my view - either works. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
* Rik van Riel <[EMAIL PROTECTED]> wrote: > GCC 4 isn't very happy. Mostly sign changes, but also something that > looks like a real error: > > gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c > fsck-cache.c: In function 'main': > fsck-cache.c:59: warning: control may reach end of non-void function > 'fsck_tree' being inlined > fsck-cache.c:62: warning: control may reach end of non-void function > 'fsck_commit' being inlined > > I assume that fsck_tree and fsck_commit should complain loudly if they > ever get to that point - but since I'm not quite sure there's no > patch, sorry. i sent a patch for most of the sign errors, but the above is a case gcc not noticing that the function can never ever exit the loop, and thus cannot get to the 'return' point. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Ralph wrote: > but good enough for > most uses that people will get caught out when it fails. Exactly. If Linus persists in this diff-tree output format, using two lines for changed files, then I will have to add the following sed script to my arsenal: sed '/^/ / }' It collapses pairs of lines: <100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile >100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile to the single line: <100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile 100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile However, more people will get bit by this git glitch than know sed. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, 9 Apr 2005, Linus Torvalds wrote: > I've rsync'ed the new git repository to kernel.org, it should all be there > in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the > mirror scripts already picked it up on the public side too). GCC 4 isn't very happy. Mostly sign changes, but also something that looks like a real error: gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c fsck-cache.c: In function 'main': fsck-cache.c:59: warning: control may reach end of non-void function 'fsck_tree' being inlined fsck-cache.c:62: warning: control may reach end of non-void function 'fsck_commit' being inlined I assume that fsck_tree and fsck_commit should complain loudly if they ever get to that point - but since I'm not quite sure there's no patch, sorry. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, Apr 10, 2005 at 08:44:56AM -0700, Linus Torvalds wrote: > > > On Sun, 10 Apr 2005, Junio C Hamano wrote: > > > > But I am wondering what your plans are to handle renames---or > > does git already represent them? > > You can represent renames on top of git - git itself really doesn't care. > In many ways you can just see git as a filesystem - it's content- > addressable, and it has a notion of versioning, but I really really > designed it coming at the problem from the viewpoint of a _filesystem_ > person (hey, kernels is what I do), and I actually have absolutely _zero_ > interest in creating a traditional SCM system. > > So to take renaming a file as an example - why do you actually want to > track renames? In traditional SCM's, you do it for two reasons: > > - space efficiency. Most SCM's are based on describing changes to a file, [snip] > - annotate/blame. This is a valid concern, but the fact is, I never use [snip] - merging. When the parent tree renames a file, it's easier for an out-of-tree patch to get up-to-date. - reviewing. A huge patch with 2000 added lines and 1990 removed lines is more difficult to review then a rename + 10 lines patch. > So consider me deficient, or consider me radical. It boils down to the > same thing. Renames don't matter. When you've got no out-of-tree patches since you've got the parent-of-all-trees, then they don't matter, that's true :) > So whether you agree with the things that _I_ consider important probably > depends on how you work. The real downside of GIT may be that _my_ way of > doing things is quite possibly very rare. -- Rutger Nijlunsing -- eludias ed dse.nl never attribute to a conspiracy which can be explained by incompetence -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, 9 Apr 2005 [EMAIL PROTECTED] wrote: > > With 60,000 changesets in the current tree, we will start out our git > repository with about 600,000 files. Assuming the first byte of the > SHA1 hash is random, that means an average of 2343 files in each of the > objects/xx directories. Give it a few more years at the current pace, > and we'll have over 10,000 files per directory. This sounds like a lot > to me ... but perhaps filesystems now handle large directories enough > better than they used to for this to not be a problem? The good news is that git itself doesn't really care. I think it's literally _one_ function ("get_sha1_filename()") that you need to change, and then you need to write a small script that moves files around, and you're really much done. Also, I did actually debate that issue with myself, and decided that even if we do have tons of files per directory, git doesn't much care. The reason? Git never _searches_ for them. Assuming you have enough memory to cache the tree, you just end up doing a "lookup", and inside the kernel that's done using an efficient hash, which doesn't actually care _at_all_ about how many files there are per directory. So I was for a while debating having a totally flat directory space, but since there are _some_ downsides (linear lookup for cold-cache, and just that "ls -l" ends up being O(n**2) and things), I decided that a single fan-out is probably a good idea. > Or maybe the files should be named objects/xx/yy/? Hey, I may end up being wrong, and yes, maybe I should have done a two-level one. The good news is that we can trivially fix it later (even dynamically - we can make the "sha1 object tree layout" be a per-tree config option, and there would be no real issue, so you could make small projects use a flat version and big projects use a very deep structure etc). You'd just have to script some renames to move the files around.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, 10 Apr 2005, Junio C Hamano wrote: > > But I am wondering what your plans are to handle renames---or > does git already represent them? You can represent renames on top of git - git itself really doesn't care. In many ways you can just see git as a filesystem - it's content- addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a _filesystem_ person (hey, kernels is what I do), and I actually have absolutely _zero_ interest in creating a traditional SCM system. So to take renaming a file as an example - why do you actually want to track renames? In traditional SCM's, you do it for two reasons: - space efficiency. Most SCM's are based on describing changes to a file, and compress the data by doing revisions on the same file. In order to continue that process past a rename, such an SCM _has_ to track renames, or lose the delta-based approach. The most trivial example of this is "diff", ie a rename ends up generating a _huge_ diff unless you track the rename explicitly. GIT doesn't care. There is _zero_ space efficiency in trying to track renames. In fact, it would add overhead to the system, not lessen it. That's because GIT fundamentally doesn't do the "delta-within-a-file" model. - annotate/blame. This is a valid concern, but the fact is, I never use it. It may be a deficiency of mine, but I simply don't do the per-line thing when I debug or try to find who was responsible. I do "blame" on a much bigger-picture level, and I personally believe (pretty strongly) that per-line annotations are not actually a good thing - they come not because people _want_ to do things at that low level, but because historically, you didn't _have_ the bigger-picture thing. In other words, pretty much every SCM out there is based on SCCS "mentally", even if not in any other model. That's why people think per-line blame is important - you have that mental model. So consider me deficient, or consider me radical. It boils down to the same thing. Renames don't matter. That said, if somebody wants to create a _real_ SCM (rather than my notion of a pure content tracker) on top of GIT, you probably could fairly easily do so by imposing a few limitations on a higher level. For example, most SCM's that track renames require that the user _tell_ them about the renames: you do a "bk mv" or a "svn rename" or something. If you want to do the same on top of GIT, then you should think of GIT as what it is: GIT just tracks contents. It's a filesystem - although a fairly strange one. How would you track renames on top of that? Easy: add your own fields to the GIT revision messages: GIT enforces the header, but you can add anything you want to the "free-form" part that follows it. Same goes for any other information where you care about what happens "within" a file. GIT simply doesn't track it. You can build things on top of GIT if you want to, though. They may not be as efficient as they would be if they were built _into_ GIT, but on the other hand GIT does a lot of other things a hell of a lot faster thanks to it's design. So whether you agree with the things that _I_ consider important probably depends on how you work. The real downside of GIT may be that _my_ way of doing things is quite possibly very rare. But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
>In other words, each "commit" file is very small and cheap, but since >almost every commit will also imply a totally new tree-file, "git" is >going to have an overhead of half a megabyte per commit. Oops. > >Damn, that's painful. I suspect I will have to change the format somehow. Having dodged that bullet with the change to make tree files point at other tree files ... here's another (potential) issue. A changeset that touches just one file a few levels down from the top of the tree (say arch/i386/kernel/setup.c) will make six new files in the git repository (one for the changeset, four tree files, and a new blob for the new version of the file). More complex changes make more files ... but say the average is ten new files per changeset since most changes touch few files. With 60,000 changesets in the current tree, we will start out our git repository with about 600,000 files. Assuming the first byte of the SHA1 hash is random, that means an average of 2343 files in each of the objects/xx directories. Give it a few more years at the current pace, and we'll have over 10,000 files per directory. This sounds like a lot to me ... but perhaps filesystems now handle large directories enough better than they used to for this to not be a problem? Or maybe the files should be named objects/xx/yy/? -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Hi, Christopher Li wrote: > On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote: > > NOTE! This means that each "tree" file basically tracks just a > > single directory. The old style of "every file in one tree file" > > still works, but fsck-cache will warn about it. Happily, the git > > archive itself doesn't have any subdirectories, so git itself is not > > impacted by it. > > That is really cool stuff. My way to read it, correct me if I am > wrong, git is a user space version file system. "tree" <--> directory > and "blob" <--> file. "commit" to describe the version history. See the Venti filesystem in Bell Labs's Plan 9 OS. It too uses SHA-1. http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf Abstract This paper describes a network storage system, called Venti, intended for archival data. In this system, a unique hash of a block's contents acts as the block identifier for read and write operations. This approach enforces a write-once policy, preventing accidental or malicious destruction of data. In addition, duplicate copies of a block can be coalesced, reducing the consumption of storage and simplifying the implementation of clients. Venti is a building block for constructing a variety of storage applications such as logical backup, physical backup, and snapshot file systems. We have built a prototype of the system and present some preliminary performance results. The system uses magnetic disks as the storage technology, resulting in an access time for archival data that is comparable to non-archival data. The feasibility of the write-once model for storage is demonstrated using data from over a decade's use of two Plan 9 file systems. Cheers, Ralph. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
>handle by pure rename only plus the extra delta. The current git don't >have per file change history. From git's point of view some file deleted >and the other file appeared with same content. > >It is the top level SCM to handle that correctly. >Rename a directory will be even more fun. But from a git perspective it will be very efficient. Imagine that Linus decides to rename arch/i386 as arch/x86 ... at the git repository level this just requires a changeset, a new top level tree, and a new tree for the arch directory showing that i386 changed to x86. That's all ... every files below that didn't change, so the blobs for the files are all the same. -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Proposal for shell-patch-format [was: Re: more git updates..]
On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > Listing the file paths and their sigs included in a tree to make > a snapshot of a tree state sounds fine, and diffing two trees by > looking at the sigs between two such files sounds fine as well. > > But I am wondering what your plans are to handle renames---or > does git already represent them? git doesn't represent transitions (or deltas), but only state. So it's not (much) more then a .tar file from version-management perspective; the only difference being that a git-tree has a comment field and a predecessor-reference, which are currently not used in determining the 'patch' between two trees. Deltas are derived by comparing different versions and determining the difference by reverse-engineering the differences which got us from version A to version B. Deltas are currently described as patch(1)es. Patches don't have the concept of 'renaming', so even after determining that file X has been renamed to Y, we have no container for this fact. A patch(1) only contains local-file-edits: substitute lines by other lines. Deltas are not needed to follow a tree; deltas are useful for merging branches of versions, and for reviewing purposes. This is comparable to using tar for version-management: it is very common to weekly tar your current version of your project as a poor-mans-version management for one-person one-project. So what is needed is a way to represent deltas which can contain more than only traditional patches. I would propose a simple format: the shell-script in a fixed-format. Shell-patch format in EBNF: ::= ( ? * )* ::= + The comments contains the text describing the function of the patch following it. ::= "# " ::= "mv " " " "\n" | "cp " " " "\n" | "chmod " "\n" | "patch <<__UNIQUE_STRING__\n" "__UNIQUE_STRING__\n" (where UNIQUE_STRING must not be contained in patch) ::= (but pointing to a file) ::= a pathname relative to '.'; escaping special characters the shell-way; may not contain '..'. Example: # Rename file b to a1, and change a line. mv b a1 patch <<__END__ *** a1 Sun Apr 10 11:43:37 2005 --- a2 Sun Apr 10 11:43:41 2005 *** *** 1,4 1 2 ! from 3 --- 1,4 1 2 ! to 3 __END__ Advantages: - ASCII! - a shell-patch is executable without extra tooling - a shell-patch is readable and therefore reviewable - a shell-patch is forward-compatible: a shell-patch acts like a patch (since patch(1) ignores garbage around patch :), but not backwards-compatible. - extensible - the heavy-lifting is done by 'patch' Disadvantages: - no deltas for binary files Open issues: - could be made more structured; maybe containing fields like Sujbect:, Author:, Signed-By:, certificates, ... (BitKeeper seems to be using "# " ":" "\n" lines) - patch(1) doesn't know any directories. Should shell-patch know directories? This implies commands working on directories to (like directory renaming, mode changing, ...). Otherwise directories are implicit (a file in a directories implies the existance of that directory). Also implies mkdir and rmdir as shell-patch commands. - extra commands might be useful to conserve more state(changes): ln -s -- for symbolic links; ln -- for hard links; chown -- for permissions; chattr -- for storing extended attributes touch -- for setting timestamps (probably creation time only, since mtime is something git relies on) ...and for the really adventurous: sed 's,,,' -- for substitutions (this is something darcs supports, but which I think is too bothersome to use since it is difficult to reverse engineere from two random trees) Why a fixed format at all? - This way, the executable shell-patch can be proven to be harmless to the machine: 'rm -rf /' is a valid shell-script, but not a valid shell-patch (since 'rm' is not valid command, random flags like '-rf' are not supported, and '/' is an absolute pathname. - A fixed format enables tooling to support such a patch format; for example creating the reverse-patch, merging patches (yeah, 'cat' also merges patches...). ...what has this to do with git? Not much and everything, depending on how you look onto it. 'git' is 'tar', and 'shell-patch' is 'patch'; both orthogonal concepts but very usable in combination. We'll look at getting from two git trees to a shell-patch. Diffing the trees would not only look at the file and per file at the hashes, but also the other way around: which hash values are used more than once. For files with the same hash value, compare the contents (and rest of attributes); this is needed since the mapping from file contents to sha1 is one-way. When the contents is the same, the shell-patch-command to generate is obviously a 'cp'. For example, we have got two trees
Re: more git updates..
Hi Paul, > Ralph wrote: > > Watch out for when xargs invokes do_something more than once and the > > `<' is parsed by a different one than the `>'. > > It will take a pretty long list to do that. It seems that GNU xargs > on top of a Linux kernel has a 128 KByte ARG_MAX. I didn't realise it was that long, but one pair of files to diff takes 128 bytes of that. $ wc -c <<\E > <100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c > >100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c > E 128 So that's space for 1024 pairs. (Doesn't envp take some up too?) That doesn't seem enough for diffs between revisions, but good enough for most uses that people will get caught out when it fails. $ bzip2 -dc patch-2.6.10.bz2 | grep -c '^diff ' 5384 Cheers, Ralph. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote: > Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter > where Christopher Li <[EMAIL PROTECTED]> told me that... > > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > > > > > > But I am wondering what your plans are to handle renames---or > > > does git already represent them? > > > > > > > Rename should just work. It will create a new tree object and you > > will notice that in the entry that changed, the hash for the blob > > object is the same. > > Which is of course wrong when you want to do proper merging, examine > per-file history, etc. One solution which springs to my mind is to have > a UUID accompany each blob and tree; that will take relatively lot of > space though, and I'm not sure it is really worth it. It should just use the rename + change two step then it is tractable with git now. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, Apr 10, 2005 at 02:28:54AM -0700, Junio C Hamano wrote: > > "CL" == Christopher Li <[EMAIL PROTECTED]> writes: > > CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > >> > >> But I am wondering what your plans are to handle renames---or > >> does git already represent them? > >> > > CL> Rename should just work. It will create a new tree object and you > CL> will notice that in the entry that changed, the hash for the blob > CL> object is the same. > > Sorry, I was unclear. But doesn't that imply that a SCM built > on top of git storage needs to read all the commit and tree > records up to the common ancestor to show tree diffs between two > forked tree? > > I suspect that another problem is that noticing the move of the > same SHA1 hash from one pathname to another and recognizing that > as a rename would not always work in the real world, because > sometimes people move files *and* make small changes at the same > time. If git is meant to be an intermediate format to suck > existing kernel history out of BK so that the history can be > converted for the next SCM chosen for the kernel work, I would > imagine that there needs to be a way to represent such a case. > Maybe convert a file rename as two git trees (one tree for pure > move which immediately followed by another tree for edit) if it > is not a pure move? > Git is not a SCM yet. For the rename + change set it should internally handle by pure rename only plus the extra delta. The current git don't have per file change history. From git's point of view some file deleted and the other file appeared with same content. It is the top level SCM to handle that correctly. Rename a directory will be even more fun. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote: > > Done, and pushed out. The current git.git repository seems to do all of > this correctly. > > NOTE! This means that each "tree" file basically tracks just a single > directory. The old style of "every file in one tree file" still works, but > fsck-cache will warn about it. Happily, the git archive itself doesn't > have any subdirectories, so git itself is not impacted by it. That is really cool stuff. My way to read it, correct me if I am wrong, git is a user space version file system. "tree" <--> directory and "blob" <--> file. "commit" to describe the version history. Git always write out a full new version of blob when there is any update to it. At first I think that waste a lot of space, especially when there is only tiny change to it. But the more I think about it, it make more sense. Kernel source is usually small objects and file is compressed store any way. A very useful thing to gain form it is that, we can truncate the older history. e.g. We can have option not to sync the pre 2.4 change set, only grab it if we need to. Most of the time we only interested in the recent change set. There is one problem though. How about the SHA1 hash collision? Even the chance is very remote, you don't want to lose some data do due to "software" error. I think it is OK that no handle that case right now. On the other hand, it will be nice to detect that and give out a big error message if it really happens. Some thing like the following patch, may be turn off able. Chris Index: git-0.03/read-cache.c === --- git-0.03.orig/read-cache.c 2005-04-09 18:42:16.0 -0400 +++ git-0.03/read-cache.c 2005-04-10 02:48:36.0 -0400 @@ -210,8 +210,22 @@ int fd; fd = open(filename, O_WRONLY | O_CREAT | O_EXCL, 0666); - if (fd < 0) - return (errno == EEXIST) ? 0 : -1; + if (fd < 0) { + void *map; + static int error(const char * string); + + if (errno != EEXIST) + return -1; + fd = open(filename, O_RDONLY); + if (fd < 0) + return -1; + map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0); + if (map == MAP_FAILED) + return -1; + if (memcmp(buf, map, size)) + return error("Ouch, Strike by lighting!\n"); + return 0; + } write(fd, buf, size); close(fd); return 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter where Junio C Hamano <[EMAIL PROTECTED]> told me that... > > "CL" == Christopher Li <[EMAIL PROTECTED]> writes: > > CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > >> > >> But I am wondering what your plans are to handle renames---or > >> does git already represent them? > >> > > CL> Rename should just work. It will create a new tree object and you > CL> will notice that in the entry that changed, the hash for the blob > CL> object is the same. > > Sorry, I was unclear. But doesn't that imply that a SCM built > on top of git storage needs to read all the commit and tree > records up to the common ancestor to show tree diffs between two > forked tree? No. See diff-tree output and http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done. Basically, you just take the two trees and compare them linearily (do a normal diff on them, essentialy). Then the differences you spot this way are everything what needs to appear in the patch. > I suspect that another problem is that noticing the move of the > same SHA1 hash from one pathname to another and recognizing that > as a rename would not always work in the real world, because > sometimes people move files *and* make small changes at the same > time. If git is meant to be an intermediate format to suck > existing kernel history out of BK so that the history can be > converted for the next SCM chosen for the kernel work, I would > imagine that there needs to be a way to represent such a case. > Maybe convert a file rename as two git trees (one tree for pure > move which immediately followed by another tree for edit) if it > is not a pure move? Actually, this could be possible too I think. We will have to make diff-tree two-pass, but it is already so blinding fast that I guess that doesn't hurt too much. I might try to get my hands on that. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter where Christopher Li <[EMAIL PROTECTED]> told me that... > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > > > > But I am wondering what your plans are to handle renames---or > > does git already represent them? > > > > Rename should just work. It will create a new tree object and you > will notice that in the entry that changed, the hash for the blob > object is the same. Which is of course wrong when you want to do proper merging, examine per-file history, etc. One solution which springs to my mind is to have a UUID accompany each blob and tree; that will take relatively lot of space though, and I'm not sure it is really worth it. How many renames were there in the 64k commits so far anyway? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Previously Christopher Li wrote: > Rename should just work. It will create a new tree object and you > will notice that in the entry that changed, the hash for the blob > object is the same. What if you rename and change a file within a changeset? Wichert. -- Wichert Akkerman <[EMAIL PROTECTED]>It is simple to make things. http://www.wiggy.net/ It is hard to make things simple. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
> "CL" == Christopher Li <[EMAIL PROTECTED]> writes: CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: >> >> But I am wondering what your plans are to handle renames---or >> does git already represent them? >> CL> Rename should just work. It will create a new tree object and you CL> will notice that in the entry that changed, the hash for the blob CL> object is the same. Sorry, I was unclear. But doesn't that imply that a SCM built on top of git storage needs to read all the commit and tree records up to the common ancestor to show tree diffs between two forked tree? I suspect that another problem is that noticing the move of the same SHA1 hash from one pathname to another and recognizing that as a rename would not always work in the real world, because sometimes people move files *and* make small changes at the same time. If git is meant to be an intermediate format to suck existing kernel history out of BK so that the history can be converted for the next SCM chosen for the kernel work, I would imagine that there needs to be a way to represent such a case. Maybe convert a file rename as two git trees (one tree for pure move which immediately followed by another tree for edit) if it is not a pure move? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote: > > But I am wondering what your plans are to handle renames---or > does git already represent them? > Rename should just work. It will create a new tree object and you will notice that in the entry that changed, the hash for the blob object is the same. Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Listing the file paths and their sigs included in a tree to make a snapshot of a tree state sounds fine, and diffing two trees by looking at the sigs between two such files sounds fine as well. But I am wondering what your plans are to handle renames---or does git already represent them? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: more git updates..
Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter where Linus Torvalds <[EMAIL PROTECTED]> told me that... > On Sat, 9 Apr 2005, Linus Torvalds wrote: > > > > Actually, I guess I wouldn't have to change the format. I could just > > extend the existing "tree" object to be able to point to other trees, and > > that's it. > > Done, and pushed out. The current git.git repository seems to do all of > this correctly. ..snip.. Ok, so now I can dare announce it, I hope. I hacked my branch of git somewhat, kept in sync with Linus, and now I have something to show. Please see it at http://pasky.or.cz/~pasky/dev/git/ It is basically a set of (still rather crude) shell scripts upon Linus' git, which make it sanely usable by mere humans for actual version tracking. Its usage _is_ going to change, so don't get too used to it (that'd be hard anyway, I suspect), but it should be working nicely. I have described most of the interesting parts and some basic usage in the README at that page. It wraps commits, supports log retrieval and comfortable diffing between any two trees. And on top of that, it can do some basic remote repositories - it will pull (rsync) from them and it can make the local copy track them - on pull, it will be updated accordingly (and your local commits on the tracked branch will get orphaned). I didn't attach a patch against Linus since I think it's pretty much useless now. It's available as against-linus.patch on the web, and you can apply it to the latest git tree (NOT 0.03). But it's probably better idea to wget my tree. You can then watch us making progress by gitpull.sh linus gitpull.sh pasky and see where we differ by: gitdiff.sh linus pasky (This is how the against-linus.patch was generated. I'd easily generate even 0.03 patch this way, but I forgot to merge the fsck at that time, so it would suck.) (Note that the tree you wget is set up to track my branch. If you want to stop tracking it (basically necessary now if you want to do local commits), do: cp .dircache/HEAD .dircache/HEAD.local gittrack.sh The cp says that something like "I want to pick up where the tracked branch left off". Otherwise, untracking would return you to your "local" branch, which is just some ancient predecessor of the pasky branch here anyway.) Note that I didn't really test it on anything but git itself yet, so I'm not sure how will it cope especially with directories - I tried to make it aware of them though. I will do some more practical testing tomorrow. Otherwise, I will probably try to consolidate the usage and documentation now, and beautify the scripts. I might start pondering some merging too. Oh, and gitpatch.sh. :-) Have fun and please share your opinions, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
>From before: The sha1 (ascii) digests for 16817 files take: 689497 bytes before compression 397475 bytes after minigzip New numbers: The sha1 (binary) digests for 16817 files take: 336340 bytes before compression 334943 bytes after minigzip So compressing binary digests isn't worth a darn, and compressing ascii digests gets them down to within 18% of binary digests in size. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
> Then a "tree" object would point to a "directory" object, Ah - light bulb flickers - in _separate_ files. Yes, that obviously makes a difference. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Linus wrote: > Damn, that's painful. I suspect I will have to change the format somehow. The sha1 (ascii) digests for 16817 files take: 689497 bytes before compression 397475 bytes after minigzip The pathnames, relative to top of tree, for these 16817 files take: 503983 bytes before compression 85786 bytes after minigzip compression I doubt any fancifying up of the pathname storage will gain much. However going from binary to ascii sha1 digest might help (compresses better, I suspect - I'll have to write a few lines of code to see). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Bernd wrote: > more parser friendly to have single records for diffs. good point [looks like you trimmed the cc list - folks around here don't like that ;)] -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
In article <[EMAIL PROTECTED]> you wrote: > Ralph wrote: >> Watch out for when xargs invokes do_something more than once and the `<' >> is parsed by a different one than the `>'. > It will take a pretty long list to do that. It seems that > GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX. > In the old days, with 4 KByte ARG_MAX limits, this would have > bitten us pretty quickly. Nevertheless I think it is more parser friendly to have single records for diffs. Greetings Bernd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Ralph wrote: > Watch out for when xargs invokes do_something more than once and the `<' > is parsed by a different one than the `>'. It will take a pretty long list to do that. It seems that GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX. In the old days, with 4 KByte ARG_MAX limits, this would have bitten us pretty quickly. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, 9 Apr 2005, Linus Torvalds wrote: > > Actually, I guess I wouldn't have to change the format. I could just > extend the existing "tree" object to be able to point to other trees, and > that's it. Done, and pushed out. The current git.git repository seems to do all of this correctly. NOTE! This means that each "tree" file basically tracks just a single directory. The old style of "every file in one tree file" still works, but fsck-cache will warn about it. Happily, the git archive itself doesn't have any subdirectories, so git itself is not impacted by it. Now, this means that I should add a "recusive" option to "tree-diff", but I haven't done so yet. So right now if I change the top-level Makefile, _and_ change kernel/exit.c, then the "tree diff" between the two commit trees ends up looking like: [EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> diff-tree 7bec1223736d7e02c755e9a365984b3cbfa1e6e9 d64817f809a60cd960d3078ae91b4d19cb649501 | tr '\0' '\n' <100644 e1e7f7430c0297f22042cff58da5ca73ef121b95 Makefile >100644 8ee21134577e98fb642dffc5b797a0121645c543 Makefile <4 2239383d00ae746f5e79ceccf8ac3fbca62f949d kernel >4 a8fad219cb78a6b6a05a10f8643d615fefc8160f kernel ie it shows that the Makefile blob has changed, and the kernel directory has changed. You then need to recurse into the kernel tree to see what the changes were there: [EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> diff-tree 2239383d00ae746f5e79ceccf8ac3fbca62f949d a8fad219cb78a6b6a05a10f8643d615fefc8160f | tr '\0' '\n' <100644 1a50b58453679b6fee8de4f744f4befc39397bb1 exit.c >100644 e8df1325bf25816827a1a64404ad533a97bfdae2 exit.c but it clearly all seems to work. And it means that a subdirectory that didn't change at all (the common case) will be able to re-use the old sha1 file when you create a tree (this may in fact make "diff-tree" much less important, since now it tends to handle objects that are just a few kB in size, rather than almost a megabyte. So in this case, the "commit cost" of changing two files was two small tree files (1468 and 679 bytes respectively for the kernel/ and top-level directory) and the commit file itself (251 bytes). In addition to the actual data files that were changed, of course. Goodie. Big difference between that and the 460kB of the old monolithic tree file. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Hi Linus, > Btw, the NUL-termination makes this really easy to use even in shell > scripts, ie you can do > > diff-tree | xargs -0 do_something > > and you'll get each line as one nice argument to your "do_something" > script. So a do_diff could be based on something like > > #!/bin/sh Watch out for when xargs invokes do_something more than once and the `<' is parsed by a different one than the `>'. A `while read ...; do ... done' would avoid that, but wouldn't like the NULs instead of LFs. Cheers, Ralph. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
Linus wrote: > the NUL-termination makes this really easy to use even in shell grumble ... > I still use the old tools I learnt to use fifteen years ago new comer ;) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, 9 Apr 2005, Linus Torvalds wrote: > > I suspect that I have to change the file format. Maybe make the "tree" > object a two-level thing, and have a "directory" object. > > Then a "tree" object would point to a "directory" object, which would in > turn point to the individual files (and other "directory" objects, of > course). That way a commit that only changes a few files will only need to > create a few new "directory" objects, instead of creating one huge "tree" > object. Actually, I guess I wouldn't have to change the format. I could just extend the existing "tree" object to be able to point to other trees, and that's it. The downside of that is that then a tree wouldn't have a canonical format any more: you could have two trees that have the exact same content, but they'd haev different names. They should obviously merge very easily (and thus you could create a new merge that _does_ have a common name), but it's ugly. I'll have to think about it. It's good to notice these issues early, this was the first time I had actually tried to check in a kernel-sized tree for real. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, 9 Apr 2005, Petr Baudis wrote: > > > Also, I wrote the "diff-tree" thing I talked about: > ..snip.. > > Hmm, I wonder, is this better done in C instead of a simple shell > script, like my gitdiff.sh? With 17,000 files in the kernel, and most commits just changing a small number of them, I actually think "diff-tree" matters. You use "join" (which is quite reasonable), but let's put it this way: just the list of files in the current kernel is about half a megabyte of data. Ie your temporary files that you use in the "ls-tree + ls-tree + join" is actually going to be quite sizeable. My goal here is that the speed of "git" really should be almost totally independent of the size of the project. You clearly cannot avoid _some_ size-dependency: my "diff-tree" clearly also has to work through the same 1MB of data, but I think it's worth making the constant factor be as small as humanly possible. I just tried checking in a kernel tree tar-file, and the initial checkin (which is allt he compression and the sha1 calculations for every single file) took about 1:35 (minutes, not hours ;). Doing a commit (trivial change to the top-level Makefile) and then doing a "treediff" between those two things took 0.05 seconds using my C thing. Ie we're talking so fast that we really don't care. Doing a "show-diff" takes 0.15 secs or so (that's all the "stat" calls), and now that I test it out I realize that the most expensive operation is actually _writing_ the "index" file out. These are the two most expensive steps: [EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> time update-cache Makefile real0m0.283s user0m0.171s sys 0m0.113s [EMAIL PROTECTED]:~/lx-test/linux-2.6.12-rc2> time write-tree 5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a real0m0.441s user0m0.354s sys 0m0.087s ie with the current infrastructure it looks like I can do a "patch + commit" in less than one second on the kernel, and 0.75 secs of that is because the "tree" file actually grows pretty large: cat-file tree 5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a | wc -c says that the uncompressed tree-file is 950,874 bytes. Compressing it means that the archival version of it is "just" 462,546 bytes, but this is really the part that is going to eat _tons_ of disk-space. In other words, each "commit" file is very small and cheap, but since almost every commit will also imply a totally new tree-file, "git" is going to have an overhead of half a megabyte per commit. Oops. Damn, that's painful. I suspect I will have to change the format somehow. One option (which I haven't tested yet) is that since the tree-file is already sorted, I could always write it out with the common subdirectory part "collapsed", ie instead of writing ... include/asm-i386/mach-default/bios_ebda.h include/asm-i386/mach-default/do_timer.h ... I'd write just ... ///bios_ebda.h ///do_timer.h ... since the directory names are implied by the predecessor. However, that doesn't help with the 20-byte sha1 associated with each file, which is also obviously uncompressible, so with 17,000+ files, we have a minimum overhead of abotu 350kB per tree-file. So even if I did the pathname compression, it wouldn't help all that much. I'd only be removing the only part of the file that _is_ very compressible, and I'd probably end up with something that isn't all that far away from the 450kB+ it is now. I suspect that I have to change the file format. Maybe make the "tree" object a two-level thing, and have a "directory" object. Then a "tree" object would point to a "directory" object, which would in turn point to the individual files (and other "directory" objects, of course). That way a commit that only changes a few files will only need to create a few new "directory" objects, instead of creating one huge "tree" object. Sadly, that will make "tree-diff" potentially more expensive. On the other hand, maybe not: it will also speed it _up_, since directories that are totally shared will be trivially seen as such and need no further operation. Thougths? That would break the current repository formats, and I'd have to create a converter thing (which shouldn't be that bad, of course). I don't have to do it right now. In fact, I'd almost prefer for the current thing to become good enough that it's not painful to work with, since right now I'm using it to develop itself. Then I can convert the format with an automated script later, before I actually start working on the kernel... > BTW, do we care about changed modes? If so, they should probably have > their place in the diff-tree output. They're there. If you want to ignore them, you can just notice that the sha1 matches between two lines, and then you don't even have to diff them. Linus - To unsubscribe from this list
Re: more git updates..
Hello, Dear diary, on Sat, Apr 09, 2005 at 09:45:52PM CEST, I got a letter where Linus Torvalds <[EMAIL PROTECTED]> told me that... > The good news is, the data structures/indexes haven't changed, but many of > the tools to interface with them have new (and improved!) semantics: > > In particular, I changed how "read-tree" works, so that it now mirrors > "write-tree", in that instead of actually changing the working directory, > it only updates the index file (aka "current directory cache" file from > the tree). > > To actually change the working directory, you'd first get the index file > setup, and then you do a "checkout-cache -a" to update the files in your > working directory with the files from the sha1 database. that's great. I was planning to do something with this since currently it really annoyed me. I think I will like this, even though I didn't look at the code itself yet (just on my way). > Also, I wrote the "diff-tree" thing I talked about: ..snip.. Hmm, I wonder, is this better done in C instead of a simple shell script, like my gitdiff.sh? I'd say it is more flexible and probably hardly performance-critical to have this scripted, and not difficult at all provided you have ls-tree. But maybe I'm just too fond of my script... ;-) (Ok, there's some trouble when you want to have newlines and spaces in file names, and join appears to be awfully ignorant about this... :[ ) BTW, do we care about changed modes? If so, they should probably have their place in the diff-tree output. BTW#2, I hope you will merge my ls-tree anyway, even though there is no user for it currently... I should quickly figure out some. :-) > Can you guys re-send the scripts you wrote? They probably need some > updating for the new semantics. Sorry about that ;( I'll try to merge ASAP. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ 98% of the time I am right. Why worry about the other 3%. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: more git updates..
On Sat, 9 Apr 2005, Linus Torvalds wrote: > > To actually change the working directory, you'd first get the index file > setup, and then you do a "checkout-cache -a" to update the files in your > working directory with the files from the sha1 database. Btw, this will not overwrite any old files, so if you have an old version of something, you'd need to do "checkout-cache -f -a" (and order matters: the "-f" must come first). This time I actually have a big comment at the top of the checkout-cache.c file trying to explain the logic. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
more git updates..
Sorry guys, several of you have sent me small fixes and scripts to "git", but I've been busy on breaking/changing the core infrastructure, so I didn't get around to looking at the scripts yet. The good news is, the data structures/indexes haven't changed, but many of the tools to interface with them have new (and improved!) semantics: In particular, I changed how "read-tree" works, so that it now mirrors "write-tree", in that instead of actually changing the working directory, it only updates the index file (aka "current directory cache" file from the tree). To actually change the working directory, you'd first get the index file setup, and then you do a "checkout-cache -a" to update the files in your working directory with the files from the sha1 database. Also, I wrote the "diff-tree" thing I talked about: [EMAIL PROTECTED]:~/git> ./diff-tree 8fd07d4b7778cd0233ea0a17acd3fe9d710af035 8c6d29d6a496d12f1c224db945c0c56fd60ce941 | tr '\0' '\n' <100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile >100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile <100664 9e1bee21e17c134a2fb008db62679048fc819528 cache.h >100664 56ef561e590fd99e938bd47fd1f2c7ed46126ff0 cache.h <100664 fd690acc02ef9c06d7c4c3541f69b10ca4b4f8c9 cat-file.c >100664 6e6d89291ced17a406e64b97fe8bb96a22eefc9d cat-file.c +100664 fd00e5603dcc4a93acceda0b8cb914fabc8645d5 checkout-cache.c <100664 a4a8c3d9ef0c4cc6c82b96b5d1a91ac6d3bed466 commit-tree.c >100664 236ceb7646e3f5d110fd83f815b82e94cc5b2927 commit-tree.c +100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c <100664 0eaa053919e0cc400ab9bc40d9272360117e6978 init-db.c >100664 815743e92dad7e451c65bab01448ee8ae9deeb56 init-db.c <100664 e7bfaadd5d2331123663a8f14a26604a3cdcb678 read-cache.c >100664 71d0cb6fe9b7ff79e3b2c5a61e288ac9f62b39dc read-cache.c <100664 ec0f167a6a505659e5af6911c97f465506534c34 read-tree.c >100664 f5c50ba79d02f002b9675fd8f129fa388e3282c6 read-tree.c <100664 00a29c403e751c2a2a61eb24fa2249c8956d1c80 show-diff.c >100664 b963dd738989bc92bf02352bbedad13a74e66a7d show-diff.c <100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c >100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c <100664 7abeeba116b2b251c12ae32c7b38cb048199b574 write-tree.c >100664 9525c6fc975888a394477339db86216cd5bd5d7c write-tree.c (ie the output of "diff-tree" has the same NUL-termination, but if you insist on getting ASCII output, you can just use "tr" to change the NUL into a NL). The format of the "diff-tree" output is that the first character is "-" for "remove file", "+" for "add file" and "<"/">" for "change file" (where the "<" shows the old state, and ">" shows the new state). Btw, the NUL-termination makes this really easy to use even in shell scripts, ie you can do diff-tree | xargs -0 do_something and you'll get each line as one nice argument to your "do_something" script. So a do_diff could be based on something like #!/bin/sh while [ "$1" != "" ]; do filename="$(echo $1 | cut -d' ' -f3-)" first_sha="$(echo $1 | cut -d' ' -f2)" second_sha="$(echo $2 | cut -d' ' -f2)" c="$(echo $1 | cut -c1)" case "$c" in "+") echo diff -u /dev/null "$filename($first_sha)";; "-") echo diff -u "$filename($first_sha)" /dev/null;; "<") echo diff -u "$filename($first_sha)" "$filename($second_sha)" shift;; *) echo WHAT? exit 1;; esac shift done which really shows what a horrid shell-person I am (I still use the old tools I learnt to use fifteen years ago. I bet you can do it trivially in perl or something sane, and I'm just stuck in the stone age of UNIX). That makes it _very_ easy to parse. The example above is the diff between the initial commit and one of the more recent trees, so it has changes to everything, but a more normal thing would be [EMAIL PROTECTED]:~/git> diff-tree 787763499dc4f8cc345bc6ed8ee1e0ae31adedd6 5b0c2695634b5bab2f5d63c7bb30f7e5815af470 | tr '\0' '\n' <100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c >100664 81aa7bee003264ea302db835158e725eefa4012d fsck-cache.c which tells you that the last commit changed just one file (it's from this one: [EMAIL PROTECTED]:~/git> cat-file commit `cat .dircache/HEAD` tree 5b0c2695634b5bab2f5d63c7bb30f7e5815af470 parent 81c53a1d3551f358860731481bb2d87179d221e6 author Linus Torvalds <[EMAIL PROTECTED]> Sat Apr 9 12:02:30 2005 committer Linus Torvalds <[EMAIL PROTECTED]> Sat Apr 9 12:02:30 2005