Re: full kernel history, in patchset format
David Mansfield <[EMAIL PROTECTED]> wrote: > Catalin Marinas wrote: >> AFAIK, cvsps uses the date/time to create the changesets. There is a >> problem with the BKCVS export since some files in the same commit can >> have a different time (by an hour). I posted a mail some time ago >> about this - >> http://marc.theaimsgroup.com/?l=linux-kernel&m=110026570201544&w=2 >> I read that the old history won't be merged into the new repository >> but, if you are interested, I have a script that can do this based on >> the "(Logical change ...)" string in the file commit logs and it is >> quite fast at generating the patches. >> > > Hmmm. I read that message just now. Is it a matter of 'perfection' > that is the issue here, or actual correctness when applying the > patches in order? I see it as a matter of correctness since in a given BKCVS changeset (i.e. revision in the ChangeSet,v file) you may miss files. You would eventually get them, with the same log, but in a different patch. If you don't care about this, you can call it 'perfection'. At that time I thought about modifying cvsps to use the "(Logical change ...)" string instead of time/date for grouping the files but I realised it is easier with a shell script. > (perhaps this has now been fixed). There was no reply to this e-mail. It might have been fixed in the meantime but I don't think the history was fixed as well. -- Catalin - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Catalin Marinas wrote: Ingo Molnar <[EMAIL PROTECTED]> wrote: i've converted the Linux kernel CVS tree into 'flat patchset' format, which gave a series of 28237 separate patches. (Each patch represents a changeset, in the order they were applied. I've used the cvsps utility.) AFAIK, cvsps uses the date/time to create the changesets. There is a problem with the BKCVS export since some files in the same commit can have a different time (by an hour). I posted a mail some time ago about this - http://marc.theaimsgroup.com/?l=linux-kernel&m=110026570201544&w=2 I read that the old history won't be merged into the new repository but, if you are interested, I have a script that can do this based on the "(Logical change ...)" string in the file commit logs and it is quite fast at generating the patches. Hmmm. I read that message just now. Is it a matter of 'perfection' that is the issue here, or actual correctness when applying the patches in order? (perhaps this has now been fixed). David - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Ingo Molnar <[EMAIL PROTECTED]> wrote: > i've converted the Linux kernel CVS tree into 'flat patchset' format, > which gave a series of 28237 separate patches. (Each patch represents a > changeset, in the order they were applied. I've used the cvsps > utility.) AFAIK, cvsps uses the date/time to create the changesets. There is a problem with the BKCVS export since some files in the same commit can have a different time (by an hour). I posted a mail some time ago about this - http://marc.theaimsgroup.com/?l=linux-kernel&m=110026570201544&w=2 I read that the old history won't be merged into the new repository but, if you are interested, I have a script that can do this based on the "(Logical change ...)" string in the file commit logs and it is quite fast at generating the patches. -- Catalin - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sun, 2005-04-17 at 18:16 -0700, Linus Torvalds wrote: > Alternatively, you can have just the rev-tree cache of them. That's what > it was designed for (along with avoiding to have to read 60,000 commits). Purely from a conceptual POV I'd be a little happier with the history just ending with a parent pointer to a commit object which is absent, rather than having commit objects which point to _trees_ which are absent. But I suppose I can't really justify that, and I'm not overly bothered about it either. The important thing to get right at this point is that the tree we all work with should refer to the history, regardless of how we choose to prune it. The current linux-2.6.git tree has a parentless commit for the 2.6.12-rc2 import, which is bad. We should start with Thomas' git tree representing the real history, and work from that. You don't even need to see his tree; you only need the final sha1 hash of the commit in his tree which matches 2.6.12-rc2, so you can use that as the 'parent' of the first change you import yourself. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Mon, 18 Apr 2005, Petr Baudis wrote: > Dear diary, on Mon, Apr 18, 2005 at 02:06:43AM CEST, I got a letter > where David Woodhouse <[EMAIL PROTECTED]> told me that... > > On Mon, 2005-04-18 at 01:39 +0200, Petr Baudis wrote: > > > Of course an entirely different thing are _trees_ associated with those > > > commits. As long as you stay with a simple three-way merge, you > > > basically never want to look at trees which aren't heads and which you > > > don't specifically request to look at. And the trees and what they carry > > > inside is the main bulk of data. > > > > If the trees are absent and you're trying to merge, what do you gain > > from having the commit objects? > > merge-base Alternatively, you can have just the rev-tree cache of them. That's what it was designed for (along with avoiding to have to read 60,000 commits). Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Dear diary, on Mon, Apr 18, 2005 at 02:51:59AM CEST, I got a letter where David Woodhouse <[EMAIL PROTECTED]> told me that... > On Mon, 2005-04-18 at 02:50 +0200, Petr Baudis wrote: > > I think I will make git-pasky's default behaviour (when we get > > http-pull, that is) to keep the complete commit history but only trees > > you need/want; togglable to both sides. > > I think the default behaviour should probably be to fetch everything. I think fetching gigs of data just won't work for many people, especially if they could do with a fraction of that. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Mon, 2005-04-18 at 02:50 +0200, Petr Baudis wrote: > I think I will make git-pasky's default behaviour (when we get > http-pull, that is) to keep the complete commit history but only trees > you need/want; togglable to both sides. I think the default behaviour should probably be to fetch everything. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Dear diary, on Mon, Apr 18, 2005 at 02:45:22AM CEST, I got a letter where David Woodhouse <[EMAIL PROTECTED]> told me that... > On Mon, 2005-04-18 at 02:35 +0200, Petr Baudis wrote: > > > For the special case of removing history before 2.6.12-rc2 from the > > > trees, I certainly think we can do it by leaving out all the commits, > > > not just the trees. We can do that easily, but there's no way we can > > > _add_ that history retrospectively if we omit it in the first place. > > > > I'm confused by this paragraph, but that might be my English skills > > failing somehow. > > "For the general case of people pruning their own trees, _maybe_ you're > right that it would be good to keep the commits even if we delete the > actual trees. But for history older than 2.6.12-rc2, that's a special > case -- I think we can happily delete the commits too. Ah _so_. Thanks for explanation. I think I will make git-pasky's default behaviour (when we get http-pull, that is) to keep the complete commit history but only trees you need/want; togglable to both sides. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Mon, 2005-04-18 at 02:35 +0200, Petr Baudis wrote: > > For the special case of removing history before 2.6.12-rc2 from the > > trees, I certainly think we can do it by leaving out all the commits, > > not just the trees. We can do that easily, but there's no way we can > > _add_ that history retrospectively if we omit it in the first place. > > I'm confused by this paragraph, but that might be my English skills > failing somehow. "For the general case of people pruning their own trees, _maybe_ you're right that it would be good to keep the commits even if we delete the actual trees. But for history older than 2.6.12-rc2, that's a special case -- I think we can happily delete the commits too. "We can delete old trees/commits easily, but we can't _add_ them to the existing linux-2.6.git tree, because the oldest commit in that tree (b4ceb6e27e4cc3f37d26e04c4535c79b98a9f889) doesn't have a parent." -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Dear diary, on Mon, Apr 18, 2005 at 02:06:43AM CEST, I got a letter where David Woodhouse <[EMAIL PROTECTED]> told me that... > On Mon, 2005-04-18 at 01:39 +0200, Petr Baudis wrote: > > Of course an entirely different thing are _trees_ associated with those > > commits. As long as you stay with a simple three-way merge, you > > basically never want to look at trees which aren't heads and which you > > don't specifically request to look at. And the trees and what they carry > > inside is the main bulk of data. > > If the trees are absent and you're trying to merge, what do you gain > from having the commit objects? merge-base > For the special case of removing history before 2.6.12-rc2 from the > trees, I certainly think we can do it by leaving out all the commits, > not just the trees. We can do that easily, but there's no way we can > _add_ that history retrospectively if we omit it in the first place. I'm confused by this paragraph, but that might be my English skills failing somehow. > For history older than 2.6.12-rc2 I'd suggest that it would be available > in a different place, and absent from the 'main' working tree that > everyone uses by default. The only difference we'd see in the working > tree is that the 2.6.12-rc2 commit -- the oldest commit in that tree -- > would actually have an absentee parent instead of appearing to be an > import. And all the sha1 hashes of all subsequent commits would be > different, of course. Yes, that's what I suggested too. > To allow pruning of older objects in the general case would be a little > bit harder than that, because as things stand you'd be re-fetching them > every time you rsync from elsewhere -- but that wouldn't really be hard > to fix if we care. I think http-pull is very promising. :-) It could be actually much faster than rsync, since you don't need to build directory listings etc, which actually takes non-trivial amount of time already with the kernel git repository. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Mon, 2005-04-18 at 01:39 +0200, Petr Baudis wrote: > I think this is bad, bad, bad. If you don't keep around all the > _commits_, you get into all sorts of troubles - when merging, when doing > git log, etc. And the commits themselves are probably actually pretty > small portion of the thing. I didn't do any actual measurement but I > would be pretty surprised if it would be much more than few megabytes of > data for the kernel history. I'm not sure it's that bad -- and everyone already seems perfectly happy not to have history going back before 2.6.12-rc2. We're not talking about doing this by _default_ -- we're talking about allowing people to keep trees pruned if they _want_ to. So I might want to drop history before 2.6.0 on my laptop, for example. > Of course an entirely different thing are _trees_ associated with those > commits. As long as you stay with a simple three-way merge, you > basically never want to look at trees which aren't heads and which you > don't specifically request to look at. And the trees and what they carry > inside is the main bulk of data. If the trees are absent and you're trying to merge, what do you gain from having the commit objects? And for the case of 'git log', I certainly think it's acceptable that you lose out on those parts of prehistory which you've explicitly removed from your local tree -- that's a feature, not a bug. For the special case of removing history before 2.6.12-rc2 from the trees, I certainly think we can do it by leaving out all the commits, not just the trees. We can do that easily, but there's no way we can _add_ that history retrospectively if we omit it in the first place. For history older than 2.6.12-rc2 I'd suggest that it would be available in a different place, and absent from the 'main' working tree that everyone uses by default. The only difference we'd see in the working tree is that the 2.6.12-rc2 commit -- the oldest commit in that tree -- would actually have an absentee parent instead of appearing to be an import. And all the sha1 hashes of all subsequent commits would be different, of course. To allow pruning of older objects in the general case would be a little bit harder than that, because as things stand you'd be re-fetching them every time you rsync from elsewhere -- but that wouldn't really be hard to fix if we care. Either way, I think it can probably be done by omitting the commit objects as well as the trees -- but the important point is that we _should_ include a 'parent' pointer in the oldest commit of the tree we're working with, pointing back to the imported history. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Dear diary, on Mon, Apr 18, 2005 at 01:31:36AM CEST, I got a letter where David Woodhouse <[EMAIL PROTECTED]> told me that... > Note that any given copy of a tree doesn't _need_ to keep all the > history back the beginning of time. It's OK if the oldest commit object > in your tree actually refers back to a parent which doesn't exist > locally. I can well imagine that some people will want to keep their > trees pruned to keep only a few weeks of history, while other copies of > the tree will keep everything. I think this is bad, bad, bad. If you don't keep around all the _commits_, you get into all sorts of troubles - when merging, when doing git log, etc. And the commits themselves are probably actually pretty small portion of the thing. I didn't do any actual measurement but I would be pretty surprised if it would be much more than few megabytes of data for the kernel history. Of course an entirely different thing are _trees_ associated with those commits. As long as you stay with a simple three-way merge, you basically never want to look at trees which aren't heads and which you don't specifically request to look at. And the trees and what they carry inside is the main bulk of data. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 2005-04-16 at 10:04 -0700, Linus Torvalds wrote: > So I'd _almost_ suggest just starting from a clean slate after all. > Keeping the old history around, of course, but not necessarily putting it > into git now. It would just force everybody who is getting used to git in > the first place to work with a 3GB archive from day one, rather than > getting into it a bit more gradually. > > What do people think? I'm not so much worried about the data itself: the > git architecture is _so_ damn simple that now that the size estimate has > been confirmed, that I don't think it would be a problem per se to put > 3.2GB into the archive. But it will bog down "rsync" horribly, so it will > actually hurt synchronization untill somebody writes the rev-tree-like > stuff to communicate changes more efficiently.. Note that any given copy of a tree doesn't _need_ to keep all the history back the beginning of time. It's OK if the oldest commit object in your tree actually refers back to a parent which doesn't exist locally. I can well imagine that some people will want to keep their trees pruned to keep only a few weeks of history, while other copies of the tree will keep everything. However, if we _don't_ base our current work on an existing import of the kernel, then we don't retain that option. We can't just change the 'parent' field of your 2.6.12-rc2 import, without changing the sha1 hash of _everything_ that happens thereafter. So I'd say we should take Thomas' import, and base new work on that -- but then possibly leave out the older objects from the 'working' repository which everyone is rsyncing from; just make them available in a 'linux-history.git' object database elsewhere. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 16 Apr 2005, Thomas Gleixner wrote: On Sat, 2005-04-16 at 10:04 -0700, Linus Torvalds wrote: So I'd _almost_ suggest just starting from a clean slate after all. Keeping the old history around, of course, but not necessarily putting it into git now. It would just force everybody who is getting used to git in the first place to work with a 3GB archive from day one, rather than getting into it a bit more gradually. Sure. We can export the 2.6.12-rc2 version of the git'ed history tree and start from there. Then the first changeset has a parent, which just lives in a different place. Thats the only difference to your repository, but it would change the sha1 sums of all your changesets. at least start with a full release. say 2.6.11 the history won't be blank, but it's far more likly that people will care about the details between 2.6.11 and 2.6.12 and will want to go back before -rc2 David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 16 Apr 2005, Mike Taht wrote: > Junio C Hamano wrote: > >>"MT" == Mike Taht <[EMAIL PROTECTED]> writes: > > > > > > MT> alternatively, "git-archive-torrent" to create a list of files for a > > MT> bittorrent feed > > > > That is certainly good for establishing the baseline, but you > > still need to leverage the inherent delta-compressibility > > between related blobs/trees by also doing something like what I > > described as "diff package", don't you? > > Yes... yes you could have files and diffs generated statically... > > although something like a bittorrent server/client/frontend, call it > "gittorrent" (I hate being the first to make this pun) could walk the > hashes dynamically ( > Ihave: sha,sha,sha,sha... Sendme: shaxxx > Hereswhatyouneedfromgit: file,file,file,diff,diff,diff,...) I'm actually working on a trivial HTTP client to do this. The user says "get from ", and it gets that object, the associated trees, and the associated blobs, skipping any that it already has. This should save having a non-standard public-facing server process, and be essentially as effective, at least once I have it using a single connection for everything. -Daniel *This .sig left intentionally blank* - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
> "CL" == Christopher Li <[EMAIL PROTECTED]> writes: CL> I bet 90% of the time people sync to the repository head first CL> want to check out the last bits. And maybe reading some change CL> log to see what is changed. CL> So having all the commit object, the user will able to see CL> what is change and which version he we like to check out. Makes sense. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Junio C Hamano wrote: "MT" == Mike Taht <[EMAIL PROTECTED]> writes: MT> alternatively, "git-archive-torrent" to create a list of files for a MT> bittorrent feed That is certainly good for establishing the baseline, but you still need to leverage the inherent delta-compressibility between related blobs/trees by also doing something like what I described as "diff package", don't you? Yes... yes you could have files and diffs generated statically... although something like a bittorrent server/client/frontend, call it "gittorrent" (I hate being the first to make this pun) could walk the hashes dynamically ( Ihave: sha,sha,sha,sha... Sendme: shaxxx Hereswhatyouneedfromgit: file,file,file,diff,diff,diff,...) -- Mike Taht "It looks like blind screaming hedonism won out." - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a > > script that will apply all the patches in order and will create a > > pristine 2.6.12-rc2 tree. > > Hey, that's great. I got the CVS repo too, and I was looking at it, > but the more I looked at it, the more I felt that the main reason I > want to import it into git ends up being to validate that my size > estimates are at all realistic. > > I see that Thomas Gleixner seems to have done that already, and come > to a figure of 3.2GB for the last three years, which I'm very happy > with, mainly because it seems to match my estimates to a tee. [...] (yeah, we apparently worked in parallel - i only learned about his efforts after i sent my mail. He was using BK to extract info, i was using the CVS tree alone and no BK code whatsoever. (I dont think there will be any argument about who owns what, but i wanted to be on the safe side, and i also wanted to see how complete and usable the CVS metadata is - it's close to perfect i'd say, for the purposes i care about.)) > But I wonder if we actually want to actually populate the whole > history.. yeah, it definitely feels a bit brave to import 28,000 changesets into a source-code database project that will be a whopping 2 weeks old in 2 days ;) Even if we felt 100% confident about all the basics (which we do of course ;), it's just simply too young to tie things down via a 3.2GB database. It feels much more natural to grow it gradually, 28,000 changesets i'm afraid would just suffocate the 'project growth dynamics'. Not going too fast is just as important as not going too slow. I didnt generate the patchset to get it added into some central repository right now, i generated it to check that we _do_ have all the revision history in an easy to understand format which does generate today's kernel tree, so that we can lean back and worry about the full database once things get a bit more settled down (in a couple of months or so). It's also an easy testbed for GIT itself. but the revision history was one of the main reasons i used BK myself, so we'll need a merged database eventually. Occasionally i needed to check who was the one who touched a particular piece of code - was that fantastic new line of code written by me, or was that buggy piece of crap written by someone else? ;) Also, looking at a change and then going to the changeset that did it, and then looking at the full picture was pretty useful too. So that sort of annotation, and generally navigating around _quickly_ and looking at the 'flow' of changes going into a particular file was really useful (for me). Ingo - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
We can just have a baseline file contain all the commit objects. Then have the git "download on demand". The problem with diff package is that I it is harder to merge with more than one diff. I bet 90% of the time people sync to the repository head first want to check out the last bits. And maybe reading some change log to see what is changed. So having all the commit object, the user will able to see what is change and which version he we like to check out. Then he can issue a command "download me all the objects is needed for checkout the this commit". Download of demand should be even better. Chris On Sat, Apr 16, 2005 at 12:19:22PM -0700, Junio C Hamano wrote: > > "MT" == Mike Taht <[EMAIL PROTECTED]> writes: > > MT> alternatively, "git-archive-torrent" to create a list of files for a > MT> bittorrent feed > > That is certainly good for establishing the baseline, but you > still need to leverage the inherent delta-compressibility > between related blobs/trees by also doing something like what I > described as "diff package", don't you? > > > > > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 2005-04-16 at 12:15 -0700, Linus Torvalds wrote: > > On Sat, 16 Apr 2005, Thomas Gleixner wrote: > > > > For the export stuff its terrible slow. :( > > What kind of _strange_ scripting architecture is so fast that there's a > difference between "cat-file" and "ls-tree" and can handle 17,000 files in > 60,000 revisions, yet so slow that you can't trivially convert 20 bytes of > data? Sorry I was neither talking about "cat-file ..." nor about the 20 byte conversion. I was talking about the bk export script, which writes the objects itself. Doing this with the git-tools would slow it down, as I have the retrieved data already in memory. It does not slow me down to create the binary ref, but its annoying. I just figured, that some revtools might have the need to use direct pointers into objects and face the same problem the other way round. tglx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
> "PB" == Petr Baudis <[EMAIL PROTECTED]> writes: PB> P.S.: It seems that Linus applied a patch to ls-tree which will make it PB> read_sha1_file() on each item when ls-tree is recursive. Junio, why did PB> you do it? Sorry it was my misunderstanding, before I found out exactly how S_ISDIR is used. Thank you for pointing it out. I was confused by this comment around the area I changed: /* XXX: We do some ugly mode heuristics here. * It seems not worth it to read each file just to get this * and the file size. -- [EMAIL PROTECTED] I mistakenly inferred from that comment that S_ISDIR(mode) is not a guarantee. So I mistakenly optimized it for non-recursive case by keeping that "heuristics". The logic was: If recursive we will need to run read_sha1_file() to find out if it is really a tree anyway. I'll fix it up, now I know S_ISDIR(mode) is a guarantee that it is a tree, I'll do the "heuristics" first, and do read_sha1_file only when it is a tree and I am recursive. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
> "MT" == Mike Taht <[EMAIL PROTECTED]> writes: MT> alternatively, "git-archive-torrent" to create a list of files for a MT> bittorrent feed That is certainly good for establishing the baseline, but you still need to leverage the inherent delta-compressibility between related blobs/trees by also doing something like what I described as "diff package", don't you? - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
* David Mansfield <[EMAIL PROTECTED]> wrote: > Ingo Molnar wrote: > >* Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > > >>the patches contain all the existing metadata, dates, log messages and > >>revision history. (What i think is missing is the BK tree merge > >>information, but i'm not sure we want/need to convert them to GIT.) > > > > > >author names are abbreviated, e.g. 'viro' instead of > >[EMAIL PROTECTED], and no committer information is > >included (albeit commiter ought to be Linus in most cases). These are > >limitations of the BK->CVS gateway i think. > > > > Glad to hear cvsps made it through! I'm curious what the manual > fixups required were, except for the binary file issue (logo.gif). --cvs-direct was needed to speed it up from 'several days to finish' to 'several hours to finish', but it crashed on a handful of patches [i used the latest devel snapshot so this isnt a complaint]. (one of the crashes was when generating 1860.patch.) Also, 'cvs rdiff' apparently emits an empty patch for diffs that remove a file that end without having a newline character - but this isnt cvsps's problem. (grep for +++ in the patchset to find those cases.) > As to the actual email addresses, for more recent patches, the > Signed-off should help. For earlier ones, isn't their some script > which 'knows' a bunch of canonical author->email mappings? (the > shortlog script or something)? yeah, that's not that much of a problem, most of the names are unique, and the rest can be fixed up too. > Is the full committer email address actually in the changeset in BK? > If so, given that we have the unique id (immutable I believe) of the > changset, could it be extracted directly from BK? i think it's included in BK. Ingo - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 16 Apr 2005, Thomas Gleixner wrote: > > For the export stuff its terrible slow. :( I don't really see your point. If you already know what the tree is like you say, you don't care about the tree object. And if you don't know what the tree is, what _are_ you doing? In other words, show us what you're complaining about. If you're looking into the trees yourself, then the binary representation of the sha1 is already what you want. That _is_ the hash. So why do you want it in ASCII? And if you're not looking into the tree directly, but using "cat-file tree" and you were hoping to see ASCII data, then that's certainly not going to be any faster than just doing "ls-tree" instead. In other words, I don't see your point. Either you want ascii output for scripting, or you don't. First you claimed that you did, and that you would want the tree object to change in order to do so. Now you claim that you can't use "ls-tree" because it's too slow. That just isn't making any sense. You're mixing two totally different levels, and complaining about performance when scripting things. Yet you're talking about a 20-byte data structure that is trivial to convert to any format you want. What kind of _strange_ scripting architecture is so fast that there's a difference between "cat-file" and "ls-tree" and can handle 17,000 files in 60,000 revisions, yet so slow that you can't trivially convert 20 bytes of data? Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 2005-04-16 10:04:31 -0700, Linus Torvalds <[EMAIL PROTECTED]> wrote in message <[EMAIL PROTECTED]>: > What do people think? I'm not so much worried about the data itself: the > git architecture is _so_ damn simple that now that the size estimate has > been confirmed, that I don't think it would be a problem per se to put > 3.2GB into the archive. But it will bog down "rsync" horribly, so it will > actually hurt synchronization untill somebody writes the rev-tree-like > stuff to communicate changes more efficiently.. > > IOW, it smells to me like we don't have the infrastructure to really work > with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can > build up the infrastructure in parallell with starting to really need it. 3GB is quite some data, but I'd accept and prefer to download it from somewhere. I think that it's worth it. I accept that there are people out there which would love to get a smaller archive, but at least most developers that would actually use it for day-to-day work *do* have the bandwidth to download it. Maybe we'd also prepare (from time to time) bzip'ed tarballs, which I expect to be a tad smaller. MfG, JBG -- Jan-Benedict Glaw [EMAIL PROTECTED]. +49-172-7608481 _ O _ "Eine Freie Meinung in einem Freien Kopf| Gegen Zensur | Gegen Krieg _ _ O fuer einen Freien Staat voll Freier BÃrger" | im Internet! | im Irak! O O O ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA)); signature.asc Description: Digital signature
Re: Re: full kernel history, in patchset format
Dear diary, on Sat, Apr 16, 2005 at 09:50:21PM CEST, I got a letter where Thomas Gleixner <[EMAIL PROTECTED]> told me that... > On Sat, 2005-04-16 at 11:44 -0700, Linus Torvalds wrote: > > > That level of abstraction ("we never look directly at the objects") is > > what allows us to change the object structure later. For example, we > > already changed the "commit" date thing once, and the tree object has > > obviously evolved a bit, and if we ever change the hash, the objects will > > change too, but if you always just script them using nice helper tools, > > you won't ever need to _care_. And that's how it should be. > > For the export stuff its terrible slow. :( It seems to me that you must be doing something wrong then. I can't see anything which would not make ls-tree blindingly fast (except for when being recursive, see below). BTW, what do you need ls-tree output for, when doing export _to_ git? P.S.: It seems that Linus applied a patch to ls-tree which will make it read_sha1_file() on each item when ls-tree is recursive. Junio, why did you do it? Is there any possible case when the item would not be marked as directory but it would be a tree object? I could imagine it bogging down ls-tree on big tree a lot. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: full kernel history, in patchset format
On Sat, Apr 16, 2005 at 07:43:27PM +0200, Petr Baudis wrote: > Dear diary, on Sat, Apr 16, 2005 at 07:04:31PM CEST, I got a letter > where Linus Torvalds <[EMAIL PROTECTED]> told me that... > > So I'd _almost_ suggest just starting from a clean slate after all. > > Keeping the old history around, of course, but not necessarily putting it > > into git now. It would just force everybody who is getting used to git in > > the first place to work with a 3GB archive from day one, rather than > > getting into it a bit more gradually. > > > > Comments? > > FWIW, it looks pretty reasonable to me. Perhaps we should have a > separate GIT repository with the previous history though, and in the > first new commit the parent could point to the last commit from the > other repository. > > Just if it isn't too much work, though. :-) I think we can make the git using stackable repository. When it fail to find an object, it will try it's to read from parent repository. It is useful to slice the history. I can have local repository that all the new object create by me will store in my tree instead of the official one. Clean up the object in the my local tree will be much easier it only need to work on a much smaller repository. If all my change is merge to official tree, I just simply empty my local repository. About the kernel git repository. I think it is much easier just put them in one tree. So I don't need to worry about "if I need to see pre 2.6.12, I need to do this". And the full repository need to store in the server some where any way. However I totally agree that people should not deal with unnecessary the history when they start using the git tools. We should just make the tools by default don't download all the histories. Only get it when user specific ask for it. Why 2.6.12-rc2? When kernel grows to 2.6.15, a new user might not even need pre 2.6.13 most of the time. If we make it very easier for people to get history if they need, it will make them less motivate to store unnecessary history locally (just in case I need it). I think we should not advise using rsync to sync the whole git tree as way to get update. We need to get use to only have a slice of the history and get more if we needed. The server should should provide some small metadata file like the the rev-tool cache, so the SCM tools can download it to figure out what file is needed to download to get to certain revision. Instead of download the whole repository to figure out what is new. We can even slice that metadata information to smaller pieces base on major release point. Chris - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 2005-04-16 at 11:44 -0700, Linus Torvalds wrote: > That level of abstraction ("we never look directly at the objects") is > what allows us to change the object structure later. For example, we > already changed the "commit" date thing once, and the tree object has > obviously evolved a bit, and if we ever change the hash, the objects will > change too, but if you always just script them using nice helper tools, > you won't ever need to _care_. And that's how it should be. For the export stuff its terrible slow. :( I agree that using common tools is good. But we talk also about an open format, so using a script to speed up certain tasks is not bad at all. tglx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
> "JCH" == Junio C Hamano <[EMAIL PROTECTED]> writes: JCH> I have been cooking this idea before I dove into the merge stuff JCH> and did not have time to implement it myself (Hint Hint), but I JCH> think something along the following lines would work nicely: It should be fairly obvious from the context what I meant to say, but in case somebody gets confused by my inaccurate description of small details (or, before somebody nitpicks ;-), I'd add some clarifications and corrections. JCH> * Run diff-tree between neighboring commits [*1*] to find out JCH>the set of blobs that are "related". Extract those related JCH>blobs and run "diff" [*2*] between them to see if it produces JCH>a patch smaller than the whole thing when compressed. If JCH>diff+patch is a win, then we do not have to transmit the blob JCH>that we could reproduce by sending the diff. Note that fact. I talked only about blobs here, but I really mean all types: commits, trees and blobs here. Nothing prevents us from extracting the raw data for trees and commits and run diff between them. We can use cat-file to do that today. What we do not have is the reverse of "$ cat-file type >rawdata" (i.e. "$ write-file type Given the above, the operation of git-archive-patch is also JCH> quite obvious. Extract the "diff package" tarball into the JCH> objects/ directory that has (at least) the full Bn, uncompress JCH> the patch file part, and run patch on it. Of course after you ran patch to reproduce the raw data for the blob or tree, we need the reverse of cat-file to register such data under object/ hierarchy. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 16 Apr 2005, Thomas Gleixner wrote: > > One remark on the tree blob storage format. > The binary storage of the sha1sum of the refered object is a PITA for > scripting. > Converting the ASCII -> binary for the sha1sum comparision should not > take much longer than the binary -> ASCII conversion for the file > reference. Can this be changed ? I'd really rather not. Why don't you just use "ls-tree" for scripting? That's why it exists in the first place. It might make sense to have some simple selection capabilities built into ls-tree (ie "ls-tree --match drivers/char/ -z " to get just a subtree out), but that depends entirely on how you end up using it. The fact is, there should _never_ any reason to look at the objects themselves directly. "cat-file" is a debugging aid, it shouldn't be scripted (with the possible exception of "cat-file blob " to just extract the blob contents, since that object doesn't have any internal structure). That level of abstraction ("we never look directly at the objects") is what allows us to change the object structure later. For example, we already changed the "commit" date thing once, and the tree object has obviously evolved a bit, and if we ever change the hash, the objects will change too, but if you always just script them using nice helper tools, you won't ever need to _care_. And that's how it should be. If there's a tool missing, holler. THAT is the part I've been trying to write: all the plumbing so that you _can_ script the thing sanely, and not worry about how objects are created and worked with. For example, that "index" file format likely _will_ change. I ended up doing the new "stage" flags in a way that kept the index file compatible with old ones, but I did that mainly because it also happened to be the easiest way to enforce the rule I wanted to enforce (ie the "stage" really _is_ a part of the filename from a "compare filenames" standpoint, in order to make sure that the stages are always ordered). So if the index file change hadn't had that property, I'd have just said "I'll change the format", and anybody who tried to parse the index file would have been _broken_. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: full kernel history, in patchset format
On Sat, 2005-04-16 at 20:32 +0200, Petr Baudis wrote: > Dear diary, on Sat, Apr 16, 2005 at 09:23:40PM CEST, I got a letter > where Thomas Gleixner <[EMAIL PROTECTED]> told me that... > > One remark on the tree blob storage format. > > The binary storage of the sha1sum of the refered object is a PITA for > > scripting. > > Converting the ASCII -> binary for the sha1sum comparision should not > > take much longer than the binary -> ASCII conversion for the file > > reference. Can this be changed ? > > Huh, you aren't supposed to peek into trees directly. What's wrong with > ls-tree? Why I'm not supposed ? Is this evil ? My export script has all the data available, so I write the tree refs directly. The full export runs ~1 hour. Thats long enough :) I tried the git way and it slows me down by factor "BIG" (I dont remember the number) Also for reference tracking all the information might be available e.g. by a database. Why should the revtool then use some tool to retrieve information which is already there ? tglx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: Re: full kernel history, in patchset format
Dear diary, on Sat, Apr 16, 2005 at 08:32:32PM CEST, I got a letter where Petr Baudis <[EMAIL PROTECTED]> told me that... > Dear diary, on Sat, Apr 16, 2005 at 09:23:40PM CEST, I got a letter > where Thomas Gleixner <[EMAIL PROTECTED]> told me that... > > One remark on the tree blob storage format. > > The binary storage of the sha1sum of the refered object is a PITA for > > scripting. > > Converting the ASCII -> binary for the sha1sum comparision should not > > take much longer than the binary -> ASCII conversion for the file > > reference. Can this be changed ? > > Huh, you aren't supposed to peek into trees directly. What's wrong with > ls-tree? (I meant, you aren't supposed to peek into trees from scripts. Or well, not "not supposed", but it does not make much sense when you have ls-tree.) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
* A script git-archive-tar is used to create a "base tarball" that roughly corresponds to "linux-*.tar.gz". This works as follows: $ git-archive-tar C [B1 B2...] This reads the named commit C, grabs the associated tree (i.e. its sub-tree objects and the blob they refer to), and makes a tarball of ??/?? files. The tarball does not have to contain any extra information to reproduce any ancestor of the named commit. alternatively, "git-archive-torrent" to create a list of files for a bittorrent feed -- Mike Taht - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: full kernel history, in patchset format
Dear diary, on Sat, Apr 16, 2005 at 09:23:40PM CEST, I got a letter where Thomas Gleixner <[EMAIL PROTECTED]> told me that... > One remark on the tree blob storage format. > The binary storage of the sha1sum of the refered object is a PITA for > scripting. > Converting the ASCII -> binary for the sha1sum comparision should not > take much longer than the binary -> ASCII conversion for the file > reference. Can this be changed ? Huh, you aren't supposed to peek into trees directly. What's wrong with ls-tree? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
> "LT" == Linus Torvalds <[EMAIL PROTECTED]> writes: LT> What do people think? I'm not so much worried about the data itself: the LT> git architecture is _so_ damn simple that now that the size estimate has LT> been confirmed, that I don't think it would be a problem per se to put LT> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will LT> actually hurt synchronization untill somebody writes the rev-tree-like LT> stuff to communicate changes more efficiently.. LT> IOW, it smells to me like we don't have the infrastructure to really work LT> with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can LT> build up the infrastructure in parallell with starting to really need it. LT> But it's _great_ to have the history in this format, especially since LT> looking at CVS just reminded me how much I hated it. LT> Comments? I have been cooking this idea before I dove into the merge stuff and did not have time to implement it myself (Hint Hint), but I think something along the following lines would work nicely: * A script git-archive-tar is used to create a "base tarball" that roughly corresponds to "linux-*.tar.gz". This works as follows: $ git-archive-tar C [B1 B2...] This reads the named commit C, grabs the associated tree (i.e. its sub-tree objects and the blob they refer to), and makes a tarball of ??/?? files. The tarball does not have to contain any extra information to reproduce any ancestor of the named commit. When extra parameters, B1 B2..., are given, it also creates "diff package" that roughly corresponds to "patch-*.gz" for each Bn given. They must be ancestors of commit. The intention is to store enough information to ensure that the recipient can recreate all the SHA1 files "base tarball" for commits between (Bn, C] would contain, provided if the recipient already has all the SHA1 files "base tarball" for Bn. * A script git-archive-patch is used to read such a "diff package". So a user needs to: * First pick some baseline B and download the base tarball for commit B. It is up to him to make trade-offs between how far back he wants to see the history and how much bandwidth he wants to waste. Untar it to get the baseline. * Then periodically pick up "diff package" for (C, B] where C is the latest available. Run git-archive-patch to populate the rest. * In addition the user can run rsync with timestamp option to pick up SHA1 files created upstream since C after this happens. What git-archive-tar needs to do to produce "diff package" for (Bn, C] is fairly obvious. * From rev-tree output, find all the commits that are on path from Bn to C. * Find all the SHA1 objects that appear on this commit chain; subtract what is in Bn since we assume the recipient has them already. * Run diff-tree between neighboring commits [*1*] to find out the set of blobs that are "related". Extract those related blobs and run "diff" [*2*] between them to see if it produces a patch smaller than the whole thing when compressed. If diff+patch is a win, then we do not have to transmit the blob that we could reproduce by sending the diff. Note that fact. * When you are all done, you have a single patch file that contains small edits on numerous blobs, and set of SHA1 files that are cheaper to transmit than in the patch form. Compress the patch file and package them together to make a tar archive. Given the above, the operation of git-archive-patch is also quite obvious. Extract the "diff package" tarball into the objects/ directory that has (at least) the full Bn, uncompress the patch file part, and run patch on it. [Footnotes] *1* Alternatively, this diff-tree can be run between Bn and each commit between (Bn, C]. It is like incremental dump strategy. We should experiment and find a good balance. *2* This does not have to be "diff -u" --- we are assuming the exact patch so diff -e or xdelta would do. We should experiment and find a good diff+patch pair. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 2005-04-16 at 10:04 -0700, Linus Torvalds wrote: > So I'd _almost_ suggest just starting from a clean slate after all. > Keeping the old history around, of course, but not necessarily putting it > into git now. It would just force everybody who is getting used to git in > the first place to work with a 3GB archive from day one, rather than > getting into it a bit more gradually. Sure. We can export the 2.6.12-rc2 version of the git'ed history tree and start from there. Then the first changeset has a parent, which just lives in a different place. Thats the only difference to your repository, but it would change the sha1 sums of all your changesets. > What do people think? I'm not so much worried about the data itself: the > git architecture is _so_ damn simple that now that the size estimate has > been confirmed, that I don't think it would be a problem per se to put > 3.2GB into the archive. But it will bog down "rsync" horribly, so it will > actually hurt synchronization untill somebody writes the rev-tree-like > stuff to communicate changes more efficiently.. We have all the tracking information in SQL and we will post the data base dump soon, so people interested in revision tracking can use this as an information base. > But it's _great_ to have the history in this format, especially since > looking at CVS just reminded me how much I hated it. :) One remark on the tree blob storage format. The binary storage of the sha1sum of the refered object is a PITA for scripting. Converting the ASCII -> binary for the sha1sum comparision should not take much longer than the binary -> ASCII conversion for the file reference. Can this be changed ? tglx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: full kernel history, in patchset format
Dear diary, on Sat, Apr 16, 2005 at 07:04:31PM CEST, I got a letter where Linus Torvalds <[EMAIL PROTECTED]> told me that... > So I'd _almost_ suggest just starting from a clean slate after all. > Keeping the old history around, of course, but not necessarily putting it > into git now. It would just force everybody who is getting used to git in > the first place to work with a 3GB archive from day one, rather than > getting into it a bit more gradually. > > Comments? FWIW, it looks pretty reasonable to me. Perhaps we should have a separate GIT repository with the previous history though, and in the first new commit the parent could point to the last commit from the other repository. Just if it isn't too much work, though. :-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
On Sat, 16 Apr 2005, Ingo Molnar wrote: > > i've converted the Linux kernel CVS tree into 'flat patchset' format, > which gave a series of 28237 separate patches. (Each patch represents a > changeset, in the order they were applied. I've used the cvsps utility.) > > the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a > script that will apply all the patches in order and will create a > pristine 2.6.12-rc2 tree. Hey, that's great. I got the CVS repo too, and I was looking at it, but the more I looked at it, the more I felt that the main reason I want to import it into git ends up being to validate that my size estimates are at all realistic. I see that Thomas Gleixner seems to have done that already, and come to a figure of 3.2GB for the last three years, which I'm very happy with, mainly because it seems to match my estimates to a tee. Which means that I just feel that much more confident about git actually being able to handle the kernel long-term, and not just as a stop-gap measure. But I wonder if we actually want to actually populate the whole history.. Now that my size estimates have been verified, I have little actual real reason to put the history into git. There are no visualization tools done for git yet, and no helpers to actually find problems, and by the time there will be, we'll have new history. So I'd _almost_ suggest just starting from a clean slate after all. Keeping the old history around, of course, but not necessarily putting it into git now. It would just force everybody who is getting used to git in the first place to work with a 3GB archive from day one, rather than getting into it a bit more gradually. What do people think? I'm not so much worried about the data itself: the git architecture is _so_ damn simple that now that the size estimate has been confirmed, that I don't think it would be a problem per se to put 3.2GB into the archive. But it will bog down "rsync" horribly, so it will actually hurt synchronization untill somebody writes the rev-tree-like stuff to communicate changes more efficiently.. IOW, it smells to me like we don't have the infrastructure to really work with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can build up the infrastructure in parallell with starting to really need it. But it's _great_ to have the history in this format, especially since looking at CVS just reminded me how much I hated it. Comments? Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Ingo Molnar <[EMAIL PROTECTED]> : [...] > the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a > script that will apply all the patches in order and will create a > pristine 2.6.12-rc2 tree. 127 weeks of bk-commit mail for the 2.6 branch alone since october 2002 provides more than 44000 messages here. The figures are surprisingly different. > it needed many hours to finish, on a very fast server with tons of RAM, > and it also needed a fair amount of manual work to extract it and to > make it usable, so i guessed others might want to use the end result as > well, to try and generate large GIT repositories from them (or to run > analysis over the patches, etc.). Has anyone already compared the (split/digested) content of the ChangeLog file with the commit messages ? It raises the interesting question of inserting the merge messages/patches in the sequence at the right place but I'd like to know if someone met other issues. -- Ueimor - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
Ingo Molnar wrote: * Ingo Molnar <[EMAIL PROTECTED]> wrote: the patches contain all the existing metadata, dates, log messages and revision history. (What i think is missing is the BK tree merge information, but i'm not sure we want/need to convert them to GIT.) author names are abbreviated, e.g. 'viro' instead of [EMAIL PROTECTED], and no committer information is included (albeit commiter ought to be Linus in most cases). These are limitations of the BK->CVS gateway i think. Glad to hear cvsps made it through! I'm curious what the manual fixups required were, except for the binary file issue (logo.gif). As to the actual email addresses, for more recent patches, the Signed-off should help. For earlier ones, isn't their some script which 'knows' a bunch of canonical author->email mappings? (the shortlog script or something)? Is the full committer email address actually in the changeset in BK? If so, given that we have the unique id (immutable I believe) of the changset, could it be extracted directly from BK? David - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full kernel history, in patchset format
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > the patches contain all the existing metadata, dates, log messages and > revision history. (What i think is missing is the BK tree merge > information, but i'm not sure we want/need to convert them to GIT.) author names are abbreviated, e.g. 'viro' instead of [EMAIL PROTECTED], and no committer information is included (albeit commiter ought to be Linus in most cases). These are limitations of the BK->CVS gateway i think. Ingo - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
full kernel history, in patchset format
i've converted the Linux kernel CVS tree into 'flat patchset' format, which gave a series of 28237 separate patches. (Each patch represents a changeset, in the order they were applied. I've used the cvsps utility.) the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a script that will apply all the patches in order and will create a pristine 2.6.12-rc2 tree. it needed many hours to finish, on a very fast server with tons of RAM, and it also needed a fair amount of manual work to extract it and to make it usable, so i guessed others might want to use the end result as well, to try and generate large GIT repositories from them (or to run analysis over the patches, etc.). the patches contain all the existing metadata, dates, log messages and revision history. (What i think is missing is the BK tree merge information, but i'm not sure we want/need to convert them to GIT.) it's a 136 MB tarball, which can be downloaded from: http://kernel.org/pub/linux/kernel/people/mingo/Linux-2.6-patchset/ the ./generate-2.6.12-rc2 script generates the 2.6.12-rc2 tree into linux/, from scratch. (No pre-existing kernel is needed, as 2.patch generates the full 2.4.0 kernel tree.) The patching takes a couple of minutes to finish, on a fast box. below i've attached a sample patch from the series. note: i kept the patches the cvsps utility generated as-is, to have a verifiable base to work on. There were a very small amount of deltas missed (about a dozen), probably resulting from CVS related errors, these are included in the diff-CVS-to-real patch. Also, the patch format cannot create the Documentation/logo.gif file, so the script does this too - just to be able to generate a complete 2.6.12-rc2 tree that is byte-for-byte identical to the real thing. Ingo - PatchSet 1234 Date: 2002/04/11 18:29:07 Author: viro Branch: HEAD Tag: (none) Log: [PATCH] crapectomy in include/linux/nfsd/syscall.h Removes an atavism in declaration of sys_nfsservctl() - sorry, I should've remove that junk when cond_syscall() thing was done. BKrev: 3cb5c7e3phTYgiz1YLsjQ_McTo9pOQ Members: ChangeSet:1.1234->1.1235 include/linux/nfsd/syscall.h:1.3->1.4 Index: linux/include/linux/nfsd/syscall.h === RCS file: /home/mingo/linux-CVS/linux/include/linux/nfsd/syscall.h,v retrieving revision 1.3 retrieving revision 1.4 diff -u -r1.3 -r1.4 --- linux/include/linux/nfsd/syscall.h 15 Mar 2002 23:06:06 - 1.3 +++ linux/include/linux/nfsd/syscall.h 11 Apr 2002 17:29:07 - 1.4 @@ -132,11 +132,7 @@ /* * Kernel syscall implementation. */ -#if defined(CONFIG_NFSD) || defined(CONFIG_NFSD_MODULE) extern asmlinkage long sys_nfsservctl(int, struct nfsctl_arg *, void *); -#else -#define sys_nfsservctl sys_ni_syscall -#endif extern int exp_addclient(struct nfsctl_client *ncp); extern int exp_delclient(struct nfsctl_client *ncp); extern int exp_export(struct nfsctl_export *nxp); - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html