subject:"\"full kernel history, in patchset format\""

Re: full kernel history, in patchset format

2005-04-19 Thread Catalin Marinas

David Mansfield <[EMAIL PROTECTED]> wrote:
> Catalin Marinas wrote:
>> AFAIK, cvsps uses the date/time to create the changesets. There is a
>> problem with the BKCVS export since some files in the same commit can
>> have a different time (by an hour). I posted a mail some time ago
>> about this -
>> http://marc.theaimsgroup.com/?l=linux-kernel&m=110026570201544&w=2
>> I read that the old history won't be merged into the new repository
>> but, if you are interested, I have a script that can do this based on
>> the "(Logical change ...)" string in the file commit logs and it is
>> quite fast at generating the patches.
>>
>
> Hmmm.  I read that message just now.  Is it a matter of 'perfection'
> that is the issue here, or actual correctness when applying the
> patches in order?

I see it as a matter of correctness since in a given BKCVS changeset
(i.e. revision in the ChangeSet,v file) you may miss files. You would
eventually get them, with the same log, but in a different patch. If
you don't care about this, you can call it 'perfection'.

At that time I thought about modifying cvsps to use the "(Logical
change ...)" string instead of time/date for grouping the files but I
realised it is easier with a shell script.

> (perhaps this has now been fixed).

There was no reply to this e-mail. It might have been fixed in the
meantime but I don't think the history was fixed as well.

-- 
Catalin

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-18 Thread David Mansfield

Catalin Marinas wrote:
Ingo Molnar <[EMAIL PROTECTED]> wrote:
i've converted the Linux kernel CVS tree into 'flat patchset' format, 
which gave a series of 28237 separate patches. (Each patch represents a 
changeset, in the order they were applied. I've used the cvsps
utility.)

AFAIK, cvsps uses the date/time to create the changesets. There is a
problem with the BKCVS export since some files in the same commit can
have a different time (by an hour). I posted a mail some time ago
about this - 
http://marc.theaimsgroup.com/?l=linux-kernel&m=110026570201544&w=2

I read that the old history won't be merged into the new repository
but, if you are interested, I have a script that can do this based on
the "(Logical change ...)" string in the file commit logs and it is
quite fast at generating the patches.
Hmmm.  I read that message just now.  Is it a matter of 'perfection' 
that is the issue here, or actual correctness when applying the patches 
in order?

(perhaps this has now been fixed).

David
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-18 Thread Catalin Marinas

Ingo Molnar <[EMAIL PROTECTED]> wrote:
> i've converted the Linux kernel CVS tree into 'flat patchset' format, 
> which gave a series of 28237 separate patches. (Each patch represents a 
> changeset, in the order they were applied. I've used the cvsps
> utility.)

AFAIK, cvsps uses the date/time to create the changesets. There is a
problem with the BKCVS export since some files in the same commit can
have a different time (by an hour). I posted a mail some time ago
about this - 
http://marc.theaimsgroup.com/?l=linux-kernel&m=110026570201544&w=2

I read that the old history won't be merged into the new repository
but, if you are interested, I have a script that can do this based on
the "(Logical change ...)" string in the file commit logs and it is
quite fast at generating the patches.

-- 
Catalin

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread David Woodhouse

On Sun, 2005-04-17 at 18:16 -0700, Linus Torvalds wrote:
> Alternatively, you can have just the rev-tree cache of them. That's what
> it was designed for (along with avoiding to have to read 60,000 commits).

Purely from a conceptual POV I'd be a little happier with the history
just ending with a parent pointer to a commit object which is absent,
rather than having commit objects which point to _trees_ which are
absent. But I suppose I can't really justify that, and I'm not overly
bothered about it either.

The important thing to get right at this point is that the tree we all
work with should refer to the history, regardless of how we choose to
prune it. The current linux-2.6.git tree has a parentless commit for the
2.6.12-rc2 import, which is bad. We should start with Thomas' git tree
representing the real history, and work from that. You don't even need
to see his tree; you only need the final sha1 hash of the commit in his
tree which matches 2.6.12-rc2, so you can use that as the 'parent' of
the first change you import yourself.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread Linus Torvalds



On Mon, 18 Apr 2005, Petr Baudis wrote:

> Dear diary, on Mon, Apr 18, 2005 at 02:06:43AM CEST, I got a letter
> where David Woodhouse <[EMAIL PROTECTED]> told me that...
> > On Mon, 2005-04-18 at 01:39 +0200, Petr Baudis wrote:
> > > Of course an entirely different thing are _trees_ associated with those
> > > commits. As long as you stay with a simple three-way merge, you
> > > basically never want to look at trees which aren't heads and which you
> > > don't specifically request to look at. And the trees and what they carry
> > > inside is the main bulk of data.
> > 
> > If the trees are absent and you're trying to merge, what do you gain
> > from having the commit objects?
> 
> merge-base

Alternatively, you can have just the rev-tree cache of them. That's what
it was designed for (along with avoiding to have to read 60,000 commits).

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread Petr Baudis

Dear diary, on Mon, Apr 18, 2005 at 02:51:59AM CEST, I got a letter
where David Woodhouse <[EMAIL PROTECTED]> told me that...
> On Mon, 2005-04-18 at 02:50 +0200, Petr Baudis wrote:
> > I think I will make git-pasky's default behaviour (when we get
> > http-pull, that is) to keep the complete commit history but only trees
> > you need/want; togglable to both sides.
> 
> I think the default behaviour should probably be to fetch everything.

I think fetching gigs of data just won't work for many people,
especially if they could do with a fraction of that.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread David Woodhouse

On Mon, 2005-04-18 at 02:50 +0200, Petr Baudis wrote:
> I think I will make git-pasky's default behaviour (when we get
> http-pull, that is) to keep the complete commit history but only trees
> you need/want; togglable to both sides.

I think the default behaviour should probably be to fetch everything.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread Petr Baudis

Dear diary, on Mon, Apr 18, 2005 at 02:45:22AM CEST, I got a letter
where David Woodhouse <[EMAIL PROTECTED]> told me that...
> On Mon, 2005-04-18 at 02:35 +0200, Petr Baudis wrote:
> > > For the special case of removing history before 2.6.12-rc2 from the
> > > trees, I certainly think we can do it by leaving out all the commits,
> > > not just the trees. We can do that easily, but there's no way we can
> > > _add_ that history retrospectively if we omit it in the first place.
> > 
> > I'm confused by this paragraph, but that might be my English skills
> > failing somehow.
> 
> "For the general case of people pruning their own trees, _maybe_ you're
> right that it would be good to keep the commits even if we delete the
> actual trees. But for history older than 2.6.12-rc2, that's a special
> case -- I think we can happily delete the commits too.

Ah _so_. Thanks for explanation.

I think I will make git-pasky's default behaviour (when we get
http-pull, that is) to keep the complete commit history but only trees
you need/want; togglable to both sides.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread David Woodhouse

On Mon, 2005-04-18 at 02:35 +0200, Petr Baudis wrote:
> > For the special case of removing history before 2.6.12-rc2 from the
> > trees, I certainly think we can do it by leaving out all the commits,
> > not just the trees. We can do that easily, but there's no way we can
> > _add_ that history retrospectively if we omit it in the first place.
> 
> I'm confused by this paragraph, but that might be my English skills
> failing somehow.

"For the general case of people pruning their own trees, _maybe_ you're
right that it would be good to keep the commits even if we delete the
actual trees. But for history older than 2.6.12-rc2, that's a special
case -- I think we can happily delete the commits too.

"We can delete old trees/commits easily, but we can't _add_ them to the
existing linux-2.6.git tree, because the oldest commit in that tree
(b4ceb6e27e4cc3f37d26e04c4535c79b98a9f889) doesn't have a parent."

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread Petr Baudis

Dear diary, on Mon, Apr 18, 2005 at 02:06:43AM CEST, I got a letter
where David Woodhouse <[EMAIL PROTECTED]> told me that...
> On Mon, 2005-04-18 at 01:39 +0200, Petr Baudis wrote:
> > Of course an entirely different thing are _trees_ associated with those
> > commits. As long as you stay with a simple three-way merge, you
> > basically never want to look at trees which aren't heads and which you
> > don't specifically request to look at. And the trees and what they carry
> > inside is the main bulk of data.
> 
> If the trees are absent and you're trying to merge, what do you gain
> from having the commit objects?

merge-base

> For the special case of removing history before 2.6.12-rc2 from the
> trees, I certainly think we can do it by leaving out all the commits,
> not just the trees. We can do that easily, but there's no way we can
> _add_ that history retrospectively if we omit it in the first place.

I'm confused by this paragraph, but that might be my English skills
failing somehow.

> For history older than 2.6.12-rc2 I'd suggest that it would be available
> in a different place, and absent from the 'main' working tree that
> everyone uses by default. The only difference we'd see in the working
> tree is that the 2.6.12-rc2 commit -- the oldest commit in that tree --
> would actually have an absentee parent instead of appearing to be an
> import. And all the sha1 hashes of all subsequent commits would be
> different, of course.

Yes, that's what I suggested too.

> To allow pruning of older objects in the general case would be a little
> bit harder than that, because as things stand you'd be re-fetching them
> every time you rsync from elsewhere -- but that wouldn't really be hard
> to fix if we care.

I think http-pull is very promising. :-)

It could be actually much faster than rsync, since you don't need to
build directory listings etc, which actually takes non-trivial amount of
time already with the kernel git repository.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread David Woodhouse

On Mon, 2005-04-18 at 01:39 +0200, Petr Baudis wrote:
> I think this is bad, bad, bad. If you don't keep around all the
> _commits_, you get into all sorts of troubles - when merging, when doing
> git log, etc. And the commits themselves are probably actually pretty
> small portion of the thing. I didn't do any actual measurement but I
> would be pretty surprised if it would be much more than few megabytes of
> data for the kernel history.

I'm not sure it's that bad -- and everyone already seems perfectly happy
not to have history going back before 2.6.12-rc2. We're not talking
about doing this by _default_ -- we're talking about allowing people to
keep trees pruned if they _want_ to. So I might want to drop history
before 2.6.0 on my laptop, for example.

> Of course an entirely different thing are _trees_ associated with those
> commits. As long as you stay with a simple three-way merge, you
> basically never want to look at trees which aren't heads and which you
> don't specifically request to look at. And the trees and what they carry
> inside is the main bulk of data.

If the trees are absent and you're trying to merge, what do you gain
from having the commit objects? And for the case of 'git log', I
certainly think it's acceptable that you lose out on those parts of
prehistory which you've explicitly removed from your local tree --
that's a feature, not a bug. 

For the special case of removing history before 2.6.12-rc2 from the
trees, I certainly think we can do it by leaving out all the commits,
not just the trees. We can do that easily, but there's no way we can
_add_ that history retrospectively if we omit it in the first place.

For history older than 2.6.12-rc2 I'd suggest that it would be available
in a different place, and absent from the 'main' working tree that
everyone uses by default. The only difference we'd see in the working
tree is that the 2.6.12-rc2 commit -- the oldest commit in that tree --
would actually have an absentee parent instead of appearing to be an
import. And all the sha1 hashes of all subsequent commits would be
different, of course.

To allow pruning of older objects in the general case would be a little
bit harder than that, because as things stand you'd be re-fetching them
every time you rsync from elsewhere -- but that wouldn't really be hard
to fix if we care.

Either way, I think it can probably be done by omitting the commit
objects as well as the trees -- but the important point is that we
_should_ include a 'parent' pointer in the oldest commit of the tree
we're working with, pointing back to the imported history.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread Petr Baudis

Dear diary, on Mon, Apr 18, 2005 at 01:31:36AM CEST, I got a letter
where David Woodhouse <[EMAIL PROTECTED]> told me that...
> Note that any given copy of a tree doesn't _need_ to keep all the
> history back the beginning of time. It's OK if the oldest commit object
> in your tree actually refers back to a parent which doesn't exist
> locally. I can well imagine that some people will want to keep their
> trees pruned to keep only a few weeks of history, while other copies of
> the tree will keep everything.

I think this is bad, bad, bad. If you don't keep around all the
_commits_, you get into all sorts of troubles - when merging, when doing
git log, etc. And the commits themselves are probably actually pretty
small portion of the thing. I didn't do any actual measurement but I
would be pretty surprised if it would be much more than few megabytes of
data for the kernel history.

Of course an entirely different thing are _trees_ associated with those
commits. As long as you stay with a simple three-way merge, you
basically never want to look at trees which aren't heads and which you
don't specifically request to look at. And the trees and what they carry
inside is the main bulk of data.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-17 Thread David Woodhouse

On Sat, 2005-04-16 at 10:04 -0700, Linus Torvalds wrote:
> So I'd _almost_ suggest just starting from a clean slate after all.  
> Keeping the old history around, of course, but not necessarily putting it
> into git now. It would just force everybody who is getting used to git in 
> the first place to work with a 3GB archive from day one, rather than 
> getting into it a bit more gradually.
> 
> What do people think? I'm not so much worried about the data itself: the
> git architecture is _so_ damn simple that now that the size estimate has
> been confirmed, that I don't think it would be a problem per se to put
> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will
> actually hurt synchronization untill somebody writes the rev-tree-like
> stuff to communicate changes more efficiently..

Note that any given copy of a tree doesn't _need_ to keep all the
history back the beginning of time. It's OK if the oldest commit object
in your tree actually refers back to a parent which doesn't exist
locally. I can well imagine that some people will want to keep their
trees pruned to keep only a few weeks of history, while other copies of
the tree will keep everything.

However, if we _don't_ base our current work on an existing import of
the kernel, then we don't retain that option. We can't just change the
'parent' field of your 2.6.12-rc2 import, without changing the sha1 hash
of _everything_ that happens thereafter. 

So I'd say we should take Thomas' import, and base new work on that --
but then possibly leave out the older objects from the 'working'
repository which everyone is rsyncing from; just make them available in
a 'linux-history.git' object database elsewhere.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread David Lang

On Sat, 16 Apr 2005, Thomas Gleixner wrote:
On Sat, 2005-04-16 at 10:04 -0700, Linus Torvalds wrote:
So I'd _almost_ suggest just starting from a clean slate after all.
Keeping the old history around, of course, but not necessarily putting it
into git now. It would just force everybody who is getting used to git in
the first place to work with a 3GB archive from day one, rather than
getting into it a bit more gradually.
Sure. We can export the 2.6.12-rc2 version of the git'ed history tree
and start from there. Then the first changeset has a parent, which just
lives in a different place.
Thats the only difference to your repository, but it would change the
sha1 sums of all your changesets.
at least start with a full release. say 2.6.11
the history won't be blank, but it's far more likly that people will care 
about the details between 2.6.11 and 2.6.12 and will want to go back 
before -rc2

David Lang
--
There are two ways of constructing a software design. One way is to make it so 
simple that there are obviously no deficiencies. And the other way is to make 
it so complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Daniel Barkalow

On Sat, 16 Apr 2005, Mike Taht wrote:

> Junio C Hamano wrote:
> >>"MT" == Mike Taht <[EMAIL PROTECTED]> writes:
> > 
> > 
> > MT> alternatively, "git-archive-torrent" to create a list of files for a
> > MT> bittorrent feed
> > 
> > That is certainly good for establishing the baseline, but you
> > still need to leverage the inherent delta-compressibility
> > between related blobs/trees by also doing something like what I
> > described as "diff package", don't you?
> 
> Yes... yes you could have files and diffs generated statically...
> 
> although something like a bittorrent server/client/frontend, call it 
> "gittorrent" (I hate being the first to make this pun) could walk the 
> hashes dynamically (
> Ihave: sha,sha,sha,sha... Sendme: shaxxx
> Hereswhatyouneedfromgit: file,file,file,diff,diff,diff,...)

I'm actually working on a trivial HTTP client to do this. The user says
"get  from ", and it gets that object, the associated
trees, and the associated blobs, skipping any that it already has.

This should save having a non-standard public-facing server process, and
be essentially as effective, at least once I have it using a single
connection for everything.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Junio C Hamano

> "CL" == Christopher Li <[EMAIL PROTECTED]> writes:

CL> I bet 90% of the time people sync to the repository head first
CL> want to check out the last bits. And maybe reading some change
CL> log to see what is changed.

CL> So having all the commit object, the user will able to see
CL> what is change and which version he we like to check out.

Makes sense.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Mike Taht

Junio C Hamano wrote:
"MT" == Mike Taht <[EMAIL PROTECTED]> writes:

MT> alternatively, "git-archive-torrent" to create a list of files for a
MT> bittorrent feed
That is certainly good for establishing the baseline, but you
still need to leverage the inherent delta-compressibility
between related blobs/trees by also doing something like what I
described as "diff package", don't you?
Yes... yes you could have files and diffs generated statically...
although something like a bittorrent server/client/frontend, call it 
"gittorrent" (I hate being the first to make this pun) could walk the 
hashes dynamically (
Ihave: sha,sha,sha,sha... Sendme: shaxxx
Hereswhatyouneedfromgit: file,file,file,diff,diff,diff,...)

--
Mike Taht
  "It looks like blind screaming hedonism won out."
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a 
> > script that will apply all the patches in order and will create a 
> > pristine 2.6.12-rc2 tree.
> 
> Hey, that's great. I got the CVS repo too, and I was looking at it, 
> but the more I looked at it, the more I felt that the main reason I 
> want to import it into git ends up being to validate that my size 
> estimates are at all realistic.
> 
> I see that Thomas Gleixner seems to have done that already, and come 
> to a figure of 3.2GB for the last three years, which I'm very happy 
> with, mainly because it seems to match my estimates to a tee. [...]

(yeah, we apparently worked in parallel - i only learned about his 
efforts after i sent my mail. He was using BK to extract info, i was 
using the CVS tree alone and no BK code whatsoever. (I dont think there 
will be any argument about who owns what, but i wanted to be on the safe 
side, and i also wanted to see how complete and usable the CVS metadata 
is - it's close to perfect i'd say, for the purposes i care about.))

> But I wonder if we actually want to actually populate the whole 
> history..

yeah, it definitely feels a bit brave to import 28,000 changesets into a 
source-code database project that will be a whopping 2 weeks old in 2 
days ;) Even if we felt 100% confident about all the basics (which we do 
of course ;), it's just simply too young to tie things down via a 3.2GB 
database. It feels much more natural to grow it gradually, 28,000 
changesets i'm afraid would just suffocate the 'project growth 
dynamics'. Not going too fast is just as important as not going too 
slow.

I didnt generate the patchset to get it added into some central 
repository right now, i generated it to check that we _do_ have all the 
revision history in an easy to understand format which does generate 
today's kernel tree, so that we can lean back and worry about the full 
database once things get a bit more settled down (in a couple of months 
or so). It's also an easy testbed for GIT itself.

but the revision history was one of the main reasons i used BK myself, 
so we'll need a merged database eventually. Occasionally i needed to 
check who was the one who touched a particular piece of code - was that 
fantastic new line of code written by me, or was that buggy piece of 
crap written by someone else? ;) Also, looking at a change and then 
going to the changeset that did it, and then looking at the full picture 
was pretty useful too. So that sort of annotation, and generally 
navigating around _quickly_ and looking at the 'flow' of changes going 
into a particular file was really useful (for me).

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Christopher Li

We can just have a baseline file contain all the commit objects.
Then have the git "download on demand". The problem with diff
package  is that I it is harder to merge with more than one diff.

I bet 90% of the time people sync to the repository head first
want to check out the last bits. And maybe reading some change
log to see what is changed.

So having all the commit object, the user will able to see
what is change and which version he we like to check out.

Then he can issue a command "download me all the objects is needed
for checkout the this commit". Download of demand should be
even better.

Chris

On Sat, Apr 16, 2005 at 12:19:22PM -0700, Junio C Hamano wrote:
> > "MT" == Mike Taht <[EMAIL PROTECTED]> writes:
> 
> MT> alternatively, "git-archive-torrent" to create a list of files for a
> MT> bittorrent feed
> 
> That is certainly good for establishing the baseline, but you
> still need to leverage the inherent delta-compressibility
> between related blobs/trees by also doing something like what I
> described as "diff package", don't you?
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Thomas Gleixner

On Sat, 2005-04-16 at 12:15 -0700, Linus Torvalds wrote:
> 
> On Sat, 16 Apr 2005, Thomas Gleixner wrote:
> > 
> > For the export stuff its terrible slow. :(
>
> What kind of _strange_ scripting architecture is so fast that there's a
> difference between "cat-file" and "ls-tree" and can handle 17,000 files in
> 60,000 revisions, yet so slow that you can't trivially convert 20 bytes of 
> data?

Sorry I was neither talking about "cat-file ..." nor about the 20 byte
conversion. I was talking about the bk export script, which writes the
objects itself. Doing this with the git-tools would slow it down, as I
have the retrieved data already in memory. It does not slow me down to
create the binary ref, but its annoying.

I just figured, that some revtools might have the need to use direct
pointers into objects and face the same problem the other way round.

tglx

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Junio C Hamano

> "PB" == Petr Baudis <[EMAIL PROTECTED]> writes:

PB> P.S.: It seems that Linus applied a patch to ls-tree which will make it
PB> read_sha1_file() on each item when ls-tree is recursive. Junio, why did
PB> you do it?

Sorry it was my misunderstanding, before I found out exactly how
S_ISDIR is used.  Thank you for pointing it out.

I was confused by this comment around the area I changed:

/* XXX: We do some ugly mode heuristics here.
 * It seems not worth it to read each file just to get this
 * and the file size. -- [EMAIL PROTECTED]

I mistakenly inferred from that comment that S_ISDIR(mode) is
not a guarantee.  So I mistakenly optimized it for non-recursive
case by keeping that "heuristics".  The logic was: If recursive
we will need to run read_sha1_file() to find out if it is really
a tree anyway.

I'll fix it up, now I know S_ISDIR(mode) is a guarantee that it
is a tree, I'll do the "heuristics" first, and do read_sha1_file
only when it is a tree and I am recursive.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Junio C Hamano

> "MT" == Mike Taht <[EMAIL PROTECTED]> writes:

MT> alternatively, "git-archive-torrent" to create a list of files for a
MT> bittorrent feed

That is certainly good for establishing the baseline, but you
still need to leverage the inherent delta-compressibility
between related blobs/trees by also doing something like what I
described as "diff package", don't you?




-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Ingo Molnar


* David Mansfield <[EMAIL PROTECTED]> wrote:

> Ingo Molnar wrote:
> >* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >
> >
> >>the patches contain all the existing metadata, dates, log messages and 
> >>revision history. (What i think is missing is the BK tree merge 
> >>information, but i'm not sure we want/need to convert them to GIT.)
> >
> >
> >author names are abbreviated, e.g. 'viro' instead of 
> >[EMAIL PROTECTED], and no committer information is 
> >included (albeit commiter ought to be Linus in most cases). These are 
> >limitations of the BK->CVS gateway i think.
> >
> 
> Glad to hear cvsps made it through!  I'm curious what the manual 
> fixups required were, except for the binary file issue (logo.gif).

--cvs-direct was needed to speed it up from 'several days to finish' to 
'several hours to finish', but it crashed on a handful of patches [i 
used the latest devel snapshot so this isnt a complaint]. (one of the 
crashes was when generating 1860.patch.) Also, 'cvs rdiff' apparently 
emits an empty patch for diffs that remove a file that end without 
having a newline character - but this isnt cvsps's problem.  (grep for 
+++ in the patchset to find those cases.)

> As to the actual email addresses, for more recent patches, the 
> Signed-off should help.  For earlier ones, isn't their some script 
> which 'knows' a bunch of canonical author->email mappings? (the 
> shortlog script or something)?

yeah, that's not that much of a problem, most of the names are unique, 
and the rest can be fixed up too.

> Is the full committer email address actually in the changeset in BK?  
> If so, given that we have the unique id (immutable I believe) of the 
> changset, could it be extracted directly from BK?

i think it's included in BK.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Linus Torvalds

On Sat, 16 Apr 2005, Thomas Gleixner wrote:
> 
> For the export stuff its terrible slow. :(

I don't really see your point.

If you already know what the tree is like you say, you don't care about
the tree object. And if you don't know what the tree is, what _are_ you
doing?

In other words, show us what you're complaining about. If you're looking
into the trees yourself, then the binary representation of the sha1 is
already what you want. That _is_ the hash. So why do you want it in ASCII?  
And if you're not looking into the tree directly, but using "cat-file
tree" and you were hoping to see ASCII data, then that's certainly not
going to be any faster than just doing "ls-tree" instead.

In other words, I don't see your point. Either you want ascii output for 
scripting, or you don't. First you claimed that you did, and that you 
would want the tree object to change in order to do so. Now you claim that 
you can't use "ls-tree" because it's too slow. 

That just isn't making any sense. You're mixing two totally different
levels, and complaining about performance when scripting things. Yet
you're talking about a 20-byte data structure that is trivial to convert
to any format you want.

What kind of _strange_ scripting architecture is so fast that there's a
difference between "cat-file" and "ls-tree" and can handle 17,000 files in
60,000 revisions, yet so slow that you can't trivially convert 20 bytes of 
data?

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Jan-Benedict Glaw

On Sat, 2005-04-16 10:04:31 -0700, Linus Torvalds <[EMAIL PROTECTED]>
wrote in message <[EMAIL PROTECTED]>:

> What do people think? I'm not so much worried about the data itself: the
> git architecture is _so_ damn simple that now that the size estimate has
> been confirmed, that I don't think it would be a problem per se to put
> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will
> actually hurt synchronization untill somebody writes the rev-tree-like
> stuff to communicate changes more efficiently..
> 
> IOW, it smells to me like we don't have the infrastructure to really work 
> with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can 
> build up the infrastructure in parallell with starting to really need it.

3GB is quite some data, but I'd accept and prefer to download it from
somewhere. I think that it's worth it.

I accept that there are people out there which would love to get a
smaller archive, but at least most developers that would actually use it
for day-to-day work *do* have the bandwidth to download it. Maybe we'd
also prepare (from time to time) bzip'ed tarballs, which I expect to be
a tad smaller.

MfG, JBG

-- 
Jan-Benedict Glaw   [EMAIL PROTECTED]. +49-172-7608481 _ O _
"Eine Freie Meinung in  einem Freien Kopf| Gegen Zensur | Gegen Krieg  _ _ O
 fuer einen Freien Staat voll Freier BÃrger" | im Internet! |   im Irak!   O O 
O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));


signature.asc
Description: Digital signature

Re: Re: full kernel history, in patchset format

2005-04-16 Thread Petr Baudis

Dear diary, on Sat, Apr 16, 2005 at 09:50:21PM CEST, I got a letter
where Thomas Gleixner <[EMAIL PROTECTED]> told me that...
> On Sat, 2005-04-16 at 11:44 -0700, Linus Torvalds wrote:
> 
> > That level of abstraction ("we never look directly at the objects") is 
> > what allows us to change the object structure later. For example, we 
> > already changed the "commit" date thing once, and the tree object has 
> > obviously evolved a bit, and if we ever change the hash, the objects will 
> > change too, but if you always just script them using nice helper tools, 
> > you won't ever need to _care_. And that's how it should be.
> 
> For the export stuff its terrible slow. :(

It seems to me that you must be doing something wrong then. I can't see
anything which would not make ls-tree blindingly fast (except for when
being recursive, see below).

BTW, what do you need ls-tree output for, when doing export _to_ git?

P.S.: It seems that Linus applied a patch to ls-tree which will make it
read_sha1_file() on each item when ls-tree is recursive. Junio, why did
you do it? Is there any possible case when the item would not be marked
as directory but it would be a tree object? I could imagine it bogging
down ls-tree on big tree a lot.

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: full kernel history, in patchset format

2005-04-16 Thread Christopher Li

On Sat, Apr 16, 2005 at 07:43:27PM +0200, Petr Baudis wrote:
> Dear diary, on Sat, Apr 16, 2005 at 07:04:31PM CEST, I got a letter
> where Linus Torvalds <[EMAIL PROTECTED]> told me that...
> > So I'd _almost_ suggest just starting from a clean slate after all.  
> > Keeping the old history around, of course, but not necessarily putting it
> > into git now. It would just force everybody who is getting used to git in 
> > the first place to work with a 3GB archive from day one, rather than 
> > getting into it a bit more gradually.
> > 
> > Comments?
> 
> FWIW, it looks pretty reasonable to me. Perhaps we should have a
> separate GIT repository with the previous history though, and in the
> first new commit the parent could point to the last commit from the
> other repository.
> 
> Just if it isn't too much work, though. :-)

I think we can make the git using stackable repository. When it fail
to find an object, it will try it's to read from parent repository.
It is useful to slice the history.

I can have local repository that all the new object create by me will
store in my tree instead of the official one. Clean up the object in the
my local tree will be much easier it only need to work on a much smaller
repository. If all my change is merge to official tree, I just simply
empty my local repository.

About the kernel git repository. I think it is much easier just put
them in one tree.  So I don't need to worry about "if I need to see
pre 2.6.12, I need to do this". And the full repository  need to
store in the server some where any way.

However I totally agree that people should not deal with unnecessary the history
when they start using the git tools. We should just make the tools
by default don't download all the histories. Only get it when user specific 
ask for it.

Why 2.6.12-rc2? When kernel grows to 2.6.15, a new user might not even need
pre 2.6.13 most of the time. If we make it very easier for people to get
history if they need, it will make them less motivate to store unnecessary
history locally (just in case I need it).

I think we should not advise using rsync to sync the whole git tree as
way to get update. We need to get use to only have a slice of the history
and get more if we needed.
The server should should provide some small metadata file like the
the rev-tool cache, so the SCM tools can download it to figure out what file
is needed to download to get to certain revision. Instead of download the
whole repository to figure out what is new.

We can even slice that metadata information to smaller pieces base on major 
release point.

Chris

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Thomas Gleixner

On Sat, 2005-04-16 at 11:44 -0700, Linus Torvalds wrote:

> That level of abstraction ("we never look directly at the objects") is 
> what allows us to change the object structure later. For example, we 
> already changed the "commit" date thing once, and the tree object has 
> obviously evolved a bit, and if we ever change the hash, the objects will 
> change too, but if you always just script them using nice helper tools, 
> you won't ever need to _care_. And that's how it should be.

For the export stuff its terrible slow. :(

I agree that using common tools is good. But we talk also about an open
format, so using a script to speed up certain tasks is not bad at all.

tglx



-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Junio C Hamano

> "JCH" == Junio C Hamano <[EMAIL PROTECTED]> writes:

JCH> I have been cooking this idea before I dove into the merge stuff
JCH> and did not have time to implement it myself (Hint Hint), but I
JCH> think something along the following lines would work nicely:

It should be fairly obvious from the context what I meant to
say, but in case somebody gets confused by my inaccurate
description of small details (or, before somebody nitpicks ;-),
I'd add some clarifications and corrections.

JCH>  * Run diff-tree between neighboring commits [*1*] to find out
JCH>the set of blobs that are "related".  Extract those related
JCH>blobs and run "diff" [*2*] between them to see if it produces
JCH>a patch smaller than the whole thing when compressed.  If
JCH>diff+patch is a win, then we do not have to transmit the blob
JCH>that we could reproduce by sending the diff.  Note that fact.

I talked only about blobs here, but I really mean all types:
commits, trees and blobs here.  Nothing prevents us from
extracting the raw data for trees and commits and run diff
between them.  We can use cat-file to do that today.

What we do not have is the reverse of "$ cat-file type >rawdata"
(i.e. "$ write-file type  Given the above, the operation of git-archive-patch is also
JCH> quite obvious.  Extract the "diff package" tarball into the
JCH> objects/ directory that has (at least) the full Bn, uncompress
JCH> the patch file part, and run patch on it. 

Of course after you ran patch to reproduce the raw data for the
blob or tree, we need the reverse of cat-file to register such
data under object/ hierarchy.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Linus Torvalds

On Sat, 16 Apr 2005, Thomas Gleixner wrote:
> 
> One remark on the tree blob storage format. 
> The binary storage of the sha1sum of the refered object is a PITA for
> scripting. 
> Converting the ASCII -> binary for the sha1sum comparision should not
> take much longer than the binary -> ASCII conversion for the file
> reference. Can this be changed ?

I'd really rather not. Why don't you just use "ls-tree" for scripting? 
That's why it exists in the first place. 

It might make sense to have some simple selection capabilities built into 
ls-tree (ie "ls-tree --match drivers/char/ -z " to get just a 
subtree out), but that depends entirely on how you end up using it.

The fact is, there should _never_ any reason to look at the objects
themselves directly. "cat-file" is a debugging aid, it shouldn't be
scripted (with the possible exception of "cat-file blob " to just
extract the blob contents, since that object doesn't have any internal
structure).

That level of abstraction ("we never look directly at the objects") is 
what allows us to change the object structure later. For example, we 
already changed the "commit" date thing once, and the tree object has 
obviously evolved a bit, and if we ever change the hash, the objects will 
change too, but if you always just script them using nice helper tools, 
you won't ever need to _care_. And that's how it should be.

If there's a tool missing, holler. THAT is the part I've been trying to
write: all the plumbing so that you _can_ script the thing sanely, and not
worry about how objects are created and worked with. 

For example, that "index" file format likely _will_ change. I ended up
doing the new "stage" flags in a way that kept the index file compatible
with old ones, but I did that mainly because it also happened to be the
easiest way to enforce the rule I wanted to enforce (ie the "stage" really
_is_ a part of the filename from a "compare filenames" standpoint, in
order to make sure that the stages are always ordered).

So if the index file change hadn't had that property, I'd have just said
"I'll change the format", and anybody who tried to parse the index file
would have been _broken_.

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: full kernel history, in patchset format

2005-04-16 Thread Thomas Gleixner

On Sat, 2005-04-16 at 20:32 +0200, Petr Baudis wrote:
> Dear diary, on Sat, Apr 16, 2005 at 09:23:40PM CEST, I got a letter
> where Thomas Gleixner <[EMAIL PROTECTED]> told me that...
> > One remark on the tree blob storage format. 
> > The binary storage of the sha1sum of the refered object is a PITA for
> > scripting. 
> > Converting the ASCII -> binary for the sha1sum comparision should not
> > take much longer than the binary -> ASCII conversion for the file
> > reference. Can this be changed ?
> 
> Huh, you aren't supposed to peek into trees directly. What's wrong with
> ls-tree?

Why I'm not supposed ? Is this evil ?

My export script has all the data available, so I write the tree refs
directly. The full export runs ~1 hour. Thats long enough :) I tried the
git way and it slows me down by factor "BIG" (I dont remember the
number)

Also for reference tracking all the information might be available e.g.
by a database. Why should the revtool then use some tool to retrieve
information which is already there ?

tglx

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: Re: full kernel history, in patchset format

2005-04-16 Thread Petr Baudis

Dear diary, on Sat, Apr 16, 2005 at 08:32:32PM CEST, I got a letter
where Petr Baudis <[EMAIL PROTECTED]> told me that...
> Dear diary, on Sat, Apr 16, 2005 at 09:23:40PM CEST, I got a letter
> where Thomas Gleixner <[EMAIL PROTECTED]> told me that...
> > One remark on the tree blob storage format. 
> > The binary storage of the sha1sum of the refered object is a PITA for
> > scripting. 
> > Converting the ASCII -> binary for the sha1sum comparision should not
> > take much longer than the binary -> ASCII conversion for the file
> > reference. Can this be changed ?
> 
> Huh, you aren't supposed to peek into trees directly. What's wrong with
> ls-tree?

(I meant, you aren't supposed to peek into trees from scripts. Or well,
not "not supposed", but it does not make much sense when you have
ls-tree.)

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Mike Taht


 * A script git-archive-tar is used to create a "base tarball"
   that roughly corresponds to "linux-*.tar.gz".  This works as
   follows:
$ git-archive-tar C [B1 B2...]
   This reads the named commit C, grabs the associated tree
   (i.e.  its sub-tree objects and the blob they refer to), and
   makes a tarball of ??/??
   files.  The tarball does not have to contain any extra
   information to reproduce any ancestor of the named commit.
alternatively, "git-archive-torrent" to create a list of files for a 
bittorrent feed

--
Mike Taht
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: full kernel history, in patchset format

2005-04-16 Thread Petr Baudis

Dear diary, on Sat, Apr 16, 2005 at 09:23:40PM CEST, I got a letter
where Thomas Gleixner <[EMAIL PROTECTED]> told me that...
> One remark on the tree blob storage format. 
> The binary storage of the sha1sum of the refered object is a PITA for
> scripting. 
> Converting the ASCII -> binary for the sha1sum comparision should not
> take much longer than the binary -> ASCII conversion for the file
> reference. Can this be changed ?

Huh, you aren't supposed to peek into trees directly. What's wrong with
ls-tree?

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Junio C Hamano

> "LT" == Linus Torvalds <[EMAIL PROTECTED]> writes:

LT> What do people think? I'm not so much worried about the data itself: the
LT> git architecture is _so_ damn simple that now that the size estimate has
LT> been confirmed, that I don't think it would be a problem per se to put
LT> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will
LT> actually hurt synchronization untill somebody writes the rev-tree-like
LT> stuff to communicate changes more efficiently..

LT> IOW, it smells to me like we don't have the infrastructure to really work 
LT> with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can 
LT> build up the infrastructure in parallell with starting to really need it.

LT> But it's _great_ to have the history in this format, especially since 
LT> looking at CVS just reminded me how much I hated it.

LT> Comments?

I have been cooking this idea before I dove into the merge stuff
and did not have time to implement it myself (Hint Hint), but I
think something along the following lines would work nicely:

 * A script git-archive-tar is used to create a "base tarball"
   that roughly corresponds to "linux-*.tar.gz".  This works as
   follows:

$ git-archive-tar C [B1 B2...]

   This reads the named commit C, grabs the associated tree
   (i.e.  its sub-tree objects and the blob they refer to), and
   makes a tarball of ??/??
   files.  The tarball does not have to contain any extra
   information to reproduce any ancestor of the named commit.

   When extra parameters, B1 B2..., are given, it also creates
   "diff package" that roughly corresponds to "patch-*.gz" for
   each Bn given.  They must be ancestors of commit.  The
   intention is to store enough information to ensure that the
   recipient can recreate all the SHA1 files "base tarball" for
   commits between (Bn, C] would contain, provided if the
   recipient already has all the SHA1 files "base tarball" for
   Bn.

 * A script git-archive-patch is used to read such a "diff
   package".

So a user needs to:

 * First pick some baseline B and download the base tarball for
   commit B.  It is up to him to make trade-offs between how far
   back he wants to see the history and how much bandwidth he
   wants to waste.  Untar it to get the baseline.

 * Then periodically pick up "diff package" for (C, B] where C
   is the latest available.  Run git-archive-patch to populate
   the rest.

 * In addition the user can run rsync with timestamp option to
   pick up SHA1 files created upstream since C after this
   happens.

What git-archive-tar needs to do to produce "diff package" for
(Bn, C] is fairly obvious.

 * From rev-tree output, find all the commits that are on path
   from Bn to C.

 * Find all the SHA1 objects that appear on this commit chain;
   subtract what is in Bn since we assume the recipient has them
   already.

 * Run diff-tree between neighboring commits [*1*] to find out
   the set of blobs that are "related".  Extract those related
   blobs and run "diff" [*2*] between them to see if it produces
   a patch smaller than the whole thing when compressed.  If
   diff+patch is a win, then we do not have to transmit the blob
   that we could reproduce by sending the diff.  Note that fact.

 * When you are all done, you have a single patch file that
   contains small edits on numerous blobs, and set of SHA1 files
   that are cheaper to transmit than in the patch form.
   Compress the patch file and package them together to make a
   tar archive.

Given the above, the operation of git-archive-patch is also
quite obvious.  Extract the "diff package" tarball into the
objects/ directory that has (at least) the full Bn, uncompress
the patch file part, and run patch on it. 


[Footnotes]

*1* Alternatively, this diff-tree can be run between Bn and each
commit between (Bn, C].  It is like incremental dump strategy.
We should experiment and find a good balance.

*2* This does not have to be "diff -u" --- we are assuming the
exact patch so diff -e or xdelta would do.  We should experiment
and find a good diff+patch pair.


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Thomas Gleixner

On Sat, 2005-04-16 at 10:04 -0700, Linus Torvalds wrote:

> So I'd _almost_ suggest just starting from a clean slate after all.  
> Keeping the old history around, of course, but not necessarily putting it
> into git now. It would just force everybody who is getting used to git in 
> the first place to work with a 3GB archive from day one, rather than 
> getting into it a bit more gradually.

Sure. We can export the 2.6.12-rc2 version of the git'ed history tree
and start from there. Then the first changeset has a parent, which just
lives in a different place. 
Thats the only difference to your repository, but it would change the
sha1 sums of all your changesets.

> What do people think? I'm not so much worried about the data itself: the
> git architecture is _so_ damn simple that now that the size estimate has
> been confirmed, that I don't think it would be a problem per se to put
> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will
> actually hurt synchronization untill somebody writes the rev-tree-like
> stuff to communicate changes more efficiently..

We have all the tracking information in SQL and we will post the data
base dump soon, so people interested in revision tracking can use this
as an information base.

> But it's _great_ to have the history in this format, especially since 
> looking at CVS just reminded me how much I hated it.

:)

One remark on the tree blob storage format. 
The binary storage of the sha1sum of the refered object is a PITA for
scripting. 
Converting the ASCII -> binary for the sha1sum comparision should not
take much longer than the binary -> ASCII conversion for the file
reference. Can this be changed ?

tglx

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: full kernel history, in patchset format

2005-04-16 Thread Petr Baudis

Dear diary, on Sat, Apr 16, 2005 at 07:04:31PM CEST, I got a letter
where Linus Torvalds <[EMAIL PROTECTED]> told me that...
> So I'd _almost_ suggest just starting from a clean slate after all.  
> Keeping the old history around, of course, but not necessarily putting it
> into git now. It would just force everybody who is getting used to git in 
> the first place to work with a 3GB archive from day one, rather than 
> getting into it a bit more gradually.
> 
> Comments?

FWIW, it looks pretty reasonable to me. Perhaps we should have a
separate GIT repository with the previous history though, and in the
first new commit the parent could point to the last commit from the
other repository.

Just if it isn't too much work, though. :-)

-- 
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Linus Torvalds

On Sat, 16 Apr 2005, Ingo Molnar wrote:
> 
> i've converted the Linux kernel CVS tree into 'flat patchset' format, 
> which gave a series of 28237 separate patches. (Each patch represents a 
> changeset, in the order they were applied. I've used the cvsps utility.)
> 
> the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a 
> script that will apply all the patches in order and will create a 
> pristine 2.6.12-rc2 tree.

Hey, that's great. I got the CVS repo too, and I was looking at it, but 
the more I looked at it, the more I felt that the main reason I want to 
import it into git ends up being to validate that my size estimates are at 
all realistic.

I see that Thomas Gleixner seems to have done that already, and come to a 
figure of 3.2GB for the last three years, which I'm very happy with, 
mainly because it seems to match my estimates to a tee. Which means that I 
just feel that much more confident about git actually being able to handle 
the kernel long-term, and not just as a stop-gap measure.

But I wonder if we actually want to actually populate the whole history.. 
Now that my size estimates have been verified, I have little actual real 
reason to put the history into git. There are no visualization tools done 
for git yet, and no helpers to actually find problems, and by the time 
there will be, we'll have new history.

So I'd _almost_ suggest just starting from a clean slate after all.  
Keeping the old history around, of course, but not necessarily putting it
into git now. It would just force everybody who is getting used to git in 
the first place to work with a 3GB archive from day one, rather than 
getting into it a bit more gradually.

What do people think? I'm not so much worried about the data itself: the
git architecture is _so_ damn simple that now that the size estimate has
been confirmed, that I don't think it would be a problem per se to put
3.2GB into the archive. But it will bog down "rsync" horribly, so it will
actually hurt synchronization untill somebody writes the rev-tree-like
stuff to communicate changes more efficiently..

IOW, it smells to me like we don't have the infrastructure to really work 
with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can 
build up the infrastructure in parallell with starting to really need it.

But it's _great_ to have the history in this format, especially since 
looking at CVS just reminded me how much I hated it.

Comments?

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Francois Romieu

Ingo Molnar <[EMAIL PROTECTED]> :
[...]
> the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a 
> script that will apply all the patches in order and will create a 
> pristine 2.6.12-rc2 tree.

127 weeks of bk-commit mail for the 2.6 branch alone since october 2002
provides more than 44000 messages here. The figures are surprisingly
different.

> it needed many hours to finish, on a very fast server with tons of RAM, 
> and it also needed a fair amount of manual work to extract it and to 
> make it usable, so i guessed others might want to use the end result as 
> well, to try and generate large GIT repositories from them (or to run 
> analysis over the patches, etc.).

Has anyone already compared the (split/digested) content of the ChangeLog
file with the commit messages ? It raises the interesting question of
inserting the merge messages/patches in the sequence at the right place
but I'd like to know if someone met other issues.

--
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread David Mansfield

Ingo Molnar wrote:
* Ingo Molnar <[EMAIL PROTECTED]> wrote:

the patches contain all the existing metadata, dates, log messages and 
revision history. (What i think is missing is the BK tree merge 
information, but i'm not sure we want/need to convert them to GIT.)

author names are abbreviated, e.g. 'viro' instead of 
[EMAIL PROTECTED], and no committer information is 
included (albeit commiter ought to be Linus in most cases). These are 
limitations of the BK->CVS gateway i think.

Glad to hear cvsps made it through!  I'm curious what the manual fixups 
required were, except for the binary file issue (logo.gif).

As to the actual email addresses, for more recent patches, the 
Signed-off should help.  For earlier ones, isn't their some script which 
'knows' a bunch of canonical author->email mappings? (the shortlog 
script or something)?

Is the full committer email address actually in the changeset in BK?  If 
so, given that we have the unique id (immutable I believe) of the 
changset, could it be extracted directly from BK?

David
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full kernel history, in patchset format

2005-04-16 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> the patches contain all the existing metadata, dates, log messages and 
> revision history. (What i think is missing is the BK tree merge 
> information, but i'm not sure we want/need to convert them to GIT.)

author names are abbreviated, e.g. 'viro' instead of 
[EMAIL PROTECTED], and no committer information is 
included (albeit commiter ought to be Linus in most cases). These are 
limitations of the BK->CVS gateway i think.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

full kernel history, in patchset format

2005-04-16 Thread Ingo Molnar


i've converted the Linux kernel CVS tree into 'flat patchset' format, 
which gave a series of 28237 separate patches. (Each patch represents a 
changeset, in the order they were applied. I've used the cvsps utility.)

the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a 
script that will apply all the patches in order and will create a 
pristine 2.6.12-rc2 tree.

it needed many hours to finish, on a very fast server with tons of RAM, 
and it also needed a fair amount of manual work to extract it and to 
make it usable, so i guessed others might want to use the end result as 
well, to try and generate large GIT repositories from them (or to run 
analysis over the patches, etc.).

the patches contain all the existing metadata, dates, log messages and 
revision history. (What i think is missing is the BK tree merge 
information, but i'm not sure we want/need to convert them to GIT.)

it's a 136 MB tarball, which can be downloaded from:

   http://kernel.org/pub/linux/kernel/people/mingo/Linux-2.6-patchset/

the ./generate-2.6.12-rc2 script generates the 2.6.12-rc2 tree into 
linux/, from scratch. (No pre-existing kernel is needed, as 2.patch 
generates the full 2.4.0 kernel tree.) The patching takes a couple of 
minutes to finish, on a fast box.

below i've attached a sample patch from the series.

note: i kept the patches the cvsps utility generated as-is, to have a 
verifiable base to work on. There were a very small amount of deltas 
missed (about a dozen), probably resulting from CVS related errors, 
these are included in the diff-CVS-to-real patch. Also, the patch format 
cannot create the Documentation/logo.gif file, so the script does this 
too - just to be able to generate a complete 2.6.12-rc2 tree that is 
byte-for-byte identical to the real thing.

Ingo

-
PatchSet 1234 
Date: 2002/04/11 18:29:07
Author: viro
Branch: HEAD
Tag: (none) 
Log:
[PATCH] crapectomy in include/linux/nfsd/syscall.h

Removes an atavism in declaration of sys_nfsservctl() - sorry, I should've
remove that junk when cond_syscall() thing was done.

BKrev: 3cb5c7e3phTYgiz1YLsjQ_McTo9pOQ

Members: 
ChangeSet:1.1234->1.1235 
include/linux/nfsd/syscall.h:1.3->1.4 

Index: linux/include/linux/nfsd/syscall.h
===
RCS file: /home/mingo/linux-CVS/linux/include/linux/nfsd/syscall.h,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- linux/include/linux/nfsd/syscall.h  15 Mar 2002 23:06:06 -  1.3
+++ linux/include/linux/nfsd/syscall.h  11 Apr 2002 17:29:07 -  1.4
@@ -132,11 +132,7 @@
 /*
  * Kernel syscall implementation.
  */
-#if defined(CONFIG_NFSD) || defined(CONFIG_NFSD_MODULE)
 extern asmlinkage long sys_nfsservctl(int, struct nfsctl_arg *, void *);
-#else
-#define sys_nfsservctl sys_ni_syscall
-#endif
 extern int exp_addclient(struct nfsctl_client *ncp);
 extern int exp_delclient(struct nfsctl_client *ncp);
 extern int exp_export(struct nfsctl_export *nxp);
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

42 matches

Mail list logo