Re: [PATCH 0/2] Support for packs in HTTP

2005-07-11 Thread Daniel Barkalow
On Mon, 11 Jul 2005, Linus Torvalds wrote:

> 
> 
> On Mon, 11 Jul 2005, Daniel Barkalow wrote:
> > On Sun, 10 Jul 2005, Linus Torvalds wrote:
> > 
> > > 
> > > You really _mustn't_ try to create the pack directly to the
> > > $GIT_DIR/objects/pack subdirectory - that would make git itself start
> > > possibly using that pack before the index is all done, and that would be
> > > just wrong and nasty.
> > >
> > > So you really should _always_ generate the pack somewhere else, and then 
> > > move it (pack file first, index file second).
> > 
> > It's currently fine ignoring index files without corresponding
> > pack files (sha1_file.c, line 470).
> 
> That doesn't help.

Well, it means that the order you move them doesn't matter, because it
will ignore the pair if either hasn't been moved.

> Redgardless of which order you write them (and you _will_ write the 
> pack-file first), you'll find that at some point you have both files, but 
> one or the other isn't fully written, ie they are unusable.

(Off topic: note that git-http-pull writes the _index_ first, because it
fetches it to determine if it should fetch the pack)

> And yes, you can handle that by always checking the SHA1 of the files when 
> you open them, but the fact is, you shouldn't need to, just to use it. 
> Checking the SHA1 of the pack-file in particular is very expensive (since 
> it's potentially a huge file, and you don't even want to read all of it).

IIRC, we check the size of the pack file and there are hashes around the
ends of the two files which have to match; but this is a die() check, not
an ignore check, so we just crash with a clear error message rather than
doing crazy stuff (like reading from beyond the end of the mmap).

> So that's what I decided the rule is: never ever have a partial file, and 
> thus you can by definition use them immediately when you see both files.
> 
> But that requires that you write them under another name than the final 
> one. And since you want that _anyway_ for other uses, you don't hide that 
> inside "git-pack-objects", but you make it an exported interface.

We should never write anything under the final name, anyway, for just this
reason; we already use open/write/close/rename for objects, refs, and
cache (maybe not working directory files, though). I think we're actually
agreeing on this.

My position is that the temporary location should be something like
{final-name}.part, such that it doesn't match *.idx or *.pack beforehand
(so it doesn't look like a complete file that you might want to send to
someone) and it doesn't have to worry about EXDEV on the rename. Also, I
would ideally like to be able to resume an interrupted download, which
means that it would have to find the partial file in a predictable
location, given what it's supposed to contain.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-11 Thread Linus Torvalds


On Mon, 11 Jul 2005, Daniel Barkalow wrote:
> On Sun, 10 Jul 2005, Linus Torvalds wrote:
> 
> > 
> > You really _mustn't_ try to create the pack directly to the
> > $GIT_DIR/objects/pack subdirectory - that would make git itself start
> > possibly using that pack before the index is all done, and that would be
> > just wrong and nasty.
> >
> > So you really should _always_ generate the pack somewhere else, and then 
> > move it (pack file first, index file second).
> 
> It's currently fine ignoring index files without corresponding
> pack files (sha1_file.c, line 470).

That doesn't help.

Redgardless of which order you write them (and you _will_ write the 
pack-file first), you'll find that at some point you have both files, but 
one or the other isn't fully written, ie they are unusable.

And yes, you can handle that by always checking the SHA1 of the files when 
you open them, but the fact is, you shouldn't need to, just to use it. 
Checking the SHA1 of the pack-file in particular is very expensive (since 
it's potentially a huge file, and you don't even want to read all of it).

So you could have rules like: pack-file must get populated first, and 
index file must be SHA1-checked by all users.

Or, you could just have a saner rule: create the pack-files somewhere 
else, and move them atomically to the right place. Problem solved.

So that's what I decided the rule is: never ever have a partial file, and 
thus you can by definition use them immediately when you see both files.

But that requires that you write them under another name than the final 
one. And since you want that _anyway_ for other uses, you don't hide that 
inside "git-pack-objects", but you make it an exported interface.

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Daniel Barkalow
On Sun, 10 Jul 2005, Linus Torvalds wrote:

> On Sun, 10 Jul 2005, Daniel Barkalow wrote:
> > 
> > Perhaps git-pack-objects should have the base as a optional argument,
> > with a default of the filename in $GIT_DIR/objects/pack and an option
> > for sending just the pack file to stdout?
> 
> You really _mustn't_ try to create the pack directly to the
> $GIT_DIR/objects/pack subdirectory - that would make git itself start
> possibly using that pack before the index is all done, and that would be
> just wrong and nasty.
>
> So you really should _always_ generate the pack somewhere else, and then 
> move it (pack file first, index file second).

It's currently fine ignoring index files without corresponding
pack files (sha1_file.c, line 470). Do you want to make the constraint
that the pack/ directory doesn't have index files for packs that aren't
also there? (I've been putting the index files for packs that might be
possibile to get there, and relying on the above check to make sure that
they don't affect anything if it hasn't fetched the pack.)

Of course, we should never write to files in locations that anything looks
at; we want everything to appear atomically, completely written and
verified. But there's nothing wrong with having the C code place the
objects, which is certainly going to be necessary in the case of
downloading them by HTTP, since the program will want to place them and
enable them while running.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Linus Torvalds


On Sun, 10 Jul 2005, Daniel Barkalow wrote:
> 
> Perhaps git-pack-objects should have the base as a optional argument,
> with a default of the filename in $GIT_DIR/objects/pack and an option
> for sending just the pack file to stdout?

You really _mustn't_ try to create the pack directly to the
$GIT_DIR/objects/pack subdirectory - that would make git itself start
possibly using that pack before the index is all done, and that would be
just wrong and nasty.

So you really should _always_ generate the pack somewhere else, and then 
move it (pack file first, index file second).

Which is, btw, exactly what "git repack" does, so the solution to the 
problem is to just never use git-pack-objects directly if you don't like 
the semantics..

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Daniel Barkalow
On Sun, 10 Jul 2005, Linus Torvalds wrote:

> 
> 
> On Sun, 10 Jul 2005, Daniel Barkalow wrote:
> 
> > On Sun, 10 Jul 2005, Linus Torvalds wrote:
> > > 
> > > Well, regardless, we want to be able to specify which directory to write 
> > > them to. We don't necessarily want to write them to the current working 
> > > directory, nor do we want to write them to their eventual destination in 
> > > .git/objects/pack.
> > > 
> > > In fact, the main current user ("git repack") really wants to write them 
> > > to a temporary file, and one that isn't even called "pack-xxx", since it 
> > > ends up doing cleanup with 
> > > 
> > >   rm -f .tmp-pack-*
> > > 
> > > in case a previous re-pack was interrupted (in which case it simply cannor
> > > know what the exact name was supposed to be).
> > > 
> > > So the "basename" ends up being necessary and meaningful regardless. We 
> > > do 
> > > _not_ want to remove that capability.
> > 
> > Shouldn't we do the same thing we do with object files? I don't see any
> > difference in desired behavior.
> 
> Well, the main difference is that pack-files can be used for many things.
> 
> For example, a web interface for getting a pack-file between two releases: 
> say you knew you had version xyzzy, and you want to get version xyzzy+1, 
> you could do that through webgit some way even with a "stupid" interface. 
> Kay already had some patch to generate pack-files for something.
> 
> The point being that pack-files are _not_ like objects. Pack-files are 
> meant for communication. Having them in .git/objects/pack is just one 
> special case.

Okay, I can see the use for them getting written to arbitrary paths; but I
think that it's worth having a canonical location for a pack that's being
used by the system (either not having been sent anywhere, or after having
been received). Perhaps git-pack-objects should have the base as a
optional argument, with a default of the filename in $GIT_DIR/objects/pack
and an option for sending just the pack file to stdout? I think that
covers everything in order of usefulness, and means that the program deals
with any filename that the user doesn't know in advance.

> > Why not checksum it in a predictable order, either that of the pack file
> > or the index? We do care that it's something verifiable, so that people
> > can't cause intentional collisions (for a DoS) just by naming their packs
> > after existing packs that users might not have downloaded yet.
> 
> We could sha1-sum the "sorted by SHA1" list, I guess.

That'd be good; then git-http-pull can validate the hash on the index and
be sure that a matching pack file from a different location still has the
same contents.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Linus Torvalds


On Sun, 10 Jul 2005, Daniel Barkalow wrote:

> On Sun, 10 Jul 2005, Linus Torvalds wrote:
> > 
> > Well, regardless, we want to be able to specify which directory to write 
> > them to. We don't necessarily want to write them to the current working 
> > directory, nor do we want to write them to their eventual destination in 
> > .git/objects/pack.
> > 
> > In fact, the main current user ("git repack") really wants to write them 
> > to a temporary file, and one that isn't even called "pack-xxx", since it 
> > ends up doing cleanup with 
> > 
> > rm -f .tmp-pack-*
> > 
> > in case a previous re-pack was interrupted (in which case it simply cannor
> > know what the exact name was supposed to be).
> > 
> > So the "basename" ends up being necessary and meaningful regardless. We do 
> > _not_ want to remove that capability.
> 
> Shouldn't we do the same thing we do with object files? I don't see any
> difference in desired behavior.

Well, the main difference is that pack-files can be used for many things.

For example, a web interface for getting a pack-file between two releases: 
say you knew you had version xyzzy, and you want to get version xyzzy+1, 
you could do that through webgit some way even with a "stupid" interface. 
Kay already had some patch to generate pack-files for something.

The point being that pack-files are _not_ like objects. Pack-files are 
meant for communication. Having them in .git/objects/pack is just one 
special case.

> Why not checksum it in a predictable order, either that of the pack file
> or the index? We do care that it's something verifiable, so that people
> can't cause intentional collisions (for a DoS) just by naming their packs
> after existing packs that users might not have downloaded yet.

We could sha1-sum the "sorted by SHA1" list, I guess.

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Daniel Barkalow
On Sun, 10 Jul 2005, Linus Torvalds wrote:

> On Sun, 10 Jul 2005, Junio C Hamano wrote:
> >
> > So I would suggest either:
> > 
> >   - droping the packname parameter from git-pack-objects.  Make
> > the packs always named pack-X{40}.pack (or just X{40}.pack);
> 
> Well, regardless, we want to be able to specify which directory to write 
> them to. We don't necessarily want to write them to the current working 
> directory, nor do we want to write them to their eventual destination in 
> .git/objects/pack.
> 
> In fact, the main current user ("git repack") really wants to write them 
> to a temporary file, and one that isn't even called "pack-xxx", since it 
> ends up doing cleanup with 
> 
>   rm -f .tmp-pack-*
> 
> in case a previous re-pack was interrupted (in which case it simply cannor
> know what the exact name was supposed to be).
> 
> So the "basename" ends up being necessary and meaningful regardless. We do 
> _not_ want to remove that capability.

Shouldn't we do the same thing we do with object files? I don't see any
difference in desired behavior.

> > also have verify-pack to verify the name of the packfile,
> > and make sure X{40} part of the name matches what it claims
> > to contain;
> 
> Now, that would be fine, but it can't be done. Not the way things are laid 
> out. A SHA1 checksum depends on the order the data was checksummed in, and 
> we don't even save that.

Why not checksum it in a predictable order, either that of the pack file
or the index? We do care that it's something verifiable, so that people
can't cause intentional collisions (for a DoS) just by naming their packs
after existing packs that users might not have downloaded yet.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Linus Torvalds


On Sun, 10 Jul 2005, Junio C Hamano wrote:
>
> So I would suggest either:
> 
>   - droping the packname parameter from git-pack-objects.  Make
> the packs always named pack-X{40}.pack (or just X{40}.pack);

Well, regardless, we want to be able to specify which directory to write 
them to. We don't necessarily want to write them to the current working 
directory, nor do we want to write them to their eventual destination in 
.git/objects/pack.

In fact, the main current user ("git repack") really wants to write them 
to a temporary file, and one that isn't even called "pack-xxx", since it 
ends up doing cleanup with 

rm -f .tmp-pack-*

in case a previous re-pack was interrupted (in which case it simply cannor
know what the exact name was supposed to be).

So the "basename" ends up being necessary and meaningful regardless. We do 
_not_ want to remove that capability.

> also have verify-pack to verify the name of the packfile,
> and make sure X{40} part of the name matches what it claims
> to contain;

Now, that would be fine, but it can't be done. Not the way things are laid 
out. A SHA1 checksum depends on the order the data was checksummed in, and 
we don't even save that.

>   - or drop sha1_pack_name() and let the user name the pack any
> way he wants.

No, I do want to use a SHA1, because I want to make sure that you can mix 
packs in a rsync/wget environment where if two files are named the same, 
they'll have the same contents.

So we can make verify-pack check that the pack-name matches the style
"pack-x." naming convention, but we can't match up the sha1 with
anything.

Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Junio C Hamano
Daniel Barkalow <[EMAIL PROTECTED]> writes:

> This series has one patch which is ready to go in and one that's not
> (although it's a reasonable phony for the current state of the git world).

I like the general direction in which this patch is leading us.

But before going further, I'd like to see a consensus on the
pack naming convention.  The "sha1 of packed object names" was
originally introduced to easily avoid the pack name collisions,
but not enforced, so a user could do the following and still
expect things to work:

$ n=`git-pack-objects pk http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Support for packs in HTTP

2005-07-10 Thread Daniel Barkalow
This series has one patch which is ready to go in and one that's not
(although it's a reasonable phony for the current state of the git world).

 1: Several additional functions are needed in the library to support
progressively getting pack data from some remote location and using it
to determine what else to get.

 2: git-http-pull can get packs as appropriate by getting all the index
files first, and then using them to figure out whether the object it's
looking for is in some pack it could get.

Currently, there's no sane way to figure out what pack/index files are
available from an HTTP server. But there only seems to be one pack file
available on an HTTP server at the moment, so this tries to get that
one.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html