Re: git, monorepos, and access control

2018-12-05 Thread Jeff King
On Wed, Dec 05, 2018 at 11:42:09PM +, Coiner, John wrote:

> > For instance, Git is very eager to try to find delta-compression
> > opportunities between objects, even if they don't have any relationship
> > within the tree structure. So imagine I want to know the contents of
> > tree X. I push up a tree Y similar to X, then fetch it back, falsely
> > claiming to have X but not Y. If the server generates a delta, that may
> > reveal information about X (which you can then iterate to send Y', and
> > so on, treating the server as an oracle until you've guessed the content
> > of X).
> Another good point. I wouldn't have thought of either of these attacks. 
> You're scaring me (appropriately) about the risks of adding security to 
> a previously-unsecured interface. Let me push on the smudge/clean 
> approach and maybe that will bear fruit.

If you do look into that approach, check out how git-lfs works. In fact,
you might even be able to build around lfs itself. It's already putting
placeholder objects into the repository, and then faulting them in from
external storage. All you would need to do is lock down access to that
external storage, which is typically accessed via http.

(That all assumes you're OK with sharing the actual filenames with
everybody, and just restricting access to the blob contents. There's no
way to clean/smudge a whole subtree. For that you'd have to use
submodules).

-Peff


Re: git, monorepos, and access control

2018-12-05 Thread Jeff King
On Thu, Dec 06, 2018 at 10:08:57AM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > In my opinion this feature is so contrary to Git's general assumptions
> > that it's likely to create a ton of information leaks of the supposedly
> > protected data.
> > ...
> 
> Yup, with s/implemented/designed/, I agree all you said here
> (snipped).

Heh, yeah, I actually scratched my head over what word to use. I think
Git _could_ be written in a way that is both compatible with existing
repositories (i.e., is still recognizably Git) and is careful about
object access control. But either way, what we have now is not close to
that.

> > Sorry I don't have a more positive response. What you want to do is
> > perfectly reasonable, but I just think it's a mismatch with how Git
> > works (and because of the security impact, one missed corner case
> > renders the whole thing useless).
> 
> Yup, again.
> 
> Storing source files encrypted and decrypting with smudge filter
> upon checkout (and those without the access won't get keys and will
> likely to use sparse checkout to exclude these priviledged sources)
> is probably the only workaround that does not involve submodules.
> Viewing "diff" and "log -p" would still be a challenge, which
> probably could use the same filter as smudge for textconv.

I suspect there are going to be some funny corner cases there. I use:

  [diff "gpg"]
  textconv = gpg -qd --no-tty

which works pretty well, but it's for files which are _never_ decrypted
by Git. So they're encrypted in the working tree too, and I don't use
clean/smudge filters.

If the files are already decrypted in the working tree, then running
them through gpg again would be the wrong thing. I guess for a diff
against the working tree, we would always do a "clean" operation to
produce the encrypted text, and then decrypt the result using textconv.
Which would work, but is rather slow.

> I wonder (and this is the primary reason why I am responding to you)
> if it is common enough wish to use the same filter for smudge and
> textconv?  So far, our stance (which can be judged from the way the
> clean/smudge filters are named) has been that the in-repo
> representation is the canonical, and the representation used in the
> checkout is ephemeral, and that is why we run "diff", "grep",
> etc. over the in-repo representation, but the "encrypted in repo,
> decrypted in checkout" abuse would be helped by an option to do the
> reverse---find changes and look substrings in the representation
> used in the checkout.  I am not sure if there are other use cases
> that is helped by such an option.

Hmm. Yeah, I agree with your line of reasoning here. I'm not sure how
common it is. This is the first I can recall it. And personally, I have
never really used clean/smudge filters myself, beyond some toy
experiments.

The other major user of that feature I can think of is LFS. There Git
ends up diffing the LFS pointers, not the big files. Which arguably is
the wrong thing (you'd prefer to see the actual file contents diffed),
but I think nobody cares in practice because large files generally don't
have readable diffs anyway.

-Peff


Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread Junio C Hamano
Jeff King  writes:

> Each "wait" will try to collect all processes, but may be interrupted by
> a signal. So the correct number is actually "1 plus the number of times
> the user hits ^C".

Yeah and that is not bounded.  It is OK not to catch multiple ^C
that races with what we do, so having ane extra wait in the clean-up
procedure after receiving a signal like you suggested would be both
good enough and the cleanest solution, I think.

Thanks.


Re: [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line

2018-12-05 Thread Junio C Hamano
Jonathan Tan  writes:

> @@ -126,6 +129,12 @@ static int read_pack_objects_stdout(int outfd, struct 
> output_state *os)
>   }
>   os->used += readsz;
>  
> + if (!os->packfile_started) {
> + os->packfile_started = 1;
> + if (use_protocol_v2)
> + packet_write_fmt(1, "packfile\n");

If we fix this function so that the only byte in the buffer is held
back without emitted when os->used == 1 as I alluded to, this may
have to be done a bit later, as with such a change, it is no longer
guaranteed that send_client_data() will be called after this point.

> + }
> +
>   if (os->used > 1) {
>   send_client_data(1, os->buffer, os->used - 1);
>   os->buffer[0] = os->buffer[os->used - 1];

> +static void flush_progress_buf(struct strbuf *progress_buf)
> +{
> + if (!progress_buf->len)
> + return;
> + send_client_data(2, progress_buf->buf, progress_buf->len);
> + strbuf_reset(progress_buf);
> +}

Interesting.

>  static void create_pack_file(const struct object_array *have_obj,
> -  const struct object_array *want_obj)
> +  const struct object_array *want_obj,
> +  int use_protocol_v2)
>  {
>   struct child_process pack_objects = CHILD_PROCESS_INIT;
>   struct output_state output_state = {0};
> @@ -260,9 +278,13 @@ static void create_pack_file(const struct object_array 
> *have_obj,
>*/
>   sz = xread(pack_objects.err, progress,
> sizeof(progress));
> - if (0 < sz)
> - send_client_data(2, progress, sz);
> - else if (sz == 0) {
> + if (0 < sz) {
> + if (output_state.packfile_started)
> + send_client_data(2, progress, sz);
> + else
> + strbuf_add(_state.progress_buf,
> +progress, sz);

Isn't progress output that goes to the channel #2 pretty much
independent from the payload stream that carries the pkt-line 
command like "packfile" plus the raw pack stream?  It somehow
feels like an oxymoron to _buffer_ progress indicator, as it
defeats the whole point of progress report to buffer it.


Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread Jeff King
On Thu, Dec 06, 2018 at 09:22:23AM +0900, Junio C Hamano wrote:

> > So the right number of waits is either "1" or "2". Looping means we do
> > too many (which is mostly a harmless noop) or too few (in the off chance
> > that you have only a single job ;) ). So it works out in practice.
> 
> Well, if you time your ^C perfectly, no number of waits is right, I
> am afraid.  You spawn N processes and while looping N times waiting
> for them, you can ^C out of wait before these N processes all die,
> no?

Each "wait" will try to collect all processes, but may be interrupted by
a signal. So the correct number is actually "1 plus the number of times
the user hits ^C". I had assumed the user was just hitting it once,
though putting the wait into the trap means we do that "1 plus" thing
anyway.

I could also see an argument that subsequent ^C's should exit
immediately, but I think we are getting well into the realm of
over-engineering.

-Peff


Re: [PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph

2018-12-05 Thread Junio C Hamano
Josh Steadmon  writes:

> diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
> new file mode 100644
> index 00..420851d0d2
> --- /dev/null
> +++ b/fuzz-commit-graph.c
> @@ -0,0 +1,18 @@
> +#include "object-store.h"
> +#include "commit-graph.h"
> +
> +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> + size_t graph_size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
> +{
> + struct commit_graph *g;
> +
> + g = parse_commit_graph((void *) data, -1, size);
> + if (g)
> + free(g);

As it is perfectly OK to free(NULL), please lose "if (g)" and a
level of indentation; otherwise, "make coccicheck" would complain.

Thanks.

> + return 0;
> +}


[no subject]

2018-12-05 Thread Sherrone Delaney


We need to make sure everyone's contact information is up to date,
please login with your corporate login and verify the information is correct:
Click here
We need to have this list updated within 24hrs, so please make this a priority.
Thank You


Re: [PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph

2018-12-05 Thread Junio C Hamano
Junio C Hamano  writes:

>> +if (graph_size < GRAPH_MIN_SIZE)
>> +return NULL;
>> +
>
> The load_commit_graph_one() grabbed graph_map out of xmmap() so it
> is guaranteed to be non-NULL, but we need to check graph_map != NULL
> when we're calling this directly from the fuzz tests, exactly in the
> same spirit that we check graph_size above.  Besides, these are to
> make sure future callers won't misuse the API.

Insert "Please check graph_map != NULL here, too." before the above
paragraph.

>>  data = (const unsigned char *)graph_map;
>
> And the reset of the function is the same as the original modulo
> jumping to the cleanup_fail label has been replaced with returning
> NULL.
>
> Looks good.

Of course, s/reset/rest/ is what I meant.


Re: [PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph

2018-12-05 Thread Junio C Hamano
Josh Steadmon  writes:

> @@ -108,27 +106,61 @@ struct commit_graph *load_commit_graph_one(const char 
> *graph_file)
>   die(_("graph file %s is too small"), graph_file);
>   }
>   graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
> + ret = parse_commit_graph(graph_map, fd, graph_size);

OK, assuming that the new helper returns NULL from all places in the
original that would have jumped to the cleanup-fail label, this would
do the same thing (it may not be the right thing to exit, but that
is OK for the purpose of this change).

> + if (ret == NULL) {
> + munmap(graph_map, graph_size);
> + close(fd);
> + exit(1);
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * This function is intended to be used only from load_commit_graph_one() or 
> in
> + * fuzz tests.
> + */

Hmph, is that necessary to say?  

"Right now, it has only these two callers" is sometimes handy for
those without good static analysis tools, like "git grep" ;-), but I
do not think of a reason why a new caller we'll add in the future,
who happens to have the data of the graph file in-core, should not
be allowed to call this function.


> +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> + size_t graph_size)
> +{
> + const unsigned char *data, *chunk_lookup;
> + uint32_t i;
> + struct commit_graph *graph;
> + uint64_t last_chunk_offset;
> + uint32_t last_chunk_id;
> + uint32_t graph_signature;
> + unsigned char graph_version, hash_version;
> +
> + /*
> +  * This should already be checked in load_commit_graph_one, but we still
> +  * need a check here for when we're calling parse_commit_graph directly
> +  * from fuzz tests. We can omit the error message in that case.
> +  */

In the same spirit, I am dubious of the longer-term value of this
comment.  As an explanation of why this conversion is correct in the
proposed log message for this change, it perfectly makes sense,
though.

> + if (graph_size < GRAPH_MIN_SIZE)
> + return NULL;
> +

The load_commit_graph_one() grabbed graph_map out of xmmap() so it
is guaranteed to be non-NULL, but we need to check graph_map != NULL
when we're calling this directly from the fuzz tests, exactly in the
same spirit that we check graph_size above.  Besides, these are to
make sure future callers won't misuse the API.

>   data = (const unsigned char *)graph_map;

And the reset of the function is the same as the original modulo
jumping to the cleanup_fail label has been replaced with returning
NULL.

Looks good.

> ...
> -
> -cleanup_fail:
> - munmap(graph_map, graph_size);
> - close(fd);
> - exit(1);
>  }
>  
>  static void prepare_commit_graph_one(struct repository *r, const char 
> *obj_dir)
> diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
> new file mode 100644
> index 00..420851d0d2
> --- /dev/null
> +++ b/fuzz-commit-graph.c
> @@ -0,0 +1,18 @@
> +#include "object-store.h"
> +#include "commit-graph.h"
> +
> +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> + size_t graph_size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
> +{
> + struct commit_graph *g;
> +
> + g = parse_commit_graph((void *) data, -1, size);
> + if (g)
> + free(g);
> +
> + return 0;
> +}


Re: [PATCH v3] list-objects.c: don't segfault for missing cmdline objects

2018-12-05 Thread Junio C Hamano
Matthew DeVore  writes:

> When a command is invoked with both --exclude-promisor-objects,
> --objects-edge-aggressive, and a missing object on the command line,
> the rev_info.cmdline array could get a NULL pointer for the value of
> an 'item' field. Prevent dereferencing of a NULL pointer in that
> situation.
>
> Properly handle --ignore-missing. If it is not passed, die when an
> object is missing. Otherwise, just silently ignore it.
>
> Signed-off-by: Matthew DeVore 

Thanks for an update.  Will replace.

> ---
>  revision.c   |  2 ++
>  t/t0410-partial-clone.sh | 16 ++--
>  2 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 13e0519c02..293303b67d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -1729,6 +1729,8 @@ int handle_revision_arg(const char *arg_, struct 
> rev_info *revs, int flags, unsi
>   if (!cant_be_filename)
>   verify_non_filename(revs->prefix, arg);
>   object = get_reference(revs, arg, , flags ^ local_flags);
> + if (!object)
> + return revs->ignore_missing ? 0 : -1;
>   add_rev_cmdline(revs, object, arg_, REV_CMD_REV, flags ^ local_flags);
>   add_pending_object_with_path(revs, object, arg, oc.mode, oc.path);
>   free(oc.path);
> diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
> index ba3887f178..169f7f10a7 100755
> --- a/t/t0410-partial-clone.sh
> +++ b/t/t0410-partial-clone.sh
> @@ -349,7 +349,7 @@ test_expect_success 'rev-list stops traversal at promisor 
> commit, tree, and blob
>   grep $(git -C repo rev-parse bar) out  # sanity check that some walking 
> was done
>  '
>  
> -test_expect_success 'rev-list accepts missing and promised objects on 
> command line' '
> +test_expect_success 'rev-list dies for missing objects on cmd line' '
>   rm -rf repo &&
>   test_create_repo repo &&
>   test_commit -C repo foo &&
> @@ -366,7 +366,19 @@ test_expect_success 'rev-list accepts missing and 
> promised objects on command li
>  
>   git -C repo config core.repositoryformatversion 1 &&
>   git -C repo config extensions.partialclone "arbitrary string" &&
> - git -C repo rev-list --exclude-promisor-objects --objects "$COMMIT" 
> "$TREE" "$BLOB"
> +
> + for OBJ in "$COMMIT" "$TREE" "$BLOB"; do
> + test_must_fail git -C repo rev-list --objects \
> + --exclude-promisor-objects "$OBJ" &&
> + test_must_fail git -C repo rev-list --objects-edge-aggressive \
> + --exclude-promisor-objects "$OBJ" &&
> +
> + # Do not die or crash when --ignore-missing is passed.
> + git -C repo rev-list --ignore-missing --objects \
> + --exclude-promisor-objects "$OBJ" &&
> + git -C repo rev-list --ignore-missing --objects-edge-aggressive 
> \
> + --exclude-promisor-objects "$OBJ"
> + done
>  '
>  
>  test_expect_success 'gc repacks promisor objects separately from 
> non-promisor objects' '


Re: git, monorepos, and access control

2018-12-05 Thread Junio C Hamano
Jeff King  writes:

> In my opinion this feature is so contrary to Git's general assumptions
> that it's likely to create a ton of information leaks of the supposedly
> protected data.
> ...

Yup, with s/implemented/designed/, I agree all you said here
(snipped).

> Sorry I don't have a more positive response. What you want to do is
> perfectly reasonable, but I just think it's a mismatch with how Git
> works (and because of the security impact, one missed corner case
> renders the whole thing useless).

Yup, again.

Storing source files encrypted and decrypting with smudge filter
upon checkout (and those without the access won't get keys and will
likely to use sparse checkout to exclude these priviledged sources)
is probably the only workaround that does not involve submodules.
Viewing "diff" and "log -p" would still be a challenge, which
probably could use the same filter as smudge for textconv.

I wonder (and this is the primary reason why I am responding to you)
if it is common enough wish to use the same filter for smudge and
textconv?  So far, our stance (which can be judged from the way the
clean/smudge filters are named) has been that the in-repo
representation is the canonical, and the representation used in the
checkout is ephemeral, and that is why we run "diff", "grep",
etc. over the in-repo representation, but the "encrypted in repo,
decrypted in checkout" abuse would be helped by an option to do the
reverse---find changes and look substrings in the representation
used in the checkout.  I am not sure if there are other use cases
that is helped by such an option.


Re: [PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph

2018-12-05 Thread Josh Steadmon
On 2018.12.05 23:48, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Dec 05 2018, Josh Steadmon wrote:
> 
> > Breaks load_commit_graph_one() into a new function,
> > parse_commit_graph(). The latter function operates on arbitrary buffers,
> > which makes it suitable as a fuzzing target.
> >
> > Adds fuzz-commit-graph.c, which provides a fuzzing entry point
> > compatible with libFuzzer (and possibly other fuzzing engines).
> >
> > Signed-off-by: Josh Steadmon 
> > ---
> >  .gitignore  |  1 +
> >  Makefile|  1 +
> >  commit-graph.c  | 63 +
> >  fuzz-commit-graph.c | 18 +
> >  4 files changed, 66 insertions(+), 17 deletions(-)
> >  create mode 100644 fuzz-commit-graph.c
> >
> > diff --git a/.gitignore b/.gitignore
> > index 0d77ea5894..8bcf153ed9 100644
> > --- a/.gitignore
> > +++ b/.gitignore
> > @@ -1,3 +1,4 @@
> > +/fuzz-commit-graph
> >  /fuzz_corpora
> >  /fuzz-pack-headers
> >  /fuzz-pack-idx
> > diff --git a/Makefile b/Makefile
> > index 1a44c811aa..6b72f37c29 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -684,6 +684,7 @@ SCRIPTS = $(SCRIPT_SH_INS) \
> >
> >  ETAGS_TARGET = TAGS
> >
> > +FUZZ_OBJS += fuzz-commit-graph.o
> >  FUZZ_OBJS += fuzz-pack-headers.o
> >  FUZZ_OBJS += fuzz-pack-idx.o
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 40c855f185..0755359b1a 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -46,6 +46,10 @@
> >  #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
> > + GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)
> >
> > +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> > +   size_t graph_size);
> > +
> > +
> >  char *get_commit_graph_filename(const char *obj_dir)
> >  {
> > return xstrfmt("%s/info/commit-graph", obj_dir);
> > @@ -84,16 +88,10 @@ static int commit_graph_compatible(struct repository *r)
> >  struct commit_graph *load_commit_graph_one(const char *graph_file)
> >  {
> > void *graph_map;
> > -   const unsigned char *data, *chunk_lookup;
> > size_t graph_size;
> > struct stat st;
> > -   uint32_t i;
> > -   struct commit_graph *graph;
> > +   struct commit_graph *ret;
> > int fd = git_open(graph_file);
> > -   uint64_t last_chunk_offset;
> > -   uint32_t last_chunk_id;
> > -   uint32_t graph_signature;
> > -   unsigned char graph_version, hash_version;
> >
> > if (fd < 0)
> > return NULL;
> > @@ -108,27 +106,61 @@ struct commit_graph *load_commit_graph_one(const char 
> > *graph_file)
> > die(_("graph file %s is too small"), graph_file);
> > }
> > graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
> > +   ret = parse_commit_graph(graph_map, fd, graph_size);
> > +
> > +   if (ret == NULL) {
> 
> Code in git usually uses just !ret.

Will fix in V2, thanks.


> > +   munmap(graph_map, graph_size);
> > +   close(fd);
> > +   exit(1);
> 
> Ouch, exit(1) from load_commit_graph_one()? Can't we return NULL here
> instead? Nasty to exit from a library routine, but then I see later...
> 
> > @@ -201,11 +235,6 @@ struct commit_graph *load_commit_graph_one(const char 
> > *graph_file)
> > }
> >
> > return graph;
> > -
> > -cleanup_fail:
> > -   munmap(graph_map, graph_size);
> > -   close(fd);
> > -   exit(1);
> >  }
> 
> ... ah, I see this is where you got the exit(1) from. So it was there
> already.
> 
> >  static void prepare_commit_graph_one(struct repository *r, const char 
> > *obj_dir)
> > diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
> > new file mode 100644
> > index 00..420851d0d2
> > --- /dev/null
> > +++ b/fuzz-commit-graph.c
> > @@ -0,0 +1,18 @@
> > +#include "object-store.h"
> > +#include "commit-graph.h"
> > +
> > +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> > +   size_t graph_size);
> > +
> > +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
> > +
> > +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
> > +{
> > +   struct commit_graph *g;
> > +
> > +   g = parse_commit_graph((void *) data, -1, size);
> > +   if (g)
> > +   free(g);
> > +
> > +   return 0;
> > +}
> 
> I hadn't looked at this before, but see your 5e47215080 ("fuzz: add
> basic fuzz testing target.", 2018-10-12) for some prior art.
> 
> There's instructions there for a very long "make" invocation. Would be
> nice if this were friendlier and we could just do "make test-fuzz" or
> something...

Yeah, the problem is that there are too many combinations of fuzzing
engine, sanitizer, and compiler to make any reasonable default here.
Even if you just stick with libFuzzer, address sanitizer, and clang, the
flags change radically depending on which version of clang you're using.


Re: BUG: CR marker ^M doesn't show up in '-' lines of diffs when the ending of the removed line is CR+LF

2018-12-05 Thread Junio C Hamano
Frank Schäfer  writes:

> Just to be sure that I'm not missing anything here:
> What's your definition of "LF in repository, CRLF in working tree" in
> terms of config parameters ?

:::Documentation/config/core.txt:::

core.autocrlf::
Setting this variable to "true" is the same as setting
the `text` attribute to "auto" on all files and core.eol to "crlf".
Set to true if you want to have `CRLF` line endings in your
working directory and the repository has LF line endings.
This variable can be set to 'input',
in which case no output conversion is performed.


Re: gitweb: local configuration not found

2018-12-05 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason  writes:

>> Documentation says "If you are absolutely certain that you want your
>> script to load and execute a file from the current directory, then use
>> a ./ prefix".  We can do that, like so:
>>
>> diff --git i/gitweb/Makefile w/gitweb/Makefile
>> index cd194d057f..3160b6cc5d 100644
>> --- i/gitweb/Makefile
>> +++ w/gitweb/Makefile
>> @@ -18,7 +18,7 @@ RM ?= rm -f
>>  INSTALL ?= install
>>
>>  # default configuration for gitweb
>> -GITWEB_CONFIG = gitweb_config.perl
>> +GITWEB_CONFIG = ./gitweb_config.perl
>>  GITWEB_CONFIG_SYSTEM = /etc/gitweb.conf
>>  GITWEB_CONFIG_COMMON = /etc/gitweb-common.conf
>>  GITWEB_HOME_LINK_STR = projects
>>
>> but that does not help if someone overrides GITWEB_CONFIG, and besides,
>> it would be nicer to avoid the possibility of an @INC search altogether.
>> ...
> Just:
>
> local @INC = '.';
> do 'FILE.pl';
>
> Would do the same thing, but seems like a more indirect way to do it if
> all we want is ./ anyway.

Yeah, it does look indirect.  Despite what you said, it also would
support users giving an absolute path via GITWEB_CONFIG.

With "use File::Spec", perhaps something like this?

 gitweb/gitweb.perl | 4 
 1 file changed, 4 insertions(+)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 2594a4badb..239e7cbc25 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -719,6 +719,10 @@ sub filter_and_validate_refs {
 sub read_config_file {
my $filename = shift;
return unless defined $filename;
+
+   $filename = File::Spec->catfile(".", $filename)
+   unless File::Spec->file_name_is_absolute($filename);
+
# die if there are errors parsing config file
if (-e $filename) {
do $filename;


Re: Any way to make git-log to enumerate commits?

2018-12-05 Thread Junio C Hamano
Konstantin Khomoutov  writes:

> On Wed, Dec 05, 2018 at 05:22:14PM +0300, Konstantin Kharlamov wrote:
>
>> It would be great if git-log has a formatting option to insert an
>> index of the current commit since HEAD.
>> 
>> It would allow after quitting the git-log to immediately fire up "git
>> rebase -i HEAD~index" instead of "git rebase -i
>> go-copy-paste-this-long-number-id".
>
> This may have little sense in a general case as the history maintained
> by Git is a graph, not a single line. Hence your prospective approach
> would only work for cases like `git log` called with the
> "--first-parent" command-line option.

I do not see why the "name each rev relative to HEAD" formatting
option cannot produce HEAD^2~2 etc.

It would be similar to "git log | git name-rev --stdin" but I do not
offhand recall if we had a way to tell name-rev to use only HEAD as
the anchoring point.


Re: git, monorepos, and access control

2018-12-05 Thread brian m. carlson
On Wed, Dec 05, 2018 at 04:01:05PM -0500, Jeff King wrote:
> You could work around that by teaching the server to refuse to use "X"
> in any way when the client does not have the right permissions. But:
> 
>   - that's just one example; there are probably a number of such leaks
> 
>   - you're fighting an uphill battle against the way Git is implemented;
> there's no single point to enforce this kind of segmentation
> 
> The model that fits more naturally with how Git is implemented would be
> to use submodules. There you leak the hash of the commit from the
> private submodule, but that's probably obscure enough (and if you're
> really worried, you can add a random nonce to the commit messages in the
> submodule to make their hashes unguessable).

I tend to agree that submodules are the way to go here over a monorepo.
Not only do you have much better access control opportunities, in
general, smaller repositories are going to perform better and be easier
to work with than monorepos. While VFS for Git is a great technology,
using a set of smaller repositories is going to outperform it every
time, both on developer machines, and on your source code hosting
platform.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread Junio C Hamano
Jeff King  writes:

> But the ^C case is interesting. Try running your script under "sh -x"
> and hitting ^C. The signal interrupts the first wait. In my script (or
> yours modified to use a single wait), we then proceed immediately to the
> "exit", and get output from the subshells after we've exited.
>
> But in your script with the loop, the first wait is interrupted, and
> then the _second_ wait (in the second loop iteration) picks up all of
> the exiting processes. The subsequent waits are all noops that return
> immediately, because there are no processes left.
>
> So the right number of waits is either "1" or "2". Looping means we do
> too many (which is mostly a harmless noop) or too few (in the off chance
> that you have only a single job ;) ). So it works out in practice.

Well, if you time your ^C perfectly, no number of waits is right, I
am afraid.  You spawn N processes and while looping N times waiting
for them, you can ^C out of wait before these N processes all die,
no?

> But I think a more clear way to do it is to simply wait in the signal
> trap, to cover the case when our original wait was aborted.

Sounds good.


Re: git, monorepos, and access control

2018-12-05 Thread Coiner, John
inline...

On 12/05/2018 04:12 PM, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Dec 05 2018, Duy Nguyen wrote:
>
>>
>> Another option is builtin per-blob encryption (maybe with just
>> clean/smudge filter), then access control will be about obtaining the
>> decryption key (*) and we don't break object storage and
>> communication. Of course pack delta compression becomes absolutely
>> useless. But that is perhaps an acceptable trade off.
> Right, this is another option, but from what John described wouldn't
> work in this case. "Any hypothetical AMD monorepo should be able to
> securely deny read access in certain subtrees to users without required
> permissions".
>
> I.e. in this case there will be a
> secret-stuff-here/ryzen-microcode.code.encrypted or whatever,
> unauthorized users can't see the content, but they can see from the
> filename that it exists, and from "git log" who works on it.

Ah, clean/smudge has potential. It locates the security boundary 
entirely outside the git service and outside the repo. No big software 
engineering project is needed. It should work in VFSForGit just as it 
does in plain old git. Beautiful!

For our use case, it might be OK for commit logs, branch names, and 
filenames to be world-readable. Commit logs are probably the greatest 
concern; and perhaps we could address that through other means.

Maybe it's possible to design a text-to-text cipher that still permits 
git to store compact diffs of ciphered text, with only modest impact on 
security?

Thank you all for your thoughtful replies.

>
> This is a thing I know *way* less about so maybe I'm completely wrong,
> but even if we have all the rest of the things outlined in your post to
> support this, isn't this part going to be susceptible to timing attacks?

That's a good point. It's not easy to estimate exposure to something 
like this.

> For instance, Git is very eager to try to find delta-compression
> opportunities between objects, even if they don't have any relationship
> within the tree structure. So imagine I want to know the contents of
> tree X. I push up a tree Y similar to X, then fetch it back, falsely
> claiming to have X but not Y. If the server generates a delta, that may
> reveal information about X (which you can then iterate to send Y', and
> so on, treating the server as an oracle until you've guessed the content
> of X).
Another good point. I wouldn't have thought of either of these attacks. 
You're scaring me (appropriately) about the risks of adding security to 
a previously-unsecured interface. Let me push on the smudge/clean 
approach and maybe that will bear fruit.

Thanks again.

- John



Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()

2018-12-05 Thread Junio C Hamano
Jonathan Tan  writes:

> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.

That sounds like a good direction, when having to list excessive
number of refs is the primary problem.  When fetching their 'master'
into our 'remotes/origin/master' and doing nothing else, we may end
up showing only the latter, which will miss optimization opportunity
a lot if the latest change made over there is to merge in the change
we asked them to pull earlier (which would be greatly helped if we
let them know about the tip of the topic they earlier pulled from
us), but also avoids having to send irrelevant refs that point at
tags addded to months old states.  So there is a subtle trade-off
between sending more refs to reduce the resulting packfile, and
sending fewer refs to reduce the cost of the "have" exchange.

Changing the way to throw each object pointed at by a ref into the
queue to be emitted in the "have" exchange from regular object
parsing to peeking of precomputed data would reduce the local cost
of "have" exchange, but it does not reduce the network cost at all,
though.

As to the change being specific to get_reference() and not to
parse_object(), I think what we see here is probably better, simply
because parse_object() is not in the position to asssume that it is
likely to be asked to parse commits, but I think get_reference() is,
after looking at its callsites in revision.c.

I do share the meta-comment concern with Peff, though.

> ---
>  revision.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info 
> *revs, const char *name,
>  {
>   struct object *object;
>  
> - object = parse_object(revs->repo, oid);
> + /*
> +  * If the repository has commit graphs, repo_parse_commit() avoids
> +  * reading the object buffer, so use it whenever possible.
> +  */
> + if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> + struct commit *c = lookup_commit(revs->repo, oid);
> + if (!repo_parse_commit(revs->repo, c))
> + object = (struct object *) c;
> + else
> + object = NULL;
> + } else {
> + object = parse_object(revs->repo, oid);
> + }
> +
>   if (!object) {
>   if (revs->ignore_missing)
>   return object;


Re: [PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph

2018-12-05 Thread Ævar Arnfjörð Bjarmason


On Wed, Dec 05 2018, Josh Steadmon wrote:

> Breaks load_commit_graph_one() into a new function,
> parse_commit_graph(). The latter function operates on arbitrary buffers,
> which makes it suitable as a fuzzing target.
>
> Adds fuzz-commit-graph.c, which provides a fuzzing entry point
> compatible with libFuzzer (and possibly other fuzzing engines).
>
> Signed-off-by: Josh Steadmon 
> ---
>  .gitignore  |  1 +
>  Makefile|  1 +
>  commit-graph.c  | 63 +
>  fuzz-commit-graph.c | 18 +
>  4 files changed, 66 insertions(+), 17 deletions(-)
>  create mode 100644 fuzz-commit-graph.c
>
> diff --git a/.gitignore b/.gitignore
> index 0d77ea5894..8bcf153ed9 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -1,3 +1,4 @@
> +/fuzz-commit-graph
>  /fuzz_corpora
>  /fuzz-pack-headers
>  /fuzz-pack-idx
> diff --git a/Makefile b/Makefile
> index 1a44c811aa..6b72f37c29 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -684,6 +684,7 @@ SCRIPTS = $(SCRIPT_SH_INS) \
>
>  ETAGS_TARGET = TAGS
>
> +FUZZ_OBJS += fuzz-commit-graph.o
>  FUZZ_OBJS += fuzz-pack-headers.o
>  FUZZ_OBJS += fuzz-pack-idx.o
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 40c855f185..0755359b1a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -46,6 +46,10 @@
>  #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
>   + GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)
>
> +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> + size_t graph_size);
> +
> +
>  char *get_commit_graph_filename(const char *obj_dir)
>  {
>   return xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -84,16 +88,10 @@ static int commit_graph_compatible(struct repository *r)
>  struct commit_graph *load_commit_graph_one(const char *graph_file)
>  {
>   void *graph_map;
> - const unsigned char *data, *chunk_lookup;
>   size_t graph_size;
>   struct stat st;
> - uint32_t i;
> - struct commit_graph *graph;
> + struct commit_graph *ret;
>   int fd = git_open(graph_file);
> - uint64_t last_chunk_offset;
> - uint32_t last_chunk_id;
> - uint32_t graph_signature;
> - unsigned char graph_version, hash_version;
>
>   if (fd < 0)
>   return NULL;
> @@ -108,27 +106,61 @@ struct commit_graph *load_commit_graph_one(const char 
> *graph_file)
>   die(_("graph file %s is too small"), graph_file);
>   }
>   graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
> + ret = parse_commit_graph(graph_map, fd, graph_size);
> +
> + if (ret == NULL) {

Code in git usually uses just !ret.

> + munmap(graph_map, graph_size);
> + close(fd);
> + exit(1);

Ouch, exit(1) from load_commit_graph_one()? Can't we return NULL here
instead? Nasty to exit from a library routine, but then I see later...

> @@ -201,11 +235,6 @@ struct commit_graph *load_commit_graph_one(const char 
> *graph_file)
>   }
>
>   return graph;
> -
> -cleanup_fail:
> - munmap(graph_map, graph_size);
> - close(fd);
> - exit(1);
>  }

... ah, I see this is where you got the exit(1) from. So it was there
already.

>  static void prepare_commit_graph_one(struct repository *r, const char 
> *obj_dir)
> diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
> new file mode 100644
> index 00..420851d0d2
> --- /dev/null
> +++ b/fuzz-commit-graph.c
> @@ -0,0 +1,18 @@
> +#include "object-store.h"
> +#include "commit-graph.h"
> +
> +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> + size_t graph_size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
> +{
> + struct commit_graph *g;
> +
> + g = parse_commit_graph((void *) data, -1, size);
> + if (g)
> + free(g);
> +
> + return 0;
> +}

I hadn't looked at this before, but see your 5e47215080 ("fuzz: add
basic fuzz testing target.", 2018-10-12) for some prior art.

There's instructions there for a very long "make" invocation. Would be
nice if this were friendlier and we could just do "make test-fuzz" or
something...


Re: [PATCH] rebase docs: fix incorrect format of the section Behavioral Differences

2018-12-05 Thread Junio C Hamano
Johannes Sixt  writes:

> Am 05.12.18 um 16:37 schrieb Elijah Newren:
>> On Tue, Dec 4, 2018 at 11:40 PM Junio C Hamano  wrote:
>>>
>>> Johannes Sixt  writes:
>>>
 Please let me deposit my objection. This paragraph is not the right
 place to explain what directory renme detection is and how it works
 under the hood. "works fine" in the original text is the right phrase
 here; if there is concern that this induces expectations that cannot
 be met, throw in the word "heuristics".

 Such as:
 Directory rename heuristics work fine in the merge and interactive
 backends. It does not in the am backend because...
>>>
>>> OK, good enough, I guess.  Or just s/work fine/is enabled/.
>>
>> So...
>>
>> Directory rename heuristics are enabled in the merge and interactive
>> backends. They are not in the am backend because it operates on
>> "fake ancestors" that involve trees containing only the files modified
>> in the patch.  Due to the lack of accurate tree information, directory
>> rename detection is disabled for the am backend.
>
> OK with me. And also if you drop sentence no.2 and keep just no.1 and no.3.

Yeah, that shorter version may be sufficient to explain why we do
not use the heuristics in the "am" backend.


Re: git, monorepos, and access control

2018-12-05 Thread Ævar Arnfjörð Bjarmason


On Wed, Dec 05 2018, Coiner, John wrote:

I forgot to mention this in my initial reply in
<878t13zp8y@evledraar.gmail.com>, but on a re-reading I re-spotted
this:

>   - hashes are secret. If the hashes from a protected tree leak, the
> data also leaks. No check on the server prevents it from handing out
> contents for correctly-guessed hashes.

This is a thing I know *way* less about so maybe I'm completely wrong,
but even if we have all the rest of the things outlined in your post to
support this, isn't this part going to be susceptible to timing attacks?

We'll do more work if you send a SHA-1 during negotiation that shares a
prefix with an existing SHA-1, since we need to binary search & compare
further. SHA-1 is 160 bits which gives you a huge space of potential
hashes, but not if I can try one bit at a time working from the start of
the hash to work my way to a valid existing hash stored on the server.

Of course that assumes a way to do this over the network, it'll be on
AMD's internal network so much faster than average, but maybe this is
completely implausible.

NetSpectre was different and relied on executing code on the remote
computer in a sandbox, not waiting for network roundtrips for each try,
so maybe this would be a non-issue.


[PATCH 2/2] commit-graph: fix buffer read-overflow

2018-12-05 Thread Josh Steadmon
fuzz-commit-graph identified a case where Git will read past the end of
a buffer containing a commit graph if the graph's header has an
incorrect chunk count. A simple bounds check in parse_commit_graph()
prevents this.

Signed-off-by: Josh Steadmon 
Helped-by: Derrick Stolee 
---
 commit-graph.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0755359b1a..fee171a5f3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -175,10 +175,19 @@ struct commit_graph *parse_commit_graph(void *graph_map, 
int fd,
last_chunk_offset = 8;
chunk_lookup = data + 8;
for (i = 0; i < graph->num_chunks; i++) {
-   uint32_t chunk_id = get_be32(chunk_lookup + 0);
-   uint64_t chunk_offset = get_be64(chunk_lookup + 4);
+   uint32_t chunk_id;
+   uint64_t chunk_offset;
int chunk_repeated = 0;
 
+   if (chunk_lookup + GRAPH_CHUNKLOOKUP_WIDTH > data + graph_size) 
{
+   error(_("chunk lookup table entry missing; graph file 
may be incomplete"));
+   free(graph);
+   return NULL;
+   }
+
+   chunk_id = get_be32(chunk_lookup + 0);
+   chunk_offset = get_be64(chunk_lookup + 4);
+
chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
 
if (chunk_offset > graph_size - GIT_MAX_RAWSZ) {
-- 
2.20.0.rc2.403.gdbc3b29805-goog



[PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph

2018-12-05 Thread Josh Steadmon
Breaks load_commit_graph_one() into a new function,
parse_commit_graph(). The latter function operates on arbitrary buffers,
which makes it suitable as a fuzzing target.

Adds fuzz-commit-graph.c, which provides a fuzzing entry point
compatible with libFuzzer (and possibly other fuzzing engines).

Signed-off-by: Josh Steadmon 
---
 .gitignore  |  1 +
 Makefile|  1 +
 commit-graph.c  | 63 +
 fuzz-commit-graph.c | 18 +
 4 files changed, 66 insertions(+), 17 deletions(-)
 create mode 100644 fuzz-commit-graph.c

diff --git a/.gitignore b/.gitignore
index 0d77ea5894..8bcf153ed9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
+/fuzz-commit-graph
 /fuzz_corpora
 /fuzz-pack-headers
 /fuzz-pack-idx
diff --git a/Makefile b/Makefile
index 1a44c811aa..6b72f37c29 100644
--- a/Makefile
+++ b/Makefile
@@ -684,6 +684,7 @@ SCRIPTS = $(SCRIPT_SH_INS) \
 
 ETAGS_TARGET = TAGS
 
+FUZZ_OBJS += fuzz-commit-graph.o
 FUZZ_OBJS += fuzz-pack-headers.o
 FUZZ_OBJS += fuzz-pack-idx.o
 
diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..0755359b1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -46,6 +46,10 @@
 #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+ GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)
 
+struct commit_graph *parse_commit_graph(void *graph_map, int fd,
+   size_t graph_size);
+
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
return xstrfmt("%s/info/commit-graph", obj_dir);
@@ -84,16 +88,10 @@ static int commit_graph_compatible(struct repository *r)
 struct commit_graph *load_commit_graph_one(const char *graph_file)
 {
void *graph_map;
-   const unsigned char *data, *chunk_lookup;
size_t graph_size;
struct stat st;
-   uint32_t i;
-   struct commit_graph *graph;
+   struct commit_graph *ret;
int fd = git_open(graph_file);
-   uint64_t last_chunk_offset;
-   uint32_t last_chunk_id;
-   uint32_t graph_signature;
-   unsigned char graph_version, hash_version;
 
if (fd < 0)
return NULL;
@@ -108,27 +106,61 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
die(_("graph file %s is too small"), graph_file);
}
graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
+   ret = parse_commit_graph(graph_map, fd, graph_size);
+
+   if (ret == NULL) {
+   munmap(graph_map, graph_size);
+   close(fd);
+   exit(1);
+   }
+
+   return ret;
+}
+
+/*
+ * This function is intended to be used only from load_commit_graph_one() or in
+ * fuzz tests.
+ */
+struct commit_graph *parse_commit_graph(void *graph_map, int fd,
+   size_t graph_size)
+{
+   const unsigned char *data, *chunk_lookup;
+   uint32_t i;
+   struct commit_graph *graph;
+   uint64_t last_chunk_offset;
+   uint32_t last_chunk_id;
+   uint32_t graph_signature;
+   unsigned char graph_version, hash_version;
+
+   /*
+* This should already be checked in load_commit_graph_one, but we still
+* need a check here for when we're calling parse_commit_graph directly
+* from fuzz tests. We can omit the error message in that case.
+*/
+   if (graph_size < GRAPH_MIN_SIZE)
+   return NULL;
+
data = (const unsigned char *)graph_map;
 
graph_signature = get_be32(data);
if (graph_signature != GRAPH_SIGNATURE) {
error(_("graph signature %X does not match signature %X"),
  graph_signature, GRAPH_SIGNATURE);
-   goto cleanup_fail;
+   return NULL;
}
 
graph_version = *(unsigned char*)(data + 4);
if (graph_version != GRAPH_VERSION) {
error(_("graph version %X does not match version %X"),
  graph_version, GRAPH_VERSION);
-   goto cleanup_fail;
+   return NULL;
}
 
hash_version = *(unsigned char*)(data + 5);
if (hash_version != GRAPH_OID_VERSION) {
error(_("hash version %X does not match version %X"),
  hash_version, GRAPH_OID_VERSION);
-   goto cleanup_fail;
+   return NULL;
}
 
graph = alloc_commit_graph();
@@ -152,7 +184,8 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
if (chunk_offset > graph_size - GIT_MAX_RAWSZ) {
error(_("improper chunk offset %08x%08x"), 
(uint32_t)(chunk_offset >> 32),
  (uint32_t)chunk_offset);
-   goto cleanup_fail;
+   free(graph);
+   return NULL;
}
 
switch (chunk_id) {
@@ -187,7 +220,8 @@ struct commit_graph 

[PATCH 0/2] Add commit-graph fuzzer and fix buffer overflow

2018-12-05 Thread Josh Steadmon
Add a new fuzz test for the commit graph and fix a buffer read-overflow
that it discovered.

Josh Steadmon (2):
  commit-graph, fuzz: Add fuzzer for commit-graph
  commit-graph: fix buffer read-overflow

 .gitignore  |  1 +
 Makefile|  1 +
 commit-graph.c  | 76 +
 fuzz-commit-graph.c | 18 +++
 4 files changed, 77 insertions(+), 19 deletions(-)
 create mode 100644 fuzz-commit-graph.c

-- 
2.20.0.rc2.403.gdbc3b29805-goog



Re: What's cooking in git.git (Nov 2018, #06; Wed, 21)

2018-12-05 Thread Matthew DeVore




On 11/21/2018 01:00 AM, Junio C Hamano wrote:

* md/list-lazy-objects-fix (2018-10-29) 1 commit
  - list-objects.c: don't segfault for missing cmdline objects

  "git rev-list --exclude-promissor-objects" had to take an object
  that does not exist locally (and is lazily available) from the
  command line without barfing, but the code dereferenced NULL.

  That sympotom may be worth addressing, but I think the "fix" is
  overly broad and is wrong.  Giving a missing object should be
  diagnosed as an error, not swept under the rug silently.


Thank you for taking a look. I just sent v3 of the patch: 
https://public-inbox.org/git/20181205214346.106217-1-matv...@google.com/T/#u


which is different from v2 in that it honors the presence or absence of 
--ignore-missing. It will only sweep things under the rug if that flag 
is passed.


Re: [PATCH 2/3] test-lib-functions: introduce the 'test_set_port' helper function

2018-12-05 Thread Jeff King
On Wed, Dec 05, 2018 at 01:20:35PM +0100, SZEDER Gábor wrote:

> On Wed, Dec 05, 2018 at 12:17:09AM -0500, Jeff King wrote:
> > On Tue, Dec 04, 2018 at 05:34:56PM +0100, SZEDER Gábor wrote:
> > 
> > > Several test scripts run daemons like 'git-daemon' or Apache, and
> > > communicate with them through TCP sockets.  To have unique ports where
> > > these daemons are accessible, the ports are usually the number of the
> > > corresponding test scripts, unless the user overrides them via
> > > environment variables, and thus all those tests and test libs contain
> > > more or less the same bit of one-liner boilerplate code to find out
> > > the port.  The last patch in this series will make this a bit more
> > > complicated.
> > > 
> > > Factor out finding the port for a daemon into the common helper
> > > function 'test_set_port' to avoid repeating ourselves.
> > 
> > OK. Looks like this should keep the same port numbers for all of the
> > existing tests, which I think is good.
> 
> Not for all existing tests, though: t0410 and the 'git p4' tests do
> get different ports.

Yeah, I should have said "most". The important thing is that they remain
static and predictable for normal non-stress runs, which they do.

-Peff


Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread Jeff King
On Wed, Dec 05, 2018 at 03:01:06PM +0100, SZEDER Gábor wrote:

> > >   - Make '--stress' imply '--verbose-log' and discard the test's
> > > standard ouput and error; dumping the output of several parallel
> > > tests to the terminal would create a big ugly mess.
> > 
> > Makes sense. My script just redirects the output to a file, but it
> > predates --verbose-log.
> > 
> > My script always runs with "-x". I guess it's not that hard to add it if
> > you want, but it is annoying to finally get a failure after several
> > minutes only to realize that your are missing some important
> > information. ;)
> > 
> > Ditto for "-i", which leaves more readable output (you know the
> > interesting part is at the end of the log), and leaves a trash directory
> > state that is more likely to be useful for inspecting.
> 
> I wanted to imply only the bare minimum of options that are necessary
> to prevent the tests from stomping on each other.  Other than that I
> agree and wouldn't run '--stress' without '-x -i' myself, and at one
> point considered to recommend '-x -i' in the description of
> '--stress'.

Yeah, it probably is the right thing to make the options as orthogonal
as possible, and let the user decide what they want. I was just
wondering if I could replace my "stress" script with this. But I think
I'd still want it to remember to use a sane set of options (including
"--root" in my case).

> > > 'wait' for all parallel jobs before exiting (either because a failure
> > > was found or because the user lost patience and aborted the stress
> > > test), allowing the still running tests to finish.  Otherwise the "OK
> > > X.Y" progress output from the last iteration would likely arrive after
> > > the user got back the shell prompt, interfering with typing in the
> > > next command.  OTOH, this waiting might induce a considerable delay
> > > between hitting ctrl-C and the test actually exiting; I'm not sure
> > > this is the right tradeoff.
> > 
> > If there is a delay, it's actually quite annoying to _not_ wait; you
> > start doing something in the terminal, and then a bunch of extra test
> > statuses print at a random time.
> 
> At least the INT/TERM/etc. handler should say something like "Waiting
> for background jobs to finish", to let the user know why it doesn't
> exit right away.

That seems reasonable. There's also no point in waiting for them to
finish if we know we're aborting anyway. Could we just kill them all?

I guess it's a little tricky, because $! is going to give us the pid of
each subshell. We actually want to kill the test sub-process. That takes
a few contortions, but the output is nice (every sub-job immediately
says "ABORTED ...", and the final process does not exit until the whole
tree is done):

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 9b7f687396..357dead3ff 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -98,8 +98,9 @@ done,*)
mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
stressfail="$TEST_RESULTS_BASE.stress-failed"
rm -f "$stressfail"
-   trap 'echo aborted >"$stressfail"' TERM INT HUP
+   trap 'echo aborted >"$stressfail"; kill $job_pids 2>/dev/null; wait' 
TERM INT HUP
 
+   job_pids=
job_nr=0
while test $job_nr -lt "$job_count"
do
@@ -108,10 +109,13 @@ done,*)
GIT_TEST_STRESS_JOB_NR=$job_nr
export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
 
+   trap 'kill $test_pid 2>/dev/null' TERM
+
cnt=0
while ! test -e "$stressfail"
do
-   if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
+   $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1 & 
test_pid=$!
+   if wait
then
printf >&2 "OK   %2d.%d\n" 
$GIT_TEST_STRESS_JOB_NR $cnt
elif test -f "$stressfail" &&
@@ -124,16 +128,11 @@ done,*)
fi
cnt=$(($cnt + 1))
done
-   ) &
+   ) & job_pids="$job_pids $!"
job_nr=$(($job_nr + 1))
done
 
-   job_nr=0
-   while test $job_nr -lt "$job_count"
-   do
-   wait
-   job_nr=$(($job_nr + 1))
-   done
+   wait
 
if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
then

> > > + elif test -f "$stressfail" &&
> > > +  test "$(cat "$stressfail")" = "aborted"
> > > + then
> > > + printf >&2 "ABORTED %2d.%d\n" 
> > > $GIT_TEST_STRESS_JOB_NR $cnt
> > > + else
> > > + printf >&2 "FAIL %2d.%d\n" 
> > > $GIT_TEST_STRESS_JOB_NR $cnt
> > > +  

[PATCH v3] list-objects.c: don't segfault for missing cmdline objects

2018-12-05 Thread Matthew DeVore
When a command is invoked with both --exclude-promisor-objects,
--objects-edge-aggressive, and a missing object on the command line,
the rev_info.cmdline array could get a NULL pointer for the value of
an 'item' field. Prevent dereferencing of a NULL pointer in that
situation.

Properly handle --ignore-missing. If it is not passed, die when an
object is missing. Otherwise, just silently ignore it.

Signed-off-by: Matthew DeVore 
---
 revision.c   |  2 ++
 t/t0410-partial-clone.sh | 16 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/revision.c b/revision.c
index 13e0519c02..293303b67d 100644
--- a/revision.c
+++ b/revision.c
@@ -1729,6 +1729,8 @@ int handle_revision_arg(const char *arg_, struct rev_info 
*revs, int flags, unsi
if (!cant_be_filename)
verify_non_filename(revs->prefix, arg);
object = get_reference(revs, arg, , flags ^ local_flags);
+   if (!object)
+   return revs->ignore_missing ? 0 : -1;
add_rev_cmdline(revs, object, arg_, REV_CMD_REV, flags ^ local_flags);
add_pending_object_with_path(revs, object, arg, oc.mode, oc.path);
free(oc.path);
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index ba3887f178..169f7f10a7 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -349,7 +349,7 @@ test_expect_success 'rev-list stops traversal at promisor 
commit, tree, and blob
grep $(git -C repo rev-parse bar) out  # sanity check that some walking 
was done
 '
 
-test_expect_success 'rev-list accepts missing and promised objects on command 
line' '
+test_expect_success 'rev-list dies for missing objects on cmd line' '
rm -rf repo &&
test_create_repo repo &&
test_commit -C repo foo &&
@@ -366,7 +366,19 @@ test_expect_success 'rev-list accepts missing and promised 
objects on command li
 
git -C repo config core.repositoryformatversion 1 &&
git -C repo config extensions.partialclone "arbitrary string" &&
-   git -C repo rev-list --exclude-promisor-objects --objects "$COMMIT" 
"$TREE" "$BLOB"
+
+   for OBJ in "$COMMIT" "$TREE" "$BLOB"; do
+   test_must_fail git -C repo rev-list --objects \
+   --exclude-promisor-objects "$OBJ" &&
+   test_must_fail git -C repo rev-list --objects-edge-aggressive \
+   --exclude-promisor-objects "$OBJ" &&
+
+   # Do not die or crash when --ignore-missing is passed.
+   git -C repo rev-list --ignore-missing --objects \
+   --exclude-promisor-objects "$OBJ" &&
+   git -C repo rev-list --ignore-missing --objects-edge-aggressive 
\
+   --exclude-promisor-objects "$OBJ"
+   done
 '
 
 test_expect_success 'gc repacks promisor objects separately from non-promisor 
objects' '
-- 
2.20.0.rc1.387.gf8505762e3-goog



Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread Jeff King
On Wed, Dec 05, 2018 at 11:34:54AM +0100, SZEDER Gábor wrote:

> 
> Just a quick reply to this one point for now:
> 
> On Wed, Dec 05, 2018 at 12:44:09AM -0500, Jeff King wrote:
> > On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:
> > > + job_nr=0
> > > + while test $job_nr -lt "$job_count"
> > > + do
> > > + wait
> > > + job_nr=$(($job_nr + 1))
> > > + done
> > 
> > Do we need to loop? Calling "wait" with no arguments should wait for all
> > children.
> 
> It should, but in dash, ksh, ksh93, Bash v4.2 or older it doesn't seem
> to do so, at least not on my machine, and I get back my shell prompt
> right after hitting ctrl-C or the first failed test, and then get the
> progress output from the remaining jobs later.

That's weird. I run my stress script under dash with a single "wait",
and it handles the failing case just fine. And switching to a single
"wait" in your script works for me.

But the ^C case is interesting. Try running your script under "sh -x"
and hitting ^C. The signal interrupts the first wait. In my script (or
yours modified to use a single wait), we then proceed immediately to the
"exit", and get output from the subshells after we've exited.

But in your script with the loop, the first wait is interrupted, and
then the _second_ wait (in the second loop iteration) picks up all of
the exiting processes. The subsequent waits are all noops that return
immediately, because there are no processes left.

So the right number of waits is either "1" or "2". Looping means we do
too many (which is mostly a harmless noop) or too few (in the off chance
that you have only a single job ;) ). So it works out in practice.

But I think a more clear way to do it is to simply wait in the signal
trap, to cover the case when our original wait was aborted.

I.e., something like this:

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 9b7f687396..5e0c41626f 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -98,7 +98,7 @@ done,*)
mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
stressfail="$TEST_RESULTS_BASE.stress-failed"
rm -f "$stressfail"
-   trap 'echo aborted >"$stressfail"' TERM INT HUP
+   trap 'echo aborted >"$stressfail"; wait' TERM INT HUP
 
job_nr=0
while test $job_nr -lt "$job_count"
@@ -128,12 +128,7 @@ done,*)
job_nr=$(($job_nr + 1))
done
 
-   job_nr=0
-   while test $job_nr -lt "$job_count"
-   do
-   wait
-   job_nr=$(($job_nr + 1))
-   done
+   wait
 
if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
then

which behaves well for me in all cases.

> Bash 4.3 or later are strange: I get back the shell prompt immediately
> after ctrl-C as well, so it doesn't appear to be waiting for all
> remaining jobs to finish either, but! I don't get any of the progress
> output from those jobs to mess up my next command.

Interesting. My bash 4.4 seems to behave the same as dash. It almost
sounds like the SIGINT is getting passed to the subshells for you.
Probably not really worth digging into, though.

In case anybody is playing with this and needed to simulate a stress
failure, here's what I used:

diff --git a/t/t-basic.sh b/t/t-basic.sh
index b6566003dd..a370cd9977 100755
--- a/t/t-basic.sh
+++ b/t/t-basic.sh
@@ -1171,4 +1171,11 @@ test_expect_success 'very long name in the index handled 
sanely' '
test $len = 4098
 '
 
+test_expect_success 'roll those dice' '
+   case "$(openssl rand -base64 1)" in
+   z*)
+   return 1
+   esac
+'
+
 test_done


Re: BUG: CR marker ^M doesn't show up in '-' lines of diffs when the ending of the removed line is CR+LF

2018-12-05 Thread Johannes Sixt

Am 05.12.18 um 20:29 schrieb Frank Schäfer:


Am 02.12.18 um 22:22 schrieb Johannes Sixt:

Am 02.12.18 um 20:31 schrieb Frank Schäfer:

With other words:
"If CR comes immediately before a LF, do the following with *all* lines:
 after the CR if eol=lf but do not  after the CR if
eol=crlf."


Why? It is the pager's duty to highlight CR, IMO. If it doesn't, but
the user wants to see them, then they are using the wrong pager or the
wrong pager settings.

AFAIU Junios explanation it's not the pagers fault.


Then Junio and I are in disagreement. IMO, Git does not have to be more 
clever than the pager when it comes to presentation of text.



As far as I am concerned, I do not have any of my files marked as
eol=crlf, but I do *not* want to see ^M in the pager. I.e., having git
insert  between CR and LF would do the wrong thing for me.


But doing the same thing in added lines is doing the right thing for you ?


Yes, I think so. As long as I'm not telling Git that my files are CRLF 
when they actual are, then the CR before LF is a whitespace error. 
Nevertheless, I do *NOT* want Git to outwit my pager by inserting 
 between CR and LF all the time so that it is forced to treat the 
lone CR like a control character that is to be made visible.



Or are you suggesting to fix the behavior of added lines instead ?
In any case, inconsistent behavior is not what we want.


I'm suggesting that users who knowingly store CRLF files in the 
database, but do not want to see ^M in added lines have to use 
whitespace=cr-at-eol and that's all. I do not see inconsistency.


-- Hannes


Re: [PATCH] rebase docs: fix incorrect format of the section Behavioral Differences

2018-12-05 Thread Johannes Sixt

Am 05.12.18 um 16:37 schrieb Elijah Newren:

On Tue, Dec 4, 2018 at 11:40 PM Junio C Hamano  wrote:


Johannes Sixt  writes:


Please let me deposit my objection. This paragraph is not the right
place to explain what directory renme detection is and how it works
under the hood. "works fine" in the original text is the right phrase
here; if there is concern that this induces expectations that cannot
be met, throw in the word "heuristics".

Such as:
Directory rename heuristics work fine in the merge and interactive
backends. It does not in the am backend because...


OK, good enough, I guess.  Or just s/work fine/is enabled/.


So...

Directory rename heuristics are enabled in the merge and interactive
backends. They are not in the am backend because it operates on
"fake ancestors" that involve trees containing only the files modified
in the patch.  Due to the lack of accurate tree information, directory
rename detection is disabled for the am backend.


OK with me. And also if you drop sentence no.2 and keep just no.1 and no.3.

Thanks,
-- Hannes


Re: git, monorepos, and access control

2018-12-05 Thread Ævar Arnfjörð Bjarmason


On Wed, Dec 05 2018, Duy Nguyen wrote:

> On Wed, Dec 5, 2018 at 9:46 PM Derrick Stolee  wrote:
>> This directory-level security is not a goal for VFS for Git, and I don't
>> see itbecoming a priority as it breaks a number of design decisions we
>> made in our object storage and communication models.
>>
>> The best I can think about when considering Git as an approach would be
>> to use submodules for your security-related content, and then have server-
>> side security for access to those repos. Of course, submodules are not
>> supported in VFS for Git, either.
>
> Another option is builtin per-blob encryption (maybe with just
> clean/smudge filter), then access control will be about obtaining the
> decryption key (*) and we don't break object storage and
> communication. Of course pack delta compression becomes absolutely
> useless. But that is perhaps an acceptable trade off.

Right, this is another option, but from what John described wouldn't
work in this case. "Any hypothetical AMD monorepo should be able to
securely deny read access in certain subtrees to users without required
permissions".

I.e. in this case there will be a
secret-stuff-here/ryzen-microcode.code.encrypted or whatever,
unauthorized users can't see the content, but they can see from the
filename that it exists, and from "git log" who works on it.

It'll also baloon in size on the server-side since we can't delta any of
these objects, they'll all be X sized encrypted binaries.

> (*) Git will not cache the key in any shape or form. Whenever it needs
> to deflate an encrypted blob, it asks for the key from a separate
> daemon. This guy handles all the access control.
>
>> The Gerrit service has _branch_ level security, which is related to the
>> reachability questions that a directory security would need. However,
>> the problem is quite different. Gerrit does have a lot of experience in
>> dealing with submodules, though, so that's probably a good place to
>> start.
>>
>> Thanks,
>> -Stolee


Re: git, monorepos, and access control

2018-12-05 Thread Jeff King
On Wed, Dec 05, 2018 at 08:13:16PM +, Coiner, John wrote:

> One obstacle to moving AMD to git/VFSForGit is the lack of access 
> control support in git. AMD has a lot of data whose distribution must be 
> limited. Sometimes it's a legal requirement, eg. CPU core designs are 
> covered by US export control laws and not all employees may see them. 
> Sometimes it's a contractual obligation, as when a third party shares 
> data with us and we agree only to share this data with certain 
> employees. Any hypothetical AMD monorepo should be able to securely deny 
> read access in certain subtrees to users without required permissions.
> 
> Has anyone looked at adding access control to git, at a per-directory 
> granularity? Is this a feature that the git community would possibly 
> welcome?

In my opinion this feature is so contrary to Git's general assumptions
that it's likely to create a ton of information leaks of the supposedly
protected data.

For instance, Git is very eager to try to find delta-compression
opportunities between objects, even if they don't have any relationship
within the tree structure. So imagine I want to know the contents of
tree X. I push up a tree Y similar to X, then fetch it back, falsely
claiming to have X but not Y. If the server generates a delta, that may
reveal information about X (which you can then iterate to send Y', and
so on, treating the server as an oracle until you've guessed the content
of X).

You could work around that by teaching the server to refuse to use "X"
in any way when the client does not have the right permissions. But:

  - that's just one example; there are probably a number of such leaks

  - you're fighting an uphill battle against the way Git is implemented;
there's no single point to enforce this kind of segmentation

The model that fits more naturally with how Git is implemented would be
to use submodules. There you leak the hash of the commit from the
private submodule, but that's probably obscure enough (and if you're
really worried, you can add a random nonce to the commit messages in the
submodule to make their hashes unguessable).

> I would welcome your feedback on whether this idea makes technical 
> sense, and whether the feature could ever be a fit for git.

Sorry I don't have a more positive response. What you want to do is
perfectly reasonable, but I just think it's a mismatch with how Git
works (and because of the security impact, one missed corner case
renders the whole thing useless).

-Peff


Re: git, monorepos, and access control

2018-12-05 Thread Duy Nguyen
On Wed, Dec 5, 2018 at 9:46 PM Derrick Stolee  wrote:
> This directory-level security is not a goal for VFS for Git, and I don't
> see itbecoming a priority as it breaks a number of design decisions we
> made in our object storage and communication models.
>
> The best I can think about when considering Git as an approach would be
> to use submodules for your security-related content, and then have server-
> side security for access to those repos. Of course, submodules are not
> supported in VFS for Git, either.

Another option is builtin per-blob encryption (maybe with just
clean/smudge filter), then access control will be about obtaining the
decryption key (*) and we don't break object storage and
communication. Of course pack delta compression becomes absolutely
useless. But that is perhaps an acceptable trade off.

(*) Git will not cache the key in any shape or form. Whenever it needs
to deflate an encrypted blob, it asks for the key from a separate
daemon. This guy handles all the access control.

> The Gerrit service has _branch_ level security, which is related to the
> reachability questions that a directory security would need. However,
> the problem is quite different. Gerrit does have a lot of experience in
> dealing with submodules, though, so that's probably a good place to
> start.
>
> Thanks,
> -Stolee
-- 
Duy


Re: git, monorepos, and access control

2018-12-05 Thread Derrick Stolee

On 12/5/2018 3:34 PM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Dec 05 2018, Coiner, John wrote:


I'm an engineer with AMD. I'm looking at whether we could switch our
internal version control to a monorepo, possibly one based on git and
VFSForGit.

Has anyone looked at adding access control to git, at a per-directory
granularity? Is this a feature that the git community would possibly
welcome?

All of what you've described is possible to implement in git, but as far
as I know there's no existing implementation of it.

Microsoft's GVFS probably comes closest, and they're actively
upstreaming bits of that, but as far as I know that doesn't in any way
try to solve this "users XYZ can't even list such-and-such a tree"
problem.

(Avar has a lot of good ideas in his message, so I'm just going to add
on a few here.)

This directory-level security is not a goal for VFS for Git, and I don't
see itbecoming a priority as it breaks a number of design decisions we
made in our object storage and communication models.

The best I can think about when considering Git as an approach would be
to use submodules for your security-related content, and then have server-
side security for access to those repos. Of course, submodules are not
supported in VFS for Git, either.

The Gerrit service has _branch_ level security, which is related to the
reachability questions that a directory security would need. However,
the problem is quite different. Gerrit does have a lot of experience in
dealing with submodules, though, so that's probably a good place to
start.

Thanks,
-Stolee


Re: git, monorepos, and access control

2018-12-05 Thread Ævar Arnfjörð Bjarmason


On Wed, Dec 05 2018, Coiner, John wrote:

> I'm an engineer with AMD. I'm looking at whether we could switch our
> internal version control to a monorepo, possibly one based on git and
> VFSForGit.
>
> One obstacle to moving AMD to git/VFSForGit is the lack of access
> control support in git. AMD has a lot of data whose distribution must be
> limited. Sometimes it's a legal requirement, eg. CPU core designs are
> covered by US export control laws and not all employees may see them.
> Sometimes it's a contractual obligation, as when a third party shares
> data with us and we agree only to share this data with certain
> employees. Any hypothetical AMD monorepo should be able to securely deny
> read access in certain subtrees to users without required permissions.
>
> Has anyone looked at adding access control to git, at a per-directory
> granularity? Is this a feature that the git community would possibly
> welcome?
>
> Here's my rough thinking about how it might work:
>   - an administrator can designate that a tree object requires zero or
> more named privileges to read
>   - when a mortal user attempts to retrieve the tree object, a hook
> allows the server to check if the user has a given privilege. The hook
> can query an arbitrary user/group data base, LDAP or whatever. The
> details of this check are mostly in the hook; git only knows about
> abstract named privileges.
>   - if the user has permission, everything goes as normal.
>   - if the user lacks permission, they get a DeniedTree object which
> might carry some metadata about what permissions would be needed to see
> more. The DeniedTree lacks the real tree's entries. (TBD, how do we
> render a denied tree in the workspace? An un-writable directory
> containing only a GITDENIED file with some user friendly error message?)
>   - hashes are secret. If the hashes from a protected tree leak, the
> data also leaks. No check on the server prevents it from handing out
> contents for correctly-guessed hashes.
>   - mortal users shouldn't be able to alter permissions. Of course,
> mortal users will often modify tree objects that carry permissions. So
> the server should enforce that a user isn't pushing updates that alter
> permissions on the same logical directory.
>
> I would welcome your feedback on whether this idea makes technical
> sense, and whether the feature could ever be a fit for git.
>
> You might ask what alternatives we are looking at. At our scale, we'd
> really want a version control system that implements a virtual
> filesystem. That already limits us to ClearCase, VFSForGit, and maybe
> Vesta among public ones.  Am I missing any? We would also want one that
> permits branching enormous numbers of files without creating enormous
> amounts of data in the repo -- git gets that right, and perforce (our
> status quo) does not. That's how I got onto the idea of adding read
> authorization to git.

All of what you've described is possible to implement in git, but as far
as I know there's no existing implementation of it.

Microsoft's GVFS probably comes closest, and they're actively
upstreaming bits of that, but as far as I know that doesn't in any way
try to solve this "users XYZ can't even list such-and-such a tree"
problem.

Google has much the same problem with Android, i.e. there's some parts
of the giant checkout of multiple repos the "repo" tool manages that
aren't available to some of Google's employees, or only to some partners
etc. They solve this by splitting those at repo boundaries, and are
working on moving that to submodules. Perhaps this would work in your
case? It's not as easy to work with, but could be made to work with
existing software.

In case you haven't, read "SECURITY" in the git-fetch(1) manpage,
although if that could be fixed to work for the case you're describing.

In a hypothetical implementation that does this, you are always going to
need to have something where you hand off the SHA-1 of the
"secret-stuff-here/" top-level tree object to clients that can't access
any of the sub-trees or blobs from that directory, since they're going
to need it to construct a commit object consisting of
"stuff-they-can-access/" plus the current state of "secret-stuff-here/"
to push upstream. But you can hide the blobs & trees under
"secret-stuff-here/" and their SHA-1s.

Note also that in the general case your "is this blob secret to user
xyz" or "is this tree secret to user xyz" will need to deal with there
happening to be some blob or tree in "secret-stuff-here/" that just so
happens to share the exact same tree/blob as something they need under
"stuff-they-can-access/". Think a directory that only contains a
".gitignore" file, or a BSD "LICENSE" file.


Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check

2018-12-05 Thread Jeff King
On Wed, Dec 05, 2018 at 07:41:44PM +0100, René Scharfe wrote:

> > If we start to convert those, there's a
> > little bit of a rabbit hole, but it's actually not too bad.
> 
> You don't need to crawl in just for quick_has_loose(), but eventually
> everything has to be converted.  It seems a bit much for one patch, but
> perhaps that's just my ever-decreasing attention span speaking.

Yeah, my normal process here is to dig to the bottom of the rabbit hole,
and then break it into actual patches. I just shared the middle state
here. ;)

I suspect the http bits could be split off into their own thing. The
bits in sha1-file.c I'd plan to mostly do all together, as they are
a family of related functions.

Mostly I wasn't sure how to wrap this up with the other changes. It's
obviously pretty invasive, and I don't want it to get in the way of
actual functional changes we've been discussing.

> > @@ -879,15 +879,15 @@ int git_open_cloexec(const char *name, int flags)
> >   * Note that it may point to static storage and is only valid until another
> >   * call to stat_sha1_file().
> >   */
> > -static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
> > - struct stat *st, const char **path)
> > +static int stat_loose_object(struct repository *r, const struct object_id 
> > *oid,
> > +struct stat *st, const char **path)
> 
> Hmm, read_sha1_file() was renamed to read_object_file() earlier this
> year, and I'd have expected this to become stat_object_file().  Names..

read_object_file() is about reading an object from _any_ source. These
are specifically about loose objects, and I think that distinction is
important (both here and for map_loose_object, etc).

I'd actually argue that read_object_file() should just be read_object(),
but that already exists. Sigh. (I think it's fixable, but obviously
orthogonal to this topic).

> Anyway, the comment above and one a few lines below should be updated
> as well.

Thanks, fixed.

> >  static int open_sha1_file(struct repository *r,
> > - const unsigned char *sha1, const char **path)
> > + const struct object_id *oid, const char **path)
> 
> That function should lose the "sha1" in its name as well.

Yep, fixed.

> > -static void *unpack_sha1_rest(git_zstream *stream, void *buffer, unsigned 
> > long size, const unsigned char *sha1)
> > +static void *unpack_loose_rest(git_zstream *stream,
> > +  void *buffer, unsigned long size,
> > +  const struct object_id *oid)
> 
> Hmm, both old and new name here look weird to me at this point.

It makes more sense in the pairing of unpack_sha1_header() and
unpack_sha1_rest(). Maybe "body" would be better than "rest".

At any rate, it probably makes sense to rename them together (but I
didn't touch the "header" one here). Maybe the name changes should come
as a separate patch. I was mostly changing them here because I was
changing the signatures anyway, and had to touch all of the callers.

-Peff


git, monorepos, and access control

2018-12-05 Thread Coiner, John
Hi,

I'm an engineer with AMD. I'm looking at whether we could switch our 
internal version control to a monorepo, possibly one based on git and 
VFSForGit.

One obstacle to moving AMD to git/VFSForGit is the lack of access 
control support in git. AMD has a lot of data whose distribution must be 
limited. Sometimes it's a legal requirement, eg. CPU core designs are 
covered by US export control laws and not all employees may see them. 
Sometimes it's a contractual obligation, as when a third party shares 
data with us and we agree only to share this data with certain 
employees. Any hypothetical AMD monorepo should be able to securely deny 
read access in certain subtrees to users without required permissions.

Has anyone looked at adding access control to git, at a per-directory 
granularity? Is this a feature that the git community would possibly 
welcome?

Here's my rough thinking about how it might work:
  - an administrator can designate that a tree object requires zero or 
more named privileges to read
  - when a mortal user attempts to retrieve the tree object, a hook 
allows the server to check if the user has a given privilege. The hook 
can query an arbitrary user/group data base, LDAP or whatever. The 
details of this check are mostly in the hook; git only knows about 
abstract named privileges.
  - if the user has permission, everything goes as normal.
  - if the user lacks permission, they get a DeniedTree object which 
might carry some metadata about what permissions would be needed to see 
more. The DeniedTree lacks the real tree's entries. (TBD, how do we 
render a denied tree in the workspace? An un-writable directory 
containing only a GITDENIED file with some user friendly error message?)
  - hashes are secret. If the hashes from a protected tree leak, the 
data also leaks. No check on the server prevents it from handing out 
contents for correctly-guessed hashes.
  - mortal users shouldn't be able to alter permissions. Of course, 
mortal users will often modify tree objects that carry permissions. So 
the server should enforce that a user isn't pushing updates that alter 
permissions on the same logical directory.

I would welcome your feedback on whether this idea makes technical 
sense, and whether the feature could ever be a fit for git.

You might ask what alternatives we are looking at. At our scale, we'd 
really want a version control system that implements a virtual 
filesystem. That already limits us to ClearCase, VFSForGit, and maybe 
Vesta among public ones.  Am I missing any? We would also want one that 
permits branching enormous numbers of files without creating enormous 
amounts of data in the repo -- git gets that right, and perforce (our 
status quo) does not. That's how I got onto the idea of adding read 
authorization to git.

Thanks,

John





Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread Ævar Arnfjörð Bjarmason


On Wed, Dec 05 2018, SZEDER Gábor wrote:

> On Wed, Dec 05, 2018 at 03:01:41PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> >> decide to stress test in advance, since we'd either flock() the trash
>> >> dir, or just mktemp(1)-it.
>> >
>> > While 'mktemp' seems to be more portable than 'flock', it doesn't seem
>> > to be portable enough; at least it's not in POSIX.>
>>
>> We are already relying on stuff like mktemp() being reliable for
>> e.g. the split index.
>
> That's the mktemp() function from libc; I meant the 'mktemp' command
> that we would use this early in 'test-lib.sh', where PATH has not been
> set up for testing yet.

Ah, if that ever becomes an issue it's a one-line test-tool wrapper.


Re: BUG: CR marker ^M doesn't show up in '-' lines of diffs when the ending of the removed line is CR+LF

2018-12-05 Thread Frank Schäfer


Am 03.12.18 um 02:15 schrieb Junio C Hamano:
> Frank Schäfer  writes:
>
>> Hi Junio,
>>
>> Am 29.11.18 um 03:11 schrieb Junio C Hamano:
>> [...]
>>> This was misspoken a bit.  Let's revise it to
>>>
>>> When producing a colored output (not limited to whitespace
>>> error coloring of diff output) for contents that are not
>>> marked as eol=crlf (and other historical means), insert
>>>  before a CR that comes immediately before a LF.
>> You mean
>>  ...
>>   *after* a CR that comes immediately before a LF."
>>
>> OK, AFAICS this produces the desired output in all cases if eol=lf.
> OK, yeah, I think I meant "after", i.e. ... CR  LF, in order
> to force CR to be separated from LF.
>
>> Now what about the case eol=crlf ?
> I have no strong opinions, other than that "LF in repository, CRLF
> in working tree" would make the issue go away (when it is solved for
> EOL=LF like the above, that is).
>
>> Keeping the current behavior of '-' lines is correct.
>> But shouldn't ^M now be suppressed in '+' lines for a consistent behavior ?
> If "LF in repository, CRLF in working tree" is done, there would not
> be ^M at the end of the line, not just for removed lines, but also
> for added lines, no?
Just to be sure that I'm not missing anything here:
What's your definition of "LF in repository, CRLF in working tree" in
terms of config parameters ?

Regards,
Frank



Re: BUG: CR marker ^M doesn't show up in '-' lines of diffs when the ending of the removed line is CR+LF

2018-12-05 Thread Frank Schäfer


Am 02.12.18 um 22:22 schrieb Johannes Sixt:
> Am 02.12.18 um 20:31 schrieb Frank Schäfer:
>> With other words:
>> "If CR comes immediately before a LF, do the following with *all* lines:
>>  after the CR if eol=lf but do not  after the CR if
>> eol=crlf."
>
> Why? It is the pager's duty to highlight CR, IMO. If it doesn't, but
> the user wants to see them, then they are using the wrong pager or the
> wrong pager settings.
AFAIU Junios explanation it's not the pagers fault.

>
> As far as I am concerned, I do not have any of my files marked as
> eol=crlf, but I do *not* want to see ^M in the pager. I.e., having git
> insert  between CR and LF would do the wrong thing for me.
>
But doing the same thing in added lines is doing the right thing for you ?
Or are you suggesting to fix the behavior of added lines instead ?
In any case, inconsistent behavior is not what we want.

Regards,
Frank



Re: gitweb: local configuration not found

2018-12-05 Thread Ævar Arnfjörð Bjarmason


On Wed, Dec 05 2018, Jonathan Nieder wrote:

> Martin Mares wrote[1]:
>
>> After upgrade to Stretch, gitweb no longer finds the configuration file
>> "gitweb_config.perl" in the current directory. However, "man gitweb" still
>> mentions this as one of the possible locations of the config file (and
>> indeed a useful one when using multiple instances of gitweb).
>>
>> It was probably broken by Perl dropping "." from the default search path
>> for security reasons.
>
> Indeed, perldelta(1) tells me that in 5.24.1 (and 5.26, etc),
>
>   Core modules and tools no longer search "." for optional modules
>
> gitweb.perl contains
>
>  sub read_config_file {
>  my $filename = shift;
>  return unless defined $filename;
>  # die if there are errors parsing config file
>  if (-e $filename) {
>  do $filename;
>
> which implies an @INC search but it is silly --- as the "-e" test
> illustrates, this never intended to search @INC.
>
> Documentation says "If you are absolutely certain that you want your
> script to load and execute a file from the current directory, then use
> a ./ prefix".  We can do that, like so:
>
> diff --git i/gitweb/Makefile w/gitweb/Makefile
> index cd194d057f..3160b6cc5d 100644
> --- i/gitweb/Makefile
> +++ w/gitweb/Makefile
> @@ -18,7 +18,7 @@ RM ?= rm -f
>  INSTALL ?= install
>
>  # default configuration for gitweb
> -GITWEB_CONFIG = gitweb_config.perl
> +GITWEB_CONFIG = ./gitweb_config.perl
>  GITWEB_CONFIG_SYSTEM = /etc/gitweb.conf
>  GITWEB_CONFIG_COMMON = /etc/gitweb-common.conf
>  GITWEB_HOME_LINK_STR = projects
>
> but that does not help if someone overrides GITWEB_CONFIG, and besides,
> it would be nicer to avoid the possibility of an @INC search altogether.
> Another alternative would be to use
>
>   local @INC = ('.');
>
> Would that be better?
>
> Advice from someone more versed than I am in perl would be very welcome
> (hence the cc to Ævar).

It seems most sensible to follow the ./FILE.pl advice per
https://metacpan.org/pod/distribution/perl/pod/perl5260delta.pod#Removal-of-the-current-directory-(%22.%22)-from-@INC

Just:

local @INC = '.';
do 'FILE.pl';

Would do the same thing, but seems like a more indirect way to do it if
all we want is ./ anyway. FWIW to be pedantically bug-compatible with
the old version (we should not do this) it's:

local @INC = (@INC, ".");
do 'FILE.pl';

I.e. before our behavior was implicitly to check whether we had a local
FILE.pl, then loop through all of @INC to see if we found it there, and
finally come back to the file we did the -e check for.


Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check

2018-12-05 Thread René Scharfe
Am 05.12.2018 um 09:15 schrieb Jeff King:
> On Wed, Dec 05, 2018 at 01:51:36AM -0500, Jeff King wrote:
> 
>>> This
>>> function is easily converted to struct object_id, though, as its single
>>> caller can pass one on -- this makes the copy unnecessary.
>>
>> If you mean modifying sha1_loose_object_info() to take an oid, then
>> sure, I agree that is a good step forward (and that is exactly the "punt
>> until" moment I meant).
> 
> So the simple thing is to do that, and then have it pass oid->hash to
> the other functions it uses.

Yes.

> If we start to convert those, there's a
> little bit of a rabbit hole, but it's actually not too bad.

You don't need to crawl in just for quick_has_loose(), but eventually
everything has to be converted.  It seems a bit much for one patch, but
perhaps that's just my ever-decreasing attention span speaking.

Converting one function prototype or struct member at a time seems
about the right amount of change per patch to me.  That's not always
possible due to entanglement, of course.

> Most of the spill-over is into the dumb-http code. Note that it actually
> uses sha1 itself! That probably needs to be the_hash_algo (though I'm
> not even sure how we'd negotiate the algorithm across a dumb fetch). At
> any rate, I don't think this patch makes anything _worse_ in that
> respect.

Right.

> diff --git a/sha1-file.c b/sha1-file.c
> index 3ddf4c9426..0705709036 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -333,12 +333,12 @@ int raceproof_create_file(const char *path, 
> create_file_fn fn, void *cb)
>   return ret;
>  }
>  
> -static void fill_sha1_path(struct strbuf *buf, const unsigned char *sha1)
> +static void fill_loose_path(struct strbuf *buf, const struct object_id *oid)

The new name fits.

> @@ -879,15 +879,15 @@ int git_open_cloexec(const char *name, int flags)
>   * Note that it may point to static storage and is only valid until another
>   * call to stat_sha1_file().
>   */
> -static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
> -   struct stat *st, const char **path)
> +static int stat_loose_object(struct repository *r, const struct object_id 
> *oid,
> +  struct stat *st, const char **path)

Hmm, read_sha1_file() was renamed to read_object_file() earlier this
year, and I'd have expected this to become stat_object_file().  Names..

Anyway, the comment above and one a few lines below should be updated
as well.

>  {
>   struct object_directory *odb;
>   static struct strbuf buf = STRBUF_INIT;
>  
>   prepare_alt_odb(r);
>   for (odb = r->objects->odb; odb; odb = odb->next) {
> - *path = odb_loose_path(odb, , sha1);
> + *path = odb_loose_path(odb, , oid);
>   if (!lstat(*path, st))
>   return 0;
>   }
> @@ -900,7 +900,7 @@ static int stat_sha1_file(struct repository *r, const 
> unsigned char *sha1,
>   * descriptor. See the caveats on the "path" parameter above.
>   */
>  static int open_sha1_file(struct repository *r,
> -   const unsigned char *sha1, const char **path)
> +   const struct object_id *oid, const char **path)

That function should lose the "sha1" in its name as well.

> -static void *map_sha1_file_1(struct repository *r, const char *path,
> -  const unsigned char *sha1, unsigned long *size)
> +static void *map_loose_object_1(struct repository *r, const char *path,
> +  const struct object_id *oid, unsigned long *size)

Similarly, map_object_file_1()?

> -void *map_sha1_file(struct repository *r,
> - const unsigned char *sha1, unsigned long *size)
> +void *map_loose_object(struct repository *r,
> +const struct object_id *oid,
> +unsigned long *size)

Similar.

> @@ -1045,7 +1043,9 @@ static int unpack_sha1_header_to_strbuf(git_zstream 
> *stream, unsigned char *map,
>   return -1;
>  }
>  
> -static void *unpack_sha1_rest(git_zstream *stream, void *buffer, unsigned 
> long size, const unsigned char *sha1)
> +static void *unpack_loose_rest(git_zstream *stream,
> +void *buffer, unsigned long size,
> +const struct object_id *oid)

Hmm, both old and new name here look weird to me at this point.

> -static int sha1_loose_object_info(struct repository *r,
> -   const unsigned char *sha1,
> -   struct object_info *oi, int flags)
> +static int loose_object_info(struct repository *r,
> +  const struct object_id *oid,
> +  struct object_info *oi, int flags)

And nothing of value was lost. :)

René


gitweb: local configuration not found

2018-12-05 Thread Jonathan Nieder
Martin Mares wrote[1]:

> After upgrade to Stretch, gitweb no longer finds the configuration file
> "gitweb_config.perl" in the current directory. However, "man gitweb" still
> mentions this as one of the possible locations of the config file (and
> indeed a useful one when using multiple instances of gitweb).
>
> It was probably broken by Perl dropping "." from the default search path
> for security reasons.

Indeed, perldelta(1) tells me that in 5.24.1 (and 5.26, etc),

  Core modules and tools no longer search "." for optional modules

gitweb.perl contains

 sub read_config_file {
 my $filename = shift;
 return unless defined $filename;
 # die if there are errors parsing config file
 if (-e $filename) {
 do $filename;

which implies an @INC search but it is silly --- as the "-e" test
illustrates, this never intended to search @INC.

Documentation says "If you are absolutely certain that you want your
script to load and execute a file from the current directory, then use
a ./ prefix".  We can do that, like so:

diff --git i/gitweb/Makefile w/gitweb/Makefile
index cd194d057f..3160b6cc5d 100644
--- i/gitweb/Makefile
+++ w/gitweb/Makefile
@@ -18,7 +18,7 @@ RM ?= rm -f
 INSTALL ?= install
 
 # default configuration for gitweb
-GITWEB_CONFIG = gitweb_config.perl
+GITWEB_CONFIG = ./gitweb_config.perl
 GITWEB_CONFIG_SYSTEM = /etc/gitweb.conf
 GITWEB_CONFIG_COMMON = /etc/gitweb-common.conf
 GITWEB_HOME_LINK_STR = projects

but that does not help if someone overrides GITWEB_CONFIG, and besides,
it would be nicer to avoid the possibility of an @INC search altogether.
Another alternative would be to use

local @INC = ('.');

Would that be better?

Advice from someone more versed than I am in perl would be very welcome
(hence the cc to Ævar).

Thanks for reporting and hope that helps,
Jonathan

[1] https://bugs.debian.org/915632


Re: [RFC] cherry-pick notes to find out cherry-picks from the origin

2018-12-05 Thread Tejun Heo
Hello, Jeff.

On Thu, Nov 15, 2018 at 09:40:44AM -0500, Jeff King wrote:
> Sorry for the slow reply. This was on my to-look-at pile, but for
> some reason I accidentally put in my done pile.

No worries and sorry about my late reply too.  Things were a bit
hectic.

> > * A new built-in command note-cherry-picks, which walks the specified
> >   commits and if they're marked with the cherry-pick trailer, adds the
> >   backlink to the origin commit using Cherry-picked-to tag in a
> >   cherry-picks note.
> 
> That makes sense. I think this could also be an option to cherry-pick,
> to instruct it to create the note when the cherry-pick is made.
> 
> But you may still want a command to backfill older cherry-picks, or
> those done by other people who do not care themselves about maintaining
> the notes tree.

So, I wanted to do both with the same command.  git-cherry-pick knows
which commits are new, so it can just pass those commits to
note-cherry-picks.  When backfilling the whole tree or newly pulled
commits, the appropriate command can invoke note-cherry-picks with the
new commits which should be super cheap.

> It _feels_ like this is something that should be do-able by plugging a
> few commands together, rather than writing a new C program. I.e.,
> something like:
> 
>   git rev-list --format='%(trailers)' HEAD |
>   perl -lne '
>   /^commit ([0-9]+)/ and $commit = $1;
>   /^\(cherry picked from commit ([0-9]+)/
>   and print "$commit $1";
>   ' |
>   while read from to; do
>   # One process per note isn't very efficient. Ideally there would
>   # be an "append --stdin" mode. Double points if it understands
>   # how to avoid adding existing lines.
>   git notes append -m "Cherry-picked-to: $to" $from
>   done
> 
> which is roughly what your program is doing.  Not that I'm entirely
> opposed to doing something in C (we've been moving away from shell
> scripts anyway). But mostly I am wondering if we can leverage existing

But the above wouldn't clean up stale commits, which could happen with
e.g. abandoned releases, and would be prone to creating duplicates.
We sure can add all those to shell / perl scripts but it's difficult
for me to see the upsides of doing it that way.

> tools, and fill in their gaps in a way that lets people easily do
> similar things.

I'll respond to this together below.

> And on that note...
> 
> > * When formatting a cherry-picks note for display, nested cherry-picks
> >   are followed from each Cherry-picked-to tag and printed out with
> >   matching indentations.
> 
> That makes sense to me, but does this have to be strictly related to
> cherry-picks? I.e., in the more generic form, could we have a way of
> marking a note as "transitive" for display, and the notes-display code
> would automatically recognize and walk hashes?
> 
> That would serve your purpose, but would also allow similar things to
> easily be done in the future.

Below.

> > Combined with name-rev --stdin, it can produce outputs like the following.
> > [...]
> 
> Yeah, that looks pretty good.
> 
> > Locally, the notes can be kept up-to-date with a trivial post-commit
> > hook which invokes note-cherry-picks on the new commit; however, I'm
> > having a bit of trouble figuring out a way to keep it up-to-date when
> > multiple trees are involved.  AFAICS, there are two options.
> > 
> > 1. Ensuring that the notes are always generated on local commits and
> >whenever new commits are received through fetch/pulls.
> > 
> > 2. Ensuring that the notes are always generated on local commits and
> >transported with push/pulls.
> > 
> > 3. A hybrid approach - also generate notes on the receiving end and
> >ensure that fetch/pulls receives the notes together (ie. similar to
> >--tags option to git-fetch).
> > 
> > #1 seems simpler and more robust to me.  Unfortunately, I can't see a
> > way to implement any of the three options with the existing hooks.
> > For #1, there's no post-fetch hook.  For #2 and #3, there doesn't seem
> > to be a fool-proof way to make sure that the notes are transported
> > together.  Any suggestions would be greatly appreciated.
> 
> Yeah, I think (1) is the simplest: it becomes a purely local thing that
> you've generated these annotations. Unfortunately, no, I don't think
> there's anything like a post-fetch hook. This might be a good reason to
> have one. One can always do "git fetch && update-notes" of course, but
> having fetch feed the script the set of updated ref tips would be very
> helpful (so you know you can traverse from $old..$new looking for
> cherry-picks).

This sounds great to me.

> I only looked briefly over your implementation, but didn't see anything
> obviously wrong. I do think it would be nice to make it more generic, as
> much as possible. I think the most generic form is really:
> 
>   traverse-and-show-trailers | invert-trailers | add-notes
> 
> In theory I should be able to do the same inversion step on any 

Re: Any way to make git-log to enumerate commits?

2018-12-05 Thread Andreas Schwab
On Dez 05 2018, Elijah Newren  wrote:

> Or, just use name-rev so it works with non-linear histories too:
>
> git log | git name-rev --refs=$(git symbolic-ref HEAD) --stdin | less

That wouldn't work for a detached HEAD, though, and you need to use
--no-abbrev.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: [PATCH] rebase docs: fix incorrect format of the section Behavioral Differences

2018-12-05 Thread Elijah Newren
On Tue, Dec 4, 2018 at 11:40 PM Junio C Hamano  wrote:
>
> Johannes Sixt  writes:
>
> > Please let me deposit my objection. This paragraph is not the right
> > place to explain what directory renme detection is and how it works
> > under the hood. "works fine" in the original text is the right phrase
> > here; if there is concern that this induces expectations that cannot
> > be met, throw in the word "heuristics".
> >
> > Such as:
> >Directory rename heuristics work fine in the merge and interactive
> >backends. It does not in the am backend because...
>
> OK, good enough, I guess.  Or just s/work fine/is enabled/.

So...

Directory rename heuristics are enabled in the merge and interactive
backends. They are not in the am backend because it operates on
"fake ancestors" that involve trees containing only the files modified
in the patch.  Due to the lack of accurate tree information, directory
rename detection is disabled for the am backend.

?


Re: Any way to make git-log to enumerate commits?

2018-12-05 Thread Elijah Newren
On Wed, Dec 5, 2018 at 6:56 AM Konstantin Khomoutov  wrote:
>
> On Wed, Dec 05, 2018 at 05:22:14PM +0300, Konstantin Kharlamov wrote:
>
> > It would be great if git-log has a formatting option to insert an
> > index of the current commit since HEAD.
> >
> > It would allow after quitting the git-log to immediately fire up "git
> > rebase -i HEAD~index" instead of "git rebase -i
> > go-copy-paste-this-long-number-id".
>
> This may have little sense in a general case as the history maintained
> by Git is a graph, not a single line. Hence your prospective approach
> would only work for cases like `git log` called with the
> "--first-parent" command-line option.
>
> Still, for a simple approach you may code it right away yourself.
> Say, let's create an alias:
>
>   $ git config alias.foo '!git log "$@" --pretty=oneline --source | {
>   n=0;
>   while read sha ref rest; do
> printf "%s\t%s~%s\t%s\n" "$sha" "$ref" $n "$rest"
> n=$((n+1))
>   done
> }'
>
> Now calling `git foo --abbrev=commit` would output something like
>
> 9be8e297dHEAD~7   Frobincated fizzle
>
> where "7" is what you're looking for.
>
> A more roubst solution may need to use the `git rev-list` command.

Or, just use name-rev so it works with non-linear histories too:

git log | git name-rev --refs=$(git symbolic-ref HEAD) --stdin | less

That'll add things like "master~7" for stuff in the first parent
history and output like "master~7^2~3" for commits on side-branches,
either of which can be fed to other git commands.


Re: Any way to make git-log to enumerate commits?

2018-12-05 Thread Konstantin Khomoutov
On Wed, Dec 05, 2018 at 05:22:14PM +0300, Konstantin Kharlamov wrote:

> It would be great if git-log has a formatting option to insert an
> index of the current commit since HEAD.
> 
> It would allow after quitting the git-log to immediately fire up "git
> rebase -i HEAD~index" instead of "git rebase -i
> go-copy-paste-this-long-number-id".

This may have little sense in a general case as the history maintained
by Git is a graph, not a single line. Hence your prospective approach
would only work for cases like `git log` called with the
"--first-parent" command-line option.

Still, for a simple approach you may code it right away yourself.
Say, let's create an alias:

  $ git config alias.foo '!git log "$@" --pretty=oneline --source | {
  n=0;
  while read sha ref rest; do
printf "%s\t%s~%s\t%s\n" "$sha" "$ref" $n "$rest"
n=$((n+1))
  done
}'

Now calling `git foo --abbrev=commit` would output something like

9be8e297dHEAD~7   Frobincated fizzle

where "7" is what you're looking for.

A more roubst solution may need to use the `git rev-list` command.



Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread SZEDER Gábor
On Wed, Dec 05, 2018 at 03:01:41PM +0100, Ævar Arnfjörð Bjarmason wrote:
> >> decide to stress test in advance, since we'd either flock() the trash
> >> dir, or just mktemp(1)-it.
> >
> > While 'mktemp' seems to be more portable than 'flock', it doesn't seem
> > to be portable enough; at least it's not in POSIX.>
> 
> We are already relying on stuff like mktemp() being reliable for
> e.g. the split index.

That's the mktemp() function from libc; I meant the 'mktemp' command
that we would use this early in 'test-lib.sh', where PATH has not been
set up for testing yet.



Any way to make git-log to enumerate commits?

2018-12-05 Thread Konstantin Kharlamov

It would be great if git-log has a formatting option to insert an index of the 
current commit since HEAD.

It would allow after quitting the git-log to immediately fire up "git rebase -i 
HEAD~index" instead of "git rebase -i go-copy-paste-this-long-number-id".


Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread Ævar Arnfjörð Bjarmason


On Wed, Dec 05 2018, SZEDER Gábor wrote:

> On Tue, Dec 04, 2018 at 07:11:08PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> It's a frequent annoyance of mine in the test suite that I'm
>> e.g. running t*.sh with some parallel "prove" in one screen, and then I
>> run tABCD*.sh manually, and get unlucky because they use the same trash
>> dir, and both tests go boom.
>>
>> You can fix that with --root, which is much of what this patch does. My
>> one-liner for doing --stress has been something like:
>>
>> perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error 
>> soon,fail=1 './t-basic.sh --root=trash-{} -v'
>>
>> But it would be great if I didn't have to worry about that and could
>> just run two concurrent:
>>
>> ./t-basic.sh
>>
>> So I think we could just set some env variable where instead of having
>> the predictable trash directory we have a $TRASHDIR.$N as this patch
>> does, except we pick $N by locking some test-runs/tABCD.Nth file with
>> flock() during the run.
>
> Is 'flock' portable?  It doesn't appear so.

Portable enough, and since it's just an alias for "grab lock atomically"
it can devolve into other more basic FS functions on systems that don't
have it.

>> Then a stress mode like this would just be:
>>
>> GIT_TEST_FLOCKED_TRASH_DIRS=1 perl -E 'say ++$_ while 1' | parallel 
>> --jobs=100% --halt-on-error soon,fail=1 './t-basic.sh'
>
> This doesn't address the issue of TCP ports for various daemons.

Once we have a test ABCD and slot offset N (for a fixed size of N) we
can pick a port from that.

>> And sure, we could ship a --stress option too, but it wouldn't be
>> magical in any way, just another "spawn N in a loop" implementation, but
>> you could also e.g. use GNU parallel to drive it, and without needing to
>
> GNU 'parallel' may or may not be available on the platform, that's why
> I stuck with the barebones shell-loops-in-the-background approach.

Yes, my point is not that you should drop that part of your patch for
using GNU parallel, but just to demonstrate that we can get pretty close
to this for most tests with an external tool, and that it would make
this sort of thing work for concurrent tests without needing to decide
to concurrently test in advance.

>> decide to stress test in advance, since we'd either flock() the trash
>> dir, or just mktemp(1)-it.
>
> While 'mktemp' seems to be more portable than 'flock', it doesn't seem
> to be portable enough; at least it's not in POSIX.>

We are already relying on stuff like mktemp() being reliable for
e.g. the split index.

> And in general I'd prefer deterministic trash directory names.  If I
> re-run a failed test after fixing the bug, then currently the trash
> directory will be overwritten, and that's good.  With 'mktemp's junk
> in the trash direcory's name it won't be overwritten, and if my bugfix
> doesn't work, then I'll start accumulating trash directories and won't
> even know which one is from the last run.

In general you wouldn't need to set GIT_TEST_FLOCKED_TRASH_DIRS=1 and
would get the same monolithic trash names as now, and for something like
the flock() version it could use a job number as a suffix like your
patch does.


Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread SZEDER Gábor
On Wed, Dec 05, 2018 at 12:44:09AM -0500, Jeff King wrote:
> On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:
> 
> > To prevent the several parallel invocations of the same test from
> > interfering with each other:
> > 
> >   - Include the parallel job's number in the name of the trash
> > directory and the various output files under 't/test-results/' as
> > a '.stress-' suffix.
> 
> Makes sense.
> 
> >   - Add the parallel job's number to the port number specified by the
> > user or to the test number, so even tests involving daemons
> > listening on a TCP socket can be stressed.
> 
> In my script I just use an arbitrary sequence of ports. That means two
> stress runs may stomp on each other, but stress runs tend not to
> interfere with regular runs (whereas with your scheme, a stress run of
> t5561 is going to stomp on t5562). It probably doesn't matter much
> either way, as I tend not to do too much with the machine during a
> stress run.

A custom port can always be set via environment variables, though the
user has to remember to set it (I doubt that I would remember it until
reminded by test failures...)

> >   - Make '--stress' imply '--verbose-log' and discard the test's
> > standard ouput and error; dumping the output of several parallel
> > tests to the terminal would create a big ugly mess.
> 
> Makes sense. My script just redirects the output to a file, but it
> predates --verbose-log.
> 
> My script always runs with "-x". I guess it's not that hard to add it if
> you want, but it is annoying to finally get a failure after several
> minutes only to realize that your are missing some important
> information. ;)
> 
> Ditto for "-i", which leaves more readable output (you know the
> interesting part is at the end of the log), and leaves a trash directory
> state that is more likely to be useful for inspecting.

I wanted to imply only the bare minimum of options that are necessary
to prevent the tests from stomping on each other.  Other than that I
agree and wouldn't run '--stress' without '-x -i' myself, and at one
point considered to recommend '-x -i' in the description of
'--stress'.

> My script also implies "--root". That's not strictly necessary, though I
> suspect it is much more likely to catch races when it's run on a ram
> disk (simply because it leaves the CPU as the main bottleneck).
> 
> > 'wait' for all parallel jobs before exiting (either because a failure
> > was found or because the user lost patience and aborted the stress
> > test), allowing the still running tests to finish.  Otherwise the "OK
> > X.Y" progress output from the last iteration would likely arrive after
> > the user got back the shell prompt, interfering with typing in the
> > next command.  OTOH, this waiting might induce a considerable delay
> > between hitting ctrl-C and the test actually exiting; I'm not sure
> > this is the right tradeoff.
> 
> If there is a delay, it's actually quite annoying to _not_ wait; you
> start doing something in the terminal, and then a bunch of extra test
> statuses print at a random time.

At least the INT/TERM/etc. handler should say something like "Waiting
for background jobs to finish", to let the user know why it doesn't
exit right away.

An alternative would be to exit right away, and make the background
loops a tad bit smarter to not print these progress lines when
aborting.

> > +   job_nr=0
> > +   while test $job_nr -lt "$job_count"
> > +   do
> > +   (
> > +   GIT_TEST_STRESS_STARTED=done
> > +   GIT_TEST_STRESS_JOB_NR=$job_nr
> > +   export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
> > +
> > +   cnt=0
> > +   while ! test -e "$stressfail"
> > +   do
> > +   if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
> > +   then
> > +   printf >&2 "OK   %2d.%d\n" 
> > $GIT_TEST_STRESS_JOB_NR $cnt
> 
> OK, this adds a counter for each job number (compared to my script).
> Seems helpful.
> 
> > +   elif test -f "$stressfail" &&
> > +test "$(cat "$stressfail")" = "aborted"
> > +   then
> > +   printf >&2 "ABORTED %2d.%d\n" 
> > $GIT_TEST_STRESS_JOB_NR $cnt
> > +   else
> > +   printf >&2 "FAIL %2d.%d\n" 
> > $GIT_TEST_STRESS_JOB_NR $cnt
> > +   echo $GIT_TEST_STRESS_JOB_NR 
> > >>"$stressfail"
> > +   fi
> 
> Hmm. What happens if we see a failure _and_ there's an abort? That's
> actually pretty plausible if you see a failure whiz by, and you want to
> abort the remaining scripts because they take a long time to run.
> 
> If the abort happens first, then the goal is to assume any errors are
> due to the abort.

Yeah, that's why I added this 

Re: [PATCH 2/3] test-lib-functions: introduce the 'test_set_port' helper function

2018-12-05 Thread SZEDER Gábor
On Wed, Dec 05, 2018 at 12:17:09AM -0500, Jeff King wrote:
> On Tue, Dec 04, 2018 at 05:34:56PM +0100, SZEDER Gábor wrote:
> 
> > Several test scripts run daemons like 'git-daemon' or Apache, and
> > communicate with them through TCP sockets.  To have unique ports where
> > these daemons are accessible, the ports are usually the number of the
> > corresponding test scripts, unless the user overrides them via
> > environment variables, and thus all those tests and test libs contain
> > more or less the same bit of one-liner boilerplate code to find out
> > the port.  The last patch in this series will make this a bit more
> > complicated.
> > 
> > Factor out finding the port for a daemon into the common helper
> > function 'test_set_port' to avoid repeating ourselves.
> 
> OK. Looks like this should keep the same port numbers for all of the
> existing tests, which I think is good.

Not for all existing tests, though: t0410 and the 'git p4' tests do
get different ports.


Hrm, speaking of affecting 'git p4' tests...  I should have put Luke
on Cc, so doing it now.



Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread SZEDER Gábor
On Tue, Dec 04, 2018 at 07:11:08PM +0100, Ævar Arnfjörð Bjarmason wrote:
> It's a frequent annoyance of mine in the test suite that I'm
> e.g. running t*.sh with some parallel "prove" in one screen, and then I
> run tABCD*.sh manually, and get unlucky because they use the same trash
> dir, and both tests go boom.
> 
> You can fix that with --root, which is much of what this patch does. My
> one-liner for doing --stress has been something like:
> 
> perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error 
> soon,fail=1 './t-basic.sh --root=trash-{} -v'
> 
> But it would be great if I didn't have to worry about that and could
> just run two concurrent:
> 
> ./t-basic.sh
> 
> So I think we could just set some env variable where instead of having
> the predictable trash directory we have a $TRASHDIR.$N as this patch
> does, except we pick $N by locking some test-runs/tABCD.Nth file with
> flock() during the run.

Is 'flock' portable?  It doesn't appear so.

> Then a stress mode like this would just be:
> 
> GIT_TEST_FLOCKED_TRASH_DIRS=1 perl -E 'say ++$_ while 1' | parallel 
> --jobs=100% --halt-on-error soon,fail=1 './t-basic.sh'

This doesn't address the issue of TCP ports for various daemons.

> And sure, we could ship a --stress option too, but it wouldn't be
> magical in any way, just another "spawn N in a loop" implementation, but
> you could also e.g. use GNU parallel to drive it, and without needing to

GNU 'parallel' may or may not be available on the platform, that's why
I stuck with the barebones shell-loops-in-the-background approach.

> decide to stress test in advance, since we'd either flock() the trash
> dir, or just mktemp(1)-it.

While 'mktemp' seems to be more portable than 'flock', it doesn't seem
to be portable enough; at least it's not in POSIX.

And in general I'd prefer deterministic trash directory names.  If I
re-run a failed test after fixing the bug, then currently the trash
directory will be overwritten, and that's good.  With 'mktemp's junk
in the trash direcory's name it won't be overwritten, and if my bugfix
doesn't work, then I'll start accumulating trash directories and won't
even know which one is from the last run.



Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load

2018-12-05 Thread SZEDER Gábor


Just a quick reply to this one point for now:

On Wed, Dec 05, 2018 at 12:44:09AM -0500, Jeff King wrote:
> On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:
> > +   job_nr=0
> > +   while test $job_nr -lt "$job_count"
> > +   do
> > +   wait
> > +   job_nr=$(($job_nr + 1))
> > +   done
> 
> Do we need to loop? Calling "wait" with no arguments should wait for all
> children.

It should, but in dash, ksh, ksh93, Bash v4.2 or older it doesn't seem
to do so, at least not on my machine, and I get back my shell prompt
right after hitting ctrl-C or the first failed test, and then get the
progress output from the remaining jobs later.

Bash 4.3 or later are strange: I get back the shell prompt immediately
after ctrl-C as well, so it doesn't appear to be waiting for all
remaining jobs to finish either, but! I don't get any of the progress
output from those jobs to mess up my next command.

And mksh and zsh can't run our tests, and I don't have any more shells
at hand to try.



Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check

2018-12-05 Thread Jeff King
On Wed, Dec 05, 2018 at 01:51:36AM -0500, Jeff King wrote:

> > This
> > function is easily converted to struct object_id, though, as its single
> > caller can pass one on -- this makes the copy unnecessary.
> 
> If you mean modifying sha1_loose_object_info() to take an oid, then
> sure, I agree that is a good step forward (and that is exactly the "punt
> until" moment I meant).

So the simple thing is to do that, and then have it pass oid->hash to
the other functions it uses. If we start to convert those, there's a
little bit of a rabbit hole, but it's actually not too bad.

Most of the spill-over is into the dumb-http code. Note that it actually
uses sha1 itself! That probably needs to be the_hash_algo (though I'm
not even sure how we'd negotiate the algorithm across a dumb fetch). At
any rate, I don't think this patch makes anything _worse_ in that
respect.

diff --git a/http-push.c b/http-push.c
index cd48590912..0141b0ad53 100644
--- a/http-push.c
+++ b/http-push.c
@@ -255,7 +255,7 @@ static void start_fetch_loose(struct transfer_request 
*request)
struct active_request_slot *slot;
struct http_object_request *obj_req;
 
-   obj_req = new_http_object_request(repo->url, request->obj->oid.hash);
+   obj_req = new_http_object_request(repo->url, >obj->oid);
if (obj_req == NULL) {
request->state = ABORTED;
return;
diff --git a/http-walker.c b/http-walker.c
index 0a392c85b6..29b59e2fe0 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -58,7 +58,7 @@ static void start_object_request(struct walker *walker,
struct active_request_slot *slot;
struct http_object_request *req;
 
-   req = new_http_object_request(obj_req->repo->base, obj_req->oid.hash);
+   req = new_http_object_request(obj_req->repo->base, _req->oid);
if (req == NULL) {
obj_req->state = ABORTED;
return;
@@ -543,11 +543,11 @@ static int fetch_object(struct walker *walker, unsigned 
char *sha1)
} else if (req->zret != Z_STREAM_END) {
walker->corrupt_object_found++;
ret = error("File %s (%s) corrupt", hex, req->url);
-   } else if (!hasheq(obj_req->oid.hash, req->real_sha1)) {
+   } else if (!oideq(_req->oid, >real_oid)) {
ret = error("File %s has bad hash", hex);
} else if (req->rename < 0) {
struct strbuf buf = STRBUF_INIT;
-   loose_object_path(the_repository, , req->sha1);
+   loose_object_path(the_repository, , >oid);
ret = error("unable to write sha1 filename %s", buf.buf);
strbuf_release();
}
diff --git a/http.c b/http.c
index 7cfa7a16e0..e95b5b9be0 100644
--- a/http.c
+++ b/http.c
@@ -2298,9 +2298,9 @@ static size_t fwrite_sha1_file(char *ptr, size_t eltsize, 
size_t nmemb,
 }
 
 struct http_object_request *new_http_object_request(const char *base_url,
-   unsigned char *sha1)
+   const struct object_id *oid)
 {
-   char *hex = sha1_to_hex(sha1);
+   char *hex = oid_to_hex(oid);
struct strbuf filename = STRBUF_INIT;
struct strbuf prevfile = STRBUF_INIT;
int prevlocal;
@@ -2311,10 +2311,10 @@ struct http_object_request 
*new_http_object_request(const char *base_url,
 
freq = xcalloc(1, sizeof(*freq));
strbuf_init(>tmpfile, 0);
-   hashcpy(freq->sha1, sha1);
+   oidcpy(>oid, oid);
freq->localfile = -1;
 
-   loose_object_path(the_repository, , sha1);
+   loose_object_path(the_repository, , oid);
strbuf_addf(>tmpfile, "%s.temp", filename.buf);
 
strbuf_addf(, "%s.prev", filename.buf);
@@ -2456,16 +2456,16 @@ int finish_http_object_request(struct 
http_object_request *freq)
}
 
git_inflate_end(>stream);
-   git_SHA1_Final(freq->real_sha1, >c);
+   git_SHA1_Final(freq->real_oid.hash, >c);
if (freq->zret != Z_STREAM_END) {
unlink_or_warn(freq->tmpfile.buf);
return -1;
}
-   if (!hasheq(freq->sha1, freq->real_sha1)) {
+   if (!oideq(>oid, >real_oid)) {
unlink_or_warn(freq->tmpfile.buf);
return -1;
}
-   loose_object_path(the_repository, , freq->sha1);
+   loose_object_path(the_repository, , >oid);
freq->rename = finalize_object_file(freq->tmpfile.buf, filename.buf);
strbuf_release();
 
diff --git a/http.h b/http.h
index d305ca1dc7..66c52b2e1e 100644
--- a/http.h
+++ b/http.h
@@ -224,8 +224,8 @@ struct http_object_request {
CURLcode curl_result;
char errorstr[CURL_ERROR_SIZE];
long http_code;
-   unsigned char sha1[20];
-   unsigned char real_sha1[20];
+   struct object_id oid;
+   struct object_id real_oid;
git_SHA_CTX c;
git_zstream stream;
int zret;
@@ -234,7 +234,7 @@ struct http_object_request {
 };
 
 extern struct