Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-17 Thread Junio C Hamano
Torsten Bögershausen  writes:

>> +test_cmp_count () {
>> +expect=$1 actual=$2
>
> That could be 
> expect="$1"
> actual="$2"

Yes, but it does not have to ;-).


Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-16 Thread Lars Schneider

> On 11 Oct 2016, at 03:09, Torsten Bögershausen  wrote:
> 
> On Tue, Oct 11, 2016 at 10:11:22AM +0200, Lars Schneider wrote:
>> 
>>> On 10 Oct 2016, at 21:58, Junio C Hamano  wrote:
>>> 
>>> larsxschnei...@gmail.com writes:
>>> 
>>> [...]
 
>> -test_cmp_count_except_clean () {
>> -for FILE in $@
> 
>> +test_cmp_count () {
>> +expect=$1 actual=$2
> 
> That could be 
> expect="$1"
> actual="$2"

Sure!


>> +for FILE in "$expect" "$actual"
>>  do
> 
>> +sort "$FILE" | uniq -c | sed "s/^[ ]*//" |
>> +sed "s/^\([0-9]\) IN: clean/x IN: clean/" |
>> +sed "s/^\([0-9]\) IN: smudge/x IN: smudge/" 
>> >"$FILE.tmp" &&
>> +cat "$FILE.tmp" >"$FILE"
> 
> How about 
>   cp "$FILE.tmp" "$FILE"

OK, I'll use "mv".

Thanks,
Lars

Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-15 Thread Jakub Narębski
W dniu 15.10.2016 o 16:45, Lars Schneider pisze:
>> On 12 Oct 2016, at 03:54, Jakub Narębski  wrote:
>> W dniu 12.10.2016 o 00:26, Lars Schneider pisze: 
 On 09 Oct 2016, at 01:06, Jakub Narębski  wrote:
>
>>>
> After the filter started
> Git sends a welcome message ("git-filter-client"), a list of
> supported protocol version numbers, and a flush packet. Git expects
> +to read a welcome response message ("git-filter-server") and exactly
> +one protocol version number from the previously sent list. All further
> +communication will be based on the selected version. The remaining
> +protocol description below documents "version=2". Please note that
> +"version=42" in the example below does not exist and is only there
> +to illustrate how the protocol would look like with more than one
> +version.
> +
> +After the version negotiation Git sends a list of all capabilities that
> +it supports and a flush packet. Git expects to read a list of desired
> +capabilities, which must be a subset of the supported capabilities list,
> +and a flush packet as response:
> +
> +packet:  git> git-filter-client
> +packet:  git> version=2
> +packet:  git> version=42
> +packet:  git> 
> +packet:  git< git-filter-server
> +packet:  git< version=2
> +packet:  git> clean=true
> +packet:  git> smudge=true
> +packet:  git> not-yet-invented=true
> +packet:  git> 
> +packet:  git< clean=true
> +packet:  git< smudge=true
> +packet:  git< 

 WARNING: This example is different from description!!!
>>>
>>> Can you try to explain the difference more clearly? I read it multiple
>>> times and I think this is sound.
>>
>> I'm sorry it was *my mistake*.  I have read the example exchange wrong.
>>
>> On the other hand that means that I have other comment, which I though
>> was addressed already in v10, namely that not all exchanges ends with
>> flush packet (inconsistency, and I think a bit of lack of extendability).
> 
> Well, this part of the protocol is not supposed to be extensible because
> it is supposed to deal *only* with the version number. It needs to keep 
> the same structure to ensure forward and backward compatibility.
> 
> However, for consistency sake I will add a flush packet.

Thanks.  That is one thing I feel quite strongly about.

I can agree that extendability does not matter much here: we can always
change the version number.  But there might be some additional information
that filter process wants to send to Git in first exchange, and using
flush-terminated list here means that we don't need to change version
number, assuming that this additional information is advisory.

The consistency means in my opinion that it should be easier to implement
filter scripts.

 In description above the example you have 4-part handshake, not 3-part;
 the filter is described to send list of supported capabilities last
 (a subset of what Git command supports).
>>>
>>> Part 1: Git sends a welcome message...
>>> Part 2: Git expects to read a welcome response message...
>>> Part 3: After the version negotiation Git sends a list of all 
>>> capabilities...
>>> Part 4: Git expects to read a list of desired capabilities...
>>>
>>> I think example and text match, no?
>>
>> Yes, it does; as I have said already, I have misread the example. 
>>
>> Anyway, in some cases 4-way handshake, where Git sends list of
>> supported capabilities first, is better.  If the protocol has
>> to prepare something for each of capabilities, and perhaps check
>> those preparation status, it can do it after Git sends what it
>> could need, and before it sends what it does support.
>>
>> Though it looks a bit strange that client (as Git is client here)
>> sends its capabilities first...
> 
> Git tells the filter what it can do. Then the filter decides what
> features it supports. I would prefer to keep it that way as I don't
> see a strong advantage for the other way around.

I think the current order is good, no need to change it.
As I said it is better for Git to send capabilities first.
 

 By the way, now I look at it, the argument for using the
 "=true" format instead of "capability="
 (or "supported-command=") is weak.  The argument for
 using "=" to make it easier to implement parsing
 is sound, but the argument for "=true" is weak.

 The argument was that with "=true" one can simply
 parse metadata into hash / dictionary / hashmap, and choose
 response based on that.  Hash / hashmap / associative array
 needs different keys, so the reasoning went for "=true"
 over "capability="... but the filter process still
 needs to handle lines with repeating keys, namely "version="
 lines!

 So the 

Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-15 Thread Jeff King
On Sat, Oct 15, 2016 at 07:45:48AM -0700, Lars Schneider wrote:

> >> I have to agree that "capability=" might read a
> >> little bit nicer. However, Peff suggested "=true" 
> >> as his preference and this is absolutely OK with me.
> > 
> > From what I remember it was Peff stating that he thinks "=true"
> > is easier for parsing (it is, but we still need to support the harder
> > way parsing anyway), and offered that "" is good enough (if less
> > consistent).

I don't mind that much if you want to do it the other way. You are the
one writing the parsing/use code.

> > Also, with "capability=" we can be more self descriptive,
> > for example "supported-command="; though "capability" is good
> > enough for me.
> > 
> > For example
> > 
> > packet:  git> wants=clean
> > packet:  git> wants=smudge
> > packet:  git> wants=size
> > packet:  git> 
> > packet:  git< supports=clean
> > packet:  git< supports=smudge
> > packet:  git< 
> > 
> > Though coming up with good names is hard; and as I said "capability"
> > is good enough; OTOH with "smudge=true" etc. we don't need to come
> > up with good name at all... though I wonder if it is a good thing `\_o,_/
> 
> How about this (I borrowed these terms from contract negotiation)?
> 
> packet:  git> offers=clean
> packet:  git> offers=smudge
> packet:  git> offers=size
> packet:  git> 
> packet:  git< accepts=clean
> packet:  git< accepts=smudge
> packet:  git< 
> 
> @Peff: Would that be OK for you?

Is it always an offers/accepts relationship? Can the response say "you
did not ask about , but just so you know I support it"?

I cannot think offhand of an example, but at the same time, if you leave
the terms as generic as possible, you do not end up later with words
that do not make sense. It is trading off one problem now (vagueness of
the protocol terms) for a potential one later (words that have a
specific meaning, but one that is not accurate).

I don't have a strong preference, though.

-Peff


Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-15 Thread Lars Schneider
@Peff: If you have time, it would be great if you could comment on
one question below prefixed with "@Peff". Thanks!


> On 12 Oct 2016, at 03:54, Jakub Narębski  wrote:
> 
> W dniu 12.10.2016 o 00:26, Lars Schneider pisze: 
>>> On 09 Oct 2016, at 01:06, Jakub Narębski  wrote:
 
>> 
 After the filter started
 Git sends a welcome message ("git-filter-client"), a list of
 supported protocol version numbers, and a flush packet. Git expects
 +to read a welcome response message ("git-filter-server") and exactly
 +one protocol version number from the previously sent list. All further
 +communication will be based on the selected version. The remaining
 +protocol description below documents "version=2". Please note that
 +"version=42" in the example below does not exist and is only there
 +to illustrate how the protocol would look like with more than one
 +version.
 +
 +After the version negotiation Git sends a list of all capabilities that
 +it supports and a flush packet. Git expects to read a list of desired
 +capabilities, which must be a subset of the supported capabilities list,
 +and a flush packet as response:
 +
 +packet:  git> git-filter-client
 +packet:  git> version=2
 +packet:  git> version=42
 +packet:  git> 
 +packet:  git< git-filter-server
 +packet:  git< version=2
 +packet:  git> clean=true
 +packet:  git> smudge=true
 +packet:  git> not-yet-invented=true
 +packet:  git> 
 +packet:  git< clean=true
 +packet:  git< smudge=true
 +packet:  git< 
>>> 
>>> WARNING: This example is different from description!!!
>> 
>> Can you try to explain the difference more clearly? I read it multiple
>> times and I think this is sound.
> 
> I'm sorry it was *my mistake*.  I have read the example exchange wrong.
> 
> On the other hand that means that I have other comment, which I though
> was addressed already in v10, namely that not all exchanges ends with
> flush packet (inconsistency, and I think a bit of lack of extendability).

Well, this part of the protocol is not supposed to be extensible because
it is supposed to deal *only* with the version number. It needs to keep 
the same structure to ensure forward and backward compatibility.

However, for consistency sake I will add a flush packet.


>>> In description above the example you have 4-part handshake, not 3-part;
>>> the filter is described to send list of supported capabilities last
>>> (a subset of what Git command supports).
>> 
>> Part 1: Git sends a welcome message...
>> Part 2: Git expects to read a welcome response message...
>> Part 3: After the version negotiation Git sends a list of all capabilities...
>> Part 4: Git expects to read a list of desired capabilities...
>> 
>> I think example and text match, no?
> 
> Yes, it does; as I have said already, I have misread the example. 
> 
> Anyway, in some cases 4-way handshake, where Git sends list of
> supported capabilities first, is better.  If the protocol has
> to prepare something for each of capabilities, and perhaps check
> those preparation status, it can do it after Git sends what it
> could need, and before it sends what it does support.
> 
> Though it looks a bit strange that client (as Git is client here)
> sends its capabilities first...

Git tells the filter what it can do. Then the filter decides what
features it supports. I would prefer to keep it that way as I don't
see a strong advantage for the other way around.


>>> By the way, now I look at it, the argument for using the
>>> "=true" format instead of "capability="
>>> (or "supported-command=") is weak.  The argument for
>>> using "=" to make it easier to implement parsing
>>> is sound, but the argument for "=true" is weak.
>>> 
>>> The argument was that with "=true" one can simply
>>> parse metadata into hash / dictionary / hashmap, and choose
>>> response based on that.  Hash / hashmap / associative array
>>> needs different keys, so the reasoning went for "=true"
>>> over "capability="... but the filter process still
>>> needs to handle lines with repeating keys, namely "version="
>>> lines!
>>> 
>>> So the argument doesn't hold water IMVHO, and we can choose
>>> version which reads better / is more natural.
>> 
>> I have to agree that "capability=" might read a
>> little bit nicer. However, Peff suggested "=true" 
>> as his preference and this is absolutely OK with me.
> 
> From what I remember it was Peff stating that he thinks "=true"
> is easier for parsing (it is, but we still need to support the harder
> way parsing anyway), and offered that "" is good enough (if less
> consistent).
> 
>> I am happy to change that if a second reviewer shares your
>> opinion.
> 
> Also, with "capability=" we can be more self descriptive,
> for 

Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-12 Thread Jakub Narębski
W dniu 12.10.2016 o 00:26, Lars Schneider pisze: 
>> On 09 Oct 2016, at 01:06, Jakub Narębski  wrote:
>>
>> Part 1 of review, starting with the protocol v2 itself.
>>
>> W dniu 08.10.2016 o 13:25, larsxschnei...@gmail.com pisze:
>>> From: Lars Schneider 
>>>
>>> +upon checkin. By default these commands process only a single
>>> +blob and terminate.  If a long running `process` filter is used
>>> +in place of `clean` and/or `smudge` filters, then Git can process
>>> +all blobs with a single filter command invocation for the entire
>>> +life of a single Git command, for example `git add --all`.  See
>>> +section below for the description of the protocol used to
>>> +communicate with a `process` filter.
>>
>> I don't remember how this part looked like in previous versions
>> of this patch series, but "... is used in place of `clean` ..."
>> does not tell explicitly about the precedence of those 
>> configuration variables.  I think it should be stated explicitly
>> that `process` takes precedence over any `clean` and/or `smudge`
>> settings for the same `filter.` (regardless of whether
>> the long running `process` filter support "clean" and/or "smudge"
>> operations or not).
> 
> This is stated explicitly later on. I moved it up here:
> 
> "If a long running `process` filter is used
> in place of `clean` and/or `smudge` filters, then Git can process
> all blobs with a single filter command invocation for the entire
> life of a single Git command, for example `git add --all`. If a 
> long running `process` filter is configured then it always takes 
> precedence over a configured single blob filter. "
> 
> OK?

Looks good to me.

I think this information about precedence between one-shot `clean`
and `smudge` filter driver configuration, and multi-file `process`
filter driver should be here for two reasons.

First, if one is interested in running filter, but do not want to
write one (he or she uses existing tool, for example one of
existing LFS solutions), one can skip the "Long Running Filter
Process" section.  But one still needs to know if to remove or
comment out old `clean` and `smudge` config, or how to provide
fallback for older Git (if one uses the same configuration with
pre-process Git and Git including support for this feature).

Second, the configuration belongs, in my opinion, here.  It is
not a part of long running filter protocol.

>>> +If the filter command (a string value) is defined via
>>> +`filter..process` then Git can process all blobs with a
>>> +single filter invocation for the entire life of a single Git
>>> +command. This is achieved by using a packet format (pkt-line,
>>> +see technical/protocol-common.txt) based protocol over standard
>>> +input and standard output as follows. All packets, except for the
>>> +"*CONTENT" packets and the "" flush packet, are considered
>>> +text and therefore are terminated by a LF.
>>
>> Maybe s/standard input and output/\& of filter process,/ (that is,
>> add "... of filter process," to the third sentence in the above
>> paragraph).
> 
> You mean "This is achieved by using a packet format (pkt-line,
> see technical/protocol-common.txt) based protocol over standard
> input and standard output of filter process as follows." ?

Yes.

> I think I like the original version better.

Well, I think it is better to err out on the side of being more
explicit.

> 
>>> After the filter started
>>> Git sends a welcome message ("git-filter-client"), a list of
>>> supported protocol version numbers, and a flush packet. Git expects
>>> +to read a welcome response message ("git-filter-server") and exactly
>>> +one protocol version number from the previously sent list. All further
>>> +communication will be based on the selected version. The remaining
>>> +protocol description below documents "version=2". Please note that
>>> +"version=42" in the example below does not exist and is only there
>>> +to illustrate how the protocol would look like with more than one
>>> +version.
>>> +
>>> +After the version negotiation Git sends a list of all capabilities that
>>> +it supports and a flush packet. Git expects to read a list of desired
>>> +capabilities, which must be a subset of the supported capabilities list,
>>> +and a flush packet as response:
>>> +
>>> +packet:  git> git-filter-client
>>> +packet:  git> version=2
>>> +packet:  git> version=42
>>> +packet:  git> 
>>> +packet:  git< git-filter-server
>>> +packet:  git< version=2
>>> +packet:  git> clean=true
>>> +packet:  git> smudge=true
>>> +packet:  git> not-yet-invented=true
>>> +packet:  git> 
>>> +packet:  git< clean=true
>>> +packet:  git< smudge=true
>>> +packet:  git< 
>>
>> WARNING: This example is different from description!!!
> 
> Can you try to explain the difference more clearly? I read it multiple
> times and I think 

Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-11 Thread Lars Schneider

> On 09 Oct 2016, at 01:06, Jakub Narębski  wrote:
> 
> Part 1 of review, starting with the protocol v2 itself.
> 
> W dniu 08.10.2016 o 13:25, larsxschnei...@gmail.com pisze:
>> From: Lars Schneider 
>> 
>> +upon checkin. By default these commands process only a single
>> +blob and terminate.  If a long running `process` filter is used
>> +in place of `clean` and/or `smudge` filters, then Git can process
>> +all blobs with a single filter command invocation for the entire
>> +life of a single Git command, for example `git add --all`.  See
>> +section below for the description of the protocol used to
>> +communicate with a `process` filter.
> 
> I don't remember how this part looked like in previous versions
> of this patch series, but "... is used in place of `clean` ..."
> does not tell explicitly about the precedence of those 
> configuration variables.  I think it should be stated explicitly
> that `process` takes precedence over any `clean` and/or `smudge`
> settings for the same `filter.` (regardless of whether
> the long running `process` filter support "clean" and/or "smudge"
> operations or not).

This is stated explicitly later on. I moved it up here:

"If a long running `process` filter is used
in place of `clean` and/or `smudge` filters, then Git can process
all blobs with a single filter command invocation for the entire
life of a single Git command, for example `git add --all`. If a 
long running `process` filter is configured then it always takes 
precedence over a configured single blob filter. "

OK?


>> +If the filter command (a string value) is defined via
>> +`filter..process` then Git can process all blobs with a
>> +single filter invocation for the entire life of a single Git
>> +command. This is achieved by using a packet format (pkt-line,
>> +see technical/protocol-common.txt) based protocol over standard
>> +input and standard output as follows. All packets, except for the
>> +"*CONTENT" packets and the "" flush packet, are considered
>> +text and therefore are terminated by a LF.
> 
> Maybe s/standard input and output/\& of filter process,/ (that is,
> add "... of filter process," to the third sentence in the above
> paragraph).

You mean "This is achieved by using a packet format (pkt-line,
see technical/protocol-common.txt) based protocol over standard
input and standard output of filter process as follows." ?

I think I like the original version better.


>> After the filter started
> Git sends a welcome message ("git-filter-client"), a list of
>> supported protocol version numbers, and a flush packet. Git expects
>> +to read a welcome response message ("git-filter-server") and exactly
>> +one protocol version number from the previously sent list. All further
>> +communication will be based on the selected version. The remaining
>> +protocol description below documents "version=2". Please note that
>> +"version=42" in the example below does not exist and is only there
>> +to illustrate how the protocol would look like with more than one
>> +version.
>> +
>> +After the version negotiation Git sends a list of all capabilities that
>> +it supports and a flush packet. Git expects to read a list of desired
>> +capabilities, which must be a subset of the supported capabilities list,
>> +and a flush packet as response:
>> +
>> +packet:  git> git-filter-client
>> +packet:  git> version=2
>> +packet:  git> version=42
>> +packet:  git> 
>> +packet:  git< git-filter-server
>> +packet:  git< version=2
>> +packet:  git> clean=true
>> +packet:  git> smudge=true
>> +packet:  git> not-yet-invented=true
>> +packet:  git> 
>> +packet:  git< clean=true
>> +packet:  git< smudge=true
>> +packet:  git< 
> 
> WARNING: This example is different from description!!!

Can you try to explain the difference more clearly? I read it multiple
times and I think this is sound.


> In example you have Git sending "git-filter-client" and list of supported
> protocol versions, terminated with flush packet,

Correct.


> then filter driver
> process sends "git-filter-server", exactly one version, *AND* list of
> supported capabilities in "=true" format, terminated with
> flush packet.

Correct. That's what I read in the text and in the example.

> 
> In description above the example you have 4-part handshake, not 3-part;
> the filter is described to send list of supported capabilities last
> (a subset of what Git command supports).

Part 1: Git sends a welcome message...
Part 2: Git expects to read a welcome response message...
Part 3: After the version negotiation Git sends a list of all capabilities...
Part 4: Git expects to read a list of desired capabilities...

I think example and text match, no?


> Moreover in the example in
> previous version at least as far as v8 of this series, the response
> from filter driver was 

Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-11 Thread Lars Schneider

> On 09 Oct 2016, at 07:32, Torsten Bögershausen  wrote:
> 
> On 09.10.16 01:06, Jakub Narębski wrote:
>>> +
 +packet:  git< status=abort
 +packet:  git< 
 +
 +
 +After the filter has processed a blob it is expected to wait for
 +the next "key=value" list containing a command. Git will close
 +the command pipe on exit. The filter is expected to detect EOF
 +and exit gracefully on its own.
>> Any "kill filter" solutions should probably be put here.  I guess
>> that filter exiting means EOF on its standard output when read
>> by Git command, isn't it?
>> 
> Isn't it that Git closes the command pipe, then filter sees EOF on it's stdin
> 
> and does a graceful exit.

Correct!

- Lars

Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-11 Thread Torsten Bögershausen
On Tue, Oct 11, 2016 at 10:11:22AM +0200, Lars Schneider wrote:
> 
> > On 10 Oct 2016, at 21:58, Junio C Hamano  wrote:
> > 
> > larsxschnei...@gmail.com writes:
> > 
> > [...]
> >> +# Count unique lines except clean invocations in two files and compare
> >> +# them. Clean invocations are not counted because their number can vary.
> >> +# c.f. 
> >> http://public-inbox.org/git/xmqqshv18i8i@gitster.mtv.corp.google.com/
> >> +test_cmp_count_except_clean () {
> >> +  for FILE in $@
> >> +  do
> >> +  sort $FILE | uniq -c | sed "s/^[ ]*//" |
> >> +  sed "s/^\([0-9]\) IN: clean/x IN: clean/" >$FILE.tmp
> >> +  cat $FILE.tmp >$FILE
> >> +  done &&
> >> +  test_cmp $@
> >> +}
> > 
> > Why do you even _care_ about the number of invocations?  While I
> > told you why "clean" could be called multiple times under racy Git
> > as an example, that was not meant to be an exhaustive example.  I
> > wouldn't be surprised if we needed to run smudge twice, for example,
> > in some weirdly racy cases in the future.
> > 
> > Can we just have the correctness (i.e. "we expect that the working
> > tree file gets this as the result of checking it out, and we made
> > sure that is the case") test without getting into such an
> > implementation detail?
> 
> My goal is to check that clean/smudge is invoked at least once. I could
> just run `uniq` to achieve that but then all other filter commands could
> happen multiple times and the test would not detect that.
> 
> I also prefer to check the filter commands to ensure the filter is 
> working as expected (e.g. no multiple start ups etc) in addition to 
> checking the working tree.
> 
> Would the patch below work for you? If yes, then please squash it into
> "convert: add filter..process option".
> 
> Thank you,
> Lars
> 
> 
> 
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index 9f892c0..714f706 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -31,38 +31,33 @@ filter_git () {
>   rm -f git-stderr.log
>  }
>  
> -# Count unique lines in two files and compare them.
> -test_cmp_count () {
> - for FILE in $@
> - do
> - sort $FILE | uniq -c | sed "s/^[ ]*//" >$FILE.tmp
> - cat $FILE.tmp >$FILE
> - done &&
> - test_cmp $@
> -}
> -
> -# Count unique lines except clean invocations in two files and compare
> -# them. Clean invocations are not counted because their number can vary.
> +# Compare two files and ensure that `clean` and `smudge` respectively are
> +# called at least once if specified in the `expect` file. The actual
> +# invocation count is not relevant because their number can vary.
>  # c.f. 
> http://public-inbox.org/git/xmqqshv18i8i@gitster.mtv.corp.google.com/
> -test_cmp_count_except_clean () {
> - for FILE in $@

> +test_cmp_count () {
> + expect=$1 actual=$2

That could be 
expect="$1"
actual="$2"

> + for FILE in "$expect" "$actual"
>   do

> + sort "$FILE" | uniq -c | sed "s/^[ ]*//" |
> + sed "s/^\([0-9]\) IN: clean/x IN: clean/" |
> + sed "s/^\([0-9]\) IN: smudge/x IN: smudge/" 
> >"$FILE.tmp" &&
> + cat "$FILE.tmp" >"$FILE"

How about 
cp "$FILE.tmp" "$FILE"



Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-11 Thread Lars Schneider

> On 10 Oct 2016, at 21:58, Junio C Hamano  wrote:
> 
> larsxschnei...@gmail.com writes:
> 
> [...]
>> +# Count unique lines except clean invocations in two files and compare
>> +# them. Clean invocations are not counted because their number can vary.
>> +# c.f. 
>> http://public-inbox.org/git/xmqqshv18i8i@gitster.mtv.corp.google.com/
>> +test_cmp_count_except_clean () {
>> +for FILE in $@
>> +do
>> +sort $FILE | uniq -c | sed "s/^[ ]*//" |
>> +sed "s/^\([0-9]\) IN: clean/x IN: clean/" >$FILE.tmp
>> +cat $FILE.tmp >$FILE
>> +done &&
>> +test_cmp $@
>> +}
> 
> Why do you even _care_ about the number of invocations?  While I
> told you why "clean" could be called multiple times under racy Git
> as an example, that was not meant to be an exhaustive example.  I
> wouldn't be surprised if we needed to run smudge twice, for example,
> in some weirdly racy cases in the future.
> 
> Can we just have the correctness (i.e. "we expect that the working
> tree file gets this as the result of checking it out, and we made
> sure that is the case") test without getting into such an
> implementation detail?

My goal is to check that clean/smudge is invoked at least once. I could
just run `uniq` to achieve that but then all other filter commands could
happen multiple times and the test would not detect that.

I also prefer to check the filter commands to ensure the filter is 
working as expected (e.g. no multiple start ups etc) in addition to 
checking the working tree.

Would the patch below work for you? If yes, then please squash it into
"convert: add filter..process option".

Thank you,
Lars



diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 9f892c0..714f706 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -31,38 +31,33 @@ filter_git () {
rm -f git-stderr.log
 }
 
-# Count unique lines in two files and compare them.
-test_cmp_count () {
-   for FILE in $@
-   do
-   sort $FILE | uniq -c | sed "s/^[ ]*//" >$FILE.tmp
-   cat $FILE.tmp >$FILE
-   done &&
-   test_cmp $@
-}
-
-# Count unique lines except clean invocations in two files and compare
-# them. Clean invocations are not counted because their number can vary.
+# Compare two files and ensure that `clean` and `smudge` respectively are
+# called at least once if specified in the `expect` file. The actual
+# invocation count is not relevant because their number can vary.
 # c.f. 
http://public-inbox.org/git/xmqqshv18i8i@gitster.mtv.corp.google.com/
-test_cmp_count_except_clean () {
-   for FILE in $@
+test_cmp_count () {
+   expect=$1 actual=$2
+   for FILE in "$expect" "$actual"
do
-   sort $FILE | uniq -c | sed "s/^[ ]*//" |
-   sed "s/^\([0-9]\) IN: clean/x IN: clean/" >$FILE.tmp
-   cat $FILE.tmp >$FILE
+   sort "$FILE" | uniq -c | sed "s/^[ ]*//" |
+   sed "s/^\([0-9]\) IN: clean/x IN: clean/" |
+   sed "s/^\([0-9]\) IN: smudge/x IN: smudge/" 
>"$FILE.tmp" &&
+   cat "$FILE.tmp" >"$FILE"
done &&
-   test_cmp $@
+   test_cmp "$expect" "$actual"
 }
 
-# Compare two files but exclude clean invocations because they can vary.
+# Compare two files but exclude all `clean` invocations because Git can
+# call `clean` zero or more times.
 # c.f. 
http://public-inbox.org/git/xmqqshv18i8i@gitster.mtv.corp.google.com/
 test_cmp_exclude_clean () {
-   for FILE in $@
+   expect=$1 actual=$2
+   for FILE in "$expect" "$actual"
do
-   grep -v "IN: clean" $FILE >$FILE.tmp
-   cat $FILE.tmp >$FILE
+   grep -v "IN: clean" "$FILE" >"$FILE.tmp" &&
+   cat "$FILE.tmp" >"$FILE"
done &&
-   test_cmp $@
+   test_cmp "$expect" "$actual"
 }
 
 # Check that the contents of two files are equal and that their rot13 version
@@ -395,7 +390,7 @@ test_expect_success PERL 'required process filter should 
filter data' '
IN: clean testsubdir/test3 '\''sq'\'',\$x.r $S3 [OK] -- 
OUT: $S3 . [OK]
STOP
EOF
-   test_cmp_count_except_clean expected.log rot13-filter.log &&
+   test_cmp_count expected.log rot13-filter.log &&
 
rm -f test2.r "testsubdir/test3 '\''sq'\'',\$x.r" &&



Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-10 Thread Junio C Hamano
larsxschnei...@gmail.com writes:

> +# Count unique lines in two files and compare them.
> +test_cmp_count () {
> + for FILE in $@
> + do
> + sort $FILE | uniq -c | sed "s/^[ ]*//" >$FILE.tmp
> + cat $FILE.tmp >$FILE

Unquoted references to $FILE bothers me.  Are you relying on them
getting split at IFS boundaries?  Otherwise write this (and other
similar ones) like so:

for FILE in "$@"
do
do-this-to "$FILE" | ... >"$FILE.tmp" &&
cat "$FILE.tmp" >"$FILE" &&
rm -f "$FILE.tmp"

> + done &&
> + test_cmp $@

The use of "$@" here is quite pointless, as you _know_ all of them
are filenames, and you _know_ that test_cmp takes only two
filenames.  Be explicit and say

test_cmp "$1" "$2"

or even

test_cmp_count () {
expect=$1 actual=$2
for FILE in "$expect" "$actual"
do
...
done &&
test_cmp "$expect" "$actual"

> +# Count unique lines except clean invocations in two files and compare
> +# them. Clean invocations are not counted because their number can vary.
> +# c.f. 
> http://public-inbox.org/git/xmqqshv18i8i@gitster.mtv.corp.google.com/
> +test_cmp_count_except_clean () {
> + for FILE in $@
> + do
> + sort $FILE | uniq -c | sed "s/^[ ]*//" |
> + sed "s/^\([0-9]\) IN: clean/x IN: clean/" >$FILE.tmp
> + cat $FILE.tmp >$FILE
> + done &&
> + test_cmp $@
> +}

Why do you even _care_ about the number of invocations?  While I
told you why "clean" could be called multiple times under racy Git
as an example, that was not meant to be an exhaustive example.  I
wouldn't be surprised if we needed to run smudge twice, for example,
in some weirdly racy cases in the future.

Can we just have the correctness (i.e. "we expect that the working
tree file gets this as the result of checking it out, and we made
sure that is the case") test without getting into such an
implementation detail?

Thanks.


Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-08 Thread Torsten Bögershausen
On 09.10.16 01:06, Jakub Narębski wrote:
>> +
>> > +packet:  git< status=abort
>> > +packet:  git< 
>> > +
>> > +
>> > +After the filter has processed a blob it is expected to wait for
>> > +the next "key=value" list containing a command. Git will close
>> > +the command pipe on exit. The filter is expected to detect EOF
>> > +and exit gracefully on its own.
> Any "kill filter" solutions should probably be put here.  I guess
> that filter exiting means EOF on its standard output when read
> by Git command, isn't it?
>
Isn't it that Git closes the command pipe, then filter sees EOF on it's stdin

and does a graceful exit.





Re: [PATCH v10 13/14] convert: add filter..process option

2016-10-08 Thread Jakub Narębski
Part 1 of review, starting with the protocol v2 itself.

W dniu 08.10.2016 o 13:25, larsxschnei...@gmail.com pisze:
> From: Lars Schneider 
> 
> Git's clean/smudge mechanism invokes an external filter process for
> every single blob that is affected by a filter. If Git filters a lot of
> blobs then the startup time of the external filter processes can become
> a significant part of the overall Git execution time.
> 
> In a preliminary performance test this developer used a clean/smudge
> filter written in golang to filter 12,000 files. This process took 364s
> with the existing filter mechanism and 5s with the new mechanism. See
> details here: https://github.com/github/git-lfs/pull/1382
> 
> This patch adds the `filter..process` string option which, if
> used, keeps the external filter process running and processes all blobs
> with the packet format (pkt-line) based protocol over standard input and
> standard output. The full protocol is explained in detail in
> `Documentation/gitattributes.txt`.
> 
> A few key decisions:
> 
> * The long running filter process is referred to as filter protocol
>   version 2 because the existing single shot filter invocation is
>   considered version 1.
> * Git sends a welcome message and expects a response right after the
>   external filter process has started. This ensures that Git will not
>   hang if a version 1 filter is incorrectly used with the
>   filter..process option for version 2 filters. In addition,
>   Git can detect this kind of error and warn the user.
> * The status of a filter operation (e.g. "success" or "error) is set
>   before the actual response and (if necessary!) re-set after the
>   response. The advantage of this two step status response is that if
>   the filter detects an error early, then the filter can communicate
>   this and Git does not even need to create structures to read the
>   response.
> * All status responses are pkt-line lists terminated with a flush
>   packet. This allows us to send other status fields with the same
>   protocol in the future.

Looks good to me.

> 
> Helped-by: Martin-Louis Bright 
> Reviewed-by: Jakub Narebski 
> Signed-off-by: Lars Schneider 
> Signed-off-by: Junio C Hamano 
> ---
>  Documentation/gitattributes.txt | 157 +-
>  convert.c   | 297 +-
>  t/t0021-conversion.sh   | 447 
> +++-
>  t/t0021/rot13-filter.pl | 191 +
>  4 files changed, 1082 insertions(+), 10 deletions(-)
>  create mode 100755 t/t0021/rot13-filter.pl
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 7aff940..5868f00 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the 
> command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate.  If a long running `process` filter is used
> +in place of `clean` and/or `smudge` filters, then Git can process
> +all blobs with a single filter command invocation for the entire
> +life of a single Git command, for example `git add --all`.  See
> +section below for the description of the protocol used to
> +communicate with a `process` filter.

I don't remember how this part looked like in previous versions
of this patch series, but "... is used in place of `clean` ..."
does not tell explicitly about the precedence of those 
configuration variables.  I think it should be stated explicitly
that `process` takes precedence over any `clean` and/or `smudge`
settings for the same `filter.` (regardless of whether
the long running `process` filter support "clean" and/or "smudge"
operations or not).

>  
>  One use of the content filtering is to massage the content into a shape
>  that is more convenient for the platform, filesystem, and the user to use.
> @@ -373,6 +379,155 @@ not exist, or may have different contents. So, smudge 
> and clean commands
>  should not try to access the file on disk, but only act as filters on the
>  content provided to them on standard input.
>  
> +Long Running Filter Process
> +^^^
> +
> +If the filter command (a string value) is defined via
> +`filter..process` then Git can process all blobs with a
> +single filter invocation for the entire life of a single Git
> +command. This is achieved by using a packet format (pkt-line,
> +see technical/protocol-common.txt) based protocol over standard
> +input and standard output as follows. All packets, except for the
> +"*CONTENT" packets and the "" flush packet, are considered
> +text 

[PATCH v10 13/14] convert: add filter..process option

2016-10-08 Thread larsxschneider
From: Lars Schneider 

Git's clean/smudge mechanism invokes an external filter process for
every single blob that is affected by a filter. If Git filters a lot of
blobs then the startup time of the external filter processes can become
a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge
filter written in golang to filter 12,000 files. This process took 364s
with the existing filter mechanism and 5s with the new mechanism. See
details here: https://github.com/github/git-lfs/pull/1382

This patch adds the `filter..process` string option which, if
used, keeps the external filter process running and processes all blobs
with the packet format (pkt-line) based protocol over standard input and
standard output. The full protocol is explained in detail in
`Documentation/gitattributes.txt`.

A few key decisions:

* The long running filter process is referred to as filter protocol
  version 2 because the existing single shot filter invocation is
  considered version 1.
* Git sends a welcome message and expects a response right after the
  external filter process has started. This ensures that Git will not
  hang if a version 1 filter is incorrectly used with the
  filter..process option for version 2 filters. In addition,
  Git can detect this kind of error and warn the user.
* The status of a filter operation (e.g. "success" or "error) is set
  before the actual response and (if necessary!) re-set after the
  response. The advantage of this two step status response is that if
  the filter detects an error early, then the filter can communicate
  this and Git does not even need to create structures to read the
  response.
* All status responses are pkt-line lists terminated with a flush
  packet. This allows us to send other status fields with the same
  protocol in the future.

Helped-by: Martin-Louis Bright 
Reviewed-by: Jakub Narebski 
Signed-off-by: Lars Schneider 
Signed-off-by: Junio C Hamano 
---
 Documentation/gitattributes.txt | 157 +-
 convert.c   | 297 +-
 t/t0021-conversion.sh   | 447 +++-
 t/t0021/rot13-filter.pl | 191 +
 4 files changed, 1082 insertions(+), 10 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 7aff940..5868f00 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -293,7 +293,13 @@ checkout, when the `smudge` command is specified, the 
command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate.  If a long running `process` filter is used
+in place of `clean` and/or `smudge` filters, then Git can process
+all blobs with a single filter command invocation for the entire
+life of a single Git command, for example `git add --all`.  See
+section below for the description of the protocol used to
+communicate with a `process` filter.
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -373,6 +379,155 @@ not exist, or may have different contents. So, smudge and 
clean commands
 should not try to access the file on disk, but only act as filters on the
 content provided to them on standard input.
 
+Long Running Filter Process
+^^^
+
+If the filter command (a string value) is defined via
+`filter..process` then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command. This is achieved by using a packet format (pkt-line,
+see technical/protocol-common.txt) based protocol over standard
+input and standard output as follows. All packets, except for the
+"*CONTENT" packets and the "" flush packet, are considered
+text and therefore are terminated by a LF.
+
+Git starts the filter when it encounters the first file
+that needs to be cleaned or smudged. After the filter started
+Git sends a welcome message ("git-filter-client"), a list of
+supported protocol version numbers, and a flush packet. Git expects
+to read a welcome response message ("git-filter-server") and exactly
+one protocol version number from the previously sent list. All further
+communication will be based on the selected version. The remaining
+protocol description below documents "version=2". Please note that
+"version=42" in the example below does not exist and is only there
+to illustrate how the protocol would look like with more than one
+version.
+
+After the version negotiation Git sends a list of