Re: [PATCH v2 5/5] convert: add filter..process option

2016-08-06 Thread Lars Schneider

> On 05 Aug 2016, at 20:55, Eric Wong  wrote:
> 
> Lars Schneider  wrote:
>>> On 27 Jul 2016, at 11:41, Eric Wong  wrote:
>>> larsxschnei...@gmail.com wrote:
 +static int apply_protocol_filter(const char *path, const char *src, 
 size_t len,
 +  int fd, struct strbuf *dst, 
 const char *cmd,
 +  const char *filter_type)
 +{
>>> 
>>> 
>>> 
 +  if (fd >= 0 && !src) {
 +  ret &= fstat(fd, &file_stat) != -1;
 +  len = file_stat.st_size;
>>> 
>>> Same truncation bug I noticed earlier; what I originally meant
>>> is the `len' arg probably ought to be off_t, here, not size_t.
>>> 32-bit x86 Linux systems have 32-bit size_t (unsigned), but
>>> large file support means off_t is 64-bits (signed).
>> 
>> OK. Would it be OK to keep size_t for this patch series?
> 
> I think there should at least be a truncation warning (or die)
> for larger-than-4GB files on 32-bit.  I don't know how common
> they are for git-lfs users.
> 
> Perhaps using xsize_t in git-compat-util.h works for now:
> 
>   len = xsize_t(file_stat.st_size);

Thanks for the hint! Should I add the same check to sha1_file's use
of fstat in line 1002 or is it not needed there?

https://github.com/git/git/blob/c6b0597e9ac7277e148e2fd4d7615ac6e0bfb661/sha1_file.c#L1002

Thanks,
Lars--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-08-06 Thread Eric Wong
> >>> larsxschnei...@gmail.com wrote:
>  +static int apply_protocol_filter(const char *path, const char *src, 
>  size_t len,

Lars Schneider  wrote:
> > On 05 Aug 2016, at 20:55, Eric Wong  wrote:
> > Perhaps using xsize_t in git-compat-util.h works for now:
> > 
> > len = xsize_t(file_stat.st_size);
> 
> Thanks for the hint! Should I add the same check to sha1_file's use
> of fstat in line 1002 or is it not needed there?
> 
> https://github.com/git/git/blob/c6b0597e9ac7277e148e2fd4d7615ac6e0bfb661/sha1_file.c#L1002

Not needed, if you look at the definition of "struct packed_git"
in cache.h, you'll see pack_size is already off_t, not size_t
like `len` is.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-08-05 Thread Eric Wong
Lars Schneider  wrote:
> > On 27 Jul 2016, at 11:41, Eric Wong  wrote:
> > larsxschnei...@gmail.com wrote:
> >> +static int apply_protocol_filter(const char *path, const char *src, 
> >> size_t len,
> >> +  int fd, struct strbuf *dst, 
> >> const char *cmd,
> >> +  const char *filter_type)
> >> +{
> > 
> > 
> > 
> >> +  if (fd >= 0 && !src) {
> >> +  ret &= fstat(fd, &file_stat) != -1;
> >> +  len = file_stat.st_size;
> > 
> > Same truncation bug I noticed earlier; what I originally meant
> > is the `len' arg probably ought to be off_t, here, not size_t.
> > 32-bit x86 Linux systems have 32-bit size_t (unsigned), but
> > large file support means off_t is 64-bits (signed).
> 
> OK. Would it be OK to keep size_t for this patch series?

I think there should at least be a truncation warning (or die)
for larger-than-4GB files on 32-bit.  I don't know how common
they are for git-lfs users.

Perhaps using xsize_t in git-compat-util.h works for now:

len = xsize_t(file_stat.st_size);

(sorry, I haven't had much time to look at your other updates)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-30 Thread Jakub Narębski
W dniu 2016-07-30 o 01:44, Lars Schneider pisze:
> On 30 Jul 2016, at 01:11, Jakub Narębski  wrote:

>> I think the protocol should be either:  + , or
>>  +  + , that is do not use flush
>> packet if size is known upfront -- it would be a second point
>> of truth (SPOT principle).
>
> As I mentioned elsewhere a  packet is always send right now.
> I have no strong opinion if this is good or bad. The implementation
> was a little bit simpler and that's why I did it. I will implement 
> whatever option the majority prefers :-)

Well, if we treat it as a size hint, then it is all right; as you
say it makes for a simpler implementation: read till flush.  Git
should not error out if there is mismatch between specified and
actual size of return from the filter; filter can do whatever
it wants.

I see there is v3 series sent, so I'll move the discussion there.
One thing: we probably would want for the size / size-hint
packet to be extensible, either

  size= [(SPC =)...] "\n"

or

   [(SPC )...] "\n"

that is, space separated list starting with size / size hint.

Upfront error could be signalled by putting for example "error"
in place of size, e.g.

  error  "\n"

-- 
Jakub Narębski
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-29 Thread Lars Schneider

> On 30 Jul 2016, at 01:11, Jakub Narębski  wrote:
> 
> W dniu 2016-07-29 o 19:35, Junio C Hamano pisze:
>> Lars Schneider  writes:
>> 
>>> I think sending it upfront is nice for buffer allocations of big files
>>> and it doesn't cost us anything to do it.
>> 
>> While I do NOT think "total size upfront" MUST BE avoided at all costs,
>> I do not think the above statement to justify it makes ANY sense.
>> 
>> Big files are by definition something you cannot afford to hold its
>> entirety in core, so you do not want to be told that you'd be fed 40GB
>> and ask xmalloc to allocate that much.
> 
> I don't know much how filter driver work internally, but in some cases
> Git reads or writes from file (file descriptor), in other cases it reads
> or writes from str+len pair (it probably predates strbuf) - I think in
> those cases file needs to fit in memory (in size_t).  So in some cases
> Git reads file into memory.  Whether it uses xmalloc or mmap, I don't
> know.
> 
>> 
>> It allows the reader to be lazy for buffer allocations as long as
>> you know the file fits in-core, at the cost of forcing the writer to
>> somehow come up with the total number of bytes even before sending a
>> single byte (in other words, if the writer cannot produce and hold
>> the data in-core, it may even have to spool the data in a temporary
>> file only to count, and then play it back after showing the total
>> size).
> 
> For some types of filters you can know the size upfront:
> - for filters such as rot13, with 1-to-1 transformation, you know
>   that the output size is the same as the input size
> - for block encodings, and for constant-width to constant-width
>   encoding conversion, filter can calculate output size from the
>   input size (e.g.  = 2*)
> - filter may have get size from somewhere, for example LFS filter
>   stub is constant size, and files are stored in artifactory with
>   their length 
> 
>> 
>> It is good that you allow both mode of operations and the size of
>> the data can either be given upfront (which allows a single fixed
>> allocation upfront without realloc, as long as the data fits in
>> core), or be left "(atend)".
> 
> I think the protocol should be either:  + , or
>  +  + , that is do not use flush
> packet if size is known upfront -- it would be a second point
> of truth (SPOT principle).

As I mentioned elsewhere a  packet is always send right now.
I have no strong opinion if this is good or bad. The implementation
was a little bit simpler and that's why I did it. I will implement 
whatever option the majority prefers :-)

Cheers,
Lars

> 
>> I just don't want to see it oversold as a "feature" that the size
>> has to come before data.  That is a limitation, not a feature.
>> 
>> Thanks.
>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-29 Thread Jakub Narębski
W dniu 2016-07-29 o 19:35, Junio C Hamano pisze:
> Lars Schneider  writes:
> 
>> I think sending it upfront is nice for buffer allocations of big files
>> and it doesn't cost us anything to do it.
> 
> While I do NOT think "total size upfront" MUST BE avoided at all costs,
> I do not think the above statement to justify it makes ANY sense.
> 
> Big files are by definition something you cannot afford to hold its
> entirety in core, so you do not want to be told that you'd be fed 40GB
> and ask xmalloc to allocate that much.

I don't know much how filter driver work internally, but in some cases
Git reads or writes from file (file descriptor), in other cases it reads
or writes from str+len pair (it probably predates strbuf) - I think in
those cases file needs to fit in memory (in size_t).  So in some cases
Git reads file into memory.  Whether it uses xmalloc or mmap, I don't
know.

> 
> It allows the reader to be lazy for buffer allocations as long as
> you know the file fits in-core, at the cost of forcing the writer to
> somehow come up with the total number of bytes even before sending a
> single byte (in other words, if the writer cannot produce and hold
> the data in-core, it may even have to spool the data in a temporary
> file only to count, and then play it back after showing the total
> size).

For some types of filters you can know the size upfront:
 - for filters such as rot13, with 1-to-1 transformation, you know
   that the output size is the same as the input size
 - for block encodings, and for constant-width to constant-width
   encoding conversion, filter can calculate output size from the
   input size (e.g.  = 2*)
 - filter may have get size from somewhere, for example LFS filter
   stub is constant size, and files are stored in artifactory with
   their length 

> 
> It is good that you allow both mode of operations and the size of
> the data can either be given upfront (which allows a single fixed
> allocation upfront without realloc, as long as the data fits in
> core), or be left "(atend)".

I think the protocol should be either:  + , or
 +  + , that is do not use flush
packet if size is known upfront -- it would be a second point
of truth (SPOT principle).
 
> I just don't want to see it oversold as a "feature" that the size
> has to come before data.  That is a limitation, not a feature.
> 
> Thanks.
> 


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-29 Thread Junio C Hamano
Lars Schneider  writes:

> I think sending it upfront is nice for buffer allocations of big files
> and it doesn't cost us anything to do it.

While I do NOT think "total size upfront" MUST BE avoided at all costs,
I do not think the above statement to justify it makes ANY sense.

Big files are by definition something you cannot afford to hold its
entirety in core, so you do not want to be told that you'd be fed 40GB
and ask xmalloc to allocate that much.

It allows the reader to be lazy for buffer allocations as long as
you know the file fits in-core, at the cost of forcing the writer to
somehow come up with the total number of bytes even before sending a
single byte (in other words, if the writer cannot produce and hold
the data in-core, it may even have to spool the data in a temporary
file only to count, and then play it back after showing the total
size).

It is good that you allow both mode of operations and the size of
the data can either be given upfront (which allows a single fixed
allocation upfront without realloc, as long as the data fits in
core), or be left "(atend)".

I just don't want to see it oversold as a "feature" that the size
has to come before data.  That is a limitation, not a feature.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-29 Thread Lars Schneider

> On 29 Jul 2016, at 13:24, Jakub Narębski  wrote:
> 
> W dniu 2016-07-29 o 12:38, Lars Schneider pisze:
>> On 27 Jul 2016, at 11:41, Eric Wong  wrote:
>>> larsxschnei...@gmail.com wrote:
> 
 +static off_t multi_packet_read(struct strbuf *sb, const int fd, const 
 size_t size)
>>> 
>>> I'm no expert in C, but this might be const-correctness taken
>>> too far.  I think basing this on the read(2) prototype is less
>>> surprising:
>>> 
>>>  static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)
>> 
>> Hm... ok. I like `const` because I think it is usually easier to 
>> read/understand
>> functions that do not change their input variables. This way I can 
>> communicate
>> my intention to future people modifying this function!
> 
> Well, scalar types like `size_t` are always passed by value, so here `const`
> doesn't matter, and it makes line longer.  I think library functions do not
> use `const` for `size_t` parameters.
> 
> You are reading from the file descriptor `fd`, so it state would change.
> Using `const` feels a bit like lying.  Also, it is scalar type.

OK, since you are the second reviewer arguing against `const` I will remove it.


> 
> [...] 
>> I agree with your reordering of the parameters, though!
>> 
>> Speaking of coding style... convert.c is already big and gets only bigger 
>> with this patch (1720 lines). Would it make sense to add a new file 
>> "convert-pipe-protocol.c"
>> or something for my additions?
> 
> I wonder if it would be possible to enhance existing functions, instead
> of redoing them (at least in part) for per-command filter driver protocol.

I think I reused as much as possible.

Thanks,
Lars
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-29 Thread Jakub Narębski
W dniu 2016-07-29 o 12:38, Lars Schneider pisze:
> On 27 Jul 2016, at 11:41, Eric Wong  wrote:
>> larsxschnei...@gmail.com wrote:

>>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const 
>>> size_t size)
>> 
>> I'm no expert in C, but this might be const-correctness taken
>> too far.  I think basing this on the read(2) prototype is less
>> surprising:
>> 
>>   static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)
>
> Hm... ok. I like `const` because I think it is usually easier to 
> read/understand
> functions that do not change their input variables. This way I can communicate
> my intention to future people modifying this function!

Well, scalar types like `size_t` are always passed by value, so here `const`
doesn't matter, and it makes line longer.  I think library functions do not
use `const` for `size_t` parameters.

You are reading from the file descriptor `fd`, so it state would change.
Using `const` feels a bit like lying.  Also, it is scalar type.
 
[...] 
> I agree with your reordering of the parameters, though!
> 
> Speaking of coding style... convert.c is already big and gets only bigger 
> with this patch (1720 lines). Would it make sense to add a new file 
> "convert-pipe-protocol.c"
> or something for my additions?

I wonder if it would be possible to enhance existing functions, instead
of redoing them (at least in part) for per-command filter driver protocol.

Best,
-- 
Jakub Narębski

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-29 Thread Lars Schneider

> On 27 Jul 2016, at 11:41, Eric Wong  wrote:
> 
> larsxschnei...@gmail.com wrote:
>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const 
>> size_t size)
> 
> I'm no expert in C, but this might be const-correctness taken
> too far.  I think basing this on the read(2) prototype is less
> surprising:
> 
>   static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)

Hm... ok. I like `const` because I think it is usually easier to read/understand
functions that do not change their input variables. This way I can communicate
my intention to future people modifying this function!

If this is frowned upon in the Git community then I will add a comment to the
CodingGuidelines and remove the const :)

I agree with your reordering of the parameters, though!

Speaking of coding style... convert.c is already big and gets only bigger 
with this patch (1720 lines). Would it make sense to add a new file 
"convert-pipe-protocol.c"
or something for my additions?


> Also what Jeff said about off_t vs size_t, but my previous
> emails may have confused you w.r.t. off_t usage...
> 
>> +static int multi_packet_write(const char *src, size_t len, const int in, 
>> const int out)
> 
> Same comment about over const ints above.
> len can probably be off_t based on what is below; but you need
> to process the loop in ssize_t-friendly chunks.

I think I would prefer to keep it an size_t because that is the
type we get from Git initially. The code will be more clear in v3.


> 
>> +{
>> +int ret = 1;
>> +char header[4];
>> +char buffer[8192];
>> +off_t bytes_to_write;
> 
> What Jeff said, this should be ssize_t to match read(2) and xread

Agreed.


> 
>> +while (ret) {
>> +if (in >= 0) {
>> +bytes_to_write = xread(in, buffer, sizeof(buffer));
>> +if (bytes_to_write < 0)
>> +ret &= 0;
>> +src = buffer;
>> +} else {
>> +bytes_to_write = len > LARGE_PACKET_MAX - 4 ? 
>> LARGE_PACKET_MAX - 4 : len;
>> +len -= bytes_to_write;
>> +}
>> +if (!bytes_to_write)
>> +break;
> 
> The whole ret &= .. style error handling is hard-to-follow and
> here, a source of bugs.  I think the expected convention on
> hitting errors is:
> 
>   1) stop whatever you're doing
>   2) cleanup
>   3) propagate the error to callers
> 
> "goto" is an acceptable way of accomplishing this.
> 
> For example, byte_to_write may still be negative at this point
> (and interpreted as a really big number when cast to unsigned
> size_t) and src/buffer could be stack garbage.

I changed the implementation here so that the &= style
is not necessary anymore. However, I will look into "goto"
for the other areas!


>> +set_packet_header(header, bytes_to_write + 4);
>> +ret &= write_in_full(out, &header, sizeof(header)) == 
>> sizeof(header);
>> +ret &= write_in_full(out, src, bytes_to_write) == 
>> bytes_to_write;
>> +}
>> +ret &= write_in_full(out, "", 4) == 4;
>> +return ret;
>> +}
>> +
> 
>> +static int apply_protocol_filter(const char *path, const char *src, size_t 
>> len,
>> +int fd, struct strbuf *dst, 
>> const char *cmd,
>> +const char *filter_type)
>> +{
> 
> 
> 
>> +if (fd >= 0 && !src) {
>> +ret &= fstat(fd, &file_stat) != -1;
>> +len = file_stat.st_size;
> 
> Same truncation bug I noticed earlier; what I originally meant
> is the `len' arg probably ought to be off_t, here, not size_t.
> 32-bit x86 Linux systems have 32-bit size_t (unsigned), but
> large file support means off_t is 64-bits (signed).

OK. Would it be OK to keep size_t for this patch series?


> Also, is it worth continuing this function if fstat fails?

No :-)


>> +}
>> +
>> +sigchain_push(SIGPIPE, SIG_IGN);
>> +
>> +packet_write(process->in, "%s\n", filter_type);
>> +packet_write(process->in, "%s\n", path);
>> +packet_write(process->in, "%zu\n", len);
> 
> I'm not sure if "%zu" is portable since we don't do C99 (yet?)
> For 64-bit signed off_t, you can probably do:
> 
>   packet_write(process->in, "%"PRIuMAX"\n", (uintmax_t)len);
> 
> Since we don't have PRIiMAX or intmax_t, here, and a negative
> len would be a bug (probably from failed fstat) anyways.

OK. "%zu" is not used in the entire code base. I will go with
your suggestion!


>> +ret &= multi_packet_write(src, len, fd, process->in);
> 
> multi_packet_write will probably fail if fstat failed above...

Yes. The error handling is bogus... I thought bitwise "and" would
act the same way as logical "and" (a bit embarrassing to admit that...).


> 
>> +strbuf = packet_read_line(process->out, NULL);
> 
> And this may just block or timeout if multi_packet_write failed.

True, but unless there is anyt

Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-29 Thread Lars Schneider

> On 28 Jul 2016, at 01:31, Jakub Narębski  wrote:
> 
> W dniu 2016-07-27 o 02:06, larsxschnei...@gmail.com pisze:
>> From: Lars Schneider 
>> 
>> Git's clean/smudge mechanism invokes an external filter process for every
>> single blob that is affected by a filter. If Git filters a lot of blobs
>> then the startup time of the external filter processes can become a
>> significant part of the overall Git execution time.
> 
> It is not strictly necessary... but do we have any benchmarks for this,
> or is it just the feeling?  That is, in what situations Git may filter
> a large number of files (initial checkout? initial add?, switching
> to unrelated branch? getting large files from LFS solution?, and when
> startup time might become significant part of execution time (MS Windows?
> fast filters?)?

All the operations you mentioned are slow because they all cause filter
process invocations. I benchmarked the new filter protocol and it is 
almost 70x faster when switching branches on my local machine (i7, SSD, 
OS X) with a test repository containing 12,000 files that need to be 
filtered. 
Details here: https://github.com/github/git-lfs/pull/1382

Based on my previous experience with Git filter invocations I expect even
more dramatic results on Windows (I will benchmark this, too, as soon as
the list agrees on this approach).


>> This patch adds the filter..process string option which, if used,
> 
> String option... what are possible values?  What happens if you use
> value that is not recognized by Git (it is "if used", isn't it)?  That's
> not obvious from the commit message (though it might be from the docs).

Then the process invocation will fail in the same way current filter
process invocations fail. If "filter..required" is set then
Git will fail, otherwise not. 


> What is missing is the description that it is set to a command, and
> how it interacts with `clean` and `smudge` options.

Right, I'll add that! I implemented it in a way that the 
"filter..clean" 
and "filter..smudge" take presence over "filter..process".

This allows the use of a `single-shot` clean filter and a `long-running` 
smudge as suggested by Junio in the v1 discussion (no ref, gmane down).


>> keeps the external filter process running and processes all blobs with
>> the following packet format (pkt-line) based protocol over standard input
>> and standard output.
>> 
>> Git starts the filter on first usage and expects a welcome
>> message, protocol version number, and filter capabilities
>> seperated by spaces:
> 
> s/seperated/separated/

Thanks!


> Is there any handling of misconfigured one-shot filters, or would
> they still hang the execution of a Git command?

They would still hang. Would it be sufficient to mention that in the
docs?


>> 
>> packet:  git< git-filter-protocol
>> packet:  git< version 2
>> packet:  git< clean smudge
> 
> Wouldn't "capabilities clean smudge" be better?  Or is it the
> "clean smudge" proposal easier to handle?

Good suggestion! This is easy to handle and I think it mimics
the pack-protocol a bit more closely.


>> 
>> Supported filter capabilities are "clean" and "smudge".
>> 
>> Afterwards Git sends a command (e.g. "smudge" or "clean" - based
>> on the supported capabilities), the filename, the content size as
>> ASCII number in bytes, and the content in packet format with a
>> flush packet at the end:
>> 
>> packet:  git> smudge
>> packet:  git> testfile.dat
> 
> And here we don't have any problems with files containing embedded
> newlines etc.  Also Git should not be sending invalid file names.
> The question remains: is it absolute file path, or basename?

It sends a path absolute in context of the Git repo (e.g. subdir/testfile.dat).
I'll add that to the docs and I'll add a test case to demonstrate it.

> 
>> packet:  git> 7
>> packet:  git> CONTENT
> 
> Can Git send file contents using more than one packet?  I think
> it should be stated upfront.

OK


>> packet:  git> 
>> 
> 
> Why we need to send content size upfront?  Well, it is not a problem
> for Git, but (as I wrote in reply to the cover letter for this
> series) might be a problem for filter scripts.

I think sending it upfront is nice for buffer allocations of big files
and it doesn't cost us anything to do it.


>> The filter is expected to respond with the result content size as
>> ASCII number in bytes and the result content in packet format with
>> a flush packet at the end:
>> 
>> packet:  git< 57
> 
> This is not neccessary (and might be hard for scripts to do) if
> pkt-line protocol is used.
> 
> In short: I think pkt-line is not worth the complication on
> the Git side and on the filter size, unless it is used for
> streaming, or at least filter not having to calculate output
> size upfront.

As I stated in my other response 

Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-28 Thread Jeff King
On Thu, Jul 28, 2016 at 02:10:12PM +0200, Lars Schneider wrote:

> > I think that's orthogonal. I just mean that using zero for success puts
> > you in our usual style, and then accumulating errors can be done with
> > "|=".
> 
> Ah. I guess I was misguided by the way errors are currently handled
> in `apply_filter` (success = 1; failure = 0):
> https://github.com/git/git/blob/8c6d1f9807c67532e7fb545a944b064faff0f70b/convert.c#L437-L479
> 
> I wouldn't like if the different filter protocols would use different
> error exit codes. Would it be OK to adjust the existing `apply_filter`
> function in a cleanup patch?

Ah, I see.

I think those return codes are a little different. They are not "success
or error", but "did convert or did not convert" (or "would convert" when
no buffer is given). And unless the filter is required, we quietly turn
errors into "did not convert" (and if it is, we die()).

So I'm not sure if changing them is a good idea. I agree with you that
it's probably inviting confusion to have the two sets of filter
functions have opposite return codes. So I think I retract my
suggestion. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-28 Thread Lars Schneider

> On 27 Jul 2016, at 20:11, Jeff King  wrote:
> 
> On Wed, Jul 27, 2016 at 07:31:26PM +0200, Lars Schneider wrote:
> 
 +  strbuf_grow(sb, size + 1);  // we need one extra byte for the 
 packet flush
>>> 
>>> What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
>>> 32-bit system)?
>> 
>> Would that be an acceptable solution?
>> 
>> if (size + 1 > SIZE_MAX)
>>  return die("unrepresentable length for filter buffer");
> 
> No, because by definition "size" will wrap to 0. :)
> 
> You have to do:
> 
>  if (size > SIZE_MAX - 1)
>   die("whoops");
> 
>> Can you point me to an example in the Git source how this kind of thing 
>> should
>> be handled?
> 
> The strbuf code itself checks for overflows. So you could do:
> 
>  strbuf_grow(sb, size);
>  ... fill up with size bytes ...
>  strbuf_addch(sb, ...); /* extra byte for whatever */
> 
> That does mean _possibly_ making a second allocation just to add the
> extra byte, but in practice it's not likely (unless the input exactly
> matches the strbuf's growth pattern).
> 
> If you want to do it yourself, I think:
> 
>  strbuf_grow(sb, st_add(size, 1));

I like that solution! Thanks!


> would work.
> 
 +  while (
 +  bytes_read > 0 &&   // the 
 last packet was no flush
 +  sb->len - total_bytes_read - 1 > 0  // we still have space 
 left in the buffer
 +  );
>>> 
>>> And I'm not sure if you need to distinguish between "0" and "-1" when
>>> checking byte_read here.
>> 
>> We want to finish reading in both cases, no?
> 
> If we get "-1", that's from an unexpected EOF during the packet_read(),
> because you set GENTLE_ON_EOF. So there's nothing left to read, and we
> should break and return an error.

Right.


> I guess "0" would come from a flush packet? Why would the filter send
> back a flush packet (unless you were using them to signal end-of-input,
> but then why does the filter have to send back the number of bytes ahead
> of time?).

Sending the bytes ahead of time (if available) might be nice for efficient
buffer allocation. I am changing the code so that both cases can be handled
(size ahead of time and no size ahead of time).


>>> Why 8K? The pkt-line format naturally restricts us to just under 64K, so
>>> why not take advantage of that and minimize the framing overhead for
>>> large data?
>> 
>> I took inspiration from here for 8K MAX_IO_SIZE:
>> https://github.com/git/git/blob/master/copy.c#L6
>> 
>> Is this read limit correct? Should I read 8 times to fill a pkt-line?
> 
> MAX_IO_SIZE is generally 8 _megabytes_, not 8K. The loop in copy.c just
> haad to pick an arbitrary size for doing its read/write proxying.  I
> think in practice you are not likely to get much benefit from going
> beyond 8K or so there, just because operating systems tend to do things
> in page-sizes anyway, which are usually 4K.
> 
> But since you are formatting the result into a form that has framing
> overhead, anything up to LARGE_PACKET_MAX will see benefits (though
> admittedly even 4 bytes per 8K is not much).
> 
> I don't think it's worth the complexity of reading 8 times, but just
> using a buffer size of LARGE_PACKET_MAX-4 would be the most efficient.
> 
> I doubt it matters _that much_ in practice, but any time I see a magic
> number I have to wonder at the "why". At least basing it on
> LARGE_PACKET_MAX has some basis, whereas 8K is largely just made-up. :)

Sounds good. I will use LARGE_PACKET_MAX-4 !

> 
>>> We do sometimes do "ret |= something()" but that is in cases where
>>> "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
>>> your function's error-reporting is inverted from our usual style?
>> 
>> I thought it makes the code easier to read and the filter doesn't care
>> at what point the error happens anyways. The filter either succeeds
>> or fails. What style would you suggest?
> 
> I think that's orthogonal. I just mean that using zero for success puts
> you in our usual style, and then accumulating errors can be done with
> "|=".

Ah. I guess I was misguided by the way errors are currently handled
in `apply_filter` (success = 1; failure = 0):
https://github.com/git/git/blob/8c6d1f9807c67532e7fb545a944b064faff0f70b/convert.c#L437-L479

I wouldn't like if the different filter protocols would use different
error exit codes. Would it be OK to adjust the existing `apply_filter`
function in a cleanup patch?


> I didn't look carefully at whether the accumulating style you're using
> makes sense or not. But something like:
> 
 +  ret &= write_in_full(out, &header, sizeof(header)) == 
 sizeof(header);
 +  ret &= write_in_full(out, src, bytes_to_write) == 
 bytes_to_write;
> 
> does mean that we call the second write() even if the first one failed.
> That's a waste of time (albeit a minor one), but it also means you could
> potentially cover up the value of "errno" from the first one (though in
> practice

Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-28 Thread Torsten Bögershausen


On 07/27/2016 02:06 AM, larsxschnei...@gmail.com wrote:
Some comments inline
[]
> The filter is expected to respond with the result content size as
> ASCII number in bytes and the result content in packet format with
> a flush packet at the end:
> 
> packet:  git< 57
> packet:  git< SMUDGED_CONTENT
> packet:  git< 
how does the filter report possible errors here ?
Let's say I want to convert UTF-8 into ISO-8859-1,
but the conversion is impossible.

 > packet:  git< -1
 > packet:  git< Error message, (or empty), followed by a '\n'
 > packet:  git< 

Side note: a filter may return length 0.
Suggestion: add 2 test cases.


> 
> Please note: In a future version of Git the capability "stream"
> might be supported. In that case the content size must not be
> part of the filter response.
>
> Afterwards the filter is expected to wait for the next command.
>
> Signed-off-by: Lars Schneider 
> Helped-by: Martin-Louis Bright 
> Signed-off-by: Lars Schneider 
> ---
>  Documentation/gitattributes.txt |  54 +++-
>  convert.c   | 269 
++--

>  t/t0021-conversion.sh   | 175 ++
>  t/t0021/rot13-filter.pl | 146 ++
>  4 files changed, 631 insertions(+), 13 deletions(-)
>  create mode 100755 t/t0021/rot13-filter.pl
>
> diff --git a/Documentation/gitattributes.txt 
b/Documentation/gitattributes.txt

> index 8882a3e..8fb40d2 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -300,7 +300,11 @@ checkout, when the `smudge` command is 
specified, the command is

>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate. If a long running filter process (see section
> +below) is used then Git can process all blobs with a single filter
> +invocation for the entire life of a single Git command (e.g.
> +`git add .`).
>
>  One use of the content filtering is to massage the content into a shape
>  that is more convenient for the platform, filesystem, and the user 
to use.

> @@ -375,6 +379,54 @@ substitution.  For example:
>  
>
>
> +Long Running Filter Process
> +^^^
> +
> +If the filter command (string value) is defined via
> +filter..process then Git can process all blobs with a
> +single filter invocation for the entire life of a single Git
> +command by talking with the following packet format (pkt-line)
> +based protocol over standard input and standard output.
> +
> +Git starts the filter on first usage and expects a welcome
> +message, protocol version number, and filter capabilities
> +seperated by spaces:
> +
> +packet:  git< git-filter-protocol
> +packet:  git< version 2
> +packet:  git< clean smudge
> +
> +Supported filter capabilities are "clean" and "smudge".
> +
> +Afterwards Git sends a command (e.g. "smudge" or "clean" - based
> +on the supported capabilities), the filename, the content size as
> +ASCII number in bytes, and the content in packet format with a
> +flush packet at the end:
> +
> +packet:  git> smudge
> +packet:  git> testfile.dat
> +packet:  git> 7
> +packet:  git> CONTENT
> +packet:  git> 
> +
> +
> +The filter is expected to respond with the result content size as
> +ASCII number in bytes and the result content in packet format with
> +a flush packet at the end:
> +
> +packet:  git< 57
> +packet:  git< SMUDGED_CONTENT
> +packet:  git< 
> +
> +Please note: In a future version of Git the capability "stream"
> +might be supported. In that case the content size must not be
> +part of the filter response.
> +
> +Afterwards the filter is expected to wait for the next command.
> +A demo implementation can be found in `t/t0021/rot13-filter.pl`
> +located in the Git core repository.
> +
> +
>  Interaction between checkin/checkout attributes
>  ^^^
>
> diff --git a/convert.c b/convert.c
> index 522e2c5..5ff200b 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -3,6 +3,7 @@
>  #include "run-command.h"
>  #include "quote.h"
>  #include "sigchain.h"
> +#include "pkt-line.h"
>
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -481,11 +482,232 @@ static int apply_filter(const char *path, 
const char *src, size_t len, int fd,

>return ret;
>  }
>
> +static off_t multi_packet_read(struct strbuf *sb, const int fd, 
const size_t size)

> +{
> +  off_t bytes_read;

Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-27 Thread Jakub Narębski
W dniu 2016-07-27 o 02:06, larsxschnei...@gmail.com pisze:
> From: Lars Schneider 
> 
> Git's clean/smudge mechanism invokes an external filter process for every
> single blob that is affected by a filter. If Git filters a lot of blobs
> then the startup time of the external filter processes can become a
> significant part of the overall Git execution time.

It is not strictly necessary... but do we have any benchmarks for this,
or is it just the feeling?  That is, in what situations Git may filter
a large number of files (initial checkout? initial add?, switching
to unrelated branch? getting large files from LFS solution?, and when
startup time might become significant part of execution time (MS Windows?
fast filters?)?

> 
> This patch adds the filter..process string option which, if used,

String option... what are possible values?  What happens if you use
value that is not recognized by Git (it is "if used", isn't it)?  That's
not obvious from the commit message (though it might be from the docs).

What is missing is the description that it is set to a command, and
how it interacts with `clean` and `smudge` options.

> keeps the external filter process running and processes all blobs with
> the following packet format (pkt-line) based protocol over standard input
> and standard output.
> 
> Git starts the filter on first usage and expects a welcome
> message, protocol version number, and filter capabilities
> seperated by spaces:

s/seperated/separated/

Is there any handling of misconfigured one-shot filters, or would
they still hang the execution of a Git command?

> 
> packet:  git< git-filter-protocol
> packet:  git< version 2
> packet:  git< clean smudge

Wouldn't "capabilities clean smudge" be better?  Or is it the
"clean smudge" proposal easier to handle?

> 
> Supported filter capabilities are "clean" and "smudge".
> 
> Afterwards Git sends a command (e.g. "smudge" or "clean" - based
> on the supported capabilities), the filename, the content size as
> ASCII number in bytes, and the content in packet format with a
> flush packet at the end:
> 
> packet:  git> smudge
> packet:  git> testfile.dat

And here we don't have any problems with files containing embedded
newlines etc.  Also Git should not be sending invalid file names.
The question remains: is it absolute file path, or basename?

> packet:  git> 7
> packet:  git> CONTENT

Can Git send file contents using more than one packet?  I think
it should be stated upfront.

> packet:  git> 
> 

Why we need to send content size upfront?  Well, it is not a problem
for Git, but (as I wrote in reply to the cover letter for this
series) might be a problem for filter scripts.

> 
> The filter is expected to respond with the result content size as
> ASCII number in bytes and the result content in packet format with
> a flush packet at the end:
> 
> packet:  git< 57

This is not neccessary (and might be hard for scripts to do) if
pkt-line protocol is used.

In short: I think pkt-line is not worth the complication on
the Git side and on the filter size, unless it is used for
streaming, or at least filter not having to calculate output
size upfront.

> packet:  git< SMUDGED_CONTENT
> packet:  git< 
> 
> Please note: In a future version of Git the capability "stream"
> might be supported. In that case the content size must not be
> part of the filter response.
> 
> Afterwards the filter is expected to wait for the next command.

When filter is supposed to exit, then?

> 
> Signed-off-by: Lars Schneider 
> Helped-by: Martin-Louis Bright 
> Signed-off-by: Lars Schneider 
> ---
>  Documentation/gitattributes.txt |  54 +++-
>  convert.c   | 269 
> ++--
>  t/t0021-conversion.sh   | 175 ++
>  t/t0021/rot13-filter.pl | 146 ++
>  4 files changed, 631 insertions(+), 13 deletions(-)
>  create mode 100755 t/t0021/rot13-filter.pl
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 8882a3e..8fb40d2 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the 
> command is
>  fed the blob object from its standard input, and its standard
>  output is used to update the worktree file.  Similarly, the
>  `clean` command is used to convert the contents of worktree file
> -upon checkin.
> +upon checkin. By default these commands process only a single
> +blob and terminate. If a long running filter process (see section
> +below) is used then Git can process all blobs with a single filter
> +invocation for the entire life of a single Git command (e.g.
> +`git add .`).

Ah, all right, h

Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-27 Thread Jeff King
On Wed, Jul 27, 2016 at 07:31:26PM +0200, Lars Schneider wrote:

> >> +  strbuf_grow(sb, size + 1);  // we need one extra byte for the 
> >> packet flush
> > 
> > What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
> > 32-bit system)?
> 
> Would that be an acceptable solution?
> 
> if (size + 1 > SIZE_MAX)
>   return die("unrepresentable length for filter buffer");

No, because by definition "size" will wrap to 0. :)

You have to do:

  if (size > SIZE_MAX - 1)
die("whoops");

> Can you point me to an example in the Git source how this kind of thing should
> be handled?

The strbuf code itself checks for overflows. So you could do:

  strbuf_grow(sb, size);
  ... fill up with size bytes ...
  strbuf_addch(sb, ...); /* extra byte for whatever */

That does mean _possibly_ making a second allocation just to add the
extra byte, but in practice it's not likely (unless the input exactly
matches the strbuf's growth pattern).

If you want to do it yourself, I think:

  strbuf_grow(sb, st_add(size, 1));

would work.

> >> +  while (
> >> +  bytes_read > 0 &&   // the 
> >> last packet was no flush
> >> +  sb->len - total_bytes_read - 1 > 0  // we still have space 
> >> left in the buffer
> >> +  );
> > 
> > And I'm not sure if you need to distinguish between "0" and "-1" when
> > checking byte_read here.
> 
> We want to finish reading in both cases, no?

If we get "-1", that's from an unexpected EOF during the packet_read(),
because you set GENTLE_ON_EOF. So there's nothing left to read, and we
should break and return an error.

I guess "0" would come from a flush packet? Why would the filter send
back a flush packet (unless you were using them to signal end-of-input,
but then why does the filter have to send back the number of bytes ahead
of time?).

> > Why 8K? The pkt-line format naturally restricts us to just under 64K, so
> > why not take advantage of that and minimize the framing overhead for
> > large data?
> 
> I took inspiration from here for 8K MAX_IO_SIZE:
> https://github.com/git/git/blob/master/copy.c#L6
> 
> Is this read limit correct? Should I read 8 times to fill a pkt-line?

MAX_IO_SIZE is generally 8 _megabytes_, not 8K. The loop in copy.c just
haad to pick an arbitrary size for doing its read/write proxying.  I
think in practice you are not likely to get much benefit from going
beyond 8K or so there, just because operating systems tend to do things
in page-sizes anyway, which are usually 4K.

But since you are formatting the result into a form that has framing
overhead, anything up to LARGE_PACKET_MAX will see benefits (though
admittedly even 4 bytes per 8K is not much).

I don't think it's worth the complexity of reading 8 times, but just
using a buffer size of LARGE_PACKET_MAX-4 would be the most efficient.

I doubt it matters _that much_ in practice, but any time I see a magic
number I have to wonder at the "why". At least basing it on
LARGE_PACKET_MAX has some basis, whereas 8K is largely just made-up. :)

> > We do sometimes do "ret |= something()" but that is in cases where
> > "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
> > your function's error-reporting is inverted from our usual style?
> 
> I thought it makes the code easier to read and the filter doesn't care
> at what point the error happens anyways. The filter either succeeds
> or fails. What style would you suggest?

I think that's orthogonal. I just mean that using zero for success puts
you in our usual style, and then accumulating errors can be done with
"|=".

I didn't look carefully at whether the accumulating style you're using
makes sense or not. But something like:

> >> +  ret &= write_in_full(out, &header, sizeof(header)) == 
> >> sizeof(header);
> >> +  ret &= write_in_full(out, src, bytes_to_write) == 
> >> bytes_to_write;

does mean that we call the second write() even if the first one failed.
That's a waste of time (albeit a minor one), but it also means you could
potentially cover up the value of "errno" from the first one (though in
practice I'd expect the second one to fail for the same reason).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-27 Thread Lars Schneider

> On 27 Jul 2016, at 03:32, Jeff King  wrote:
> 
> On Wed, Jul 27, 2016 at 02:06:05AM +0200, larsxschnei...@gmail.com wrote:
> 
>> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const 
>> size_t size)
>> +{
>> +off_t bytes_read;
>> +off_t total_bytes_read = 0;
> 
> I haven't looked carefully at the whole patch yet, but there seems to be
> some type issues here. off_t is a good type for storing the whole size
> of a file (which may be larger than the amount of memory we can
> allocate). But size_t is the right size for an in-memory object.
> 
> This function takes a size_t size, which makes sense if it is meant to
> read everything into a strbuf.
> 
> So I think our total_bytes_read would probably want to be a size_t here,
> too, because it cannot possibly grow larger than that (and that is
> enforced by the loop below). Otherwise you get weirdness like "sb->buf +
> total_bytes_ref" possibly overflowing memory.

OK


>> +strbuf_grow(sb, size + 1);  // we need one extra byte for the 
>> packet flush
> 
> What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
> 32-bit system)?

Would that be an acceptable solution?

if (size + 1 > SIZE_MAX)
return die("unrepresentable length for filter buffer");

Can you point me to an example in the Git source how this kind of thing should
be handled?


>> +do {
>> +bytes_read = packet_read(
>> +fd, NULL, NULL,
>> +sb->buf + total_bytes_read, sb->len - total_bytes_read 
>> - 1,
>> +PACKET_READ_GENTLE_ON_EOF
>> +);
> 
> packet_read() actually returns an int, and may return "-1" on EOF (and
> int is fine because we know that we are constrained to 16-bit values
> by the pkt-line definition). You read it into an "off_t". I _think_ that
> is OK, because I believe POSIX says off_t must be signed. But probably
> "int" is the more correct type here.

OK


>> +total_bytes_read += bytes_read;
> 
> If you do get "-1", I think you need to detect it here before adjusting
> total_bytes_read.

Correct!


>> +while (
>> +bytes_read > 0 &&   // the 
>> last packet was no flush
>> +sb->len - total_bytes_read - 1 > 0  // we still have space 
>> left in the buffer
>> +);
> 
> And I'm not sure if you need to distinguish between "0" and "-1" when
> checking byte_read here.

We want to finish reading in both cases, no?


> 
>> +strbuf_setlen(sb, total_bytes_read);
> 
> Passing an off_t to something expecting a size_t, which can involve
> truncation (though I think in practice you really are limited to
> size_t).

OK


>> +static int multi_packet_write(const char *src, size_t len, const int in, 
>> const int out)
>> +{
>> +int ret = 1;
>> +char header[4];
>> +char buffer[8192];
>> +off_t bytes_to_write;
>> +while (ret) {
>> +if (in >= 0) {
>> +bytes_to_write = xread(in, buffer, sizeof(buffer));
> 
> Likewise here, xread() is returning ssize_t. Again, OK if we can assume
> off_t is signed, but it probably makes sense to use the correct type (we
> also know it cannot be larger than 8K, of course).

OK


> Why 8K? The pkt-line format naturally restricts us to just under 64K, so
> why not take advantage of that and minimize the framing overhead for
> large data?

I took inspiration from here for 8K MAX_IO_SIZE:
https://github.com/git/git/blob/master/copy.c#L6

Is this read limit correct? Should I read 8 times to fill a pkt-line?


>> +if (bytes_to_write < 0)
>> +ret &= 0;
> 
> I think "&= 0" is unusual for our codebase? Would just writing "= 0" be
> more clear?

Yes!


> We do sometimes do "ret |= something()" but that is in cases where
> "ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
> your function's error-reporting is inverted from our usual style?

I thought it makes the code easier to read and the filter doesn't care
at what point the error happens anyways. The filter either succeeds
or fails. What style would you suggest?


>> +set_packet_header(header, bytes_to_write + 4);
>> +ret &= write_in_full(out, &header, sizeof(header)) == 
>> sizeof(header);
>> +ret &= write_in_full(out, src, bytes_to_write) == 
>> bytes_to_write;
>> +}
> 
> If you look at format_packet(), it pulls a slight trick: we have a
> buffer 4 bytes larger than we need, format into "buf + 4", and then
> write the final size at the beginning. That lets us write() it all in
> one go.
> 
> At first I thought this function was simply reinventing packet_write(),
> but I guess you are trying to avoid the extra copy of the data (once
> into the buffer from xread, and then again via format_packet just to add
> the extra bytes at the beginning).

Yes, that was my intention.


> I agree with what Junio said elsewhere, that there may be a way to mak

Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-27 Thread Eric Wong
larsxschnei...@gmail.com wrote:
> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t 
> size)

I'm no expert in C, but this might be const-correctness taken
too far.  I think basing this on the read(2) prototype is less
surprising:

   static ssize_t multi_packet_read(int fd, struct strbuf *sb, size_t size)

Also what Jeff said about off_t vs size_t, but my previous
emails may have confused you w.r.t. off_t usage...

> +static int multi_packet_write(const char *src, size_t len, const int in, 
> const int out)

Same comment about over const ints above.
len can probably be off_t based on what is below; but you need
to process the loop in ssize_t-friendly chunks.

> +{
> + int ret = 1;
> + char header[4];
> + char buffer[8192];
> + off_t bytes_to_write;

What Jeff said, this should be ssize_t to match read(2) and xread

> + while (ret) {
> + if (in >= 0) {
> + bytes_to_write = xread(in, buffer, sizeof(buffer));
> + if (bytes_to_write < 0)
> + ret &= 0;
> + src = buffer;
> + } else {
> + bytes_to_write = len > LARGE_PACKET_MAX - 4 ? 
> LARGE_PACKET_MAX - 4 : len;
> + len -= bytes_to_write;
> + }
> + if (!bytes_to_write)
> + break;

The whole ret &= .. style error handling is hard-to-follow and
here, a source of bugs.  I think the expected convention on
hitting errors is:

1) stop whatever you're doing
2) cleanup
3) propagate the error to callers

"goto" is an acceptable way of accomplishing this.

For example, byte_to_write may still be negative at this point
(and interpreted as a really big number when cast to unsigned
size_t) and src/buffer could be stack garbage.

> + set_packet_header(header, bytes_to_write + 4);
> + ret &= write_in_full(out, &header, sizeof(header)) == 
> sizeof(header);
> + ret &= write_in_full(out, src, bytes_to_write) == 
> bytes_to_write;
> + }
> + ret &= write_in_full(out, "", 4) == 4;
> + return ret;
> +}
> +

> +static int apply_protocol_filter(const char *path, const char *src, size_t 
> len,
> + int fd, struct strbuf *dst, 
> const char *cmd,
> + const char *filter_type)
> +{



> + if (fd >= 0 && !src) {
> + ret &= fstat(fd, &file_stat) != -1;
> + len = file_stat.st_size;

Same truncation bug I noticed earlier; what I originally meant
is the `len' arg probably ought to be off_t, here, not size_t.
32-bit x86 Linux systems have 32-bit size_t (unsigned), but
large file support means off_t is 64-bits (signed).

Also, is it worth continuing this function if fstat fails?

> + }
> +
> + sigchain_push(SIGPIPE, SIG_IGN);
> +
> + packet_write(process->in, "%s\n", filter_type);
> + packet_write(process->in, "%s\n", path);
> + packet_write(process->in, "%zu\n", len);

I'm not sure if "%zu" is portable since we don't do C99 (yet?)
For 64-bit signed off_t, you can probably do:

packet_write(process->in, "%"PRIuMAX"\n", (uintmax_t)len);

Since we don't have PRIiMAX or intmax_t, here, and a negative
len would be a bug (probably from failed fstat) anyways.

> + ret &= multi_packet_write(src, len, fd, process->in);

multi_packet_write will probably fail if fstat failed above...

> + strbuf = packet_read_line(process->out, NULL);

And this may just block or timeout if multi_packet_write failed.


Naptime, I may look at the rest another day.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] convert: add filter..process option

2016-07-26 Thread Jeff King
On Wed, Jul 27, 2016 at 02:06:05AM +0200, larsxschnei...@gmail.com wrote:

> +static off_t multi_packet_read(struct strbuf *sb, const int fd, const size_t 
> size)
> +{
> + off_t bytes_read;
> + off_t total_bytes_read = 0;

I haven't looked carefully at the whole patch yet, but there seems to be
some type issues here. off_t is a good type for storing the whole size
of a file (which may be larger than the amount of memory we can
allocate). But size_t is the right size for an in-memory object.

This function takes a size_t size, which makes sense if it is meant to
read everything into a strbuf.

So I think our total_bytes_read would probably want to be a size_t here,
too, because it cannot possibly grow larger than that (and that is
enforced by the loop below). Otherwise you get weirdness like "sb->buf +
total_bytes_ref" possibly overflowing memory.

> + strbuf_grow(sb, size + 1);  // we need one extra byte for the 
> packet flush

What happens if size is the maximum for size_t here (i.e., 4GB-1 on a
32-bit system)?

> + do {
> + bytes_read = packet_read(
> + fd, NULL, NULL,
> + sb->buf + total_bytes_read, sb->len - total_bytes_read 
> - 1,
> + PACKET_READ_GENTLE_ON_EOF
> + );

packet_read() actually returns an int, and may return "-1" on EOF (and
int is fine because we know that we are constrained to 16-bit values
by the pkt-line definition). You read it into an "off_t". I _think_ that
is OK, because I believe POSIX says off_t must be signed. But probably
"int" is the more correct type here.

> + total_bytes_read += bytes_read;

If you do get "-1", I think you need to detect it here before adjusting
total_bytes_read.

> + while (
> + bytes_read > 0 &&   // the 
> last packet was no flush
> + sb->len - total_bytes_read - 1 > 0  // we still have space 
> left in the buffer
> + );

And I'm not sure if you need to distinguish between "0" and "-1" when
checking byte_read here.

> + strbuf_setlen(sb, total_bytes_read);

Passing an off_t to something expecting a size_t, which can involve
truncation (though I think in practice you really are limited to
size_t).

> +static int multi_packet_write(const char *src, size_t len, const int in, 
> const int out)
> +{
> + int ret = 1;
> + char header[4];
> + char buffer[8192];
> + off_t bytes_to_write;
> + while (ret) {
> + if (in >= 0) {
> + bytes_to_write = xread(in, buffer, sizeof(buffer));

Likewise here, xread() is returning ssize_t. Again, OK if we can assume
off_t is signed, but it probably makes sense to use the correct type (we
also know it cannot be larger than 8K, of course).

Why 8K? The pkt-line format naturally restricts us to just under 64K, so
why not take advantage of that and minimize the framing overhead for
large data?

> + if (bytes_to_write < 0)
> + ret &= 0;

I think "&= 0" is unusual for our codebase? Would just writing "= 0" be
more clear?

We do sometimes do "ret |= something()" but that is in cases where
"ret" is zero for success, and non-zero (usually -1) otherwise. Perhaps
your function's error-reporting is inverted from our usual style?

> + set_packet_header(header, bytes_to_write + 4);
> + ret &= write_in_full(out, &header, sizeof(header)) == 
> sizeof(header);
> + ret &= write_in_full(out, src, bytes_to_write) == 
> bytes_to_write;
> + }

If you look at format_packet(), it pulls a slight trick: we have a
buffer 4 bytes larger than we need, format into "buf + 4", and then
write the final size at the beginning. That lets us write() it all in
one go.

At first I thought this function was simply reinventing packet_write(),
but I guess you are trying to avoid the extra copy of the data (once
into the buffer from xread, and then again via format_packet just to add
the extra bytes at the beginning).

I agree with what Junio said elsewhere, that there may be a way to make
the pkt-line code handle this zero-copy situation better. Perhaps
something like:

  struct pktline {
/* first 4 bytes are reserved for length header */
char buf[LARGE_PACKET_MAX];
  };
  #define PKTLINE_DATA_START(pkt) ((pkt)->buf + 4)
  #define PKTLINE_DATA_LEN (LARGE_PACKET_MAX - 4)

  ...
  struct pktline pkt;
  ssize_t len = xread(fd, PKTLINE_DATA_START(&pkt), PKTLINE_DATA_LEN);
  packet_send(&pkt, len);

Then packet_send() knows that the first 4 bytes are reserved for it. I
suspect that the strbuf used by format_packet() could get away with
using such a "struct pktline" too, though in practice I doubt there's
any real efficiency to be gained (we generally reuse the same strbuf
over and over, so it will grow once to 64K and get reused).

> + ret &= write_in_full(out, "", 4) == 4;

packet_flush() ?

I know the packet functions a

[PATCH v2 5/5] convert: add filter..process option

2016-07-26 Thread larsxschneider
From: Lars Schneider 

Git's clean/smudge mechanism invokes an external filter process for every
single blob that is affected by a filter. If Git filters a lot of blobs
then the startup time of the external filter processes can become a
significant part of the overall Git execution time.

This patch adds the filter..process string option which, if used,
keeps the external filter process running and processes all blobs with
the following packet format (pkt-line) based protocol over standard input
and standard output.

Git starts the filter on first usage and expects a welcome
message, protocol version number, and filter capabilities
seperated by spaces:

packet:  git< git-filter-protocol
packet:  git< version 2
packet:  git< clean smudge

Supported filter capabilities are "clean" and "smudge".

Afterwards Git sends a command (e.g. "smudge" or "clean" - based
on the supported capabilities), the filename, the content size as
ASCII number in bytes, and the content in packet format with a
flush packet at the end:

packet:  git> smudge
packet:  git> testfile.dat
packet:  git> 7
packet:  git> CONTENT
packet:  git> 


The filter is expected to respond with the result content size as
ASCII number in bytes and the result content in packet format with
a flush packet at the end:

packet:  git< 57
packet:  git< SMUDGED_CONTENT
packet:  git< 

Please note: In a future version of Git the capability "stream"
might be supported. In that case the content size must not be
part of the filter response.

Afterwards the filter is expected to wait for the next command.

Signed-off-by: Lars Schneider 
Helped-by: Martin-Louis Bright 
Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt |  54 +++-
 convert.c   | 269 ++--
 t/t0021-conversion.sh   | 175 ++
 t/t0021/rot13-filter.pl | 146 ++
 4 files changed, 631 insertions(+), 13 deletions(-)
 create mode 100755 t/t0021/rot13-filter.pl

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 8882a3e..8fb40d2 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -300,7 +300,11 @@ checkout, when the `smudge` command is specified, the 
command is
 fed the blob object from its standard input, and its standard
 output is used to update the worktree file.  Similarly, the
 `clean` command is used to convert the contents of worktree file
-upon checkin.
+upon checkin. By default these commands process only a single
+blob and terminate. If a long running filter process (see section
+below) is used then Git can process all blobs with a single filter
+invocation for the entire life of a single Git command (e.g.
+`git add .`).
 
 One use of the content filtering is to massage the content into a shape
 that is more convenient for the platform, filesystem, and the user to use.
@@ -375,6 +379,54 @@ substitution.  For example:
 
 
 
+Long Running Filter Process
+^^^
+
+If the filter command (string value) is defined via
+filter..process then Git can process all blobs with a
+single filter invocation for the entire life of a single Git
+command by talking with the following packet format (pkt-line)
+based protocol over standard input and standard output.
+
+Git starts the filter on first usage and expects a welcome
+message, protocol version number, and filter capabilities
+seperated by spaces:
+
+packet:  git< git-filter-protocol
+packet:  git< version 2
+packet:  git< clean smudge
+
+Supported filter capabilities are "clean" and "smudge".
+
+Afterwards Git sends a command (e.g. "smudge" or "clean" - based
+on the supported capabilities), the filename, the content size as
+ASCII number in bytes, and the content in packet format with a
+flush packet at the end:
+
+packet:  git> smudge
+packet:  git> testfile.dat
+packet:  git> 7
+packet:  git> CONTENT
+packet:  git> 
+
+
+The filter is expected to respond with the result content size as
+ASCII number in bytes and the result content in packet format with
+a flush packet at the end:
+
+packet:  git< 57
+packet:  git< SMUDGED_CONTENT
+packet:  git< 
+
+Please note: In a future version of Git the capability "stream"
+might be supported. In that case the content size must not be
+part of the filter response.
+
+Afterwards the filter is expected to wait for the next command.
+A demo implementation can be found in `t/t0021/rot13-filter.pl`
+located i