> On 01 Aug 2016, at 00:19, Jakub Narębski <[email protected]> wrote:
>
> W dniu 30.07.2016 o 01:38, [email protected] pisze:
> [...]
>> +Please note that you cannot use an existing filter.<driver>.clean
>> +or filter.<driver>.smudge command as filter.<driver>.process
>> +command.
>
> I think it would be more readable and easier to understand to write:
>
> ... you cannot use an existing ... command with
> filter.<driver>.process
>
> About the style: wouldn't `filter.<driver>.process` be better?
OK, changed it!
>> As soon as Git would detect a file that needs to be
>> +processed by this filter, it would stop responding.
>
> This is quite convoluted, and hard to understand. I would say
> that because `clean` and `smudge` filters are expected to read
> first, while Git expects `process` filter to say first, using
> `clean` or `smudge` filter without changes as `process` filter
> would lead to git command deadlocking / hanging / stopping
> responding.
How about this:
Please note that you cannot use an existing `filter.<driver>.clean`
or `filter.<driver>.smudge` command with `filter.<driver>.process`
because the former two use a different inter process communication
protocol than the latter one. As soon as Git would detect a file
that needs to be processed by such an invalid "process" filter,
it would wait for a proper protocol handshake and appear "hanging".
>> +
>> +
>> Interaction between checkin/checkout attributes
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> diff --git a/convert.c b/convert.c
>> index 522e2c5..be6405c 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -3,6 +3,7 @@
>> #include "run-command.h"
>> #include "quote.h"
>> #include "sigchain.h"
>> +#include "pkt-line.h"
>>
>> /*
>> * convert.c - convert a file when checking it out and checking it in.
>> @@ -481,11 +482,355 @@ static int apply_filter(const char *path, const char
>> *src, size_t len, int fd,
>> return ret;
>> }
>>
>> +static int multi_packet_read(int fd_in, struct strbuf *sb, size_t
>> expected_bytes, int is_stream)
>
> About name of this function: `multi_packet_read` is fine, though I wonder
> if `packet_read_in_full` with nearly the same parameters as `packet_read`,
> or `packet_read_till_flush`, or `read_in_full_packetized` would be better.
I like `multi_packet_read` and will rename!
> Also, the problem is that while we know that what packet_read() stores
> would fit in memory (in size_t), it is not true for reading whole file,
> which might be very large - for example huge graphical assets like raw
> images or raw videos, or virtual machine images. Isn't that the goal
> of git-LFS solutions, which need this feature? Shouldn't we have then
> both `multi_packet_read_to_fd` and `multi_packet_read_to_buf`,
> or whatever?
Git LFS works well with the current clean/smudge mechanism that uses the
same on in memory buffers. I understand your concern but I think this
improvement is out of scope for this patch series.
> Also, if we have `fd_in`, then perhaps `sb_out`?
Agreed!
> I am also unsure if `expected_bytes` (or `expected_size`) should not be
> just a size hint, leaving handing mismatch between expected size and
> real size of output to the caller; then the `is_stream` would be not
> needed.
As mentioned in a previous email... I will drop the "size" support in
this patch series as it is not really needed.
>> +{
>> + int bytes_read;
>> + size_t total_bytes_read = 0;
>
> Why `bytes_read` is int, while `total_bytes_read` is size_t? Ah, I see
> that packet_read() returns an int. It should be ssize_t, just like
> read(), isn't it? But we know that packet size is limited, and would
> fit in an int (or would it?).
Yes, it is limited but I agree on ssize_t!
> Also, total_bytes_read could overflow size_t, but then we would have
> problems storing the result in strbuf.
Would that check be ok?
if (total_bytes_read > SIZE_MAX - bytes_read)
return 1; // `total_bytes_read` would overflow and is
not representable
>> + if (expected_bytes == 0 && !is_stream)
>> + return 0;
>
> So in all cases *except* size = 0 we expect flush packet after the
> contents, but size = 0 is a corner case without flush packet?
I agree that is inconsistent... I will change it!
>> +
>> + if (is_stream)
>> + strbuf_grow(sb, LARGE_PACKET_MAX); // allocate space
>> for at least one packet
>> + else
>> + strbuf_grow(sb, st_add(expected_bytes, 1)); // add one extra
>> byte for the packet flush
>> +
>> + do {
>> + bytes_read = packet_read(
>> + fd_in, NULL, NULL,
>> + sb->buf + total_bytes_read, sb->len - total_bytes_read
>> - 1,
>> + PACKET_READ_GENTLE_ON_EOF
>> + );
>> + if (bytes_read < 0)
>> + return 1; // unexpected EOF
>
> Don't we usually return negative numbers on error? Ah, I see that the
> return is a bool, which allows to use boolean expression with 'return'.
> But I am still unsure if it is good API, this return value.
According to Peff zero for success is the usual style:
http://public-inbox.org/git/20160728133523.GB21311%40sigill.intra.peff.net/
> If we move handling of size mismatch to the caller, then the function
> can simply return the size of data read (probably off_t or uint64_t).
> Then the caller can check if it is what it expected, and react accordingly.
True, but as discussed previously I will remove the size.
>> +
>> + if (is_stream &&
>> + bytes_read > 0 &&
>> + sb->len - total_bytes_read - 1 <= 0)
>> + strbuf_grow(sb, st_add(sb->len, LARGE_PACKET_MAX));
>> + total_bytes_read += bytes_read;
>> + }
>> + while (
>> + bytes_read > 0 && // the last packet was no
>> flush
>> + sb->len - total_bytes_read - 1 > 0 // we still have space left
>> in the buffer
>
> Ah, so buffer is resized only in the 'is_stream' case. Perhaps then
> use an "int options" instead of 'is_stream', and have one of flags
> tell if we should resize or not, that is if size parameter is hint
> or a strict limit.
Obsolete
>> + );
>> + strbuf_setlen(sb, total_bytes_read);
>> + return (is_stream ? 0 : expected_bytes != total_bytes_read);
>> +}
>> +
>> +static int multi_packet_write_from_fd(const int fd_in, const int fd_out)
>
> Is it equivalent of copy_fd() function, but where destination uses pkt-line
> and we need to pack data into pkt-lines?
Correct!
>> +{
>> + int did_fail = 0;
>> + ssize_t bytes_to_write;
>> + while (!did_fail) {
>> + bytes_to_write = xread(fd_in,
>> PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_LEN);
>
> Using global variable packet_buffer makes this code thread-unsafe, isn't it?
> But perhaps that is not a problem, because other functions are also
> using this global variable.
Correct!
> It is more of PKTLINE_DATA_MAXLEN, isn't it?
Agreed, will change!
>
>> + if (bytes_to_write < 0)
>> + return 1;
>> + if (bytes_to_write == 0)
>> + break;
>> + did_fail |= direct_packet_write(fd_out, packet_buffer,
>> PKTLINE_HEADER_LEN + bytes_to_write, 1);
>> + }
>> + if (!did_fail)
>> + did_fail = packet_flush_gentle(fd_out);
>
> Shouldn't we try to flush even if there was an error? Or is it
> that if there is an error writing, then there is some problem
> such that we know that flush would not work?
Right, that's what I though.
>> + return did_fail;
>
> Return true on fail? Shouldn't we follow example of copy_fd()
> from copy.c, and return COPY_READ_ERROR, or COPY_WRITE_ERROR,
> or PKTLINE_WRITE_ERROR?
OK. How about this?
static int multi_packet_write_from_fd(const int fd_in, const int fd_out)
{
int did_fail = 0;
ssize_t bytes_to_write;
while (!did_fail) {
bytes_to_write = xread(fd_in,
PKTLINE_DATA_START(packet_buffer), PKTLINE_DATA_MAXLEN);
if (bytes_to_write < 0)
return COPY_READ_ERROR;
if (bytes_to_write == 0)
break;
did_fail |= direct_packet_write(fd_out, packet_buffer,
PKTLINE_HEADER_LEN + bytes_to_write, 1);
}
if (!did_fail)
did_fail = packet_flush_gently(fd_out);
return (did_fail ? COPY_WRITE_ERROR : 0);
}
>> +}
>> +
>> +static int multi_packet_write_from_buf(const char *src, size_t len, int
>> fd_out)
>
> It is equivalent of write_in_full(), with different order of parameters,
> but where destination file descriptor expects pkt-line and we need to pack
> data into pkt-lines?
True. Do you suggest to reorder parameters? I also would like to rename `src`
to `src_in`, OK?
>
> NOTE: function description comments?
What do you mean here?
>> +{
>> + int did_fail = 0;
>> + size_t bytes_written = 0;
>> + size_t bytes_to_write;
>
> Note to self: bytes_to_write should fit in size_t, as it is limited to
> PKTLINE_DATA_LEN. bytes_written should fit in size_t, as it is at most
> len, which is of type size_t.
>
>> + while (!did_fail) {
>> + if ((len - bytes_written) > PKTLINE_DATA_LEN)
>> + bytes_to_write = PKTLINE_DATA_LEN;
>> + else
>> + bytes_to_write = len - bytes_written;
>> + if (bytes_to_write == 0)
>> + break;
>> + did_fail |= direct_packet_write_data(fd_out, src +
>> bytes_written, bytes_to_write, 1);
>> + bytes_written += bytes_to_write;
>
> Ah, I see now why we need both direct_packet_write() and
> direct_packet_write_data(). Nice abstraction, makes for
> clear code.
>
> The last parameter of '1' means 'gently', isn't it?
Correct. Thanks :)
>> + }
>> + if (!did_fail)
>> + did_fail = packet_flush_gentle(fd_out);
>> + return did_fail;
>> +}
>
> I think all three/four of those functions should be added in a separate
> commit, separate patch in patch series.
OK
> Namely:
>
> - for git -> filter:
> * read from fd, write pkt-line to fd (off_t)
> * read from str+len, write pkt-line to fd (size_t, ssize_t)
> - for filter -> git:
> * read pkt-line from fd, write to fd (off_t)
This one does not exist.
> * read pkt-line from fd, write to str+len (size_t, ssize_t)
>
> Perhaps some of those can be in one overloaded function, perhaps it would
> be easier to keep them separate.
I would like to keep them separate as it is easier to comprehend.
>
> Also, I do wonder how the fetch / push code spools pack file received
> over pkt-lines to disk. Can we reuse that code?
I haven't found any.
> Or maybe that code
> could use those new functions?
I think so, but this would be out of scope for this series :)
>> +
>> +#define FILTER_CAPABILITIES_STREAM 0x1
>> +#define FILTER_CAPABILITIES_CLEAN 0x2
>> +#define FILTER_CAPABILITIES_SMUDGE 0x4
>> +#define FILTER_CAPABILITIES_SHUTDOWN 0x8
>> +#define FILTER_SUPPORTS_STREAM(type) ((type) & FILTER_CAPABILITIES_STREAM)
>> +#define FILTER_SUPPORTS_CLEAN(type) ((type) & FILTER_CAPABILITIES_CLEAN)
>> +#define FILTER_SUPPORTS_SMUDGE(type) ((type) & FILTER_CAPABILITIES_SMUDGE)
>> +#define FILTER_SUPPORTS_SHUTDOWN(type) ((type) &
>> FILTER_CAPABILITIES_SHUTDOWN)
>> +
>> +struct cmd2process {
>> + struct hashmap_entry ent; /* must be the first member! */
>> + const char *cmd;
>> + int supported_capabilities;
>
> I wonder if switching from int (perhaps with field width of 1 to denote
> that it is boolean-like flag) to mask makes it more readable, or less.
> But I think it is.
>
>
> Reading Documentation/technical/api-hashmap.txt I found the following
> recommendation:
>
> `struct hashmap_entry`::
>
> An opaque structure representing an entry in the hash table, which must
> be used as first member of user data structures. Ideally it should be
> followed by an int-sized member to prevent unused memory on 64-bit
> systems due to alignment.
>
> Therefore it "int supported_capabilities" should precede
> "const char *cmd", I think. Though it is not strictly necessary; it
> is not as if this hash table were large (maximum size is limited by
> the number of filter drivers configured), so we don't waste much space
> due to internal padding / due to alignment.
Thanks! I will change it to your suggestion anyway!
>
>> + struct child_process process;
>> +};
>> +
>> +static int cmd_process_map_initialized = 0;
>> +static struct hashmap cmd_process_map;
>
> Reading Documentation/technical/api-hashmap.txt I see that:
>
> `tablesize` is the allocated size of the hash table. A non-0 value indicates
> that the hashmap is initialized.
>
> So cmd_process_map_initialized is not really needed, is it?
I copied that from config.c:
https://github.com/git/git/blob/f8f7adce9fc50a11a764d57815602dcb818d1816/config.c#L1425-L1428
`git grep "tablesize"` reveals that the check for `tablesize` is only used
in hashmap.c ... so what approach should we use?
>> +
>> +static int cmd2process_cmp(const struct cmd2process *e1,
>> + const struct
>> cmd2process *e2,
>> + const void *unused)
>> +{
>> + return strcmp(e1->cmd, e2->cmd);
>> +}
>
> Well, to be exact (which is decidely not needed!) two commands might
> be equivalent not being identical as strings (e.g. extra space between
> parameters). But it is something the user should care about, not Git.
>
>> +
>> +static struct cmd2process *find_protocol2_filter_entry(struct hashmap
>> *hashmap, const char *cmd)
>
> I'm not sure if *_protocol2_* is needed; those functions are static,
> local to convert.c.
I want to make sure that the reader understands that these functions are
related to the filter protocol version 2. Not OK?
>> +{
>> + struct cmd2process k;
>
> Does this name of variable 'k' follow established convention?
> 'key' would be more descriptive, but it's not as if this function
> was long; so 'k' is all right, I think.
I agree on "key".
>
>> + hashmap_entry_init(&k, strhash(cmd));
>> + k.cmd = cmd;
>> + return hashmap_get(hashmap, &k, NULL);
>> +}
>> +
>> +static void kill_protocol2_filter(struct hashmap *hashmap, struct
>> cmd2process *entry) {
>
> Programming style: the opening brace should be on separate line,
> that is:
>
> +static void kill_protocol2_filter(struct hashmap *hashmap, struct
> cmd2process *entry)
> +{
Agreed!
>> + if (!entry)
>> + return;
>> + sigchain_push(SIGPIPE, SIG_IGN);
>> + close(entry->process.in);
>> + close(entry->process.out);
>> + sigchain_pop(SIGPIPE);
>> + finish_command(&entry->process);
>> + child_process_clear(&entry->process);
>> + hashmap_remove(hashmap, entry, NULL);
>> + free(entry);
>> +}
>
> All those, from #define FILTER_CAPABILITIES_ to here could be put
> in a separate patch, to reduce size of this one. But I am less
> sure that it is worth it for this case.
>
>> +
>> +void shutdown_protocol2_filter(pid_t pid)
>> +{
> [...]
>
> In my opinion this should be postponed to a separate commit.
Agreed!
>
>> +}
>> +
>> +static struct cmd2process *start_protocol2_filter(struct hashmap *hashmap,
>> const char *cmd)
>
> This has some parts in common with existing filter_buffer_or_fd().
> I wonder if it would be worth to extract those common parts.
>
> But perhaps it would be better to leave such refactoring for later.
>
>> +{
>> + int did_fail;
>> + struct cmd2process *entry;
>> + struct child_process *process;
>> + const char *argv[] = { cmd, NULL };
>> + struct string_list capabilities = STRING_LIST_INIT_NODUP;
>> + char *capabilities_buffer;
>> + int i;
>> +
>> + entry = xmalloc(sizeof(*entry));
>> + hashmap_entry_init(entry, strhash(cmd));
>> + entry->cmd = cmd;
>> + entry->supported_capabilities = 0;
>> + process = &entry->process;
>> +
>> + child_process_init(process);
>
> filter_buffer_or_fd() uses instead
>
> struct child_process child_process = CHILD_PROCESS_INIT;
>
> But I see that you need to access &entry->process anyway, so you
> need to have it here, and in this case child_process_init() is
> equivalent.
>
> I wonder if it would be worth it to use strbuf for cmd.
What do you mean by "worth it to use strbuf for cmd"? Why would
we need a strbuf?
>> + process->argv = argv;
>> + process->use_shell = 1;
>> + process->in = -1;
>> + process->out = -1;
>> + process->clean_on_exit = 1;
>> + process->clean_on_exit_handler = shutdown_protocol2_filter;
>
> These two lines are new, and related to the "shutdown" capability, isn't it?
Yes.
>
>> +
>> + if (start_command(process)) {
>> + error("cannot fork to run external filter '%s'", cmd);
>> + kill_protocol2_filter(hashmap, entry);
>
> I guess the alternative solution of adding filter to the hashmap only
> after starting the process would be racy?
>
> Ah, disregard that. I see that this pattern is a common way to error
> out in this function (for process-related errors).
>
>> + return NULL;
>> + }
>> +
>> + sigchain_push(SIGPIPE, SIG_IGN);
>> + did_fail = strcmp(packet_read_line(process->out, NULL),
>> "git-filter-protocol");
>> + if (!did_fail)
>> + did_fail |= strcmp(packet_read_line(process->out, NULL),
>> "version 2");
>> + if (!did_fail)
>> + capabilities_buffer = packet_read_line(process->out, NULL);
>> + else
>> + capabilities_buffer = NULL;
>> + sigchain_pop(SIGPIPE);
>> +
>> + if (!did_fail && capabilities_buffer) {
>> + string_list_split_in_place(&capabilities, capabilities_buffer,
>> ' ', -1);
>> + if (capabilities.nr > 1 &&
>> + !strcmp(capabilities.items[0].string, "capabilities")) {
>> + for (i = 1; i < capabilities.nr; i++) {
>> + const char *requested =
>> capabilities.items[i].string;
>> + if (!strcmp(requested, "stream")) {
>> + entry->supported_capabilities |=
>> FILTER_CAPABILITIES_STREAM;
>> + } else if (!strcmp(requested, "clean")) {
>> + entry->supported_capabilities |=
>> FILTER_CAPABILITIES_CLEAN;
>> + } else if (!strcmp(requested, "smudge")) {
>> + entry->supported_capabilities |=
>> FILTER_CAPABILITIES_SMUDGE;
>> + } else if (!strcmp(requested, "shutdown")) {
>> + entry->supported_capabilities |=
>> FILTER_CAPABILITIES_SHUTDOWN;
>> + } else {
>> + warning(
>> + "external filter '%s' requested
>> unsupported filter capability '%s'",
>> + cmd, requested
>> + );
>> + }
>> + }
>> + } else {
>> + error("filter capabilities not found");
>> + did_fail = 1;
>> + }
>> + string_list_clear(&capabilities, 0);
>> + }
>
> I wonder if the above conditional wouldn't be better to be put in
> a separate function, parse_filter_capabilities(capabilities_buffer),
> returning a mask, or having mask as an out parameter, and returning
> an error condition.
Agreed.
>> +
>> + if (did_fail) {
>> + error("initialization for external filter '%s' failed", cmd);
>
> More detailed information not needed, because one can use GIT_PACKET_TRACE.
> Would it be worth add this information as a kind of advice, or put it
> in the documentation of the `process` option?
I will put it into the docs.
>
>> + kill_protocol2_filter(hashmap, entry);
>> + return NULL;
>> + }
>> +
>> + hashmap_add(hashmap, entry);
>> + return entry;
>> +}
>> +
>> +static int apply_protocol2_filter(const char *path, const char *src, size_t
>> len,
>> + int fd, struct strbuf *dst,
>> const char *cmd,
>> + const int wanted_capability)
>
> apply_protocol2_filter, or apply_process_filter? Or rather,
> s/_protocol2_/_process_/g ?
Mh. I wanted to convey that this functions is protocol V2 related...
>
> This is equivalent to
>
> static int apply_filter(const char *path, const char *src, size_t len, int
> fd,
> struct strbuf *dst, const char *cmd)
>
> Could we have extended that one instead?
Initially I had one function but that got kind of long ... I prefer two for now.
>> +{
>> + int ret = 1;
>> + struct cmd2process *entry;
>> + struct child_process *process;
>> + struct stat file_stat;
>> + struct strbuf nbuf = STRBUF_INIT;
>> + size_t expected_bytes = 0;
>> + char *strtol_end;
>> + char *strbuf;
>> + char *filter_type;
>> + char *filter_result = NULL;
>> +
>
>> + if (!cmd || !*cmd)
>> + return 0;
>> +
>> + if (!dst)
>> + return 1;
>
> This is the same as in apply_filter().
>
>> +
>> + if (!cmd_process_map_initialized) {
>> + cmd_process_map_initialized = 1;
>> + hashmap_init(&cmd_process_map, (hashmap_cmp_fn)
>> cmd2process_cmp, 0);
>> + entry = NULL;
>> + } else {
>> + entry = find_protocol2_filter_entry(&cmd_process_map, cmd);
>> + }
>
> Here we try to find existing process, rather than starting new
> as in apply_filter()
>
>> +
>> + fflush(NULL);
>
> This is the same as in apply_filter(), but I wonder what it is for.
"If the stream argument is NULL, fflush() flushes all
open output streams."
http://man7.org/linux/man-pages/man3/fflush.3.html
>
>> +
>> + if (!entry) {
>> + entry = start_protocol2_filter(&cmd_process_map, cmd);
>> + if (!entry) {
>> + return 0;
>> + }
>
> Style; we prefer:
>
> + if (!entry)
> + return 0;
Agreed.
> This is very similar to apply_filter(), but the latter uses start_async()
> from "run-command.h", with filter_buffer_or_fd() as asynchronous process,
> which gets passed command to run in struct filter_params. In this
> function start_protocol2_filter() runs start_command(), synchronous API.
>
> Why the difference?
The protocol V2 requires a sequential processing of the packets. See
discussion with Junio here:
http://public-inbox.org/git/xmqqbn1th5qn.fsf%40gitster.mtv.corp.google.com/
[LONG SNIP]
I will answer the second half in a separate email.
Thanks for the review,
Lars
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html