Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
Lars Schneiderwrites: >> Hmm, I would have expected that the basic flow would become >> >> for each paths to be processed: >> convert-to-worktree to buf >> if not delayed: >> do the caller's thing to use buf >> else: >> remember path >> >> for each delayed paths: >> ensure filter process finished processing for path >> fetch the thing to buf from the process >> do the caller's thing to use buf >> >> and that would make quite a lot of sense. However, what is actually >> implemented is a bit disappointing from that point of view. While >> its first part is the same as above, the latter part instead does: >> >> for each delayed paths: >> checkout the path >> ... > > I am not sure I can follow you here. > ... > I implemented the "checkout_delayed_entries" function in v1 because > it solved the problem with minimal changes in the existing code. Our previous > discussion made me think that this is the preferred way: > > I do not think we want to see such a rewrite all over the > codepaths. It might be OK to add such a "these entries are known > to be delayed" list in struct checkout so that the above becomes > more like this: > >for (i = 0; i < active_nr; i++) > checkout_entry(active_cache[i], state, NULL); > + checkout_entry_finish(state); > > That is, addition of a single "some of the checkout_entry() calls > done so far might have been lazy, and I'll give them a chance to > clean up" might be palatable. Anything more than that on the > caller side is not. But that is apples-and-oranges comparision, no? The old discussion assumes there is no "caller's thing to use buf" other than "checkout to the working tree", which is why the function its main loop calls is "checkout_entry()" and the caller does not see the contents of the filtered blob at all. In that context, checkout_entry_finish() that equally does not let the caller see the contents of the filtered blob is quite appropriate.
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
Jakub Narębskiwrites: >> Yes, this problem happens every day with filters that perform network >> requests (e.g. GitLFS). > > Do I understand it correctly that the expected performance improvement > thanks to this feature is possible only if there is some amount of > parallelism and concurrency in the filter? That is, filter can be sending > one blob to Git while processing other one, or filter can be fetching blobs > in parallel. The first-object latency may not be helped, but by allowing "delayed", the latency to retrieve the second and subsequent objects can be hidden, I would think.
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
On Wed, Jan 11, 2017 at 11:13:00AM +0100, Lars Schneider wrote: > > In v1 I implemented a) with the busy-loop problem in mind. > > My thinking was this: > > If the filter sees at least one filter request twice then the filter knows > that > Git has already requested all files that require filtering. At that point the > filter could just block the "delayed" answer to the latest filter request > until > at least one of the previously delayed requests can be fulfilled. Then the > filter > answers "delay" to Git until Git requests the blob that can be fulfilled. This > process cycles until all requests can be fulfilled. Wouldn't that work? > > I think a "done" message by the filter is not easy. Right now the protocol > works > in a mode were Git always asks and the filter always answers. I believe > changing > the filter to be able to initiate a "done" message would complicated the > protocol. > > > Additionally, the protocol should specify a sentinel "no more entries" value > > that could be sent from Git to the filter to signal that there are no more > > files > > to checkout. Some filters may implement mechanisms for converting files that > > require a signal to know when all files have been sent. Specifically, Git > > LFS > > (https://git-lfs.github.com) batches files to be transferred together, and > > needs > > to know when all files have been announced to truncate and send the last > > batch, > > if it is not yet full. I'm sure other filter implementations use a similar > > mechanism and would benefit from this as well. > > I agree. I think the filter already has this info implicitly as explained > above > but an explicit message would be better! This works, but forces some undesirable constraints: - The filter has to keep track of all items in the checkout to determine how many times Git has asked about them. This is probably fine, though annoying. Cases where this is not fine is when there are _many_ items in the checkout forcing filter implementations to keep track of large sets of data. - The protocol is now _must_ ask for the status of checkout items in a particular order. If the assumption is that "Git has sent me all items in the checkout by the point I see Git ask for the status of an item more than once", then Git _cannot_ ask for the status of a delayed item while there are still more items to checkout. This could be OK, so long as the protocol commits to it and documents this behavior. What are your thoughts? -- Thanks, Taylor Blau ttayl...@github.com
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
W dniu 11.01.2017 o 11:20, Lars Schneider pisze: > On 10 Jan 2017, at 23:11, Jakub Narębskiwrote: >> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze: >>> larsxschnei...@gmail.com writes: From: Lars Schneider Some `clean` / `smudge` filters might require a significant amount of time to process a single blob. During this process the Git checkout operation is blocked and Git needs to wait until the filter is done to continue with the checkout. >> >> Lars, what is expected use case for this feature; that is when do you >> think this problem may happen? Is it something that happened IRL? > > Yes, this problem happens every day with filters that perform network > requests (e.g. GitLFS). Do I understand it correctly that the expected performance improvement thanks to this feature is possible only if there is some amount of parallelism and concurrency in the filter? That is, filter can be sending one blob to Git while processing other one, or filter can be fetching blobs in parallel. This means that filter process works as a kind of (de)multiplexer for fetching and/or processing blob contents, I think. > [...] In GitLFS we even implemented Git wrapper > commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988 > The ultimate goal of this patch is to be able to get rid of the wrapper > commands. I'm sorry, I don't see it how the wrapper helps here. > Teach the filter process protocol (introduced in edcc858) to accept the status "delayed" as response to a filter request. Upon this response Git continues with the checkout operation and asks the filter to process the blob again after all other blobs have been processed. >>> >>> Hmm, I would have expected that the basic flow would become >>> >>> for each paths to be processed: >>> convert-to-worktree to buf >>> if not delayed: >>> do the caller's thing to use buf >>> else: >>> remember path >>> >>> for each delayed paths: >>> ensure filter process finished processing for path >>> fetch the thing to buf from the process >>> do the caller's thing to use buf >> >> I would expect here to have a kind of event loop, namely >> >>while there are delayed paths: >>get path that is ready from filter >>fetch the thing to buf (supporting "delayed") >>if path done >>do the caller's thing to use buf >>(e.g. finish checkout path, eof convert, etc.) >> >> We can either trust filter process to tell us when it finished sending >> delayed paths, or keep list of paths that are being delayed in Git. > > I could implement "get path that is ready from filter" but wouldn't > that complicate the filter protocol? I think we can use the protocol pretty > much as if with the strategy outlined here: > http://public-inbox.org/git/f533857d-9b51-44c1-8889-aa0542ad8...@gmail.com/ You are talking about the "busy-loop" solution, isn't it? In the same notation, it would look like this: while there are delayed paths: for each delayed path: request path from filter [1] fetch the thing (supporting "delayed") [2] if path done do the caller's thing to use buf remove it from delayed paths list Footnotes: -- 1) We don't send the Git-side contents of blob again, isn't it? So we need some protocol extension / new understanding anyway. for example that we don't send contents if we request path again. 2) If path is not ready at all, filter protocol would send status=delayed with empty contents. This means that we would immediately go to the next path, if there is one. There are some cases where busy loop is preferable, but I don't think it is the case here. The event loop solution would require additional protocol extension, but I don't think those complicate protocol too much: A. Git would need to signal filter process that it has sent all paths, and that it should be sending delayed paths when they are ready. This could be done for example with "command=continue". packet: git> command=continue B. Filter driver, in the event-loop phase, when (de)multiplexing fetching or processing of data, it would need now to initialize transfer, instead of waiting for Git to ask. It could look like this: packet: git< status=resumed [3] packet: git< pathname=file/to/be/resumed [4] packet: git< packet: git< SMUDGED_CONTENT_CONTINUED packet: git< packet: git< # empty list, means "status=success" [5] Footnotes: -- 3.) It could be "status=success", "status=delayed",
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
> On 10 Jan 2017, at 23:11, Jakub Narębskiwrote: > > W dniu 09.01.2017 o 00:42, Junio C Hamano pisze: >> larsxschnei...@gmail.com writes: >>> From: Lars Schneider >>> >>> Some `clean` / `smudge` filters might require a significant amount of >>> time to process a single blob. During this process the Git checkout >>> operation is blocked and Git needs to wait until the filter is done to >>> continue with the checkout. > > Lars, what is expected use case for this feature; that is when do you > think this problem may happen? Is it something that happened IRL? Yes, this problem happens every day with filters that perform network requests (e.g. GitLFS). In GitLFS we even implemented Git wrapper commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988 The ultimate goal of this patch is to be able to get rid of the wrapper commands. >>> Teach the filter process protocol (introduced in edcc858) to accept the >>> status "delayed" as response to a filter request. Upon this response Git >>> continues with the checkout operation and asks the filter to process the >>> blob again after all other blobs have been processed. >> >> Hmm, I would have expected that the basic flow would become >> >> for each paths to be processed: >> convert-to-worktree to buf >> if not delayed: >> do the caller's thing to use buf >> else: >> remember path >> >> for each delayed paths: >> ensure filter process finished processing for path >> fetch the thing to buf from the process >> do the caller's thing to use buf > > I would expect here to have a kind of event loop, namely > >while there are delayed paths: >get path that is ready from filter >fetch the thing to buf (supporting "delayed") >if path done >do the caller's thing to use buf >(e.g. finish checkout path, eof convert, etc.) > > We can either trust filter process to tell us when it finished sending > delayed paths, or keep list of paths that are being delayed in Git. I could implement "get path that is ready from filter" but wouldn't that complicate the filter protocol? I think we can use the protocol pretty much as if with the strategy outlined here: http://public-inbox.org/git/f533857d-9b51-44c1-8889-aa0542ad8...@gmail.com/ Thanks, Lars
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
> On 10 Jan 2017, at 00:38, Taylor Blauwrote: > > I've been considering some alternative approaches in order to make the > communication between Git and any extension that implements this protocol more > intuitive. > > In particular, I'm considering alternatives to: > >> for each delayed paths: >> ensure filter process finished processing for path >> fetch the thing to buf from the process >> do the caller's thing to use buf > > As I understand it, the above sequence of steps would force Git to either: > > a) loop over all delayed paths and ask the filter if it's done processing, > creating a busy-loop between the filter and Git, or... > b) loop over all delayed paths sequentially, checking out each path in > sequence > > I would like to avoid both of those situations, and instead opt for an > asynchronous approach. In (a), the protocol is far too chatty. In (b), the > protocol is much less chatty, but forces the checkout to be the very last > step, > which has negative performance implications on checkouts with many large > files. > > For instance, checking out several multi-gigabyte files one after the other > means that a significant amount of time is lost while the filter has some of > the > items ready. Instead of checking them out as they become available, Git waits > until the very end when they are all available. > > I think it would be preferable for the protocol to specify a sort of "done" > signal against each path such that Git could check out delayed paths as they > become available. If implemented this way, Git could checkout files > asynchronously, while the filter continues to do work on the other end. In v1 I implemented a) with the busy-loop problem in mind. My thinking was this: If the filter sees at least one filter request twice then the filter knows that Git has already requested all files that require filtering. At that point the filter could just block the "delayed" answer to the latest filter request until at least one of the previously delayed requests can be fulfilled. Then the filter answers "delay" to Git until Git requests the blob that can be fulfilled. This process cycles until all requests can be fulfilled. Wouldn't that work? I think a "done" message by the filter is not easy. Right now the protocol works in a mode were Git always asks and the filter always answers. I believe changing the filter to be able to initiate a "done" message would complicated the protocol. > Additionally, the protocol should specify a sentinel "no more entries" value > that could be sent from Git to the filter to signal that there are no more > files > to checkout. Some filters may implement mechanisms for converting files that > require a signal to know when all files have been sent. Specifically, Git LFS > (https://git-lfs.github.com) batches files to be transferred together, and > needs > to know when all files have been announced to truncate and send the last > batch, > if it is not yet full. I'm sure other filter implementations use a similar > mechanism and would benefit from this as well. I agree. I think the filter already has this info implicitly as explained above but an explicit message would be better! Thanks, Lars
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
> On 08 Jan 2017, at 21:45, Eric Wongwrote: > > larsxschnei...@gmail.com wrote: >> +++ b/t/t0021/rot13-filter.pl > >> +$DELAY{'test-delay1.r'} = 1; >> +$DELAY{'test-delay3.r'} = 3; >> >> open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!"; >> >> @@ -166,6 +176,15 @@ while (1) { >> packet_txt_write("status=abort"); >> packet_flush(); >> } >> +elsif ( $command eq "smudge" and >> +exists $DELAY{$pathname} and >> +$DELAY{$pathname} gt 0 ) { > > Use '>' for numeric comparisons. 'gt' is for strings (man perlop) Still learning Perl :-) > Sidenote, staying <= 80 columns for the rest of the changes is > strongly preferred, some of us need giant fonts. I think what > Torsten said about introducing a new *_internal function can > also help with that. OK! Thank you, Lars
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
> On 08 Jan 2017, at 21:14, Torsten Bögershausenwrote: > > On Sun, Jan 08, 2017 at 08:17:36PM +0100, larsxschnei...@gmail.com wrote: >> From: Lars Schneider >> >> Some `clean` / `smudge` filters might require a significant amount of >> time to process a single blob. During this process the Git checkout >> operation is blocked and Git needs to wait until the filter is done to >> continue with the checkout. >> >> Teach the filter process protocol (introduced in edcc858) to accept the >> status "delayed" as response to a filter request. Upon this response Git >> continues with the checkout operation and asks the filter to process the >> blob again after all other blobs have been processed. >> >> Git has a multiple code paths that checkout a blob. Support delayed >> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations. >> >> Signed-off-by: Lars Schneider >> --- >> > > Some feeling tells me that it may be better to leave > convert_to_working_tree() as it is. > And change convert_to_working_tree_internal as suggested: > > int convert_to_working_tree(const char *path, const char *src, size_t len, > struct strbuf *dst) > { > - return convert_to_working_tree_internal(path, src, len, dst, 0); > + return convert_to_working_tree_internal(path, src, len, dst, NULL, 0); > } If I do this then I would have no way to communicate to the caller that the processing is delayed. Consequently the caller would not know that an additional call is necessary to fetch the result. Thanks, Lars
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
> On 09 Jan 2017, at 00:42, Junio C Hamanowrote: > > larsxschnei...@gmail.com writes: > >> From: Lars Schneider >> >> Some `clean` / `smudge` filters might require a significant amount of >> time to process a single blob. During this process the Git checkout >> operation is blocked and Git needs to wait until the filter is done to >> continue with the checkout. >> >> Teach the filter process protocol (introduced in edcc858) to accept the >> status "delayed" as response to a filter request. Upon this response Git >> continues with the checkout operation and asks the filter to process the >> blob again after all other blobs have been processed. > > Hmm, I would have expected that the basic flow would become > > for each paths to be processed: > convert-to-worktree to buf > if not delayed: > do the caller's thing to use buf > else: > remember path > > for each delayed paths: > ensure filter process finished processing for path > fetch the thing to buf from the process > do the caller's thing to use buf > > and that would make quite a lot of sense. However, what is actually > implemented is a bit disappointing from that point of view. While > its first part is the same as above, the latter part instead does: > > for each delayed paths: > checkout the path > > Presumably, checkout_entry() does the "ensure that the process is > done converting" (otherwise the result is simply buggy), but what > disappoints me is that this does not allow callers that call > "convert-to-working-tree", whose interface is obtain the bytestream > in-core in the working tree representation, given an object in the > object-db representation in an in-core buffer, to _use_ the result > of the conversion. The caller does not have a chance to even see > the result as it is written straight to the filesystem, once it > calls checkout_delayed_entries(). I am not sure I can follow you here. A caller of "convert_to_working_tree" would indeed see filtered result. Consider the following example. The filter delays the conversion twice and responds with the filtered results on the third call: CALL: int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0) RESPONSE: return == 1; *delayed == 1, *dst=='' CALL: int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0) RESPONSE: return == 1; *delayed == 1, *dst=='' CALL: int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0) RESPONSE: return == 1; *delayed == 0, *dst=='FILTERED_CONTENT' I implemented the "checkout_delayed_entries" function in v1 because it solved the problem with minimal changes in the existing code. Our previous discussion made me think that this is the preferred way: I do not think we want to see such a rewrite all over the codepaths. It might be OK to add such a "these entries are known to be delayed" list in struct checkout so that the above becomes more like this: for (i = 0; i < active_nr; i++) checkout_entry(active_cache[i], state, NULL); + checkout_entry_finish(state); That is, addition of a single "some of the checkout_entry() calls done so far might have been lazy, and I'll give them a chance to clean up" might be palatable. Anything more than that on the caller side is not. c.f. http://public-inbox.org/git/xmqqvavotych@gitster.mtv.corp.google.com/ Thanks, Lars
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
On Tue, Jan 10, 2017 at 11:11:01PM +0100, Jakub Narębski wrote: > W dniu 09.01.2017 o 00:42, Junio C Hamano pisze: > > larsxschnei...@gmail.com writes: > >> From: Lars Schneider> >> > >> Some `clean` / `smudge` filters might require a significant amount of > >> time to process a single blob. During this process the Git checkout > >> operation is blocked and Git needs to wait until the filter is done to > >> continue with the checkout. > > Lars, what is expected use case for this feature; that is when do you > think this problem may happen? Is it something that happened IRL? > > >> > >> Teach the filter process protocol (introduced in edcc858) to accept the > >> status "delayed" as response to a filter request. Upon this response Git > >> continues with the checkout operation and asks the filter to process the > >> blob again after all other blobs have been processed. > > > > Hmm, I would have expected that the basic flow would become > > > > for each paths to be processed: > > convert-to-worktree to buf > > if not delayed: > > do the caller's thing to use buf > > else: > > remember path > > > > for each delayed paths: > > ensure filter process finished processing for path > > fetch the thing to buf from the process > > do the caller's thing to use buf > > I would expect here to have a kind of event loop, namely > > while there are delayed paths: > get path that is ready from filter > fetch the thing to buf (supporting "delayed") > if path done > do the caller's thing to use buf > (e.g. finish checkout path, eof convert, etc.) > > We can either trust filter process to tell us when it finished sending > delayed paths, or keep list of paths that are being delayed in Git. This makes a lot of sense to me. The "get path that is ready from filter" should block until the filter has data that it is ready to send. This way Git isn't wasting time in a busy-loop asking whether the filter has data ready to be sent. It also means that if the filter has one large chunk that it's ready to write, Git can work on that while the filter continues to process more data, theoretically improving the performance of checkouts with many large delayed objects. > > > > > and that would make quite a lot of sense. However, what is actually > > implemented is a bit disappointing from that point of view. While > > its first part is the same as above, the latter part instead does: > > > > for each delayed paths: > > checkout the path > > > > Presumably, checkout_entry() does the "ensure that the process is > > done converting" (otherwise the result is simply buggy), but what > > disappoints me is that this does not allow callers that call > > "convert-to-working-tree", whose interface is obtain the bytestream > > in-core in the working tree representation, given an object in the > > object-db representation in an in-core buffer, to _use_ the result > > of the conversion. The caller does not have a chance to even see > > the result as it is written straight to the filesystem, once it > > calls checkout_delayed_entries(). > > > In addition to the above, I'd also like to investigate adding a "no more items" message into the filter protocol. This would be useful for filters that batch delayed items into groups. In particular, if the batch size is `N`, and Git sends `2N-1` items, the second batch will be under-filled. The filter on the other end needs some mechanism to send the second batch, even though it hasn't hit max capacity. Specifically, this is how Git LFS implements object transfers for data it does not have locally, but I'm sure that this sort of functionality would be useful for other filter implementations as well. -- Thanks, Taylor Blau ttayl...@github.com
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
W dniu 09.01.2017 o 00:42, Junio C Hamano pisze: > larsxschnei...@gmail.com writes: >> From: Lars Schneider>> >> Some `clean` / `smudge` filters might require a significant amount of >> time to process a single blob. During this process the Git checkout >> operation is blocked and Git needs to wait until the filter is done to >> continue with the checkout. Lars, what is expected use case for this feature; that is when do you think this problem may happen? Is it something that happened IRL? >> >> Teach the filter process protocol (introduced in edcc858) to accept the >> status "delayed" as response to a filter request. Upon this response Git >> continues with the checkout operation and asks the filter to process the >> blob again after all other blobs have been processed. > > Hmm, I would have expected that the basic flow would become > > for each paths to be processed: > convert-to-worktree to buf > if not delayed: > do the caller's thing to use buf > else: > remember path > > for each delayed paths: > ensure filter process finished processing for path > fetch the thing to buf from the process > do the caller's thing to use buf I would expect here to have a kind of event loop, namely while there are delayed paths: get path that is ready from filter fetch the thing to buf (supporting "delayed") if path done do the caller's thing to use buf (e.g. finish checkout path, eof convert, etc.) We can either trust filter process to tell us when it finished sending delayed paths, or keep list of paths that are being delayed in Git. > > and that would make quite a lot of sense. However, what is actually > implemented is a bit disappointing from that point of view. While > its first part is the same as above, the latter part instead does: > > for each delayed paths: > checkout the path > > Presumably, checkout_entry() does the "ensure that the process is > done converting" (otherwise the result is simply buggy), but what > disappoints me is that this does not allow callers that call > "convert-to-working-tree", whose interface is obtain the bytestream > in-core in the working tree representation, given an object in the > object-db representation in an in-core buffer, to _use_ the result > of the conversion. The caller does not have a chance to even see > the result as it is written straight to the filesystem, once it > calls checkout_delayed_entries(). >
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
larsxschnei...@gmail.com writes: > From: Lars Schneider> > Some `clean` / `smudge` filters might require a significant amount of > time to process a single blob. During this process the Git checkout > operation is blocked and Git needs to wait until the filter is done to > continue with the checkout. > > Teach the filter process protocol (introduced in edcc858) to accept the > status "delayed" as response to a filter request. Upon this response Git > continues with the checkout operation and asks the filter to process the > blob again after all other blobs have been processed. Hmm, I would have expected that the basic flow would become for each paths to be processed: convert-to-worktree to buf if not delayed: do the caller's thing to use buf else: remember path for each delayed paths: ensure filter process finished processing for path fetch the thing to buf from the process do the caller's thing to use buf and that would make quite a lot of sense. However, what is actually implemented is a bit disappointing from that point of view. While its first part is the same as above, the latter part instead does: for each delayed paths: checkout the path Presumably, checkout_entry() does the "ensure that the process is done converting" (otherwise the result is simply buggy), but what disappoints me is that this does not allow callers that call "convert-to-working-tree", whose interface is obtain the bytestream in-core in the working tree representation, given an object in the object-db representation in an in-core buffer, to _use_ the result of the conversion. The caller does not have a chance to even see the result as it is written straight to the filesystem, once it calls checkout_delayed_entries().
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
larsxschnei...@gmail.com wrote: > +++ b/t/t0021/rot13-filter.pl > +$DELAY{'test-delay1.r'} = 1; > +$DELAY{'test-delay3.r'} = 3; > > open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!"; > > @@ -166,6 +176,15 @@ while (1) { > packet_txt_write("status=abort"); > packet_flush(); > } > + elsif ( $command eq "smudge" and > + exists $DELAY{$pathname} and > + $DELAY{$pathname} gt 0 ) { Use '>' for numeric comparisons. 'gt' is for strings (man perlop) Sidenote, staying <= 80 columns for the rest of the changes is strongly preferred, some of us need giant fonts. I think what Torsten said about introducing a new *_internal function can also help with that. Thanks.
Re: [PATCH v1] convert: add "status=delayed" to filter process protocol
On Sun, Jan 08, 2017 at 08:17:36PM +0100, larsxschnei...@gmail.com wrote: > From: Lars Schneider> > Some `clean` / `smudge` filters might require a significant amount of > time to process a single blob. During this process the Git checkout > operation is blocked and Git needs to wait until the filter is done to > continue with the checkout. > > Teach the filter process protocol (introduced in edcc858) to accept the > status "delayed" as response to a filter request. Upon this response Git > continues with the checkout operation and asks the filter to process the > blob again after all other blobs have been processed. > > Git has a multiple code paths that checkout a blob. Support delayed > checkouts only in `clone` (in unpack-trees.c) and `checkout` operations. > > Signed-off-by: Lars Schneider > --- > Some feeling tells me that it may be better to leave convert_to_working_tree() as it is. And change convert_to_working_tree_internal as suggested: int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst) { - return convert_to_working_tree_internal(path, src, len, dst, 0); + return convert_to_working_tree_internal(path, src, len, dst, NULL, 0); }
[PATCH v1] convert: add "status=delayed" to filter process protocol
From: Lars SchneiderSome `clean` / `smudge` filters might require a significant amount of time to process a single blob. During this process the Git checkout operation is blocked and Git needs to wait until the filter is done to continue with the checkout. Teach the filter process protocol (introduced in edcc858) to accept the status "delayed" as response to a filter request. Upon this response Git continues with the checkout operation and asks the filter to process the blob again after all other blobs have been processed. Git has a multiple code paths that checkout a blob. Support delayed checkouts only in `clone` (in unpack-trees.c) and `checkout` operations. Signed-off-by: Lars Schneider --- You can find the RFC thread for this topic here: http://public-inbox.org/git/d10f7c47-14e8-465b-8b7a-a09a1b28a...@gmail.com/ Notes: Base Commit: e05806da9e (master) Diff on Web: https://github.com/larsxschneider/git/commit/ea25a1834b Checkout:git fetch https://github.com/larsxschneider/git filter-process/delay-v1 && git checkout ea25a1834b Interdiff (filter-process/delay-v0..filter-process/delay-v1): Documentation/gitattributes.txt | 9 +++ apply.c | 2 +- archive.c | 2 +- builtin/cat-file.c | 2 +- builtin/checkout.c | 1 + cache.h | 1 + convert.c | 33 ++--- convert.h | 2 +- diff.c | 2 +- entry.c | 40 +++ merge-recursive.c | 2 +- t/t0021-conversion.sh | 53 + t/t0021/rot13-filter.pl | 19 +++ unpack-trees.c | 1 + 14 files changed, 145 insertions(+), 24 deletions(-) diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index e0b66c1220..f6bad8db40 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -473,6 +473,15 @@ packet: git< # empty content! packet: git< # empty list, keep "status=success" unchanged! +If the request cannot be fulfilled within a reasonable amount of time +then the filter can respond with a "delayed" status and a flush packet. +Git will perform the same request at a later point in time, again. The +filter can delay a response multiple times for a single request. + +packet: git< status=delayed +packet: git< + + In case the filter cannot or does not want to process the content, it is expected to respond with an "error" status. diff --git a/apply.c b/apply.c index 2ed808d429..aa7e6e4359 100644 --- a/apply.c +++ b/apply.c @@ -4328,7 +4328,7 @@ static int try_create_file(const char *path, unsigned int mode, const char *buf, if (fd < 0) return 1; - if (convert_to_working_tree(path, buf, size, )) { + if (convert_to_working_tree(path, buf, size, , NULL)) { size = nbuf.len; buf = nbuf.buf; } diff --git a/archive.c b/archive.c index 01751e574b..90d02463ef 100644 --- a/archive.c +++ b/archive.c @@ -77,7 +77,7 @@ void *sha1_file_to_archive(const struct archiver_args *args, size_t size = 0; strbuf_attach(, buffer, *sizep, *sizep + 1); - convert_to_working_tree(path, buf.buf, buf.len, ); + convert_to_working_tree(path, buf.buf, buf.len, , NULL); if (commit) format_subst(commit, buf.buf, buf.len, ); buffer = strbuf_detach(, ); diff --git a/builtin/cat-file.c b/builtin/cat-file.c index 30383e9eb4..4e3e31a00b 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -35,7 +35,7 @@ static int filter_object(const char *path, unsigned mode, oid_to_hex(oid), path); if ((type == OBJ_BLOB) && S_ISREG(mode)) { struct strbuf strbuf = STRBUF_INIT; - if (convert_to_working_tree(path, *buf, *size, )) { + if (convert_to_working_tree(path, *buf, *size, , NULL)) { free(*buf); *size = strbuf.len; *buf = strbuf_detach(, NULL); diff --git a/builtin/checkout.c b/builtin/checkout.c index bfe685c198..42885cfe92 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -369,6 +369,7 @@ static int checkout_paths(const struct checkout_opts *opts, pos = skip_same_name(ce, pos) - 1; } } + errs |= checkout_delayed_entries(); if (write_locked_index(_index, lock_file, COMMIT_LOCK)) die(_("unable to write new index file")); diff --git a/cache.h b/cache.h index