Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Junio C Hamano
Lars Schneider  writes:

>> Hmm, I would have expected that the basic flow would become
>> 
>>  for each paths to be processed:
>>  convert-to-worktree to buf
>>  if not delayed:
>>  do the caller's thing to use buf
>>  else:
>>  remember path
>> 
>>  for each delayed paths:
>>  ensure filter process finished processing for path
>>  fetch the thing to buf from the process
>>  do the caller's thing to use buf
>> 
>> and that would make quite a lot of sense.  However, what is actually
>> implemented is a bit disappointing from that point of view.  While
>> its first part is the same as above, the latter part instead does:
>> 
>>  for each delayed paths:
>>  checkout the path
>> ...
>
> I am not sure I can follow you here.
> ...
> I implemented the "checkout_delayed_entries" function in v1 because
> it solved the problem with minimal changes in the existing code. Our previous 
> discussion made me think that this is the preferred way:
>
>  I do not think we want to see such a rewrite all over the
>  codepaths.  It might be OK to add such a "these entries are known
>  to be delayed" list in struct checkout so that the above becomes
>  more like this:
>
>for (i = 0; i < active_nr; i++)
>   checkout_entry(active_cache[i], state, NULL);
>  + checkout_entry_finish(state);
>
>  That is, addition of a single "some of the checkout_entry() calls
>  done so far might have been lazy, and I'll give them a chance to
>  clean up" might be palatable.  Anything more than that on the
>  caller side is not.

But that is apples-and-oranges comparision, no?  The old discussion
assumes there is no "caller's thing to use buf" other than "checkout
to the working tree", which is why the function its main loop calls
is "checkout_entry()" and the caller does not see the contents of
the filtered blob at all.  In that context, checkout_entry_finish()
that equally does not let the caller see the contents of the
filtered blob is quite appropriate.



Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Junio C Hamano
Jakub Narębski  writes:

>> Yes, this problem happens every day with filters that perform network
>> requests (e.g. GitLFS). 
>
> Do I understand it correctly that the expected performance improvement
> thanks to this feature is possible only if there is some amount of
> parallelism and concurrency in the filter?  That is, filter can be sending
> one blob to Git while processing other one, or filter can be fetching blobs
> in parallel.

The first-object latency may not be helped, but by allowing
"delayed", the latency to retrieve the second and subsequent objects
can be hidden, I would think.


Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Taylor Blau
On Wed, Jan 11, 2017 at 11:13:00AM +0100, Lars Schneider wrote:
>
> In v1 I implemented a) with the busy-loop problem in mind.
>
> My thinking was this:
>
> If the filter sees at least one filter request twice then the filter knows 
> that
> Git has already requested all files that require filtering. At that point the
> filter could just block the "delayed" answer to the latest filter request 
> until
> at least one of the previously delayed requests can be fulfilled. Then the 
> filter
> answers "delay" to Git until Git requests the blob that can be fulfilled. This
> process cycles until all requests can be fulfilled. Wouldn't that work?
>
> I think a "done" message by the filter is not easy. Right now the protocol 
> works
> in a mode were Git always asks and the filter always answers. I believe 
> changing
> the filter to be able to initiate a "done" message would complicated the 
> protocol.
>
> > Additionally, the protocol should specify a sentinel "no more entries" value
> > that could be sent from Git to the filter to signal that there are no more 
> > files
> > to checkout. Some filters may implement mechanisms for converting files that
> > require a signal to know when all files have been sent. Specifically, Git 
> > LFS
> > (https://git-lfs.github.com) batches files to be transferred together, and 
> > needs
> > to know when all files have been announced to truncate and send the last 
> > batch,
> > if it is not yet full. I'm sure other filter implementations use a similar
> > mechanism and would benefit from this as well.
>
> I agree. I think the filter already has this info implicitly as explained 
> above
> but an explicit message would be better!

This works, but forces some undesirable constraints:

- The filter has to keep track of all items in the checkout to determine how
  many times Git has asked about them. This is probably fine, though annoying.
  Cases where this is not fine is when there are _many_ items in the checkout
  forcing filter implementations to keep track of large sets of data.
- The protocol is now _must_ ask for the status of checkout items in a
  particular order. If the assumption is that "Git has sent me all items in the
  checkout by the point I see Git ask for the status of an item more than once",
  then Git _cannot_ ask for the status of a delayed item while there are still
  more items to checkout. This could be OK, so long as the protocol commits to
  it and documents this behavior.

What are your thoughts?

--
Thanks,
Taylor Blau

ttayl...@github.com


Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Jakub Narębski
W dniu 11.01.2017 o 11:20, Lars Schneider pisze: 
> On 10 Jan 2017, at 23:11, Jakub Narębski  wrote:
>> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
>>> larsxschnei...@gmail.com writes:
 From: Lars Schneider 

 Some `clean` / `smudge` filters might require a significant amount of
 time to process a single blob. During this process the Git checkout
 operation is blocked and Git needs to wait until the filter is done to
 continue with the checkout.
>>
>> Lars, what is expected use case for this feature; that is when do you
>> think this problem may happen?  Is it something that happened IRL?
> 
> Yes, this problem happens every day with filters that perform network
> requests (e.g. GitLFS). 

Do I understand it correctly that the expected performance improvement
thanks to this feature is possible only if there is some amount of
parallelism and concurrency in the filter?  That is, filter can be sending
one blob to Git while processing other one, or filter can be fetching blobs
in parallel.

This means that filter process works as a kind of (de)multiplexer for
fetching and/or processing blob contents, I think.

> [...] In GitLFS we even implemented Git wrapper
> commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988
> The ultimate goal of this patch is to be able to get rid of the wrapper 
> commands.

I'm sorry, I don't see it how the wrapper helps here.

> 
 Teach the filter process protocol (introduced in edcc858) to accept the
 status "delayed" as response to a filter request. Upon this response Git
 continues with the checkout operation and asks the filter to process the
 blob again after all other blobs have been processed.
>>>
>>> Hmm, I would have expected that the basic flow would become
>>>
>>> for each paths to be processed:
>>> convert-to-worktree to buf
>>> if not delayed:
>>> do the caller's thing to use buf
>>> else:
>>> remember path
>>>
>>> for each delayed paths:
>>> ensure filter process finished processing for path
>>> fetch the thing to buf from the process
>>> do the caller's thing to use buf
>>
>> I would expect here to have a kind of event loop, namely
>>
>>while there are delayed paths:
>>get path that is ready from filter
>>fetch the thing to buf (supporting "delayed")
>>if path done
>>do the caller's thing to use buf 
>>(e.g. finish checkout path, eof convert, etc.)
>>
>> We can either trust filter process to tell us when it finished sending
>> delayed paths, or keep list of paths that are being delayed in Git.
> 
> I could implement "get path that is ready from filter" but wouldn't
> that complicate the filter protocol? I think we can use the protocol pretty 
> much as if with the strategy outlined here:
> http://public-inbox.org/git/f533857d-9b51-44c1-8889-aa0542ad8...@gmail.com/

You are talking about the "busy-loop" solution, isn't it?  In the
same notation, it would look like this:

  while there are delayed paths:
  for each delayed path:
  request path from filter [1]
  fetch the thing (supporting "delayed") [2]
  if path done
  do the caller's thing to use buf
  remove it from delayed paths list


Footnotes:
--
1) We don't send the Git-side contents of blob again, isn't it?
   So we need some protocol extension / new understanding anyway.
   for example that we don't send contents if we request path again.
2) If path is not ready at all, filter protocol would send status=delayed
   with empty contents.  This means that we would immediately go to the
   next path, if there is one.

There are some cases where busy loop is preferable, but I don't think
it is the case here.


The event loop solution would require additional protocol extension,
but I don't think those complicate protocol too much:

A. Git would need to signal filter process that it has sent all paths,
and that it should be sending delayed paths when they are ready.  This
could be done for example with "command=continue".

packet:  git> command=continue


B. Filter driver, in the event-loop phase, when (de)multiplexing fetching
or processing of data, it would need now to initialize transfer, instead
of waiting for Git to ask.  It could look like this:

packet:  git< status=resumed [3]
packet:  git< pathname=file/to/be/resumed [4]
packet:  git< 
packet:  git< SMUDGED_CONTENT_CONTINUED
packet:  git< 
packet:  git<   # empty list, means "status=success" [5]

Footnotes:
--
3.) It could be "status=success", "status=delayed", 

Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Lars Schneider

> On 10 Jan 2017, at 23:11, Jakub Narębski  wrote:
> 
> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
>> larsxschnei...@gmail.com writes:
>>> From: Lars Schneider 
>>> 
>>> Some `clean` / `smudge` filters might require a significant amount of
>>> time to process a single blob. During this process the Git checkout
>>> operation is blocked and Git needs to wait until the filter is done to
>>> continue with the checkout.
> 
> Lars, what is expected use case for this feature; that is when do you
> think this problem may happen?  Is it something that happened IRL?

Yes, this problem happens every day with filters that perform network
requests (e.g. GitLFS). In GitLFS we even implemented Git wrapper
commands to address the problem: https://github.com/git-lfs/git-lfs/pull/988
The ultimate goal of this patch is to be able to get rid of the wrapper 
commands.


>>> Teach the filter process protocol (introduced in edcc858) to accept the
>>> status "delayed" as response to a filter request. Upon this response Git
>>> continues with the checkout operation and asks the filter to process the
>>> blob again after all other blobs have been processed.
>> 
>> Hmm, I would have expected that the basic flow would become
>> 
>>  for each paths to be processed:
>>  convert-to-worktree to buf
>>  if not delayed:
>>  do the caller's thing to use buf
>>  else:
>>  remember path
>> 
>>  for each delayed paths:
>>  ensure filter process finished processing for path
>>  fetch the thing to buf from the process
>>  do the caller's thing to use buf
> 
> I would expect here to have a kind of event loop, namely
> 
>while there are delayed paths:
>get path that is ready from filter
>fetch the thing to buf (supporting "delayed")
>if path done
>do the caller's thing to use buf 
>(e.g. finish checkout path, eof convert, etc.)
> 
> We can either trust filter process to tell us when it finished sending
> delayed paths, or keep list of paths that are being delayed in Git.

I could implement "get path that is ready from filter" but wouldn't
that complicate the filter protocol? I think we can use the protocol pretty 
much as if with the strategy outlined here:
http://public-inbox.org/git/f533857d-9b51-44c1-8889-aa0542ad8...@gmail.com/


Thanks,
Lars

Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Lars Schneider

> On 10 Jan 2017, at 00:38, Taylor Blau  wrote:
> 
> I've been considering some alternative approaches in order to make the
> communication between Git and any extension that implements this protocol more
> intuitive.
> 
> In particular, I'm considering alternatives to:
> 
>>  for each delayed paths:
>>  ensure filter process finished processing for path
>>  fetch the thing to buf from the process
>>  do the caller's thing to use buf
> 
> As I understand it, the above sequence of steps would force Git to either:
> 
> a) loop over all delayed paths and ask the filter if it's done processing,
>   creating a busy-loop between the filter and Git, or...
> b) loop over all delayed paths sequentially, checking out each path in 
> sequence
> 
> I would like to avoid both of those situations, and instead opt for an
> asynchronous approach. In (a), the protocol is far too chatty. In (b), the
> protocol is much less chatty, but forces the checkout to be the very last 
> step,
> which has negative performance implications on checkouts with many large 
> files.
> 
> For instance, checking out several multi-gigabyte files one after the other
> means that a significant amount of time is lost while the filter has some of 
> the
> items ready. Instead of checking them out as they become available, Git waits
> until the very end when they are all available.
> 
> I think it would be preferable for the protocol to specify a sort of "done"
> signal against each path such that Git could check out delayed paths as they
> become available. If implemented this way, Git could checkout files
> asynchronously, while the filter continues to do work on the other end.

In v1 I implemented a) with the busy-loop problem in mind. 

My thinking was this:

If the filter sees at least one filter request twice then the filter knows that
Git has already requested all files that require filtering. At that point the
filter could just block the "delayed" answer to the latest filter request until
at least one of the previously delayed requests can be fulfilled. Then the 
filter
answers "delay" to Git until Git requests the blob that can be fulfilled. This
process cycles until all requests can be fulfilled. Wouldn't that work?

I think a "done" message by the filter is not easy. Right now the protocol 
works 
in a mode were Git always asks and the filter always answers. I believe changing
the filter to be able to initiate a "done" message would complicated the 
protocol.


> Additionally, the protocol should specify a sentinel "no more entries" value
> that could be sent from Git to the filter to signal that there are no more 
> files
> to checkout. Some filters may implement mechanisms for converting files that
> require a signal to know when all files have been sent. Specifically, Git LFS
> (https://git-lfs.github.com) batches files to be transferred together, and 
> needs
> to know when all files have been announced to truncate and send the last 
> batch,
> if it is not yet full. I'm sure other filter implementations use a similar
> mechanism and would benefit from this as well.

I agree. I think the filter already has this info implicitly as explained above
but an explicit message would be better!

Thanks,
Lars

Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Lars Schneider

> On 08 Jan 2017, at 21:45, Eric Wong  wrote:
> 
> larsxschnei...@gmail.com wrote:
>> +++ b/t/t0021/rot13-filter.pl
> 
>> +$DELAY{'test-delay1.r'} = 1;
>> +$DELAY{'test-delay3.r'} = 3;
>> 
>> open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!";
>> 
>> @@ -166,6 +176,15 @@ while (1) {
>>  packet_txt_write("status=abort");
>>  packet_flush();
>>  }
>> +elsif ( $command eq "smudge" and
>> +exists $DELAY{$pathname} and
>> +$DELAY{$pathname} gt 0 ) {
> 
> Use '>' for numeric comparisons.  'gt' is for strings (man perlop)

Still learning Perl :-)


> Sidenote, staying <= 80 columns for the rest of the changes is
> strongly preferred, some of us need giant fonts.  I think what
> Torsten said about introducing a new *_internal function can
> also help with that.

OK! 

Thank you,
Lars


Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Lars Schneider

> On 08 Jan 2017, at 21:14, Torsten Bögershausen  wrote:
> 
> On Sun, Jan 08, 2017 at 08:17:36PM +0100, larsxschnei...@gmail.com wrote:
>> From: Lars Schneider 
>> 
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob. During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.
>> 
>> Teach the filter process protocol (introduced in edcc858) to accept the
>> status "delayed" as response to a filter request. Upon this response Git
>> continues with the checkout operation and asks the filter to process the
>> blob again after all other blobs have been processed.
>> 
>> Git has a multiple code paths that checkout a blob. Support delayed
>> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.
>> 
>> Signed-off-by: Lars Schneider 
>> ---
>> 
> 
> Some feeling tells me that it may be better to leave 
> convert_to_working_tree() as it is.
> And change convert_to_working_tree_internal as suggested:
> 
> int convert_to_working_tree(const char *path, const char *src, size_t len, 
> struct strbuf *dst)
> {
> - return convert_to_working_tree_internal(path, src, len, dst, 0);
> + return convert_to_working_tree_internal(path, src, len, dst, NULL, 0);
> }

If I do this then I would have no way to communicate to the caller that the
processing is delayed. Consequently the caller would not know that an additional
call is necessary to fetch the result.

Thanks,
Lars

Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-11 Thread Lars Schneider

> On 09 Jan 2017, at 00:42, Junio C Hamano  wrote:
> 
> larsxschnei...@gmail.com writes:
> 
>> From: Lars Schneider 
>> 
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob. During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.
>> 
>> Teach the filter process protocol (introduced in edcc858) to accept the
>> status "delayed" as response to a filter request. Upon this response Git
>> continues with the checkout operation and asks the filter to process the
>> blob again after all other blobs have been processed.
> 
> Hmm, I would have expected that the basic flow would become
> 
>   for each paths to be processed:
>   convert-to-worktree to buf
>   if not delayed:
>   do the caller's thing to use buf
>   else:
>   remember path
> 
>   for each delayed paths:
>   ensure filter process finished processing for path
>   fetch the thing to buf from the process
>   do the caller's thing to use buf
> 
> and that would make quite a lot of sense.  However, what is actually
> implemented is a bit disappointing from that point of view.  While
> its first part is the same as above, the latter part instead does:
> 
>   for each delayed paths:
>   checkout the path
> 
> Presumably, checkout_entry() does the "ensure that the process is
> done converting" (otherwise the result is simply buggy), but what
> disappoints me is that this does not allow callers that call
> "convert-to-working-tree", whose interface is obtain the bytestream 
> in-core in the working tree representation, given an object in the
> object-db representation in an in-core buffer, to _use_ the result
> of the conversion.  The caller does not have a chance to even see
> the result as it is written straight to the filesystem, once it
> calls checkout_delayed_entries().

I am not sure I can follow you here. A caller of "convert_to_working_tree"
would indeed see filtered result. Consider the following example. The 
filter delays the conversion twice and responds with the filtered results
on the third call:

CALL: int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0)
RESPONSE: return == 1; *delayed == 1, *dst==''

CALL: int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0)
RESPONSE: return == 1; *delayed == 1, *dst==''

CALL: int convert_to_working_tree(*src=='CONTENT', *dst, *delayed==0)
RESPONSE: return == 1; *delayed == 0, *dst=='FILTERED_CONTENT'

I implemented the "checkout_delayed_entries" function in v1 because
it solved the problem with minimal changes in the existing code. Our previous 
discussion made me think that this is the preferred way:

 I do not think we want to see such a rewrite all over the
 codepaths.  It might be OK to add such a "these entries are known
 to be delayed" list in struct checkout so that the above becomes
 more like this:

   for (i = 0; i < active_nr; i++)
  checkout_entry(active_cache[i], state, NULL);
 + checkout_entry_finish(state);

 That is, addition of a single "some of the checkout_entry() calls
 done so far might have been lazy, and I'll give them a chance to
 clean up" might be palatable.  Anything more than that on the
 caller side is not.

c.f. http://public-inbox.org/git/xmqqvavotych@gitster.mtv.corp.google.com/

Thanks,
Lars





Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-10 Thread Taylor Blau
On Tue, Jan 10, 2017 at 11:11:01PM +0100, Jakub Narębski wrote:
> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
> > larsxschnei...@gmail.com writes:
> >> From: Lars Schneider 
> >>
> >> Some `clean` / `smudge` filters might require a significant amount of
> >> time to process a single blob. During this process the Git checkout
> >> operation is blocked and Git needs to wait until the filter is done to
> >> continue with the checkout.
>
> Lars, what is expected use case for this feature; that is when do you
> think this problem may happen?  Is it something that happened IRL?
>
> >>
> >> Teach the filter process protocol (introduced in edcc858) to accept the
> >> status "delayed" as response to a filter request. Upon this response Git
> >> continues with the checkout operation and asks the filter to process the
> >> blob again after all other blobs have been processed.
> >
> > Hmm, I would have expected that the basic flow would become
> >
> > for each paths to be processed:
> > convert-to-worktree to buf
> > if not delayed:
> > do the caller's thing to use buf
> > else:
> > remember path
> >
> > for each delayed paths:
> > ensure filter process finished processing for path
> > fetch the thing to buf from the process
> > do the caller's thing to use buf
>
> I would expect here to have a kind of event loop, namely
>
> while there are delayed paths:
> get path that is ready from filter
> fetch the thing to buf (supporting "delayed")
> if path done
> do the caller's thing to use buf
> (e.g. finish checkout path, eof convert, etc.)
>
> We can either trust filter process to tell us when it finished sending
> delayed paths, or keep list of paths that are being delayed in Git.

This makes a lot of sense to me. The "get path that is ready from filter" should
block until the filter has data that it is ready to send. This way Git isn't
wasting time in a busy-loop asking whether the filter has data ready to be sent.
It also means that if the filter has one large chunk that it's ready to write,
Git can work on that while the filter continues to process more data,
theoretically improving the performance of checkouts with many large delayed
objects.

>
> >
> > and that would make quite a lot of sense.  However, what is actually
> > implemented is a bit disappointing from that point of view.  While
> > its first part is the same as above, the latter part instead does:
> >
> > for each delayed paths:
> > checkout the path
> >
> > Presumably, checkout_entry() does the "ensure that the process is
> > done converting" (otherwise the result is simply buggy), but what
> > disappoints me is that this does not allow callers that call
> > "convert-to-working-tree", whose interface is obtain the bytestream
> > in-core in the working tree representation, given an object in the
> > object-db representation in an in-core buffer, to _use_ the result
> > of the conversion.  The caller does not have a chance to even see
> > the result as it is written straight to the filesystem, once it
> > calls checkout_delayed_entries().
> >
>

In addition to the above, I'd also like to investigate adding a "no more items"
message into the filter protocol. This would be useful for filters that
batch delayed items into groups. In particular, if the batch size is `N`, and 
Git
sends `2N-1` items, the second batch will be under-filled. The filter on the
other end needs some mechanism to send the second batch, even though it hasn't
hit max capacity.

Specifically, this is how Git LFS implements object transfers for data it does
not have locally, but I'm sure that this sort of functionality would be useful
for other filter implementations as well.

--
Thanks,
Taylor Blau

ttayl...@github.com


Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-10 Thread Jakub Narębski
W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
> larsxschnei...@gmail.com writes:
>> From: Lars Schneider 
>>
>> Some `clean` / `smudge` filters might require a significant amount of
>> time to process a single blob. During this process the Git checkout
>> operation is blocked and Git needs to wait until the filter is done to
>> continue with the checkout.

Lars, what is expected use case for this feature; that is when do you
think this problem may happen?  Is it something that happened IRL?

>>
>> Teach the filter process protocol (introduced in edcc858) to accept the
>> status "delayed" as response to a filter request. Upon this response Git
>> continues with the checkout operation and asks the filter to process the
>> blob again after all other blobs have been processed.
> 
> Hmm, I would have expected that the basic flow would become
> 
>   for each paths to be processed:
>   convert-to-worktree to buf
>   if not delayed:
>   do the caller's thing to use buf
>   else:
>   remember path
> 
>   for each delayed paths:
>   ensure filter process finished processing for path
>   fetch the thing to buf from the process
>   do the caller's thing to use buf

I would expect here to have a kind of event loop, namely

while there are delayed paths:
get path that is ready from filter
fetch the thing to buf (supporting "delayed")
if path done
do the caller's thing to use buf 
(e.g. finish checkout path, eof convert, etc.)

We can either trust filter process to tell us when it finished sending
delayed paths, or keep list of paths that are being delayed in Git.

> 
> and that would make quite a lot of sense.  However, what is actually
> implemented is a bit disappointing from that point of view.  While
> its first part is the same as above, the latter part instead does:
> 
>   for each delayed paths:
>   checkout the path
> 
> Presumably, checkout_entry() does the "ensure that the process is
> done converting" (otherwise the result is simply buggy), but what
> disappoints me is that this does not allow callers that call
> "convert-to-working-tree", whose interface is obtain the bytestream 
> in-core in the working tree representation, given an object in the
> object-db representation in an in-core buffer, to _use_ the result
> of the conversion.  The caller does not have a chance to even see
> the result as it is written straight to the filesystem, once it
> calls checkout_delayed_entries().
> 



Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-08 Thread Junio C Hamano
larsxschnei...@gmail.com writes:

> From: Lars Schneider 
>
> Some `clean` / `smudge` filters might require a significant amount of
> time to process a single blob. During this process the Git checkout
> operation is blocked and Git needs to wait until the filter is done to
> continue with the checkout.
>
> Teach the filter process protocol (introduced in edcc858) to accept the
> status "delayed" as response to a filter request. Upon this response Git
> continues with the checkout operation and asks the filter to process the
> blob again after all other blobs have been processed.

Hmm, I would have expected that the basic flow would become

for each paths to be processed:
convert-to-worktree to buf
if not delayed:
do the caller's thing to use buf
else:
remember path

for each delayed paths:
ensure filter process finished processing for path
fetch the thing to buf from the process
do the caller's thing to use buf

and that would make quite a lot of sense.  However, what is actually
implemented is a bit disappointing from that point of view.  While
its first part is the same as above, the latter part instead does:

for each delayed paths:
checkout the path

Presumably, checkout_entry() does the "ensure that the process is
done converting" (otherwise the result is simply buggy), but what
disappoints me is that this does not allow callers that call
"convert-to-working-tree", whose interface is obtain the bytestream 
in-core in the working tree representation, given an object in the
object-db representation in an in-core buffer, to _use_ the result
of the conversion.  The caller does not have a chance to even see
the result as it is written straight to the filesystem, once it
calls checkout_delayed_entries().


Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-08 Thread Eric Wong
larsxschnei...@gmail.com wrote:
> +++ b/t/t0021/rot13-filter.pl

> +$DELAY{'test-delay1.r'} = 1;
> +$DELAY{'test-delay3.r'} = 3;
> 
>  open my $debug, ">>", "rot13-filter.log" or die "cannot open log file: $!";
> 
> @@ -166,6 +176,15 @@ while (1) {
>   packet_txt_write("status=abort");
>   packet_flush();
>   }
> + elsif ( $command eq "smudge" and
> + exists $DELAY{$pathname} and
> + $DELAY{$pathname} gt 0 ) {

Use '>' for numeric comparisons.  'gt' is for strings (man perlop)

Sidenote, staying <= 80 columns for the rest of the changes is
strongly preferred, some of us need giant fonts.  I think what
Torsten said about introducing a new *_internal function can
also help with that.

Thanks.


Re: [PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-08 Thread Torsten Bögershausen
On Sun, Jan 08, 2017 at 08:17:36PM +0100, larsxschnei...@gmail.com wrote:
> From: Lars Schneider 
> 
> Some `clean` / `smudge` filters might require a significant amount of
> time to process a single blob. During this process the Git checkout
> operation is blocked and Git needs to wait until the filter is done to
> continue with the checkout.
> 
> Teach the filter process protocol (introduced in edcc858) to accept the
> status "delayed" as response to a filter request. Upon this response Git
> continues with the checkout operation and asks the filter to process the
> blob again after all other blobs have been processed.
> 
> Git has a multiple code paths that checkout a blob. Support delayed
> checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.
> 
> Signed-off-by: Lars Schneider 
> ---
> 

Some feeling tells me that it may be better to leave convert_to_working_tree() 
as it is.
And change convert_to_working_tree_internal as suggested:

int convert_to_working_tree(const char *path, const char *src, size_t len, 
struct strbuf *dst)
 {
-   return convert_to_working_tree_internal(path, src, len, dst, 0);
+   return convert_to_working_tree_internal(path, src, len, dst, NULL, 0);
 }



[PATCH v1] convert: add "status=delayed" to filter process protocol

2017-01-08 Thread larsxschneider
From: Lars Schneider 

Some `clean` / `smudge` filters might require a significant amount of
time to process a single blob. During this process the Git checkout
operation is blocked and Git needs to wait until the filter is done to
continue with the checkout.

Teach the filter process protocol (introduced in edcc858) to accept the
status "delayed" as response to a filter request. Upon this response Git
continues with the checkout operation and asks the filter to process the
blob again after all other blobs have been processed.

Git has a multiple code paths that checkout a blob. Support delayed
checkouts only in `clone` (in unpack-trees.c) and `checkout` operations.

Signed-off-by: Lars Schneider 
---

You can find the RFC thread for this topic here:
http://public-inbox.org/git/d10f7c47-14e8-465b-8b7a-a09a1b28a...@gmail.com/

Notes:
Base Commit: e05806da9e (master)
Diff on Web: https://github.com/larsxschneider/git/commit/ea25a1834b
Checkout:git fetch https://github.com/larsxschneider/git 
filter-process/delay-v1 && git checkout ea25a1834b

Interdiff (filter-process/delay-v0..filter-process/delay-v1):

 Documentation/gitattributes.txt |  9 +++
 apply.c |  2 +-
 archive.c   |  2 +-
 builtin/cat-file.c  |  2 +-
 builtin/checkout.c  |  1 +
 cache.h |  1 +
 convert.c   | 33 ++---
 convert.h   |  2 +-
 diff.c  |  2 +-
 entry.c | 40 +++
 merge-recursive.c   |  2 +-
 t/t0021-conversion.sh   | 53 +
 t/t0021/rot13-filter.pl | 19 +++
 unpack-trees.c  |  1 +
 14 files changed, 145 insertions(+), 24 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index e0b66c1220..f6bad8db40 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -473,6 +473,15 @@ packet:  git<   # empty content!
 packet:  git<   # empty list, keep "status=success" unchanged!
 

+If the request cannot be fulfilled within a reasonable amount of time
+then the filter can respond with a "delayed" status and a flush packet.
+Git will perform the same request at a later point in time, again. The
+filter can delay a response multiple times for a single request.
+
+packet:  git< status=delayed
+packet:  git< 
+
+
 In case the filter cannot or does not want to process the content,
 it is expected to respond with an "error" status.
 
diff --git a/apply.c b/apply.c
index 2ed808d429..aa7e6e4359 100644
--- a/apply.c
+++ b/apply.c
@@ -4328,7 +4328,7 @@ static int try_create_file(const char *path, unsigned int 
mode, const char *buf,
if (fd < 0)
return 1;

-   if (convert_to_working_tree(path, buf, size, )) {
+   if (convert_to_working_tree(path, buf, size, , NULL)) {
size = nbuf.len;
buf  = nbuf.buf;
}
diff --git a/archive.c b/archive.c
index 01751e574b..90d02463ef 100644
--- a/archive.c
+++ b/archive.c
@@ -77,7 +77,7 @@ void *sha1_file_to_archive(const struct archiver_args *args,
size_t size = 0;

strbuf_attach(, buffer, *sizep, *sizep + 1);
-   convert_to_working_tree(path, buf.buf, buf.len, );
+   convert_to_working_tree(path, buf.buf, buf.len, , NULL);
if (commit)
format_subst(commit, buf.buf, buf.len, );
buffer = strbuf_detach(, );
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 30383e9eb4..4e3e31a00b 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -35,7 +35,7 @@ static int filter_object(const char *path, unsigned mode,
 oid_to_hex(oid), path);
if ((type == OBJ_BLOB) && S_ISREG(mode)) {
struct strbuf strbuf = STRBUF_INIT;
-   if (convert_to_working_tree(path, *buf, *size, )) {
+   if (convert_to_working_tree(path, *buf, *size, , NULL)) {
free(*buf);
*size = strbuf.len;
*buf = strbuf_detach(, NULL);
diff --git a/builtin/checkout.c b/builtin/checkout.c
index bfe685c198..42885cfe92 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -369,6 +369,7 @@ static int checkout_paths(const struct checkout_opts *opts,
pos = skip_same_name(ce, pos) - 1;
}
}
+   errs |= checkout_delayed_entries();

if (write_locked_index(_index, lock_file, COMMIT_LOCK))
die(_("unable to write new index file"));
diff --git a/cache.h b/cache.h
index