On Tue, Jan 10, 2017 at 11:11:01PM +0100, Jakub Narębski wrote:
> W dniu 09.01.2017 o 00:42, Junio C Hamano pisze:
> > larsxschnei...@gmail.com writes:
> >> From: Lars Schneider <larsxschnei...@gmail.com>
> >>
> >> Some `clean` / `smudge` filters might require a significant amount of
> >> time to process a single blob. During this process the Git checkout
> >> operation is blocked and Git needs to wait until the filter is done to
> >> continue with the checkout.
>
> Lars, what is expected use case for this feature; that is when do you
> think this problem may happen?  Is it something that happened IRL?
>
> >>
> >> Teach the filter process protocol (introduced in edcc858) to accept the
> >> status "delayed" as response to a filter request. Upon this response Git
> >> continues with the checkout operation and asks the filter to process the
> >> blob again after all other blobs have been processed.
> >
> > Hmm, I would have expected that the basic flow would become
> >
> >     for each paths to be processed:
> >             convert-to-worktree to buf
> >             if not delayed:
> >                     do the caller's thing to use buf
> >             else:
> >                     remember path
> >
> >     for each delayed paths:
> >             ensure filter process finished processing for path
> >             fetch the thing to buf from the process
> >             do the caller's thing to use buf
>
> I would expect here to have a kind of event loop, namely
>
>         while there are delayed paths:
>                 get path that is ready from filter
>                 fetch the thing to buf (supporting "delayed")
>                 if path done
>                         do the caller's thing to use buf
>                         (e.g. finish checkout path, eof convert, etc.)
>
> We can either trust filter process to tell us when it finished sending
> delayed paths, or keep list of paths that are being delayed in Git.

This makes a lot of sense to me. The "get path that is ready from filter" should
block until the filter has data that it is ready to send. This way Git isn't
wasting time in a busy-loop asking whether the filter has data ready to be sent.
It also means that if the filter has one large chunk that it's ready to write,
Git can work on that while the filter continues to process more data,
theoretically improving the performance of checkouts with many large delayed
objects.

>
> >
> > and that would make quite a lot of sense.  However, what is actually
> > implemented is a bit disappointing from that point of view.  While
> > its first part is the same as above, the latter part instead does:
> >
> >     for each delayed paths:
> >             checkout the path
> >
> > Presumably, checkout_entry() does the "ensure that the process is
> > done converting" (otherwise the result is simply buggy), but what
> > disappoints me is that this does not allow callers that call
> > "convert-to-working-tree", whose interface is obtain the bytestream
> > in-core in the working tree representation, given an object in the
> > object-db representation in an in-core buffer, to _use_ the result
> > of the conversion.  The caller does not have a chance to even see
> > the result as it is written straight to the filesystem, once it
> > calls checkout_delayed_entries().
> >
>

In addition to the above, I'd also like to investigate adding a "no more items"
message into the filter protocol. This would be useful for filters that
batch delayed items into groups. In particular, if the batch size is `N`, and 
Git
sends `2N-1` items, the second batch will be under-filled. The filter on the
other end needs some mechanism to send the second batch, even though it hasn't
hit max capacity.

Specifically, this is how Git LFS implements object transfers for data it does
not have locally, but I'm sure that this sort of functionality would be useful
for other filter implementations as well.

--
Thanks,
Taylor Blau

ttayl...@github.com

Reply via email to