Re: [gradle-dev] API/DSL for implementing incremental tasks

Luke Daley Thu, 11 Apr 2013 03:47:34 -0700

On 11/04/2013, at 8:29 AM, Adam Murdoch <[email protected]> wrote:


> 
> On 11/04/2013, at 2:55 PM, Daz DeBoer <[email protected]> wrote:
> 
>> 
>> 
>> 
>> On 10 April 2013 15:33, Adam Murdoch <[email protected]> wrote:
>> 
>> On 10/04/2013, at 8:53 AM, Daz DeBoer <[email protected]> wrote:
>> 
>>> On 9 April 2013 05:25, Luke Daley <[email protected]> wrote:
>>>> 
>>>> On 09/04/2013, at 11:55 AM, Luke Daley <[email protected]> wrote:
>>>> 
>>>>> 
>>>>> On 27/03/2013, at 3:43 PM, Daz DeBoer <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>>> G'day
>>>>>> 
>>>>>> Now in master is a pretty cool new feature: you can now implement an
>>>>>> 'incremental' task that is informed about exactly which input files
>>>>>> have changed when the task is out of date.
>>>>>> This is very useful for something like a C++ compile task, as it means
>>>>>> that only the changed files need to be recompiled, rather than the
>>>>>> entire set of inputs.
>>>>>> 
>>>>>> I've got a 'draft' DSL functioning, and would appreciate any feedback
>>>>>> you guys have. Here's a sample:
>>>>>> 
>>>>>>      class IncrementalSync extends DefaultTask {
>>>>>>          @InputFiles
>>>>>>          def FileCollection src
>>>>>> 
>>>>>>          @OutputDirectory
>>>>>>          def File destination
>>>>>> 
>>>>>>          @TaskAction
>>>>>>          void execute(TaskInputChanges inputs) {
>>>>>>              if (inputs.allOutOfDate) {
>>>>>>                  FileUtils.forceDelete(destination)
>>>>>>              }
>>>>>> 
>>>>>>              inputs.outOfDate({
>>>>>>                  FileUtils.copyFile(change.file, targetFile(change.file))
>>>>>>              } as Action)
>>>>>>              .removed({
>>>>>>                  FileUtils.forceDelete(targetFile(change.file))
>>>>>>              } as Action)
>>>>>>              .process()
>>>>>>          }
>>>>>> 
>>>>>>          def targetFile(def inputFile) {
>>>>>>              new File(destination, change.file.name)
>>>>>>          }
>>>>>>      }
>>>>> 
>>>>> Thinking more about this, I don't think what we have is right. I think we 
>>>>> can come up with a simpler API.
>>>>> 
>>>>> interface TaskExecutionContext {
>>>>> 
>>>>> boolean isHasPriorOutputs(); // this is an incremental execution
>>>>> 
>>>>> void eachInputFileChange(Action<InputFileChange> action);
>>>>> 
>>>>>      interface InputFileChange {
>>>>>        boolean isAdded();
>>>>>        boolean isModified();
>>>>>        boolean isRemoved();
>>>>>        File getFile();
>>>>>      }
>>>>> 
>>>>> }
>>>>> 
>>>>> @TaskAction
>>>>> void execute(TaskExecutionContext executionContext) {
>>>>>      if (executionContext.hasPriorOutputs) {
>>>>>              executionContext.eachInputFileChange {
>>>>> 
>>>>>              }
>>>>>      } else {
>>>>>              // full rebuild
>>>>>      }
>>>>> }
>>>>> 
>>>>> Some issues I have with the existing design:
>>>>> 
>>>>> * I'm not sure "outOfDate" is the right term here. It doesn't quite 
>>>>> capture it and can imply more than what we actually mean. You could say 
>>>>> that an output is out of date, but it doesn't quite work for inputs.
>>>>> * why do we need the outOfDate() and removed() methods? After reading the 
>>>>> Javadoc I'm not sure how I would use this API. What's the difference 
>>>>> between out-of-date and removed? I could guess, but it's not clear.
>>>>> * I think we'll want to provide information that is not strictly an 
>>>>> aspect of the inputs or outputs, but of something emergent from the 
>>>>> input/output concept. That's why I think modelling an encompassing type 
>>>>> is better.
>>>> 
>>>> Sorry, accidentally hit send…
>>>> 
>>>>> What I think we are trying to do is make it easy for the user to decide 
>>>>> whether they should do a fine grained incremental execution or a coarse 
>>>>> grained full.
>>>> 
>>>> The existing isAllOutOfDate() method goes too far, based on its Javadoc:
>>>> 
>>>>    /**
>>>>     * Specifies if incremental build is not possible due to changed Input 
>>>> Properties, Output Files, etc.
>>>>     * In this case, every file will be considered to be 'out-of-date'.
>>>>     */
>>>>     boolean isAllOutOfDate();
>>>> 
>>>> It is too presumptuous of us to say that all outputs are invalid just 
>>>> because an input property changed. The task author may know better and can 
>>>> only rebuild what is effected by the input property change.
>>>> 
>>>> We can tell if a full build is needed with relative confidence based on 
>>>> the lack of task history, or by verifying that the declared outputs don't 
>>>> exist (if that's fast enough), hence the isHasPriorOutputs() method. It 
>>>> would be nice to be able to tell the task that all of the input files have 
>>>> changed in some way, but this seems like it would be very expensive to 
>>>> compute.
>>>> 
>>>>> --
>>>>> Luke Daley
>>>>> Principal Engineer, Gradleware
>>>>> http://gradleware.com
>>>>> 
>>>>> Join me at the Gradle Summit 2013, June 13th and 14th in Santa Clara, CA: 
>>>>> http://www.gradlesummit.com
>>>>> 
>>> 
>>> For our usual "up-to-date" check, these are the things that can cause
>>> a task to be out-of-date (in order of checking):
>>> 1. Task has no declared outputs (always out-of-date with no declared 
>>> outputs)
>>> 2. Task.upToDateWhen() returns false
>>> 3. Task has no record of previous execution
>>> 4. Task class has changed since previous execution
>>> 5. Task input property has changed
>>> 6. Task output files have changed
>>> 7. Task input files have changed
>>> 
>>> So the interesting thing is how to deal with all of these cases with
>>> the incremental API:
>>> 
>>> (We are currently not honouring 1. for incremental tasks: this rule is
>>> mostly to make lifecycle tasks (and similar) fast, by not
>>> persisting/checking state. By making your task 'incremental' you're
>>> also telling Gradle not to make this check.)
>>> 
>>> The initial approach was that the InputFileChanges parameter on
>>> TaskAction was an indication that the task could handle changes to
>>> input files: anything other sort of state change could not be handled
>>> by the task.  The "out-of-date" files would be the set of files that
>>> you needed to re-evaluate, based on _all_ changes, including those
>>> that you cannot handle directly (like, say, input property changes).
>>> We would later add other parameter types to declare that the task
>>> could handle other incremental changes: properties, output files, etc.
>>> 
>>> Having spent more time on this, I'm not really liking this approach so
>>> much: I think that a simple "tell me the changed inputs" method is
>>> more useful and less confusing than a clever "tell me what I should
>>> re-evaluate" method (this is what Luke is suggesting, too).
>>> 
>>> We then need a way to tell an incremental task that other
>>> (non-input-file) changes have occurred since the prior execution,
>>> leaving the task implementation to decide what to do in this case.
>>> This API needs to be able to adapt to the future when a task can
>>> handle input-file and input-property changes, but still needs to know
>>> if task.upToDate() is false or outputs have changed since previous
>>> execution. Right now, we need to be able to ask "has anything other
>>> than 'input-files' changed since the last execution"?
>>> 
>>> Perhaps something like:
>>> 
>>> IncrementalExecutionContext {
>>>    boolean isIncremental(); // false for cases 2,3,4 above
>>> 
>>>    // We can pretty easily implement these right now
>>>    boolean isHasInputPropertyChanges()
>>>    boolean isHasInputFileChanges()
>>>    boolean isHasOutputFileChanges()
>>> 
>>>    void eachInputFileChange(Action<InputFileChange> action)
>>> 
>>>    // And to add to the API in the future
>>>    void eachInputPropertyChange(Action<InputPropertyChange> action)
>>>    void eachOutputFileChange(Action<OutputFileChange> action)
>>> }
>> 
>> How does this actually make it easier to implement an incremental task, 
>> given that most tasks don't care about output file changes and approximately 
>> none care about property changes?
>> 
>> It seems like it's more understandable and obvious, and not that much less 
>> easy to use. People already know how to implement a non-incremental task, 
>> and it might be easier for them to choose to do it the "old way" when the 
>> task is not incremental, or there are changes that the task cannot handle. 
> 
> Why would it be easier? Do you have a concrete example? Why would the task 
> not just go: for each input file that needs work, do the work.
> 
>> 
>> There's certainly a trade-off here between elegance and obviousness. The 
>> fact the current DSL has some non-obvious logic involved is definitely a 
>> cost that should be considered.
>>  
>> 
>> What happens when we add new types of things that we detect changes for?
>> 
>> I guess the issue is that the task wants to ask: "are there any changes that 
>> are not included in [X,Y]" where X,Y are the set of changes that the task 
>> _is_ aware of. 
>> So a method like:
>>    hasChangesOtherThan([InputFileChange,OutputFileChange]) would give this 
>> functionality.
>> 
>> 
>> Stepping back, there are several things a task requires from this API.
>> 1) Task can declare the changeset(s) it can handle, and stream these changes
>> 2) Task needs to know if the changeset(s) are "incomplete": are there other 
>> changes outside of the ones I can handle. The case where there is no prior 
>> history is a type of "incomplete".
>> 3) Task needs to be able to handle the "incomplete" case in a sensible way
>> 4) Task needs to be able to handle the "complete" case in a sensible way
>> 
>> With the current DSL, this is done by:
>> 1) @TaskAction execute(InputFileChanges inputs, OutputFileChanges outputs) 
>> // I can handle changes to input files and output files
>> 2) inputs.allOutOfDate() == "incomplete". Not sure if there is an equivalent 
>> on outputs?
>> 3) inputs.outOfDate({})  // Every file is considered "out-of-date"
>>     outputs.changedSincePreviousExecution({}) // No outputs files considered 
>> changed?
>> 4) inputs.outOfDate({}) // Only the changed files
>>     inputs.removed({})
>>     outputs.outOfDate({}) // Only files that have changed
>>     outputs.removed({}) // Output files that have been removed
>> 
>> eg:
>> def execute(InputFileChanges inputs, OutputFileChanges outputs) {
>>     outputs.outOfDate {
>>         // This may be a set of actual changes: not sure what to send when 
>> no task history, changed input property, etc
>>     }
>>     inputs.outOfDate {
>>        // This may be a set of changes, or it may be every input (no task 
>> history, changed input property, etc)
>>     }
>> }
>> 
>> An alternative mechanism could be:
>> 1) Explicitly ask the context for the changesets you care about: 
>>             def changeset = context.getChanges([InputFileChange, 
>> OutputFileChange])
>> 2) Ask the context (or the changeset) if it is "complete"
>>             changeset.complete
>> 3) Task author iterates over all inputs (just like a non-incremental task): 
>> cannot use changeset
>> 4) Task author iterates over changes
>> 
>> eg:
>> def execute(IncrementalTaskContext context) {
>>   def changeset = context.getChanges([InputFileChange, OutputFileChange])
>>   if (changeset.complete) {
>>       changeset.eachChange(OutputFileChange) { 
>>           ...
>>       }
>>       changeset.eachChange(InputFileChange) {
>>           ...
>>       }
>>   } else {
>>      // Regular non-incremental task action
>>      task.getInputs().eachFile {
>>          ...
>>      }
>>   }
>> 
>> The latter is definitely more verbose, and probably a bit less convenient. 
>> But it's also more explicit and I think probably more understandable.
> 
> Maybe. It's better than your other proposal above, in that it doesn't force 
> the task implementation to know about every type of change when it doesn't 
> want to. This is important because very few task implementations are going to 
> care about anything other than input file changes.
> 
> However, the above also rules out certain performance optimisations as we 
> don't know until the task executes what it will actually ask for. Instead, we 
> have to be able to deal with the task asking for anything. Exactly what a 
> task can handle, however, isn't dynamic - it's baked into the code and very 
> much static. So, the API needs to expose this statically, so that we can do 
> useful things with this information.
> 
> This also make evolution easier, as we can inject various different views 
> instead of a single context that exposes everything.
> 
> As far as 'understandability' goes, I think the basic problem here is that 
> we're defining the API in terms of changes, whereas the task implementations 
> are interested in the work they have to do, and (very infrequently) the 
> reason why. And while a change is a type of reason, but not all reasons are 
> changes (eg --rerun-tasks and missing history are reasons, but they're not 
> changes).
> 
> So, I think the API needs to express things in terms of what work the task 
> needs to do, not in terms of what has changed. This has correctness 
> implications, as well, as the task doesn't need to (and possibly cannot) be 
> involved in converting changes to work. And for this API, correctness is the 
> primary goal, followed by performance, followed by understandability. It's 
> not a beginner API.
> 
> Here are the different kinds of work we're interested in:
> 
> - Process all input files for which there is no corresponding output file.
> - Process all input files which have changed since their corresponding output 
> files were generated.
> - Process all input files for which the corresponding output file has changed 
> since they were generated.
> - Process all input files for which we don't have any history.
> - Process all input files for which the corresponding output file is 
> incorrect due to a property change.
> - Remove outputs for all input files that have been removed.
> - Remove existing outputs before processing input files, for when we're 
> rebuilding the outputs (for whatever reason).
> 
> This says to me we want an API that
> 
> - Allows a task to invalidate output files given the set of property changes. 
> Default is to invalidate all output files for any property change.
> - Allows a task to invalidate input files given the set of output file 
> changes. Default is to invalidate all input files for any output file change.
> - Allows a task to ask for the input files it needs to do work on.

And you're saying that the current API, with just some renames, provides this?

-- 
Luke Daley
Principal Engineer, Gradleware 
http://gradleware.com

Join me at the Gradle Summit 2013, June 13th and 14th in Santa Clara, CA: 
http://www.gradlesummit.com


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email

Re: [gradle-dev] API/DSL for implementing incremental tasks

Reply via email to