Re: [gradle-dev] API/DSL for implementing incremental tasks

Adam Murdoch Sat, 13 Apr 2013 14:22:39 -0700

On 12/04/2013, at 1:37 AM, Daz DeBoer <darrell.deb...@gradleware.com> wrote:


> On 11 April 2013 04:47, Luke Daley <luke.da...@gradleware.com> wrote:
>> 
>> On 11/04/2013, at 8:29 AM, Adam Murdoch <adam.murd...@gradleware.com> wrote:
>> 
>>> 
>>> On 11/04/2013, at 2:55 PM, Daz DeBoer <darrell.deb...@gradleware.com> wrote:
>>> 
>>>> 
>>>> 
>>>> 
>>>> On 10 April 2013 15:33, Adam Murdoch <adam.murd...@gradleware.com> wrote:
>>>> 
>>>> On 10/04/2013, at 8:53 AM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>> wrote:
>>>> 
>>>>> On 9 April 2013 05:25, Luke Daley <luke.da...@gradleware.com> wrote:
>>>>>> 
>>>>>> On 09/04/2013, at 11:55 AM, Luke Daley <luke.da...@gradleware.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On 27/03/2013, at 3:43 PM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> G'day
>>>>>>>> 
>>>>>>>> Now in master is a pretty cool new feature: you can now implement an
>>>>>>>> 'incremental' task that is informed about exactly which input files
>>>>>>>> have changed when the task is out of date.
>>>>>>>> This is very useful for something like a C++ compile task, as it means
>>>>>>>> that only the changed files need to be recompiled, rather than the
>>>>>>>> entire set of inputs.
>>>>>>>> 
>>>>>>>> I've got a 'draft' DSL functioning, and would appreciate any feedback
>>>>>>>> you guys have. Here's a sample:
>>>>>>>> 
>>>>>>>>     class IncrementalSync extends DefaultTask {
>>>>>>>>         @InputFiles
>>>>>>>>         def FileCollection src
>>>>>>>> 
>>>>>>>>         @OutputDirectory
>>>>>>>>         def File destination
>>>>>>>> 
>>>>>>>>         @TaskAction
>>>>>>>>         void execute(TaskInputChanges inputs) {
>>>>>>>>             if (inputs.allOutOfDate) {
>>>>>>>>                 FileUtils.forceDelete(destination)
>>>>>>>>             }
>>>>>>>> 
>>>>>>>>             inputs.outOfDate({
>>>>>>>>                 FileUtils.copyFile(change.file, 
>>>>>>>> targetFile(change.file))
>>>>>>>>             } as Action)
>>>>>>>>             .removed({
>>>>>>>>                 FileUtils.forceDelete(targetFile(change.file))
>>>>>>>>             } as Action)
>>>>>>>>             .process()
>>>>>>>>         }
>>>>>>>> 
>>>>>>>>         def targetFile(def inputFile) {
>>>>>>>>             new File(destination, change.file.name)
>>>>>>>>         }
>>>>>>>>     }
>>>>>>> 
>>>>>>> Thinking more about this, I don't think what we have is right. I think 
>>>>>>> we can come up with a simpler API.
>>>>>>> 
>>>>>>> interface TaskExecutionContext {
>>>>>>> 
>>>>>>> boolean isHasPriorOutputs(); // this is an incremental execution
>>>>>>> 
>>>>>>> void eachInputFileChange(Action<InputFileChange> action);
>>>>>>> 
>>>>>>>     interface InputFileChange {
>>>>>>>       boolean isAdded();
>>>>>>>       boolean isModified();
>>>>>>>       boolean isRemoved();
>>>>>>>       File getFile();
>>>>>>>     }
>>>>>>> 
>>>>>>> }
>>>>>>> 
>>>>>>> @TaskAction
>>>>>>> void execute(TaskExecutionContext executionContext) {
>>>>>>>     if (executionContext.hasPriorOutputs) {
>>>>>>>             executionContext.eachInputFileChange {
>>>>>>> 
>>>>>>>             }
>>>>>>>     } else {
>>>>>>>             // full rebuild
>>>>>>>     }
>>>>>>> }
>>>>>>> 
>>>>>>> Some issues I have with the existing design:
>>>>>>> 
>>>>>>> * I'm not sure "outOfDate" is the right term here. It doesn't quite 
>>>>>>> capture it and can imply more than what we actually mean. You could say 
>>>>>>> that an output is out of date, but it doesn't quite work for inputs.
>>>>>>> * why do we need the outOfDate() and removed() methods? After reading 
>>>>>>> the Javadoc I'm not sure how I would use this API. What's the 
>>>>>>> difference between out-of-date and removed? I could guess, but it's not 
>>>>>>> clear.
>>>>>>> * I think we'll want to provide information that is not strictly an 
>>>>>>> aspect of the inputs or outputs, but of something emergent from the 
>>>>>>> input/output concept. That's why I think modelling an encompassing type 
>>>>>>> is better.
>>>>>> 
>>>>>> Sorry, accidentally hit send…
>>>>>> 
>>>>>>> What I think we are trying to do is make it easy for the user to decide 
>>>>>>> whether they should do a fine grained incremental execution or a coarse 
>>>>>>> grained full.
>>>>>> 
>>>>>> The existing isAllOutOfDate() method goes too far, based on its Javadoc:
>>>>>> 
>>>>>>   /**
>>>>>>    * Specifies if incremental build is not possible due to changed Input 
>>>>>> Properties, Output Files, etc.
>>>>>>    * In this case, every file will be considered to be 'out-of-date'.
>>>>>>    */
>>>>>>    boolean isAllOutOfDate();
>>>>>> 
>>>>>> It is too presumptuous of us to say that all outputs are invalid just 
>>>>>> because an input property changed. The task author may know better and 
>>>>>> can only rebuild what is effected by the input property change.
>>>>>> 
>>>>>> We can tell if a full build is needed with relative confidence based on 
>>>>>> the lack of task history, or by verifying that the declared outputs 
>>>>>> don't exist (if that's fast enough), hence the isHasPriorOutputs() 
>>>>>> method. It would be nice to be able to tell the task that all of the 
>>>>>> input files have changed in some way, but this seems like it would be 
>>>>>> very expensive to compute.
>>>>>> 
>>>>>>> --
>>>>>>> Luke Daley
>>>>>>> Principal Engineer, Gradleware
>>>>>>> http://gradleware.com
>>>>>>> 
>>>>>>> Join me at the Gradle Summit 2013, June 13th and 14th in Santa Clara, 
>>>>>>> CA: http://www.gradlesummit.com
>>>>>>> 
>>>>> 
>>>>> For our usual "up-to-date" check, these are the things that can cause
>>>>> a task to be out-of-date (in order of checking):
>>>>> 1. Task has no declared outputs (always out-of-date with no declared 
>>>>> outputs)
>>>>> 2. Task.upToDateWhen() returns false
>>>>> 3. Task has no record of previous execution
>>>>> 4. Task class has changed since previous execution
>>>>> 5. Task input property has changed
>>>>> 6. Task output files have changed
>>>>> 7. Task input files have changed
>>>>> 
>>>>> So the interesting thing is how to deal with all of these cases with
>>>>> the incremental API:
>>>>> 
>>>>> (We are currently not honouring 1. for incremental tasks: this rule is
>>>>> mostly to make lifecycle tasks (and similar) fast, by not
>>>>> persisting/checking state. By making your task 'incremental' you're
>>>>> also telling Gradle not to make this check.)
>>>>> 
>>>>> The initial approach was that the InputFileChanges parameter on
>>>>> TaskAction was an indication that the task could handle changes to
>>>>> input files: anything other sort of state change could not be handled
>>>>> by the task.  The "out-of-date" files would be the set of files that
>>>>> you needed to re-evaluate, based on _all_ changes, including those
>>>>> that you cannot handle directly (like, say, input property changes).
>>>>> We would later add other parameter types to declare that the task
>>>>> could handle other incremental changes: properties, output files, etc.
>>>>> 
>>>>> Having spent more time on this, I'm not really liking this approach so
>>>>> much: I think that a simple "tell me the changed inputs" method is
>>>>> more useful and less confusing than a clever "tell me what I should
>>>>> re-evaluate" method (this is what Luke is suggesting, too).
>>>>> 
>>>>> We then need a way to tell an incremental task that other
>>>>> (non-input-file) changes have occurred since the prior execution,
>>>>> leaving the task implementation to decide what to do in this case.
>>>>> This API needs to be able to adapt to the future when a task can
>>>>> handle input-file and input-property changes, but still needs to know
>>>>> if task.upToDate() is false or outputs have changed since previous
>>>>> execution. Right now, we need to be able to ask "has anything other
>>>>> than 'input-files' changed since the last execution"?
>>>>> 
>>>>> Perhaps something like:
>>>>> 
>>>>> IncrementalExecutionContext {
>>>>>   boolean isIncremental(); // false for cases 2,3,4 above
>>>>> 
>>>>>   // We can pretty easily implement these right now
>>>>>   boolean isHasInputPropertyChanges()
>>>>>   boolean isHasInputFileChanges()
>>>>>   boolean isHasOutputFileChanges()
>>>>> 
>>>>>   void eachInputFileChange(Action<InputFileChange> action)
>>>>> 
>>>>>   // And to add to the API in the future
>>>>>   void eachInputPropertyChange(Action<InputPropertyChange> action)
>>>>>   void eachOutputFileChange(Action<OutputFileChange> action)
>>>>> }
>>>> 
>>>> How does this actually make it easier to implement an incremental task, 
>>>> given that most tasks don't care about output file changes and 
>>>> approximately none care about property changes?
>>>> 
>>>> It seems like it's more understandable and obvious, and not that much less 
>>>> easy to use. People already know how to implement a non-incremental task, 
>>>> and it might be easier for them to choose to do it the "old way" when the 
>>>> task is not incremental, or there are changes that the task cannot handle.
>>> 
>>> Why would it be easier? Do you have a concrete example? Why would the task 
>>> not just go: for each input file that needs work, do the work.
>>> 
>>>> 
>>>> There's certainly a trade-off here between elegance and obviousness. The 
>>>> fact the current DSL has some non-obvious logic involved is definitely a 
>>>> cost that should be considered.
>>>> 
>>>> 
>>>> What happens when we add new types of things that we detect changes for?
>>>> 
>>>> I guess the issue is that the task wants to ask: "are there any changes 
>>>> that are not included in [X,Y]" where X,Y are the set of changes that the 
>>>> task _is_ aware of.
>>>> So a method like:
>>>>   hasChangesOtherThan([InputFileChange,OutputFileChange]) would give this 
>>>> functionality.
>>>> 
>>>> 
>>>> Stepping back, there are several things a task requires from this API.
>>>> 1) Task can declare the changeset(s) it can handle, and stream these 
>>>> changes
>>>> 2) Task needs to know if the changeset(s) are "incomplete": are there 
>>>> other changes outside of the ones I can handle. The case where there is no 
>>>> prior history is a type of "incomplete".
>>>> 3) Task needs to be able to handle the "incomplete" case in a sensible way
>>>> 4) Task needs to be able to handle the "complete" case in a sensible way
>>>> 
>>>> With the current DSL, this is done by:
>>>> 1) @TaskAction execute(InputFileChanges inputs, OutputFileChanges outputs) 
>>>> // I can handle changes to input files and output files
>>>> 2) inputs.allOutOfDate() == "incomplete". Not sure if there is an 
>>>> equivalent on outputs?
>>>> 3) inputs.outOfDate({})  // Every file is considered "out-of-date"
>>>>    outputs.changedSincePreviousExecution({}) // No outputs files 
>>>> considered changed?
>>>> 4) inputs.outOfDate({}) // Only the changed files
>>>>    inputs.removed({})
>>>>    outputs.outOfDate({}) // Only files that have changed
>>>>    outputs.removed({}) // Output files that have been removed
>>>> 
>>>> eg:
>>>> def execute(InputFileChanges inputs, OutputFileChanges outputs) {
>>>>    outputs.outOfDate {
>>>>        // This may be a set of actual changes: not sure what to send when 
>>>> no task history, changed input property, etc
>>>>    }
>>>>    inputs.outOfDate {
>>>>       // This may be a set of changes, or it may be every input (no task 
>>>> history, changed input property, etc)
>>>>    }
>>>> }
>>>> 
>>>> An alternative mechanism could be:
>>>> 1) Explicitly ask the context for the changesets you care about:
>>>>            def changeset = context.getChanges([InputFileChange, 
>>>> OutputFileChange])
>>>> 2) Ask the context (or the changeset) if it is "complete"
>>>>            changeset.complete
>>>> 3) Task author iterates over all inputs (just like a non-incremental 
>>>> task): cannot use changeset
>>>> 4) Task author iterates over changes
>>>> 
>>>> eg:
>>>> def execute(IncrementalTaskContext context) {
>>>>  def changeset = context.getChanges([InputFileChange, OutputFileChange])
>>>>  if (changeset.complete) {
>>>>      changeset.eachChange(OutputFileChange) {
>>>>          ...
>>>>      }
>>>>      changeset.eachChange(InputFileChange) {
>>>>          ...
>>>>      }
>>>>  } else {
>>>>     // Regular non-incremental task action
>>>>     task.getInputs().eachFile {
>>>>         ...
>>>>     }
>>>>  }
>>>> 
>>>> The latter is definitely more verbose, and probably a bit less convenient. 
>>>> But it's also more explicit and I think probably more understandable.
>>> 
>>> Maybe. It's better than your other proposal above, in that it doesn't force 
>>> the task implementation to know about every type of change when it doesn't 
>>> want to. This is important because very few task implementations are going 
>>> to care about anything other than input file changes.
>>> 
>>> However, the above also rules out certain performance optimisations as we 
>>> don't know until the task executes what it will actually ask for. Instead, 
>>> we have to be able to deal with the task asking for anything. Exactly what 
>>> a task can handle, however, isn't dynamic - it's baked into the code and 
>>> very much static. So, the API needs to expose this statically, so that we 
>>> can do useful things with this information.
>>> 
>>> This also make evolution easier, as we can inject various different views 
>>> instead of a single context that exposes everything.
>>> 
>>> As far as 'understandability' goes, I think the basic problem here is that 
>>> we're defining the API in terms of changes, whereas the task 
>>> implementations are interested in the work they have to do, and (very 
>>> infrequently) the reason why. And while a change is a type of reason, but 
>>> not all reasons are changes (eg --rerun-tasks and missing history are 
>>> reasons, but they're not changes).
>>> 
>>> So, I think the API needs to express things in terms of what work the task 
>>> needs to do, not in terms of what has changed. This has correctness 
>>> implications, as well, as the task doesn't need to (and possibly cannot) be 
>>> involved in converting changes to work. And for this API, correctness is 
>>> the primary goal, followed by performance, followed by understandability. 
>>> It's not a beginner API.
>>> 
>>> Here are the different kinds of work we're interested in:
>>> 
>>> - Process all input files for which there is no corresponding output file.
>>> - Process all input files which have changed since their corresponding 
>>> output files were generated.
>>> - Process all input files for which the corresponding output file has 
>>> changed since they were generated.
>>> - Process all input files for which we don't have any history.
>>> - Process all input files for which the corresponding output file is 
>>> incorrect due to a property change.
>>> - Remove outputs for all input files that have been removed.
>>> - Remove existing outputs before processing input files, for when we're 
>>> rebuilding the outputs (for whatever reason).
>>> 
>>> This says to me we want an API that
>>> 
>>> - Allows a task to invalidate output files given the set of property 
>>> changes. Default is to invalidate all output files for any property change.
>>> - Allows a task to invalidate input files given the set of output file 
>>> changes. Default is to invalidate all input files for any output file 
>>> change.
>>> - Allows a task to ask for the input files it needs to do work on.
>> 
>> And you're saying that the current API, with just some renames, provides 
>> this?
>> 
> 
> Well the current API provides for the third option, with the default
> behaviour for changed properties and changed outputs.
> We could actually remove TaskInputChanges.allOutOfDate, as it doesn't
> serve any of the purposes above.

The list above wasn't complete. We still need to deal with the case where we 
don't know what old state you need to clean up. For example, say we make our 
dsl meta data task incremental. This task produces a single merged output file 
from a bunch of input files. If we don't have the history for the task or if 
this output file has been changed or if running with --rerun-tasks or if the 
task implementation has changed, then we need to make sure that before the task 
works on the input files, this output file has been cleaned of all existing 
state (eg deleted).

Right now we use the allOutOfDate flag to let the task know this. Instead we 
might:

- delete the output files of the task before calling the task action.
- let the task know which output files have been invalidated.
- keep the flag and rename it something like 'allOutputsInvalid' or 
'rebuildRequired'.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Join us at the Gradle Summit 2013, June 13th and 14th in Santa Clara, CA: 
http://www.gradlesummit.com

Re: [gradle-dev] API/DSL for implementing incremental tasks

Reply via email to