Re: [gradle-dev] API/DSL for implementing incremental tasks

Luke Daley Mon, 15 Apr 2013 02:09:28 -0700

On 13/04/2013, at 10:56 PM, Adam Murdoch <adam.murd...@gradleware.com> wrote:


> 
> On 14/04/2013, at 7:22 AM, Adam Murdoch <adam.murd...@gradleware.com> wrote:
> 
>> 
>> On 12/04/2013, at 1:37 AM, Daz DeBoer <darrell.deb...@gradleware.com> wrote:
>> 
>>> On 11 April 2013 04:47, Luke Daley <luke.da...@gradleware.com> wrote:
>>>> 
>>>> On 11/04/2013, at 8:29 AM, Adam Murdoch <adam.murd...@gradleware.com> 
>>>> wrote:
>>>> 
>>>>> 
>>>>> On 11/04/2013, at 2:55 PM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 10 April 2013 15:33, Adam Murdoch <adam.murd...@gradleware.com> wrote:
>>>>>> 
>>>>>> On 10/04/2013, at 8:53 AM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>>>> wrote:
>>>>>> 
>>>>>>> On 9 April 2013 05:25, Luke Daley <luke.da...@gradleware.com> wrote:
>>>>>>>> 
>>>>>>>> On 09/04/2013, at 11:55 AM, Luke Daley <luke.da...@gradleware.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 27/03/2013, at 3:43 PM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> G'day
>>>>>>>>>> 
>>>>>>>>>> Now in master is a pretty cool new feature: you can now implement an
>>>>>>>>>> 'incremental' task that is informed about exactly which input files
>>>>>>>>>> have changed when the task is out of date.
>>>>>>>>>> This is very useful for something like a C++ compile task, as it 
>>>>>>>>>> means
>>>>>>>>>> that only the changed files need to be recompiled, rather than the
>>>>>>>>>> entire set of inputs.
>>>>>>>>>> 
>>>>>>>>>> I've got a 'draft' DSL functioning, and would appreciate any feedback
>>>>>>>>>> you guys have. Here's a sample:
>>>>>>>>>> 
>>>>>>>>>>     class IncrementalSync extends DefaultTask {
>>>>>>>>>>         @InputFiles
>>>>>>>>>>         def FileCollection src
>>>>>>>>>> 
>>>>>>>>>>         @OutputDirectory
>>>>>>>>>>         def File destination
>>>>>>>>>> 
>>>>>>>>>>         @TaskAction
>>>>>>>>>>         void execute(TaskInputChanges inputs) {
>>>>>>>>>>             if (inputs.allOutOfDate) {
>>>>>>>>>>                 FileUtils.forceDelete(destination)
>>>>>>>>>>             }
>>>>>>>>>> 
>>>>>>>>>>             inputs.outOfDate({
>>>>>>>>>>                 FileUtils.copyFile(change.file, 
>>>>>>>>>> targetFile(change.file))
>>>>>>>>>>             } as Action)
>>>>>>>>>>             .removed({
>>>>>>>>>>                 FileUtils.forceDelete(targetFile(change.file))
>>>>>>>>>>             } as Action)
>>>>>>>>>>             .process()
>>>>>>>>>>         }
>>>>>>>>>> 
>>>>>>>>>>         def targetFile(def inputFile) {
>>>>>>>>>>             new File(destination, change.file.name)
>>>>>>>>>>         }
>>>>>>>>>>     }
>>>>>>>>> 
>>>>>>>>> Thinking more about this, I don't think what we have is right. I 
>>>>>>>>> think we can come up with a simpler API.
>>>>>>>>> 
>>>>>>>>> interface TaskExecutionContext {
>>>>>>>>> 
>>>>>>>>> boolean isHasPriorOutputs(); // this is an incremental execution
>>>>>>>>> 
>>>>>>>>> void eachInputFileChange(Action<InputFileChange> action);
>>>>>>>>> 
>>>>>>>>>     interface InputFileChange {
>>>>>>>>>       boolean isAdded();
>>>>>>>>>       boolean isModified();
>>>>>>>>>       boolean isRemoved();
>>>>>>>>>       File getFile();
>>>>>>>>>     }
>>>>>>>>> 
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> @TaskAction
>>>>>>>>> void execute(TaskExecutionContext executionContext) {
>>>>>>>>>     if (executionContext.hasPriorOutputs) {
>>>>>>>>>             executionContext.eachInputFileChange {
>>>>>>>>> 
>>>>>>>>>             }
>>>>>>>>>     } else {
>>>>>>>>>             // full rebuild
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> Some issues I have with the existing design:
>>>>>>>>> 
>>>>>>>>> * I'm not sure "outOfDate" is the right term here. It doesn't quite 
>>>>>>>>> capture it and can imply more than what we actually mean. You could 
>>>>>>>>> say that an output is out of date, but it doesn't quite work for 
>>>>>>>>> inputs.
>>>>>>>>> * why do we need the outOfDate() and removed() methods? After reading 
>>>>>>>>> the Javadoc I'm not sure how I would use this API. What's the 
>>>>>>>>> difference between out-of-date and removed? I could guess, but it's 
>>>>>>>>> not clear.
>>>>>>>>> * I think we'll want to provide information that is not strictly an 
>>>>>>>>> aspect of the inputs or outputs, but of something emergent from the 
>>>>>>>>> input/output concept. That's why I think modelling an encompassing 
>>>>>>>>> type is better.
>>>>>>>> 
>>>>>>>> Sorry, accidentally hit send…
>>>>>>>> 
>>>>>>>>> What I think we are trying to do is make it easy for the user to 
>>>>>>>>> decide whether they should do a fine grained incremental execution or 
>>>>>>>>> a coarse grained full.
>>>>>>>> 
>>>>>>>> The existing isAllOutOfDate() method goes too far, based on its 
>>>>>>>> Javadoc:
>>>>>>>> 
>>>>>>>>   /**
>>>>>>>>    * Specifies if incremental build is not possible due to changed 
>>>>>>>> Input Properties, Output Files, etc.
>>>>>>>>    * In this case, every file will be considered to be 'out-of-date'.
>>>>>>>>    */
>>>>>>>>    boolean isAllOutOfDate();
>>>>>>>> 
>>>>>>>> It is too presumptuous of us to say that all outputs are invalid just 
>>>>>>>> because an input property changed. The task author may know better and 
>>>>>>>> can only rebuild what is effected by the input property change.
>>>>>>>> 
>>>>>>>> We can tell if a full build is needed with relative confidence based 
>>>>>>>> on the lack of task history, or by verifying that the declared outputs 
>>>>>>>> don't exist (if that's fast enough), hence the isHasPriorOutputs() 
>>>>>>>> method. It would be nice to be able to tell the task that all of the 
>>>>>>>> input files have changed in some way, but this seems like it would be 
>>>>>>>> very expensive to compute.
>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Luke Daley
>>>>>>>>> Principal Engineer, Gradleware
>>>>>>>>> http://gradleware.com
>>>>>>>>> 
>>>>>>>>> Join me at the Gradle Summit 2013, June 13th and 14th in Santa Clara, 
>>>>>>>>> CA: http://www.gradlesummit.com
>>>>>>>>> 
>>>>>>> 
>>>>>>> For our usual "up-to-date" check, these are the things that can cause
>>>>>>> a task to be out-of-date (in order of checking):
>>>>>>> 1. Task has no declared outputs (always out-of-date with no declared 
>>>>>>> outputs)
>>>>>>> 2. Task.upToDateWhen() returns false
>>>>>>> 3. Task has no record of previous execution
>>>>>>> 4. Task class has changed since previous execution
>>>>>>> 5. Task input property has changed
>>>>>>> 6. Task output files have changed
>>>>>>> 7. Task input files have changed
>>>>>>> 
>>>>>>> So the interesting thing is how to deal with all of these cases with
>>>>>>> the incremental API:
>>>>>>> 
>>>>>>> (We are currently not honouring 1. for incremental tasks: this rule is
>>>>>>> mostly to make lifecycle tasks (and similar) fast, by not
>>>>>>> persisting/checking state. By making your task 'incremental' you're
>>>>>>> also telling Gradle not to make this check.)
>>>>>>> 
>>>>>>> The initial approach was that the InputFileChanges parameter on
>>>>>>> TaskAction was an indication that the task could handle changes to
>>>>>>> input files: anything other sort of state change could not be handled
>>>>>>> by the task.  The "out-of-date" files would be the set of files that
>>>>>>> you needed to re-evaluate, based on _all_ changes, including those
>>>>>>> that you cannot handle directly (like, say, input property changes).
>>>>>>> We would later add other parameter types to declare that the task
>>>>>>> could handle other incremental changes: properties, output files, etc.
>>>>>>> 
>>>>>>> Having spent more time on this, I'm not really liking this approach so
>>>>>>> much: I think that a simple "tell me the changed inputs" method is
>>>>>>> more useful and less confusing than a clever "tell me what I should
>>>>>>> re-evaluate" method (this is what Luke is suggesting, too).
>>>>>>> 
>>>>>>> We then need a way to tell an incremental task that other
>>>>>>> (non-input-file) changes have occurred since the prior execution,
>>>>>>> leaving the task implementation to decide what to do in this case.
>>>>>>> This API needs to be able to adapt to the future when a task can
>>>>>>> handle input-file and input-property changes, but still needs to know
>>>>>>> if task.upToDate() is false or outputs have changed since previous
>>>>>>> execution. Right now, we need to be able to ask "has anything other
>>>>>>> than 'input-files' changed since the last execution"?
>>>>>>> 
>>>>>>> Perhaps something like:
>>>>>>> 
>>>>>>> IncrementalExecutionContext {
>>>>>>>   boolean isIncremental(); // false for cases 2,3,4 above
>>>>>>> 
>>>>>>>   // We can pretty easily implement these right now
>>>>>>>   boolean isHasInputPropertyChanges()
>>>>>>>   boolean isHasInputFileChanges()
>>>>>>>   boolean isHasOutputFileChanges()
>>>>>>> 
>>>>>>>   void eachInputFileChange(Action<InputFileChange> action)
>>>>>>> 
>>>>>>>   // And to add to the API in the future
>>>>>>>   void eachInputPropertyChange(Action<InputPropertyChange> action)
>>>>>>>   void eachOutputFileChange(Action<OutputFileChange> action)
>>>>>>> }
>>>>>> 
>>>>>> How does this actually make it easier to implement an incremental task, 
>>>>>> given that most tasks don't care about output file changes and 
>>>>>> approximately none care about property changes?
>>>>>> 
>>>>>> It seems like it's more understandable and obvious, and not that much 
>>>>>> less easy to use. People already know how to implement a non-incremental 
>>>>>> task, and it might be easier for them to choose to do it the "old way" 
>>>>>> when the task is not incremental, or there are changes that the task 
>>>>>> cannot handle.
>>>>> 
>>>>> Why would it be easier? Do you have a concrete example? Why would the 
>>>>> task not just go: for each input file that needs work, do the work.
>>>>> 
>>>>>> 
>>>>>> There's certainly a trade-off here between elegance and obviousness. The 
>>>>>> fact the current DSL has some non-obvious logic involved is definitely a 
>>>>>> cost that should be considered.
>>>>>> 
>>>>>> 
>>>>>> What happens when we add new types of things that we detect changes for?
>>>>>> 
>>>>>> I guess the issue is that the task wants to ask: "are there any changes 
>>>>>> that are not included in [X,Y]" where X,Y are the set of changes that 
>>>>>> the task _is_ aware of.
>>>>>> So a method like:
>>>>>>   hasChangesOtherThan([InputFileChange,OutputFileChange]) would give 
>>>>>> this functionality.
>>>>>> 
>>>>>> 
>>>>>> Stepping back, there are several things a task requires from this API.
>>>>>> 1) Task can declare the changeset(s) it can handle, and stream these 
>>>>>> changes
>>>>>> 2) Task needs to know if the changeset(s) are "incomplete": are there 
>>>>>> other changes outside of the ones I can handle. The case where there is 
>>>>>> no prior history is a type of "incomplete".
>>>>>> 3) Task needs to be able to handle the "incomplete" case in a sensible 
>>>>>> way
>>>>>> 4) Task needs to be able to handle the "complete" case in a sensible way
>>>>>> 
>>>>>> With the current DSL, this is done by:
>>>>>> 1) @TaskAction execute(InputFileChanges inputs, OutputFileChanges 
>>>>>> outputs) // I can handle changes to input files and output files
>>>>>> 2) inputs.allOutOfDate() == "incomplete". Not sure if there is an 
>>>>>> equivalent on outputs?
>>>>>> 3) inputs.outOfDate({})  // Every file is considered "out-of-date"
>>>>>>    outputs.changedSincePreviousExecution({}) // No outputs files 
>>>>>> considered changed?
>>>>>> 4) inputs.outOfDate({}) // Only the changed files
>>>>>>    inputs.removed({})
>>>>>>    outputs.outOfDate({}) // Only files that have changed
>>>>>>    outputs.removed({}) // Output files that have been removed
>>>>>> 
>>>>>> eg:
>>>>>> def execute(InputFileChanges inputs, OutputFileChanges outputs) {
>>>>>>    outputs.outOfDate {
>>>>>>        // This may be a set of actual changes: not sure what to send 
>>>>>> when no task history, changed input property, etc
>>>>>>    }
>>>>>>    inputs.outOfDate {
>>>>>>       // This may be a set of changes, or it may be every input (no task 
>>>>>> history, changed input property, etc)
>>>>>>    }
>>>>>> }
>>>>>> 
>>>>>> An alternative mechanism could be:
>>>>>> 1) Explicitly ask the context for the changesets you care about:
>>>>>>            def changeset = context.getChanges([InputFileChange, 
>>>>>> OutputFileChange])
>>>>>> 2) Ask the context (or the changeset) if it is "complete"
>>>>>>            changeset.complete
>>>>>> 3) Task author iterates over all inputs (just like a non-incremental 
>>>>>> task): cannot use changeset
>>>>>> 4) Task author iterates over changes
>>>>>> 
>>>>>> eg:
>>>>>> def execute(IncrementalTaskContext context) {
>>>>>>  def changeset = context.getChanges([InputFileChange, OutputFileChange])
>>>>>>  if (changeset.complete) {
>>>>>>      changeset.eachChange(OutputFileChange) {
>>>>>>          ...
>>>>>>      }
>>>>>>      changeset.eachChange(InputFileChange) {
>>>>>>          ...
>>>>>>      }
>>>>>>  } else {
>>>>>>     // Regular non-incremental task action
>>>>>>     task.getInputs().eachFile {
>>>>>>         ...
>>>>>>     }
>>>>>>  }
>>>>>> 
>>>>>> The latter is definitely more verbose, and probably a bit less 
>>>>>> convenient. But it's also more explicit and I think probably more 
>>>>>> understandable.
>>>>> 
>>>>> Maybe. It's better than your other proposal above, in that it doesn't 
>>>>> force the task implementation to know about every type of change when it 
>>>>> doesn't want to. This is important because very few task implementations 
>>>>> are going to care about anything other than input file changes.
>>>>> 
>>>>> However, the above also rules out certain performance optimisations as we 
>>>>> don't know until the task executes what it will actually ask for. 
>>>>> Instead, we have to be able to deal with the task asking for anything. 
>>>>> Exactly what a task can handle, however, isn't dynamic - it's baked into 
>>>>> the code and very much static. So, the API needs to expose this 
>>>>> statically, so that we can do useful things with this information.
>>>>> 
>>>>> This also make evolution easier, as we can inject various different views 
>>>>> instead of a single context that exposes everything.
>>>>> 
>>>>> As far as 'understandability' goes, I think the basic problem here is 
>>>>> that we're defining the API in terms of changes, whereas the task 
>>>>> implementations are interested in the work they have to do, and (very 
>>>>> infrequently) the reason why. And while a change is a type of reason, but 
>>>>> not all reasons are changes (eg --rerun-tasks and missing history are 
>>>>> reasons, but they're not changes).
>>>>> 
>>>>> So, I think the API needs to express things in terms of what work the 
>>>>> task needs to do, not in terms of what has changed. This has correctness 
>>>>> implications, as well, as the task doesn't need to (and possibly cannot) 
>>>>> be involved in converting changes to work. And for this API, correctness 
>>>>> is the primary goal, followed by performance, followed by 
>>>>> understandability. It's not a beginner API.
>>>>> 
>>>>> Here are the different kinds of work we're interested in:
>>>>> 
>>>>> - Process all input files for which there is no corresponding output file.
>>>>> - Process all input files which have changed since their corresponding 
>>>>> output files were generated.
>>>>> - Process all input files for which the corresponding output file has 
>>>>> changed since they were generated.
>>>>> - Process all input files for which we don't have any history.
>>>>> - Process all input files for which the corresponding output file is 
>>>>> incorrect due to a property change.
>>>>> - Remove outputs for all input files that have been removed.
>>>>> - Remove existing outputs before processing input files, for when we're 
>>>>> rebuilding the outputs (for whatever reason).
>>>>> 
>>>>> This says to me we want an API that
>>>>> 
>>>>> - Allows a task to invalidate output files given the set of property 
>>>>> changes. Default is to invalidate all output files for any property 
>>>>> change.
>>>>> - Allows a task to invalidate input files given the set of output file 
>>>>> changes. Default is to invalidate all input files for any output file 
>>>>> change.
>>>>> - Allows a task to ask for the input files it needs to do work on.
>>>> 
>>>> And you're saying that the current API, with just some renames, provides 
>>>> this?
>>>> 
>>> 
>>> Well the current API provides for the third option, with the default
>>> behaviour for changed properties and changed outputs.
>>> We could actually remove TaskInputChanges.allOutOfDate, as it doesn't
>>> serve any of the purposes above.
>> 
>> The list above wasn't complete. We still need to deal with the case where we 
>> don't know what old state you need to clean up. For example, say we make our 
>> dsl meta data task incremental. This task produces a single merged output 
>> file from a bunch of input files. If we don't have the history for the task 
>> or if this output file has been changed or if running with --rerun-tasks or 
>> if the task implementation has changed, then we need to make sure that 
>> before the task works on the input files, this output file has been cleaned 
>> of all existing state (eg deleted).
>> 
>> Right now we use the allOutOfDate flag to let the task know this. Instead we 
>> might:
>> 
>> - delete the output files of the task before calling the task action.
>> - let the task know which output files have been invalidated.
>> - keep the flag and rename it something like 'allOutputsInvalid' or 
>> 'rebuildRequired'.
> 
> Another option would be to split the task action into 2 separate methods, one 
> which is called on rebuild and one which is called on incremental execution.

This doesn't feel right to me. For 1:1 transforming tasks this would be a 
nuisance.

> Or add an optional method that is called on rebuild before the task action 
> method is called.

I can't see what this buys us over telling the user via the info we give the 
task action.

-- 
Luke Daley
Principal Engineer, Gradleware 
http://gradleware.com

Join me at the Gradle Summit 2013, June 13th and 14th in Santa Clara, CA: 
http://www.gradlesummit.com


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email

Re: [gradle-dev] API/DSL for implementing incremental tasks

Reply via email to