Re: [gradle-dev] API/DSL for implementing incremental tasks

Adam Murdoch Mon, 15 Apr 2013 21:09:26 -0700

On 15/04/2013, at 7:09 PM, Luke Daley <[email protected]> wrote:


> 
> On 13/04/2013, at 10:56 PM, Adam Murdoch <[email protected]> wrote:
> 
>> 
>> On 14/04/2013, at 7:22 AM, Adam Murdoch <[email protected]> wrote:
>> 
>>> 
>>> On 12/04/2013, at 1:37 AM, Daz DeBoer <[email protected]> wrote:
>>> 
>>>> On 11 April 2013 04:47, Luke Daley <[email protected]> wrote:
>>>>> 
>>>>> On 11/04/2013, at 8:29 AM, Adam Murdoch <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> On 11/04/2013, at 2:55 PM, Daz DeBoer <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 10 April 2013 15:33, Adam Murdoch <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> On 10/04/2013, at 8:53 AM, Daz DeBoer <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> On 9 April 2013 05:25, Luke Daley <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> On 09/04/2013, at 11:55 AM, Luke Daley <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 27/03/2013, at 3:43 PM, Daz DeBoer 
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>>> G'day
>>>>>>>>>>> 
>>>>>>>>>>> Now in master is a pretty cool new feature: you can now implement an
>>>>>>>>>>> 'incremental' task that is informed about exactly which input files
>>>>>>>>>>> have changed when the task is out of date.
>>>>>>>>>>> This is very useful for something like a C++ compile task, as it 
>>>>>>>>>>> means
>>>>>>>>>>> that only the changed files need to be recompiled, rather than the
>>>>>>>>>>> entire set of inputs.
>>>>>>>>>>> 
>>>>>>>>>>> I've got a 'draft' DSL functioning, and would appreciate any 
>>>>>>>>>>> feedback
>>>>>>>>>>> you guys have. Here's a sample:
>>>>>>>>>>> 
>>>>>>>>>>>    class IncrementalSync extends DefaultTask {
>>>>>>>>>>>        @InputFiles
>>>>>>>>>>>        def FileCollection src
>>>>>>>>>>> 
>>>>>>>>>>>        @OutputDirectory
>>>>>>>>>>>        def File destination
>>>>>>>>>>> 
>>>>>>>>>>>        @TaskAction
>>>>>>>>>>>        void execute(TaskInputChanges inputs) {
>>>>>>>>>>>            if (inputs.allOutOfDate) {
>>>>>>>>>>>                FileUtils.forceDelete(destination)
>>>>>>>>>>>            }
>>>>>>>>>>> 
>>>>>>>>>>>            inputs.outOfDate({
>>>>>>>>>>>                FileUtils.copyFile(change.file, 
>>>>>>>>>>> targetFile(change.file))
>>>>>>>>>>>            } as Action)
>>>>>>>>>>>            .removed({
>>>>>>>>>>>                FileUtils.forceDelete(targetFile(change.file))
>>>>>>>>>>>            } as Action)
>>>>>>>>>>>            .process()
>>>>>>>>>>>        }
>>>>>>>>>>> 
>>>>>>>>>>>        def targetFile(def inputFile) {
>>>>>>>>>>>            new File(destination, change.file.name)
>>>>>>>>>>>        }
>>>>>>>>>>>    }
>>>>>>>>>> 
>>>>>>>>>> Thinking more about this, I don't think what we have is right. I 
>>>>>>>>>> think we can come up with a simpler API.
>>>>>>>>>> 
>>>>>>>>>> interface TaskExecutionContext {
>>>>>>>>>> 
>>>>>>>>>> boolean isHasPriorOutputs(); // this is an incremental execution
>>>>>>>>>> 
>>>>>>>>>> void eachInputFileChange(Action<InputFileChange> action);
>>>>>>>>>> 
>>>>>>>>>>    interface InputFileChange {
>>>>>>>>>>      boolean isAdded();
>>>>>>>>>>      boolean isModified();
>>>>>>>>>>      boolean isRemoved();
>>>>>>>>>>      File getFile();
>>>>>>>>>>    }
>>>>>>>>>> 
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> @TaskAction
>>>>>>>>>> void execute(TaskExecutionContext executionContext) {
>>>>>>>>>>    if (executionContext.hasPriorOutputs) {
>>>>>>>>>>            executionContext.eachInputFileChange {
>>>>>>>>>> 
>>>>>>>>>>            }
>>>>>>>>>>    } else {
>>>>>>>>>>            // full rebuild
>>>>>>>>>>    }
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> Some issues I have with the existing design:
>>>>>>>>>> 
>>>>>>>>>> * I'm not sure "outOfDate" is the right term here. It doesn't quite 
>>>>>>>>>> capture it and can imply more than what we actually mean. You could 
>>>>>>>>>> say that an output is out of date, but it doesn't quite work for 
>>>>>>>>>> inputs.
>>>>>>>>>> * why do we need the outOfDate() and removed() methods? After 
>>>>>>>>>> reading the Javadoc I'm not sure how I would use this API. What's 
>>>>>>>>>> the difference between out-of-date and removed? I could guess, but 
>>>>>>>>>> it's not clear.
>>>>>>>>>> * I think we'll want to provide information that is not strictly an 
>>>>>>>>>> aspect of the inputs or outputs, but of something emergent from the 
>>>>>>>>>> input/output concept. That's why I think modelling an encompassing 
>>>>>>>>>> type is better.
>>>>>>>>> 
>>>>>>>>> Sorry, accidentally hit send…
>>>>>>>>> 
>>>>>>>>>> What I think we are trying to do is make it easy for the user to 
>>>>>>>>>> decide whether they should do a fine grained incremental execution 
>>>>>>>>>> or a coarse grained full.
>>>>>>>>> 
>>>>>>>>> The existing isAllOutOfDate() method goes too far, based on its 
>>>>>>>>> Javadoc:
>>>>>>>>> 
>>>>>>>>>  /**
>>>>>>>>>   * Specifies if incremental build is not possible due to changed 
>>>>>>>>> Input Properties, Output Files, etc.
>>>>>>>>>   * In this case, every file will be considered to be 'out-of-date'.
>>>>>>>>>   */
>>>>>>>>>   boolean isAllOutOfDate();
>>>>>>>>> 
>>>>>>>>> It is too presumptuous of us to say that all outputs are invalid just 
>>>>>>>>> because an input property changed. The task author may know better 
>>>>>>>>> and can only rebuild what is effected by the input property change.
>>>>>>>>> 
>>>>>>>>> We can tell if a full build is needed with relative confidence based 
>>>>>>>>> on the lack of task history, or by verifying that the declared 
>>>>>>>>> outputs don't exist (if that's fast enough), hence the 
>>>>>>>>> isHasPriorOutputs() method. It would be nice to be able to tell the 
>>>>>>>>> task that all of the input files have changed in some way, but this 
>>>>>>>>> seems like it would be very expensive to compute.
>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Luke Daley
>>>>>>>>>> Principal Engineer, Gradleware
>>>>>>>>>> http://gradleware.com
>>>>>>>>>> 
>>>>>>>>>> Join me at the Gradle Summit 2013, June 13th and 14th in Santa 
>>>>>>>>>> Clara, CA: http://www.gradlesummit.com
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> For our usual "up-to-date" check, these are the things that can cause
>>>>>>>> a task to be out-of-date (in order of checking):
>>>>>>>> 1. Task has no declared outputs (always out-of-date with no declared 
>>>>>>>> outputs)
>>>>>>>> 2. Task.upToDateWhen() returns false
>>>>>>>> 3. Task has no record of previous execution
>>>>>>>> 4. Task class has changed since previous execution
>>>>>>>> 5. Task input property has changed
>>>>>>>> 6. Task output files have changed
>>>>>>>> 7. Task input files have changed
>>>>>>>> 
>>>>>>>> So the interesting thing is how to deal with all of these cases with
>>>>>>>> the incremental API:
>>>>>>>> 
>>>>>>>> (We are currently not honouring 1. for incremental tasks: this rule is
>>>>>>>> mostly to make lifecycle tasks (and similar) fast, by not
>>>>>>>> persisting/checking state. By making your task 'incremental' you're
>>>>>>>> also telling Gradle not to make this check.)
>>>>>>>> 
>>>>>>>> The initial approach was that the InputFileChanges parameter on
>>>>>>>> TaskAction was an indication that the task could handle changes to
>>>>>>>> input files: anything other sort of state change could not be handled
>>>>>>>> by the task.  The "out-of-date" files would be the set of files that
>>>>>>>> you needed to re-evaluate, based on _all_ changes, including those
>>>>>>>> that you cannot handle directly (like, say, input property changes).
>>>>>>>> We would later add other parameter types to declare that the task
>>>>>>>> could handle other incremental changes: properties, output files, etc.
>>>>>>>> 
>>>>>>>> Having spent more time on this, I'm not really liking this approach so
>>>>>>>> much: I think that a simple "tell me the changed inputs" method is
>>>>>>>> more useful and less confusing than a clever "tell me what I should
>>>>>>>> re-evaluate" method (this is what Luke is suggesting, too).
>>>>>>>> 
>>>>>>>> We then need a way to tell an incremental task that other
>>>>>>>> (non-input-file) changes have occurred since the prior execution,
>>>>>>>> leaving the task implementation to decide what to do in this case.
>>>>>>>> This API needs to be able to adapt to the future when a task can
>>>>>>>> handle input-file and input-property changes, but still needs to know
>>>>>>>> if task.upToDate() is false or outputs have changed since previous
>>>>>>>> execution. Right now, we need to be able to ask "has anything other
>>>>>>>> than 'input-files' changed since the last execution"?
>>>>>>>> 
>>>>>>>> Perhaps something like:
>>>>>>>> 
>>>>>>>> IncrementalExecutionContext {
>>>>>>>>  boolean isIncremental(); // false for cases 2,3,4 above
>>>>>>>> 
>>>>>>>>  // We can pretty easily implement these right now
>>>>>>>>  boolean isHasInputPropertyChanges()
>>>>>>>>  boolean isHasInputFileChanges()
>>>>>>>>  boolean isHasOutputFileChanges()
>>>>>>>> 
>>>>>>>>  void eachInputFileChange(Action<InputFileChange> action)
>>>>>>>> 
>>>>>>>>  // And to add to the API in the future
>>>>>>>>  void eachInputPropertyChange(Action<InputPropertyChange> action)
>>>>>>>>  void eachOutputFileChange(Action<OutputFileChange> action)
>>>>>>>> }
>>>>>>> 
>>>>>>> How does this actually make it easier to implement an incremental task, 
>>>>>>> given that most tasks don't care about output file changes and 
>>>>>>> approximately none care about property changes?
>>>>>>> 
>>>>>>> It seems like it's more understandable and obvious, and not that much 
>>>>>>> less easy to use. People already know how to implement a 
>>>>>>> non-incremental task, and it might be easier for them to choose to do 
>>>>>>> it the "old way" when the task is not incremental, or there are changes 
>>>>>>> that the task cannot handle.
>>>>>> 
>>>>>> Why would it be easier? Do you have a concrete example? Why would the 
>>>>>> task not just go: for each input file that needs work, do the work.
>>>>>> 
>>>>>>> 
>>>>>>> There's certainly a trade-off here between elegance and obviousness. 
>>>>>>> The fact the current DSL has some non-obvious logic involved is 
>>>>>>> definitely a cost that should be considered.
>>>>>>> 
>>>>>>> 
>>>>>>> What happens when we add new types of things that we detect changes for?
>>>>>>> 
>>>>>>> I guess the issue is that the task wants to ask: "are there any changes 
>>>>>>> that are not included in [X,Y]" where X,Y are the set of changes that 
>>>>>>> the task _is_ aware of.
>>>>>>> So a method like:
>>>>>>>  hasChangesOtherThan([InputFileChange,OutputFileChange]) would give 
>>>>>>> this functionality.
>>>>>>> 
>>>>>>> 
>>>>>>> Stepping back, there are several things a task requires from this API.
>>>>>>> 1) Task can declare the changeset(s) it can handle, and stream these 
>>>>>>> changes
>>>>>>> 2) Task needs to know if the changeset(s) are "incomplete": are there 
>>>>>>> other changes outside of the ones I can handle. The case where there is 
>>>>>>> no prior history is a type of "incomplete".
>>>>>>> 3) Task needs to be able to handle the "incomplete" case in a sensible 
>>>>>>> way
>>>>>>> 4) Task needs to be able to handle the "complete" case in a sensible way
>>>>>>> 
>>>>>>> With the current DSL, this is done by:
>>>>>>> 1) @TaskAction execute(InputFileChanges inputs, OutputFileChanges 
>>>>>>> outputs) // I can handle changes to input files and output files
>>>>>>> 2) inputs.allOutOfDate() == "incomplete". Not sure if there is an 
>>>>>>> equivalent on outputs?
>>>>>>> 3) inputs.outOfDate({})  // Every file is considered "out-of-date"
>>>>>>>   outputs.changedSincePreviousExecution({}) // No outputs files 
>>>>>>> considered changed?
>>>>>>> 4) inputs.outOfDate({}) // Only the changed files
>>>>>>>   inputs.removed({})
>>>>>>>   outputs.outOfDate({}) // Only files that have changed
>>>>>>>   outputs.removed({}) // Output files that have been removed
>>>>>>> 
>>>>>>> eg:
>>>>>>> def execute(InputFileChanges inputs, OutputFileChanges outputs) {
>>>>>>>   outputs.outOfDate {
>>>>>>>       // This may be a set of actual changes: not sure what to send 
>>>>>>> when no task history, changed input property, etc
>>>>>>>   }
>>>>>>>   inputs.outOfDate {
>>>>>>>      // This may be a set of changes, or it may be every input (no task 
>>>>>>> history, changed input property, etc)
>>>>>>>   }
>>>>>>> }
>>>>>>> 
>>>>>>> An alternative mechanism could be:
>>>>>>> 1) Explicitly ask the context for the changesets you care about:
>>>>>>>           def changeset = context.getChanges([InputFileChange, 
>>>>>>> OutputFileChange])
>>>>>>> 2) Ask the context (or the changeset) if it is "complete"
>>>>>>>           changeset.complete
>>>>>>> 3) Task author iterates over all inputs (just like a non-incremental 
>>>>>>> task): cannot use changeset
>>>>>>> 4) Task author iterates over changes
>>>>>>> 
>>>>>>> eg:
>>>>>>> def execute(IncrementalTaskContext context) {
>>>>>>> def changeset = context.getChanges([InputFileChange, OutputFileChange])
>>>>>>> if (changeset.complete) {
>>>>>>>     changeset.eachChange(OutputFileChange) {
>>>>>>>         ...
>>>>>>>     }
>>>>>>>     changeset.eachChange(InputFileChange) {
>>>>>>>         ...
>>>>>>>     }
>>>>>>> } else {
>>>>>>>    // Regular non-incremental task action
>>>>>>>    task.getInputs().eachFile {
>>>>>>>        ...
>>>>>>>    }
>>>>>>> }
>>>>>>> 
>>>>>>> The latter is definitely more verbose, and probably a bit less 
>>>>>>> convenient. But it's also more explicit and I think probably more 
>>>>>>> understandable.
>>>>>> 
>>>>>> Maybe. It's better than your other proposal above, in that it doesn't 
>>>>>> force the task implementation to know about every type of change when it 
>>>>>> doesn't want to. This is important because very few task implementations 
>>>>>> are going to care about anything other than input file changes.
>>>>>> 
>>>>>> However, the above also rules out certain performance optimisations as 
>>>>>> we don't know until the task executes what it will actually ask for. 
>>>>>> Instead, we have to be able to deal with the task asking for anything. 
>>>>>> Exactly what a task can handle, however, isn't dynamic - it's baked into 
>>>>>> the code and very much static. So, the API needs to expose this 
>>>>>> statically, so that we can do useful things with this information.
>>>>>> 
>>>>>> This also make evolution easier, as we can inject various different 
>>>>>> views instead of a single context that exposes everything.
>>>>>> 
>>>>>> As far as 'understandability' goes, I think the basic problem here is 
>>>>>> that we're defining the API in terms of changes, whereas the task 
>>>>>> implementations are interested in the work they have to do, and (very 
>>>>>> infrequently) the reason why. And while a change is a type of reason, 
>>>>>> but not all reasons are changes (eg --rerun-tasks and missing history 
>>>>>> are reasons, but they're not changes).
>>>>>> 
>>>>>> So, I think the API needs to express things in terms of what work the 
>>>>>> task needs to do, not in terms of what has changed. This has correctness 
>>>>>> implications, as well, as the task doesn't need to (and possibly cannot) 
>>>>>> be involved in converting changes to work. And for this API, correctness 
>>>>>> is the primary goal, followed by performance, followed by 
>>>>>> understandability. It's not a beginner API.
>>>>>> 
>>>>>> Here are the different kinds of work we're interested in:
>>>>>> 
>>>>>> - Process all input files for which there is no corresponding output 
>>>>>> file.
>>>>>> - Process all input files which have changed since their corresponding 
>>>>>> output files were generated.
>>>>>> - Process all input files for which the corresponding output file has 
>>>>>> changed since they were generated.
>>>>>> - Process all input files for which we don't have any history.
>>>>>> - Process all input files for which the corresponding output file is 
>>>>>> incorrect due to a property change.
>>>>>> - Remove outputs for all input files that have been removed.
>>>>>> - Remove existing outputs before processing input files, for when we're 
>>>>>> rebuilding the outputs (for whatever reason).
>>>>>> 
>>>>>> This says to me we want an API that
>>>>>> 
>>>>>> - Allows a task to invalidate output files given the set of property 
>>>>>> changes. Default is to invalidate all output files for any property 
>>>>>> change.
>>>>>> - Allows a task to invalidate input files given the set of output file 
>>>>>> changes. Default is to invalidate all input files for any output file 
>>>>>> change.
>>>>>> - Allows a task to ask for the input files it needs to do work on.
>>>>> 
>>>>> And you're saying that the current API, with just some renames, provides 
>>>>> this?
>>>>> 
>>>> 
>>>> Well the current API provides for the third option, with the default
>>>> behaviour for changed properties and changed outputs.
>>>> We could actually remove TaskInputChanges.allOutOfDate, as it doesn't
>>>> serve any of the purposes above.
>>> 
>>> The list above wasn't complete. We still need to deal with the case where 
>>> we don't know what old state you need to clean up. For example, say we make 
>>> our dsl meta data task incremental. This task produces a single merged 
>>> output file from a bunch of input files. If we don't have the history for 
>>> the task or if this output file has been changed or if running with 
>>> --rerun-tasks or if the task implementation has changed, then we need to 
>>> make sure that before the task works on the input files, this output file 
>>> has been cleaned of all existing state (eg deleted).
>>> 
>>> Right now we use the allOutOfDate flag to let the task know this. Instead 
>>> we might:
>>> 
>>> - delete the output files of the task before calling the task action.
>>> - let the task know which output files have been invalidated.
>>> - keep the flag and rename it something like 'allOutputsInvalid' or 
>>> 'rebuildRequired'.
>> 
>> Another option would be to split the task action into 2 separate methods, 
>> one which is called on rebuild and one which is called on incremental 
>> execution.
> 
> This doesn't feel right to me. For 1:1 transforming tasks this would be a 
> nuisance.

That's kind of the idea, to remind you to deal with both cases. Both methods 
might be passed the same parameter types, so that the 'rebuild' method can 
simply call the 'incremental' method.

> 
>> Or add an optional method that is called on rebuild before the task action 
>> method is called.
> 
> I can't see what this buys us over telling the user via the info we give the 
> task action.

It offers a couple of things: It allows the task to declare if it cares about 
cleaning up on rebuild or not. This means we can clean up on its behalf if it 
doesn't. And because it's declarative we can do things like call the 'cleanup' 
but not the 'do work' method to provide a more accurate clean task. Or call the 
'cleanup' method on failure. Or to undeploy, stop or otherwise cleanup the work 
of the task once its no longer required.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Join us at the Gradle Summit 2013, June 13th and 14th in Santa Clara, CA: 
http://www.gradlesummit.com

Re: [gradle-dev] API/DSL for implementing incremental tasks

Reply via email to