Re: [gradle-dev] API/DSL for implementing incremental tasks

Adam Murdoch Sat, 13 Apr 2013 14:10:54 -0700

On 12/04/2013, at 7:33 PM, Luke Daley <luke.da...@gradleware.com> wrote:


> 
> On 11/04/2013, at 4:37 PM, Daz DeBoer <darrell.deb...@gradleware.com> wrote:
> 
>> On 11 April 2013 04:47, Luke Daley <luke.da...@gradleware.com> wrote:
>>> 
>>> On 11/04/2013, at 8:29 AM, Adam Murdoch <adam.murd...@gradleware.com> wrote:
>>> 
>>>> 
>>>> On 11/04/2013, at 2:55 PM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>> wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10 April 2013 15:33, Adam Murdoch <adam.murd...@gradleware.com> wrote:
>>>>> 
>>>>> On 10/04/2013, at 8:53 AM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>>> wrote:
>>>>> 
>>>>>> On 9 April 2013 05:25, Luke Daley <luke.da...@gradleware.com> wrote:
>>>>>>> 
>>>>>>> On 09/04/2013, at 11:55 AM, Luke Daley <luke.da...@gradleware.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On 27/03/2013, at 3:43 PM, Daz DeBoer <darrell.deb...@gradleware.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> G'day
>>>>>>>>> 
>>>>>>>>> Now in master is a pretty cool new feature: you can now implement an
>>>>>>>>> 'incremental' task that is informed about exactly which input files
>>>>>>>>> have changed when the task is out of date.
>>>>>>>>> This is very useful for something like a C++ compile task, as it means
>>>>>>>>> that only the changed files need to be recompiled, rather than the
>>>>>>>>> entire set of inputs.
>>>>>>>>> 
>>>>>>>>> I've got a 'draft' DSL functioning, and would appreciate any feedback
>>>>>>>>> you guys have. Here's a sample:
>>>>>>>>> 
>>>>>>>>>    class IncrementalSync extends DefaultTask {
>>>>>>>>>        @InputFiles
>>>>>>>>>        def FileCollection src
>>>>>>>>> 
>>>>>>>>>        @OutputDirectory
>>>>>>>>>        def File destination
>>>>>>>>> 
>>>>>>>>>        @TaskAction
>>>>>>>>>        void execute(TaskInputChanges inputs) {
>>>>>>>>>            if (inputs.allOutOfDate) {
>>>>>>>>>                FileUtils.forceDelete(destination)
>>>>>>>>>            }
>>>>>>>>> 
>>>>>>>>>            inputs.outOfDate({
>>>>>>>>>                FileUtils.copyFile(change.file, 
>>>>>>>>> targetFile(change.file))
>>>>>>>>>            } as Action)
>>>>>>>>>            .removed({
>>>>>>>>>                FileUtils.forceDelete(targetFile(change.file))
>>>>>>>>>            } as Action)
>>>>>>>>>            .process()
>>>>>>>>>        }
>>>>>>>>> 
>>>>>>>>>        def targetFile(def inputFile) {
>>>>>>>>>            new File(destination, change.file.name)
>>>>>>>>>        }
>>>>>>>>>    }
>>>>>>>> 
>>>>>>>> Thinking more about this, I don't think what we have is right. I think 
>>>>>>>> we can come up with a simpler API.
>>>>>>>> 
>>>>>>>> interface TaskExecutionContext {
>>>>>>>> 
>>>>>>>> boolean isHasPriorOutputs(); // this is an incremental execution
>>>>>>>> 
>>>>>>>> void eachInputFileChange(Action<InputFileChange> action);
>>>>>>>> 
>>>>>>>>    interface InputFileChange {
>>>>>>>>      boolean isAdded();
>>>>>>>>      boolean isModified();
>>>>>>>>      boolean isRemoved();
>>>>>>>>      File getFile();
>>>>>>>>    }
>>>>>>>> 
>>>>>>>> }
>>>>>>>> 
>>>>>>>> @TaskAction
>>>>>>>> void execute(TaskExecutionContext executionContext) {
>>>>>>>>    if (executionContext.hasPriorOutputs) {
>>>>>>>>            executionContext.eachInputFileChange {
>>>>>>>> 
>>>>>>>>            }
>>>>>>>>    } else {
>>>>>>>>            // full rebuild
>>>>>>>>    }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> Some issues I have with the existing design:
>>>>>>>> 
>>>>>>>> * I'm not sure "outOfDate" is the right term here. It doesn't quite 
>>>>>>>> capture it and can imply more than what we actually mean. You could 
>>>>>>>> say that an output is out of date, but it doesn't quite work for 
>>>>>>>> inputs.
>>>>>>>> * why do we need the outOfDate() and removed() methods? After reading 
>>>>>>>> the Javadoc I'm not sure how I would use this API. What's the 
>>>>>>>> difference between out-of-date and removed? I could guess, but it's 
>>>>>>>> not clear.
>>>>>>>> * I think we'll want to provide information that is not strictly an 
>>>>>>>> aspect of the inputs or outputs, but of something emergent from the 
>>>>>>>> input/output concept. That's why I think modelling an encompassing 
>>>>>>>> type is better.
>>>>>>> 
>>>>>>> Sorry, accidentally hit send…
>>>>>>> 
>>>>>>>> What I think we are trying to do is make it easy for the user to 
>>>>>>>> decide whether they should do a fine grained incremental execution or 
>>>>>>>> a coarse grained full.
>>>>>>> 
>>>>>>> The existing isAllOutOfDate() method goes too far, based on its Javadoc:
>>>>>>> 
>>>>>>>  /**
>>>>>>>   * Specifies if incremental build is not possible due to changed Input 
>>>>>>> Properties, Output Files, etc.
>>>>>>>   * In this case, every file will be considered to be 'out-of-date'.
>>>>>>>   */
>>>>>>>   boolean isAllOutOfDate();
>>>>>>> 
>>>>>>> It is too presumptuous of us to say that all outputs are invalid just 
>>>>>>> because an input property changed. The task author may know better and 
>>>>>>> can only rebuild what is effected by the input property change.
>>>>>>> 
>>>>>>> We can tell if a full build is needed with relative confidence based on 
>>>>>>> the lack of task history, or by verifying that the declared outputs 
>>>>>>> don't exist (if that's fast enough), hence the isHasPriorOutputs() 
>>>>>>> method. It would be nice to be able to tell the task that all of the 
>>>>>>> input files have changed in some way, but this seems like it would be 
>>>>>>> very expensive to compute.
>>>>>>> 
>>>>>>>> --
>>>>>>>> Luke Daley
>>>>>>>> Principal Engineer, Gradleware
>>>>>>>> http://gradleware.com
>>>>>>>> 
>>>>>>>> Join me at the Gradle Summit 2013, June 13th and 14th in Santa Clara, 
>>>>>>>> CA: http://www.gradlesummit.com
>>>>>>>> 
>>>>>> 
>>>>>> For our usual "up-to-date" check, these are the things that can cause
>>>>>> a task to be out-of-date (in order of checking):
>>>>>> 1. Task has no declared outputs (always out-of-date with no declared 
>>>>>> outputs)
>>>>>> 2. Task.upToDateWhen() returns false
>>>>>> 3. Task has no record of previous execution
>>>>>> 4. Task class has changed since previous execution
>>>>>> 5. Task input property has changed
>>>>>> 6. Task output files have changed
>>>>>> 7. Task input files have changed
>>>>>> 
>>>>>> So the interesting thing is how to deal with all of these cases with
>>>>>> the incremental API:
>>>>>> 
>>>>>> (We are currently not honouring 1. for incremental tasks: this rule is
>>>>>> mostly to make lifecycle tasks (and similar) fast, by not
>>>>>> persisting/checking state. By making your task 'incremental' you're
>>>>>> also telling Gradle not to make this check.)
>>>>>> 
>>>>>> The initial approach was that the InputFileChanges parameter on
>>>>>> TaskAction was an indication that the task could handle changes to
>>>>>> input files: anything other sort of state change could not be handled
>>>>>> by the task.  The "out-of-date" files would be the set of files that
>>>>>> you needed to re-evaluate, based on _all_ changes, including those
>>>>>> that you cannot handle directly (like, say, input property changes).
>>>>>> We would later add other parameter types to declare that the task
>>>>>> could handle other incremental changes: properties, output files, etc.
>>>>>> 
>>>>>> Having spent more time on this, I'm not really liking this approach so
>>>>>> much: I think that a simple "tell me the changed inputs" method is
>>>>>> more useful and less confusing than a clever "tell me what I should
>>>>>> re-evaluate" method (this is what Luke is suggesting, too).
>>>>>> 
>>>>>> We then need a way to tell an incremental task that other
>>>>>> (non-input-file) changes have occurred since the prior execution,
>>>>>> leaving the task implementation to decide what to do in this case.
>>>>>> This API needs to be able to adapt to the future when a task can
>>>>>> handle input-file and input-property changes, but still needs to know
>>>>>> if task.upToDate() is false or outputs have changed since previous
>>>>>> execution. Right now, we need to be able to ask "has anything other
>>>>>> than 'input-files' changed since the last execution"?
>>>>>> 
>>>>>> Perhaps something like:
>>>>>> 
>>>>>> IncrementalExecutionContext {
>>>>>>  boolean isIncremental(); // false for cases 2,3,4 above
>>>>>> 
>>>>>>  // We can pretty easily implement these right now
>>>>>>  boolean isHasInputPropertyChanges()
>>>>>>  boolean isHasInputFileChanges()
>>>>>>  boolean isHasOutputFileChanges()
>>>>>> 
>>>>>>  void eachInputFileChange(Action<InputFileChange> action)
>>>>>> 
>>>>>>  // And to add to the API in the future
>>>>>>  void eachInputPropertyChange(Action<InputPropertyChange> action)
>>>>>>  void eachOutputFileChange(Action<OutputFileChange> action)
>>>>>> }
>>>>> 
>>>>> How does this actually make it easier to implement an incremental task, 
>>>>> given that most tasks don't care about output file changes and 
>>>>> approximately none care about property changes?
>>>>> 
>>>>> It seems like it's more understandable and obvious, and not that much 
>>>>> less easy to use. People already know how to implement a non-incremental 
>>>>> task, and it might be easier for them to choose to do it the "old way" 
>>>>> when the task is not incremental, or there are changes that the task 
>>>>> cannot handle.
>>>> 
>>>> Why would it be easier? Do you have a concrete example? Why would the task 
>>>> not just go: for each input file that needs work, do the work.
>>>> 
>>>>> 
>>>>> There's certainly a trade-off here between elegance and obviousness. The 
>>>>> fact the current DSL has some non-obvious logic involved is definitely a 
>>>>> cost that should be considered.
>>>>> 
>>>>> 
>>>>> What happens when we add new types of things that we detect changes for?
>>>>> 
>>>>> I guess the issue is that the task wants to ask: "are there any changes 
>>>>> that are not included in [X,Y]" where X,Y are the set of changes that the 
>>>>> task _is_ aware of.
>>>>> So a method like:
>>>>>  hasChangesOtherThan([InputFileChange,OutputFileChange]) would give this 
>>>>> functionality.
>>>>> 
>>>>> 
>>>>> Stepping back, there are several things a task requires from this API.
>>>>> 1) Task can declare the changeset(s) it can handle, and stream these 
>>>>> changes
>>>>> 2) Task needs to know if the changeset(s) are "incomplete": are there 
>>>>> other changes outside of the ones I can handle. The case where there is 
>>>>> no prior history is a type of "incomplete".
>>>>> 3) Task needs to be able to handle the "incomplete" case in a sensible way
>>>>> 4) Task needs to be able to handle the "complete" case in a sensible way
>>>>> 
>>>>> With the current DSL, this is done by:
>>>>> 1) @TaskAction execute(InputFileChanges inputs, OutputFileChanges 
>>>>> outputs) // I can handle changes to input files and output files
>>>>> 2) inputs.allOutOfDate() == "incomplete". Not sure if there is an 
>>>>> equivalent on outputs?
>>>>> 3) inputs.outOfDate({})  // Every file is considered "out-of-date"
>>>>>   outputs.changedSincePreviousExecution({}) // No outputs files 
>>>>> considered changed?
>>>>> 4) inputs.outOfDate({}) // Only the changed files
>>>>>   inputs.removed({})
>>>>>   outputs.outOfDate({}) // Only files that have changed
>>>>>   outputs.removed({}) // Output files that have been removed
>>>>> 
>>>>> eg:
>>>>> def execute(InputFileChanges inputs, OutputFileChanges outputs) {
>>>>>   outputs.outOfDate {
>>>>>       // This may be a set of actual changes: not sure what to send when 
>>>>> no task history, changed input property, etc
>>>>>   }
>>>>>   inputs.outOfDate {
>>>>>      // This may be a set of changes, or it may be every input (no task 
>>>>> history, changed input property, etc)
>>>>>   }
>>>>> }
>>>>> 
>>>>> An alternative mechanism could be:
>>>>> 1) Explicitly ask the context for the changesets you care about:
>>>>>           def changeset = context.getChanges([InputFileChange, 
>>>>> OutputFileChange])
>>>>> 2) Ask the context (or the changeset) if it is "complete"
>>>>>           changeset.complete
>>>>> 3) Task author iterates over all inputs (just like a non-incremental 
>>>>> task): cannot use changeset
>>>>> 4) Task author iterates over changes
>>>>> 
>>>>> eg:
>>>>> def execute(IncrementalTaskContext context) {
>>>>> def changeset = context.getChanges([InputFileChange, OutputFileChange])
>>>>> if (changeset.complete) {
>>>>>     changeset.eachChange(OutputFileChange) {
>>>>>         ...
>>>>>     }
>>>>>     changeset.eachChange(InputFileChange) {
>>>>>         ...
>>>>>     }
>>>>> } else {
>>>>>    // Regular non-incremental task action
>>>>>    task.getInputs().eachFile {
>>>>>        ...
>>>>>    }
>>>>> }
>>>>> 
>>>>> The latter is definitely more verbose, and probably a bit less 
>>>>> convenient. But it's also more explicit and I think probably more 
>>>>> understandable.
>>>> 
>>>> Maybe. It's better than your other proposal above, in that it doesn't 
>>>> force the task implementation to know about every type of change when it 
>>>> doesn't want to. This is important because very few task implementations 
>>>> are going to care about anything other than input file changes.
>>>> 
>>>> However, the above also rules out certain performance optimisations as we 
>>>> don't know until the task executes what it will actually ask for. Instead, 
>>>> we have to be able to deal with the task asking for anything. Exactly what 
>>>> a task can handle, however, isn't dynamic - it's baked into the code and 
>>>> very much static. So, the API needs to expose this statically, so that we 
>>>> can do useful things with this information.
>>>> 
>>>> This also make evolution easier, as we can inject various different views 
>>>> instead of a single context that exposes everything.
>>>> 
>>>> As far as 'understandability' goes, I think the basic problem here is that 
>>>> we're defining the API in terms of changes, whereas the task 
>>>> implementations are interested in the work they have to do, and (very 
>>>> infrequently) the reason why. And while a change is a type of reason, but 
>>>> not all reasons are changes (eg --rerun-tasks and missing history are 
>>>> reasons, but they're not changes).
>>>> 
>>>> So, I think the API needs to express things in terms of what work the task 
>>>> needs to do, not in terms of what has changed. This has correctness 
>>>> implications, as well, as the task doesn't need to (and possibly cannot) 
>>>> be involved in converting changes to work. And for this API, correctness 
>>>> is the primary goal, followed by performance, followed by 
>>>> understandability. It's not a beginner API.
>>>> 
>>>> Here are the different kinds of work we're interested in:
>>>> 
>>>> - Process all input files for which there is no corresponding output file.
>>>> - Process all input files which have changed since their corresponding 
>>>> output files were generated.
>>>> - Process all input files for which the corresponding output file has 
>>>> changed since they were generated.
>>>> - Process all input files for which we don't have any history.
>>>> - Process all input files for which the corresponding output file is 
>>>> incorrect due to a property change.
>>>> - Remove outputs for all input files that have been removed.
>>>> - Remove existing outputs before processing input files, for when we're 
>>>> rebuilding the outputs (for whatever reason).
>>>> 
>>>> This says to me we want an API that
>>>> 
>>>> - Allows a task to invalidate output files given the set of property 
>>>> changes. Default is to invalidate all output files for any property change.
>>>> - Allows a task to invalidate input files given the set of output file 
>>>> changes. Default is to invalidate all input files for any output file 
>>>> change.
>>>> - Allows a task to ask for the input files it needs to do work on.
>>> 
>>> And you're saying that the current API, with just some renames, provides 
>>> this?
>>> 
>> 
>> Well the current API provides for the third option, with the default
>> behaviour for changed properties and changed outputs.
>> We could actually remove TaskInputChanges.allOutOfDate, as it doesn't
>> serve any of the purposes above.
> 
> I originally thought we'd need to support telling the task that it needs to 
> do everything explicitly, so that it could potentially do its work more 
> efficiently (e.g. batching).

I don't think so. There are certainly those sorts of tasks, such as an 
incremental Java compile task. For this kind of task, you'd just collect the 
files into a collection and then chop that up however you like at the end. So, 
the semantics of the action aren't 'do the work on this file', rather it means 
'here's a file you need to do work on'.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Join us at the Gradle Summit 2013, June 13th and 14th in Santa Clara, CA: 
http://www.gradlesummit.com

Re: [gradle-dev] API/DSL for implementing incremental tasks

Reply via email to