Hi team, Workflow is in a pretty good state -- nearly done mostly as per proposal, with nearly all step types, retries, docs [1], and integrated into the activities view in the inspector. My feeling is that it radically transforms how easy it is to write sensors and effectors.
Thanks to everyone for their reviews and feedback! The biggest change from the original proposal is a switch to the list syntax from the init.d syntax. Extra thanks to those who agitated for that. The other really nice aspect is how the shorthand step notation functions as a DSL in simple cases so extra thanks for the suggestion to make it DSL-like. The two items remaining are nested workflows and controlling how long workflows are remembered. There is one new feature which seems to be needed, which I wanted to raise. As the subject suggests, this is mutexes. I had hoped we wouldn't need this but actually quite often you want to ensure no other workflows are conflicting. Consider the simple case where you want to atomically increment a sensor: ``` - let i = ${entity.sensor.count} ?? 0 - let i = ${i} + 1 - set-sensor count = ${i} ``` Running this twice we'd hope to get count=2. But if the runs are concurrent you won't. So how can we ensure no other instances of the workflow are running concurrently? There are three options I see. (1) set-sensor allows arithmetic We could support arithmetic on set-sensor and require it to run atomically against that sensor. For instance `set-sensor count = (${entity.sensor.count} ?? 0) + 1`. We could fairly easily ensure the RHS is evaluated with the caller holding the lock on the sensor count. However our arithmetic support is quite limited (we don't support grouping at present, so either you'd have to write `${entity.sensor.count} + 1 ?? 1` or we'd beef that up), and I think there is something nice about at present where arithmetic is only allowed on `let` so it is more inspectable. (2) set-sensor with mutex check We could introduce a check which is done while the lock on the sensor is held. So you could check the sensor is unset before setting it, and fail if it isn't, or check the value is as expected. You can then set up whatever retry behaviour you want in the usual way. For instance: ``` # pessimistic locking - let i = ${entity.sensor.count} ?? 0 - let j = ${i} + 1 - step: set-sensor count = ${j} require: ${i} on-error: - goto start - clear-sensor lock-for-count ``` ``` # mutex lock acquisition - step: set-sensor lock-for-count = ${workflow.id} require: when: absent on-error: - retry backoff 50ms increasing 2x up to 5s - let i = ${entity.sensor.count} ?? 0 - let i = ${i} + 1 - set-sensor count = ${i} - clear-sensor lock-for-count ``` (A couple subtleties for those of you new to the workflow conditions; they always have an implicit target depending on context, which for `require` we would make be the old sensor value; "when: absent" is a special predicate DSL keyword to say that a sensor is unavailable (you could also use `when: absent_or_null` or `when: falsy` or `not: { when: truthy }`. Finally `require: ${i}` uses the fact that conditions default to being an equality check. That call is equivalent to `require: { target: ${entity.sensor.count}, equals: ${i} }`.) The retry with backoff is pretty handy here. But there is still one problem, in the lock acquisition case, if the workflow fails after step 1 what will clear the lock? (Pessimistic locking doesn't have this problem.). Happily we have an easy solution, because workflows were designed with recovery in mind. If Brooklyn detects an interrupted workflow on startup, it will fail it with a "dangling workflow" exception, and you can attach recovery to it; specifying replay points and making steps idempotent. ``` # rock-solid mutex lock acquisition steps: - step: set-sensor lock-for-count = ${workflow.id} require: any: - when: absent - equals: ${workflow.id} on-error: - retry backoff 50ms increasing 2x up to 5s - let i = ${entity.sensor.count} ?? 0 - let i = ${i} + 1 - step: set-sensor count = ${i} replayable: yes - clear-sensor lock-for-count on-error: - condition: error-cause: glob: *DanglingWorkflowException* step: retry replay replayable: yes ``` The `require` block now allows re-entrancy. We rely on the fact that Brooklyn gives workflow instances a unique ID and on failover Dangling is thrown from an interrupted step preserving the workflow ID (but giving it a different task ID so replays can be distinguished, with support for this in the UI), and Brooklyn persistence handles election of a single primary with any demoted instance interrupting its tasks. The workflow is replayable from the start, and on Dangling it replays. Additionally we can replay from the `set-sensor` step which will use local copies of the workflow variable so if that step is interrupted and runs twice it will be idempotent. (3) explicit `lock` keyword on `workflow` step My feeling is that (2) is really powerful, and the pessimistic locking case is easy enough, but the mutex implementation is hard. We could make it easy to opt-in to by allowing sub-workflow blocks to specify a mutex. ``` - step: workflow lock incrementing-count steps: - let i = ${entity.sensor.count} ?? 0 - let i = ${i} + 1 - step: set-sensor count = ${i} replayable: yes ``` This would act exactly as the previous example, setting the workflow.id into a sensor ahead of time, allowing absent or current workflow id as the value for re-entrancy, retrying if locked, clearing it after, and handling the Dangling exception. The only tiny differences are that the sensor it atomically sets and clears would be called something like `workflow-lock-incrementing-count`, and the steps would be running in a sub-workflow. In this example we still need to say that the final `set-sensor count` is a replay point, otherwise if Brooklyn were interrupted in the split-second after setting the sensor but before recording the fact that the step completed, it would retry from the start causing a slight chance the sensor increments by 2. This isn't to do with the fact a mutex is wanted however, it's because fundamentally adding one is not an idempotent operation! Provided the steps are idempotent there would be no need for it. For instance to make sure that `apt` (or `dpkg` or `terraform` or any command which requires a lock) isn't invoked concurrently you could use it like this: ``` type: workflow-effector name: apt-get parameters: package: description: package(s) to install steps: - step: workflow lock apt steps: - ssh sudo apt-get install ${package} ``` Parallel invocations of effector `apt-get { package: xxx }` will share a mutex on sensor `workflow-lock-apt` ensuring they don't conflict. You could even define a new workflow step type (assuming "lock xxx" in the shorthand is a key `lock` on the workflow step): ``` id: apt-get type: workflow lock: apt shorthand: ${package} parameters: package: description: package(s) to install steps: - step: workflow lock apt steps: - ssh sudo apt-get install ${package} ``` With this in the catalog, you could in workflow have steps `apt-get xxx` which acquire a lock from Brooklyn before running so they don't fail if invoked in parallel. --- I lean towards providing (2) and (3). It's actually fairly easy to implement given what we already have, and allows low-level control for special cases via (2) and a fairly simple high-level mechanism (3). Thoughts? On a related note, I'm not 100% happy with `replayable` as a keyword and its options. The essential thing is to indicate which points in a workflow are safe to replay from if it is interrupted or fails. We currently support "yes" and "no" to indicate if something is a replayable point, "true" and "false" as synonyms, and "on" and "off" to indicate that a step and all following ones are or are not replayable (unless overridden with replayable yes/no or until changed with another replayable on/off). It defaults to "off". I think it is logically a good solution, but it doesn't feel intuitive. Alternative suggestions welcome! Cheers for your input! Alex [1] https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/workflow/index.md