Declarative Workflow update & mutex

Alex Heneveld Fri, 11 Nov 2022 05:41:23 -0800

Hi team,

Workflow is in a pretty good state -- nearly done mostly as per
proposal, with nearly all step types, retries, docs [1], and integrated
into the activities view in the inspector.  My feeling is that it radically
transforms how easy it is to write sensors and effectors.


Thanks to everyone for their reviews and feedback!

The biggest change from the original proposal is a switch to the list
syntax from the init.d syntax.  Extra thanks to those who agitated for that.

The other really nice aspect is how the shorthand step notation functions
as a DSL in simple cases so extra thanks for the suggestion to make it
DSL-like.

The two items remaining are nested workflows and controlling how long
workflows are remembered.


There is one new feature which seems to be needed, which I wanted to
raise.  As the subject suggests, this is mutexes.  I had hoped we wouldn't
need this but actually quite often you want to ensure no other workflows
are conflicting.  Consider the simple case where you want to atomically
increment a sensor:

```
- let i = ${entity.sensor.count} ?? 0
- let i = ${i} + 1
- set-sensor count = ${i}
```

Running this twice we'd hope to get count=2.  But if the runs are
concurrent you won't.  So how can we ensure no other instances of the
workflow are running concurrently?

There are three options I see.


(1) set-sensor allows arithmetic

We could support arithmetic on set-sensor and require it to run atomically
against that sensor.  For instance `set-sensor count =
(${entity.sensor.count} ?? 0) + 1`.  We could fairly easily ensure the RHS
is evaluated with the caller holding the lock on the sensor count.  However
our arithmetic support is quite limited (we don't support grouping at
present, so either you'd have to write `${entity.sensor.count} + 1 ?? 1` or
we'd beef that up), and I think there is something nice about at present
where arithmetic is only allowed on `let` so it is more inspectable.


(2) set-sensor with mutex check

We could introduce a check which is done while the lock on the sensor is
held.  So you could check the sensor is unset before setting it, and fail
if it isn't, or check the value is as expected.  You can then set up
whatever retry behaviour you want in the usual way.  For instance:

```
# pessimistic locking
- let i = ${entity.sensor.count} ?? 0
- let j = ${i} + 1
- step: set-sensor count = ${j}
  require: ${i}
  on-error:
  - goto start
- clear-sensor lock-for-count
```

```
# mutex lock acquisition
- step: set-sensor lock-for-count = ${workflow.id}
  require:
    when: absent
  on-error:
  - retry backoff 50ms increasing 2x up to 5s
- let i = ${entity.sensor.count} ?? 0
- let i = ${i} + 1
- set-sensor count = ${i}
- clear-sensor lock-for-count
```

(A couple subtleties for those of you new to the workflow conditions; they
always have an implicit target depending on context, which for `require` we
would make be the old sensor value; "when: absent" is a special predicate
DSL keyword to say that a sensor is unavailable (you could also use `when:
absent_or_null` or `when: falsy` or `not: { when: truthy }`.  Finally
`require: ${i}` uses the fact that conditions default to being an equality
check.  That call is equivalent to `require: { target:
${entity.sensor.count}, equals: ${i} }`.)

The retry with backoff is pretty handy here.  But there is still one
problem, in the lock acquisition case, if the workflow fails after step 1
what will clear the lock?  (Pessimistic locking doesn't have this
problem.). Happily we have an easy solution, because workflows were
designed with recovery in mind.  If Brooklyn detects an interrupted
workflow on startup, it will fail it with a "dangling workflow" exception,
and you can attach recovery to it; specifying replay points and making
steps idempotent.

```
# rock-solid mutex lock acquisition
steps:
- step: set-sensor lock-for-count = ${workflow.id}
  require:
    any:
    - when: absent
    - equals: ${workflow.id}
  on-error:
  - retry backoff 50ms increasing 2x up to 5s
- let i = ${entity.sensor.count} ?? 0
- let i = ${i} + 1
- step: set-sensor count = ${i}
  replayable: yes
- clear-sensor lock-for-count
on-error:
  - condition:
      error-cause:
        glob: *DanglingWorkflowException*
    step: retry replay
replayable: yes
```

The `require` block now allows re-entrancy.  We rely on the fact that
Brooklyn gives workflow instances a unique ID and on failover Dangling is
thrown from an interrupted step preserving the workflow ID (but giving it a
different task ID so replays can be distinguished, with support for this in
the UI), and Brooklyn persistence handles election of a single primary with
any demoted instance interrupting its tasks.  The workflow is replayable
from the start, and on Dangling it replays.  Additionally we can replay
from the `set-sensor` step which will use local copies of the workflow
variable so if that step is interrupted and runs twice it will be
idempotent.


(3) explicit `lock` keyword on `workflow` step

My feeling is that (2) is really powerful, and the pessimistic locking case
is easy enough, but the mutex implementation is hard.  We could make it
easy to opt-in to by allowing sub-workflow blocks to specify a mutex.

```
- step: workflow lock incrementing-count
  steps:
    - let i = ${entity.sensor.count} ?? 0
    - let i = ${i} + 1
    - step: set-sensor count = ${i}
      replayable: yes
```

This would act exactly as the previous example, setting the workflow.id
into a sensor ahead of time, allowing absent or current workflow id as the
value for re-entrancy, retrying if locked, clearing it after, and handling
the Dangling exception.  The only tiny differences are that the sensor it
atomically sets and clears would be called something like
`workflow-lock-incrementing-count`, and the steps would be running in a
sub-workflow.

In this example we still need to say that the final `set-sensor count` is a
replay point, otherwise if Brooklyn were interrupted in the split-second
after setting the sensor but before recording the fact that the step
completed, it would retry from the start causing a slight chance the sensor
increments by 2.  This isn't to do with the fact a mutex is wanted however,
it's because fundamentally adding one is not an idempotent operation!
Provided the steps are idempotent there would be no need for it.

For instance to make sure that `apt` (or `dpkg` or `terraform` or any
command which requires a lock) isn't invoked concurrently you could use it
like this:

```
type: workflow-effector
name: apt-get
parameters:
  package:
    description: package(s) to install
steps:
- step: workflow lock apt
  steps:
    - ssh sudo apt-get install ${package}
```

Parallel invocations of effector `apt-get { package: xxx }` will share a
mutex on sensor `workflow-lock-apt` ensuring they don't conflict.

You could even define a new workflow step type (assuming "lock xxx" in the
shorthand is a key `lock` on the workflow step):

```
id: apt-get
type: workflow
lock: apt
shorthand: ${package}
parameters:
  package:
    description: package(s) to install
steps:
- step: workflow lock apt
  steps:
    - ssh sudo apt-get install ${package}
```

With this in the catalog, you could in workflow have steps `apt-get xxx`
which acquire a lock from Brooklyn before running so they don't fail if
invoked in parallel.

---

I lean towards providing (2) and (3).  It's actually fairly easy to
implement given what we already have, and allows low-level control for
special cases via (2) and a fairly simple high-level mechanism (3).

Thoughts?

On a related note, I'm not 100% happy with `replayable` as a keyword and
its options.  The essential thing is to indicate which points in a workflow
are safe to replay from if it is interrupted or fails.  We currently
support "yes" and "no" to indicate if something is a replayable point,
"true" and "false" as synonyms, and "on" and "off" to indicate that a step
and all following ones are or are not replayable (unless overridden with
replayable yes/no or until changed with another replayable on/off).  It
defaults to "off".  I think it is logically a good solution, but it doesn't
feel intuitive.  Alternative suggestions welcome!


Cheers for your input!
Alex


[1]
https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/workflow/index.md

Declarative Workflow update & mutex

Reply via email to