Re: Declarative Workflow update, mutex, replay and retention

Alex Heneveld Fri, 18 Nov 2022 05:45:42 -0800

Hi team,

I've got most of the proposed "lock" implementation completed, as discussed
in the previous mail (below), PR to come shortly.  It works well, though
there are a huge number of subtleties so test cases were a challenge; it
makes it all the better than we provide this however, so workflow authors
have a much easier time.  The biggest challenge was to make sure that if an
author writes e.g. { type: workflow, lock: my-lock, on-error: [ retry ],
steps: [ ... ] }`, if Brooklyn does a failover the retry (at the new
server) preserves the lock ... but if there is no retry, or the retry
fails, the lock is cleared.


As part of this I've been mulling over "replayable"; as I mentioned below
it was one aspect I'm not entirely sure of, and it's quite closely related
to "expiration" which I think might be better described as "retention".  I
think I've got a better way to handle those, and a tweak to error
handling.  It's described in this section:

https://docs.google.com/document/d/1u02Bi6sS8Fkf1s7UzRRMnvLhA477bqcyxGa0nJesqkI/edit#heading=h.63aibdplmze

There are two questions towards the end that I'd especially value input on.

Thanks
Alex


On Fri, 11 Nov 2022 at 13:40, Alex Heneveld <[email protected]> wrote:

> Hi team,
>
> Workflow is in a pretty good state -- nearly done mostly as per
> proposal, with nearly all step types, retries, docs [1], and integrated
> into the activities view in the inspector.  My feeling is that it radically
> transforms how easy it is to write sensors and effectors.
>
> Thanks to everyone for their reviews and feedback!
>
> The biggest change from the original proposal is a switch to the list
> syntax from the init.d syntax.  Extra thanks to those who agitated for that.
>
> The other really nice aspect is how the shorthand step notation functions
> as a DSL in simple cases so extra thanks for the suggestion to make it
> DSL-like.
>
> The two items remaining are nested workflows and controlling how long
> workflows are remembered.
>
>
> There is one new feature which seems to be needed, which I wanted to
> raise.  As the subject suggests, this is mutexes.  I had hoped we wouldn't
> need this but actually quite often you want to ensure no other workflows
> are conflicting.  Consider the simple case where you want to atomically
> increment a sensor:
>
> ```
> - let i = ${entity.sensor.count} ?? 0
> - let i = ${i} + 1
> - set-sensor count = ${i}
> ```
>
> Running this twice we'd hope to get count=2.  But if the runs are
> concurrent you won't.  So how can we ensure no other instances of the
> workflow are running concurrently?
>
> There are three options I see.
>
>
> (1) set-sensor allows arithmetic
>
> We could support arithmetic on set-sensor and require it to run atomically
> against that sensor.  For instance `set-sensor count =
> (${entity.sensor.count} ?? 0) + 1`.  We could fairly easily ensure the RHS
> is evaluated with the caller holding the lock on the sensor count.  However
> our arithmetic support is quite limited (we don't support grouping at
> present, so either you'd have to write `${entity.sensor.count} + 1 ?? 1` or
> we'd beef that up), and I think there is something nice about at present
> where arithmetic is only allowed on `let` so it is more inspectable.
>
>
> (2) set-sensor with mutex check
>
> We could introduce a check which is done while the lock on the sensor is
> held.  So you could check the sensor is unset before setting it, and fail
> if it isn't, or check the value is as expected.  You can then set up
> whatever retry behaviour you want in the usual way.  For instance:
>
> ```
> # pessimistic locking
> - let i = ${entity.sensor.count} ?? 0
> - let j = ${i} + 1
> - step: set-sensor count = ${j}
>   require: ${i}
>   on-error:
>   - goto start
> - clear-sensor lock-for-count
> ```
>
> ```
> # mutex lock acquisition
> - step: set-sensor lock-for-count = ${workflow.id}
>   require:
>     when: absent
>   on-error:
>   - retry backoff 50ms increasing 2x up to 5s
> - let i = ${entity.sensor.count} ?? 0
> - let i = ${i} + 1
> - set-sensor count = ${i}
> - clear-sensor lock-for-count
> ```
>
> (A couple subtleties for those of you new to the workflow conditions; they
> always have an implicit target depending on context, which for `require` we
> would make be the old sensor value; "when: absent" is a special predicate
> DSL keyword to say that a sensor is unavailable (you could also use `when:
> absent_or_null` or `when: falsy` or `not: { when: truthy }`.  Finally
> `require: ${i}` uses the fact that conditions default to being an equality
> check.  That call is equivalent to `require: { target:
> ${entity.sensor.count}, equals: ${i} }`.)
>
> The retry with backoff is pretty handy here.  But there is still one
> problem, in the lock acquisition case, if the workflow fails after step 1
> what will clear the lock?  (Pessimistic locking doesn't have this
> problem.). Happily we have an easy solution, because workflows were
> designed with recovery in mind.  If Brooklyn detects an interrupted
> workflow on startup, it will fail it with a "dangling workflow" exception,
> and you can attach recovery to it; specifying replay points and making
> steps idempotent.
>
> ```
> # rock-solid mutex lock acquisition
> steps:
> - step: set-sensor lock-for-count = ${workflow.id}
>   require:
>     any:
>     - when: absent
>     - equals: ${workflow.id}
>   on-error:
>   - retry backoff 50ms increasing 2x up to 5s
> - let i = ${entity.sensor.count} ?? 0
> - let i = ${i} + 1
> - step: set-sensor count = ${i}
>   replayable: yes
> - clear-sensor lock-for-count
> on-error:
>   - condition:
>       error-cause:
>         glob: *DanglingWorkflowException*
>     step: retry replay
> replayable: yes
> ```
>
> The `require` block now allows re-entrancy.  We rely on the fact that
> Brooklyn gives workflow instances a unique ID and on failover Dangling is
> thrown from an interrupted step preserving the workflow ID (but giving it a
> different task ID so replays can be distinguished, with support for this in
> the UI), and Brooklyn persistence handles election of a single primary with
> any demoted instance interrupting its tasks.  The workflow is replayable
> from the start, and on Dangling it replays.  Additionally we can replay
> from the `set-sensor` step which will use local copies of the workflow
> variable so if that step is interrupted and runs twice it will be
> idempotent.
>
>
> (3) explicit `lock` keyword on `workflow` step
>
> My feeling is that (2) is really powerful, and the pessimistic locking
> case is easy enough, but the mutex implementation is hard.  We could make
> it easy to opt-in to by allowing sub-workflow blocks to specify a mutex.
>
> ```
> - step: workflow lock incrementing-count
>   steps:
>     - let i = ${entity.sensor.count} ?? 0
>     - let i = ${i} + 1
>     - step: set-sensor count = ${i}
>       replayable: yes
> ```
>
> This would act exactly as the previous example, setting the workflow.id
> into a sensor ahead of time, allowing absent or current workflow id as the
> value for re-entrancy, retrying if locked, clearing it after, and handling
> the Dangling exception.  The only tiny differences are that the sensor it
> atomically sets and clears would be called something like
> `workflow-lock-incrementing-count`, and the steps would be running in a
> sub-workflow.
>
> In this example we still need to say that the final `set-sensor count` is
> a replay point, otherwise if Brooklyn were interrupted in the split-second
> after setting the sensor but before recording the fact that the step
> completed, it would retry from the start causing a slight chance the sensor
> increments by 2.  This isn't to do with the fact a mutex is wanted however,
> it's because fundamentally adding one is not an idempotent operation!
> Provided the steps are idempotent there would be no need for it.
>
> For instance to make sure that `apt` (or `dpkg` or `terraform` or any
> command which requires a lock) isn't invoked concurrently you could use it
> like this:
>
> ```
> type: workflow-effector
> name: apt-get
> parameters:
>   package:
>     description: package(s) to install
> steps:
> - step: workflow lock apt
>   steps:
>     - ssh sudo apt-get install ${package}
> ```
>
> Parallel invocations of effector `apt-get { package: xxx }` will share a
> mutex on sensor `workflow-lock-apt` ensuring they don't conflict.
>
> You could even define a new workflow step type (assuming "lock xxx" in the
> shorthand is a key `lock` on the workflow step):
>
> ```
> id: apt-get
> type: workflow
> lock: apt
> shorthand: ${package}
> parameters:
>   package:
>     description: package(s) to install
> steps:
> - step: workflow lock apt
>   steps:
>     - ssh sudo apt-get install ${package}
> ```
>
> With this in the catalog, you could in workflow have steps `apt-get xxx`
> which acquire a lock from Brooklyn before running so they don't fail if
> invoked in parallel.
>
> ---
>
> I lean towards providing (2) and (3).  It's actually fairly easy to
> implement given what we already have, and allows low-level control for
> special cases via (2) and a fairly simple high-level mechanism (3).
>
> Thoughts?
>
> On a related note, I'm not 100% happy with `replayable` as a keyword and
> its options.  The essential thing is to indicate which points in a workflow
> are safe to replay from if it is interrupted or fails.  We currently
> support "yes" and "no" to indicate if something is a replayable point,
> "true" and "false" as synonyms, and "on" and "off" to indicate that a step
> and all following ones are or are not replayable (unless overridden with
> replayable yes/no or until changed with another replayable on/off).  It
> defaults to "off".  I think it is logically a good solution, but it doesn't
> feel intuitive.  Alternative suggestions welcome!
>
>
> Cheers for your input!
> Alex
>
>
> [1]
> https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/workflow/index.md
>

Re: Declarative Workflow update, mutex, replay and retention

Reply via email to