Re: Declarative Workflow update, mutex, replay and retention

Alex Heneveld Mon, 21 Nov 2022 07:48:52 -0800

Following some feedback re "replayable" I've rejigged that section [1].  It
changes the concepts to consider whether steps are "resumable" as a
separate idea to noting explicit "replayable" waypoints, with in most cases
either able to allow a workflow to be replayed, e.g. if a resumable step is
interrupted (such as a sleep, but not for instance and http call) or if the
workflow author indicated that a completed step was "replayable here".


Thanks to those who gave their input!  I am much happier with this rejigged
plan.  More comments are welcome!

Best
Alex

[1]
https://docs.google.com/document/d/1u02Bi6sS8Fkf1s7UzRRMnvLhA477bqcyxGa0nJesqkI/edit#heading=h.63aibdplmze


On Fri, 18 Nov 2022 at 13:44, Alex Heneveld <a...@cloudsoft.io> wrote:

> Hi team,
>
> I've got most of the proposed "lock" implementation completed, as
> discussed in the previous mail (below), PR to come shortly.  It works well,
> though there are a huge number of subtleties so test cases were a
> challenge; it makes it all the better than we provide this however, so
> workflow authors have a much easier time.  The biggest challenge was to
> make sure that if an author writes e.g. { type: workflow, lock: my-lock,
> on-error: [ retry ], steps: [ ... ] }`, if Brooklyn does a failover the
> retry (at the new server) preserves the lock ... but if there is no retry,
> or the retry fails, the lock is cleared.
>
> As part of this I've been mulling over "replayable"; as I mentioned below
> it was one aspect I'm not entirely sure of, and it's quite closely related
> to "expiration" which I think might be better described as "retention".  I
> think I've got a better way to handle those, and a tweak to error
> handling.  It's described in this section:
>
>
> https://docs.google.com/document/d/1u02Bi6sS8Fkf1s7UzRRMnvLhA477bqcyxGa0nJesqkI/edit#heading=h.63aibdplmze
>
> There are two questions towards the end that I'd especially value input on.
>
> Thanks
> Alex
>
>
> On Fri, 11 Nov 2022 at 13:40, Alex Heneveld <a...@cloudsoft.io> wrote:
>
>> Hi team,
>>
>> Workflow is in a pretty good state -- nearly done mostly as per
>> proposal, with nearly all step types, retries, docs [1], and integrated
>> into the activities view in the inspector.  My feeling is that it radically
>> transforms how easy it is to write sensors and effectors.
>>
>> Thanks to everyone for their reviews and feedback!
>>
>> The biggest change from the original proposal is a switch to the list
>> syntax from the init.d syntax.  Extra thanks to those who agitated for that.
>>
>> The other really nice aspect is how the shorthand step notation functions
>> as a DSL in simple cases so extra thanks for the suggestion to make it
>> DSL-like.
>>
>> The two items remaining are nested workflows and controlling how long
>> workflows are remembered.
>>
>>
>> There is one new feature which seems to be needed, which I wanted to
>> raise.  As the subject suggests, this is mutexes.  I had hoped we wouldn't
>> need this but actually quite often you want to ensure no other workflows
>> are conflicting.  Consider the simple case where you want to atomically
>> increment a sensor:
>>
>> ```
>> - let i = ${entity.sensor.count} ?? 0
>> - let i = ${i} + 1
>> - set-sensor count = ${i}
>> ```
>>
>> Running this twice we'd hope to get count=2.  But if the runs are
>> concurrent you won't.  So how can we ensure no other instances of the
>> workflow are running concurrently?
>>
>> There are three options I see.
>>
>>
>> (1) set-sensor allows arithmetic
>>
>> We could support arithmetic on set-sensor and require it to run
>> atomically against that sensor.  For instance `set-sensor count =
>> (${entity.sensor.count} ?? 0) + 1`.  We could fairly easily ensure the RHS
>> is evaluated with the caller holding the lock on the sensor count.  However
>> our arithmetic support is quite limited (we don't support grouping at
>> present, so either you'd have to write `${entity.sensor.count} + 1 ?? 1` or
>> we'd beef that up), and I think there is something nice about at present
>> where arithmetic is only allowed on `let` so it is more inspectable.
>>
>>
>> (2) set-sensor with mutex check
>>
>> We could introduce a check which is done while the lock on the sensor is
>> held.  So you could check the sensor is unset before setting it, and fail
>> if it isn't, or check the value is as expected.  You can then set up
>> whatever retry behaviour you want in the usual way.  For instance:
>>
>> ```
>> # pessimistic locking
>> - let i = ${entity.sensor.count} ?? 0
>> - let j = ${i} + 1
>> - step: set-sensor count = ${j}
>>   require: ${i}
>>   on-error:
>>   - goto start
>> - clear-sensor lock-for-count
>> ```
>>
>> ```
>> # mutex lock acquisition
>> - step: set-sensor lock-for-count = ${workflow.id}
>>   require:
>>     when: absent
>>   on-error:
>>   - retry backoff 50ms increasing 2x up to 5s
>> - let i = ${entity.sensor.count} ?? 0
>> - let i = ${i} + 1
>> - set-sensor count = ${i}
>> - clear-sensor lock-for-count
>> ```
>>
>> (A couple subtleties for those of you new to the workflow conditions;
>> they always have an implicit target depending on context, which for
>> `require` we would make be the old sensor value; "when: absent" is a
>> special predicate DSL keyword to say that a sensor is unavailable (you
>> could also use `when: absent_or_null` or `when: falsy` or `not: { when:
>> truthy }`.  Finally `require: ${i}` uses the fact that conditions default
>> to being an equality check.  That call is equivalent to `require: { target:
>> ${entity.sensor.count}, equals: ${i} }`.)
>>
>> The retry with backoff is pretty handy here.  But there is still one
>> problem, in the lock acquisition case, if the workflow fails after step 1
>> what will clear the lock?  (Pessimistic locking doesn't have this
>> problem.). Happily we have an easy solution, because workflows were
>> designed with recovery in mind.  If Brooklyn detects an interrupted
>> workflow on startup, it will fail it with a "dangling workflow" exception,
>> and you can attach recovery to it; specifying replay points and making
>> steps idempotent.
>>
>> ```
>> # rock-solid mutex lock acquisition
>> steps:
>> - step: set-sensor lock-for-count = ${workflow.id}
>>   require:
>>     any:
>>     - when: absent
>>     - equals: ${workflow.id}
>>   on-error:
>>   - retry backoff 50ms increasing 2x up to 5s
>> - let i = ${entity.sensor.count} ?? 0
>> - let i = ${i} + 1
>> - step: set-sensor count = ${i}
>>   replayable: yes
>> - clear-sensor lock-for-count
>> on-error:
>>   - condition:
>>       error-cause:
>>         glob: *DanglingWorkflowException*
>>     step: retry replay
>> replayable: yes
>> ```
>>
>> The `require` block now allows re-entrancy.  We rely on the fact that
>> Brooklyn gives workflow instances a unique ID and on failover Dangling is
>> thrown from an interrupted step preserving the workflow ID (but giving it a
>> different task ID so replays can be distinguished, with support for this in
>> the UI), and Brooklyn persistence handles election of a single primary with
>> any demoted instance interrupting its tasks.  The workflow is replayable
>> from the start, and on Dangling it replays.  Additionally we can replay
>> from the `set-sensor` step which will use local copies of the workflow
>> variable so if that step is interrupted and runs twice it will be
>> idempotent.
>>
>>
>> (3) explicit `lock` keyword on `workflow` step
>>
>> My feeling is that (2) is really powerful, and the pessimistic locking
>> case is easy enough, but the mutex implementation is hard.  We could make
>> it easy to opt-in to by allowing sub-workflow blocks to specify a mutex.
>>
>> ```
>> - step: workflow lock incrementing-count
>>   steps:
>>     - let i = ${entity.sensor.count} ?? 0
>>     - let i = ${i} + 1
>>     - step: set-sensor count = ${i}
>>       replayable: yes
>> ```
>>
>> This would act exactly as the previous example, setting the workflow.id
>> into a sensor ahead of time, allowing absent or current workflow id as the
>> value for re-entrancy, retrying if locked, clearing it after, and handling
>> the Dangling exception.  The only tiny differences are that the sensor it
>> atomically sets and clears would be called something like
>> `workflow-lock-incrementing-count`, and the steps would be running in a
>> sub-workflow.
>>
>> In this example we still need to say that the final `set-sensor count` is
>> a replay point, otherwise if Brooklyn were interrupted in the split-second
>> after setting the sensor but before recording the fact that the step
>> completed, it would retry from the start causing a slight chance the sensor
>> increments by 2.  This isn't to do with the fact a mutex is wanted however,
>> it's because fundamentally adding one is not an idempotent operation!
>> Provided the steps are idempotent there would be no need for it.
>>
>> For instance to make sure that `apt` (or `dpkg` or `terraform` or any
>> command which requires a lock) isn't invoked concurrently you could use it
>> like this:
>>
>> ```
>> type: workflow-effector
>> name: apt-get
>> parameters:
>>   package:
>>     description: package(s) to install
>> steps:
>> - step: workflow lock apt
>>   steps:
>>     - ssh sudo apt-get install ${package}
>> ```
>>
>> Parallel invocations of effector `apt-get { package: xxx }` will share a
>> mutex on sensor `workflow-lock-apt` ensuring they don't conflict.
>>
>> You could even define a new workflow step type (assuming "lock xxx" in
>> the shorthand is a key `lock` on the workflow step):
>>
>> ```
>> id: apt-get
>> type: workflow
>> lock: apt
>> shorthand: ${package}
>> parameters:
>>   package:
>>     description: package(s) to install
>> steps:
>> - step: workflow lock apt
>>   steps:
>>     - ssh sudo apt-get install ${package}
>> ```
>>
>> With this in the catalog, you could in workflow have steps `apt-get xxx`
>> which acquire a lock from Brooklyn before running so they don't fail if
>> invoked in parallel.
>>
>> ---
>>
>> I lean towards providing (2) and (3).  It's actually fairly easy to
>> implement given what we already have, and allows low-level control for
>> special cases via (2) and a fairly simple high-level mechanism (3).
>>
>> Thoughts?
>>
>> On a related note, I'm not 100% happy with `replayable` as a keyword and
>> its options.  The essential thing is to indicate which points in a workflow
>> are safe to replay from if it is interrupted or fails.  We currently
>> support "yes" and "no" to indicate if something is a replayable point,
>> "true" and "false" as synonyms, and "on" and "off" to indicate that a step
>> and all following ones are or are not replayable (unless overridden with
>> replayable yes/no or until changed with another replayable on/off).  It
>> defaults to "off".  I think it is logically a good solution, but it doesn't
>> feel intuitive.  Alternative suggestions welcome!
>>
>>
>> Cheers for your input!
>> Alex
>>
>>
>> [1]
>> https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/workflow/index.md
>>
>

Re: Declarative Workflow update, mutex, replay and retention

Reply via email to