Following some feedback re "replayable" I've rejigged that section [1]. It changes the concepts to consider whether steps are "resumable" as a separate idea to noting explicit "replayable" waypoints, with in most cases either able to allow a workflow to be replayed, e.g. if a resumable step is interrupted (such as a sleep, but not for instance and http call) or if the workflow author indicated that a completed step was "replayable here".
Thanks to those who gave their input! I am much happier with this rejigged plan. More comments are welcome! Best Alex [1] https://docs.google.com/document/d/1u02Bi6sS8Fkf1s7UzRRMnvLhA477bqcyxGa0nJesqkI/edit#heading=h.63aibdplmze On Fri, 18 Nov 2022 at 13:44, Alex Heneveld <a...@cloudsoft.io> wrote: > Hi team, > > I've got most of the proposed "lock" implementation completed, as > discussed in the previous mail (below), PR to come shortly. It works well, > though there are a huge number of subtleties so test cases were a > challenge; it makes it all the better than we provide this however, so > workflow authors have a much easier time. The biggest challenge was to > make sure that if an author writes e.g. { type: workflow, lock: my-lock, > on-error: [ retry ], steps: [ ... ] }`, if Brooklyn does a failover the > retry (at the new server) preserves the lock ... but if there is no retry, > or the retry fails, the lock is cleared. > > As part of this I've been mulling over "replayable"; as I mentioned below > it was one aspect I'm not entirely sure of, and it's quite closely related > to "expiration" which I think might be better described as "retention". I > think I've got a better way to handle those, and a tweak to error > handling. It's described in this section: > > > https://docs.google.com/document/d/1u02Bi6sS8Fkf1s7UzRRMnvLhA477bqcyxGa0nJesqkI/edit#heading=h.63aibdplmze > > There are two questions towards the end that I'd especially value input on. > > Thanks > Alex > > > On Fri, 11 Nov 2022 at 13:40, Alex Heneveld <a...@cloudsoft.io> wrote: > >> Hi team, >> >> Workflow is in a pretty good state -- nearly done mostly as per >> proposal, with nearly all step types, retries, docs [1], and integrated >> into the activities view in the inspector. My feeling is that it radically >> transforms how easy it is to write sensors and effectors. >> >> Thanks to everyone for their reviews and feedback! >> >> The biggest change from the original proposal is a switch to the list >> syntax from the init.d syntax. Extra thanks to those who agitated for that. >> >> The other really nice aspect is how the shorthand step notation functions >> as a DSL in simple cases so extra thanks for the suggestion to make it >> DSL-like. >> >> The two items remaining are nested workflows and controlling how long >> workflows are remembered. >> >> >> There is one new feature which seems to be needed, which I wanted to >> raise. As the subject suggests, this is mutexes. I had hoped we wouldn't >> need this but actually quite often you want to ensure no other workflows >> are conflicting. Consider the simple case where you want to atomically >> increment a sensor: >> >> ``` >> - let i = ${entity.sensor.count} ?? 0 >> - let i = ${i} + 1 >> - set-sensor count = ${i} >> ``` >> >> Running this twice we'd hope to get count=2. But if the runs are >> concurrent you won't. So how can we ensure no other instances of the >> workflow are running concurrently? >> >> There are three options I see. >> >> >> (1) set-sensor allows arithmetic >> >> We could support arithmetic on set-sensor and require it to run >> atomically against that sensor. For instance `set-sensor count = >> (${entity.sensor.count} ?? 0) + 1`. We could fairly easily ensure the RHS >> is evaluated with the caller holding the lock on the sensor count. However >> our arithmetic support is quite limited (we don't support grouping at >> present, so either you'd have to write `${entity.sensor.count} + 1 ?? 1` or >> we'd beef that up), and I think there is something nice about at present >> where arithmetic is only allowed on `let` so it is more inspectable. >> >> >> (2) set-sensor with mutex check >> >> We could introduce a check which is done while the lock on the sensor is >> held. So you could check the sensor is unset before setting it, and fail >> if it isn't, or check the value is as expected. You can then set up >> whatever retry behaviour you want in the usual way. For instance: >> >> ``` >> # pessimistic locking >> - let i = ${entity.sensor.count} ?? 0 >> - let j = ${i} + 1 >> - step: set-sensor count = ${j} >> require: ${i} >> on-error: >> - goto start >> - clear-sensor lock-for-count >> ``` >> >> ``` >> # mutex lock acquisition >> - step: set-sensor lock-for-count = ${workflow.id} >> require: >> when: absent >> on-error: >> - retry backoff 50ms increasing 2x up to 5s >> - let i = ${entity.sensor.count} ?? 0 >> - let i = ${i} + 1 >> - set-sensor count = ${i} >> - clear-sensor lock-for-count >> ``` >> >> (A couple subtleties for those of you new to the workflow conditions; >> they always have an implicit target depending on context, which for >> `require` we would make be the old sensor value; "when: absent" is a >> special predicate DSL keyword to say that a sensor is unavailable (you >> could also use `when: absent_or_null` or `when: falsy` or `not: { when: >> truthy }`. Finally `require: ${i}` uses the fact that conditions default >> to being an equality check. That call is equivalent to `require: { target: >> ${entity.sensor.count}, equals: ${i} }`.) >> >> The retry with backoff is pretty handy here. But there is still one >> problem, in the lock acquisition case, if the workflow fails after step 1 >> what will clear the lock? (Pessimistic locking doesn't have this >> problem.). Happily we have an easy solution, because workflows were >> designed with recovery in mind. If Brooklyn detects an interrupted >> workflow on startup, it will fail it with a "dangling workflow" exception, >> and you can attach recovery to it; specifying replay points and making >> steps idempotent. >> >> ``` >> # rock-solid mutex lock acquisition >> steps: >> - step: set-sensor lock-for-count = ${workflow.id} >> require: >> any: >> - when: absent >> - equals: ${workflow.id} >> on-error: >> - retry backoff 50ms increasing 2x up to 5s >> - let i = ${entity.sensor.count} ?? 0 >> - let i = ${i} + 1 >> - step: set-sensor count = ${i} >> replayable: yes >> - clear-sensor lock-for-count >> on-error: >> - condition: >> error-cause: >> glob: *DanglingWorkflowException* >> step: retry replay >> replayable: yes >> ``` >> >> The `require` block now allows re-entrancy. We rely on the fact that >> Brooklyn gives workflow instances a unique ID and on failover Dangling is >> thrown from an interrupted step preserving the workflow ID (but giving it a >> different task ID so replays can be distinguished, with support for this in >> the UI), and Brooklyn persistence handles election of a single primary with >> any demoted instance interrupting its tasks. The workflow is replayable >> from the start, and on Dangling it replays. Additionally we can replay >> from the `set-sensor` step which will use local copies of the workflow >> variable so if that step is interrupted and runs twice it will be >> idempotent. >> >> >> (3) explicit `lock` keyword on `workflow` step >> >> My feeling is that (2) is really powerful, and the pessimistic locking >> case is easy enough, but the mutex implementation is hard. We could make >> it easy to opt-in to by allowing sub-workflow blocks to specify a mutex. >> >> ``` >> - step: workflow lock incrementing-count >> steps: >> - let i = ${entity.sensor.count} ?? 0 >> - let i = ${i} + 1 >> - step: set-sensor count = ${i} >> replayable: yes >> ``` >> >> This would act exactly as the previous example, setting the workflow.id >> into a sensor ahead of time, allowing absent or current workflow id as the >> value for re-entrancy, retrying if locked, clearing it after, and handling >> the Dangling exception. The only tiny differences are that the sensor it >> atomically sets and clears would be called something like >> `workflow-lock-incrementing-count`, and the steps would be running in a >> sub-workflow. >> >> In this example we still need to say that the final `set-sensor count` is >> a replay point, otherwise if Brooklyn were interrupted in the split-second >> after setting the sensor but before recording the fact that the step >> completed, it would retry from the start causing a slight chance the sensor >> increments by 2. This isn't to do with the fact a mutex is wanted however, >> it's because fundamentally adding one is not an idempotent operation! >> Provided the steps are idempotent there would be no need for it. >> >> For instance to make sure that `apt` (or `dpkg` or `terraform` or any >> command which requires a lock) isn't invoked concurrently you could use it >> like this: >> >> ``` >> type: workflow-effector >> name: apt-get >> parameters: >> package: >> description: package(s) to install >> steps: >> - step: workflow lock apt >> steps: >> - ssh sudo apt-get install ${package} >> ``` >> >> Parallel invocations of effector `apt-get { package: xxx }` will share a >> mutex on sensor `workflow-lock-apt` ensuring they don't conflict. >> >> You could even define a new workflow step type (assuming "lock xxx" in >> the shorthand is a key `lock` on the workflow step): >> >> ``` >> id: apt-get >> type: workflow >> lock: apt >> shorthand: ${package} >> parameters: >> package: >> description: package(s) to install >> steps: >> - step: workflow lock apt >> steps: >> - ssh sudo apt-get install ${package} >> ``` >> >> With this in the catalog, you could in workflow have steps `apt-get xxx` >> which acquire a lock from Brooklyn before running so they don't fail if >> invoked in parallel. >> >> --- >> >> I lean towards providing (2) and (3). It's actually fairly easy to >> implement given what we already have, and allows low-level control for >> special cases via (2) and a fairly simple high-level mechanism (3). >> >> Thoughts? >> >> On a related note, I'm not 100% happy with `replayable` as a keyword and >> its options. The essential thing is to indicate which points in a workflow >> are safe to replay from if it is interrupted or fails. We currently >> support "yes" and "no" to indicate if something is a replayable point, >> "true" and "false" as synonyms, and "on" and "off" to indicate that a step >> and all following ones are or are not replayable (unless overridden with >> replayable yes/no or until changed with another replayable on/off). It >> defaults to "off". I think it is logically a good solution, but it doesn't >> feel intuitive. Alternative suggestions welcome! >> >> >> Cheers for your input! >> Alex >> >> >> [1] >> https://github.com/apache/brooklyn-docs/blob/master/guide/blueprints/workflow/index.md >> >