On 11/05/2015 02:35 PM, Markus Armbruster wrote:
> John Snow <js...@redhat.com> writes:
> 
>> On 11/05/2015 05:47 AM, Stefan Hajnoczi wrote:
>>> On Tue, Nov 03, 2015 at 12:27:19PM -0500, John Snow wrote:
>>>>
>>>>
>>>> On 11/03/2015 10:17 AM, Stefan Hajnoczi wrote:
>>>>> On Fri, Oct 23, 2015 at 07:56:50PM -0400, John Snow wrote:
>>>>>> @@ -1732,6 +1757,10 @@ static void 
>>>>>> block_dirty_bitmap_add_prepare(BlkActionState *common,
>>>>>>      BlockDirtyBitmapState *state = DO_UPCAST(BlockDirtyBitmapState,
>>>>>>                                               common, common);
>>>>>>  
>>>>>> +    if (action_check_cancel_mode(common, errp) < 0) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>>      action = common->action->block_dirty_bitmap_add;
>>>>>>      /* AIO context taken and released within qmp_block_dirty_bitmap_add 
>>>>>> */
>>>>>>      qmp_block_dirty_bitmap_add(action->node, action->name,
>>>>>> @@ -1767,6 +1796,10 @@ static void 
>>>>>> block_dirty_bitmap_clear_prepare(BlkActionState *common,
>>>>>>                                               common, common);
>>>>>>      BlockDirtyBitmap *action;
>>>>>>  
>>>>>> +    if (action_check_cancel_mode(common, errp) < 0) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>>      action = common->action->block_dirty_bitmap_clear;
>>>>>>      state->bitmap = block_dirty_bitmap_lookup(action->node,
>>>>>>                                                action->name,
>>>>>
>>>>> Why do the bitmap add/clear actions not support err-cancel=all?
>>>>>
>>>>> I understand why other block jobs don't support it, but it's not clear
>>>>> why these non-block job actions cannot.
>>>>>
>>>>
>>>> Because they don't have a callback to invoke if the rest of the job fails.
>>>>
>>>> I could create a BlockJob for them complete with a callback to invoke,
>>>> but basically it's just because there's no interface to unwind them, or
>>>> an interface to join them with the transaction.
>>>>
>>>> They're small, synchronous non-job actions. Which makes them weird.
>>>
>>> Funny, we've been looking at the same picture while seeing different
>>> things:
>>> https://en.wikipedia.org/wiki/Rabbit%E2%80%93duck_illusion
>>>
>>> I think I understand your idea: the transaction should include both
>>> immediate actions as well as block jobs.
>>>
>>> My mental model was different: immediate actions commit/abort along with
>>> the 'transaction' command.  Block jobs are separate and complete/cancel
>>> together in a group.
>>>
>>> In practice I think the two end up being similar because we won't be
>>> able to implement immediate action commit/abort together with
>>> long-running block jobs because the immediate actions rely on
>>> quiescing/pausing the guest for atomic commit/abort.
>>>
>>> So with your mental model the QMP client has to submit 2 'transaction'
>>> commands: 1 for the immediate actions, 1 for the block jobs.
>>>
>>> In my mental model the QMP client submits 1 command but the immediate
>>> actions and block jobs are two separate transaction scopes.  This means
>>> if the block jobs fail, the client needs to be aware of the immediate
>>> actions that have committed.  Because of this, it becomes just as much
>>> client effort as submitting two separate 'transaction' commands in your
>>> model.
>>>
>>> Can anyone see a practical difference?  I think I'm happy with John's
>>> model.
>>>
>>> Stefan
>>>
>>
>> We discussed this off-list, but for the sake of the archive:
>>
>> == How it is now ==
>>
>> Currently, transactions have two implicit phases: the first is the
>> synchronous phase. If this phase completes successfully, we consider the
>> transaction a success. The second phase is the asynchronous phase where
>> jobs launched by the synchronous phase run to completion.
>>
>> all synchronous commands must complete for the transaction to "succeed."
>> There are currently (pre-patch) no guarantees about asynchronous command
>> completion. As long as all synchronous actions complete, asynchronous
>> actions are free to succeed or fail individually.
>>
>> == My Model ==
>>
>> The current behavior is my "err-cancel = none" scenario: we offer no
>> guarantee about the success or failure of the transaction as a whole
>> after the synchronous portion has completed.
>>
>> What I was proposing is "err-cancel = all," which to me means that _ALL_
>> commands in this transaction are to succeed (synchronous or not) before
>> _any_ actions are irrevocably committed. This means that for a
>> hypothetical mixed synchronous-asynchronous transaction, that even after
>> the transaction succeeded (it passed the synchronous phase), if an
>> asynchronous action later fails, all actions both synchronous and non
>> are rolled-back -- a kind of retroactive failure of the transaction.
>> This is clearly not possible in all cases, so commands that cannot
>> support these semantics will refuse "err-cancel = all" during the
>> synchronous phase.
>>
>> In practice, only asynchronous actions can tolerate these semantics, but
>> from a user perspective, it's clear that any transaction successfully
>> launched with "err-cancel = all" applies to *all* actions, regardless.
>>
>> == Stefan's Model ==
>>
>> Stefan's model was to imply that the "err-cancel" parameter applied only
>> to the *asynchronous* phase, because the synchronous phase has already
>> reported back success to the user as a return from the qmp-transaction
>> command. This would mean that to Stefan, "err-cancel = all" was implying
>> that only the asynchronous actions had to participate in the "all or
>> none" behavior of the transaction -- synchronous portions were exempt.
>>
>> == Equivalence ==
>>
>> Both models wind up being equivalent:
>>
>> In Stefan's model, you need no foreknowledge of which actions are
>> synchronous or not. Upon failure during the asynchronous phase you will
>> need to understand which actions rolled back and which ones didn't, however.
>>
>> In my model, you need foreknowledge of which actions are synchronous and
>> which ones are not, because synchronous actions will refuse the
>> "err-cancel = all" parameter. There is no sifting through failure states
>> when the command fails.
>>
>> It's mostly a matter of when you need to know the difference between the
>> two classes of actions. In one model, it's before. In the other, it's
>> after a failure.
>>
>> My model also allows for an emulation of Stefan's model, using the
>> hypothetical "err-cancel = jobs-only" mode, which would only enforce the
>> transaction semantics in the asynchronous phase.
>>
>>
>> For this reason, I think the two approaches to thinking about the
>> problem wind up having the same effect. I would perhaps argue that my
>> model is more explicit -- but I'm biased. I wrote it :)
> 
> I haven't followed this topic, and my opinions are therefore quite
> uninformed.  Here goes anway.
> 
> In John's model, you need to know whether an action is synchronous.  If
> you get it wrong, your attempt to err-cancel=all will fail.  Sounds like
> a nice early failure to me.
> 

Yes.

> In Stefan's model, you "need to understand which actions rolled back" to
> make sense of a failure.  How?  Are exactly the asynchronous ones rolled
> back?  What happens if you get it wrong?
> 

You need to understand which actions were synchronous (i.e. not jobs)
and which ones were jobs. Only jobs will be rolled back, the synchronous
actions will not be -- so to retry, you need to have a sense of /which/
actions to retry.

If you get it wrong, you get weird stuff you didn't expect, of course.

Either model implies you need to understand which actions are which
kind, but the patchset as it is right now will yell at you if you got
that wrong from the get-go.

To be fair, in Stefan's model, the QMP events will hint to you which
actions actually failed, because you'll be able to count the block job
failure events. The qmp-transaction return of "{return: {}}" is the
implicit "All non-jobs finished OK."

--js

Reply via email to