Re: [Qemu-block] [PATCH v2 0/4] blockjobs: add explicit job reaping

2017-10-09 Thread Nikolay Shirokovskiy
Hi, John.

This is the original letter 
https://lists.nongnu.org/archive/html/qemu-devel/2016-11/msg04091.html.

In short problem is next. If during full backup I miss the completion
event I don't know whether backup file is correct or not. If I miss the
event during incremental backup additionally I have to make full backup
aftermath because I can not be sure at what state dirty bitmap is. 
(It depends on whether backup was successful or not). This problem
can be approached by method suggested by Vladimir in the above thread
apart from introducing zombie job status though.

Well, may be not that short...


On 05.10.2017 21:17, John Snow wrote:
> Nikolay: You mentioned a while ago that you had issues with incremental
> backup's eventual return status being unknown. Can you please elaborate
> for me why this is a problem?
> 
> I assume due to the long running of a backup job it's entirely possible
> to imagine losing connection to QEMU and missing the event depending on
> how long the interruption is.
> 
> Backup operations are expensive, so we need some definite way to catch
> this return status.
> 
> Please let me know if you have any feedback to this thread.
> 
> On 10/05/2017 07:38 AM, Kevin Wolf wrote:
>> Am 05.10.2017 um 03:46 hat John Snow geschrieben:
>>> On 10/04/2017 02:27 PM, Kevin Wolf wrote:
 Am 04.10.2017 um 03:52 hat John Snow geschrieben: For jobs that 
 complete when a monitor isn't looking, there's no way to
> tell what the job's final return code was. We need to allow jobs to
> remain in the list until queried for reliable management.

 Just a short summary of what I discussed with John on IRC:

 Another important reason why we want to have an explicit end of block
 jobs is that job completion often makes changes to the graph. For a
 management tool that manages the block graph on a node level, it is a
 big problem if graph changes can happen at any point that can lead to
 bad race conditions. Giving the management tool control over the end of
 the block job makes it aware that graph changes happen.

 This means that compared to this RFC series, we need to move the waiting
 earlier in the process:

 1. Block job is done and calls block_job_completed()
 2. Wait for other block jobs in the same job transaction to complete
 3. Send a (new) QMP event to the management tool to notify it that the
job is ready to be reaped
>>>
>>> Oh, I suppose to distinguish it from "COMPLETED" in that sense, because
>>> it isn't actually COMPLETED anymore under your vision, so it requires a
>>> new event in this proposal.
>>>
>>> This becomes a bit messy, bumping up against both "READY" and a
>>> transactional pre-completed state semantically. U, for lack of a
>>> better word in the timeframe I'd like to complete this email in, let's
>>> call this new theoretical state "PENDING"?
>>>
>>> So presently, a job goes through the following life cycle:
>>>
>>> 1. CREATED --> RUNNING
>>> 2. RUNNING <--> PAUSED
>>> 3. RUNNING --> (READY | COMPLETED | CANCELED)
>>> 4. READY --> (COMPLETED | CANCELED)
>>> 5. (COMPLETED | CANCELED) --> NULL
>>>
>>> Where we emit an event upon entering "READY", "COMPLETED" or "CANCELED".
>>
>> Roughly yes, but it's not quite true because you can still pause and
>> unpause ready jobs. So READY and PAUSED are kind of orthogonal.
>>
> 
> But you cannot block-job-complete a running job, so I included it here
> so we could keep the concept of the ready-to-complete state in mind.
> 
>>> My patchset here effectively adds a new optional terminal state:
>>>
>>> 5. (COMPLETED | CANCELED) --> (NULL | FINISHED)
>>> 6. FINISHED --> NULL
>>>
>>> Where the last transition from FINISHED to NULL is performed via
>>> block-job-reap, but notably we get to re-use the events for COMPLETED |
>>> CANCELED to indicate the availability of this operation to be performed.
>>>
>>> What happens in the case of transactionally managed jobs presently is
>>> that jobs get stuck as they enter the COMPLETED|CANCELED state. If you
>>> were to query them they behave as if they're RUNNING. There's no
>>> discrete state that exists for this presently.
>>>
>>> You can cancel these as normal, but I'm not sure if you can pause them,
>>> actually. (Note to self, test that.) I think they have almost exactly
>>> like any RUNNING job would.
>>
>> Except that they don't do any work any more. This is an mportant
>> difference for a mirror job which would normally keep copying new writes
>> until it sends the COMPLETED event. So when libvirt restarts and it sees
>> a "RUNNING" mirror job, it can't decide whether it is still copying
>> things or has already completed.
>>
>> Looks like this is another reason why we want a separate state here.
> 
> Yes, I realized as I was writing it that we have no real way to tell
> that a job is simply pending completion.
> 
>>
>>> What you're proposing here is the formalization of the pre-completion
>>> 

Re: [Qemu-block] [PATCH v2 0/4] blockjobs: add explicit job reaping

2017-10-05 Thread John Snow
Nikolay: You mentioned a while ago that you had issues with incremental
backup's eventual return status being unknown. Can you please elaborate
for me why this is a problem?

I assume due to the long running of a backup job it's entirely possible
to imagine losing connection to QEMU and missing the event depending on
how long the interruption is.

Backup operations are expensive, so we need some definite way to catch
this return status.

Please let me know if you have any feedback to this thread.

On 10/05/2017 07:38 AM, Kevin Wolf wrote:
> Am 05.10.2017 um 03:46 hat John Snow geschrieben:
>> On 10/04/2017 02:27 PM, Kevin Wolf wrote:
>>> Am 04.10.2017 um 03:52 hat John Snow geschrieben: For jobs that 
>>> complete when a monitor isn't looking, there's no way to
 tell what the job's final return code was. We need to allow jobs to
 remain in the list until queried for reliable management.
>>>
>>> Just a short summary of what I discussed with John on IRC:
>>>
>>> Another important reason why we want to have an explicit end of block
>>> jobs is that job completion often makes changes to the graph. For a
>>> management tool that manages the block graph on a node level, it is a
>>> big problem if graph changes can happen at any point that can lead to
>>> bad race conditions. Giving the management tool control over the end of
>>> the block job makes it aware that graph changes happen.
>>>
>>> This means that compared to this RFC series, we need to move the waiting
>>> earlier in the process:
>>>
>>> 1. Block job is done and calls block_job_completed()
>>> 2. Wait for other block jobs in the same job transaction to complete
>>> 3. Send a (new) QMP event to the management tool to notify it that the
>>>job is ready to be reaped
>>
>> Oh, I suppose to distinguish it from "COMPLETED" in that sense, because
>> it isn't actually COMPLETED anymore under your vision, so it requires a
>> new event in this proposal.
>>
>> This becomes a bit messy, bumping up against both "READY" and a
>> transactional pre-completed state semantically. U, for lack of a
>> better word in the timeframe I'd like to complete this email in, let's
>> call this new theoretical state "PENDING"?
>>
>> So presently, a job goes through the following life cycle:
>>
>> 1. CREATED --> RUNNING
>> 2. RUNNING <--> PAUSED
>> 3. RUNNING --> (READY | COMPLETED | CANCELED)
>> 4. READY --> (COMPLETED | CANCELED)
>> 5. (COMPLETED | CANCELED) --> NULL
>>
>> Where we emit an event upon entering "READY", "COMPLETED" or "CANCELED".
> 
> Roughly yes, but it's not quite true because you can still pause and
> unpause ready jobs. So READY and PAUSED are kind of orthogonal.
> 

But you cannot block-job-complete a running job, so I included it here
so we could keep the concept of the ready-to-complete state in mind.

>> My patchset here effectively adds a new optional terminal state:
>>
>> 5. (COMPLETED | CANCELED) --> (NULL | FINISHED)
>> 6. FINISHED --> NULL
>>
>> Where the last transition from FINISHED to NULL is performed via
>> block-job-reap, but notably we get to re-use the events for COMPLETED |
>> CANCELED to indicate the availability of this operation to be performed.
>>
>> What happens in the case of transactionally managed jobs presently is
>> that jobs get stuck as they enter the COMPLETED|CANCELED state. If you
>> were to query them they behave as if they're RUNNING. There's no
>> discrete state that exists for this presently.
>>
>> You can cancel these as normal, but I'm not sure if you can pause them,
>> actually. (Note to self, test that.) I think they have almost exactly
>> like any RUNNING job would.
> 
> Except that they don't do any work any more. This is an mportant
> difference for a mirror job which would normally keep copying new writes
> until it sends the COMPLETED event. So when libvirt restarts and it sees
> a "RUNNING" mirror job, it can't decide whether it is still copying
> things or has already completed.
> 
> Looks like this is another reason why we want a separate state here.

Yes, I realized as I was writing it that we have no real way to tell
that a job is simply pending completion.

> 
>> What you're proposing here is the formalization of the pre-completion
>> state ("PENDING") and that in this state, a job outside of a transaction
>> can exist until it is manually told to finally, once and for all,
>> actually finish its business. We can use this as a hook to perform and
>> last graph changes so they will not come as a surprise to the management
>> application. Maybe this operation should be called "Finalize". Again,
>> for lack of a better term in the timeframe, I'll refer to it as such for
>> now.
> 
> "finalize" doesn't sound too bad.
> 

Though taken altogether, the set of names we've accumulated is a little
ridiculous.

>> I think importantly this actually distinguishes it from "reap" in that
>> the commit phase can still fail, so we can't let the job follow that
>> auto transition back to the NULL state.
> 
> 

Re: [Qemu-block] [PATCH v2 0/4] blockjobs: add explicit job reaping

2017-10-05 Thread Kevin Wolf
Am 05.10.2017 um 03:46 hat John Snow geschrieben:
> On 10/04/2017 02:27 PM, Kevin Wolf wrote:
> > Am 04.10.2017 um 03:52 hat John Snow geschrieben:
> >> For jobs that complete when a monitor isn't looking, there's no way to
> >> tell what the job's final return code was. We need to allow jobs to
> >> remain in the list until queried for reliable management.
> > 
> > Just a short summary of what I discussed with John on IRC:
> > 
> > Another important reason why we want to have an explicit end of block
> > jobs is that job completion often makes changes to the graph. For a
> > management tool that manages the block graph on a node level, it is a
> > big problem if graph changes can happen at any point that can lead to
> > bad race conditions. Giving the management tool control over the end of
> > the block job makes it aware that graph changes happen.
> > 
> > This means that compared to this RFC series, we need to move the waiting
> > earlier in the process:
> > 
> > 1. Block job is done and calls block_job_completed()
> > 2. Wait for other block jobs in the same job transaction to complete
> > 3. Send a (new) QMP event to the management tool to notify it that the
> >job is ready to be reaped
> 
> Oh, I suppose to distinguish it from "COMPLETED" in that sense, because
> it isn't actually COMPLETED anymore under your vision, so it requires a
> new event in this proposal.
> 
> This becomes a bit messy, bumping up against both "READY" and a
> transactional pre-completed state semantically. U, for lack of a
> better word in the timeframe I'd like to complete this email in, let's
> call this new theoretical state "PENDING"?
> 
> So presently, a job goes through the following life cycle:
> 
> 1. CREATED --> RUNNING
> 2. RUNNING <--> PAUSED
> 3. RUNNING --> (READY | COMPLETED | CANCELED)
> 4. READY --> (COMPLETED | CANCELED)
> 5. (COMPLETED | CANCELED) --> NULL
> 
> Where we emit an event upon entering "READY", "COMPLETED" or "CANCELED".

Roughly yes, but it's not quite true because you can still pause and
unpause ready jobs. So READY and PAUSED are kind of orthogonal.

> My patchset here effectively adds a new optional terminal state:
> 
> 5. (COMPLETED | CANCELED) --> (NULL | FINISHED)
> 6. FINISHED --> NULL
> 
> Where the last transition from FINISHED to NULL is performed via
> block-job-reap, but notably we get to re-use the events for COMPLETED |
> CANCELED to indicate the availability of this operation to be performed.
> 
> What happens in the case of transactionally managed jobs presently is
> that jobs get stuck as they enter the COMPLETED|CANCELED state. If you
> were to query them they behave as if they're RUNNING. There's no
> discrete state that exists for this presently.
> 
> You can cancel these as normal, but I'm not sure if you can pause them,
> actually. (Note to self, test that.) I think they have almost exactly
> like any RUNNING job would.

Except that they don't do any work any more. This is an mportant
difference for a mirror job which would normally keep copying new writes
until it sends the COMPLETED event. So when libvirt restarts and it sees
a "RUNNING" mirror job, it can't decide whether it is still copying
things or has already completed.

Looks like this is another reason why we want a separate state here.

> What you're proposing here is the formalization of the pre-completion
> state ("PENDING") and that in this state, a job outside of a transaction
> can exist until it is manually told to finally, once and for all,
> actually finish its business. We can use this as a hook to perform and
> last graph changes so they will not come as a surprise to the management
> application. Maybe this operation should be called "Finalize". Again,
> for lack of a better term in the timeframe, I'll refer to it as such for
> now.

"finalize" doesn't sound too bad.

> I think importantly this actually distinguishes it from "reap" in that
> the commit phase can still fail, so we can't let the job follow that
> auto transition back to the NULL state.

Let me see if I understand correctly: We want to make sure that the
management tool sees the final return value for the job. We have already
decided that events aren't enough for this because the management tool
could be restarted while we send the event, so the information is lost.
Having it as a return value of block-job-reap isn't enough either
because it could be lost the same way. We need a separate phase where
libvirt can query the return value and from which we don't automatically
transition away.

I'm afraid that you are right.

> That means that we'd need both a block-job-finalize AND a
> block-job-reap to accomplish both of the following goals:
> 
> (1) Allow the management application to control graph changes [libvirt]
> (2) Prevent auto transitions to NULL state for asynchronous clients [A
> requirement mentioned by Virtuozzo]

Okay, your (2) is another case that forbids auto-transitions to NULL.
What I described above is why it's 

Re: [Qemu-block] [PATCH v2 0/4] blockjobs: add explicit job reaping

2017-10-04 Thread Kevin Wolf
Am 04.10.2017 um 03:52 hat John Snow geschrieben:
> For jobs that complete when a monitor isn't looking, there's no way to
> tell what the job's final return code was. We need to allow jobs to
> remain in the list until queried for reliable management.

Just a short summary of what I discussed with John on IRC:

Another important reason why we want to have an explicit end of block
jobs is that job completion often makes changes to the graph. For a
management tool that manages the block graph on a node level, it is a
big problem if graph changes can happen at any point that can lead to
bad race conditions. Giving the management tool control over the end of
the block job makes it aware that graph changes happen.

This means that compared to this RFC series, we need to move the waiting
earlier in the process:

1. Block job is done and calls block_job_completed()
2. Wait for other block jobs in the same job transaction to complete
3. Send a (new) QMP event to the management tool to notify it that the
   job is ready to be reaped
4. On block-job-reap, call the .commit/.abort callbacks of the jobs in
   the transaction. They will do most of the work that is currently done
   in the completion callbacks, in particular any graph changes. If we
   need to allow error cases, we can add a .prepare_completion callback
   that can still let the whole transaction fail.
5. Send the final QMP completion event for each job in the transaction
   with the final error code. This is the existing BLOCK_JOB_COMPLETED
   or BLOCK_JOB_CANCELLED event.

The current RFC still executes .commit/.abort before block-job-reap, so
the graph changes happen too early to be under control of the management
tool.

> RFC:
> The next version will add tests for transactions.
> Kevin, can you please take a look at bdrv_is_root_node and how it is
> used with respect to do_drive_backup? I suspect that in this case that
> "is root" should actually be "true", but a node in use by a job has
> two roles; child_root and child_job, so it starts returning false here.
> 
> That's fine because we prevent a collision that way, but it makes the
> error messages pretty bad and misleading. Do you have a quick suggestion?
> (Should I just amend the loop to allow non-root nodes as long as they
> happen to be jobs so that the job creation code can check permissions?)

We kind of sidestepped this problem by deciding that there is no real
reason for the backup job to require the source to be a root node.

I'm not completely sure how we could easily get a better message while
still requiring a root node (and I suppose there are other places that
rightfully still require a root node). Ignoring children with child_job
feels ugly, but might be the best option.

Note that this will not make the conflicting command work suddenly,
because every node that has a child_job parent also has op blockers for
everything, but the error message should be less confusing than "is not
a root node".

Kevin