Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

2015-12-18 Thread Kevin Wolf
Am 16.12.2015 um 07:25 hat Jeff Cody geschrieben:
> Background:
> 
> Block jobs, and other QAPI operations, may modify and impact the
> BlockDriverState graph in QEMU.  In order to support multiple
> operations safely, we need a mechanism to block and gate operations,
> 
> We currently have op blockers, that are attached to each BDS.
> However, in practice we check this on the device level, rather than on
> the granularity of the individual BDS.  Also, due to limitations of
> the current system, we only allow a single block job at a time.

As I already said on IRC, I think a really important part in order to
make proper op blockers workable is that we actually enforce that you
have to request permission before you can operate on an image. That is,
if you were to write to an image, but you haven't requested write
permission from the op blocker system before, you'd get an assertion
failure.

The reason for this is that it's a long standing issue of the current
system that it's not only possible, but even easy to forget adding the
blocker code or updating it when the surrounding code changes. The
result is a mechanism that is too restrictive for many cases on the one
hand, but fails to actually provide the protection we need on the other
hand (for example, bs->file is still unprotected).

Blockers are subtle and at the same time pervasive enough that I'm
almost sure we wouldn't implement them correctly if we don't force
ourselves to do it right by letting qemu crash whenever we don't.

> Proposed New Design Features:
> --
> This design would supersede the current op blocker system.
> 
> Rather than have the op blockers attached directly to each BDS, Block
> Job access will be done through a separate Node Access Control system.
> 
> This new system will:
> 
> * Allow / Disallow operations to BDSs, (generally initiated by QAPI
>   actions; i.e. BDS node operations other than guest read/writes)
> 
> * Not be limited to "block jobs" (e.g. block-commit, block-stream,
>   etc..) - will also apply to other operations, such as
>   blockdev-add, change-backing-file, etc.
> 
> * Allow granularity in options, and provide the safety needed to
>   support multiple block jobs and operations.
> 
> * Reference each BDS by its node-name
> 
> * Be independent of the bdrv_states graph (i.e. does not reside in
>   the BDS structure)

I agree that the BDS structure is probably not the right place (except
possibly for caching the permission bitmasks), but I also wouldn't say
that it should be completely independent structures. Otherwise keeping
things in sync when the graph changes might turn out hard.

At the moment I can see two fundamentally different kinds of operations
that are interesting for op blockers:

1. I/O requests

   We need a place to check I/O requests that knows both the user (and
   the permissions it has acquired) and the node that is accessed.

   Any I/O request from outside the block layer (which includes guest
   devices, NBD servers and block jobs) should come through a
   BlockBackend. Even though we don't have it yet, we want to have a
   single BB per user. These requests could be dealt with by calling
   into the NAC whenever a BDS is attach to or detached from a BB;
   assertions can be checked in the appropriate blk_* functions.

   But there are also references between BDSes in the middle of the
   graph, like a qcow2 image referencing (and sending I/O requests to)
   its bs->file and bs->backing. It's important that we represent the
   requirements of such nodes in the NAC. As permissions are always for
   a specific user for a specific node, neither the parent node nor the
   child node are the correct place to manage the permissions. It seems
   that instead BdrvChild might be where they belong. The NAC needs to
   be informed whenever a child node is attached or detached from a BDS.

   As it happens, I've already been considering for a while to convert
   BlockBackend to using BdrvChild as well, so we could possibly unify
   both cases.

   The problem we have with BdrvChild is that I/O requests don't go
   through BdrvChild functions, so there is no place to add the
   assertions we need. We could possibly still add them to blk_*, but
   that would leave internal I/O requests unchecked.

2. Graph reconfiguration

   The problematic part here is that it's generally done by users who
   don't have a direct reference to all the nodes that are affected from
   the changes to the graph.

   For example, the commit job wants to drop all intermediate nodes
   after completing the copy - but even after properly converting it to
   BlockBackend, it would still only have a BB for the top and the base
   node, but not to any of the intermediate nodes that it drops.

   We don't seem to have convenient places like attaching/detaching from
   nodes where we could put the permission requests. Also we don't seem
   to have a 

Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

2015-12-18 Thread Jeff Cody
On Fri, Dec 18, 2015 at 03:19:25PM +0100, Kevin Wolf wrote:
> Am 16.12.2015 um 07:25 hat Jeff Cody geschrieben:
> > Background:
> > 
> > Block jobs, and other QAPI operations, may modify and impact the
> > BlockDriverState graph in QEMU.  In order to support multiple
> > operations safely, we need a mechanism to block and gate operations,
> > 
> > We currently have op blockers, that are attached to each BDS.
> > However, in practice we check this on the device level, rather than on
> > the granularity of the individual BDS.  Also, due to limitations of
> > the current system, we only allow a single block job at a time.
> 
> As I already said on IRC, I think a really important part in order to
> make proper op blockers workable is that we actually enforce that you
> have to request permission before you can operate on an image. That is,
> if you were to write to an image, but you haven't requested write
> permission from the op blocker system before, you'd get an assertion
> failure.
> 
> The reason for this is that it's a long standing issue of the current
> system that it's not only possible, but even easy to forget adding the
> blocker code or updating it when the surrounding code changes. The
> result is a mechanism that is too restrictive for many cases on the one
> hand, but fails to actually provide the protection we need on the other
> hand (for example, bs->file is still unprotected).
> 
> Blockers are subtle and at the same time pervasive enough that I'm
> almost sure we wouldn't implement them correctly if we don't force
> ourselves to do it right by letting qemu crash whenever we don't.
>

I agree with this.  The (too brief) blurb at the end of my email was
the kernel of my thoughts:

 "It may be possible to later expand the NAC system, to provide
 handles for use by low-level operations in block.c."

I think to do enforced permissions, we would need to introduce
something like handles (or tokens) for operations, so that we would
fail if the handle/token was not provided for a given operation (or
had insufficient permissions).

I mentioned "later" because I think if the code itself is flexible to
it, it may be better to grow the block layer into it.  But you are
right, doing it right away may be best.


> > Proposed New Design Features:
> > --
> > This design would supersede the current op blocker system.
> > 
> > Rather than have the op blockers attached directly to each BDS, Block
> > Job access will be done through a separate Node Access Control system.
> > 
> > This new system will:
> > 
> > * Allow / Disallow operations to BDSs, (generally initiated by QAPI
> >   actions; i.e. BDS node operations other than guest read/writes)
> > 
> > * Not be limited to "block jobs" (e.g. block-commit, block-stream,
> >   etc..) - will also apply to other operations, such as
> >   blockdev-add, change-backing-file, etc.
> > 
> > * Allow granularity in options, and provide the safety needed to
> >   support multiple block jobs and operations.
> > 
> > * Reference each BDS by its node-name
> > 
> > * Be independent of the bdrv_states graph (i.e. does not reside in
> >   the BDS structure)
> 
> I agree that the BDS structure is probably not the right place (except
> possibly for caching the permission bitmasks), but I also wouldn't say
> that it should be completely independent structures. Otherwise keeping
> things in sync when the graph changes might turn out hard.
> 

And having some linking into the BDS will also make it more efficient,
and allow it to probe some for insights (such as child / parent node
relationships to check for adequate permissions, etc..).

I just want something that makes the NAC a distinct entity, that is
not necessarily constrained by the BDS graph (especially since we can
now unambiguously address all nodes independent of the graph).


> At the moment I can see two fundamentally different kinds of operations
> that are interesting for op blockers:
> 
> 1. I/O requests
> 
>We need a place to check I/O requests that knows both the user (and
>the permissions it has acquired) and the node that is accessed.
> 
>Any I/O request from outside the block layer (which includes guest
>devices, NBD servers and block jobs) should come through a
>BlockBackend. Even though we don't have it yet, we want to have a
>single BB per user. These requests could be dealt with by calling
>into the NAC whenever a BDS is attach to or detached from a BB;
>assertions can be checked in the appropriate blk_* functions.
> 

Can you elaborate more on "we want to have a single BB per user."?

I certainly agree that for most roles and operations, it is true that
the BB should be the representation.  However, most block jobs will
benefit from specifying node-names in addition to the BB, for easier
node identification.  I'd go as far as requiring it, if we are
implementing a new set of QAPI functions to go 

Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

2015-12-17 Thread Stefan Hajnoczi
On Wed, Dec 16, 2015 at 01:25:29AM -0500, Jeff Cody wrote:
> Background:
> 
> Block jobs, and other QAPI operations, may modify and impact the
> BlockDriverState graph in QEMU.  In order to support multiple
> operations safely, we need a mechanism to block and gate operations,
> 
> We currently have op blockers, that are attached to each BDS.
> However, in practice we check this on the device level, rather than on
> the granularity of the individual BDS.  Also, due to limitations of
> the current system, we only allow a single block job at a time.
> 
> 
> Proposed New Design Features:
> --
> This design would supersede the current op blocker system.
> 
> Rather than have the op blockers attached directly to each BDS, Block
> Job access will be done through a separate Node Access Control system.
> 
> This new system will:

Trying to understand what's new:

> * Allow / Disallow operations to BDSs, (generally initiated by QAPI
>   actions; i.e. BDS node operations other than guest read/writes)

The old system allows this.

> * Not be limited to "block jobs" (e.g. block-commit, block-stream,
>   etc..) - will also apply to other operations, such as
>   blockdev-add, change-backing-file, etc.

The old system allows this.

> * Allow granularity in options, and provide the safety needed to
>   support multiple block jobs and operations.

The old system allows this although it leads to a proliferation of op
blockers.

> * Reference each BDS by its node-name

The old system is also per-BDS.

> * Be independent of the bdrv_states graph (i.e. does not reside in
>   the BDS structure)

This is new but what is benefit?

I'm not advocating the old system, just trying to understand what the
new requirements are.

> Node Access Control: Jobs
> --
> Every QAPI/HMP initiated operation is considered a "job", regardless
> if it is implemented via coroutines (e.g. block jobs such as
> block-commit), or handled synchronously in a QAPI handler (e.g.
> change-backing-file, blockdev-add).

Perhaps "operation" is clearer since the term "job" is already
associated with block jobs.

> A job consists of:
> * Unique job ID

What is the purpose of the ID?  I guess you are planning a QMP
interface?

> * A list of all node-names affected by the operation
> * Action flags (see below) for each node in the above list
> 
> 
> Node Access Control: Action Flags
> ---
> Every operation must set action flag pairs, for every node affected by
> the operation.
> 
> Action flags are set as a Require/Allow pair.  If the Require
> flag is set, then the operation requires the ability to take the
> specific action.  If the Allow flag is set, then the operation will
> allow other operations to perform same action in parallel.
> 
> The default is to prohibit, not allow, parallel actions.
> 
> The proposed actions are:
> 
> 1. Modify - Visible Data
> * This is modification that would be detected by an external
>   read (for instance, a hypothetical qemu-io read from the
>   specific node), inclusive of its backing files.  A classic
>   example would be writing data into another node, as part of
>   block-commit.
> 
> 2. Modify - Metadata
> * This is a write that is not necessarily visible to an external
>   read, but still modifies the underlying image format behind the
>   node.  I.e., change-backing-file to an identical backing file.
>   (n.b.: if changing the backing file to a non-identical
>   backing file, the "Write - Visible Data" action would also
>   be required).
> 
> 3. Graph Reconfiguration
> * Removing, adding, or reordering nodes in the bdrv_state
>   graph. (n.b.: Write - Visible Data may need to be set, if
>   appropriate)
> 
> 4. Read - Visible Data
> * Read data visible to an external read.
> 
> 5. Read - Metadata
> * Read node metadata

I think this is the main change from the old system.  Instead of relying
on operation-specific blockers (e.g. BLOCK_OP_TYPE_COMMIT_TARGET), the
new system uses just a few actions.

The risk is that this small list of actions won't be fine-grained enough
to express the relationships between existing operations.  I don't know
whether or not this is the case.

Have you enumerated existing operations and checked that these actions
can represent the necessary blockers?

> Node Access Control: Tracking Jobs
> --
> Each new NAC job will have a unique ID. Associated with that ID is
> a list of each BDS node affected by the operation, alongside its
> corresponding Action Flags.
> 
> When a new NAC job is created, a separate list of node-name flags is
> updated to provide easy go/no go decisions.
> 
> An operation is prohibited if:

These make sense:

> * A "Require" Action Flag is set, 

Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

2015-12-16 Thread Jeff Cody
On Wed, Dec 16, 2015 at 09:31:26AM -0700, Eric Blake wrote:
> On 12/15/2015 11:25 PM, Jeff Cody wrote:
> > Background:
> > 
> > Block jobs, and other QAPI operations, may modify and impact the
> > BlockDriverState graph in QEMU.  In order to support multiple
> > operations safely, we need a mechanism to block and gate operations,
> > 
> > We currently have op blockers, that are attached to each BDS.
> > However, in practice we check this on the device level, rather than on
> > the granularity of the individual BDS.  Also, due to limitations of
> > the current system, we only allow a single block job at a time.
> > 
> > 
> > Proposed New Design Features:
> > --
> > This design would supersede the current op blocker system.
> > 
> > Rather than have the op blockers attached directly to each BDS, Block
> > Job access will be done through a separate Node Access Control system.
> > 
> > This new system will:
> > 
> > * Allow / Disallow operations to BDSs, (generally initiated by QAPI
> >   actions; i.e. BDS node operations other than guest read/writes)
> 
> Doesn't suspending/resuming the guest count as a case where guest
> read/writes affect allowed operations?  That is, a VM in state RUNNING
> counts as one that Requires read-data and write-data, and once it
> transitions to paused, the Requires can go away.
>

That's a good question.  I'm not sure, and I tend to think of the VM
guest as a special case, but I don't know that everyone agrees with
that.

My reasoning on why it is special can be summed up as: the
guest is concerned with the _device_ (i.e. BlockBackend), rather than
the BDS nodes (which are invisible to the guest).

For instance, if the guest had Require/Allow flags set, when we do a
live snapshot, those flags wouldn't necessarily apply to the BDS, but
we would want those to move to the active layer again (but not really,
because again we are concerned with the BlockBackend device moreso
than the BDS).

So I think if we wanted to encapsulate the guest, rather than trying
to do it vis-a-vis the active layer, we should extend this proposed
NAC scheme to also have a BBAC (BlockBackend Access Control) layer as
well.

> > 
> > * Not be limited to "block jobs" (e.g. block-commit, block-stream,
> >   etc..) - will also apply to other operations, such as
> >   blockdev-add, change-backing-file, etc.
> > 
> > * Allow granularity in options, and provide the safety needed to
> >   support multiple block jobs and operations.
> > 
> > * Reference each BDS by its node-name
> > 
> > * Be independent of the bdrv_states graph (i.e. does not reside in
> >   the BDS structure)
> > 
> > 
> > Node Access Control: Jobs
> > --
> > Every QAPI/HMP initiated operation is considered a "job", regardless
> > if it is implemented via coroutines (e.g. block jobs such as
> > block-commit), or handled synchronously in a QAPI handler (e.g.
> > change-backing-file, blockdev-add).
> > 
> > A job consists of:
> > * Unique job ID
> > * A list of all node-names affected by the operation
> 
> Do we need to hold some sort of lock when computing the list of
> node-names to be affected, since some of the very operations involved
> are those that would change the set of node-names impacted? That is, no
> operation can change the graph while another operation is collecting the
> list of nodes to be tied to a job.
> 

Yes, good point, I think you are right: a block-level global mutex of
sorts would be needed here.  I think we could get away with this being
around graph reconfiguration, and computing the list of node-names
affected by an operation.

> > * Action flags (see below) for each node in the above list
> > 
> > 
> > Node Access Control: Action Flags
> > ---
> > Every operation must set action flag pairs, for every node affected by
> > the operation.
> > 
> > Action flags are set as a Require/Allow pair.  If the Require
> > flag is set, then the operation requires the ability to take the
> > specific action.  If the Allow flag is set, then the operation will
> > allow other operations to perform same action in parallel.
> > 
> > The default is to prohibit, not allow, parallel actions.
> > 
> > The proposed actions are:
> > 
> > 1. Modify - Visible Data
> > * This is modification that would be detected by an external
> >   read (for instance, a hypothetical qemu-io read from the
> >   specific node), inclusive of its backing files.  A classic
> >   example would be writing data into another node, as part of
> >   block-commit.
> > 
> > 2. Modify - Metadata
> > * This is a write that is not necessarily visible to an external
> >   read, but still modifies the underlying image format behind the
> >   node.  I.e., change-backing-file to an identical backing file.
> >   (n.b.: if changing the backing file to a 

Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

2015-12-16 Thread Eric Blake
On 12/15/2015 11:25 PM, Jeff Cody wrote:
> Background:
> 
> Block jobs, and other QAPI operations, may modify and impact the
> BlockDriverState graph in QEMU.  In order to support multiple
> operations safely, we need a mechanism to block and gate operations,
> 
> We currently have op blockers, that are attached to each BDS.
> However, in practice we check this on the device level, rather than on
> the granularity of the individual BDS.  Also, due to limitations of
> the current system, we only allow a single block job at a time.
> 
> 
> Proposed New Design Features:
> --
> This design would supersede the current op blocker system.
> 
> Rather than have the op blockers attached directly to each BDS, Block
> Job access will be done through a separate Node Access Control system.
> 
> This new system will:
> 
> * Allow / Disallow operations to BDSs, (generally initiated by QAPI
>   actions; i.e. BDS node operations other than guest read/writes)

Doesn't suspending/resuming the guest count as a case where guest
read/writes affect allowed operations?  That is, a VM in state RUNNING
counts as one that Requires read-data and write-data, and once it
transitions to paused, the Requires can go away.

> 
> * Not be limited to "block jobs" (e.g. block-commit, block-stream,
>   etc..) - will also apply to other operations, such as
>   blockdev-add, change-backing-file, etc.
> 
> * Allow granularity in options, and provide the safety needed to
>   support multiple block jobs and operations.
> 
> * Reference each BDS by its node-name
> 
> * Be independent of the bdrv_states graph (i.e. does not reside in
>   the BDS structure)
> 
> 
> Node Access Control: Jobs
> --
> Every QAPI/HMP initiated operation is considered a "job", regardless
> if it is implemented via coroutines (e.g. block jobs such as
> block-commit), or handled synchronously in a QAPI handler (e.g.
> change-backing-file, blockdev-add).
> 
> A job consists of:
> * Unique job ID
> * A list of all node-names affected by the operation

Do we need to hold some sort of lock when computing the list of
node-names to be affected, since some of the very operations involved
are those that would change the set of node-names impacted? That is, no
operation can change the graph while another operation is collecting the
list of nodes to be tied to a job.

> * Action flags (see below) for each node in the above list
> 
> 
> Node Access Control: Action Flags
> ---
> Every operation must set action flag pairs, for every node affected by
> the operation.
> 
> Action flags are set as a Require/Allow pair.  If the Require
> flag is set, then the operation requires the ability to take the
> specific action.  If the Allow flag is set, then the operation will
> allow other operations to perform same action in parallel.
> 
> The default is to prohibit, not allow, parallel actions.
> 
> The proposed actions are:
> 
> 1. Modify - Visible Data
> * This is modification that would be detected by an external
>   read (for instance, a hypothetical qemu-io read from the
>   specific node), inclusive of its backing files.  A classic
>   example would be writing data into another node, as part of
>   block-commit.
> 
> 2. Modify - Metadata
> * This is a write that is not necessarily visible to an external
>   read, but still modifies the underlying image format behind the
>   node.  I.e., change-backing-file to an identical backing file.
>   (n.b.: if changing the backing file to a non-identical
>   backing file, the "Write - Visible Data" action would also
>   be required).

Will these tie in to Kevin's work on advisory qcow2 locking (where we
try to detect the case of two concurrent processes both opening the same
qcow2 file for writes)?

> 
> 3. Graph Reconfiguration
> * Removing, adding, or reordering nodes in the bdrv_state
>   graph. (n.b.: Write - Visible Data may need to be set, if
>   appropriate)
> 
> 4. Read - Visible Data
> * Read data visible to an external read.
> 
> 5. Read - Metadata
> * Read node metadata
> 
> 
> Node Access Control: Tracking Jobs
> --
> Each new NAC job will have a unique ID. Associated with that ID is
> a list of each BDS node affected by the operation, alongside its
> corresponding Action Flags.
> 
> When a new NAC job is created, a separate list of node-name flags is
> updated to provide easy go/no go decisions.
> 
> An operation is prohibited if:
> 
> * A "Require" Action Flag is set, and
> * The logical AND of the "Allow" Action Flag of all NAC-controlled
>   operations for that node is 0.
> 
> - OR -
> 
> * The operation does not set the "Allow" Action Flag, and
> * The logical OR of the corresponding 

[Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes

2015-12-15 Thread Jeff Cody
Background:

Block jobs, and other QAPI operations, may modify and impact the
BlockDriverState graph in QEMU.  In order to support multiple
operations safely, we need a mechanism to block and gate operations,

We currently have op blockers, that are attached to each BDS.
However, in practice we check this on the device level, rather than on
the granularity of the individual BDS.  Also, due to limitations of
the current system, we only allow a single block job at a time.


Proposed New Design Features:
--
This design would supersede the current op blocker system.

Rather than have the op blockers attached directly to each BDS, Block
Job access will be done through a separate Node Access Control system.

This new system will:

* Allow / Disallow operations to BDSs, (generally initiated by QAPI
  actions; i.e. BDS node operations other than guest read/writes)

* Not be limited to "block jobs" (e.g. block-commit, block-stream,
  etc..) - will also apply to other operations, such as
  blockdev-add, change-backing-file, etc.

* Allow granularity in options, and provide the safety needed to
  support multiple block jobs and operations.

* Reference each BDS by its node-name

* Be independent of the bdrv_states graph (i.e. does not reside in
  the BDS structure)


Node Access Control: Jobs
--
Every QAPI/HMP initiated operation is considered a "job", regardless
if it is implemented via coroutines (e.g. block jobs such as
block-commit), or handled synchronously in a QAPI handler (e.g.
change-backing-file, blockdev-add).

A job consists of:
* Unique job ID
* A list of all node-names affected by the operation
* Action flags (see below) for each node in the above list


Node Access Control: Action Flags
---
Every operation must set action flag pairs, for every node affected by
the operation.

Action flags are set as a Require/Allow pair.  If the Require
flag is set, then the operation requires the ability to take the
specific action.  If the Allow flag is set, then the operation will
allow other operations to perform same action in parallel.

The default is to prohibit, not allow, parallel actions.

The proposed actions are:

1. Modify - Visible Data
* This is modification that would be detected by an external
  read (for instance, a hypothetical qemu-io read from the
  specific node), inclusive of its backing files.  A classic
  example would be writing data into another node, as part of
  block-commit.

2. Modify - Metadata
* This is a write that is not necessarily visible to an external
  read, but still modifies the underlying image format behind the
  node.  I.e., change-backing-file to an identical backing file.
  (n.b.: if changing the backing file to a non-identical
  backing file, the "Write - Visible Data" action would also
  be required).

3. Graph Reconfiguration
* Removing, adding, or reordering nodes in the bdrv_state
  graph. (n.b.: Write - Visible Data may need to be set, if
  appropriate)

4. Read - Visible Data
* Read data visible to an external read.

5. Read - Metadata
* Read node metadata


Node Access Control: Tracking Jobs
--
Each new NAC job will have a unique ID. Associated with that ID is
a list of each BDS node affected by the operation, alongside its
corresponding Action Flags.

When a new NAC job is created, a separate list of node-name flags is
updated to provide easy go/no go decisions.

An operation is prohibited if:

* A "Require" Action Flag is set, and
* The logical AND of the "Allow" Action Flag of all NAC-controlled
  operations for that node is 0.

- OR -

* The operation does not set the "Allow" Action Flag, and
* The logical OR of the corresponding "Require" Action Flag of all
  NAC-controlled operations for that node is 1.

When a NAC controlled job completes, the node-name flag list is
updated, and the corresponding NAC job removed from the job list.


General Notes:
-
* This is still a "honor" system, in that each handler / job is
responsible for acquiring the NAC permission, and properly identifying
all nodes affected correctly by their operation.

This should be done before any action is taken by the handler - that
is, it should be clean to abort the operation if the NAC does not give
consent.

* It may be possible to later expand the NAC system, to provide
  handles for use by low-level operations in block.c.

* Thoughts?  There are still probably some holes in the scheme that need
to be plugged.


-Jeff