Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread liu ping fan
[...]

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer.

 The above code has a deadlock.  memory_region_del_subregion() may be
 called under the device lock (since it may be the result of mmio to the
 device), and if the flush uses synchronized_rcu(), it will wait forever
 for the read-side critical section to complete.

 But if _del_subregion() just wait for mr-X quiescent period, while
 calling in mr-Y's read side critical section, then we can avoid
 deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

 So I
 think we can save both object_ref/unref(dev) for memory system.
 The only problem is that whether we can implement it as synchronous or
 not,  is it possible that we can launch a  _del_subregion(mr-X) in
 mr-X's dispatcher?

 Yes.  Real cases exist.

 Oh, I find the sample code, then, the deadlock is unavoidable in this method.

 What alternatives remain?

 I think a way out may be async+refcnt

 If we consider the relationship of MemoryRegionImpl and device as the
 one between file and file-private_data  in Linux. Then the creation
 of impl will object_ref(dev) and when impl-ref=0, it will
 object_unref(dev)
 But this is an async model, for those client which need to know the
 end of flush, we can adopt callback just like call_rcu().


During this thread, it seems that no matter we adopt refcnt on
MemoryRegionImpl or not, protecting MemoryRegion::opaque during
dispatch is still required. It is challenging to substitute
memory_region_init_io() 's 3rd parameter from void* to Object*, owning
to the  conflict that life-cycle management need the host of opaque,
while Object and mr need 1:1 map. So I think, I will move forward on
the way based on MemoryRegionOps::object(). Right?

Regards,
pingfan


 Regards,
 pingfan
 --
 error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread Avi Kivity
On 09/05/2012 11:19 AM, liu ping fan wrote:
 [...]

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer.

 The above code has a deadlock.  memory_region_del_subregion() may be
 called under the device lock (since it may be the result of mmio to the
 device), and if the flush uses synchronized_rcu(), it will wait forever
 for the read-side critical section to complete.

 But if _del_subregion() just wait for mr-X quiescent period, while
 calling in mr-Y's read side critical section, then we can avoid
 deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

 So I
 think we can save both object_ref/unref(dev) for memory system.
 The only problem is that whether we can implement it as synchronous or
 not,  is it possible that we can launch a  _del_subregion(mr-X) in
 mr-X's dispatcher?

 Yes.  Real cases exist.

 Oh, I find the sample code, then, the deadlock is unavoidable in this 
 method.

 What alternatives remain?

 I think a way out may be async+refcnt

 If we consider the relationship of MemoryRegionImpl and device as the
 one between file and file-private_data  in Linux. Then the creation
 of impl will object_ref(dev) and when impl-ref=0, it will
 object_unref(dev)
 But this is an async model, for those client which need to know the
 end of flush, we can adopt callback just like call_rcu().


 During this thread, it seems that no matter we adopt refcnt on
 MemoryRegionImpl or not, protecting MemoryRegion::opaque during
 dispatch is still required. 

Right.  So MemoryRegionImpl is pointless.

My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
corresponding unref), which has the following requirements:

- if the refcount is nonzero, MemoryRegion::opaque is safe to use
- if the refcount is nonzero, the MemoryRegion itself is stable.

Later we can change other callbacks to also use mr instead of opaque.
Usually the callback will be able to derive the device object from the
region, only special cases will require
memory_region_opaque(MemoryRegion *).  We can even drop the opaque from
memory_region_init_io() and replace it with memory_region_set_opaque(),
to be used in those special cases.

 It is challenging to substitute
 memory_region_init_io() 's 3rd parameter from void* to Object*, owning
 to the  conflict that life-cycle management need the host of opaque,
 while Object and mr need 1:1 map. So I think, I will move forward on
 the way based on MemoryRegionOps::object(). Right?

As outlined above, I now prefer MemoryRegionOps::ref/unref.

Advantages compared to MemoryRegionOps::object():
 - decoupled from the QOM framework

Disadvantages:
 - more boilerplate code in converted devices

Since we are likely to convert few devices to lockless dispatch, I
believe the tradeoff is justified.  But let's hear Jan and the others.

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread Jan Kiszka
On 2012-09-05 11:52, Avi Kivity wrote:
 On 09/05/2012 11:19 AM, liu ping fan wrote:
 [...]

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer.

 The above code has a deadlock.  memory_region_del_subregion() may be
 called under the device lock (since it may be the result of mmio to the
 device), and if the flush uses synchronized_rcu(), it will wait forever
 for the read-side critical section to complete.

 But if _del_subregion() just wait for mr-X quiescent period, while
 calling in mr-Y's read side critical section, then we can avoid
 deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

 So I
 think we can save both object_ref/unref(dev) for memory system.
 The only problem is that whether we can implement it as synchronous or
 not,  is it possible that we can launch a  _del_subregion(mr-X) in
 mr-X's dispatcher?

 Yes.  Real cases exist.

 Oh, I find the sample code, then, the deadlock is unavoidable in this 
 method.

 What alternatives remain?

 I think a way out may be async+refcnt

 If we consider the relationship of MemoryRegionImpl and device as the
 one between file and file-private_data  in Linux. Then the creation
 of impl will object_ref(dev) and when impl-ref=0, it will
 object_unref(dev)
 But this is an async model, for those client which need to know the
 end of flush, we can adopt callback just like call_rcu().


 During this thread, it seems that no matter we adopt refcnt on
 MemoryRegionImpl or not, protecting MemoryRegion::opaque during
 dispatch is still required. 
 
 Right.  So MemoryRegionImpl is pointless.
 
 My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
 corresponding unref), which has the following requirements:
 
 - if the refcount is nonzero, MemoryRegion::opaque is safe to use
 - if the refcount is nonzero, the MemoryRegion itself is stable.

The second point means that the memory subsystem will cache the region
state and apply changes only after leaving a handler that performed them?

 
 Later we can change other callbacks to also use mr instead of opaque.
 Usually the callback will be able to derive the device object from the
 region, only special cases will require
 memory_region_opaque(MemoryRegion *).  We can even drop the opaque from
 memory_region_init_io() and replace it with memory_region_set_opaque(),
 to be used in those special cases.
 
 It is challenging to substitute
 memory_region_init_io() 's 3rd parameter from void* to Object*, owning
 to the  conflict that life-cycle management need the host of opaque,
 while Object and mr need 1:1 map. So I think, I will move forward on
 the way based on MemoryRegionOps::object(). Right?
 
 As outlined above, I now prefer MemoryRegionOps::ref/unref.
 
 Advantages compared to MemoryRegionOps::object():
  - decoupled from the QOM framework
 
 Disadvantages:
  - more boilerplate code in converted devices
 
 Since we are likely to convert few devices to lockless dispatch, I
 believe the tradeoff is justified.  But let's hear Jan and the others.

I still need to dig into related posts of the past days, lost track, but
the above sounds much better.

Besides the question of what to protect and how, one important thing
IMHO is that we are able to switch locking behaviour on a per region
basis. And that should also be feasible this way.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread Avi Kivity
On 09/05/2012 01:36 PM, Jan Kiszka wrote:
 
 My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
 corresponding unref), which has the following requirements:
 
 - if the refcount is nonzero, MemoryRegion::opaque is safe to use
 - if the refcount is nonzero, the MemoryRegion itself is stable.
 
 The second point means that the memory subsystem will cache the region
 state and apply changes only after leaving a handler that performed them?

No.  I/O callbacks may be called after a region has been disabled.

I guess we can restrict this to converted regions, so we don't introduce
regressions.

 As outlined above, I now prefer MemoryRegionOps::ref/unref.
 
 Advantages compared to MemoryRegionOps::object():
  - decoupled from the QOM framework
 
 Disadvantages:
  - more boilerplate code in converted devices
 
 Since we are likely to convert few devices to lockless dispatch, I
 believe the tradeoff is justified.  But let's hear Jan and the others.
 
 I still need to dig into related posts of the past days, lost track, but
 the above sounds much better.
 
 Besides the question of what to protect and how, one important thing
 IMHO is that we are able to switch locking behaviour on a per region
 basis. And that should also be feasible this way.

It was also possible with MemoryRegionOps::object() - no one said all
regions for a device have to refer to the same object (and there is a
difference between locking and reference counting, each callback could
take a different lock).  But it is more natural with MemoryRegionOps::ref().

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread Jan Kiszka
On 2012-09-05 12:53, Avi Kivity wrote:
 On 09/05/2012 01:36 PM, Jan Kiszka wrote:

 My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
 corresponding unref), which has the following requirements:

 - if the refcount is nonzero, MemoryRegion::opaque is safe to use
 - if the refcount is nonzero, the MemoryRegion itself is stable.

 The second point means that the memory subsystem will cache the region
 state and apply changes only after leaving a handler that performed them?
 
 No.  I/O callbacks may be called after a region has been disabled.
 
 I guess we can restrict this to converted regions, so we don't introduce
 regressions.

s/can/have to/. This property change will require some special care,
hopefully mostly at the memory layer. A simple scenario from recent patches:

if (enable) {
memory_region_set_address(s-pm_io, pm_io_base);
memory_region_set_enabled(s-pm_io, true);
} else {
memory_region_set_enabled(s-pm_io, false);
}

We will have to ensure that set_enabled(..., true) will never cause a
dispatch using an outdated base address.

I think discussing semantics and usage patterns of the new memory API -
outside of the BQL - will be the next big topic. ;)

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread Avi Kivity
On 09/05/2012 02:11 PM, Jan Kiszka wrote:
 On 2012-09-05 12:53, Avi Kivity wrote:
 On 09/05/2012 01:36 PM, Jan Kiszka wrote:

 My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
 corresponding unref), which has the following requirements:

 - if the refcount is nonzero, MemoryRegion::opaque is safe to use
 - if the refcount is nonzero, the MemoryRegion itself is stable.

 The second point means that the memory subsystem will cache the region
 state and apply changes only after leaving a handler that performed them?
 
 No.  I/O callbacks may be called after a region has been disabled.
 
 I guess we can restrict this to converted regions, so we don't introduce
 regressions.
 
 s/can/have to/. This property change will require some special care,
 hopefully mostly at the memory layer. A simple scenario from recent patches:
 
 if (enable) {
 memory_region_set_address(s-pm_io, pm_io_base);
 memory_region_set_enabled(s-pm_io, true);
 } else {
 memory_region_set_enabled(s-pm_io, false);
 }

I am unable to avoid pointing out that this can be collapsed to

  memory_region_set_address(s-pm_io, pm_io_base);
  memory_region_set_enabled(s-pm_io, enable);

as the address is meaningless when disabled. Sorry.

 
 We will have to ensure that set_enabled(..., true) will never cause a
 dispatch using an outdated base address.

No, this is entirely safe.  If the guest changes a region offset
concurrently with issuing mmio on it, then it must expect either the old
or new offset to be used during dispatch.  In either case, the correct
intra-region offset will be provided to the I/O callback (no volatile
MemoryRegion fields except -readable (IIRC) are used during dispatch -
the rest are all copied into data structures used during dispatch (this
is part of what makes the whole thing so rcu friendly).

 I think discussing semantics and usage patterns of the new memory API -
 outside of the BQL - will be the next big topic. ;)

I hope it won't prove to be that complicated.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread Jan Kiszka
On 2012-09-05 13:25, Avi Kivity wrote:
 On 09/05/2012 02:11 PM, Jan Kiszka wrote:
 On 2012-09-05 12:53, Avi Kivity wrote:
 On 09/05/2012 01:36 PM, Jan Kiszka wrote:

 My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
 corresponding unref), which has the following requirements:

 - if the refcount is nonzero, MemoryRegion::opaque is safe to use
 - if the refcount is nonzero, the MemoryRegion itself is stable.

 The second point means that the memory subsystem will cache the region
 state and apply changes only after leaving a handler that performed them?

 No.  I/O callbacks may be called after a region has been disabled.

 I guess we can restrict this to converted regions, so we don't introduce
 regressions.

 s/can/have to/. This property change will require some special care,
 hopefully mostly at the memory layer. A simple scenario from recent patches:

 if (enable) {
 memory_region_set_address(s-pm_io, pm_io_base);
 memory_region_set_enabled(s-pm_io, true);
 } else {
 memory_region_set_enabled(s-pm_io, false);
 }
 
 I am unable to avoid pointing out that this can be collapsed to
 
   memory_region_set_address(s-pm_io, pm_io_base);
   memory_region_set_enabled(s-pm_io, enable);
 
 as the address is meaningless when disabled. Sorry.
 

 We will have to ensure that set_enabled(..., true) will never cause a
 dispatch using an outdated base address.
 
 No, this is entirely safe.  If the guest changes a region offset
 concurrently with issuing mmio on it, then it must expect either the old
 or new offset to be used during dispatch.

You assume disable, reprogram, enable, ignoring that there could be
multiple, invalid states between disable and enable. Real hardware takes
care of the ordering.

  In either case, the correct
 intra-region offset will be provided to the I/O callback (no volatile
 MemoryRegion fields except -readable (IIRC) are used during dispatch -
 the rest are all copied into data structures used during dispatch (this
 is part of what makes the whole thing so rcu friendly).
 
 I think discussing semantics and usage patterns of the new memory API -
 outside of the BQL - will be the next big topic. ;)
 
 I hope it won't prove to be that complicated.
 

Yeah, let's hope this...

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-05 Thread Avi Kivity
On 09/05/2012 03:02 PM, Jan Kiszka wrote:
 On 2012-09-05 13:25, Avi Kivity wrote:
 On 09/05/2012 02:11 PM, Jan Kiszka wrote:
 On 2012-09-05 12:53, Avi Kivity wrote:
 On 09/05/2012 01:36 PM, Jan Kiszka wrote:

 My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a
 corresponding unref), which has the following requirements:

 - if the refcount is nonzero, MemoryRegion::opaque is safe to use
 - if the refcount is nonzero, the MemoryRegion itself is stable.

 The second point means that the memory subsystem will cache the region
 state and apply changes only after leaving a handler that performed them?

 No.  I/O callbacks may be called after a region has been disabled.

 I guess we can restrict this to converted regions, so we don't introduce
 regressions.

 s/can/have to/. This property change will require some special care,
 hopefully mostly at the memory layer. A simple scenario from recent patches:

 if (enable) {
 memory_region_set_address(s-pm_io, pm_io_base);
 memory_region_set_enabled(s-pm_io, true);
 } else {
 memory_region_set_enabled(s-pm_io, false);
 }
 
 I am unable to avoid pointing out that this can be collapsed to
 
   memory_region_set_address(s-pm_io, pm_io_base);
   memory_region_set_enabled(s-pm_io, enable);
 
 as the address is meaningless when disabled. Sorry.
 

 We will have to ensure that set_enabled(..., true) will never cause a
 dispatch using an outdated base address.
 
 No, this is entirely safe.  If the guest changes a region offset
 concurrently with issuing mmio on it, then it must expect either the old
 or new offset to be used during dispatch.
 

I forgot to mention that my clever rewrite above actually breaks this -
it needs to be wrapped in a transaction, otherwise we have a
move-then-disable pattern.

 You assume disable, reprogram, enable, ignoring that there could be
 multiple, invalid states between disable and enable. Real hardware takes
 care of the ordering.

It's possible of course, but the snippet above isn't susceptible on its own.

I don't think it's solvable.  To serialize, you must hold the device
lock, but we don't want to take the device lock during dispatch.

Users can protect against them by checking for -enabled:

void foo_io_read(...)
{
   FooState *s = container_of(mr, ...);

   qemu_mutex_lock(s-lock);
   if (!memory_region_enabled(mr)) {
   *data = ~(uint64_t)0;
   goto out;
   }
   ...
out:
   qemu_mutex_unlock(s-lock);
}

Non-converted users will be naturally protected since we will take the
bql on their behalf.

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-03 Thread liu ping fan
On Sat, Sep 1, 2012 at 4:46 PM, Avi Kivity a...@redhat.com wrote:
 On 08/30/2012 12:08 AM, Jan Kiszka wrote:
 
  We are dispatching according to the memory region (parameters, op
  handlers, opaques). If we end up in device object is not decided at this
  level. A memory region describes a dispatchable area - not to confuse
  with a device that may only partially be able to receive such requests.
 
  But I think the meaning of the memory region is for dispatching. If no
  dispatching associated with mr, why need it exist in the system?

 Where did I say that memory regions should no longer be used for
 dispatching? The point is to keep the clean layer separation between
 memory regions and device objects instead of merging them together.

 Believe me, that's exactly how I feel.  I guess no author of a module
 wants to see it swallowed by a larger framework and see its design
 completely changed.  Modules should be decoupled as much as possible.
 But I don't see a better way yet.


 Given

Object
^^
||
 Region 1Region 2

 you protect the object during dispatch, implicitly (and that is bad)
 requiring that no region must change in that period.

 We only protect the object against removal.  After object_ref() has run,
 the region may still be disabled.

  I say what rather
 needs protection are the regions so that Region 2 can pass away and
 maybe reappear independent of Region 1.

 Region 2 can go away until the device-supplied dispatch code takes the
 device lock.  If the device chooses to use finer-grained locking, it can
 allow region 2 (or even region 1) to change during dispatch.  The only
 requirement is that region 1 is not destroyed.

  And: I won't need to know the
 type of that object the regions are referring to in this model. That's
 the difference.

 Sorry, I lost the reference.  What model?  Are you referring to my
 broken MemoryRegionImpl split?

   And
  could you elaborate that who will be the ref holder of mr?

 The memory subsystem while running a memory region handler. If that will
 be a reference counter or a per-region lock like Avi suggested, we still
 need to find out.


 If we make the refcount/lock internal to the region, we must remove the
 opaque, since the region won't protect it.

 Replacing the opaque with container_of(mr) doesn't help, since we can't
 refcount mr, only mr-impl.

I think you mean if using MemoryRegionImpl, then at this level, we had
better not touch the refcnt of container_of(mr) and should face the
mr-impl-refcnt. Right?

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

I think object_ref(dev) just another style to push
MemoryRegionOps::object() to device level.  And I think we can bypass
it.   The caller (unplug, pci-reconfig ) of
memory_region_del_subregion() ensure the @dev is valid.
If the implied flush is implemented in synchronize,  _del_subregion()
will guarantee no reader for @dev-mr and @dev exist any longer. So I
think we can save both object_ref/unref(dev) for memory system.
The only problem is that whether we can implement it as synchronous or
not,  is it possible that we can launch a  _del_subregion(mr-X) in
mr-X's dispatcher?

Regards,
pingfan

 er, this must be wrong, since it's so simple

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-03 Thread Avi Kivity
On 09/03/2012 10:44 AM, liu ping fan wrote:


 If we make the refcount/lock internal to the region, we must remove the
 opaque, since the region won't protect it.

 Replacing the opaque with container_of(mr) doesn't help, since we can't
 refcount mr, only mr-impl.

 I think you mean if using MemoryRegionImpl, then at this level, we had
 better not touch the refcnt of container_of(mr) and should face the
 mr-impl-refcnt. Right?

I did not understand the second part, sorry.

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer. 

The above code has a deadlock.  memory_region_del_subregion() may be
called under the device lock (since it may be the result of mmio to the
device), and if the flush uses synchronized_rcu(), it will wait forever
for the read-side critical section to complete.

 So I
 think we can save both object_ref/unref(dev) for memory system.
 The only problem is that whether we can implement it as synchronous or
 not,  is it possible that we can launch a  _del_subregion(mr-X) in
 mr-X's dispatcher?

Yes.  Real cases exist.

What alternatives remain?

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-03 Thread Avi Kivity
On 08/29/2012 08:41 PM, Jan Kiszka wrote:

 memory_region_transaction_commit (or calls that trigger this) will have
 to wait. That is required as the caller may start fiddling with the
 region right afterward.
 
 However, it may be running within an mmio dispatch, so it may wait forever.
 
 We won't be able to wait for devices that can acquire the same lock we
 hold while waiting, of course. That's why a lock ordering device-lock -
 BQL (the one we hold over memory region changes) won't be allowed. I was
 wrong on this before: We must not take course-grained locks while
 holding fine-grained ones, only the other way around. Pretty obvious,
 specifically for realtime / low-latency.

So how do we wait?

Detecting that we're in the same thread and allowing reentrancy in that
case in the option, but I can't say I like it much.

 
 Ignoring this, how does it wait? sleep()? or wait for a completion?
 
 Ideally via completion.

Then you have to manage the life cycle of the completion object.

 
 

 Usually a reference count controls the lifetime of the reference counted
 object, what are you suggesting here?

 To synchronize on reference count going to zero. 
 
 Quite unorthodox.  Do you have real life examples?
 
 synchronize_rcu.

This whole thing started because synchronize_rcu() will deadlock if
called from a read-side critical section.  The fix was taking the
reference, so that the read-side critical section is confined to the
lookup, not the call.  This way the lock ordering is always device-map.

We could schedule a bh or a thread and call synchronize_rcu() from
there, but then the commit is deferred to an unknown time, whereas we
need to guarantee that by the time the guest is reentered, the
transaction is committed.

 
 
 Or all readers leaving
 the read-side critical sections.
 
 This is rcu.  But again, we may be in an rcu read-side critical section
 while needing to wait.  In fact this was what I suggested in the first
 place, until Marcelo pointed out the deadlock, so we came up with the
 refcount.
 
 We must avoid that deadlock scenario. I bended my brain around it, and I
 think that is the only answer. We can if we apply clear rules regarding
 locking and BQL-protected services to those devices that will actually
 use fine-grained locking, and there only for those paths that take
 per-device locks. I'm pretty sure we won't get far with an
 all-or-nothing model where every device uses a private lock.

An all-or-nothing model is not proposed.  Unconverted devices will take
the BQL instead of the device lock (hopefully the BQL will be taken for
them during dispatch, so no code changes will be needed).

 
 I don't follow.  Synchronous transactions mean you can't
 synchronize_rcu() or upgrade the lock or wait for the refcount.  That's
 the source of the problem!
 
 Our current device models assume synchronous transactions on the memory
 layer, actually on all layers. The proposals I saw so far try to change
 this. But I'm very skeptical that this would succeed, due to the
 conversion effort and due to the fact that it would give us a tricky  to
 use API.

In what way do transactions become asynchronous?

It's true that io callbacks can be called concurrently, but as soon as
they take the device lock, they are serialized.  The only relaxation is
that a callback may be called after a region has been disabled or
removed from view.

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-03 Thread liu ping fan
On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 10:44 AM, liu ping fan wrote:


 If we make the refcount/lock internal to the region, we must remove the
 opaque, since the region won't protect it.

 Replacing the opaque with container_of(mr) doesn't help, since we can't
 refcount mr, only mr-impl.

 I think you mean if using MemoryRegionImpl, then at this level, we had
 better not touch the refcnt of container_of(mr) and should face the
 mr-impl-refcnt. Right?

 I did not understand the second part, sorry.

My understanding of Replacing the opaque with container_of(mr)
doesn't help, since we can't  refcount mr, only
mr-impl. is that although Object_ref(container_of(mr)) can help us
to protect it from disappearing. But apparently it is not right place
to do it it in memory core.   Do I catch you meaning?

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer.

 The above code has a deadlock.  memory_region_del_subregion() may be
 called under the device lock (since it may be the result of mmio to the
 device), and if the flush uses synchronized_rcu(), it will wait forever
 for the read-side critical section to complete.

But if _del_subregion() just wait for mr-X quiescent period, while
calling in mr-Y's read side critical section, then we can avoid
deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

 So I
 think we can save both object_ref/unref(dev) for memory system.
 The only problem is that whether we can implement it as synchronous or
 not,  is it possible that we can launch a  _del_subregion(mr-X) in
 mr-X's dispatcher?

 Yes.  Real cases exist.

Oh, I find the sample code, then, the deadlock is unavoidable in this method.

 What alternatives remain?

I think a way out may be async+refcnt

Regards,
pingfan
 --
 error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-03 Thread Avi Kivity
On 09/03/2012 01:06 PM, liu ping fan wrote:
 On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 10:44 AM, liu ping fan wrote:


 If we make the refcount/lock internal to the region, we must remove the
 opaque, since the region won't protect it.

 Replacing the opaque with container_of(mr) doesn't help, since we can't
 refcount mr, only mr-impl.

 I think you mean if using MemoryRegionImpl, then at this level, we had
 better not touch the refcnt of container_of(mr) and should face the
 mr-impl-refcnt. Right?

 I did not understand the second part, sorry.

 My understanding of Replacing the opaque with container_of(mr)
 doesn't help, since we can't  refcount mr, only
 mr-impl. is that although Object_ref(container_of(mr)) can help us
 to protect it from disappearing. But apparently it is not right place
 to do it it in memory core.   Do I catch you meaning?

We cannot call container_of(mr) in the memory core, because the
parameters to container_of() are not known.  But that is not the main issue.

Something we can do is make MemoryRegionOps::object() take a mr
parameter instead of opaque.  MemoryRegionOps::object() then uses
container_of() to derive the object to ref.

(works with MemoryRegionOps::ref()/MemoryRegionOps::unref() too; Jan,
this decouples Object from MemoryRegion at the cost of extra boilerplate
in client code, but it may be worthwhile as a temporary measure while we
gain more experience with this)

 
 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer.

 The above code has a deadlock.  memory_region_del_subregion() may be
 called under the device lock (since it may be the result of mmio to the
 device), and if the flush uses synchronized_rcu(), it will wait forever
 for the read-side critical section to complete.

 But if _del_subregion() just wait for mr-X quiescent period, while
 calling in mr-Y's read side critical section, then we can avoid
 deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

Look at cirrus-vga.c, there are many calls to the memory API there.
While I doubt that we have one region disabling itself (or a subregion
of itself), that's all code that would be run under the same device
lock, and so would deadlock.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-03 Thread liu ping fan
On Mon, Sep 3, 2012 at 6:16 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 01:06 PM, liu ping fan wrote:
 On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 10:44 AM, liu ping fan wrote:


 If we make the refcount/lock internal to the region, we must remove the
 opaque, since the region won't protect it.

 Replacing the opaque with container_of(mr) doesn't help, since we can't
 refcount mr, only mr-impl.

 I think you mean if using MemoryRegionImpl, then at this level, we had
 better not touch the refcnt of container_of(mr) and should face the
 mr-impl-refcnt. Right?

 I did not understand the second part, sorry.

 My understanding of Replacing the opaque with container_of(mr)
 doesn't help, since we can't  refcount mr, only
 mr-impl. is that although Object_ref(container_of(mr)) can help us
 to protect it from disappearing. But apparently it is not right place
 to do it it in memory core.   Do I catch you meaning?

 We cannot call container_of(mr) in the memory core, because the
 parameters to container_of() are not known.  But that is not the main issue.

 Something we can do is make MemoryRegionOps::object() take a mr
 parameter instead of opaque.  MemoryRegionOps::object() then uses
 container_of() to derive the object to ref.

 (works with MemoryRegionOps::ref()/MemoryRegionOps::unref() too; Jan,
 this decouples Object from MemoryRegion at the cost of extra boilerplate
 in client code, but it may be worthwhile as a temporary measure while we
 gain more experience with this)

Exactly catch your meaning, thanks.

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer.

 The above code has a deadlock.  memory_region_del_subregion() may be
 called under the device lock (since it may be the result of mmio to the
 device), and if the flush uses synchronized_rcu(), it will wait forever
 for the read-side critical section to complete.

 But if _del_subregion() just wait for mr-X quiescent period, while
 calling in mr-Y's read side critical section, then we can avoid
 deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

 Look at cirrus-vga.c, there are many calls to the memory API there.
 While I doubt that we have one region disabling itself (or a subregion
 of itself), that's all code that would be run under the same device
 lock, and so would deadlock.

Oh, yes, quiescent time will never come since reader wait for the
lock! Got it, thanks.

pingfan

 --
 error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-03 Thread liu ping fan
On Mon, Sep 3, 2012 at 6:06 PM, liu ping fan qemul...@gmail.com wrote:
 On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity a...@redhat.com wrote:
 On 09/03/2012 10:44 AM, liu ping fan wrote:


 If we make the refcount/lock internal to the region, we must remove the
 opaque, since the region won't protect it.

 Replacing the opaque with container_of(mr) doesn't help, since we can't
 refcount mr, only mr-impl.

 I think you mean if using MemoryRegionImpl, then at this level, we had
 better not touch the refcnt of container_of(mr) and should face the
 mr-impl-refcnt. Right?

 I did not understand the second part, sorry.

 My understanding of Replacing the opaque with container_of(mr)
 doesn't help, since we can't  refcount mr, only
 mr-impl. is that although Object_ref(container_of(mr)) can help us
 to protect it from disappearing. But apparently it is not right place
 to do it it in memory core.   Do I catch you meaning?

 We could externalize the refcounting and push it into device code.  This
 means:

memory_region_init_io(s-mem, dev)

...

object_ref(dev)
memory_region_add_subregion(..., dev-mr)

...

memory_region_del_subregion(..., dev-mr)  // implied flush
object_unref(dev)

 I think object_ref(dev) just another style to push
 MemoryRegionOps::object() to device level.  And I think we can bypass
 it.   The caller (unplug, pci-reconfig ) of
 memory_region_del_subregion() ensure the @dev is valid.
 If the implied flush is implemented in synchronize,  _del_subregion()
 will guarantee no reader for @dev-mr and @dev exist any longer.

 The above code has a deadlock.  memory_region_del_subregion() may be
 called under the device lock (since it may be the result of mmio to the
 device), and if the flush uses synchronized_rcu(), it will wait forever
 for the read-side critical section to complete.

 But if _del_subregion() just wait for mr-X quiescent period, while
 calling in mr-Y's read side critical section, then we can avoid
 deadlock.  I saw in pci-mapping, we delete mr-X in mr-Y read side.

 So I
 think we can save both object_ref/unref(dev) for memory system.
 The only problem is that whether we can implement it as synchronous or
 not,  is it possible that we can launch a  _del_subregion(mr-X) in
 mr-X's dispatcher?

 Yes.  Real cases exist.

 Oh, I find the sample code, then, the deadlock is unavoidable in this method.

 What alternatives remain?

 I think a way out may be async+refcnt

If we consider the relationship of MemoryRegionImpl and device as the
one between file and file-private_data  in Linux. Then the creation
of impl will object_ref(dev) and when impl-ref=0, it will
object_unref(dev)
But this is an async model, for those client which need to know the
end of flush, we can adopt callback just like call_rcu().



 Regards,
 pingfan
 --
 error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-01 Thread Avi Kivity
On 08/29/2012 10:49 AM, Jan Kiszka wrote:
  
  Let's experiment with refcounting MemoryRegion.  We can move the entire
  contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
  MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
  refcounted MemoryRegionImpl:
  
  struct MemoryRegion {
  struct MemoryRegionImpl *impl;
  };
  
  struct MemoryRegionImpl {
  atomic int refs;
  ...
  };
  
  The memory core can then store MemoryRegion copies (with elevated
  refcounts) instead of pointers.  Devices can destroy MemoryRegions at
  any time, the implementation will not go away.  However, what of the
  opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
  
  One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
  lock, examines the -enabled member, and bails out if it is cleared. 
  The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
  clears -enabled,  releases the lock, and drops the reference.

 That means holding the MemoryRegionImpl lock across the handler call?

Blech.  As you said on the other side of this thread, we must not take a
coarse grained lock within a fine grained one, and
MemoryRegionImpl::lock is as fine as they get.

  
  The advantage to this scheme is that all changes are localized to the
  memory core, no need for a full sweep.  It is a little complicated, but
  we may be able to simplify it (or find another one).

 May work. We just need to detect if memory region tries to delete itself
 from its handler to prevent the deadlock.

Those types of hacks are fragile.  IMO it just demonstrates what I said
earlier (then tried to disprove with this): if we call an opaque's
method, we must refcount or otherwise lock that opaque.  No way around it.


  
 
 
  Besides
  MMIO and PIO dispatching, it will haunt us for file or event handlers,
  any kind of callbacks etc.
 
  Context A   Context B
  -   -
  object = lookup()
  deregister(object)
  modify(object) - invalid state
  ... use(object)
  modify(object) - valid state
  register(object)
 
  And with object I'm not talking about QOM but any data structure.
 
 
 
  Context B
  -
  rcu_read_lock()
  object = lookup()
  if (object) {
  ref(object)
  }
  rcu_read_unlock()
 
  use(object)
 
  unref(object)
 
  (substitute locking scheme to conform to taste and/or dietary
  restrictions)   
 
 
  + wait for refcount(object) == 0 in deregister(object). That's what I'm
  proposing.
  
  Consider timer_del() called from a timer callback.  It's often not doable.

 This is inherently synchronous already (when working against the same
 alarm timer backend). We can detect this in timer_del and skip waiting,
 as in the above scenario.

It can always be broken.  The timer callback takes the device lock to
update the device.  The device mmio handler, holding the device lock,
takes the timer lock to timer_mod.  Deadlock.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-01 Thread Avi Kivity
On 08/30/2012 12:08 AM, Jan Kiszka wrote:
 
  We are dispatching according to the memory region (parameters, op
  handlers, opaques). If we end up in device object is not decided at this
  level. A memory region describes a dispatchable area - not to confuse
  with a device that may only partially be able to receive such requests.
 
  But I think the meaning of the memory region is for dispatching. If no
  dispatching associated with mr, why need it exist in the system?

 Where did I say that memory regions should no longer be used for
 dispatching? The point is to keep the clean layer separation between
 memory regions and device objects instead of merging them together.

Believe me, that's exactly how I feel.  I guess no author of a module
wants to see it swallowed by a larger framework and see its design
completely changed.  Modules should be decoupled as much as possible. 
But I don't see a better way yet.


 Given

Object
^^
||
 Region 1Region 2

 you protect the object during dispatch, implicitly (and that is bad)
 requiring that no region must change in that period.

We only protect the object against removal.  After object_ref() has run,
the region may still be disabled.

  I say what rather
 needs protection are the regions so that Region 2 can pass away and
 maybe reappear independent of Region 1.

Region 2 can go away until the device-supplied dispatch code takes the
device lock.  If the device chooses to use finer-grained locking, it can
allow region 2 (or even region 1) to change during dispatch.  The only
requirement is that region 1 is not destroyed.

  And: I won't need to know the
 type of that object the regions are referring to in this model. That's
 the difference.

Sorry, I lost the reference.  What model?  Are you referring to my
broken MemoryRegionImpl split?

   And
  could you elaborate that who will be the ref holder of mr?

 The memory subsystem while running a memory region handler. If that will
 be a reference counter or a per-region lock like Avi suggested, we still
 need to find out.


If we make the refcount/lock internal to the region, we must remove the
opaque, since the region won't protect it.

Replacing the opaque with container_of(mr) doesn't help, since we can't
refcount mr, only mr-impl.

We could externalize the refcounting and push it into device code.  This
means:

   memory_region_init_io(s-mem, dev)

   ...

   object_ref(dev)
   memory_region_add_subregion(..., dev-mr)

   ...

   memory_region_del_subregion(..., dev-mr)  // implied flush
   object_unref(dev)

er, this must be wrong, since it's so simple

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-01 Thread Jan Kiszka
On 2012-09-01 10:31, Avi Kivity wrote:
 On 08/29/2012 10:49 AM, Jan Kiszka wrote:

 Let's experiment with refcounting MemoryRegion.  We can move the entire
 contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
 MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
 refcounted MemoryRegionImpl:

 struct MemoryRegion {
 struct MemoryRegionImpl *impl;
 };

 struct MemoryRegionImpl {
 atomic int refs;
 ...
 };

 The memory core can then store MemoryRegion copies (with elevated
 refcounts) instead of pointers.  Devices can destroy MemoryRegions at
 any time, the implementation will not go away.  However, what of the
 opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.

 One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
 lock, examines the -enabled member, and bails out if it is cleared. 
 The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
 clears -enabled,  releases the lock, and drops the reference.

 That means holding the MemoryRegionImpl lock across the handler call?
 
 Blech.  As you said on the other side of this thread, we must not take a
 coarse grained lock within a fine grained one, and
 MemoryRegionImpl::lock is as fine as they get.

Not sure what you compare here. MemoryRegionImpl::lock would be per
memory region, so with finer scope than the BQL but with similar scope
like a per-device lock.

 

 The advantage to this scheme is that all changes are localized to the
 memory core, no need for a full sweep.  It is a little complicated, but
 we may be able to simplify it (or find another one).

 May work. We just need to detect if memory region tries to delete itself
 from its handler to prevent the deadlock.
 
 Those types of hacks are fragile.  IMO it just demonstrates what I said
 earlier (then tried to disprove with this): if we call an opaque's
 method, we must refcount or otherwise lock that opaque.  No way around it.

But that still doesn't solve the problem that we need to lock down the
*state* of the opaque during dispatch /wrt to memory region changes.
Just ensuring its existence is not enough unless we declare memory
region transactions to be asynchronous - and adapt all users.

 




 Besides
 MMIO and PIO dispatching, it will haunt us for file or event handlers,
 any kind of callbacks etc.

 Context A   Context B
 -   -
 object = lookup()
 deregister(object)
 modify(object) - invalid state
 ... use(object)
 modify(object) - valid state
 register(object)

 And with object I'm not talking about QOM but any data structure.



 Context B
 -
 rcu_read_lock()
 object = lookup()
 if (object) {
 ref(object)
 }
 rcu_read_unlock()

 use(object)

 unref(object)

 (substitute locking scheme to conform to taste and/or dietary
 restrictions)   


 + wait for refcount(object) == 0 in deregister(object). That's what I'm
 proposing.

 Consider timer_del() called from a timer callback.  It's often not doable.

 This is inherently synchronous already (when working against the same
 alarm timer backend). We can detect this in timer_del and skip waiting,
 as in the above scenario.
 
 It can always be broken.  The timer callback takes the device lock to
 update the device.  The device mmio handler, holding the device lock,
 takes the timer lock to timer_mod.  Deadlock.

Well, how is this solved in Linux? By waiting on the callback in
hrtimer_cancel. Not by wait_on_magic_opaque (well, there is even no
opaque in the hrtimer API).

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-09-01 Thread Avi Kivity
On 09/01/2012 01:57 AM, Jan Kiszka wrote:
 On 2012-09-01 10:31, Avi Kivity wrote:
  On 08/29/2012 10:49 AM, Jan Kiszka wrote:
 
  Let's experiment with refcounting MemoryRegion.  We can move the entire
  contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
  MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
  refcounted MemoryRegionImpl:
 
  struct MemoryRegion {
  struct MemoryRegionImpl *impl;
  };
 
  struct MemoryRegionImpl {
  atomic int refs;
  ...
  };
 
  The memory core can then store MemoryRegion copies (with elevated
  refcounts) instead of pointers.  Devices can destroy MemoryRegions at
  any time, the implementation will not go away.  However, what of the
  opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
 
  One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
  lock, examines the -enabled member, and bails out if it is cleared. 
  The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
  clears -enabled,  releases the lock, and drops the reference.
 
  That means holding the MemoryRegionImpl lock across the handler call?
  
  Blech.  As you said on the other side of this thread, we must not take a
  coarse grained lock within a fine grained one, and
  MemoryRegionImpl::lock is as fine as they get.

 Not sure what you compare here. MemoryRegionImpl::lock would be per
 memory region, so with finer scope than the BQL but with similar scope
 like a per-device lock.

The dispatch may acquire the bql (in fact most will, unless we convert
everything).  A design that doesn't allow this is broken.  Even the
device lock is more coarse grained (since it covers all regions) and is
taken in a deadlock scenario as region manipulation will be done under
the device lock.  You say we can detect this, but I dislike it.

  
 
  The advantage to this scheme is that all changes are localized to the
  memory core, no need for a full sweep.  It is a little complicated, but
  we may be able to simplify it (or find another one).
 
  May work. We just need to detect if memory region tries to delete itself
  from its handler to prevent the deadlock.
  
  Those types of hacks are fragile.  IMO it just demonstrates what I said
  earlier (then tried to disprove with this): if we call an opaque's
  method, we must refcount or otherwise lock that opaque.  No way around it.

 But that still doesn't solve the problem that we need to lock down the
 *state* of the opaque during dispatch /wrt to memory region changes.
 Just ensuring its existence is not enough unless we declare memory
 region transactions to be asynchronous - and adapt all users.

I expect no users will need change.

Changing the region offset has no effect on dispatch (in fact the region
itself doesn't change).  Changing subregions is fine.  Disabling a
region concurrently with access is the only potential problem, but this
is rare, and I expect it to just work.  We will have to audit all users,
yes.

  + wait for refcount(object) == 0 in deregister(object). That's what I'm
  proposing.
 
  Consider timer_del() called from a timer callback.  It's often not doable.
 
  This is inherently synchronous already (when working against the same
  alarm timer backend). We can detect this in timer_del and skip waiting,
  as in the above scenario.
  
  It can always be broken.  The timer callback takes the device lock to
  update the device.  The device mmio handler, holding the device lock,
  takes the timer lock to timer_mod.  Deadlock.

 Well, how is this solved in Linux? By waiting on the callback in
 hrtimer_cancel. Not by wait_on_magic_opaque (well, there is even no
 opaque in the hrtimer API).

I don't consider that busy loop an elegant solution.

The original proposal mirrors dcache.  You can perform operations on a
dentry even after it is unlinked.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-30 Thread Jan Kiszka
On 2012-08-30 07:54, liu ping fan wrote:
 On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity a...@redhat.com wrote:
 On 08/29/2012 10:30 AM, Jan Kiszka wrote:
 On 2012-08-29 19:23, Avi Kivity wrote:
 On 08/28/2012 02:42 AM, Jan Kiszka wrote:

 Let's not talk about devices or MMIO dispatching. I think the problem is
 way more generic, and we will face it multiple times in QEMU.

 The problem exists outside qemu as well.  It is one of the reasons for
 the popularity of garbage collection IMO, and the use of reference
 counting when GC is not available.

 This pattern is even documented in
 Documentation/DocBook/kernel-locking.tmpl:

 @@ -104,12 +114,11 @@
  struct object *cache_find(int id)
  {
  struct object *obj;
 -unsigned long flags;

 -spin_lock_irqsave(amp;cache_lock, flags);
 +rcu_read_lock();
  obj = __cache_find(id);
  if (obj)
  object_get(obj);
 -spin_unlock_irqrestore(amp;cache_lock, flags);
 +rcu_read_unlock();


 Of course that doesn't mean we should use it, but at least it indicates
 that it is a standard pattern.  With MemoryRegion the pattern is
 changed, since MemoryRegion is a thunk, not the object we're really
 dispatching to.

 We are dispatching according to the memory region (parameters, op
 handlers, opaques). If we end up in device object is not decided at this
 level. A memory region describes a dispatchable area - not to confuse
 with a device that may only partially be able to receive such requests.

 But I think the meaning of the memory region is for dispatching. If no
 dispatching associated with mr, why need it exist in the system?

Where did I say that memory regions should no longer be used for
dispatching? The point is to keep the clean layer separation between
memory regions and device objects instead of merging them together.

Given

   Object
   ^^
   ||
Region 1Region 2

you protect the object during dispatch, implicitly (and that is bad)
requiring that no region must change in that period. I say what rather
needs protection are the regions so that Region 2 can pass away and
maybe reappear independent of Region 1. And: I won't need to know the
type of that object the regions are referring to in this model. That's
the difference.

  And
 could you elaborate that who will be the ref holder of mr?

The memory subsystem while running a memory region handler. If that will
be a reference counter or a per-region lock like Avi suggested, we still
need to find out.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-30 Thread liu ping fan
On Thu, Aug 30, 2012 at 3:08 PM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2012-08-30 07:54, liu ping fan wrote:
 On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity a...@redhat.com wrote:
 On 08/29/2012 10:30 AM, Jan Kiszka wrote:
 On 2012-08-29 19:23, Avi Kivity wrote:
 On 08/28/2012 02:42 AM, Jan Kiszka wrote:

 Let's not talk about devices or MMIO dispatching. I think the problem is
 way more generic, and we will face it multiple times in QEMU.

 The problem exists outside qemu as well.  It is one of the reasons for
 the popularity of garbage collection IMO, and the use of reference
 counting when GC is not available.

 This pattern is even documented in
 Documentation/DocBook/kernel-locking.tmpl:

 @@ -104,12 +114,11 @@
  struct object *cache_find(int id)
  {
  struct object *obj;
 -unsigned long flags;

 -spin_lock_irqsave(amp;cache_lock, flags);
 +rcu_read_lock();
  obj = __cache_find(id);
  if (obj)
  object_get(obj);
 -spin_unlock_irqrestore(amp;cache_lock, flags);
 +rcu_read_unlock();


 Of course that doesn't mean we should use it, but at least it indicates
 that it is a standard pattern.  With MemoryRegion the pattern is
 changed, since MemoryRegion is a thunk, not the object we're really
 dispatching to.

 We are dispatching according to the memory region (parameters, op
 handlers, opaques). If we end up in device object is not decided at this
 level. A memory region describes a dispatchable area - not to confuse
 with a device that may only partially be able to receive such requests.

 But I think the meaning of the memory region is for dispatching. If no
 dispatching associated with mr, why need it exist in the system?

 Where did I say that memory regions should no longer be used for
 dispatching? The point is to keep the clean layer separation between
 memory regions and device objects instead of merging them together.

 Given

Object
^^
||
 Region 1Region 2

 you protect the object during dispatch, implicitly (and that is bad)
 requiring that no region must change in that period. I say what rather
 needs protection are the regions so that Region 2 can pass away and
 maybe reappear independent of Region 1. And: I won't need to know the

OK, I see, this is a strong reason.

Regards,
pingfan

 type of that object the regions are referring to in this model. That's
 the difference.

  And
 could you elaborate that who will be the ref holder of mr?

 The memory subsystem while running a memory region handler. If that will
 be a reference counter or a per-region lock like Avi suggested, we still
 need to find out.

 Jan

 --
 Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
 Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Avi Kivity
On 08/27/2012 06:01 PM, Jan Kiszka wrote:
 On 2012-08-27 22:53, Avi Kivity wrote:
  On 08/27/2012 12:38 PM, Jan Kiszka wrote:
  Even worse, apply
  restrictions on how the dispatched objects, the regions, have to be
  treated because of this.
 
  Please elaborate.
 
  The fact that you can't manipulate a memory region object arbitrarily
  after removing it from the mapping because you track the references to
  the object that the region points to, not the region itself. The region
  remains in use by the dispatching layer and potentially the called
  device, even after deregistration.
  
  That object will be a container_of() the region, usually literally but
  sometimes only in spirit.  Reference counting the regions means they
  cannot be embedded into other objects any more.

 I cannot follow the logic of the last sentence. Reference counting just
 means that we track if there are users left, not necessarily that we
 perform asynchronous releases. We can simply wait for those users to
 complete.

I don't see how.  Suppose you add a reference count to MemoryRegion. 
How do you delay its containing object's destructor from running?  Do
you iterate over all member MemoryRegion and examine their reference counts?

Usually a reference count controls the lifetime of the reference counted
object, what are you suggesting here?

  
  We can probably figure out a way to flush out accesses.  After switching
  to rcu, for example, all we need is synchronize_rcu() in a
  non-deadlocking place.  But my bet is that it will not be needed.

 If you properly flush out accesses, you don't need to track at device
 level anymore - and mess with abstraction layers. That's my whole point.

To flush out an access you need either rwlock_write_lock() or
synchronize_rcu() (depending on the implementation).  But neither of
these can be run from an rcu read-side critical section or
rwlock_read_lock().

You could defer the change to a bottom half, but if the hardware demands
that the change be complete before returning, that doesn't work.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Jan Kiszka
On 2012-08-29 19:13, Avi Kivity wrote:
 On 08/27/2012 06:01 PM, Jan Kiszka wrote:
 On 2012-08-27 22:53, Avi Kivity wrote:
 On 08/27/2012 12:38 PM, Jan Kiszka wrote:
 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.

 Please elaborate.

 The fact that you can't manipulate a memory region object arbitrarily
 after removing it from the mapping because you track the references to
 the object that the region points to, not the region itself. The region
 remains in use by the dispatching layer and potentially the called
 device, even after deregistration.

 That object will be a container_of() the region, usually literally but
 sometimes only in spirit.  Reference counting the regions means they
 cannot be embedded into other objects any more.

 I cannot follow the logic of the last sentence. Reference counting just
 means that we track if there are users left, not necessarily that we
 perform asynchronous releases. We can simply wait for those users to
 complete.
 
 I don't see how.  Suppose you add a reference count to MemoryRegion. 
 How do you delay its containing object's destructor from running?  Do
 you iterate over all member MemoryRegion and examine their reference counts?

memory_region_transaction_commit (or calls that trigger this) will have
to wait. That is required as the caller may start fiddling with the
region right afterward.

 
 Usually a reference count controls the lifetime of the reference counted
 object, what are you suggesting here?

To synchronize on reference count going to zero. Or all readers leaving
the read-side critical sections.

 

 We can probably figure out a way to flush out accesses.  After switching
 to rcu, for example, all we need is synchronize_rcu() in a
 non-deadlocking place.  But my bet is that it will not be needed.

 If you properly flush out accesses, you don't need to track at device
 level anymore - and mess with abstraction layers. That's my whole point.
 
 To flush out an access you need either rwlock_write_lock() or
 synchronize_rcu() (depending on the implementation).  But neither of
 these can be run from an rcu read-side critical section or
 rwlock_read_lock().
 
 You could defer the change to a bottom half, but if the hardware demands
 that the change be complete before returning, that doesn't work.

Right, therefore synchronous transactions.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Avi Kivity
On 08/28/2012 02:42 AM, Jan Kiszka wrote:

 Let's not talk about devices or MMIO dispatching. I think the problem is
 way more generic, and we will face it multiple times in QEMU. 

The problem exists outside qemu as well.  It is one of the reasons for
the popularity of garbage collection IMO, and the use of reference
counting when GC is not available.

This pattern is even documented in
Documentation/DocBook/kernel-locking.tmpl:

@@ -104,12 +114,11 @@
 struct object *cache_find(int id)
 {
 struct object *obj;
-unsigned long flags;

-spin_lock_irqsave(amp;cache_lock, flags);
+rcu_read_lock();
 obj = __cache_find(id);
 if (obj)
 object_get(obj);
-spin_unlock_irqrestore(amp;cache_lock, flags);
+rcu_read_unlock();
  

Of course that doesn't mean we should use it, but at least it indicates
that it is a standard pattern.  With MemoryRegion the pattern is
changed, since MemoryRegion is a thunk, not the object we're really
dispatching to.

 Besides
 MMIO and PIO dispatching, it will haunt us for file or event handlers,
 any kind of callbacks etc.

 Context A   Context B
 -   -
 object = lookup()
 deregister(object)
 modify(object) - invalid state
 ... use(object)
 modify(object) - valid state
 register(object)

 And with object I'm not talking about QOM but any data structure.



Context B
-
rcu_read_lock()
object = lookup()
if (object) {
ref(object)
}
rcu_read_unlock()

use(object)

unref(object)

(substitute locking scheme to conform to taste and/or dietary
restrictions)   

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Jan Kiszka
On 2012-08-29 19:23, Avi Kivity wrote:
 On 08/28/2012 02:42 AM, Jan Kiszka wrote:

 Let's not talk about devices or MMIO dispatching. I think the problem is
 way more generic, and we will face it multiple times in QEMU. 
 
 The problem exists outside qemu as well.  It is one of the reasons for
 the popularity of garbage collection IMO, and the use of reference
 counting when GC is not available.
 
 This pattern is even documented in
 Documentation/DocBook/kernel-locking.tmpl:
 
 @@ -104,12 +114,11 @@
  struct object *cache_find(int id)
  {
  struct object *obj;
 -unsigned long flags;
 
 -spin_lock_irqsave(amp;cache_lock, flags);
 +rcu_read_lock();
  obj = __cache_find(id);
  if (obj)
  object_get(obj);
 -spin_unlock_irqrestore(amp;cache_lock, flags);
 +rcu_read_unlock();
   
 
 Of course that doesn't mean we should use it, but at least it indicates
 that it is a standard pattern.  With MemoryRegion the pattern is
 changed, since MemoryRegion is a thunk, not the object we're really
 dispatching to.

We are dispatching according to the memory region (parameters, op
handlers, opaques). If we end up in device object is not decided at this
level. A memory region describes a dispatchable area - not to confuse
with a device that may only partially be able to receive such requests.

 
 Besides
 MMIO and PIO dispatching, it will haunt us for file or event handlers,
 any kind of callbacks etc.

 Context A   Context B
 -   -
 object = lookup()
 deregister(object)
 modify(object) - invalid state
 ... use(object)
 modify(object) - valid state
 register(object)

 And with object I'm not talking about QOM but any data structure.

 
 
 Context B
 -
 rcu_read_lock()
 object = lookup()
 if (object) {
 ref(object)
 }
 rcu_read_unlock()
 
 use(object)
 
 unref(object)
 
 (substitute locking scheme to conform to taste and/or dietary
 restrictions)   
 

+ wait for refcount(object) == 0 in deregister(object). That's what I'm
proposing.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Avi Kivity
On 08/29/2012 10:21 AM, Jan Kiszka wrote:
 On 2012-08-29 19:13, Avi Kivity wrote:
  On 08/27/2012 06:01 PM, Jan Kiszka wrote:
  On 2012-08-27 22:53, Avi Kivity wrote:
  On 08/27/2012 12:38 PM, Jan Kiszka wrote:
  Even worse, apply
  restrictions on how the dispatched objects, the regions, have to be
  treated because of this.
 
  Please elaborate.
 
  The fact that you can't manipulate a memory region object arbitrarily
  after removing it from the mapping because you track the references to
  the object that the region points to, not the region itself. The region
  remains in use by the dispatching layer and potentially the called
  device, even after deregistration.
 
  That object will be a container_of() the region, usually literally but
  sometimes only in spirit.  Reference counting the regions means they
  cannot be embedded into other objects any more.
 
  I cannot follow the logic of the last sentence. Reference counting just
  means that we track if there are users left, not necessarily that we
  perform asynchronous releases. We can simply wait for those users to
  complete.
  
  I don't see how.  Suppose you add a reference count to MemoryRegion. 
  How do you delay its containing object's destructor from running?  Do
  you iterate over all member MemoryRegion and examine their reference counts?

 memory_region_transaction_commit (or calls that trigger this) will have
 to wait. That is required as the caller may start fiddling with the
 region right afterward.

However, it may be running within an mmio dispatch, so it may wait forever.

Ignoring this, how does it wait? sleep()? or wait for a completion?

  
  Usually a reference count controls the lifetime of the reference counted
  object, what are you suggesting here?

 To synchronize on reference count going to zero. 

Quite unorthodox.  Do you have real life examples?

 Or all readers leaving
 the read-side critical sections.

This is rcu.  But again, we may be in an rcu read-side critical section
while needing to wait.  In fact this was what I suggested in the first
place, until Marcelo pointed out the deadlock, so we came up with the
refcount.


  
 
  We can probably figure out a way to flush out accesses.  After switching
  to rcu, for example, all we need is synchronize_rcu() in a
  non-deadlocking place.  But my bet is that it will not be needed.
 
  If you properly flush out accesses, you don't need to track at device
  level anymore - and mess with abstraction layers. That's my whole point.
  
  To flush out an access you need either rwlock_write_lock() or
  synchronize_rcu() (depending on the implementation).  But neither of
  these can be run from an rcu read-side critical section or
  rwlock_read_lock().
  
  You could defer the change to a bottom half, but if the hardware demands
  that the change be complete before returning, that doesn't work.

 Right, therefore synchronous transactions.

I don't follow.  Synchronous transactions mean you can't
synchronize_rcu() or upgrade the lock or wait for the refcount.  That's
the source of the problem!

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Avi Kivity
On 08/29/2012 10:30 AM, Jan Kiszka wrote:
 On 2012-08-29 19:23, Avi Kivity wrote:
  On 08/28/2012 02:42 AM, Jan Kiszka wrote:
 
  Let's not talk about devices or MMIO dispatching. I think the problem is
  way more generic, and we will face it multiple times in QEMU. 
  
  The problem exists outside qemu as well.  It is one of the reasons for
  the popularity of garbage collection IMO, and the use of reference
  counting when GC is not available.
  
  This pattern is even documented in
  Documentation/DocBook/kernel-locking.tmpl:
  
  @@ -104,12 +114,11 @@
   struct object *cache_find(int id)
   {
   struct object *obj;
  -unsigned long flags;
  
  -spin_lock_irqsave(amp;cache_lock, flags);
  +rcu_read_lock();
   obj = __cache_find(id);
   if (obj)
   object_get(obj);
  -spin_unlock_irqrestore(amp;cache_lock, flags);
  +rcu_read_unlock();

  
  Of course that doesn't mean we should use it, but at least it indicates
  that it is a standard pattern.  With MemoryRegion the pattern is
  changed, since MemoryRegion is a thunk, not the object we're really
  dispatching to.

 We are dispatching according to the memory region (parameters, op
 handlers, opaques). If we end up in device object is not decided at this
 level. A memory region describes a dispatchable area - not to confuse
 with a device that may only partially be able to receive such requests.

Correct.

Let's experiment with refcounting MemoryRegion.  We can move the entire
contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
refcounted MemoryRegionImpl:

struct MemoryRegion {
struct MemoryRegionImpl *impl;
};

struct MemoryRegionImpl {
atomic int refs;
...
};

The memory core can then store MemoryRegion copies (with elevated
refcounts) instead of pointers.  Devices can destroy MemoryRegions at
any time, the implementation will not go away.  However, what of the
opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.

One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
lock, examines the -enabled member, and bails out if it is cleared. 
The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
clears -enabled,  releases the lock, and drops the reference.

The advantage to this scheme is that all changes are localized to the
memory core, no need for a full sweep.  It is a little complicated, but
we may be able to simplify it (or find another one).


  
  Besides
  MMIO and PIO dispatching, it will haunt us for file or event handlers,
  any kind of callbacks etc.
 
  Context A   Context B
  -   -
  object = lookup()
  deregister(object)
  modify(object) - invalid state
  ... use(object)
  modify(object) - valid state
  register(object)
 
  And with object I'm not talking about QOM but any data structure.
 
  
  
  Context B
  -
  rcu_read_lock()
  object = lookup()
  if (object) {
  ref(object)
  }
  rcu_read_unlock()
  
  use(object)
  
  unref(object)
  
  (substitute locking scheme to conform to taste and/or dietary
  restrictions)   
  

 + wait for refcount(object) == 0 in deregister(object). That's what I'm
 proposing.

Consider timer_del() called from a timer callback.  It's often not doable.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Jan Kiszka
On 2012-08-29 19:27, Avi Kivity wrote:
 On 08/29/2012 10:21 AM, Jan Kiszka wrote:
 On 2012-08-29 19:13, Avi Kivity wrote:
 On 08/27/2012 06:01 PM, Jan Kiszka wrote:
 On 2012-08-27 22:53, Avi Kivity wrote:
 On 08/27/2012 12:38 PM, Jan Kiszka wrote:
 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.

 Please elaborate.

 The fact that you can't manipulate a memory region object arbitrarily
 after removing it from the mapping because you track the references to
 the object that the region points to, not the region itself. The region
 remains in use by the dispatching layer and potentially the called
 device, even after deregistration.

 That object will be a container_of() the region, usually literally but
 sometimes only in spirit.  Reference counting the regions means they
 cannot be embedded into other objects any more.

 I cannot follow the logic of the last sentence. Reference counting just
 means that we track if there are users left, not necessarily that we
 perform asynchronous releases. We can simply wait for those users to
 complete.

 I don't see how.  Suppose you add a reference count to MemoryRegion. 
 How do you delay its containing object's destructor from running?  Do
 you iterate over all member MemoryRegion and examine their reference counts?

 memory_region_transaction_commit (or calls that trigger this) will have
 to wait. That is required as the caller may start fiddling with the
 region right afterward.
 
 However, it may be running within an mmio dispatch, so it may wait forever.

We won't be able to wait for devices that can acquire the same lock we
hold while waiting, of course. That's why a lock ordering device-lock -
BQL (the one we hold over memory region changes) won't be allowed. I was
wrong on this before: We must not take course-grained locks while
holding fine-grained ones, only the other way around. Pretty obvious,
specifically for realtime / low-latency.

 
 Ignoring this, how does it wait? sleep()? or wait for a completion?

Ideally via completion.

 

 Usually a reference count controls the lifetime of the reference counted
 object, what are you suggesting here?

 To synchronize on reference count going to zero. 
 
 Quite unorthodox.  Do you have real life examples?

synchronize_rcu.

 
 Or all readers leaving
 the read-side critical sections.
 
 This is rcu.  But again, we may be in an rcu read-side critical section
 while needing to wait.  In fact this was what I suggested in the first
 place, until Marcelo pointed out the deadlock, so we came up with the
 refcount.

We must avoid that deadlock scenario. I bended my brain around it, and I
think that is the only answer. We can if we apply clear rules regarding
locking and BQL-protected services to those devices that will actually
use fine-grained locking, and there only for those paths that take
per-device locks. I'm pretty sure we won't get far with an
all-or-nothing model where every device uses a private lock.

 



 We can probably figure out a way to flush out accesses.  After switching
 to rcu, for example, all we need is synchronize_rcu() in a
 non-deadlocking place.  But my bet is that it will not be needed.

 If you properly flush out accesses, you don't need to track at device
 level anymore - and mess with abstraction layers. That's my whole point.

 To flush out an access you need either rwlock_write_lock() or
 synchronize_rcu() (depending on the implementation).  But neither of
 these can be run from an rcu read-side critical section or
 rwlock_read_lock().

 You could defer the change to a bottom half, but if the hardware demands
 that the change be complete before returning, that doesn't work.

 Right, therefore synchronous transactions.
 
 I don't follow.  Synchronous transactions mean you can't
 synchronize_rcu() or upgrade the lock or wait for the refcount.  That's
 the source of the problem!

Our current device models assume synchronous transactions on the memory
layer, actually on all layers. The proposals I saw so far try to change
this. But I'm very skeptical that this would succeed, due to the
conversion effort and due to the fact that it would give us a tricky  to
use API.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread Jan Kiszka
On 2012-08-29 19:40, Avi Kivity wrote:
 On 08/29/2012 10:30 AM, Jan Kiszka wrote:
 On 2012-08-29 19:23, Avi Kivity wrote:
 On 08/28/2012 02:42 AM, Jan Kiszka wrote:

 Let's not talk about devices or MMIO dispatching. I think the problem is
 way more generic, and we will face it multiple times in QEMU. 

 The problem exists outside qemu as well.  It is one of the reasons for
 the popularity of garbage collection IMO, and the use of reference
 counting when GC is not available.

 This pattern is even documented in
 Documentation/DocBook/kernel-locking.tmpl:

 @@ -104,12 +114,11 @@
  struct object *cache_find(int id)
  {
  struct object *obj;
 -unsigned long flags;

 -spin_lock_irqsave(amp;cache_lock, flags);
 +rcu_read_lock();
  obj = __cache_find(id);
  if (obj)
  object_get(obj);
 -spin_unlock_irqrestore(amp;cache_lock, flags);
 +rcu_read_unlock();
   

 Of course that doesn't mean we should use it, but at least it indicates
 that it is a standard pattern.  With MemoryRegion the pattern is
 changed, since MemoryRegion is a thunk, not the object we're really
 dispatching to.

 We are dispatching according to the memory region (parameters, op
 handlers, opaques). If we end up in device object is not decided at this
 level. A memory region describes a dispatchable area - not to confuse
 with a device that may only partially be able to receive such requests.
 
 Correct.
 
 Let's experiment with refcounting MemoryRegion.  We can move the entire
 contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
 MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
 refcounted MemoryRegionImpl:
 
 struct MemoryRegion {
 struct MemoryRegionImpl *impl;
 };
 
 struct MemoryRegionImpl {
 atomic int refs;
 ...
 };
 
 The memory core can then store MemoryRegion copies (with elevated
 refcounts) instead of pointers.  Devices can destroy MemoryRegions at
 any time, the implementation will not go away.  However, what of the
 opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.
 
 One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
 lock, examines the -enabled member, and bails out if it is cleared. 
 The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
 clears -enabled,  releases the lock, and drops the reference.

That means holding the MemoryRegionImpl lock across the handler call?

 
 The advantage to this scheme is that all changes are localized to the
 memory core, no need for a full sweep.  It is a little complicated, but
 we may be able to simplify it (or find another one).

May work. We just need to detect if memory region tries to delete itself
from its handler to prevent the deadlock.

 


 Besides
 MMIO and PIO dispatching, it will haunt us for file or event handlers,
 any kind of callbacks etc.

 Context A   Context B
 -   -
 object = lookup()
 deregister(object)
 modify(object) - invalid state
 ... use(object)
 modify(object) - valid state
 register(object)

 And with object I'm not talking about QOM but any data structure.



 Context B
 -
 rcu_read_lock()
 object = lookup()
 if (object) {
 ref(object)
 }
 rcu_read_unlock()

 use(object)

 unref(object)

 (substitute locking scheme to conform to taste and/or dietary
 restrictions)   


 + wait for refcount(object) == 0 in deregister(object). That's what I'm
 proposing.
 
 Consider timer_del() called from a timer callback.  It's often not doable.

This is inherently synchronous already (when working against the same
alarm timer backend). We can detect this in timer_del and skip waiting,
as in the above scenario.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-29 Thread liu ping fan
On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity a...@redhat.com wrote:
 On 08/29/2012 10:30 AM, Jan Kiszka wrote:
 On 2012-08-29 19:23, Avi Kivity wrote:
  On 08/28/2012 02:42 AM, Jan Kiszka wrote:
 
  Let's not talk about devices or MMIO dispatching. I think the problem is
  way more generic, and we will face it multiple times in QEMU.
 
  The problem exists outside qemu as well.  It is one of the reasons for
  the popularity of garbage collection IMO, and the use of reference
  counting when GC is not available.
 
  This pattern is even documented in
  Documentation/DocBook/kernel-locking.tmpl:
 
  @@ -104,12 +114,11 @@
   struct object *cache_find(int id)
   {
   struct object *obj;
  -unsigned long flags;
 
  -spin_lock_irqsave(amp;cache_lock, flags);
  +rcu_read_lock();
   obj = __cache_find(id);
   if (obj)
   object_get(obj);
  -spin_unlock_irqrestore(amp;cache_lock, flags);
  +rcu_read_unlock();
 
 
  Of course that doesn't mean we should use it, but at least it indicates
  that it is a standard pattern.  With MemoryRegion the pattern is
  changed, since MemoryRegion is a thunk, not the object we're really
  dispatching to.

 We are dispatching according to the memory region (parameters, op
 handlers, opaques). If we end up in device object is not decided at this
 level. A memory region describes a dispatchable area - not to confuse
 with a device that may only partially be able to receive such requests.

But I think the meaning of the memory region is for dispatching. If no
dispatching associated with mr, why need it exist in the system?  And
could you elaborate that who will be the ref holder of mr?

 Correct.

 Let's experiment with refcounting MemoryRegion.  We can move the entire
 contents of MemoryRegion to MemoryRegionImpl, add a reference count (to
 MemoryRegionImpl), and change MemoryRegion to contain a pointer to the
 refcounted MemoryRegionImpl:

 struct MemoryRegion {
 struct MemoryRegionImpl *impl;
 };

 struct MemoryRegionImpl {
 atomic int refs;
 ...
 };

 The memory core can then store MemoryRegion copies (with elevated
 refcounts) instead of pointers.  Devices can destroy MemoryRegions at
 any time, the implementation will not go away.  However, what of the
 opaque stored in MemoryRegionImpl?  It becomes a dangling pointer.

 One way out is to add a lock to MemoryRegionImpl.  Dispatch takes the
 lock, examines the -enabled member, and bails out if it is cleared.
 The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock,
 clears -enabled,  releases the lock, and drops the reference.

Assume we can get the host objectX of opaque, then we can just
object_ref(objectX). So dispatching will not face the hard nut to
handle opaque.

Regards,
pingfan
 The advantage to this scheme is that all changes are localized to the
 memory core, no need for a full sweep.  It is a little complicated, but
 we may be able to simplify it (or find another one).


 
  Besides
  MMIO and PIO dispatching, it will haunt us for file or event handlers,
  any kind of callbacks etc.
 
  Context A   Context B
  -   -
  object = lookup()
  deregister(object)
  modify(object) - invalid state
  ... use(object)
  modify(object) - valid state
  register(object)
 
  And with object I'm not talking about QOM but any data structure.
 
 
 
  Context B
  -
  rcu_read_lock()
  object = lookup()
  if (object) {
  ref(object)
  }
  rcu_read_unlock()
 
  use(object)
 
  unref(object)
 
  (substitute locking scheme to conform to taste and/or dietary
  restrictions)
 

 + wait for refcount(object) == 0 in deregister(object). That's what I'm
 proposing.

 Consider timer_del() called from a timer callback.  It's often not doable.

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-28 Thread Jan Kiszka
On 2012-08-28 05:38, liu ping fan wrote:
 On Tue, Aug 28, 2012 at 11:09 AM, liu ping fan qemul...@gmail.com wrote:
 On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2012-08-27 20:52, Avi Kivity wrote:
 On 08/27/2012 11:39 AM, Jan Kiszka wrote:
 On 2012-08-27 20:20, Avi Kivity wrote:
 On 08/27/2012 11:17 AM, Jan Kiszka wrote:
 On 2012-08-27 20:09, Avi Kivity wrote:
 On 08/27/2012 10:14 AM, Jan Kiszka wrote:

 Deregistration is fine, the problem is destruction.


 It isn't as you access memory region states that can change after
 deregistration. Devices can remove memory regions from the mapping,
 alter and then reinsert them. The last to steps must not happen while
 anyone is still using a reference to that region.


 Why not?  If the guest is accessing an mmio region while reconfiguring
 it in a way that changes its meaning, either the previous or the next
 meaning is valid.

 If the memory region owner sets the content to zero or even releases it
 (nothing states a memory region can only live inside a device
 structure), we will crash. Restricting how a memory region can be
 created and handled after it was once registered somewhere is an
 unnatural interface, waiting to cause subtle bugs.

 Using an Object * allows the simple case to be really simple (object ==
 device) and the hard cases to be doable.

 What would you suggest as a better interface?

 To protect the life cycle of the object we manage in the memory layer:
 regions. We don't manage devices there. If there is any implementation
 benefit in having a full QOM object, then make memory regions objects.

 I see and sympathise with this point of view.

 The idea to use opaque and convert it to Object * is that often a device
 has multiple regions but no interest in managing them separately.  So
 managing regions will cause the device model authors to create
 boilerplate code to push region management to device management.

 I had this experience with the first iterations of the region interface,
 where instead of opaque we had MemoryRegion *.  Device code would use
 container_of() in the simple case, but the non-simple case caused lots
 of pain.  I believe it is the natural interface, but in practice it
 turned out to cause too much churn.

 I simply don't like this indirection, having the memory layer pick up
 the opaque value of the region and interpret it.

 That's just an intermediate step.  The idea is that eventually opaque
 will be changed to Object *.

 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.

 Please elaborate.

 The fact that you can't manipulate a memory region object arbitrarily
 after removing it from the mapping because you track the references to
 the object that the region points to, not the region itself. The region
 remains in use by the dispatching layer and potentially the called
 device, even after deregistration.

 Why can it in use by dispatching layer? Memory region just servers as
 interface for mem lookup, when dispatching in mr's MemoryRegionOps,
 why do we need to access it?  Which means that read/write can cause
 register/deregister mr?  Example?
 
 I found example in pci_update_mappings(), but but my following point
 still stand.
 Also I think MemoryRegionOps will eventually operate on Object (mr is
 only interface). MRs is only represent of Object in the view of
 memory(so their life cycle can be managed by Object). And there could
 be some intermediate mr like those lie in subpage_t which are not
 based on Object. But we can walk down to the terminal point and
 extract the real Object,
 so when dispatching, we only need to ensure that Object and its direct
 mr are alive.

Let's not talk about devices or MMIO dispatching. I think the problem is
way more generic, and we will face it multiple times in QEMU. Besides
MMIO and PIO dispatching, it will haunt us for file or event handlers,
any kind of callbacks etc.

Context A   Context B
-   -
object = lookup()
deregister(object)
modify(object) - invalid state
... use(object)
modify(object) - valid state
register(object)

And with object I'm not talking about QOM but any data structure.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-28 Thread Paolo Bonzini
Il 28/08/2012 11:42, Jan Kiszka ha scritto:
 Context A   Context B
 -   -
 object = lookup()
 deregister(object)
 modify(object) - invalid state
 ... use(object)
 modify(object) - valid state
 register(object)
 
 And with object I'm not talking about QOM but any data structure.

If you want to avoid locks, the only way to do so is RCU.  Then you do
not modify object at all.

Paolo



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Paolo Bonzini
Il 25/08/2012 09:42, liu ping fan ha scritto:
 
  I don't see why MMIO dispatch should hold the IDEBus ref rather than the
  PCIIDEState.
 
 When transfer memory_region_init_io()  3rd para from void* opaque to
 Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
 situation, in order to let MemoryRegionOps tell between them, we
 should pass PCIIDEState-bus[0], bus[1] separately.

The rule should be that the obj is the object that you want referenced,
and that should be the PCIIDEState.

But this is anyway moot because it only applies to objects that are
converted to use unlocked dispatch.  This likely will not be the case
for IDE.

Paolo

  In the case of the PIIX, the BARs are set up by the PCIIDEState in
  bmdma_setup_bar (called by bmdma_setup_bar).
 
 Supposing we have convert  PCIIDEState-bmdma[0]/[1] to Object. And in
 mmio-dispatch, object_ref will impose on bmdma[0/[1], but this can not
 prevent  PCIIDEState-refcnt=0, and then the whole object disappear!




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 09:01, Paolo Bonzini wrote:
 Il 25/08/2012 09:42, liu ping fan ha scritto:

 I don't see why MMIO dispatch should hold the IDEBus ref rather than the
 PCIIDEState.

 When transfer memory_region_init_io()  3rd para from void* opaque to
 Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
 situation, in order to let MemoryRegionOps tell between them, we
 should pass PCIIDEState-bus[0], bus[1] separately.
 
 The rule should be that the obj is the object that you want referenced,
 and that should be the PCIIDEState.
 
 But this is anyway moot because it only applies to objects that are
 converted to use unlocked dispatch.  This likely will not be the case
 for IDE.

BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
dispatching - that device objects are the wrong target for reference
counting. We keep memory regions in our dispatching tables (PIO
dispatching needs some refactoring for this), and those regions need
protection for BQL-free use. Devices can't pass away as long as the have
referenced regions, memory region deregistration services will have to
take care of this.

I'm currently not using reference counting at all, I'm enforcing that
only BQL-protected regions can be deregistered.

Also note that there seems to be another misconception in the
discussions: deregistration is not only bound to device unplug. It also
happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
strong indicator that we should worry about individual memory regions,
not devices.

Jan




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread liu ping fan
On Mon, Aug 27, 2012 at 3:47 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2012-08-27 09:01, Paolo Bonzini wrote:
 Il 25/08/2012 09:42, liu ping fan ha scritto:

 I don't see why MMIO dispatch should hold the IDEBus ref rather than the
 PCIIDEState.

 When transfer memory_region_init_io()  3rd para from void* opaque to
 Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
 situation, in order to let MemoryRegionOps tell between them, we
 should pass PCIIDEState-bus[0], bus[1] separately.

 The rule should be that the obj is the object that you want referenced,
 and that should be the PCIIDEState.

 But this is anyway moot because it only applies to objects that are
 converted to use unlocked dispatch.  This likely will not be the case
 for IDE.

 BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
 dispatching - that device objects are the wrong target for reference

Hi Jan, thanks for reminder, but could you explain it more detail?
mmio dispatch table holds 1 ref for device, before releasing this
ref,( When unplugging, we detach all the device's mr from memory, then
drop the ref. So I think that no leak will be exposed by mr  and it is
safe to use device as target for reference.

 counting. We keep memory regions in our dispatching tables (PIO
 dispatching needs some refactoring for this), and those regions need
 protection for BQL-free use. Devices can't pass away as long as the have
Yes, it is right. Device can pass away only after mr removed from
dispatching tables

Thanx pingfan
 referenced regions, memory region deregistration services will have to
 take care of this.


 I'm currently not using reference counting at all, I'm enforcing that
 only BQL-protected regions can be deregistered.

 Also note that there seems to be another misconception in the
 discussions: deregistration is not only bound to device unplug. It also
 happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
 strong indicator that we should worry about individual memory regions,
 not devices.

 Jan





Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 10:17, liu ping fan wrote:
 On Mon, Aug 27, 2012 at 3:47 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2012-08-27 09:01, Paolo Bonzini wrote:
 Il 25/08/2012 09:42, liu ping fan ha scritto:

 I don't see why MMIO dispatch should hold the IDEBus ref rather than the
 PCIIDEState.

 When transfer memory_region_init_io()  3rd para from void* opaque to
 Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
 situation, in order to let MemoryRegionOps tell between them, we
 should pass PCIIDEState-bus[0], bus[1] separately.

 The rule should be that the obj is the object that you want referenced,
 and that should be the PCIIDEState.

 But this is anyway moot because it only applies to objects that are
 converted to use unlocked dispatch.  This likely will not be the case
 for IDE.

 BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
 dispatching - that device objects are the wrong target for reference
 
 Hi Jan, thanks for reminder, but could you explain it more detail?
 mmio dispatch table holds 1 ref for device, before releasing this
 ref,( When unplugging, we detach all the device's mr from memory, then
 drop the ref. So I think that no leak will be exposed by mr  and it is
 safe to use device as target for reference.

It would be a mistake to assume that memory regions can only be embedded
in device objects. Memory regions can be reconfigured or dynamically
added/removed (see e.g. portio lists) - there is no device in this
sentence. Regions are stored in the dispatching table, they will first
of all be touched without holding the BQL. So their content has to be
stable in that period, and it is the proper abstraction, IMHO, to focus
on their life cycle management and attach all the rest to them.

 
 counting. We keep memory regions in our dispatching tables (PIO
 dispatching needs some refactoring for this), and those regions need
 protection for BQL-free use. Devices can't pass away as long as the have
 Yes, it is right. Device can pass away only after mr removed from
 dispatching tables

Great, then you don't have to worry about device objects in the context
of dispatching.

Jan




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Anthony Liguori
Liu Ping Fan qemul...@gmail.com writes:

 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Scene:
   obja lies in objA, when objA's ref-0, it will be freed,
 but at that time obja can still be in use.

 The real example is:
 typedef struct PCIIDEState {
 PCIDevice dev;
 IDEBus bus[2]; -- create in place
 .
 }

 When without big lock protection for mmio-dispatch, we will hold
 obj's refcnt. So memory_region_init_io() will replace the third para
 void *opaque with Object *obj.
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

 And the ref circle has been broken when calling qdev_delete_subtree().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

I think this is solving the wrong problem.  There are many, many
dependencies a device may have on other devices.  Memory allocation
isn't the only one.

The problem is that we want to make sure that a device doesn't go away
while an MMIO dispatch is happening.  This is easy to solve without
touching referencing counting.

The device will hold a lock while the MMIO is being dispatched.  The
delete path simply needs to acquire that same lock.  This will ensure
that a delete operation cannot finish while MMIO is still in flight.

Regarding deleting a device, not all devices are capable of being
deleted and specifically, devices that are composed within the memory of
another device cannot be directly deleted (they can only be deleted
as part of their parent's destruction).

Regards,

Anthony Liguori

 ---
  hw/qdev.c |2 ++
  hw/qdev.h |1 +
  2 files changed, 3 insertions(+), 0 deletions(-)

 diff --git a/hw/qdev.c b/hw/qdev.c
 index e2339a1..b09ebbf 100644
 --- a/hw/qdev.c
 +++ b/hw/qdev.c
 @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char 
 *typename,
  {
  object_initialize(bus, typename);
  
 +bus-overlap = parent;
 +object_ref(OBJECT(bus-overlap));
  bus-parent = parent;
  bus-name = name ? g_strdup(name) : NULL;
  qbus_realize(bus);
 diff --git a/hw/qdev.h b/hw/qdev.h
 index 182cfa5..9bc5783 100644
 --- a/hw/qdev.h
 +++ b/hw/qdev.h
 @@ -117,6 +117,7 @@ struct BusState {
  int allow_hotplug;
  bool qom_allocated;
  bool glib_allocated;
 +DeviceState *overlap;
  int max_index;
  QTAILQ_HEAD(ChildrenHead, BusChild) children;
  QLIST_ENTRY(BusState) sibling;
 -- 
 1.7.4.4



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 15:19, Anthony Liguori wrote:
 Liu Ping Fan qemul...@gmail.com writes:
 
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Scene:
   obja lies in objA, when objA's ref-0, it will be freed,
 but at that time obja can still be in use.

 The real example is:
 typedef struct PCIIDEState {
 PCIDevice dev;
 IDEBus bus[2]; -- create in place
 .
 }

 When without big lock protection for mmio-dispatch, we will hold
 obj's refcnt. So memory_region_init_io() will replace the third para
 void *opaque with Object *obj.
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

 And the ref circle has been broken when calling qdev_delete_subtree().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 I think this is solving the wrong problem.  There are many, many
 dependencies a device may have on other devices.  Memory allocation
 isn't the only one.
 
 The problem is that we want to make sure that a device doesn't go away
 while an MMIO dispatch is happening.  This is easy to solve without
 touching referencing counting.
 
 The device will hold a lock while the MMIO is being dispatched.  The
 delete path simply needs to acquire that same lock.  This will ensure
 that a delete operation cannot finish while MMIO is still in flight.

That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
without any device-specific locking, e.g. just to read a simple register
value. So we will need reference counting (for devices using private
locks), but on the front-line object: the memory region. That region
will block its owner from disappearing by waiting on dispatch when
someone tries to unregister it.

Also note that holding a lock is easily said but will be more tricky
in practice. Quite a significant share of our code will continue to run
under BQL, even for devices with their own locks. Init/cleanup functions
will likely fall into this category, simply because the surrounding
logic is hard to convert into fine-grained locking and is also not
performance critical. At the same time, we can't take BQL - device-lock
as we have to support device-lock - BQL ordering for (slow-path) calls
into BQL-protected areas while holding a per-device lock (e.g. device
mapping changes).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Anthony Liguori
Jan Kiszka jan.kis...@siemens.com writes:

 On 2012-08-27 15:19, Anthony Liguori wrote:
 Liu Ping Fan qemul...@gmail.com writes:
 
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Scene:
   obja lies in objA, when objA's ref-0, it will be freed,
 but at that time obja can still be in use.

 The real example is:
 typedef struct PCIIDEState {
 PCIDevice dev;
 IDEBus bus[2]; -- create in place
 .
 }

 When without big lock protection for mmio-dispatch, we will hold
 obj's refcnt. So memory_region_init_io() will replace the third para
 void *opaque with Object *obj.
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

 And the ref circle has been broken when calling qdev_delete_subtree().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 I think this is solving the wrong problem.  There are many, many
 dependencies a device may have on other devices.  Memory allocation
 isn't the only one.
 
 The problem is that we want to make sure that a device doesn't go away
 while an MMIO dispatch is happening.  This is easy to solve without
 touching referencing counting.
 
 The device will hold a lock while the MMIO is being dispatched.  The
 delete path simply needs to acquire that same lock.  This will ensure
 that a delete operation cannot finish while MMIO is still in flight.

 That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
 without any device-specific locking, e.g. just to read a simple register
 value. So we will need reference counting

But then you'll need to acquire a lock to take the reference/remove the
reference which sort of defeats the purpose of trying to fast path.

 (for devices using private
 locks), but on the front-line object: the memory region. That region
 will block its owner from disappearing by waiting on dispatch when
 someone tries to unregister it.

 Also note that holding a lock is easily said but will be more tricky
 in practice. Quite a significant share of our code will continue to run
 under BQL, even for devices with their own locks. Init/cleanup functions
 will likely fall into this category,

I'm not sure I'm convinced of this--but it's hard to tell until we
really start converting.

BTW, I'm pretty sure we have to tackle main loop functions first before
we try to convert any devices off the BQL.

Regards,

Anthony Liguori

 simply because the surrounding
 logic is hard to convert into fine-grained locking and is also not
 performance critical. At the same time, we can't take BQL - device-lock
 as we have to support device-lock - BQL ordering for (slow-path) calls
 into BQL-protected areas while holding a per-device lock (e.g. device
 mapping changes).



 Jan

 -- 
 Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
 Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 17:14, Anthony Liguori wrote:
 Jan Kiszka jan.kis...@siemens.com writes:
 
 On 2012-08-27 15:19, Anthony Liguori wrote:
 Liu Ping Fan qemul...@gmail.com writes:

 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Scene:
   obja lies in objA, when objA's ref-0, it will be freed,
 but at that time obja can still be in use.

 The real example is:
 typedef struct PCIIDEState {
 PCIDevice dev;
 IDEBus bus[2]; -- create in place
 .
 }

 When without big lock protection for mmio-dispatch, we will hold
 obj's refcnt. So memory_region_init_io() will replace the third para
 void *opaque with Object *obj.
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

 And the ref circle has been broken when calling qdev_delete_subtree().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 I think this is solving the wrong problem.  There are many, many
 dependencies a device may have on other devices.  Memory allocation
 isn't the only one.

 The problem is that we want to make sure that a device doesn't go away
 while an MMIO dispatch is happening.  This is easy to solve without
 touching referencing counting.

 The device will hold a lock while the MMIO is being dispatched.  The
 delete path simply needs to acquire that same lock.  This will ensure
 that a delete operation cannot finish while MMIO is still in flight.

 That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
 without any device-specific locking, e.g. just to read a simple register
 value. So we will need reference counting
 
 But then you'll need to acquire a lock to take the reference/remove the
 reference which sort of defeats the purpose of trying to fast path.

Atomic ops? RCU? This problem won't be solved for the first time.

 
 (for devices using private
 locks), but on the front-line object: the memory region. That region
 will block its owner from disappearing by waiting on dispatch when
 someone tries to unregister it.

 Also note that holding a lock is easily said but will be more tricky
 in practice. Quite a significant share of our code will continue to run
 under BQL, even for devices with their own locks. Init/cleanup functions
 will likely fall into this category,
 
 I'm not sure I'm convinced of this--but it's hard to tell until we
 really start converting.
 
 BTW, I'm pretty sure we have to tackle main loop functions first before
 we try to convert any devices off the BQL.

I'm sure we should leave existing code alone wherever possible, focusing
on providing alternative versions for those paths that matter. Example:
Most timers are fine under BQL. But some sensitive devices (RTC or HPET
as clock source) will want their own timers. So the approach is to
instantiate a separate, also prioritizeable instance of the timer
subsystem for them and be done.

We won't convert QEMU in a day, but we surely want results before the
last corner is refactored (which would take years, at best).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Anthony Liguori
Jan Kiszka jan.kis...@siemens.com writes:

 On 2012-08-27 17:14, Anthony Liguori wrote:
 Jan Kiszka jan.kis...@siemens.com writes:
 
 On 2012-08-27 15:19, Anthony Liguori wrote:
 Liu Ping Fan qemul...@gmail.com writes:

 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Scene:
   obja lies in objA, when objA's ref-0, it will be freed,
 but at that time obja can still be in use.

 The real example is:
 typedef struct PCIIDEState {
 PCIDevice dev;
 IDEBus bus[2]; -- create in place
 .
 }

 When without big lock protection for mmio-dispatch, we will hold
 obj's refcnt. So memory_region_init_io() will replace the third para
 void *opaque with Object *obj.
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

 And the ref circle has been broken when calling qdev_delete_subtree().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 I think this is solving the wrong problem.  There are many, many
 dependencies a device may have on other devices.  Memory allocation
 isn't the only one.

 The problem is that we want to make sure that a device doesn't go away
 while an MMIO dispatch is happening.  This is easy to solve without
 touching referencing counting.

 The device will hold a lock while the MMIO is being dispatched.  The
 delete path simply needs to acquire that same lock.  This will ensure
 that a delete operation cannot finish while MMIO is still in flight.

 That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
 without any device-specific locking, e.g. just to read a simple register
 value. So we will need reference counting
 
 But then you'll need to acquire a lock to take the reference/remove the
 reference which sort of defeats the purpose of trying to fast path.

 Atomic ops? RCU? This problem won't be solved for the first time.

Yes, there are ways to do this, but you add a fair bit of complication.

It's much simplier to make objects not have any locks and enforce all
callers to protect access.  IOW, noone except the device gets to inc/dec
reference counts.

 (for devices using private
 locks), but on the front-line object: the memory region. That region
 will block its owner from disappearing by waiting on dispatch when
 someone tries to unregister it.

 Also note that holding a lock is easily said but will be more tricky
 in practice. Quite a significant share of our code will continue to run
 under BQL, even for devices with their own locks. Init/cleanup functions
 will likely fall into this category,
 
 I'm not sure I'm convinced of this--but it's hard to tell until we
 really start converting.
 
 BTW, I'm pretty sure we have to tackle main loop functions first before
 we try to convert any devices off the BQL.

 I'm sure we should leave existing code alone wherever possible, focusing
 on providing alternative versions for those paths that matter. Example:
 Most timers are fine under BQL. But some sensitive devices (RTC or HPET
 as clock source) will want their own timers. So the approach is to
 instantiate a separate, also prioritizeable instance of the timer
 subsystem for them and be done.

I disagree.  I think we conver the timer subsystem to be lockless and
then let some devices acquire the BQL during dispatch.

And we have a nice thread-aware main loop available to us--glib.  We
don't need to reinvent the wheel.

Regards,

Anthony Liguori


 We won't convert QEMU in a day, but we surely want results before the
 last corner is refactored (which would take years, at best).

 Jan

 -- 
 Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
 Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 18:24, Anthony Liguori wrote:
 Jan Kiszka jan.kis...@siemens.com writes:
 
 On 2012-08-27 17:14, Anthony Liguori wrote:
 Jan Kiszka jan.kis...@siemens.com writes:

 On 2012-08-27 15:19, Anthony Liguori wrote:
 Liu Ping Fan qemul...@gmail.com writes:

 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Scene:
   obja lies in objA, when objA's ref-0, it will be freed,
 but at that time obja can still be in use.

 The real example is:
 typedef struct PCIIDEState {
 PCIDevice dev;
 IDEBus bus[2]; -- create in place
 .
 }

 When without big lock protection for mmio-dispatch, we will hold
 obj's refcnt. So memory_region_init_io() will replace the third para
 void *opaque with Object *obj.
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

 And the ref circle has been broken when calling qdev_delete_subtree().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 I think this is solving the wrong problem.  There are many, many
 dependencies a device may have on other devices.  Memory allocation
 isn't the only one.

 The problem is that we want to make sure that a device doesn't go away
 while an MMIO dispatch is happening.  This is easy to solve without
 touching referencing counting.

 The device will hold a lock while the MMIO is being dispatched.  The
 delete path simply needs to acquire that same lock.  This will ensure
 that a delete operation cannot finish while MMIO is still in flight.

 That's a bit too simple. Quite a few MMIO/PIO fast-paths will work
 without any device-specific locking, e.g. just to read a simple register
 value. So we will need reference counting

 But then you'll need to acquire a lock to take the reference/remove the
 reference which sort of defeats the purpose of trying to fast path.

 Atomic ops? RCU? This problem won't be solved for the first time.
 
 Yes, there are ways to do this, but you add a fair bit of complication.
 
 It's much simplier to make objects not have any locks and enforce all
 callers to protect access.  IOW, noone except the device gets to inc/dec
 reference counts.

Pulling locks out of the devices may appear simpler on first sight but
won't be in practice. Objects can issue requests to other objects. Will
they hold its lock while taking the target's lock? Lock management is a
per-device thing, not only for performance reasons.

But I guess we will have to compare working prototypes. Many concepts
look nice on the paper, but when you implement them or even try to run
them, the interesting problems show up. Specifically those related to a
smooth transition phase.

 
 (for devices using private
 locks), but on the front-line object: the memory region. That region
 will block its owner from disappearing by waiting on dispatch when
 someone tries to unregister it.

 Also note that holding a lock is easily said but will be more tricky
 in practice. Quite a significant share of our code will continue to run
 under BQL, even for devices with their own locks. Init/cleanup functions
 will likely fall into this category,

 I'm not sure I'm convinced of this--but it's hard to tell until we
 really start converting.

 BTW, I'm pretty sure we have to tackle main loop functions first before
 we try to convert any devices off the BQL.

 I'm sure we should leave existing code alone wherever possible, focusing
 on providing alternative versions for those paths that matter. Example:
 Most timers are fine under BQL. But some sensitive devices (RTC or HPET
 as clock source) will want their own timers. So the approach is to
 instantiate a separate, also prioritizeable instance of the timer
 subsystem for them and be done.
 
 I disagree.  I think we conver the timer subsystem to be lockless and
 then let some devices acquire the BQL during dispatch.

The timer subsystem is lockless in this sense already. Its user holds
the necessary lock (BQL so far). I'm just making it multi-instance so
that the time-critical users don't need to worry about other, low-prio
timers. In that sense, I guess we are describing the same end result.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 12:47 AM, Jan Kiszka wrote:
 On 2012-08-27 09:01, Paolo Bonzini wrote:
  Il 25/08/2012 09:42, liu ping fan ha scritto:
 
  I don't see why MMIO dispatch should hold the IDEBus ref rather than the
  PCIIDEState.
 
  When transfer memory_region_init_io()  3rd para from void* opaque to
  Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
  situation, in order to let MemoryRegionOps tell between them, we
  should pass PCIIDEState-bus[0], bus[1] separately.
  
  The rule should be that the obj is the object that you want referenced,
  and that should be the PCIIDEState.
  
  But this is anyway moot because it only applies to objects that are
  converted to use unlocked dispatch.  This likely will not be the case
  for IDE.

 BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
 dispatching - that device objects are the wrong target for reference
 counting. We keep memory regions in our dispatching tables (PIO
 dispatching needs some refactoring for this), and those regions need
 protection for BQL-free use. Devices can't pass away as long as the have
 referenced regions, memory region deregistration services will have to
 take care of this.

 I'm currently not using reference counting at all, I'm enforcing that
 only BQL-protected regions can be deregistered.

That's a pretty harsh constraint.

 Also note that there seems to be another misconception in the
 discussions: deregistration is not only bound to device unplug. It also
 happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
 strong indicator that we should worry about individual memory regions,
 not devices.



Deregistration is fine, the problem is destruction.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 19:09, Avi Kivity wrote:
 On 08/27/2012 12:47 AM, Jan Kiszka wrote:
 On 2012-08-27 09:01, Paolo Bonzini wrote:
 Il 25/08/2012 09:42, liu ping fan ha scritto:

 I don't see why MMIO dispatch should hold the IDEBus ref rather than the
 PCIIDEState.

 When transfer memory_region_init_io()  3rd para from void* opaque to
 Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
 situation, in order to let MemoryRegionOps tell between them, we
 should pass PCIIDEState-bus[0], bus[1] separately.

 The rule should be that the obj is the object that you want referenced,
 and that should be the PCIIDEState.

 But this is anyway moot because it only applies to objects that are
 converted to use unlocked dispatch.  This likely will not be the case
 for IDE.

 BTW, I'm pretty sure - after implementing the basics for BQL-free PIO
 dispatching - that device objects are the wrong target for reference
 counting. We keep memory regions in our dispatching tables (PIO
 dispatching needs some refactoring for this), and those regions need
 protection for BQL-free use. Devices can't pass away as long as the have
 referenced regions, memory region deregistration services will have to
 take care of this.

 I'm currently not using reference counting at all, I'm enforcing that
 only BQL-protected regions can be deregistered.
 
 That's a pretty harsh constraint.

Of course. It's a first step, not a long-term restriction. But we have
to do small steps.

 
 Also note that there seems to be another misconception in the
 discussions: deregistration is not only bound to device unplug. It also
 happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another
 strong indicator that we should worry about individual memory regions,
 not devices.


 
 Deregistration is fine, the problem is destruction.
 

It isn't as you access memory region states that can change after
deregistration. Devices can remove memory regions from the mapping,
alter and then reinsert them. The last to steps must not happen while
anyone is still using a reference to that region.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 10:14 AM, Jan Kiszka wrote:
  
  Deregistration is fine, the problem is destruction.
  

 It isn't as you access memory region states that can change after
 deregistration. Devices can remove memory regions from the mapping,
 alter and then reinsert them. The last to steps must not happen while
 anyone is still using a reference to that region.


Why not?  If the guest is accessing an mmio region while reconfiguring
it in a way that changes its meaning, either the previous or the next
meaning is valid.

It is true that memory_region_set_enabled(..., false) will become weaker
as a result.  Code will have to be prepared for that.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 20:09, Avi Kivity wrote:
 On 08/27/2012 10:14 AM, Jan Kiszka wrote:

 Deregistration is fine, the problem is destruction.


 It isn't as you access memory region states that can change after
 deregistration. Devices can remove memory regions from the mapping,
 alter and then reinsert them. The last to steps must not happen while
 anyone is still using a reference to that region.

 
 Why not?  If the guest is accessing an mmio region while reconfiguring
 it in a way that changes its meaning, either the previous or the next
 meaning is valid.

If the memory region owner sets the content to zero or even releases it
(nothing states a memory region can only live inside a device
structure), we will crash. Restricting how a memory region can be
created and handled after it was once registered somewhere is an
unnatural interface, waiting to cause subtle bugs.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 11:17 AM, Jan Kiszka wrote:
 On 2012-08-27 20:09, Avi Kivity wrote:
  On 08/27/2012 10:14 AM, Jan Kiszka wrote:
 
  Deregistration is fine, the problem is destruction.
 
 
  It isn't as you access memory region states that can change after
  deregistration. Devices can remove memory regions from the mapping,
  alter and then reinsert them. The last to steps must not happen while
  anyone is still using a reference to that region.
 
  
  Why not?  If the guest is accessing an mmio region while reconfiguring
  it in a way that changes its meaning, either the previous or the next
  meaning is valid.

 If the memory region owner sets the content to zero or even releases it
 (nothing states a memory region can only live inside a device
 structure), we will crash. Restricting how a memory region can be
 created and handled after it was once registered somewhere is an
 unnatural interface, waiting to cause subtle bugs.

Using an Object * allows the simple case to be really simple (object ==
device) and the hard cases to be doable.

What would you suggest as a better interface?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 06:19 AM, Anthony Liguori wrote:
 Liu Ping Fan qemul...@gmail.com writes:

  From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
  Scene:
obja lies in objA, when objA's ref-0, it will be freed,
  but at that time obja can still be in use.
 
  The real example is:
  typedef struct PCIIDEState {
  PCIDevice dev;
  IDEBus bus[2]; -- create in place
  .
  }
 
  When without big lock protection for mmio-dispatch, we will hold
  obj's refcnt. So memory_region_init_io() will replace the third para
  void *opaque with Object *obj.
  With this patch, we can protect PCIIDEState from disappearing during
  mmio-dispatch hold the IDEBus-ref.
 
  And the ref circle has been broken when calling qdev_delete_subtree().
 
  Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 I think this is solving the wrong problem.  There are many, many
 dependencies a device may have on other devices.  Memory allocation
 isn't the only one.

 The problem is that we want to make sure that a device doesn't go away
 while an MMIO dispatch is happening.  This is easy to solve without
 touching referencing counting.

 The device will hold a lock while the MMIO is being dispatched.  The
 delete path simply needs to acquire that same lock.  This will ensure
 that a delete operation cannot finish while MMIO is still in flight.

Where does the lock live?  If it lives in the device, then we can't
acquire it during mmio dispatch (the device may have gone away).  If it
lives outside the device, we need to to tell the memory core about it. 
I would really like to avoid adding additional state; I think the curent
opaque should serve both to provide state to the callbacks and for life
cycle management.

 Regarding deleting a device, not all devices are capable of being
 deleted and specifically, devices that are composed within the memory of
 another device cannot be directly deleted (they can only be deleted
 as part of their parent's destruction).

This doesn't help at all.  We have devices that can be deleted, and we
would like to make the devices use the unlocked path, so we have to
account for it.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 09:24 AM, Anthony Liguori wrote:
 
  I'm sure we should leave existing code alone wherever possible, focusing
  on providing alternative versions for those paths that matter. Example:
  Most timers are fine under BQL. But some sensitive devices (RTC or HPET
  as clock source) will want their own timers. So the approach is to
  instantiate a separate, also prioritizeable instance of the timer
  subsystem for them and be done.

 I disagree.  I think we conver the timer subsystem to be lockless and
 then let some devices acquire the BQL during dispatch.

I agree with your disagreement but disagree with the rest.  The timer
subsystem should have its own internal locking that callers will not be
aware of.  Requiring devices to acquire the bql will lead to deadlocks.

Note that fine-grained locking timers will also require reference
counting: you want to call the timer expiration callback after releasing
the timer subsystem lock, so you need to make sure the callback does not
go away.

Linux manages without it (hrtimer_interrupt), so maybe we can too.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 20:20, Avi Kivity wrote:
 On 08/27/2012 11:17 AM, Jan Kiszka wrote:
 On 2012-08-27 20:09, Avi Kivity wrote:
 On 08/27/2012 10:14 AM, Jan Kiszka wrote:

 Deregistration is fine, the problem is destruction.


 It isn't as you access memory region states that can change after
 deregistration. Devices can remove memory regions from the mapping,
 alter and then reinsert them. The last to steps must not happen while
 anyone is still using a reference to that region.


 Why not?  If the guest is accessing an mmio region while reconfiguring
 it in a way that changes its meaning, either the previous or the next
 meaning is valid.

 If the memory region owner sets the content to zero or even releases it
 (nothing states a memory region can only live inside a device
 structure), we will crash. Restricting how a memory region can be
 created and handled after it was once registered somewhere is an
 unnatural interface, waiting to cause subtle bugs.
 
 Using an Object * allows the simple case to be really simple (object ==
 device) and the hard cases to be doable.
 
 What would you suggest as a better interface?

To protect the life cycle of the object we manage in the memory layer:
regions. We don't manage devices there. If there is any implementation
benefit in having a full QOM object, then make memory regions objects.

I simply don't like this indirection, having the memory layer pick up
the opaque value of the region and interpret it. Even worse, apply
restrictions on how the dispatched objects, the regions, have to be
treated because of this.

Also, using memory regions to control the locking behaviour allows for
more fine-grained control. A device may expose certain regions with
self-managed locking while others, less time critical ones, can still be
handled under BQL for simplicity reasons.

Example: regions that translate MMIO to PIO (alpha-pci.c, every user of
isa_mmio.c, ...). If PIO dispatching runs outside of the BQL, these
regions must not be protected by the BQL anymore. At the same time, we
may not want to convert the device exposing the region to its own
locking scheme (yet). And we surely don't want to take the per-device
lock for this state-less PIO dispatching.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 11:39 AM, Jan Kiszka wrote:
 On 2012-08-27 20:20, Avi Kivity wrote:
  On 08/27/2012 11:17 AM, Jan Kiszka wrote:
  On 2012-08-27 20:09, Avi Kivity wrote:
  On 08/27/2012 10:14 AM, Jan Kiszka wrote:
 
  Deregistration is fine, the problem is destruction.
 
 
  It isn't as you access memory region states that can change after
  deregistration. Devices can remove memory regions from the mapping,
  alter and then reinsert them. The last to steps must not happen while
  anyone is still using a reference to that region.
 
 
  Why not?  If the guest is accessing an mmio region while reconfiguring
  it in a way that changes its meaning, either the previous or the next
  meaning is valid.
 
  If the memory region owner sets the content to zero or even releases it
  (nothing states a memory region can only live inside a device
  structure), we will crash. Restricting how a memory region can be
  created and handled after it was once registered somewhere is an
  unnatural interface, waiting to cause subtle bugs.
  
  Using an Object * allows the simple case to be really simple (object ==
  device) and the hard cases to be doable.
  
  What would you suggest as a better interface?

 To protect the life cycle of the object we manage in the memory layer:
 regions. We don't manage devices there. If there is any implementation
 benefit in having a full QOM object, then make memory regions objects.

I see and sympathise with this point of view.

The idea to use opaque and convert it to Object * is that often a device
has multiple regions but no interest in managing them separately.  So
managing regions will cause the device model authors to create
boilerplate code to push region management to device management.

I had this experience with the first iterations of the region interface,
where instead of opaque we had MemoryRegion *.  Device code would use
container_of() in the simple case, but the non-simple case caused lots
of pain.  I believe it is the natural interface, but in practice it
turned out to cause too much churn.

 I simply don't like this indirection, having the memory layer pick up
 the opaque value of the region and interpret it. 

That's just an intermediate step.  The idea is that eventually opaque
will be changed to Object *.

 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.

Please elaborate.

 Also, using memory regions to control the locking behaviour allows for
 more fine-grained control. A device may expose certain regions with
 self-managed locking while others, less time critical ones, can still be
 handled under BQL for simplicity reasons.

You can still aquire the bql in one callback and your local lock in
another if you choose (but that would be most confusing IMO).

 Example: regions that translate MMIO to PIO (alpha-pci.c, every user of
 isa_mmio.c, ...). If PIO dispatching runs outside of the BQL, these
 regions must not be protected by the BQL anymore. At the same time, we
 may not want to convert the device exposing the region to its own
 locking scheme (yet). And we surely don't want to take the per-device
 lock for this state-less PIO dispatching.

We're diverging, but eventually PIO dispatch will share the same code as
mmio dispatch.  PIO-to-MMIO bridges will simply be containers or
aliases, they will not call cpu_inb() and relatives.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Anthony Liguori
Avi Kivity a...@redhat.com writes:

 On 08/27/2012 09:24 AM, Anthony Liguori wrote:
 
  I'm sure we should leave existing code alone wherever possible, focusing
  on providing alternative versions for those paths that matter. Example:
  Most timers are fine under BQL. But some sensitive devices (RTC or HPET
  as clock source) will want their own timers. So the approach is to
  instantiate a separate, also prioritizeable instance of the timer
  subsystem for them and be done.

 I disagree.  I think we conver the timer subsystem to be lockless and
 then let some devices acquire the BQL during dispatch.

 I agree with your disagreement but disagree with the rest.  The timer
 subsystem should have its own internal locking that callers will not be
 aware of.  Requiring devices to acquire the bql will lead to
 deadlocks.

Err, I completely mispoke there.

I meant, to say, we should convert the timer subsystem to be re-entrant
and then let some devices acquire the BQL during dispatch.

Existing users of the time API should get the BQL acquired automagically
for them.  The new API would not hold the BQL during dispatch but the
old API, implemented in terms of the new API, would acquire it on behalf
of the caller.

As a rule, if a device relies on the BQL, it must use the old APIs.  If
a device uses the new APIs, it must not acquire the BQL.

 Note that fine-grained locking timers will also require reference
 counting: you want to call the timer expiration callback after releasing
 the timer subsystem lock, so you need to make sure the callback does not
 go away.

glib simply uses a single main loop lock to deal with all of this.  I
think that's absolutely the right model.

It's best to start this conversion using very coarse locking.  There's
no need to start with ultra fine grain locking.

Regards,

Anthony Liguori

 Linux manages without it (hrtimer_interrupt), so maybe we can too.
 -- 
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 21:17, Anthony Liguori wrote:
 Avi Kivity a...@redhat.com writes:
 
 On 08/27/2012 09:24 AM, Anthony Liguori wrote:

 I'm sure we should leave existing code alone wherever possible, focusing
 on providing alternative versions for those paths that matter. Example:
 Most timers are fine under BQL. But some sensitive devices (RTC or HPET
 as clock source) will want their own timers. So the approach is to
 instantiate a separate, also prioritizeable instance of the timer
 subsystem for them and be done.

 I disagree.  I think we conver the timer subsystem to be lockless and
 then let some devices acquire the BQL during dispatch.

 I agree with your disagreement but disagree with the rest.  The timer
 subsystem should have its own internal locking that callers will not be
 aware of.  Requiring devices to acquire the bql will lead to
 deadlocks.
 
 Err, I completely mispoke there.
 
 I meant, to say, we should convert the timer subsystem to be re-entrant
 and then let some devices acquire the BQL during dispatch.
 
 Existing users of the time API should get the BQL acquired automagically
 for them.  The new API would not hold the BQL during dispatch but the
 old API, implemented in terms of the new API, would acquire it on behalf
 of the caller.
 
 As a rule, if a device relies on the BQL, it must use the old APIs.  If
 a device uses the new APIs, it must not acquire the BQL.

This makes sense.

 
 Note that fine-grained locking timers will also require reference
 counting: you want to call the timer expiration callback after releasing
 the timer subsystem lock, so you need to make sure the callback does not
 go away.
 
 glib simply uses a single main loop lock to deal with all of this.  I
 think that's absolutely the right model.

That depends on what is under the lock. Also, relying on locking of
external libraries is a no-go for realtime as they usually do not care
about priorities and inversion avoidance.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 20:52, Avi Kivity wrote:
 On 08/27/2012 11:39 AM, Jan Kiszka wrote:
 On 2012-08-27 20:20, Avi Kivity wrote:
 On 08/27/2012 11:17 AM, Jan Kiszka wrote:
 On 2012-08-27 20:09, Avi Kivity wrote:
 On 08/27/2012 10:14 AM, Jan Kiszka wrote:

 Deregistration is fine, the problem is destruction.


 It isn't as you access memory region states that can change after
 deregistration. Devices can remove memory regions from the mapping,
 alter and then reinsert them. The last to steps must not happen while
 anyone is still using a reference to that region.


 Why not?  If the guest is accessing an mmio region while reconfiguring
 it in a way that changes its meaning, either the previous or the next
 meaning is valid.

 If the memory region owner sets the content to zero or even releases it
 (nothing states a memory region can only live inside a device
 structure), we will crash. Restricting how a memory region can be
 created and handled after it was once registered somewhere is an
 unnatural interface, waiting to cause subtle bugs.

 Using an Object * allows the simple case to be really simple (object ==
 device) and the hard cases to be doable.

 What would you suggest as a better interface?

 To protect the life cycle of the object we manage in the memory layer:
 regions. We don't manage devices there. If there is any implementation
 benefit in having a full QOM object, then make memory regions objects.
 
 I see and sympathise with this point of view.
 
 The idea to use opaque and convert it to Object * is that often a device
 has multiple regions but no interest in managing them separately.  So
 managing regions will cause the device model authors to create
 boilerplate code to push region management to device management.
 
 I had this experience with the first iterations of the region interface,
 where instead of opaque we had MemoryRegion *.  Device code would use
 container_of() in the simple case, but the non-simple case caused lots
 of pain.  I believe it is the natural interface, but in practice it
 turned out to cause too much churn.
 
 I simply don't like this indirection, having the memory layer pick up
 the opaque value of the region and interpret it. 
 
 That's just an intermediate step.  The idea is that eventually opaque
 will be changed to Object *.
 
 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.
 
 Please elaborate.

The fact that you can't manipulate a memory region object arbitrarily
after removing it from the mapping because you track the references to
the object that the region points to, not the region itself. The region
remains in use by the dispatching layer and potentially the called
device, even after deregistration.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 12:38 PM, Jan Kiszka wrote:
  Even worse, apply
  restrictions on how the dispatched objects, the regions, have to be
  treated because of this.
  
  Please elaborate.

 The fact that you can't manipulate a memory region object arbitrarily
 after removing it from the mapping because you track the references to
 the object that the region points to, not the region itself. The region
 remains in use by the dispatching layer and potentially the called
 device, even after deregistration.

That object will be a container_of() the region, usually literally but
sometimes only in spirit.  Reference counting the regions means they
cannot be embedded into other objects any more.

We can probably figure out a way to flush out accesses.  After switching
to rcu, for example, all we need is synchronize_rcu() in a
non-deadlocking place.  But my bet is that it will not be needed.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Avi Kivity
On 08/27/2012 12:17 PM, Anthony Liguori wrote:
 Avi Kivity a...@redhat.com writes:

  On 08/27/2012 09:24 AM, Anthony Liguori wrote:
  
   I'm sure we should leave existing code alone wherever possible, focusing
   on providing alternative versions for those paths that matter. Example:
   Most timers are fine under BQL. But some sensitive devices (RTC or HPET
   as clock source) will want their own timers. So the approach is to
   instantiate a separate, also prioritizeable instance of the timer
   subsystem for them and be done.
 
  I disagree.  I think we conver the timer subsystem to be lockless and
  then let some devices acquire the BQL during dispatch.
 
  I agree with your disagreement but disagree with the rest.  The timer
  subsystem should have its own internal locking that callers will not be
  aware of.  Requiring devices to acquire the bql will lead to
  deadlocks.

 Err, I completely mispoke there.

 I meant, to say, we should convert the timer subsystem to be re-entrant
 and then let some devices acquire the BQL during dispatch.

That would be all devices, with converted devices not acquiring it as
they are converted.

 Existing users of the time API should get the BQL acquired automagically
 for them.  The new API would not hold the BQL during dispatch but the
 old API, implemented in terms of the new API, would acquire it on behalf
 of the caller.

 As a rule, if a device relies on the BQL, it must use the old APIs.  If
 a device uses the new APIs, it must not acquire the BQL.

This exactly mirrors the plan with memory region conversion.

  Note that fine-grained locking timers will also require reference
  counting: you want to call the timer expiration callback after releasing
  the timer subsystem lock, so you need to make sure the callback does not
  go away.

 glib simply uses a single main loop lock to deal with all of this.  I
 think that's absolutely the right model.

 It's best to start this conversion using very coarse locking.  There's
 no need to start with ultra fine grain locking.

How does it work?  You have to drop this main loop lock to dispatch the
callback, so the data structure you use for dispatch is no longer protected.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Paolo Bonzini
Il 27/08/2012 22:58, Avi Kivity ha scritto:
  It's best to start this conversion using very coarse locking.  There's
  no need to start with ultra fine grain locking.
 How does it work?  You have to drop this main loop lock to dispatch the
 callback, so the data structure you use for dispatch is no longer protected.

It is protected by a mixture of reference counting, thread-local storage and
careful coding to take care of re-entrancy.

I'm not too sure that we can use it, though.  Mostly because of the reasons Jan
mentioned (it may also not be precise enough, but we can fix that with our own
GSource wrapping POSIX real-time timers or Linux timerfds).

Paolo




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread Jan Kiszka
On 2012-08-27 22:53, Avi Kivity wrote:
 On 08/27/2012 12:38 PM, Jan Kiszka wrote:
 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.

 Please elaborate.

 The fact that you can't manipulate a memory region object arbitrarily
 after removing it from the mapping because you track the references to
 the object that the region points to, not the region itself. The region
 remains in use by the dispatching layer and potentially the called
 device, even after deregistration.
 
 That object will be a container_of() the region, usually literally but
 sometimes only in spirit.  Reference counting the regions means they
 cannot be embedded into other objects any more.

I cannot follow the logic of the last sentence. Reference counting just
means that we track if there are users left, not necessarily that we
perform asynchronous releases. We can simply wait for those users to
complete.

 
 We can probably figure out a way to flush out accesses.  After switching
 to rcu, for example, all we need is synchronize_rcu() in a
 non-deadlocking place.  But my bet is that it will not be needed.

If you properly flush out accesses, you don't need to track at device
level anymore - and mess with abstraction layers. That's my whole point.

Jan




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread liu ping fan
On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2012-08-27 20:52, Avi Kivity wrote:
 On 08/27/2012 11:39 AM, Jan Kiszka wrote:
 On 2012-08-27 20:20, Avi Kivity wrote:
 On 08/27/2012 11:17 AM, Jan Kiszka wrote:
 On 2012-08-27 20:09, Avi Kivity wrote:
 On 08/27/2012 10:14 AM, Jan Kiszka wrote:

 Deregistration is fine, the problem is destruction.


 It isn't as you access memory region states that can change after
 deregistration. Devices can remove memory regions from the mapping,
 alter and then reinsert them. The last to steps must not happen while
 anyone is still using a reference to that region.


 Why not?  If the guest is accessing an mmio region while reconfiguring
 it in a way that changes its meaning, either the previous or the next
 meaning is valid.

 If the memory region owner sets the content to zero or even releases it
 (nothing states a memory region can only live inside a device
 structure), we will crash. Restricting how a memory region can be
 created and handled after it was once registered somewhere is an
 unnatural interface, waiting to cause subtle bugs.

 Using an Object * allows the simple case to be really simple (object ==
 device) and the hard cases to be doable.

 What would you suggest as a better interface?

 To protect the life cycle of the object we manage in the memory layer:
 regions. We don't manage devices there. If there is any implementation
 benefit in having a full QOM object, then make memory regions objects.

 I see and sympathise with this point of view.

 The idea to use opaque and convert it to Object * is that often a device
 has multiple regions but no interest in managing them separately.  So
 managing regions will cause the device model authors to create
 boilerplate code to push region management to device management.

 I had this experience with the first iterations of the region interface,
 where instead of opaque we had MemoryRegion *.  Device code would use
 container_of() in the simple case, but the non-simple case caused lots
 of pain.  I believe it is the natural interface, but in practice it
 turned out to cause too much churn.

 I simply don't like this indirection, having the memory layer pick up
 the opaque value of the region and interpret it.

 That's just an intermediate step.  The idea is that eventually opaque
 will be changed to Object *.

 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.

 Please elaborate.

 The fact that you can't manipulate a memory region object arbitrarily
 after removing it from the mapping because you track the references to
 the object that the region points to, not the region itself. The region
 remains in use by the dispatching layer and potentially the called
 device, even after deregistration.

Why can it in use by dispatching layer? Memory region just servers as
interface for mem lookup, when dispatching in mr's MemoryRegionOps,
why do we need to access it?  Which means that read/write can cause
register/deregister mr?  Example?
Also I think MemoryRegionOps will eventually operate on Object (mr is
only interface). MRs is only represent of Object in the view of
memory(so their life cycle can be managed by Object). And there could
be some intermediate mr like those lie in subpage_t which are not
based on Object. But we can walk down to the terminal point and
extract the real Object,
so when dispatching, we only need to ensure that Object and its direct
mr are alive.

Regards,
pingfan

 Jan

 --
 Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
 Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-27 Thread liu ping fan
On Tue, Aug 28, 2012 at 11:09 AM, liu ping fan qemul...@gmail.com wrote:
 On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2012-08-27 20:52, Avi Kivity wrote:
 On 08/27/2012 11:39 AM, Jan Kiszka wrote:
 On 2012-08-27 20:20, Avi Kivity wrote:
 On 08/27/2012 11:17 AM, Jan Kiszka wrote:
 On 2012-08-27 20:09, Avi Kivity wrote:
 On 08/27/2012 10:14 AM, Jan Kiszka wrote:

 Deregistration is fine, the problem is destruction.


 It isn't as you access memory region states that can change after
 deregistration. Devices can remove memory regions from the mapping,
 alter and then reinsert them. The last to steps must not happen while
 anyone is still using a reference to that region.


 Why not?  If the guest is accessing an mmio region while reconfiguring
 it in a way that changes its meaning, either the previous or the next
 meaning is valid.

 If the memory region owner sets the content to zero or even releases it
 (nothing states a memory region can only live inside a device
 structure), we will crash. Restricting how a memory region can be
 created and handled after it was once registered somewhere is an
 unnatural interface, waiting to cause subtle bugs.

 Using an Object * allows the simple case to be really simple (object ==
 device) and the hard cases to be doable.

 What would you suggest as a better interface?

 To protect the life cycle of the object we manage in the memory layer:
 regions. We don't manage devices there. If there is any implementation
 benefit in having a full QOM object, then make memory regions objects.

 I see and sympathise with this point of view.

 The idea to use opaque and convert it to Object * is that often a device
 has multiple regions but no interest in managing them separately.  So
 managing regions will cause the device model authors to create
 boilerplate code to push region management to device management.

 I had this experience with the first iterations of the region interface,
 where instead of opaque we had MemoryRegion *.  Device code would use
 container_of() in the simple case, but the non-simple case caused lots
 of pain.  I believe it is the natural interface, but in practice it
 turned out to cause too much churn.

 I simply don't like this indirection, having the memory layer pick up
 the opaque value of the region and interpret it.

 That's just an intermediate step.  The idea is that eventually opaque
 will be changed to Object *.

 Even worse, apply
 restrictions on how the dispatched objects, the regions, have to be
 treated because of this.

 Please elaborate.

 The fact that you can't manipulate a memory region object arbitrarily
 after removing it from the mapping because you track the references to
 the object that the region points to, not the region itself. The region
 remains in use by the dispatching layer and potentially the called
 device, even after deregistration.

 Why can it in use by dispatching layer? Memory region just servers as
 interface for mem lookup, when dispatching in mr's MemoryRegionOps,
 why do we need to access it?  Which means that read/write can cause
 register/deregister mr?  Example?

I found example in pci_update_mappings(), but but my following point
still stand.
 Also I think MemoryRegionOps will eventually operate on Object (mr is
 only interface). MRs is only represent of Object in the view of
 memory(so their life cycle can be managed by Object). And there could
 be some intermediate mr like those lie in subpage_t which are not
 based on Object. But we can walk down to the terminal point and
 extract the real Object,
 so when dispatching, we only need to ensure that Object and its direct
 mr are alive.

 Regards,
 pingfan

 Jan

 --
 Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
 Corporate Competence Center Embedded Linux



Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-25 Thread liu ping fan
On Fri, Aug 24, 2012 at 10:42 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 24/08/2012 11:49, Liu Ping Fan ha scritto:
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

 I don't see why MMIO dispatch should hold the IDEBus ref rather than the
 PCIIDEState.

When transfer memory_region_init_io()  3rd para from void* opaque to
Object* obj,  the obj : opaque is not neccessary 1:1 map. For such
situation, in order to let MemoryRegionOps tell between them, we
should pass PCIIDEState-bus[0], bus[1] separately.

 In the case of the PIIX, the BARs are set up by the PCIIDEState in
 bmdma_setup_bar (called by bmdma_setup_bar).

Supposing we have convert  PCIIDEState-bmdma[0]/[1] to Object. And in
mmio-dispatch, object_ref will impose on bmdma[0/[1], but this can not
prevent  PCIIDEState-refcnt=0, and then the whole object disappear!

Thanks and regards,
pingfan


 Also, containment may happen just as well for devices, not buses.  Why
 isn't it a problem in that case?  It looks like you're papering over a
 different bug.

 Paolo

 And the ref circle has been broken when calling qdev_delete_subtree().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  hw/qdev.c |2 ++
  hw/qdev.h |1 +
  2 files changed, 3 insertions(+), 0 deletions(-)

 diff --git a/hw/qdev.c b/hw/qdev.c
 index e2339a1..b09ebbf 100644
 --- a/hw/qdev.c
 +++ b/hw/qdev.c
 @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char 
 *typename,
  {
  object_initialize(bus, typename);

 +bus-overlap = parent;
 +object_ref(OBJECT(bus-overlap));
  bus-parent = parent;
  bus-name = name ? g_strdup(name) : NULL;
  qbus_realize(bus);
 diff --git a/hw/qdev.h b/hw/qdev.h
 index 182cfa5..9bc5783 100644
 --- a/hw/qdev.h
 +++ b/hw/qdev.h
 @@ -117,6 +117,7 @@ struct BusState {
  int allow_hotplug;
  bool qom_allocated;
  bool glib_allocated;
 +DeviceState *overlap;
  int max_index;
  QTAILQ_HEAD(ChildrenHead, BusChild) children;
  QLIST_ENTRY(BusState) sibling;
 -- 1.7.4.4




[Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-24 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Scene:
  obja lies in objA, when objA's ref-0, it will be freed,
but at that time obja can still be in use.

The real example is:
typedef struct PCIIDEState {
PCIDevice dev;
IDEBus bus[2]; -- create in place
.
}

When without big lock protection for mmio-dispatch, we will hold
obj's refcnt. So memory_region_init_io() will replace the third para
void *opaque with Object *obj.
With this patch, we can protect PCIIDEState from disappearing during
mmio-dispatch hold the IDEBus-ref.

And the ref circle has been broken when calling qdev_delete_subtree().

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 hw/qdev.c |2 ++
 hw/qdev.h |1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/hw/qdev.c b/hw/qdev.c
index e2339a1..b09ebbf 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char 
*typename,
 {
 object_initialize(bus, typename);
 
+bus-overlap = parent;
+object_ref(OBJECT(bus-overlap));
 bus-parent = parent;
 bus-name = name ? g_strdup(name) : NULL;
 qbus_realize(bus);
diff --git a/hw/qdev.h b/hw/qdev.h
index 182cfa5..9bc5783 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -117,6 +117,7 @@ struct BusState {
 int allow_hotplug;
 bool qom_allocated;
 bool glib_allocated;
+DeviceState *overlap;
 int max_index;
 QTAILQ_HEAD(ChildrenHead, BusChild) children;
 QLIST_ENTRY(BusState) sibling;
-- 
1.7.4.4




Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem

2012-08-24 Thread Paolo Bonzini
Il 24/08/2012 11:49, Liu Ping Fan ha scritto:
 With this patch, we can protect PCIIDEState from disappearing during
 mmio-dispatch hold the IDEBus-ref.

I don't see why MMIO dispatch should hold the IDEBus ref rather than the
PCIIDEState.

In the case of the PIIX, the BARs are set up by the PCIIDEState in
bmdma_setup_bar (called by bmdma_setup_bar).

Also, containment may happen just as well for devices, not buses.  Why
isn't it a problem in that case?  It looks like you're papering over a
different bug.

Paolo

 And the ref circle has been broken when calling qdev_delete_subtree().
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  hw/qdev.c |2 ++
  hw/qdev.h |1 +
  2 files changed, 3 insertions(+), 0 deletions(-)
 
 diff --git a/hw/qdev.c b/hw/qdev.c
 index e2339a1..b09ebbf 100644
 --- a/hw/qdev.c
 +++ b/hw/qdev.c
 @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char 
 *typename,
  {
  object_initialize(bus, typename);
  
 +bus-overlap = parent;
 +object_ref(OBJECT(bus-overlap));
  bus-parent = parent;
  bus-name = name ? g_strdup(name) : NULL;
  qbus_realize(bus);
 diff --git a/hw/qdev.h b/hw/qdev.h
 index 182cfa5..9bc5783 100644
 --- a/hw/qdev.h
 +++ b/hw/qdev.h
 @@ -117,6 +117,7 @@ struct BusState {
  int allow_hotplug;
  bool qom_allocated;
  bool glib_allocated;
 +DeviceState *overlap;
  int max_index;
  QTAILQ_HEAD(ChildrenHead, BusChild) children;
  QLIST_ENTRY(BusState) sibling;
 -- 1.7.4.4