Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 09/05/2012 03:02 PM, Jan Kiszka wrote: > On 2012-09-05 13:25, Avi Kivity wrote: >> On 09/05/2012 02:11 PM, Jan Kiszka wrote: >>> On 2012-09-05 12:53, Avi Kivity wrote: On 09/05/2012 01:36 PM, Jan Kiszka wrote: >> >> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a >> corresponding unref), which has the following requirements: >> >> - if the refcount is nonzero, MemoryRegion::opaque is safe to use >> - if the refcount is nonzero, the MemoryRegion itself is stable. > > The second point means that the memory subsystem will cache the region > state and apply changes only after leaving a handler that performed them? No. I/O callbacks may be called after a region has been disabled. I guess we can restrict this to converted regions, so we don't introduce regressions. >>> >>> s/can/have to/. This property change will require some special care, >>> hopefully mostly at the memory layer. A simple scenario from recent patches: >>> >>> if () { >>> memory_region_set_address(&s->pm_io, pm_io_base); >>> memory_region_set_enabled(&s->pm_io, true); >>> } else { >>> memory_region_set_enabled(&s->pm_io, false); >>> } >> >> I am unable to avoid pointing out that this can be collapsed to >> >> memory_region_set_address(&s->pm_io, pm_io_base); >> memory_region_set_enabled(&s->pm_io, ); >> >> as the address is meaningless when disabled. Sorry. >> >>> >>> We will have to ensure that set_enabled(..., true) will never cause a >>> dispatch using an outdated base address. >> >> No, this is entirely safe. If the guest changes a region offset >> concurrently with issuing mmio on it, then it must expect either the old >> or new offset to be used during dispatch. > I forgot to mention that my clever rewrite above actually breaks this - it needs to be wrapped in a transaction, otherwise we have a move-then-disable pattern. > You assume disable, reprogram, enable, ignoring that there could be > multiple, invalid states between disable and enable. Real hardware takes > care of the ordering. It's possible of course, but the snippet above isn't susceptible on its own. I don't think it's solvable. To serialize, you must hold the device lock, but we don't want to take the device lock during dispatch. Users can protect against them by checking for ->enabled: void foo_io_read(...) { FooState *s = container_of(mr, ...); qemu_mutex_lock(&s->lock); if (!memory_region_enabled(mr)) { *data = ~(uint64_t)0; goto out; } ... out: qemu_mutex_unlock(&s->lock); } Non-converted users will be naturally protected since we will take the bql on their behalf. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-09-05 13:25, Avi Kivity wrote: > On 09/05/2012 02:11 PM, Jan Kiszka wrote: >> On 2012-09-05 12:53, Avi Kivity wrote: >>> On 09/05/2012 01:36 PM, Jan Kiszka wrote: > > My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a > corresponding unref), which has the following requirements: > > - if the refcount is nonzero, MemoryRegion::opaque is safe to use > - if the refcount is nonzero, the MemoryRegion itself is stable. The second point means that the memory subsystem will cache the region state and apply changes only after leaving a handler that performed them? >>> >>> No. I/O callbacks may be called after a region has been disabled. >>> >>> I guess we can restrict this to converted regions, so we don't introduce >>> regressions. >> >> s/can/have to/. This property change will require some special care, >> hopefully mostly at the memory layer. A simple scenario from recent patches: >> >> if () { >> memory_region_set_address(&s->pm_io, pm_io_base); >> memory_region_set_enabled(&s->pm_io, true); >> } else { >> memory_region_set_enabled(&s->pm_io, false); >> } > > I am unable to avoid pointing out that this can be collapsed to > > memory_region_set_address(&s->pm_io, pm_io_base); > memory_region_set_enabled(&s->pm_io, ); > > as the address is meaningless when disabled. Sorry. > >> >> We will have to ensure that set_enabled(..., true) will never cause a >> dispatch using an outdated base address. > > No, this is entirely safe. If the guest changes a region offset > concurrently with issuing mmio on it, then it must expect either the old > or new offset to be used during dispatch. You assume disable, reprogram, enable, ignoring that there could be multiple, invalid states between disable and enable. Real hardware takes care of the ordering. > In either case, the correct > intra-region offset will be provided to the I/O callback (no volatile > MemoryRegion fields except ->readable (IIRC) are used during dispatch - > the rest are all copied into data structures used during dispatch (this > is part of what makes the whole thing so rcu friendly). > >> I think discussing semantics and usage patterns of the new memory API - >> outside of the BQL - will be the next big topic. ;) > > I hope it won't prove to be that complicated. > Yeah, let's hope this... Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 09/05/2012 02:11 PM, Jan Kiszka wrote: > On 2012-09-05 12:53, Avi Kivity wrote: >> On 09/05/2012 01:36 PM, Jan Kiszka wrote: My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a corresponding unref), which has the following requirements: - if the refcount is nonzero, MemoryRegion::opaque is safe to use - if the refcount is nonzero, the MemoryRegion itself is stable. >>> >>> The second point means that the memory subsystem will cache the region >>> state and apply changes only after leaving a handler that performed them? >> >> No. I/O callbacks may be called after a region has been disabled. >> >> I guess we can restrict this to converted regions, so we don't introduce >> regressions. > > s/can/have to/. This property change will require some special care, > hopefully mostly at the memory layer. A simple scenario from recent patches: > > if () { > memory_region_set_address(&s->pm_io, pm_io_base); > memory_region_set_enabled(&s->pm_io, true); > } else { > memory_region_set_enabled(&s->pm_io, false); > } I am unable to avoid pointing out that this can be collapsed to memory_region_set_address(&s->pm_io, pm_io_base); memory_region_set_enabled(&s->pm_io, ); as the address is meaningless when disabled. Sorry. > > We will have to ensure that set_enabled(..., true) will never cause a > dispatch using an outdated base address. No, this is entirely safe. If the guest changes a region offset concurrently with issuing mmio on it, then it must expect either the old or new offset to be used during dispatch. In either case, the correct intra-region offset will be provided to the I/O callback (no volatile MemoryRegion fields except ->readable (IIRC) are used during dispatch - the rest are all copied into data structures used during dispatch (this is part of what makes the whole thing so rcu friendly). > I think discussing semantics and usage patterns of the new memory API - > outside of the BQL - will be the next big topic. ;) I hope it won't prove to be that complicated. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-09-05 12:53, Avi Kivity wrote: > On 09/05/2012 01:36 PM, Jan Kiszka wrote: >>> >>> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a >>> corresponding unref), which has the following requirements: >>> >>> - if the refcount is nonzero, MemoryRegion::opaque is safe to use >>> - if the refcount is nonzero, the MemoryRegion itself is stable. >> >> The second point means that the memory subsystem will cache the region >> state and apply changes only after leaving a handler that performed them? > > No. I/O callbacks may be called after a region has been disabled. > > I guess we can restrict this to converted regions, so we don't introduce > regressions. s/can/have to/. This property change will require some special care, hopefully mostly at the memory layer. A simple scenario from recent patches: if () { memory_region_set_address(&s->pm_io, pm_io_base); memory_region_set_enabled(&s->pm_io, true); } else { memory_region_set_enabled(&s->pm_io, false); } We will have to ensure that set_enabled(..., true) will never cause a dispatch using an outdated base address. I think discussing semantics and usage patterns of the new memory API - outside of the BQL - will be the next big topic. ;) Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 09/05/2012 01:36 PM, Jan Kiszka wrote: >> >> My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a >> corresponding unref), which has the following requirements: >> >> - if the refcount is nonzero, MemoryRegion::opaque is safe to use >> - if the refcount is nonzero, the MemoryRegion itself is stable. > > The second point means that the memory subsystem will cache the region > state and apply changes only after leaving a handler that performed them? No. I/O callbacks may be called after a region has been disabled. I guess we can restrict this to converted regions, so we don't introduce regressions. >> As outlined above, I now prefer MemoryRegionOps::ref/unref. >> >> Advantages compared to MemoryRegionOps::object(): >> - decoupled from the QOM framework >> >> Disadvantages: >> - more boilerplate code in converted devices >> >> Since we are likely to convert few devices to lockless dispatch, I >> believe the tradeoff is justified. But let's hear Jan and the others. > > I still need to dig into related posts of the past days, lost track, but > the above sounds much better. > > Besides the question of what to protect and how, one important thing > IMHO is that we are able to switch locking behaviour on a per region > basis. And that should also be feasible this way. It was also possible with MemoryRegionOps::object() - no one said all regions for a device have to refer to the same object (and there is a difference between locking and reference counting, each callback could take a different lock). But it is more natural with MemoryRegionOps::ref(). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-09-05 11:52, Avi Kivity wrote: > On 09/05/2012 11:19 AM, liu ping fan wrote: >> [...] >>> We could externalize the refcounting and push it into device code. This >>> means: >>> >>>memory_region_init_io(&s->mem, dev) >>> >>>... >>> >>>object_ref(dev) >>>memory_region_add_subregion(..., &dev->mr) >>> >>>... >>> >>>memory_region_del_subregion(..., &dev->mr) // implied flush >>>object_unref(dev) >>> >> I think "object_ref(dev)" just another style to push >> MemoryRegionOps::object() to device level. And I think we can bypass >> it. The caller (unplug, pci-reconfig ) of >> memory_region_del_subregion() ensure the @dev is valid. >> If the implied flush is implemented in synchronize, _del_subregion() >> will guarantee no reader for @dev->mr and @dev exist any longer. > > The above code has a deadlock. memory_region_del_subregion() may be > called under the device lock (since it may be the result of mmio to the > device), and if the flush uses synchronized_rcu(), it will wait forever > for the read-side critical section to complete. > But if _del_subregion() just wait for mr-X quiescent period, while calling in mr-Y's read side critical section, then we can avoid deadlock. I saw in pci-mapping, we delete mr-X in mr-Y read side. >> So I >> think we can save both object_ref/unref(dev) for memory system. >> The only problem is that whether we can implement it as synchronous or >> not, is it possible that we can launch a _del_subregion(mr-X) in >> mr-X's dispatcher? > > Yes. Real cases exist. Oh, I find the sample code, then, the deadlock is unavoidable in this method. > > What alternatives remain? > I think a way out may be async+refcnt >>> If we consider the relationship of MemoryRegionImpl and device as the >>> one between file and file->private_data in Linux. Then the creation >>> of impl will object_ref(dev) and when impl->ref=0, it will >>> object_unref(dev) >>> But this is an async model, for those client which need to know the >>> end of flush, we can adopt callback just like call_rcu(). >>> >>> >> During this thread, it seems that no matter we adopt refcnt on >> MemoryRegionImpl or not, protecting MemoryRegion::opaque during >> dispatch is still required. > > Right. So MemoryRegionImpl is pointless. > > My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a > corresponding unref), which has the following requirements: > > - if the refcount is nonzero, MemoryRegion::opaque is safe to use > - if the refcount is nonzero, the MemoryRegion itself is stable. The second point means that the memory subsystem will cache the region state and apply changes only after leaving a handler that performed them? > > Later we can change other callbacks to also use mr instead of opaque. > Usually the callback will be able to derive the device object from the > region, only special cases will require > memory_region_opaque(MemoryRegion *). We can even drop the opaque from > memory_region_init_io() and replace it with memory_region_set_opaque(), > to be used in those special cases. > >> It is challenging to substitute >> memory_region_init_io() 's 3rd parameter from void* to Object*, owning >> to the conflict that life-cycle management need the host of opaque, >> while Object and mr need 1:1 map. So I think, I will move forward on >> the way based on MemoryRegionOps::object(). Right? > > As outlined above, I now prefer MemoryRegionOps::ref/unref. > > Advantages compared to MemoryRegionOps::object(): > - decoupled from the QOM framework > > Disadvantages: > - more boilerplate code in converted devices > > Since we are likely to convert few devices to lockless dispatch, I > believe the tradeoff is justified. But let's hear Jan and the others. I still need to dig into related posts of the past days, lost track, but the above sounds much better. Besides the question of what to protect and how, one important thing IMHO is that we are able to switch locking behaviour on a per region basis. And that should also be feasible this way. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 09/05/2012 11:19 AM, liu ping fan wrote: > [...] >>> >> We could externalize the refcounting and push it into device code. This >> means: >> >>memory_region_init_io(&s->mem, dev) >> >>... >> >>object_ref(dev) >>memory_region_add_subregion(..., &dev->mr) >> >>... >> >>memory_region_del_subregion(..., &dev->mr) // implied flush >>object_unref(dev) >> > I think "object_ref(dev)" just another style to push > MemoryRegionOps::object() to device level. And I think we can bypass > it. The caller (unplug, pci-reconfig ) of > memory_region_del_subregion() ensure the @dev is valid. > If the implied flush is implemented in synchronize, _del_subregion() > will guarantee no reader for @dev->mr and @dev exist any longer. The above code has a deadlock. memory_region_del_subregion() may be called under the device lock (since it may be the result of mmio to the device), and if the flush uses synchronized_rcu(), it will wait forever for the read-side critical section to complete. >>> But if _del_subregion() just wait for mr-X quiescent period, while >>> calling in mr-Y's read side critical section, then we can avoid >>> deadlock. I saw in pci-mapping, we delete mr-X in mr-Y read side. >>> > So I > think we can save both object_ref/unref(dev) for memory system. > The only problem is that whether we can implement it as synchronous or > not, is it possible that we can launch a _del_subregion(mr-X) in > mr-X's dispatcher? Yes. Real cases exist. >>> >>> Oh, I find the sample code, then, the deadlock is unavoidable in this >>> method. What alternatives remain? >>> I think a way out may be async+refcnt >>> >> If we consider the relationship of MemoryRegionImpl and device as the >> one between file and file->private_data in Linux. Then the creation >> of impl will object_ref(dev) and when impl->ref=0, it will >> object_unref(dev) >> But this is an async model, for those client which need to know the >> end of flush, we can adopt callback just like call_rcu(). >> >> > During this thread, it seems that no matter we adopt refcnt on > MemoryRegionImpl or not, protecting MemoryRegion::opaque during > dispatch is still required. Right. So MemoryRegionImpl is pointless. My current preference is MemoryRegionOps::ref(MemoryRegion *mr) (and a corresponding unref), which has the following requirements: - if the refcount is nonzero, MemoryRegion::opaque is safe to use - if the refcount is nonzero, the MemoryRegion itself is stable. Later we can change other callbacks to also use mr instead of opaque. Usually the callback will be able to derive the device object from the region, only special cases will require memory_region_opaque(MemoryRegion *). We can even drop the opaque from memory_region_init_io() and replace it with memory_region_set_opaque(), to be used in those special cases. > It is challenging to substitute > memory_region_init_io() 's 3rd parameter from void* to Object*, owning > to the conflict that life-cycle management need the host of opaque, > while Object and mr need 1:1 map. So I think, I will move forward on > the way based on MemoryRegionOps::object(). Right? As outlined above, I now prefer MemoryRegionOps::ref/unref. Advantages compared to MemoryRegionOps::object(): - decoupled from the QOM framework Disadvantages: - more boilerplate code in converted devices Since we are likely to convert few devices to lockless dispatch, I believe the tradeoff is justified. But let's hear Jan and the others. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
[...] >> > We could externalize the refcounting and push it into device code. This > means: > >memory_region_init_io(&s->mem, dev) > >... > >object_ref(dev) >memory_region_add_subregion(..., &dev->mr) > >... > >memory_region_del_subregion(..., &dev->mr) // implied flush >object_unref(dev) > I think "object_ref(dev)" just another style to push MemoryRegionOps::object() to device level. And I think we can bypass it. The caller (unplug, pci-reconfig ) of memory_region_del_subregion() ensure the @dev is valid. If the implied flush is implemented in synchronize, _del_subregion() will guarantee no reader for @dev->mr and @dev exist any longer. >>> >>> The above code has a deadlock. memory_region_del_subregion() may be >>> called under the device lock (since it may be the result of mmio to the >>> device), and if the flush uses synchronized_rcu(), it will wait forever >>> for the read-side critical section to complete. >>> >> But if _del_subregion() just wait for mr-X quiescent period, while >> calling in mr-Y's read side critical section, then we can avoid >> deadlock. I saw in pci-mapping, we delete mr-X in mr-Y read side. >> So I think we can save both object_ref/unref(dev) for memory system. The only problem is that whether we can implement it as synchronous or not, is it possible that we can launch a _del_subregion(mr-X) in mr-X's dispatcher? >>> >>> Yes. Real cases exist. >> >> Oh, I find the sample code, then, the deadlock is unavoidable in this method. >>> >>> What alternatives remain? >>> >> I think a way out may be async+refcnt >> > If we consider the relationship of MemoryRegionImpl and device as the > one between file and file->private_data in Linux. Then the creation > of impl will object_ref(dev) and when impl->ref=0, it will > object_unref(dev) > But this is an async model, for those client which need to know the > end of flush, we can adopt callback just like call_rcu(). > > During this thread, it seems that no matter we adopt refcnt on MemoryRegionImpl or not, protecting MemoryRegion::opaque during dispatch is still required. It is challenging to substitute memory_region_init_io() 's 3rd parameter from void* to Object*, owning to the conflict that life-cycle management need the host of opaque, while Object and mr need 1:1 map. So I think, I will move forward on the way based on MemoryRegionOps::object(). Right? Regards, pingfan > >> Regards, >> pingfan >>> -- >>> error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Mon, Sep 3, 2012 at 6:06 PM, liu ping fan wrote: > On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity wrote: >> On 09/03/2012 10:44 AM, liu ping fan wrote: > If we make the refcount/lock internal to the region, we must remove the opaque, since the region won't protect it. Replacing the opaque with container_of(mr) doesn't help, since we can't refcount mr, only mr->impl. >>> I think you mean if using MemoryRegionImpl, then at this level, we had >>> better not touch the refcnt of container_of(mr) and should face the >>> mr->impl->refcnt. Right? >> >> I did not understand the second part, sorry. >> > My understanding of "Replacing the opaque with container_of(mr) > doesn't help, since we can't refcount mr, only > mr->impl." is that although Object_ref(container_of(mr)) can help us > to protect it from disappearing. But apparently it is not right place > to do it it in memory core. Do I catch you meaning? > We could externalize the refcounting and push it into device code. This means: memory_region_init_io(&s->mem, dev) ... object_ref(dev) memory_region_add_subregion(..., &dev->mr) ... memory_region_del_subregion(..., &dev->mr) // implied flush object_unref(dev) >>> I think "object_ref(dev)" just another style to push >>> MemoryRegionOps::object() to device level. And I think we can bypass >>> it. The caller (unplug, pci-reconfig ) of >>> memory_region_del_subregion() ensure the @dev is valid. >>> If the implied flush is implemented in synchronize, _del_subregion() >>> will guarantee no reader for @dev->mr and @dev exist any longer. >> >> The above code has a deadlock. memory_region_del_subregion() may be >> called under the device lock (since it may be the result of mmio to the >> device), and if the flush uses synchronized_rcu(), it will wait forever >> for the read-side critical section to complete. >> > But if _del_subregion() just wait for mr-X quiescent period, while > calling in mr-Y's read side critical section, then we can avoid > deadlock. I saw in pci-mapping, we delete mr-X in mr-Y read side. > >>> So I >>> think we can save both object_ref/unref(dev) for memory system. >>> The only problem is that whether we can implement it as synchronous or >>> not, is it possible that we can launch a _del_subregion(mr-X) in >>> mr-X's dispatcher? >> >> Yes. Real cases exist. > > Oh, I find the sample code, then, the deadlock is unavoidable in this method. >> >> What alternatives remain? >> > I think a way out may be async+refcnt > If we consider the relationship of MemoryRegionImpl and device as the one between file and file->private_data in Linux. Then the creation of impl will object_ref(dev) and when impl->ref=0, it will object_unref(dev) But this is an async model, for those client which need to know the end of flush, we can adopt callback just like call_rcu(). > Regards, > pingfan >> -- >> error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Mon, Sep 3, 2012 at 6:16 PM, Avi Kivity wrote: > On 09/03/2012 01:06 PM, liu ping fan wrote: >> On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity wrote: >>> On 09/03/2012 10:44 AM, liu ping fan wrote: >> > > If we make the refcount/lock internal to the region, we must remove the > opaque, since the region won't protect it. > > Replacing the opaque with container_of(mr) doesn't help, since we can't > refcount mr, only mr->impl. > I think you mean if using MemoryRegionImpl, then at this level, we had better not touch the refcnt of container_of(mr) and should face the mr->impl->refcnt. Right? >>> >>> I did not understand the second part, sorry. >>> >> My understanding of "Replacing the opaque with container_of(mr) >> doesn't help, since we can't refcount mr, only >> mr->impl." is that although Object_ref(container_of(mr)) can help us >> to protect it from disappearing. But apparently it is not right place >> to do it it in memory core. Do I catch you meaning? > > We cannot call container_of(mr) in the memory core, because the > parameters to container_of() are not known. But that is not the main issue. > > Something we can do is make MemoryRegionOps::object() take a mr > parameter instead of opaque. MemoryRegionOps::object() then uses > container_of() to derive the object to ref. > > (works with MemoryRegionOps::ref()/MemoryRegionOps::unref() too; Jan, > this decouples Object from MemoryRegion at the cost of extra boilerplate > in client code, but it may be worthwhile as a temporary measure while we > gain more experience with this) > Exactly catch your meaning, thanks. >> > We could externalize the refcounting and push it into device code. This > means: > >memory_region_init_io(&s->mem, dev) > >... > >object_ref(dev) >memory_region_add_subregion(..., &dev->mr) > >... > >memory_region_del_subregion(..., &dev->mr) // implied flush >object_unref(dev) > I think "object_ref(dev)" just another style to push MemoryRegionOps::object() to device level. And I think we can bypass it. The caller (unplug, pci-reconfig ) of memory_region_del_subregion() ensure the @dev is valid. If the implied flush is implemented in synchronize, _del_subregion() will guarantee no reader for @dev->mr and @dev exist any longer. >>> >>> The above code has a deadlock. memory_region_del_subregion() may be >>> called under the device lock (since it may be the result of mmio to the >>> device), and if the flush uses synchronized_rcu(), it will wait forever >>> for the read-side critical section to complete. >>> >> But if _del_subregion() just wait for mr-X quiescent period, while >> calling in mr-Y's read side critical section, then we can avoid >> deadlock. I saw in pci-mapping, we delete mr-X in mr-Y read side. > > Look at cirrus-vga.c, there are many calls to the memory API there. > While I doubt that we have one region disabling itself (or a subregion > of itself), that's all code that would be run under the same device > lock, and so would deadlock. > Oh, yes, quiescent time will never come since reader wait for the lock! Got it, thanks. pingfan > > -- > error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 09/03/2012 01:06 PM, liu ping fan wrote: > On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity wrote: >> On 09/03/2012 10:44 AM, liu ping fan wrote: > If we make the refcount/lock internal to the region, we must remove the opaque, since the region won't protect it. Replacing the opaque with container_of(mr) doesn't help, since we can't refcount mr, only mr->impl. >>> I think you mean if using MemoryRegionImpl, then at this level, we had >>> better not touch the refcnt of container_of(mr) and should face the >>> mr->impl->refcnt. Right? >> >> I did not understand the second part, sorry. >> > My understanding of "Replacing the opaque with container_of(mr) > doesn't help, since we can't refcount mr, only > mr->impl." is that although Object_ref(container_of(mr)) can help us > to protect it from disappearing. But apparently it is not right place > to do it it in memory core. Do I catch you meaning? We cannot call container_of(mr) in the memory core, because the parameters to container_of() are not known. But that is not the main issue. Something we can do is make MemoryRegionOps::object() take a mr parameter instead of opaque. MemoryRegionOps::object() then uses container_of() to derive the object to ref. (works with MemoryRegionOps::ref()/MemoryRegionOps::unref() too; Jan, this decouples Object from MemoryRegion at the cost of extra boilerplate in client code, but it may be worthwhile as a temporary measure while we gain more experience with this) > We could externalize the refcounting and push it into device code. This means: memory_region_init_io(&s->mem, dev) ... object_ref(dev) memory_region_add_subregion(..., &dev->mr) ... memory_region_del_subregion(..., &dev->mr) // implied flush object_unref(dev) >>> I think "object_ref(dev)" just another style to push >>> MemoryRegionOps::object() to device level. And I think we can bypass >>> it. The caller (unplug, pci-reconfig ) of >>> memory_region_del_subregion() ensure the @dev is valid. >>> If the implied flush is implemented in synchronize, _del_subregion() >>> will guarantee no reader for @dev->mr and @dev exist any longer. >> >> The above code has a deadlock. memory_region_del_subregion() may be >> called under the device lock (since it may be the result of mmio to the >> device), and if the flush uses synchronized_rcu(), it will wait forever >> for the read-side critical section to complete. >> > But if _del_subregion() just wait for mr-X quiescent period, while > calling in mr-Y's read side critical section, then we can avoid > deadlock. I saw in pci-mapping, we delete mr-X in mr-Y read side. Look at cirrus-vga.c, there are many calls to the memory API there. While I doubt that we have one region disabling itself (or a subregion of itself), that's all code that would be run under the same device lock, and so would deadlock. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Mon, Sep 3, 2012 at 4:52 PM, Avi Kivity wrote: > On 09/03/2012 10:44 AM, liu ping fan wrote: >>> >>> If we make the refcount/lock internal to the region, we must remove the >>> opaque, since the region won't protect it. >>> >>> Replacing the opaque with container_of(mr) doesn't help, since we can't >>> refcount mr, only mr->impl. >>> >> I think you mean if using MemoryRegionImpl, then at this level, we had >> better not touch the refcnt of container_of(mr) and should face the >> mr->impl->refcnt. Right? > > I did not understand the second part, sorry. > My understanding of "Replacing the opaque with container_of(mr) doesn't help, since we can't refcount mr, only mr->impl." is that although Object_ref(container_of(mr)) can help us to protect it from disappearing. But apparently it is not right place to do it it in memory core. Do I catch you meaning? >>> We could externalize the refcounting and push it into device code. This >>> means: >>> >>>memory_region_init_io(&s->mem, dev) >>> >>>... >>> >>>object_ref(dev) >>>memory_region_add_subregion(..., &dev->mr) >>> >>>... >>> >>>memory_region_del_subregion(..., &dev->mr) // implied flush >>>object_unref(dev) >>> >> I think "object_ref(dev)" just another style to push >> MemoryRegionOps::object() to device level. And I think we can bypass >> it. The caller (unplug, pci-reconfig ) of >> memory_region_del_subregion() ensure the @dev is valid. >> If the implied flush is implemented in synchronize, _del_subregion() >> will guarantee no reader for @dev->mr and @dev exist any longer. > > The above code has a deadlock. memory_region_del_subregion() may be > called under the device lock (since it may be the result of mmio to the > device), and if the flush uses synchronized_rcu(), it will wait forever > for the read-side critical section to complete. > But if _del_subregion() just wait for mr-X quiescent period, while calling in mr-Y's read side critical section, then we can avoid deadlock. I saw in pci-mapping, we delete mr-X in mr-Y read side. >> So I >> think we can save both object_ref/unref(dev) for memory system. >> The only problem is that whether we can implement it as synchronous or >> not, is it possible that we can launch a _del_subregion(mr-X) in >> mr-X's dispatcher? > > Yes. Real cases exist. Oh, I find the sample code, then, the deadlock is unavoidable in this method. > > What alternatives remain? > I think a way out may be async+refcnt Regards, pingfan > -- > error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/29/2012 08:41 PM, Jan Kiszka wrote: >>> >>> memory_region_transaction_commit (or calls that trigger this) will have >>> to wait. That is required as the caller may start fiddling with the >>> region right afterward. >> >> However, it may be running within an mmio dispatch, so it may wait forever. > > We won't be able to wait for devices that can acquire the same lock we > hold while waiting, of course. That's why a lock ordering device-lock -> > BQL (the one we hold over memory region changes) won't be allowed. I was > wrong on this before: We must not take course-grained locks while > holding fine-grained ones, only the other way around. Pretty obvious, > specifically for realtime / low-latency. So how do we wait? Detecting that we're in the same thread and allowing reentrancy in that case in the option, but I can't say I like it much. >> >> Ignoring this, how does it wait? sleep()? or wait for a completion? > > Ideally via completion. Then you have to manage the life cycle of the completion object. > >> Usually a reference count controls the lifetime of the reference counted object, what are you suggesting here? >>> >>> To synchronize on reference count going to zero. >> >> Quite unorthodox. Do you have real life examples? > > synchronize_rcu. This whole thing started because synchronize_rcu() will deadlock if called from a read-side critical section. The fix was taking the reference, so that the read-side critical section is confined to the lookup, not the call. This way the lock ordering is always device->map. We could schedule a bh or a thread and call synchronize_rcu() from there, but then the commit is deferred to an unknown time, whereas we need to guarantee that by the time the guest is reentered, the transaction is committed. > >> >>> Or all readers leaving >>> the read-side critical sections. >> >> This is rcu. But again, we may be in an rcu read-side critical section >> while needing to wait. In fact this was what I suggested in the first >> place, until Marcelo pointed out the deadlock, so we came up with the >> refcount. > > We must avoid that deadlock scenario. I bended my brain around it, and I > think that is the only answer. We can if we apply clear rules regarding > locking and BQL-protected services to those devices that will actually > use fine-grained locking, and there only for those paths that take > per-device locks. I'm pretty sure we won't get far with an > "all-or-nothing" model where every device uses a private lock. An all-or-nothing model is not proposed. Unconverted devices will take the BQL instead of the device lock (hopefully the BQL will be taken for them during dispatch, so no code changes will be needed). >> >> I don't follow. Synchronous transactions mean you can't >> synchronize_rcu() or upgrade the lock or wait for the refcount. That's >> the source of the problem! > > Our current device models assume synchronous transactions on the memory > layer, actually on all layers. The proposals I saw so far try to change > this. But I'm very skeptical that this would succeed, due to the > conversion effort and due to the fact that it would give us a tricky to > use API. In what way do transactions become asynchronous? It's true that io callbacks can be called concurrently, but as soon as they take the device lock, they are serialized. The only relaxation is that a callback may be called after a region has been disabled or removed from view. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 09/03/2012 10:44 AM, liu ping fan wrote: >>> >> >> If we make the refcount/lock internal to the region, we must remove the >> opaque, since the region won't protect it. >> >> Replacing the opaque with container_of(mr) doesn't help, since we can't >> refcount mr, only mr->impl. >> > I think you mean if using MemoryRegionImpl, then at this level, we had > better not touch the refcnt of container_of(mr) and should face the > mr->impl->refcnt. Right? I did not understand the second part, sorry. >> We could externalize the refcounting and push it into device code. This >> means: >> >>memory_region_init_io(&s->mem, dev) >> >>... >> >>object_ref(dev) >>memory_region_add_subregion(..., &dev->mr) >> >>... >> >>memory_region_del_subregion(..., &dev->mr) // implied flush >>object_unref(dev) >> > I think "object_ref(dev)" just another style to push > MemoryRegionOps::object() to device level. And I think we can bypass > it. The caller (unplug, pci-reconfig ) of > memory_region_del_subregion() ensure the @dev is valid. > If the implied flush is implemented in synchronize, _del_subregion() > will guarantee no reader for @dev->mr and @dev exist any longer. The above code has a deadlock. memory_region_del_subregion() may be called under the device lock (since it may be the result of mmio to the device), and if the flush uses synchronized_rcu(), it will wait forever for the read-side critical section to complete. > So I > think we can save both object_ref/unref(dev) for memory system. > The only problem is that whether we can implement it as synchronous or > not, is it possible that we can launch a _del_subregion(mr-X) in > mr-X's dispatcher? Yes. Real cases exist. What alternatives remain? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Sat, Sep 1, 2012 at 4:46 PM, Avi Kivity wrote: > On 08/30/2012 12:08 AM, Jan Kiszka wrote: >> >>> >> >>> We are dispatching according to the memory region (parameters, op >> >>> handlers, opaques). If we end up in device object is not decided at this >> >>> level. A memory region describes a dispatchable area - not to confuse >> >>> with a device that may only partially be able to receive such requests. >> >> >> > But I think the meaning of the memory region is for dispatching. If no >> > dispatching associated with mr, why need it exist in the system? >> >> Where did I say that memory regions should no longer be used for >> dispatching? The point is to keep the clean layer separation between >> memory regions and device objects instead of merging them together. > > Believe me, that's exactly how I feel. I guess no author of a module > wants to see it swallowed by a larger framework and see its design > completely changed. Modules should be decoupled as much as possible. > But I don't see a better way yet. > >> >> Given >> >>Object >>^^ >>|| >> Region 1Region 2 >> >> you protect the object during dispatch, implicitly (and that is bad) >> requiring that no region must change in that period. > > We only protect the object against removal. After object_ref() has run, > the region may still be disabled. > >> I say what rather >> needs protection are the regions so that Region 2 can pass away and >> maybe reappear independent of Region 1. > > Region 2 can go away until the device-supplied dispatch code takes the > device lock. If the device chooses to use finer-grained locking, it can > allow region 2 (or even region 1) to change during dispatch. The only > requirement is that region 1 is not destroyed. > >> And: I won't need to know the >> type of that object the regions are referring to in this model. That's >> the difference. > > Sorry, I lost the reference. What model? Are you referring to my > broken MemoryRegionImpl split? > >> > And >> > could you elaborate that who will be the ref holder of mr? >> >> The memory subsystem while running a memory region handler. If that will >> be a reference counter or a per-region lock like Avi suggested, we still >> need to find out. >> > > If we make the refcount/lock internal to the region, we must remove the > opaque, since the region won't protect it. > > Replacing the opaque with container_of(mr) doesn't help, since we can't > refcount mr, only mr->impl. > I think you mean if using MemoryRegionImpl, then at this level, we had better not touch the refcnt of container_of(mr) and should face the mr->impl->refcnt. Right? > We could externalize the refcounting and push it into device code. This > means: > >memory_region_init_io(&s->mem, dev) > >... > >object_ref(dev) >memory_region_add_subregion(..., &dev->mr) > >... > >memory_region_del_subregion(..., &dev->mr) // implied flush >object_unref(dev) > I think "object_ref(dev)" just another style to push MemoryRegionOps::object() to device level. And I think we can bypass it. The caller (unplug, pci-reconfig ) of memory_region_del_subregion() ensure the @dev is valid. If the implied flush is implemented in synchronize, _del_subregion() will guarantee no reader for @dev->mr and @dev exist any longer. So I think we can save both object_ref/unref(dev) for memory system. The only problem is that whether we can implement it as synchronous or not, is it possible that we can launch a _del_subregion(mr-X) in mr-X's dispatcher? Regards, pingfan > er, this must be wrong, since it's so simple > > -- > I have a truly marvellous patch that fixes the bug which this > signature is too narrow to contain. >
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 09/01/2012 01:57 AM, Jan Kiszka wrote: > On 2012-09-01 10:31, Avi Kivity wrote: > > On 08/29/2012 10:49 AM, Jan Kiszka wrote: > >>> > >>> Let's experiment with refcounting MemoryRegion. We can move the entire > >>> contents of MemoryRegion to MemoryRegionImpl, add a reference count (to > >>> MemoryRegionImpl), and change MemoryRegion to contain a pointer to the > >>> refcounted MemoryRegionImpl: > >>> > >>> struct MemoryRegion { > >>> struct MemoryRegionImpl *impl; > >>> }; > >>> > >>> struct MemoryRegionImpl { > >>> atomic int refs; > >>> ... > >>> }; > >>> > >>> The memory core can then store MemoryRegion copies (with elevated > >>> refcounts) instead of pointers. Devices can destroy MemoryRegions at > >>> any time, the implementation will not go away. However, what of the > >>> opaque stored in MemoryRegionImpl? It becomes a dangling pointer. > >>> > >>> One way out is to add a lock to MemoryRegionImpl. Dispatch takes the > >>> lock, examines the ->enabled member, and bails out if it is cleared. > >>> The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock, > >>> clears ->enabled, releases the lock, and drops the reference. > >> > >> That means holding the MemoryRegionImpl lock across the handler call? > > > > Blech. As you said on the other side of this thread, we must not take a > > coarse grained lock within a fine grained one, and > > MemoryRegionImpl::lock is as fine as they get. > > Not sure what you compare here. MemoryRegionImpl::lock would be per > memory region, so with finer scope than the BQL but with similar scope > like a per-device lock. The dispatch may acquire the bql (in fact most will, unless we convert everything). A design that doesn't allow this is broken. Even the device lock is more coarse grained (since it covers all regions) and is taken in a deadlock scenario as region manipulation will be done under the device lock. You say we can detect this, but I dislike it. > > > >>> > >>> The advantage to this scheme is that all changes are localized to the > >>> memory core, no need for a full sweep. It is a little complicated, but > >>> we may be able to simplify it (or find another one). > >> > >> May work. We just need to detect if memory region tries to delete itself > >> from its handler to prevent the deadlock. > > > > Those types of hacks are fragile. IMO it just demonstrates what I said > > earlier (then tried to disprove with this): if we call an opaque's > > method, we must refcount or otherwise lock that opaque. No way around it. > > But that still doesn't solve the problem that we need to lock down the > *state* of the opaque during dispatch /wrt to memory region changes. > Just ensuring its existence is not enough unless we declare memory > region transactions to be asynchronous - and adapt all users. I expect no users will need change. Changing the region offset has no effect on dispatch (in fact the region itself doesn't change). Changing subregions is fine. Disabling a region concurrently with access is the only potential problem, but this is rare, and I expect it to just work. We will have to audit all users, yes. > + wait for refcount(object) == 0 in deregister(object). That's what I'm > proposing. > >>> > >>> Consider timer_del() called from a timer callback. It's often not doable. > >> > >> This is inherently synchronous already (when working against the same > >> alarm timer backend). We can detect this in timer_del and skip waiting, > >> as in the above scenario. > > > > It can always be broken. The timer callback takes the device lock to > > update the device. The device mmio handler, holding the device lock, > > takes the timer lock to timer_mod. Deadlock. > > Well, how is this solved in Linux? By waiting on the callback in > hrtimer_cancel. Not by wait_on_magic_opaque (well, there is even no > opaque in the hrtimer API). I don't consider that busy loop an elegant solution. The original proposal mirrors dcache. You can perform operations on a dentry even after it is unlinked. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-09-01 10:31, Avi Kivity wrote: > On 08/29/2012 10:49 AM, Jan Kiszka wrote: >>> >>> Let's experiment with refcounting MemoryRegion. We can move the entire >>> contents of MemoryRegion to MemoryRegionImpl, add a reference count (to >>> MemoryRegionImpl), and change MemoryRegion to contain a pointer to the >>> refcounted MemoryRegionImpl: >>> >>> struct MemoryRegion { >>> struct MemoryRegionImpl *impl; >>> }; >>> >>> struct MemoryRegionImpl { >>> atomic int refs; >>> ... >>> }; >>> >>> The memory core can then store MemoryRegion copies (with elevated >>> refcounts) instead of pointers. Devices can destroy MemoryRegions at >>> any time, the implementation will not go away. However, what of the >>> opaque stored in MemoryRegionImpl? It becomes a dangling pointer. >>> >>> One way out is to add a lock to MemoryRegionImpl. Dispatch takes the >>> lock, examines the ->enabled member, and bails out if it is cleared. >>> The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock, >>> clears ->enabled, releases the lock, and drops the reference. >> >> That means holding the MemoryRegionImpl lock across the handler call? > > Blech. As you said on the other side of this thread, we must not take a > coarse grained lock within a fine grained one, and > MemoryRegionImpl::lock is as fine as they get. Not sure what you compare here. MemoryRegionImpl::lock would be per memory region, so with finer scope than the BQL but with similar scope like a per-device lock. > >>> >>> The advantage to this scheme is that all changes are localized to the >>> memory core, no need for a full sweep. It is a little complicated, but >>> we may be able to simplify it (or find another one). >> >> May work. We just need to detect if memory region tries to delete itself >> from its handler to prevent the deadlock. > > Those types of hacks are fragile. IMO it just demonstrates what I said > earlier (then tried to disprove with this): if we call an opaque's > method, we must refcount or otherwise lock that opaque. No way around it. But that still doesn't solve the problem that we need to lock down the *state* of the opaque during dispatch /wrt to memory region changes. Just ensuring its existence is not enough unless we declare memory region transactions to be asynchronous - and adapt all users. > >> >>> > >> Besides >> MMIO and PIO dispatching, it will haunt us for file or event handlers, >> any kind of callbacks etc. >> >> Context A Context B >> - - >> object = lookup() >> deregister(object) >> modify(object) -> invalid state >> ... use(object) >> modify(object) -> valid state >> register(object) >> >> And with "object" I'm not talking about QOM but any data structure. >> > > > Context B > - > rcu_read_lock() > object = lookup() > if (object) { > ref(object) > } > rcu_read_unlock() > > use(object) > > unref(object) > > (substitute locking scheme to conform to taste and/or dietary > restrictions) > + wait for refcount(object) == 0 in deregister(object). That's what I'm proposing. >>> >>> Consider timer_del() called from a timer callback. It's often not doable. >> >> This is inherently synchronous already (when working against the same >> alarm timer backend). We can detect this in timer_del and skip waiting, >> as in the above scenario. > > It can always be broken. The timer callback takes the device lock to > update the device. The device mmio handler, holding the device lock, > takes the timer lock to timer_mod. Deadlock. Well, how is this solved in Linux? By waiting on the callback in hrtimer_cancel. Not by wait_on_magic_opaque (well, there is even no opaque in the hrtimer API). Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/30/2012 12:08 AM, Jan Kiszka wrote: > >>> > >>> We are dispatching according to the memory region (parameters, op > >>> handlers, opaques). If we end up in device object is not decided at this > >>> level. A memory region describes a dispatchable area - not to confuse > >>> with a device that may only partially be able to receive such requests. > >> > > But I think the meaning of the memory region is for dispatching. If no > > dispatching associated with mr, why need it exist in the system? > > Where did I say that memory regions should no longer be used for > dispatching? The point is to keep the clean layer separation between > memory regions and device objects instead of merging them together. Believe me, that's exactly how I feel. I guess no author of a module wants to see it swallowed by a larger framework and see its design completely changed. Modules should be decoupled as much as possible. But I don't see a better way yet. > > Given > >Object >^^ >|| > Region 1Region 2 > > you protect the object during dispatch, implicitly (and that is bad) > requiring that no region must change in that period. We only protect the object against removal. After object_ref() has run, the region may still be disabled. > I say what rather > needs protection are the regions so that Region 2 can pass away and > maybe reappear independent of Region 1. Region 2 can go away until the device-supplied dispatch code takes the device lock. If the device chooses to use finer-grained locking, it can allow region 2 (or even region 1) to change during dispatch. The only requirement is that region 1 is not destroyed. > And: I won't need to know the > type of that object the regions are referring to in this model. That's > the difference. Sorry, I lost the reference. What model? Are you referring to my broken MemoryRegionImpl split? > > And > > could you elaborate that who will be the ref holder of mr? > > The memory subsystem while running a memory region handler. If that will > be a reference counter or a per-region lock like Avi suggested, we still > need to find out. > If we make the refcount/lock internal to the region, we must remove the opaque, since the region won't protect it. Replacing the opaque with container_of(mr) doesn't help, since we can't refcount mr, only mr->impl. We could externalize the refcounting and push it into device code. This means: memory_region_init_io(&s->mem, dev) ... object_ref(dev) memory_region_add_subregion(..., &dev->mr) ... memory_region_del_subregion(..., &dev->mr) // implied flush object_unref(dev) er, this must be wrong, since it's so simple -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/29/2012 10:49 AM, Jan Kiszka wrote: > > > > Let's experiment with refcounting MemoryRegion. We can move the entire > > contents of MemoryRegion to MemoryRegionImpl, add a reference count (to > > MemoryRegionImpl), and change MemoryRegion to contain a pointer to the > > refcounted MemoryRegionImpl: > > > > struct MemoryRegion { > > struct MemoryRegionImpl *impl; > > }; > > > > struct MemoryRegionImpl { > > atomic int refs; > > ... > > }; > > > > The memory core can then store MemoryRegion copies (with elevated > > refcounts) instead of pointers. Devices can destroy MemoryRegions at > > any time, the implementation will not go away. However, what of the > > opaque stored in MemoryRegionImpl? It becomes a dangling pointer. > > > > One way out is to add a lock to MemoryRegionImpl. Dispatch takes the > > lock, examines the ->enabled member, and bails out if it is cleared. > > The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock, > > clears ->enabled, releases the lock, and drops the reference. > > That means holding the MemoryRegionImpl lock across the handler call? Blech. As you said on the other side of this thread, we must not take a coarse grained lock within a fine grained one, and MemoryRegionImpl::lock is as fine as they get. > > > > The advantage to this scheme is that all changes are localized to the > > memory core, no need for a full sweep. It is a little complicated, but > > we may be able to simplify it (or find another one). > > May work. We just need to detect if memory region tries to delete itself > from its handler to prevent the deadlock. Those types of hacks are fragile. IMO it just demonstrates what I said earlier (then tried to disprove with this): if we call an opaque's method, we must refcount or otherwise lock that opaque. No way around it. > > > > >> > >>> > Besides > MMIO and PIO dispatching, it will haunt us for file or event handlers, > any kind of callbacks etc. > > Context A Context B > - - > object = lookup() > deregister(object) > modify(object) -> invalid state > ... use(object) > modify(object) -> valid state > register(object) > > And with "object" I'm not talking about QOM but any data structure. > > >>> > >>> > >>> Context B > >>> - > >>> rcu_read_lock() > >>> object = lookup() > >>> if (object) { > >>> ref(object) > >>> } > >>> rcu_read_unlock() > >>> > >>> use(object) > >>> > >>> unref(object) > >>> > >>> (substitute locking scheme to conform to taste and/or dietary > >>> restrictions) > >>> > >> > >> + wait for refcount(object) == 0 in deregister(object). That's what I'm > >> proposing. > > > > Consider timer_del() called from a timer callback. It's often not doable. > > This is inherently synchronous already (when working against the same > alarm timer backend). We can detect this in timer_del and skip waiting, > as in the above scenario. It can always be broken. The timer callback takes the device lock to update the device. The device mmio handler, holding the device lock, takes the timer lock to timer_mod. Deadlock. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Thu, Aug 30, 2012 at 3:08 PM, Jan Kiszka wrote: > On 2012-08-30 07:54, liu ping fan wrote: >> On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity wrote: >>> On 08/29/2012 10:30 AM, Jan Kiszka wrote: On 2012-08-29 19:23, Avi Kivity wrote: > On 08/28/2012 02:42 AM, Jan Kiszka wrote: >> >> Let's not talk about devices or MMIO dispatching. I think the problem is >> way more generic, and we will face it multiple times in QEMU. > > The problem exists outside qemu as well. It is one of the reasons for > the popularity of garbage collection IMO, and the use of reference > counting when GC is not available. > > This pattern is even documented in > Documentation/DocBook/kernel-locking.tmpl: > > @@ -104,12 +114,11 @@ > struct object *cache_find(int id) > { > struct object *obj; > -unsigned long flags; > > -spin_lock_irqsave(&cache_lock, flags); > +rcu_read_lock(); > obj = __cache_find(id); > if (obj) > object_get(obj); > -spin_unlock_irqrestore(&cache_lock, flags); > +rcu_read_unlock(); > > > Of course that doesn't mean we should use it, but at least it indicates > that it is a standard pattern. With MemoryRegion the pattern is > changed, since MemoryRegion is a thunk, not the object we're really > dispatching to. We are dispatching according to the memory region (parameters, op handlers, opaques). If we end up in device object is not decided at this level. A memory region describes a dispatchable area - not to confuse with a device that may only partially be able to receive such requests. >>> >> But I think the meaning of the memory region is for dispatching. If no >> dispatching associated with mr, why need it exist in the system? > > Where did I say that memory regions should no longer be used for > dispatching? The point is to keep the clean layer separation between > memory regions and device objects instead of merging them together. > > Given > >Object >^^ >|| > Region 1Region 2 > > you protect the object during dispatch, implicitly (and that is bad) > requiring that no region must change in that period. I say what rather > needs protection are the regions so that Region 2 can pass away and > maybe reappear independent of Region 1. And: I won't need to know the OK, I see, this is a strong reason. Regards, pingfan > type of that object the regions are referring to in this model. That's > the difference. > >> And >> could you elaborate that who will be the ref holder of mr? > > The memory subsystem while running a memory region handler. If that will > be a reference counter or a per-region lock like Avi suggested, we still > need to find out. > > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE > Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-30 07:54, liu ping fan wrote: > On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity wrote: >> On 08/29/2012 10:30 AM, Jan Kiszka wrote: >>> On 2012-08-29 19:23, Avi Kivity wrote: On 08/28/2012 02:42 AM, Jan Kiszka wrote: > > Let's not talk about devices or MMIO dispatching. I think the problem is > way more generic, and we will face it multiple times in QEMU. The problem exists outside qemu as well. It is one of the reasons for the popularity of garbage collection IMO, and the use of reference counting when GC is not available. This pattern is even documented in Documentation/DocBook/kernel-locking.tmpl: @@ -104,12 +114,11 @@ struct object *cache_find(int id) { struct object *obj; -unsigned long flags; -spin_lock_irqsave(&cache_lock, flags); +rcu_read_lock(); obj = __cache_find(id); if (obj) object_get(obj); -spin_unlock_irqrestore(&cache_lock, flags); +rcu_read_unlock(); Of course that doesn't mean we should use it, but at least it indicates that it is a standard pattern. With MemoryRegion the pattern is changed, since MemoryRegion is a thunk, not the object we're really dispatching to. >>> >>> We are dispatching according to the memory region (parameters, op >>> handlers, opaques). If we end up in device object is not decided at this >>> level. A memory region describes a dispatchable area - not to confuse >>> with a device that may only partially be able to receive such requests. >> > But I think the meaning of the memory region is for dispatching. If no > dispatching associated with mr, why need it exist in the system? Where did I say that memory regions should no longer be used for dispatching? The point is to keep the clean layer separation between memory regions and device objects instead of merging them together. Given Object ^^ || Region 1Region 2 you protect the object during dispatch, implicitly (and that is bad) requiring that no region must change in that period. I say what rather needs protection are the regions so that Region 2 can pass away and maybe reappear independent of Region 1. And: I won't need to know the type of that object the regions are referring to in this model. That's the difference. > And > could you elaborate that who will be the ref holder of mr? The memory subsystem while running a memory region handler. If that will be a reference counter or a per-region lock like Avi suggested, we still need to find out. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Thu, Aug 30, 2012 at 1:40 AM, Avi Kivity wrote: > On 08/29/2012 10:30 AM, Jan Kiszka wrote: >> On 2012-08-29 19:23, Avi Kivity wrote: >> > On 08/28/2012 02:42 AM, Jan Kiszka wrote: >> >> >> >> Let's not talk about devices or MMIO dispatching. I think the problem is >> >> way more generic, and we will face it multiple times in QEMU. >> > >> > The problem exists outside qemu as well. It is one of the reasons for >> > the popularity of garbage collection IMO, and the use of reference >> > counting when GC is not available. >> > >> > This pattern is even documented in >> > Documentation/DocBook/kernel-locking.tmpl: >> > >> > @@ -104,12 +114,11 @@ >> > struct object *cache_find(int id) >> > { >> > struct object *obj; >> > -unsigned long flags; >> > >> > -spin_lock_irqsave(&cache_lock, flags); >> > +rcu_read_lock(); >> > obj = __cache_find(id); >> > if (obj) >> > object_get(obj); >> > -spin_unlock_irqrestore(&cache_lock, flags); >> > +rcu_read_unlock(); >> > >> > >> > Of course that doesn't mean we should use it, but at least it indicates >> > that it is a standard pattern. With MemoryRegion the pattern is >> > changed, since MemoryRegion is a thunk, not the object we're really >> > dispatching to. >> >> We are dispatching according to the memory region (parameters, op >> handlers, opaques). If we end up in device object is not decided at this >> level. A memory region describes a dispatchable area - not to confuse >> with a device that may only partially be able to receive such requests. > But I think the meaning of the memory region is for dispatching. If no dispatching associated with mr, why need it exist in the system? And could you elaborate that who will be the ref holder of mr? > Correct. > > Let's experiment with refcounting MemoryRegion. We can move the entire > contents of MemoryRegion to MemoryRegionImpl, add a reference count (to > MemoryRegionImpl), and change MemoryRegion to contain a pointer to the > refcounted MemoryRegionImpl: > > struct MemoryRegion { > struct MemoryRegionImpl *impl; > }; > > struct MemoryRegionImpl { > atomic int refs; > ... > }; > > The memory core can then store MemoryRegion copies (with elevated > refcounts) instead of pointers. Devices can destroy MemoryRegions at > any time, the implementation will not go away. However, what of the > opaque stored in MemoryRegionImpl? It becomes a dangling pointer. > > One way out is to add a lock to MemoryRegionImpl. Dispatch takes the > lock, examines the ->enabled member, and bails out if it is cleared. > The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock, > clears ->enabled, releases the lock, and drops the reference. > Assume we can get the host objectX of opaque, then we can just object_ref(objectX). So dispatching will not face the hard nut to handle opaque. Regards, pingfan > The advantage to this scheme is that all changes are localized to the > memory core, no need for a full sweep. It is a little complicated, but > we may be able to simplify it (or find another one). > >> >> > >> >> Besides >> >> MMIO and PIO dispatching, it will haunt us for file or event handlers, >> >> any kind of callbacks etc. >> >> >> >> Context A Context B >> >> - - >> >> object = lookup() >> >> deregister(object) >> >> modify(object) -> invalid state >> >> ... use(object) >> >> modify(object) -> valid state >> >> register(object) >> >> >> >> And with "object" I'm not talking about QOM but any data structure. >> >> >> > >> > >> > Context B >> > - >> > rcu_read_lock() >> > object = lookup() >> > if (object) { >> > ref(object) >> > } >> > rcu_read_unlock() >> > >> > use(object) >> > >> > unref(object) >> > >> > (substitute locking scheme to conform to taste and/or dietary >> > restrictions) >> > >> >> + wait for refcount(object) == 0 in deregister(object). That's what I'm >> proposing. > > Consider timer_del() called from a timer callback. It's often not doable. > > -- > I have a truly marvellous patch that fixes the bug which this > signature is too narrow to contain. >
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-29 19:40, Avi Kivity wrote: > On 08/29/2012 10:30 AM, Jan Kiszka wrote: >> On 2012-08-29 19:23, Avi Kivity wrote: >>> On 08/28/2012 02:42 AM, Jan Kiszka wrote: Let's not talk about devices or MMIO dispatching. I think the problem is way more generic, and we will face it multiple times in QEMU. >>> >>> The problem exists outside qemu as well. It is one of the reasons for >>> the popularity of garbage collection IMO, and the use of reference >>> counting when GC is not available. >>> >>> This pattern is even documented in >>> Documentation/DocBook/kernel-locking.tmpl: >>> >>> @@ -104,12 +114,11 @@ >>> struct object *cache_find(int id) >>> { >>> struct object *obj; >>> -unsigned long flags; >>> >>> -spin_lock_irqsave(&cache_lock, flags); >>> +rcu_read_lock(); >>> obj = __cache_find(id); >>> if (obj) >>> object_get(obj); >>> -spin_unlock_irqrestore(&cache_lock, flags); >>> +rcu_read_unlock(); >>> >>> >>> Of course that doesn't mean we should use it, but at least it indicates >>> that it is a standard pattern. With MemoryRegion the pattern is >>> changed, since MemoryRegion is a thunk, not the object we're really >>> dispatching to. >> >> We are dispatching according to the memory region (parameters, op >> handlers, opaques). If we end up in device object is not decided at this >> level. A memory region describes a dispatchable area - not to confuse >> with a device that may only partially be able to receive such requests. > > Correct. > > Let's experiment with refcounting MemoryRegion. We can move the entire > contents of MemoryRegion to MemoryRegionImpl, add a reference count (to > MemoryRegionImpl), and change MemoryRegion to contain a pointer to the > refcounted MemoryRegionImpl: > > struct MemoryRegion { > struct MemoryRegionImpl *impl; > }; > > struct MemoryRegionImpl { > atomic int refs; > ... > }; > > The memory core can then store MemoryRegion copies (with elevated > refcounts) instead of pointers. Devices can destroy MemoryRegions at > any time, the implementation will not go away. However, what of the > opaque stored in MemoryRegionImpl? It becomes a dangling pointer. > > One way out is to add a lock to MemoryRegionImpl. Dispatch takes the > lock, examines the ->enabled member, and bails out if it is cleared. > The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock, > clears ->enabled, releases the lock, and drops the reference. That means holding the MemoryRegionImpl lock across the handler call? > > The advantage to this scheme is that all changes are localized to the > memory core, no need for a full sweep. It is a little complicated, but > we may be able to simplify it (or find another one). May work. We just need to detect if memory region tries to delete itself from its handler to prevent the deadlock. > >> >>> Besides MMIO and PIO dispatching, it will haunt us for file or event handlers, any kind of callbacks etc. Context A Context B - - object = lookup() deregister(object) modify(object) -> invalid state ... use(object) modify(object) -> valid state register(object) And with "object" I'm not talking about QOM but any data structure. >>> >>> >>> Context B >>> - >>> rcu_read_lock() >>> object = lookup() >>> if (object) { >>> ref(object) >>> } >>> rcu_read_unlock() >>> >>> use(object) >>> >>> unref(object) >>> >>> (substitute locking scheme to conform to taste and/or dietary >>> restrictions) >>> >> >> + wait for refcount(object) == 0 in deregister(object). That's what I'm >> proposing. > > Consider timer_del() called from a timer callback. It's often not doable. This is inherently synchronous already (when working against the same alarm timer backend). We can detect this in timer_del and skip waiting, as in the above scenario. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-29 19:27, Avi Kivity wrote: > On 08/29/2012 10:21 AM, Jan Kiszka wrote: >> On 2012-08-29 19:13, Avi Kivity wrote: >>> On 08/27/2012 06:01 PM, Jan Kiszka wrote: On 2012-08-27 22:53, Avi Kivity wrote: > On 08/27/2012 12:38 PM, Jan Kiszka wrote: Even worse, apply restrictions on how the dispatched objects, the regions, have to be treated because of this. >>> >>> Please elaborate. >> >> The fact that you can't manipulate a memory region object arbitrarily >> after removing it from the mapping because you track the references to >> the object that the region points to, not the region itself. The region >> remains in use by the dispatching layer and potentially the called >> device, even after deregistration. > > That object will be a container_of() the region, usually literally but > sometimes only in spirit. Reference counting the regions means they > cannot be embedded into other objects any more. I cannot follow the logic of the last sentence. Reference counting just means that we track if there are users left, not necessarily that we perform asynchronous releases. We can simply wait for those users to complete. >>> >>> I don't see how. Suppose you add a reference count to MemoryRegion. >>> How do you delay its containing object's destructor from running? Do >>> you iterate over all member MemoryRegion and examine their reference counts? >> >> memory_region_transaction_commit (or calls that trigger this) will have >> to wait. That is required as the caller may start fiddling with the >> region right afterward. > > However, it may be running within an mmio dispatch, so it may wait forever. We won't be able to wait for devices that can acquire the same lock we hold while waiting, of course. That's why a lock ordering device-lock -> BQL (the one we hold over memory region changes) won't be allowed. I was wrong on this before: We must not take course-grained locks while holding fine-grained ones, only the other way around. Pretty obvious, specifically for realtime / low-latency. > > Ignoring this, how does it wait? sleep()? or wait for a completion? Ideally via completion. > >>> >>> Usually a reference count controls the lifetime of the reference counted >>> object, what are you suggesting here? >> >> To synchronize on reference count going to zero. > > Quite unorthodox. Do you have real life examples? synchronize_rcu. > >> Or all readers leaving >> the read-side critical sections. > > This is rcu. But again, we may be in an rcu read-side critical section > while needing to wait. In fact this was what I suggested in the first > place, until Marcelo pointed out the deadlock, so we came up with the > refcount. We must avoid that deadlock scenario. I bended my brain around it, and I think that is the only answer. We can if we apply clear rules regarding locking and BQL-protected services to those devices that will actually use fine-grained locking, and there only for those paths that take per-device locks. I'm pretty sure we won't get far with an "all-or-nothing" model where every device uses a private lock. > >> >>> > > We can probably figure out a way to flush out accesses. After switching > to rcu, for example, all we need is synchronize_rcu() in a > non-deadlocking place. But my bet is that it will not be needed. If you properly flush out accesses, you don't need to track at device level anymore - and mess with abstraction layers. That's my whole point. >>> >>> To flush out an access you need either rwlock_write_lock() or >>> synchronize_rcu() (depending on the implementation). But neither of >>> these can be run from an rcu read-side critical section or >>> rwlock_read_lock(). >>> >>> You could defer the change to a bottom half, but if the hardware demands >>> that the change be complete before returning, that doesn't work. >> >> Right, therefore synchronous transactions. > > I don't follow. Synchronous transactions mean you can't > synchronize_rcu() or upgrade the lock or wait for the refcount. That's > the source of the problem! Our current device models assume synchronous transactions on the memory layer, actually on all layers. The proposals I saw so far try to change this. But I'm very skeptical that this would succeed, due to the conversion effort and due to the fact that it would give us a tricky to use API. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/29/2012 10:30 AM, Jan Kiszka wrote: > On 2012-08-29 19:23, Avi Kivity wrote: > > On 08/28/2012 02:42 AM, Jan Kiszka wrote: > >> > >> Let's not talk about devices or MMIO dispatching. I think the problem is > >> way more generic, and we will face it multiple times in QEMU. > > > > The problem exists outside qemu as well. It is one of the reasons for > > the popularity of garbage collection IMO, and the use of reference > > counting when GC is not available. > > > > This pattern is even documented in > > Documentation/DocBook/kernel-locking.tmpl: > > > > @@ -104,12 +114,11 @@ > > struct object *cache_find(int id) > > { > > struct object *obj; > > -unsigned long flags; > > > > -spin_lock_irqsave(&cache_lock, flags); > > +rcu_read_lock(); > > obj = __cache_find(id); > > if (obj) > > object_get(obj); > > -spin_unlock_irqrestore(&cache_lock, flags); > > +rcu_read_unlock(); > > > > > > Of course that doesn't mean we should use it, but at least it indicates > > that it is a standard pattern. With MemoryRegion the pattern is > > changed, since MemoryRegion is a thunk, not the object we're really > > dispatching to. > > We are dispatching according to the memory region (parameters, op > handlers, opaques). If we end up in device object is not decided at this > level. A memory region describes a dispatchable area - not to confuse > with a device that may only partially be able to receive such requests. Correct. Let's experiment with refcounting MemoryRegion. We can move the entire contents of MemoryRegion to MemoryRegionImpl, add a reference count (to MemoryRegionImpl), and change MemoryRegion to contain a pointer to the refcounted MemoryRegionImpl: struct MemoryRegion { struct MemoryRegionImpl *impl; }; struct MemoryRegionImpl { atomic int refs; ... }; The memory core can then store MemoryRegion copies (with elevated refcounts) instead of pointers. Devices can destroy MemoryRegions at any time, the implementation will not go away. However, what of the opaque stored in MemoryRegionImpl? It becomes a dangling pointer. One way out is to add a lock to MemoryRegionImpl. Dispatch takes the lock, examines the ->enabled member, and bails out if it is cleared. The (MemoryRegion, not MemoryRegionImpl) destructor also takes the lock, clears ->enabled, releases the lock, and drops the reference. The advantage to this scheme is that all changes are localized to the memory core, no need for a full sweep. It is a little complicated, but we may be able to simplify it (or find another one). > > > > >> Besides > >> MMIO and PIO dispatching, it will haunt us for file or event handlers, > >> any kind of callbacks etc. > >> > >> Context A Context B > >> - - > >> object = lookup() > >> deregister(object) > >> modify(object) -> invalid state > >> ... use(object) > >> modify(object) -> valid state > >> register(object) > >> > >> And with "object" I'm not talking about QOM but any data structure. > >> > > > > > > Context B > > - > > rcu_read_lock() > > object = lookup() > > if (object) { > > ref(object) > > } > > rcu_read_unlock() > > > > use(object) > > > > unref(object) > > > > (substitute locking scheme to conform to taste and/or dietary > > restrictions) > > > > + wait for refcount(object) == 0 in deregister(object). That's what I'm > proposing. Consider timer_del() called from a timer callback. It's often not doable. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/29/2012 10:21 AM, Jan Kiszka wrote: > On 2012-08-29 19:13, Avi Kivity wrote: > > On 08/27/2012 06:01 PM, Jan Kiszka wrote: > >> On 2012-08-27 22:53, Avi Kivity wrote: > >>> On 08/27/2012 12:38 PM, Jan Kiszka wrote: > >> Even worse, apply > >> restrictions on how the dispatched objects, the regions, have to be > >> treated because of this. > > > > Please elaborate. > > The fact that you can't manipulate a memory region object arbitrarily > after removing it from the mapping because you track the references to > the object that the region points to, not the region itself. The region > remains in use by the dispatching layer and potentially the called > device, even after deregistration. > >>> > >>> That object will be a container_of() the region, usually literally but > >>> sometimes only in spirit. Reference counting the regions means they > >>> cannot be embedded into other objects any more. > >> > >> I cannot follow the logic of the last sentence. Reference counting just > >> means that we track if there are users left, not necessarily that we > >> perform asynchronous releases. We can simply wait for those users to > >> complete. > > > > I don't see how. Suppose you add a reference count to MemoryRegion. > > How do you delay its containing object's destructor from running? Do > > you iterate over all member MemoryRegion and examine their reference counts? > > memory_region_transaction_commit (or calls that trigger this) will have > to wait. That is required as the caller may start fiddling with the > region right afterward. However, it may be running within an mmio dispatch, so it may wait forever. Ignoring this, how does it wait? sleep()? or wait for a completion? > > > > Usually a reference count controls the lifetime of the reference counted > > object, what are you suggesting here? > > To synchronize on reference count going to zero. Quite unorthodox. Do you have real life examples? > Or all readers leaving > the read-side critical sections. This is rcu. But again, we may be in an rcu read-side critical section while needing to wait. In fact this was what I suggested in the first place, until Marcelo pointed out the deadlock, so we came up with the refcount. > > > > >>> > >>> We can probably figure out a way to flush out accesses. After switching > >>> to rcu, for example, all we need is synchronize_rcu() in a > >>> non-deadlocking place. But my bet is that it will not be needed. > >> > >> If you properly flush out accesses, you don't need to track at device > >> level anymore - and mess with abstraction layers. That's my whole point. > > > > To flush out an access you need either rwlock_write_lock() or > > synchronize_rcu() (depending on the implementation). But neither of > > these can be run from an rcu read-side critical section or > > rwlock_read_lock(). > > > > You could defer the change to a bottom half, but if the hardware demands > > that the change be complete before returning, that doesn't work. > > Right, therefore synchronous transactions. I don't follow. Synchronous transactions mean you can't synchronize_rcu() or upgrade the lock or wait for the refcount. That's the source of the problem! -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-29 19:23, Avi Kivity wrote: > On 08/28/2012 02:42 AM, Jan Kiszka wrote: >> >> Let's not talk about devices or MMIO dispatching. I think the problem is >> way more generic, and we will face it multiple times in QEMU. > > The problem exists outside qemu as well. It is one of the reasons for > the popularity of garbage collection IMO, and the use of reference > counting when GC is not available. > > This pattern is even documented in > Documentation/DocBook/kernel-locking.tmpl: > > @@ -104,12 +114,11 @@ > struct object *cache_find(int id) > { > struct object *obj; > -unsigned long flags; > > -spin_lock_irqsave(&cache_lock, flags); > +rcu_read_lock(); > obj = __cache_find(id); > if (obj) > object_get(obj); > -spin_unlock_irqrestore(&cache_lock, flags); > +rcu_read_unlock(); > > > Of course that doesn't mean we should use it, but at least it indicates > that it is a standard pattern. With MemoryRegion the pattern is > changed, since MemoryRegion is a thunk, not the object we're really > dispatching to. We are dispatching according to the memory region (parameters, op handlers, opaques). If we end up in device object is not decided at this level. A memory region describes a dispatchable area - not to confuse with a device that may only partially be able to receive such requests. > >> Besides >> MMIO and PIO dispatching, it will haunt us for file or event handlers, >> any kind of callbacks etc. >> >> Context A Context B >> - - >> object = lookup() >> deregister(object) >> modify(object) -> invalid state >> ... use(object) >> modify(object) -> valid state >> register(object) >> >> And with "object" I'm not talking about QOM but any data structure. >> > > > Context B > - > rcu_read_lock() > object = lookup() > if (object) { > ref(object) > } > rcu_read_unlock() > > use(object) > > unref(object) > > (substitute locking scheme to conform to taste and/or dietary > restrictions) > + wait for refcount(object) == 0 in deregister(object). That's what I'm proposing. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/28/2012 02:42 AM, Jan Kiszka wrote: > > Let's not talk about devices or MMIO dispatching. I think the problem is > way more generic, and we will face it multiple times in QEMU. The problem exists outside qemu as well. It is one of the reasons for the popularity of garbage collection IMO, and the use of reference counting when GC is not available. This pattern is even documented in Documentation/DocBook/kernel-locking.tmpl: @@ -104,12 +114,11 @@ struct object *cache_find(int id) { struct object *obj; -unsigned long flags; -spin_lock_irqsave(&cache_lock, flags); +rcu_read_lock(); obj = __cache_find(id); if (obj) object_get(obj); -spin_unlock_irqrestore(&cache_lock, flags); +rcu_read_unlock(); Of course that doesn't mean we should use it, but at least it indicates that it is a standard pattern. With MemoryRegion the pattern is changed, since MemoryRegion is a thunk, not the object we're really dispatching to. > Besides > MMIO and PIO dispatching, it will haunt us for file or event handlers, > any kind of callbacks etc. > > Context A Context B > - - > object = lookup() > deregister(object) > modify(object) -> invalid state > ... use(object) > modify(object) -> valid state > register(object) > > And with "object" I'm not talking about QOM but any data structure. > Context B - rcu_read_lock() object = lookup() if (object) { ref(object) } rcu_read_unlock() use(object) unref(object) (substitute locking scheme to conform to taste and/or dietary restrictions) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-29 19:13, Avi Kivity wrote: > On 08/27/2012 06:01 PM, Jan Kiszka wrote: >> On 2012-08-27 22:53, Avi Kivity wrote: >>> On 08/27/2012 12:38 PM, Jan Kiszka wrote: >> Even worse, apply >> restrictions on how the dispatched objects, the regions, have to be >> treated because of this. > > Please elaborate. The fact that you can't manipulate a memory region object arbitrarily after removing it from the mapping because you track the references to the object that the region points to, not the region itself. The region remains in use by the dispatching layer and potentially the called device, even after deregistration. >>> >>> That object will be a container_of() the region, usually literally but >>> sometimes only in spirit. Reference counting the regions means they >>> cannot be embedded into other objects any more. >> >> I cannot follow the logic of the last sentence. Reference counting just >> means that we track if there are users left, not necessarily that we >> perform asynchronous releases. We can simply wait for those users to >> complete. > > I don't see how. Suppose you add a reference count to MemoryRegion. > How do you delay its containing object's destructor from running? Do > you iterate over all member MemoryRegion and examine their reference counts? memory_region_transaction_commit (or calls that trigger this) will have to wait. That is required as the caller may start fiddling with the region right afterward. > > Usually a reference count controls the lifetime of the reference counted > object, what are you suggesting here? To synchronize on reference count going to zero. Or all readers leaving the read-side critical sections. > >>> >>> We can probably figure out a way to flush out accesses. After switching >>> to rcu, for example, all we need is synchronize_rcu() in a >>> non-deadlocking place. But my bet is that it will not be needed. >> >> If you properly flush out accesses, you don't need to track at device >> level anymore - and mess with abstraction layers. That's my whole point. > > To flush out an access you need either rwlock_write_lock() or > synchronize_rcu() (depending on the implementation). But neither of > these can be run from an rcu read-side critical section or > rwlock_read_lock(). > > You could defer the change to a bottom half, but if the hardware demands > that the change be complete before returning, that doesn't work. Right, therefore synchronous transactions. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 06:01 PM, Jan Kiszka wrote: > On 2012-08-27 22:53, Avi Kivity wrote: > > On 08/27/2012 12:38 PM, Jan Kiszka wrote: > Even worse, apply > restrictions on how the dispatched objects, the regions, have to be > treated because of this. > >>> > >>> Please elaborate. > >> > >> The fact that you can't manipulate a memory region object arbitrarily > >> after removing it from the mapping because you track the references to > >> the object that the region points to, not the region itself. The region > >> remains in use by the dispatching layer and potentially the called > >> device, even after deregistration. > > > > That object will be a container_of() the region, usually literally but > > sometimes only in spirit. Reference counting the regions means they > > cannot be embedded into other objects any more. > > I cannot follow the logic of the last sentence. Reference counting just > means that we track if there are users left, not necessarily that we > perform asynchronous releases. We can simply wait for those users to > complete. I don't see how. Suppose you add a reference count to MemoryRegion. How do you delay its containing object's destructor from running? Do you iterate over all member MemoryRegion and examine their reference counts? Usually a reference count controls the lifetime of the reference counted object, what are you suggesting here? > > > > We can probably figure out a way to flush out accesses. After switching > > to rcu, for example, all we need is synchronize_rcu() in a > > non-deadlocking place. But my bet is that it will not be needed. > > If you properly flush out accesses, you don't need to track at device > level anymore - and mess with abstraction layers. That's my whole point. To flush out an access you need either rwlock_write_lock() or synchronize_rcu() (depending on the implementation). But neither of these can be run from an rcu read-side critical section or rwlock_read_lock(). You could defer the change to a bottom half, but if the hardware demands that the change be complete before returning, that doesn't work. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Il 28/08/2012 11:42, Jan Kiszka ha scritto: > Context A Context B > - - > object = lookup() > deregister(object) > modify(object) -> invalid state > ... use(object) > modify(object) -> valid state > register(object) > > And with "object" I'm not talking about QOM but any data structure. If you want to avoid locks, the only way to do so is RCU. Then you do not modify object at all. Paolo
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-28 05:38, liu ping fan wrote: > On Tue, Aug 28, 2012 at 11:09 AM, liu ping fan wrote: >> On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka wrote: >>> On 2012-08-27 20:52, Avi Kivity wrote: On 08/27/2012 11:39 AM, Jan Kiszka wrote: > On 2012-08-27 20:20, Avi Kivity wrote: >> On 08/27/2012 11:17 AM, Jan Kiszka wrote: >>> On 2012-08-27 20:09, Avi Kivity wrote: On 08/27/2012 10:14 AM, Jan Kiszka wrote: >> >> Deregistration is fine, the problem is destruction. >> > > It isn't as you access memory region states that can change after > deregistration. Devices can remove memory regions from the mapping, > alter and then reinsert them. The last to steps must not happen while > anyone is still using a reference to that region. > Why not? If the guest is accessing an mmio region while reconfiguring it in a way that changes its meaning, either the previous or the next meaning is valid. >>> >>> If the memory region owner sets the content to zero or even releases it >>> (nothing states a memory region can only live inside a device >>> structure), we will crash. Restricting how a memory region can be >>> created and handled after it was once registered somewhere is an >>> unnatural interface, waiting to cause subtle bugs. >> >> Using an Object * allows the simple case to be really simple (object == >> device) and the hard cases to be doable. >> >> What would you suggest as a better interface? > > To protect the life cycle of the object we manage in the memory layer: > regions. We don't manage devices there. If there is any implementation > benefit in having a full QOM object, then make memory regions objects. I see and sympathise with this point of view. The idea to use opaque and convert it to Object * is that often a device has multiple regions but no interest in managing them separately. So managing regions will cause the device model authors to create boilerplate code to push region management to device management. I had this experience with the first iterations of the region interface, where instead of opaque we had MemoryRegion *. Device code would use container_of() in the simple case, but the non-simple case caused lots of pain. I believe it is the natural interface, but in practice it turned out to cause too much churn. > I simply don't like this indirection, having the memory layer pick up > the opaque value of the region and interpret it. That's just an intermediate step. The idea is that eventually opaque will be changed to Object *. > Even worse, apply > restrictions on how the dispatched objects, the regions, have to be > treated because of this. Please elaborate. >>> >>> The fact that you can't manipulate a memory region object arbitrarily >>> after removing it from the mapping because you track the references to >>> the object that the region points to, not the region itself. The region >>> remains in use by the dispatching layer and potentially the called >>> device, even after deregistration. >> >> Why can it in use by dispatching layer? Memory region just servers as >> interface for mem lookup, when dispatching in mr's MemoryRegionOps, >> why do we need to access it? Which means that read/write can cause >> register/deregister mr? Example? > > I found example in pci_update_mappings(), but but my following point > still stand. >> Also I think MemoryRegionOps will eventually operate on Object (mr is >> only interface). MRs is only represent of Object in the view of >> memory(so their life cycle can be managed by Object). And there could >> be some intermediate mr like those lie in subpage_t which are not >> based on Object. But we can walk down to the terminal point and >> extract the real Object, >> so when dispatching, we only need to ensure that Object and its direct >> mr are alive. Let's not talk about devices or MMIO dispatching. I think the problem is way more generic, and we will face it multiple times in QEMU. Besides MMIO and PIO dispatching, it will haunt us for file or event handlers, any kind of callbacks etc. Context A Context B - - object = lookup() deregister(object) modify(object) -> invalid state ... use(object) modify(object) -> valid state register(object) And with "object" I'm not talking about QOM but any data structure. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Tue, Aug 28, 2012 at 11:09 AM, liu ping fan wrote: > On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka wrote: >> On 2012-08-27 20:52, Avi Kivity wrote: >>> On 08/27/2012 11:39 AM, Jan Kiszka wrote: On 2012-08-27 20:20, Avi Kivity wrote: > On 08/27/2012 11:17 AM, Jan Kiszka wrote: >> On 2012-08-27 20:09, Avi Kivity wrote: >>> On 08/27/2012 10:14 AM, Jan Kiszka wrote: > > Deregistration is fine, the problem is destruction. > It isn't as you access memory region states that can change after deregistration. Devices can remove memory regions from the mapping, alter and then reinsert them. The last to steps must not happen while anyone is still using a reference to that region. >>> >>> Why not? If the guest is accessing an mmio region while reconfiguring >>> it in a way that changes its meaning, either the previous or the next >>> meaning is valid. >> >> If the memory region owner sets the content to zero or even releases it >> (nothing states a memory region can only live inside a device >> structure), we will crash. Restricting how a memory region can be >> created and handled after it was once registered somewhere is an >> unnatural interface, waiting to cause subtle bugs. > > Using an Object * allows the simple case to be really simple (object == > device) and the hard cases to be doable. > > What would you suggest as a better interface? To protect the life cycle of the object we manage in the memory layer: regions. We don't manage devices there. If there is any implementation benefit in having a full QOM object, then make memory regions objects. >>> >>> I see and sympathise with this point of view. >>> >>> The idea to use opaque and convert it to Object * is that often a device >>> has multiple regions but no interest in managing them separately. So >>> managing regions will cause the device model authors to create >>> boilerplate code to push region management to device management. >>> >>> I had this experience with the first iterations of the region interface, >>> where instead of opaque we had MemoryRegion *. Device code would use >>> container_of() in the simple case, but the non-simple case caused lots >>> of pain. I believe it is the natural interface, but in practice it >>> turned out to cause too much churn. >>> I simply don't like this indirection, having the memory layer pick up the opaque value of the region and interpret it. >>> >>> That's just an intermediate step. The idea is that eventually opaque >>> will be changed to Object *. >>> Even worse, apply restrictions on how the dispatched objects, the regions, have to be treated because of this. >>> >>> Please elaborate. >> >> The fact that you can't manipulate a memory region object arbitrarily >> after removing it from the mapping because you track the references to >> the object that the region points to, not the region itself. The region >> remains in use by the dispatching layer and potentially the called >> device, even after deregistration. > > Why can it in use by dispatching layer? Memory region just servers as > interface for mem lookup, when dispatching in mr's MemoryRegionOps, > why do we need to access it? Which means that read/write can cause > register/deregister mr? Example? I found example in pci_update_mappings(), but but my following point still stand. > Also I think MemoryRegionOps will eventually operate on Object (mr is > only interface). MRs is only represent of Object in the view of > memory(so their life cycle can be managed by Object). And there could > be some intermediate mr like those lie in subpage_t which are not > based on Object. But we can walk down to the terminal point and > extract the real Object, > so when dispatching, we only need to ensure that Object and its direct > mr are alive. > > Regards, > pingfan >> >> Jan >> >> -- >> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE >> Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Tue, Aug 28, 2012 at 3:38 AM, Jan Kiszka wrote: > On 2012-08-27 20:52, Avi Kivity wrote: >> On 08/27/2012 11:39 AM, Jan Kiszka wrote: >>> On 2012-08-27 20:20, Avi Kivity wrote: On 08/27/2012 11:17 AM, Jan Kiszka wrote: > On 2012-08-27 20:09, Avi Kivity wrote: >> On 08/27/2012 10:14 AM, Jan Kiszka wrote: Deregistration is fine, the problem is destruction. >>> >>> It isn't as you access memory region states that can change after >>> deregistration. Devices can remove memory regions from the mapping, >>> alter and then reinsert them. The last to steps must not happen while >>> anyone is still using a reference to that region. >>> >> >> Why not? If the guest is accessing an mmio region while reconfiguring >> it in a way that changes its meaning, either the previous or the next >> meaning is valid. > > If the memory region owner sets the content to zero or even releases it > (nothing states a memory region can only live inside a device > structure), we will crash. Restricting how a memory region can be > created and handled after it was once registered somewhere is an > unnatural interface, waiting to cause subtle bugs. Using an Object * allows the simple case to be really simple (object == device) and the hard cases to be doable. What would you suggest as a better interface? >>> >>> To protect the life cycle of the object we manage in the memory layer: >>> regions. We don't manage devices there. If there is any implementation >>> benefit in having a full QOM object, then make memory regions objects. >> >> I see and sympathise with this point of view. >> >> The idea to use opaque and convert it to Object * is that often a device >> has multiple regions but no interest in managing them separately. So >> managing regions will cause the device model authors to create >> boilerplate code to push region management to device management. >> >> I had this experience with the first iterations of the region interface, >> where instead of opaque we had MemoryRegion *. Device code would use >> container_of() in the simple case, but the non-simple case caused lots >> of pain. I believe it is the natural interface, but in practice it >> turned out to cause too much churn. >> >>> I simply don't like this indirection, having the memory layer pick up >>> the opaque value of the region and interpret it. >> >> That's just an intermediate step. The idea is that eventually opaque >> will be changed to Object *. >> >>> Even worse, apply >>> restrictions on how the dispatched objects, the regions, have to be >>> treated because of this. >> >> Please elaborate. > > The fact that you can't manipulate a memory region object arbitrarily > after removing it from the mapping because you track the references to > the object that the region points to, not the region itself. The region > remains in use by the dispatching layer and potentially the called > device, even after deregistration. Why can it in use by dispatching layer? Memory region just servers as interface for mem lookup, when dispatching in mr's MemoryRegionOps, why do we need to access it? Which means that read/write can cause register/deregister mr? Example? Also I think MemoryRegionOps will eventually operate on Object (mr is only interface). MRs is only represent of Object in the view of memory(so their life cycle can be managed by Object). And there could be some intermediate mr like those lie in subpage_t which are not based on Object. But we can walk down to the terminal point and extract the real Object, so when dispatching, we only need to ensure that Object and its direct mr are alive. Regards, pingfan > > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE > Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 22:53, Avi Kivity wrote: > On 08/27/2012 12:38 PM, Jan Kiszka wrote: Even worse, apply restrictions on how the dispatched objects, the regions, have to be treated because of this. >>> >>> Please elaborate. >> >> The fact that you can't manipulate a memory region object arbitrarily >> after removing it from the mapping because you track the references to >> the object that the region points to, not the region itself. The region >> remains in use by the dispatching layer and potentially the called >> device, even after deregistration. > > That object will be a container_of() the region, usually literally but > sometimes only in spirit. Reference counting the regions means they > cannot be embedded into other objects any more. I cannot follow the logic of the last sentence. Reference counting just means that we track if there are users left, not necessarily that we perform asynchronous releases. We can simply wait for those users to complete. > > We can probably figure out a way to flush out accesses. After switching > to rcu, for example, all we need is synchronize_rcu() in a > non-deadlocking place. But my bet is that it will not be needed. If you properly flush out accesses, you don't need to track at device level anymore - and mess with abstraction layers. That's my whole point. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Il 27/08/2012 22:58, Avi Kivity ha scritto: >> > It's best to start this conversion using very coarse locking. There's >> > no need to start with ultra fine grain locking. > How does it work? You have to drop this main loop lock to dispatch the > callback, so the data structure you use for dispatch is no longer protected. It is protected by a mixture of reference counting, thread-local storage and careful coding to take care of re-entrancy. I'm not too sure that we can use it, though. Mostly because of the reasons Jan mentioned (it may also not be precise enough, but we can fix that with our own GSource wrapping POSIX real-time timers or Linux timerfds). Paolo
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 12:17 PM, Anthony Liguori wrote: > Avi Kivity writes: > > > On 08/27/2012 09:24 AM, Anthony Liguori wrote: > >> > > >> > I'm sure we should leave existing code alone wherever possible, focusing > >> > on providing alternative versions for those paths that matter. Example: > >> > Most timers are fine under BQL. But some sensitive devices (RTC or HPET > >> > as clock source) will want their own timers. So the approach is to > >> > instantiate a separate, also prioritizeable instance of the timer > >> > subsystem for them and be done. > >> > >> I disagree. I think we conver the timer subsystem to be lockless and > >> then let some devices acquire the BQL during dispatch. > > > > I agree with your disagreement but disagree with the rest. The timer > > subsystem should have its own internal locking that callers will not be > > aware of. Requiring devices to acquire the bql will lead to > > deadlocks. > > Err, I completely mispoke there. > > I meant, to say, we should convert the timer subsystem to be re-entrant > and then let some devices acquire the BQL during dispatch. That would be all devices, with converted devices not acquiring it as they are converted. > Existing users of the time API should get the BQL acquired automagically > for them. The new API would not hold the BQL during dispatch but the > old API, implemented in terms of the new API, would acquire it on behalf > of the caller. > > As a rule, if a device relies on the BQL, it must use the old APIs. If > a device uses the new APIs, it must not acquire the BQL. This exactly mirrors the plan with memory region conversion. > > Note that fine-grained locking timers will also require reference > > counting: you want to call the timer expiration callback after releasing > > the timer subsystem lock, so you need to make sure the callback does not > > go away. > > glib simply uses a single main loop lock to deal with all of this. I > think that's absolutely the right model. > > It's best to start this conversion using very coarse locking. There's > no need to start with ultra fine grain locking. How does it work? You have to drop this main loop lock to dispatch the callback, so the data structure you use for dispatch is no longer protected. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 12:38 PM, Jan Kiszka wrote: > >> Even worse, apply > >> restrictions on how the dispatched objects, the regions, have to be > >> treated because of this. > > > > Please elaborate. > > The fact that you can't manipulate a memory region object arbitrarily > after removing it from the mapping because you track the references to > the object that the region points to, not the region itself. The region > remains in use by the dispatching layer and potentially the called > device, even after deregistration. That object will be a container_of() the region, usually literally but sometimes only in spirit. Reference counting the regions means they cannot be embedded into other objects any more. We can probably figure out a way to flush out accesses. After switching to rcu, for example, all we need is synchronize_rcu() in a non-deadlocking place. But my bet is that it will not be needed. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 20:52, Avi Kivity wrote: > On 08/27/2012 11:39 AM, Jan Kiszka wrote: >> On 2012-08-27 20:20, Avi Kivity wrote: >>> On 08/27/2012 11:17 AM, Jan Kiszka wrote: On 2012-08-27 20:09, Avi Kivity wrote: > On 08/27/2012 10:14 AM, Jan Kiszka wrote: >>> >>> Deregistration is fine, the problem is destruction. >>> >> >> It isn't as you access memory region states that can change after >> deregistration. Devices can remove memory regions from the mapping, >> alter and then reinsert them. The last to steps must not happen while >> anyone is still using a reference to that region. >> > > Why not? If the guest is accessing an mmio region while reconfiguring > it in a way that changes its meaning, either the previous or the next > meaning is valid. If the memory region owner sets the content to zero or even releases it (nothing states a memory region can only live inside a device structure), we will crash. Restricting how a memory region can be created and handled after it was once registered somewhere is an unnatural interface, waiting to cause subtle bugs. >>> >>> Using an Object * allows the simple case to be really simple (object == >>> device) and the hard cases to be doable. >>> >>> What would you suggest as a better interface? >> >> To protect the life cycle of the object we manage in the memory layer: >> regions. We don't manage devices there. If there is any implementation >> benefit in having a full QOM object, then make memory regions objects. > > I see and sympathise with this point of view. > > The idea to use opaque and convert it to Object * is that often a device > has multiple regions but no interest in managing them separately. So > managing regions will cause the device model authors to create > boilerplate code to push region management to device management. > > I had this experience with the first iterations of the region interface, > where instead of opaque we had MemoryRegion *. Device code would use > container_of() in the simple case, but the non-simple case caused lots > of pain. I believe it is the natural interface, but in practice it > turned out to cause too much churn. > >> I simply don't like this indirection, having the memory layer pick up >> the opaque value of the region and interpret it. > > That's just an intermediate step. The idea is that eventually opaque > will be changed to Object *. > >> Even worse, apply >> restrictions on how the dispatched objects, the regions, have to be >> treated because of this. > > Please elaborate. The fact that you can't manipulate a memory region object arbitrarily after removing it from the mapping because you track the references to the object that the region points to, not the region itself. The region remains in use by the dispatching layer and potentially the called device, even after deregistration. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 21:17, Anthony Liguori wrote: > Avi Kivity writes: > >> On 08/27/2012 09:24 AM, Anthony Liguori wrote: I'm sure we should leave existing code alone wherever possible, focusing on providing alternative versions for those paths that matter. Example: Most timers are fine under BQL. But some sensitive devices (RTC or HPET as clock source) will want their own timers. So the approach is to instantiate a separate, also prioritizeable instance of the timer subsystem for them and be done. >>> >>> I disagree. I think we conver the timer subsystem to be lockless and >>> then let some devices acquire the BQL during dispatch. >> >> I agree with your disagreement but disagree with the rest. The timer >> subsystem should have its own internal locking that callers will not be >> aware of. Requiring devices to acquire the bql will lead to >> deadlocks. > > Err, I completely mispoke there. > > I meant, to say, we should convert the timer subsystem to be re-entrant > and then let some devices acquire the BQL during dispatch. > > Existing users of the time API should get the BQL acquired automagically > for them. The new API would not hold the BQL during dispatch but the > old API, implemented in terms of the new API, would acquire it on behalf > of the caller. > > As a rule, if a device relies on the BQL, it must use the old APIs. If > a device uses the new APIs, it must not acquire the BQL. This makes sense. > >> Note that fine-grained locking timers will also require reference >> counting: you want to call the timer expiration callback after releasing >> the timer subsystem lock, so you need to make sure the callback does not >> go away. > > glib simply uses a single main loop lock to deal with all of this. I > think that's absolutely the right model. That depends on what is under the lock. Also, relying on locking of external libraries is a no-go for realtime as they usually do not care about priorities and inversion avoidance. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Avi Kivity writes: > On 08/27/2012 09:24 AM, Anthony Liguori wrote: >> > >> > I'm sure we should leave existing code alone wherever possible, focusing >> > on providing alternative versions for those paths that matter. Example: >> > Most timers are fine under BQL. But some sensitive devices (RTC or HPET >> > as clock source) will want their own timers. So the approach is to >> > instantiate a separate, also prioritizeable instance of the timer >> > subsystem for them and be done. >> >> I disagree. I think we conver the timer subsystem to be lockless and >> then let some devices acquire the BQL during dispatch. > > I agree with your disagreement but disagree with the rest. The timer > subsystem should have its own internal locking that callers will not be > aware of. Requiring devices to acquire the bql will lead to > deadlocks. Err, I completely mispoke there. I meant, to say, we should convert the timer subsystem to be re-entrant and then let some devices acquire the BQL during dispatch. Existing users of the time API should get the BQL acquired automagically for them. The new API would not hold the BQL during dispatch but the old API, implemented in terms of the new API, would acquire it on behalf of the caller. As a rule, if a device relies on the BQL, it must use the old APIs. If a device uses the new APIs, it must not acquire the BQL. > Note that fine-grained locking timers will also require reference > counting: you want to call the timer expiration callback after releasing > the timer subsystem lock, so you need to make sure the callback does not > go away. glib simply uses a single main loop lock to deal with all of this. I think that's absolutely the right model. It's best to start this conversion using very coarse locking. There's no need to start with ultra fine grain locking. Regards, Anthony Liguori > Linux manages without it (hrtimer_interrupt), so maybe we can too. > -- > I have a truly marvellous patch that fixes the bug which this > signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 11:39 AM, Jan Kiszka wrote: > On 2012-08-27 20:20, Avi Kivity wrote: > > On 08/27/2012 11:17 AM, Jan Kiszka wrote: > >> On 2012-08-27 20:09, Avi Kivity wrote: > >>> On 08/27/2012 10:14 AM, Jan Kiszka wrote: > > > > Deregistration is fine, the problem is destruction. > > > > It isn't as you access memory region states that can change after > deregistration. Devices can remove memory regions from the mapping, > alter and then reinsert them. The last to steps must not happen while > anyone is still using a reference to that region. > > >>> > >>> Why not? If the guest is accessing an mmio region while reconfiguring > >>> it in a way that changes its meaning, either the previous or the next > >>> meaning is valid. > >> > >> If the memory region owner sets the content to zero or even releases it > >> (nothing states a memory region can only live inside a device > >> structure), we will crash. Restricting how a memory region can be > >> created and handled after it was once registered somewhere is an > >> unnatural interface, waiting to cause subtle bugs. > > > > Using an Object * allows the simple case to be really simple (object == > > device) and the hard cases to be doable. > > > > What would you suggest as a better interface? > > To protect the life cycle of the object we manage in the memory layer: > regions. We don't manage devices there. If there is any implementation > benefit in having a full QOM object, then make memory regions objects. I see and sympathise with this point of view. The idea to use opaque and convert it to Object * is that often a device has multiple regions but no interest in managing them separately. So managing regions will cause the device model authors to create boilerplate code to push region management to device management. I had this experience with the first iterations of the region interface, where instead of opaque we had MemoryRegion *. Device code would use container_of() in the simple case, but the non-simple case caused lots of pain. I believe it is the natural interface, but in practice it turned out to cause too much churn. > I simply don't like this indirection, having the memory layer pick up > the opaque value of the region and interpret it. That's just an intermediate step. The idea is that eventually opaque will be changed to Object *. > Even worse, apply > restrictions on how the dispatched objects, the regions, have to be > treated because of this. Please elaborate. > Also, using memory regions to control the locking behaviour allows for > more fine-grained control. A device may expose certain regions with > self-managed locking while others, less time critical ones, can still be > handled under BQL for simplicity reasons. You can still aquire the bql in one callback and your local lock in another if you choose (but that would be most confusing IMO). > Example: regions that translate MMIO to PIO (alpha-pci.c, every user of > isa_mmio.c, ...). If PIO dispatching runs outside of the BQL, these > regions must not be protected by the BQL anymore. At the same time, we > may not want to convert the device exposing the region to its own > locking scheme (yet). And we surely don't want to take the per-device > lock for this state-less PIO dispatching. We're diverging, but eventually PIO dispatch will share the same code as mmio dispatch. PIO-to-MMIO bridges will simply be containers or aliases, they will not call cpu_inb() and relatives. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 20:20, Avi Kivity wrote: > On 08/27/2012 11:17 AM, Jan Kiszka wrote: >> On 2012-08-27 20:09, Avi Kivity wrote: >>> On 08/27/2012 10:14 AM, Jan Kiszka wrote: > > Deregistration is fine, the problem is destruction. > It isn't as you access memory region states that can change after deregistration. Devices can remove memory regions from the mapping, alter and then reinsert them. The last to steps must not happen while anyone is still using a reference to that region. >>> >>> Why not? If the guest is accessing an mmio region while reconfiguring >>> it in a way that changes its meaning, either the previous or the next >>> meaning is valid. >> >> If the memory region owner sets the content to zero or even releases it >> (nothing states a memory region can only live inside a device >> structure), we will crash. Restricting how a memory region can be >> created and handled after it was once registered somewhere is an >> unnatural interface, waiting to cause subtle bugs. > > Using an Object * allows the simple case to be really simple (object == > device) and the hard cases to be doable. > > What would you suggest as a better interface? To protect the life cycle of the object we manage in the memory layer: regions. We don't manage devices there. If there is any implementation benefit in having a full QOM object, then make memory regions objects. I simply don't like this indirection, having the memory layer pick up the opaque value of the region and interpret it. Even worse, apply restrictions on how the dispatched objects, the regions, have to be treated because of this. Also, using memory regions to control the locking behaviour allows for more fine-grained control. A device may expose certain regions with self-managed locking while others, less time critical ones, can still be handled under BQL for simplicity reasons. Example: regions that translate MMIO to PIO (alpha-pci.c, every user of isa_mmio.c, ...). If PIO dispatching runs outside of the BQL, these regions must not be protected by the BQL anymore. At the same time, we may not want to convert the device exposing the region to its own locking scheme (yet). And we surely don't want to take the per-device lock for this state-less PIO dispatching. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 09:24 AM, Anthony Liguori wrote: > > > > I'm sure we should leave existing code alone wherever possible, focusing > > on providing alternative versions for those paths that matter. Example: > > Most timers are fine under BQL. But some sensitive devices (RTC or HPET > > as clock source) will want their own timers. So the approach is to > > instantiate a separate, also prioritizeable instance of the timer > > subsystem for them and be done. > > I disagree. I think we conver the timer subsystem to be lockless and > then let some devices acquire the BQL during dispatch. I agree with your disagreement but disagree with the rest. The timer subsystem should have its own internal locking that callers will not be aware of. Requiring devices to acquire the bql will lead to deadlocks. Note that fine-grained locking timers will also require reference counting: you want to call the timer expiration callback after releasing the timer subsystem lock, so you need to make sure the callback does not go away. Linux manages without it (hrtimer_interrupt), so maybe we can too. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 06:19 AM, Anthony Liguori wrote: > Liu Ping Fan writes: > > > From: Liu Ping Fan > > > > Scene: > > obja lies in objA, when objA's ref->0, it will be freed, > > but at that time obja can still be in use. > > > > The real example is: > > typedef struct PCIIDEState { > > PCIDevice dev; > > IDEBus bus[2]; --> create in place > > . > > } > > > > When without big lock protection for mmio-dispatch, we will hold > > obj's refcnt. So memory_region_init_io() will replace the third para > > "void *opaque" with "Object *obj". > > With this patch, we can protect PCIIDEState from disappearing during > > mmio-dispatch hold the IDEBus->ref. > > > > And the ref circle has been broken when calling qdev_delete_subtree(). > > > > Signed-off-by: Liu Ping Fan > > I think this is solving the wrong problem. There are many, many > dependencies a device may have on other devices. Memory allocation > isn't the only one. > > The problem is that we want to make sure that a device doesn't "go away" > while an MMIO dispatch is happening. This is easy to solve without > touching referencing counting. > > The device will hold a lock while the MMIO is being dispatched. The > delete path simply needs to acquire that same lock. This will ensure > that a delete operation cannot finish while MMIO is still in flight. Where does the lock live? If it lives in the device, then we can't acquire it during mmio dispatch (the device may have gone away). If it lives outside the device, we need to to tell the memory core about it. I would really like to avoid adding additional state; I think the curent opaque should serve both to provide state to the callbacks and for life cycle management. > Regarding deleting a device, not all devices are capable of being > deleted and specifically, devices that are composed within the memory of > another device cannot be directly deleted (they can only be deleted > as part of their parent's destruction). This doesn't help at all. We have devices that can be deleted, and we would like to make the devices use the unlocked path, so we have to account for it. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 11:17 AM, Jan Kiszka wrote: > On 2012-08-27 20:09, Avi Kivity wrote: > > On 08/27/2012 10:14 AM, Jan Kiszka wrote: > >>> > >>> Deregistration is fine, the problem is destruction. > >>> > >> > >> It isn't as you access memory region states that can change after > >> deregistration. Devices can remove memory regions from the mapping, > >> alter and then reinsert them. The last to steps must not happen while > >> anyone is still using a reference to that region. > >> > > > > Why not? If the guest is accessing an mmio region while reconfiguring > > it in a way that changes its meaning, either the previous or the next > > meaning is valid. > > If the memory region owner sets the content to zero or even releases it > (nothing states a memory region can only live inside a device > structure), we will crash. Restricting how a memory region can be > created and handled after it was once registered somewhere is an > unnatural interface, waiting to cause subtle bugs. Using an Object * allows the simple case to be really simple (object == device) and the hard cases to be doable. What would you suggest as a better interface? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 20:09, Avi Kivity wrote: > On 08/27/2012 10:14 AM, Jan Kiszka wrote: >>> >>> Deregistration is fine, the problem is destruction. >>> >> >> It isn't as you access memory region states that can change after >> deregistration. Devices can remove memory regions from the mapping, >> alter and then reinsert them. The last to steps must not happen while >> anyone is still using a reference to that region. >> > > Why not? If the guest is accessing an mmio region while reconfiguring > it in a way that changes its meaning, either the previous or the next > meaning is valid. If the memory region owner sets the content to zero or even releases it (nothing states a memory region can only live inside a device structure), we will crash. Restricting how a memory region can be created and handled after it was once registered somewhere is an unnatural interface, waiting to cause subtle bugs. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 10:14 AM, Jan Kiszka wrote: > > > > Deregistration is fine, the problem is destruction. > > > > It isn't as you access memory region states that can change after > deregistration. Devices can remove memory regions from the mapping, > alter and then reinsert them. The last to steps must not happen while > anyone is still using a reference to that region. > Why not? If the guest is accessing an mmio region while reconfiguring it in a way that changes its meaning, either the previous or the next meaning is valid. It is true that memory_region_set_enabled(..., false) will become weaker as a result. Code will have to be prepared for that. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 19:09, Avi Kivity wrote: > On 08/27/2012 12:47 AM, Jan Kiszka wrote: >> On 2012-08-27 09:01, Paolo Bonzini wrote: >>> Il 25/08/2012 09:42, liu ping fan ha scritto: >> >> I don't see why MMIO dispatch should hold the IDEBus ref rather than the >> PCIIDEState. >> When transfer memory_region_init_io() 3rd para from void* opaque to Object* obj, the obj : opaque is not neccessary 1:1 map. For such situation, in order to let MemoryRegionOps tell between them, we should pass PCIIDEState->bus[0], bus[1] separately. >>> >>> The rule should be that the obj is the object that you want referenced, >>> and that should be the PCIIDEState. >>> >>> But this is anyway moot because it only applies to objects that are >>> converted to use unlocked dispatch. This likely will not be the case >>> for IDE. >> >> BTW, I'm pretty sure - after implementing the basics for BQL-free PIO >> dispatching - that device objects are the wrong target for reference >> counting. We keep memory regions in our dispatching tables (PIO >> dispatching needs some refactoring for this), and those regions need >> protection for BQL-free use. Devices can't pass away as long as the have >> referenced regions, memory region deregistration services will have to >> take care of this. >> >> I'm currently not using reference counting at all, I'm enforcing that >> only BQL-protected regions can be deregistered. > > That's a pretty harsh constraint. Of course. It's a first step, not a long-term restriction. But we have to do small steps. > >> Also note that there seems to be another misconception in the >> discussions: deregistration is not only bound to device unplug. It also >> happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another >> strong indicator that we should worry about individual memory regions, >> not devices. >> >> > > Deregistration is fine, the problem is destruction. > It isn't as you access memory region states that can change after deregistration. Devices can remove memory regions from the mapping, alter and then reinsert them. The last to steps must not happen while anyone is still using a reference to that region. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 08/27/2012 12:47 AM, Jan Kiszka wrote: > On 2012-08-27 09:01, Paolo Bonzini wrote: > > Il 25/08/2012 09:42, liu ping fan ha scritto: > > I don't see why MMIO dispatch should hold the IDEBus ref rather than the > PCIIDEState. > > >> When transfer memory_region_init_io() 3rd para from void* opaque to > >> Object* obj, the obj : opaque is not neccessary 1:1 map. For such > >> situation, in order to let MemoryRegionOps tell between them, we > >> should pass PCIIDEState->bus[0], bus[1] separately. > > > > The rule should be that the obj is the object that you want referenced, > > and that should be the PCIIDEState. > > > > But this is anyway moot because it only applies to objects that are > > converted to use unlocked dispatch. This likely will not be the case > > for IDE. > > BTW, I'm pretty sure - after implementing the basics for BQL-free PIO > dispatching - that device objects are the wrong target for reference > counting. We keep memory regions in our dispatching tables (PIO > dispatching needs some refactoring for this), and those regions need > protection for BQL-free use. Devices can't pass away as long as the have > referenced regions, memory region deregistration services will have to > take care of this. > > I'm currently not using reference counting at all, I'm enforcing that > only BQL-protected regions can be deregistered. That's a pretty harsh constraint. > Also note that there seems to be another misconception in the > discussions: deregistration is not only bound to device unplug. It also > happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another > strong indicator that we should worry about individual memory regions, > not devices. > > Deregistration is fine, the problem is destruction. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 18:24, Anthony Liguori wrote: > Jan Kiszka writes: > >> On 2012-08-27 17:14, Anthony Liguori wrote: >>> Jan Kiszka writes: >>> On 2012-08-27 15:19, Anthony Liguori wrote: > Liu Ping Fan writes: > >> From: Liu Ping Fan >> >> Scene: >> obja lies in objA, when objA's ref->0, it will be freed, >> but at that time obja can still be in use. >> >> The real example is: >> typedef struct PCIIDEState { >> PCIDevice dev; >> IDEBus bus[2]; --> create in place >> . >> } >> >> When without big lock protection for mmio-dispatch, we will hold >> obj's refcnt. So memory_region_init_io() will replace the third para >> "void *opaque" with "Object *obj". >> With this patch, we can protect PCIIDEState from disappearing during >> mmio-dispatch hold the IDEBus->ref. >> >> And the ref circle has been broken when calling qdev_delete_subtree(). >> >> Signed-off-by: Liu Ping Fan > > I think this is solving the wrong problem. There are many, many > dependencies a device may have on other devices. Memory allocation > isn't the only one. > > The problem is that we want to make sure that a device doesn't "go away" > while an MMIO dispatch is happening. This is easy to solve without > touching referencing counting. > > The device will hold a lock while the MMIO is being dispatched. The > delete path simply needs to acquire that same lock. This will ensure > that a delete operation cannot finish while MMIO is still in flight. That's a bit too simple. Quite a few MMIO/PIO fast-paths will work without any device-specific locking, e.g. just to read a simple register value. So we will need reference counting >>> >>> But then you'll need to acquire a lock to take the reference/remove the >>> reference which sort of defeats the purpose of trying to fast path. >> >> Atomic ops? RCU? This problem won't be solved for the first time. > > Yes, there are ways to do this, but you add a fair bit of complication. > > It's much simplier to make objects not have any locks and enforce all > callers to protect access. IOW, noone except the device gets to inc/dec > reference counts. Pulling locks out of the devices may appear simpler on first sight but won't be in practice. Objects can issue requests to other objects. Will they hold its lock while taking the target's lock? Lock management is a per-device thing, not only for performance reasons. But I guess we will have to compare working prototypes. Many concepts look nice on the paper, but when you implement them or even try to run them, the interesting problems show up. Specifically those related to a smooth transition phase. > (for devices using private locks), but on the "front-line" object: the memory region. That region will block its owner from disappearing by waiting on dispatch when someone tries to unregister it. Also note that "holding a lock" is easily said but will be more tricky in practice. Quite a significant share of our code will continue to run under BQL, even for devices with their own locks. Init/cleanup functions will likely fall into this category, >>> >>> I'm not sure I'm convinced of this--but it's hard to tell until we >>> really start converting. >>> >>> BTW, I'm pretty sure we have to tackle main loop functions first before >>> we try to convert any devices off the BQL. >> >> I'm sure we should leave existing code alone wherever possible, focusing >> on providing alternative versions for those paths that matter. Example: >> Most timers are fine under BQL. But some sensitive devices (RTC or HPET >> as clock source) will want their own timers. So the approach is to >> instantiate a separate, also prioritizeable instance of the timer >> subsystem for them and be done. > > I disagree. I think we conver the timer subsystem to be lockless and > then let some devices acquire the BQL during dispatch. The timer subsystem is lockless in this sense already. Its user holds the necessary lock (BQL so far). I'm just making it multi-instance so that the time-critical users don't need to worry about other, low-prio timers. In that sense, I guess we are describing the same end result. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Jan Kiszka writes: > On 2012-08-27 17:14, Anthony Liguori wrote: >> Jan Kiszka writes: >> >>> On 2012-08-27 15:19, Anthony Liguori wrote: Liu Ping Fan writes: > From: Liu Ping Fan > > Scene: > obja lies in objA, when objA's ref->0, it will be freed, > but at that time obja can still be in use. > > The real example is: > typedef struct PCIIDEState { > PCIDevice dev; > IDEBus bus[2]; --> create in place > . > } > > When without big lock protection for mmio-dispatch, we will hold > obj's refcnt. So memory_region_init_io() will replace the third para > "void *opaque" with "Object *obj". > With this patch, we can protect PCIIDEState from disappearing during > mmio-dispatch hold the IDEBus->ref. > > And the ref circle has been broken when calling qdev_delete_subtree(). > > Signed-off-by: Liu Ping Fan I think this is solving the wrong problem. There are many, many dependencies a device may have on other devices. Memory allocation isn't the only one. The problem is that we want to make sure that a device doesn't "go away" while an MMIO dispatch is happening. This is easy to solve without touching referencing counting. The device will hold a lock while the MMIO is being dispatched. The delete path simply needs to acquire that same lock. This will ensure that a delete operation cannot finish while MMIO is still in flight. >>> >>> That's a bit too simple. Quite a few MMIO/PIO fast-paths will work >>> without any device-specific locking, e.g. just to read a simple register >>> value. So we will need reference counting >> >> But then you'll need to acquire a lock to take the reference/remove the >> reference which sort of defeats the purpose of trying to fast path. > > Atomic ops? RCU? This problem won't be solved for the first time. Yes, there are ways to do this, but you add a fair bit of complication. It's much simplier to make objects not have any locks and enforce all callers to protect access. IOW, noone except the device gets to inc/dec reference counts. >>> (for devices using private >>> locks), but on the "front-line" object: the memory region. That region >>> will block its owner from disappearing by waiting on dispatch when >>> someone tries to unregister it. >>> >>> Also note that "holding a lock" is easily said but will be more tricky >>> in practice. Quite a significant share of our code will continue to run >>> under BQL, even for devices with their own locks. Init/cleanup functions >>> will likely fall into this category, >> >> I'm not sure I'm convinced of this--but it's hard to tell until we >> really start converting. >> >> BTW, I'm pretty sure we have to tackle main loop functions first before >> we try to convert any devices off the BQL. > > I'm sure we should leave existing code alone wherever possible, focusing > on providing alternative versions for those paths that matter. Example: > Most timers are fine under BQL. But some sensitive devices (RTC or HPET > as clock source) will want their own timers. So the approach is to > instantiate a separate, also prioritizeable instance of the timer > subsystem for them and be done. I disagree. I think we conver the timer subsystem to be lockless and then let some devices acquire the BQL during dispatch. And we have a nice thread-aware main loop available to us--glib. We don't need to reinvent the wheel. Regards, Anthony Liguori > > We won't convert QEMU in a day, but we surely want results before the > last corner is refactored (which would take years, at best). > > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE > Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 17:14, Anthony Liguori wrote: > Jan Kiszka writes: > >> On 2012-08-27 15:19, Anthony Liguori wrote: >>> Liu Ping Fan writes: >>> From: Liu Ping Fan Scene: obja lies in objA, when objA's ref->0, it will be freed, but at that time obja can still be in use. The real example is: typedef struct PCIIDEState { PCIDevice dev; IDEBus bus[2]; --> create in place . } When without big lock protection for mmio-dispatch, we will hold obj's refcnt. So memory_region_init_io() will replace the third para "void *opaque" with "Object *obj". With this patch, we can protect PCIIDEState from disappearing during mmio-dispatch hold the IDEBus->ref. And the ref circle has been broken when calling qdev_delete_subtree(). Signed-off-by: Liu Ping Fan >>> >>> I think this is solving the wrong problem. There are many, many >>> dependencies a device may have on other devices. Memory allocation >>> isn't the only one. >>> >>> The problem is that we want to make sure that a device doesn't "go away" >>> while an MMIO dispatch is happening. This is easy to solve without >>> touching referencing counting. >>> >>> The device will hold a lock while the MMIO is being dispatched. The >>> delete path simply needs to acquire that same lock. This will ensure >>> that a delete operation cannot finish while MMIO is still in flight. >> >> That's a bit too simple. Quite a few MMIO/PIO fast-paths will work >> without any device-specific locking, e.g. just to read a simple register >> value. So we will need reference counting > > But then you'll need to acquire a lock to take the reference/remove the > reference which sort of defeats the purpose of trying to fast path. Atomic ops? RCU? This problem won't be solved for the first time. > >> (for devices using private >> locks), but on the "front-line" object: the memory region. That region >> will block its owner from disappearing by waiting on dispatch when >> someone tries to unregister it. >> >> Also note that "holding a lock" is easily said but will be more tricky >> in practice. Quite a significant share of our code will continue to run >> under BQL, even for devices with their own locks. Init/cleanup functions >> will likely fall into this category, > > I'm not sure I'm convinced of this--but it's hard to tell until we > really start converting. > > BTW, I'm pretty sure we have to tackle main loop functions first before > we try to convert any devices off the BQL. I'm sure we should leave existing code alone wherever possible, focusing on providing alternative versions for those paths that matter. Example: Most timers are fine under BQL. But some sensitive devices (RTC or HPET as clock source) will want their own timers. So the approach is to instantiate a separate, also prioritizeable instance of the timer subsystem for them and be done. We won't convert QEMU in a day, but we surely want results before the last corner is refactored (which would take years, at best). Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Jan Kiszka writes: > On 2012-08-27 15:19, Anthony Liguori wrote: >> Liu Ping Fan writes: >> >>> From: Liu Ping Fan >>> >>> Scene: >>> obja lies in objA, when objA's ref->0, it will be freed, >>> but at that time obja can still be in use. >>> >>> The real example is: >>> typedef struct PCIIDEState { >>> PCIDevice dev; >>> IDEBus bus[2]; --> create in place >>> . >>> } >>> >>> When without big lock protection for mmio-dispatch, we will hold >>> obj's refcnt. So memory_region_init_io() will replace the third para >>> "void *opaque" with "Object *obj". >>> With this patch, we can protect PCIIDEState from disappearing during >>> mmio-dispatch hold the IDEBus->ref. >>> >>> And the ref circle has been broken when calling qdev_delete_subtree(). >>> >>> Signed-off-by: Liu Ping Fan >> >> I think this is solving the wrong problem. There are many, many >> dependencies a device may have on other devices. Memory allocation >> isn't the only one. >> >> The problem is that we want to make sure that a device doesn't "go away" >> while an MMIO dispatch is happening. This is easy to solve without >> touching referencing counting. >> >> The device will hold a lock while the MMIO is being dispatched. The >> delete path simply needs to acquire that same lock. This will ensure >> that a delete operation cannot finish while MMIO is still in flight. > > That's a bit too simple. Quite a few MMIO/PIO fast-paths will work > without any device-specific locking, e.g. just to read a simple register > value. So we will need reference counting But then you'll need to acquire a lock to take the reference/remove the reference which sort of defeats the purpose of trying to fast path. > (for devices using private > locks), but on the "front-line" object: the memory region. That region > will block its owner from disappearing by waiting on dispatch when > someone tries to unregister it. > > Also note that "holding a lock" is easily said but will be more tricky > in practice. Quite a significant share of our code will continue to run > under BQL, even for devices with their own locks. Init/cleanup functions > will likely fall into this category, I'm not sure I'm convinced of this--but it's hard to tell until we really start converting. BTW, I'm pretty sure we have to tackle main loop functions first before we try to convert any devices off the BQL. Regards, Anthony Liguori > simply because the surrounding > logic is hard to convert into fine-grained locking and is also not > performance critical. At the same time, we can't take BQL -> device-lock > as we have to support device-lock -> BQL ordering for (slow-path) calls > into BQL-protected areas while holding a per-device lock (e.g. device > mapping changes). > > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE > Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 15:19, Anthony Liguori wrote: > Liu Ping Fan writes: > >> From: Liu Ping Fan >> >> Scene: >> obja lies in objA, when objA's ref->0, it will be freed, >> but at that time obja can still be in use. >> >> The real example is: >> typedef struct PCIIDEState { >> PCIDevice dev; >> IDEBus bus[2]; --> create in place >> . >> } >> >> When without big lock protection for mmio-dispatch, we will hold >> obj's refcnt. So memory_region_init_io() will replace the third para >> "void *opaque" with "Object *obj". >> With this patch, we can protect PCIIDEState from disappearing during >> mmio-dispatch hold the IDEBus->ref. >> >> And the ref circle has been broken when calling qdev_delete_subtree(). >> >> Signed-off-by: Liu Ping Fan > > I think this is solving the wrong problem. There are many, many > dependencies a device may have on other devices. Memory allocation > isn't the only one. > > The problem is that we want to make sure that a device doesn't "go away" > while an MMIO dispatch is happening. This is easy to solve without > touching referencing counting. > > The device will hold a lock while the MMIO is being dispatched. The > delete path simply needs to acquire that same lock. This will ensure > that a delete operation cannot finish while MMIO is still in flight. That's a bit too simple. Quite a few MMIO/PIO fast-paths will work without any device-specific locking, e.g. just to read a simple register value. So we will need reference counting (for devices using private locks), but on the "front-line" object: the memory region. That region will block its owner from disappearing by waiting on dispatch when someone tries to unregister it. Also note that "holding a lock" is easily said but will be more tricky in practice. Quite a significant share of our code will continue to run under BQL, even for devices with their own locks. Init/cleanup functions will likely fall into this category, simply because the surrounding logic is hard to convert into fine-grained locking and is also not performance critical. At the same time, we can't take BQL -> device-lock as we have to support device-lock -> BQL ordering for (slow-path) calls into BQL-protected areas while holding a per-device lock (e.g. device mapping changes). Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Liu Ping Fan writes: > From: Liu Ping Fan > > Scene: > obja lies in objA, when objA's ref->0, it will be freed, > but at that time obja can still be in use. > > The real example is: > typedef struct PCIIDEState { > PCIDevice dev; > IDEBus bus[2]; --> create in place > . > } > > When without big lock protection for mmio-dispatch, we will hold > obj's refcnt. So memory_region_init_io() will replace the third para > "void *opaque" with "Object *obj". > With this patch, we can protect PCIIDEState from disappearing during > mmio-dispatch hold the IDEBus->ref. > > And the ref circle has been broken when calling qdev_delete_subtree(). > > Signed-off-by: Liu Ping Fan I think this is solving the wrong problem. There are many, many dependencies a device may have on other devices. Memory allocation isn't the only one. The problem is that we want to make sure that a device doesn't "go away" while an MMIO dispatch is happening. This is easy to solve without touching referencing counting. The device will hold a lock while the MMIO is being dispatched. The delete path simply needs to acquire that same lock. This will ensure that a delete operation cannot finish while MMIO is still in flight. Regarding deleting a device, not all devices are capable of being deleted and specifically, devices that are composed within the memory of another device cannot be directly deleted (they can only be deleted as part of their parent's destruction). Regards, Anthony Liguori > --- > hw/qdev.c |2 ++ > hw/qdev.h |1 + > 2 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/hw/qdev.c b/hw/qdev.c > index e2339a1..b09ebbf 100644 > --- a/hw/qdev.c > +++ b/hw/qdev.c > @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char > *typename, > { > object_initialize(bus, typename); > > +bus->overlap = parent; > +object_ref(OBJECT(bus->overlap)); > bus->parent = parent; > bus->name = name ? g_strdup(name) : NULL; > qbus_realize(bus); > diff --git a/hw/qdev.h b/hw/qdev.h > index 182cfa5..9bc5783 100644 > --- a/hw/qdev.h > +++ b/hw/qdev.h > @@ -117,6 +117,7 @@ struct BusState { > int allow_hotplug; > bool qom_allocated; > bool glib_allocated; > +DeviceState *overlap; > int max_index; > QTAILQ_HEAD(ChildrenHead, BusChild) children; > QLIST_ENTRY(BusState) sibling; > -- > 1.7.4.4
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 10:17, liu ping fan wrote: > On Mon, Aug 27, 2012 at 3:47 PM, Jan Kiszka wrote: >> On 2012-08-27 09:01, Paolo Bonzini wrote: >>> Il 25/08/2012 09:42, liu ping fan ha scritto: >> >> I don't see why MMIO dispatch should hold the IDEBus ref rather than the >> PCIIDEState. >> When transfer memory_region_init_io() 3rd para from void* opaque to Object* obj, the obj : opaque is not neccessary 1:1 map. For such situation, in order to let MemoryRegionOps tell between them, we should pass PCIIDEState->bus[0], bus[1] separately. >>> >>> The rule should be that the obj is the object that you want referenced, >>> and that should be the PCIIDEState. >>> >>> But this is anyway moot because it only applies to objects that are >>> converted to use unlocked dispatch. This likely will not be the case >>> for IDE. >> >> BTW, I'm pretty sure - after implementing the basics for BQL-free PIO >> dispatching - that device objects are the wrong target for reference > > Hi Jan, thanks for reminder, but could you explain it more detail? > mmio dispatch table holds 1 ref for device, before releasing this > ref,( When unplugging, we detach all the device's mr from memory, then > drop the ref. So I think that no leak will be exposed by mr and it is > safe to use device as target for reference. It would be a mistake to assume that memory regions can only be embedded in device objects. Memory regions can be reconfigured or dynamically added/removed (see e.g. portio lists) - there is no "device" in this sentence. Regions are stored in the dispatching table, they will first of all be touched without holding the BQL. So their content has to be stable in that period, and it is the proper abstraction, IMHO, to focus on their life cycle management and attach all the rest to them. > >> counting. We keep memory regions in our dispatching tables (PIO >> dispatching needs some refactoring for this), and those regions need >> protection for BQL-free use. Devices can't pass away as long as the have > Yes, it is right. Device can pass away only after mr removed from > dispatching tables Great, then you don't have to worry about device objects in the context of dispatching. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Mon, Aug 27, 2012 at 3:47 PM, Jan Kiszka wrote: > On 2012-08-27 09:01, Paolo Bonzini wrote: >> Il 25/08/2012 09:42, liu ping fan ha scritto: > > I don't see why MMIO dispatch should hold the IDEBus ref rather than the > PCIIDEState. > >>> When transfer memory_region_init_io() 3rd para from void* opaque to >>> Object* obj, the obj : opaque is not neccessary 1:1 map. For such >>> situation, in order to let MemoryRegionOps tell between them, we >>> should pass PCIIDEState->bus[0], bus[1] separately. >> >> The rule should be that the obj is the object that you want referenced, >> and that should be the PCIIDEState. >> >> But this is anyway moot because it only applies to objects that are >> converted to use unlocked dispatch. This likely will not be the case >> for IDE. > > BTW, I'm pretty sure - after implementing the basics for BQL-free PIO > dispatching - that device objects are the wrong target for reference Hi Jan, thanks for reminder, but could you explain it more detail? mmio dispatch table holds 1 ref for device, before releasing this ref,( When unplugging, we detach all the device's mr from memory, then drop the ref. So I think that no leak will be exposed by mr and it is safe to use device as target for reference. > counting. We keep memory regions in our dispatching tables (PIO > dispatching needs some refactoring for this), and those regions need > protection for BQL-free use. Devices can't pass away as long as the have Yes, it is right. Device can pass away only after mr removed from dispatching tables Thanx pingfan > referenced regions, memory region deregistration services will have to > take care of this. > > I'm currently not using reference counting at all, I'm enforcing that > only BQL-protected regions can be deregistered. > > Also note that there seems to be another misconception in the > discussions: deregistration is not only bound to device unplug. It also > happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another > strong indicator that we should worry about individual memory regions, > not devices. > > Jan > >
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On 2012-08-27 09:01, Paolo Bonzini wrote: > Il 25/08/2012 09:42, liu ping fan ha scritto: I don't see why MMIO dispatch should hold the IDEBus ref rather than the PCIIDEState. >> When transfer memory_region_init_io() 3rd para from void* opaque to >> Object* obj, the obj : opaque is not neccessary 1:1 map. For such >> situation, in order to let MemoryRegionOps tell between them, we >> should pass PCIIDEState->bus[0], bus[1] separately. > > The rule should be that the obj is the object that you want referenced, > and that should be the PCIIDEState. > > But this is anyway moot because it only applies to objects that are > converted to use unlocked dispatch. This likely will not be the case > for IDE. BTW, I'm pretty sure - after implementing the basics for BQL-free PIO dispatching - that device objects are the wrong target for reference counting. We keep memory regions in our dispatching tables (PIO dispatching needs some refactoring for this), and those regions need protection for BQL-free use. Devices can't pass away as long as the have referenced regions, memory region deregistration services will have to take care of this. I'm currently not using reference counting at all, I'm enforcing that only BQL-protected regions can be deregistered. Also note that there seems to be another misconception in the discussions: deregistration is not only bound to device unplug. It also happens on device reconfiguration, e.g. PCI BAR (re-)mapping. Another strong indicator that we should worry about individual memory regions, not devices. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Il 25/08/2012 09:42, liu ping fan ha scritto: >> > >> > I don't see why MMIO dispatch should hold the IDEBus ref rather than the >> > PCIIDEState. >> > > When transfer memory_region_init_io() 3rd para from void* opaque to > Object* obj, the obj : opaque is not neccessary 1:1 map. For such > situation, in order to let MemoryRegionOps tell between them, we > should pass PCIIDEState->bus[0], bus[1] separately. The rule should be that the obj is the object that you want referenced, and that should be the PCIIDEState. But this is anyway moot because it only applies to objects that are converted to use unlocked dispatch. This likely will not be the case for IDE. Paolo >> > In the case of the PIIX, the BARs are set up by the PCIIDEState in >> > bmdma_setup_bar (called by bmdma_setup_bar). >> > > Supposing we have convert PCIIDEState->bmdma[0]/[1] to Object. And in > mmio-dispatch, object_ref will impose on bmdma[0/[1], but this can not > prevent PCIIDEState->refcnt=0, and then the whole object disappear!
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
On Fri, Aug 24, 2012 at 10:42 PM, Paolo Bonzini wrote: > Il 24/08/2012 11:49, Liu Ping Fan ha scritto: >> With this patch, we can protect PCIIDEState from disappearing during >> mmio-dispatch hold the IDEBus->ref. > > I don't see why MMIO dispatch should hold the IDEBus ref rather than the > PCIIDEState. > When transfer memory_region_init_io() 3rd para from void* opaque to Object* obj, the obj : opaque is not neccessary 1:1 map. For such situation, in order to let MemoryRegionOps tell between them, we should pass PCIIDEState->bus[0], bus[1] separately. > In the case of the PIIX, the BARs are set up by the PCIIDEState in > bmdma_setup_bar (called by bmdma_setup_bar). > Supposing we have convert PCIIDEState->bmdma[0]/[1] to Object. And in mmio-dispatch, object_ref will impose on bmdma[0/[1], but this can not prevent PCIIDEState->refcnt=0, and then the whole object disappear! Thanks and regards, pingfan > Also, containment may happen just as well for devices, not buses. Why > isn't it a problem in that case? It looks like you're papering over a > different bug. > > Paolo > >> And the ref circle has been broken when calling qdev_delete_subtree(). >> >> Signed-off-by: Liu Ping Fan >> --- >> hw/qdev.c |2 ++ >> hw/qdev.h |1 + >> 2 files changed, 3 insertions(+), 0 deletions(-) >> >> diff --git a/hw/qdev.c b/hw/qdev.c >> index e2339a1..b09ebbf 100644 >> --- a/hw/qdev.c >> +++ b/hw/qdev.c >> @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char >> *typename, >> { >> object_initialize(bus, typename); >> >> +bus->overlap = parent; >> +object_ref(OBJECT(bus->overlap)); >> bus->parent = parent; >> bus->name = name ? g_strdup(name) : NULL; >> qbus_realize(bus); >> diff --git a/hw/qdev.h b/hw/qdev.h >> index 182cfa5..9bc5783 100644 >> --- a/hw/qdev.h >> +++ b/hw/qdev.h >> @@ -117,6 +117,7 @@ struct BusState { >> int allow_hotplug; >> bool qom_allocated; >> bool glib_allocated; >> +DeviceState *overlap; >> int max_index; >> QTAILQ_HEAD(ChildrenHead, BusChild) children; >> QLIST_ENTRY(BusState) sibling; >> -- 1.7.4.4 >
Re: [Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
Il 24/08/2012 11:49, Liu Ping Fan ha scritto: > With this patch, we can protect PCIIDEState from disappearing during > mmio-dispatch hold the IDEBus->ref. I don't see why MMIO dispatch should hold the IDEBus ref rather than the PCIIDEState. In the case of the PIIX, the BARs are set up by the PCIIDEState in bmdma_setup_bar (called by bmdma_setup_bar). Also, containment may happen just as well for devices, not buses. Why isn't it a problem in that case? It looks like you're papering over a different bug. Paolo > And the ref circle has been broken when calling qdev_delete_subtree(). > > Signed-off-by: Liu Ping Fan > --- > hw/qdev.c |2 ++ > hw/qdev.h |1 + > 2 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/hw/qdev.c b/hw/qdev.c > index e2339a1..b09ebbf 100644 > --- a/hw/qdev.c > +++ b/hw/qdev.c > @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char > *typename, > { > object_initialize(bus, typename); > > +bus->overlap = parent; > +object_ref(OBJECT(bus->overlap)); > bus->parent = parent; > bus->name = name ? g_strdup(name) : NULL; > qbus_realize(bus); > diff --git a/hw/qdev.h b/hw/qdev.h > index 182cfa5..9bc5783 100644 > --- a/hw/qdev.h > +++ b/hw/qdev.h > @@ -117,6 +117,7 @@ struct BusState { > int allow_hotplug; > bool qom_allocated; > bool glib_allocated; > +DeviceState *overlap; > int max_index; > QTAILQ_HEAD(ChildrenHead, BusChild) children; > QLIST_ENTRY(BusState) sibling; > -- 1.7.4.4
[Qemu-devel] [PATCH 10/10] qdev: fix create in place obj's life cycle problem
From: Liu Ping Fan Scene: obja lies in objA, when objA's ref->0, it will be freed, but at that time obja can still be in use. The real example is: typedef struct PCIIDEState { PCIDevice dev; IDEBus bus[2]; --> create in place . } When without big lock protection for mmio-dispatch, we will hold obj's refcnt. So memory_region_init_io() will replace the third para "void *opaque" with "Object *obj". With this patch, we can protect PCIIDEState from disappearing during mmio-dispatch hold the IDEBus->ref. And the ref circle has been broken when calling qdev_delete_subtree(). Signed-off-by: Liu Ping Fan --- hw/qdev.c |2 ++ hw/qdev.h |1 + 2 files changed, 3 insertions(+), 0 deletions(-) diff --git a/hw/qdev.c b/hw/qdev.c index e2339a1..b09ebbf 100644 --- a/hw/qdev.c +++ b/hw/qdev.c @@ -510,6 +510,8 @@ void qbus_create_inplace(BusState *bus, const char *typename, { object_initialize(bus, typename); +bus->overlap = parent; +object_ref(OBJECT(bus->overlap)); bus->parent = parent; bus->name = name ? g_strdup(name) : NULL; qbus_realize(bus); diff --git a/hw/qdev.h b/hw/qdev.h index 182cfa5..9bc5783 100644 --- a/hw/qdev.h +++ b/hw/qdev.h @@ -117,6 +117,7 @@ struct BusState { int allow_hotplug; bool qom_allocated; bool glib_allocated; +DeviceState *overlap; int max_index; QTAILQ_HEAD(ChildrenHead, BusChild) children; QLIST_ENTRY(BusState) sibling; -- 1.7.4.4