Re: [RFC PATCH v3 00/19] ACPI memory hotplug
On Wed, Oct 31, 2012 at 01:16:56PM +0200, Avi Kivity wrote: On 10/31/2012 12:58 PM, Stefan Hajnoczi wrote: On Fri, Sep 21, 2012 at 1:17 PM, Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com wrote: This is v3 of the ACPI memory hotplug functionality. Only x86_64 target is supported for now. Hi Vasilis, Regarding the hot unplug issue we've been discussing, it's possible to progress this patch series without fully solving that problem upfront. Karen Noel suggested that the series could be rolled without the hot unplug command, so that it's not possible to hit the unsafe case. This would allow users to hot plug additional memory. They would have to use virtio-balloon to reduce the memory footprint again. Later, when the memory region referencing issue has been solved the hot unplug command can be added. Just wanted to mention Karen's idea in case you feel stuck right now. We could introduce hotunplug as an experimental feature so people can test and play with it, and later graduate it to a fully supported feature. ok, I 'll separate hotplug and hotunplug patches for next version of the patchseries (maybe even offer hotunplug in a separate series) thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction
Hi, On Wed, Oct 24, 2012 at 12:15:17PM +0200, Stefan Hajnoczi wrote: On Wed, Oct 24, 2012 at 10:06 AM, liu ping fan qemul...@gmail.com wrote: On Tue, Oct 23, 2012 at 8:25 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Sep 21, 2012 at 01:17:21PM +0200, Vasilis Liaskovitis wrote: +static void dimm_populate(DimmDevice *s) +{ +DeviceState *dev= (DeviceState*)s; +MemoryRegion *new = NULL; + +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, dev-id, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +} + +static void dimm_depopulate(DimmDevice *s) +{ +assert(s); +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-mr = NULL; +} How is dimm hot unplug protected against callers who currently have RAM mapped (from cpu_physical_memory_map())? Emulated devices call cpu_physical_memory_map() directly or indirectly through DMA emulation code. The RAM pointer may be held for arbitrary lengths of time, across main loop iterations, etc. It's not clear to me that it is safe to unplug a DIMM that has network or disk I/O buffers, for example. We also need to be robust against malicious guests who abuse the hotplug lifecycle. QEMU should never be left with dangling pointers. Not sure about the block layer. But I think those thread are already out of big lock, so there should be a MemoryListener to catch the RAM-unplug event, and if needed, bdrv_flush. do we want bdrv_flush, or some kind of cancel request e.g. bdrv_aio_cancel? Here is the detailed scenario: 1. Emulated device does cpu_physical_memory_map() and gets a pointer to guest RAM. 2. Return to vcpu or iothread, continue processing... 3. Hot unplug of RAM causes the guest RAM to disappear. 4. Pending I/O completes and overwrites memory from dangling guest RAM pointer. Any I/O device that does zero-copy I/O in QEMU faces this problem: * The block layer is affected. * The net layer is unaffected because it doesn't do zero-copy tx/rx across returns to the main loop (#2 above). * Not sure about other devices classes (e.g. USB). How should the MemoryListener callback work? For block I/O it may not be possible to cancel pending I/O asynchronously - if you try to cancel then your thread may block until the I/O completes. e.g. paio_cancel does this? is there already an API to asynchronously cancel all in flight operations in a BlockDriverState? Afaict block_job_cancel refers to streaming jobs only and doesn't help here. Can we make the RAM unplug initiate async I/O cancellations, prevent further I/Os, and only free the memory in a callback, after all DMA I/O to the associated memory region has been cancelled or completed? Also iiuc the MemoryListener should be registered from users of cpu_physical_memory_map e.g. hw/virtio.c By the way dimm_depopulate only frees the qemu memory on an ACPI _EJ request, which means that a well-behaved guest will have already offlined the memory and is not using it anymore. If the guest still uses the memory e.g. for a DMA buffer, the logical memory offlining will fail and the _EJ/qemu memory freeing will never happen. But in theory a malicious acpi guest driver could trigger _EJ requests to do step 3 above. Or perhaps the backing block driver can finish an I/O request for a zero-copy block device that the guest doesn't care for anymore? I 'll think about this a bit more. Synchronous cancel behavior is not workable since it can lead to poor latency or hangs in the guest. ok thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement -dimm command line option
Hi, On Thu, Oct 18, 2012 at 02:33:02PM +0200, Avi Kivity wrote: On 10/18/2012 11:27 AM, Vasilis Liaskovitis wrote: On Wed, Oct 17, 2012 at 12:03:51PM +0200, Avi Kivity wrote: On 10/17/2012 11:19 AM, Vasilis Liaskovitis wrote: I don't think so, but probably there's a limit of DIMMs that real controllers have, something like 8 max. In the case of i440fx specifically, do you mean that we should model the DRB (Dram row boundary registers in section 3.2.19 of the i440fx spec) ? The i440fx DRB registers only supports up to 8 DRAM rows (let's say 1 row maps 1-1 to a DimmDevice for this discussion) and only supports up to 2GB of memory afaict (bit 31 and above is ignored). I 'd rather not model this part of the i440fx - having only 8 DIMMs seems too restrictive. The rest of the patchset supports up to 255 DIMMs so it would be a waste imho to model an old pc memory controller that only supports 8 DIMMs. There was also an old discussion about i440fx modeling here: https://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg02705.html the general direction was that i440fx is too old and we don't want to precisely emulate the DRB registers, since they lack flexibility. Possible solutions: 1) is there a newer and more flexible chipset that we could model? Look for q35 on this list. thanks, I 'll take a look. It sounds like the other options below are more straightforward now, but let me know if you prefer q35 integration as a priority. At least validate that what you're doing fits with how q35 works. In terms of pmc modeling, the q35 page http://wiki.qemu.org/Features/Q35 mentions: Refactor i440fx to create i440fx-pmc class ich9: model ICH9 Super I/O chip ich9: make i440fx-pmc a generic PCNorthBridge class and add support for ich9 northbridge is this still the plan? There was an old patchset creating i440fx-pmc here: http://lists.gnu.org/archive/html/qemu-devel/2012-01/msg03501.html but I am not sure if it has been dropped or worked on. v3 of the q35 patchset doesn't include a pmc I think. It would be good to know what the current plan regarding pmc modeling (for both q35 and i440fx) is. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement -dimm command line option
On Wed, Oct 17, 2012 at 12:03:51PM +0200, Avi Kivity wrote: On 10/17/2012 11:19 AM, Vasilis Liaskovitis wrote: I don't think so, but probably there's a limit of DIMMs that real controllers have, something like 8 max. In the case of i440fx specifically, do you mean that we should model the DRB (Dram row boundary registers in section 3.2.19 of the i440fx spec) ? The i440fx DRB registers only supports up to 8 DRAM rows (let's say 1 row maps 1-1 to a DimmDevice for this discussion) and only supports up to 2GB of memory afaict (bit 31 and above is ignored). I 'd rather not model this part of the i440fx - having only 8 DIMMs seems too restrictive. The rest of the patchset supports up to 255 DIMMs so it would be a waste imho to model an old pc memory controller that only supports 8 DIMMs. There was also an old discussion about i440fx modeling here: https://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg02705.html the general direction was that i440fx is too old and we don't want to precisely emulate the DRB registers, since they lack flexibility. Possible solutions: 1) is there a newer and more flexible chipset that we could model? Look for q35 on this list. thanks, I 'll take a look. It sounds like the other options below are more straightforward now, but let me know if you prefer q35 integration as a priority. 2) model and document ^--- the critical bit a generic (non-existent) i440fx that would support more and larger DIMMs. E.g. support 255 DIMMs. If we want to use a description similar to the i440fx DRB registers, the registers would take up a lot of space. In i440fx there is one 8-bit DRB register per DIMM, and DRB[i] describes how many 8MB chunks are contained in DIMMs 0...i. So, the register values are cumulative (and total described memory cannot exceed 256x8MB = 2GB) Our i440fx has already been extended by support for pci and cpu hotplug, and I see no reason not to extend it for memory. We can allocate extra mmio space for registers if needed. Usually I'm against this sort of thing, but in this case we don't have much choice. ok We could for example model: - an 8-bit non-cumulative register for each DIMM, denoting how many 128MB chunks it contains. This allowes 32GB for each DIMM, and with 255 DIMMs we describe a bit less than 8TB. These registers require 255 bytes. - a 16-bit cumulative register for each DIMM again for 128MB chunks. This allows us to describe 8TB of memory (but the registers take up double the space, because they describe cumulative memory amounts) There is no reason to save space. Why not have two 64-bit registers per DIMM, one describing the size and the other the base address, both in bytes? Use a few low order bits for control. Do we want this generic scheme above to be tied into the i440fx/pc machine? Or have it as a separate generic memory bus / pmc usable by others (e.g. in hw/dimm.c)? The 64-bit values you describe are already part of DimmDevice properties, but they are not hardware registers described as part of a chipset. In terms of control bits, did you want to mimic some other chipset registers? - any examples would be useful. 3) let everything be handled/abstracted by dimmbus - the chipset DRB modelling is not done (at least for i440fx, other machines could). This is the least precise in terms of emulation. On the other hand, if we are not really trying to emulate the real (too restrictive) hardware, does it matter? We could emulate base memory using the chipset, and extra memory using the scheme above. This allows guests that are tied to the chipset to work, and guests that have more awareness (seabios) to use the extra features. But if we use the real i440fx pmc DRBs for base memory, this means base memory would be = 2GB, right? Sounds like we 'd need to change the DRBs anyway to describe useful amounts of base memory (e.g. 512MB chunks and check against address lines [36:29] can describe base memory up to 64GB, though that's still limiting for very large VMs). But we'd be diverting from the real hardware again. Then we can model base memory with tweaked i440fx pmc's DRB registers - we could only use DRB[0] (one DIMM describing all of base memory) or more. DIMMs would be allowed to be hotplugged in the generic mem-controller scheme only (unless it makes sense to allow hotplug in the remaining pmc DRBs and start using the generic scheme once we run out of emulated DRBs) thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement -dimm command line option
On Sat, Oct 13, 2012 at 08:57:19AM +, Blue Swirl wrote: On Tue, Oct 9, 2012 at 5:04 PM, Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com wrote: snip Maybe even the dimmbus device shouldn't exist by itself after all, or it should be pretty much invisible to users. On real HW, the memory controller or south bridge handles the memory. For i440fx, it's part of the same chipset. So I think we should just add qdev properties to i440fx to specify the sizes, nodes etc. Then i440fx should create the dimmbus device unconditionally using the properties. The default properties should create a sane configuration, otherwise -global i440fx.dimm_size=512M etc. could be used. Then the bus would be populated as before or with device_add. hmm the problem with using only i440fx properties, is that size/nodes look dimm specific to me, not chipset-memcontroller specific. Unless we only allow uniform size dimms. Is it possible to have a dynamic list of sizes/nodes pairs as properties of a qdev device? I don't think so, but probably there's a limit of DIMMs that real controllers have, something like 8 max. In the case of i440fx specifically, do you mean that we should model the DRB (Dram row boundary registers in section 3.2.19 of the i440fx spec) ? The i440fx DRB registers only supports up to 8 DRAM rows (let's say 1 row maps 1-1 to a DimmDevice for this discussion) and only supports up to 2GB of memory afaict (bit 31 and above is ignored). I 'd rather not model this part of the i440fx - having only 8 DIMMs seems too restrictive. The rest of the patchset supports up to 255 DIMMs so it would be a waste imho to model an old pc memory controller that only supports 8 DIMMs. There was also an old discussion about i440fx modeling here: https://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg02705.html the general direction was that i440fx is too old and we don't want to precisely emulate the DRB registers, since they lack flexibility. Possible solutions: 1) is there a newer and more flexible chipset that we could model? 2) model and document a generic (non-existent) i440fx that would support more and larger DIMMs. E.g. support 255 DIMMs. If we want to use a description similar to the i440fx DRB registers, the registers would take up a lot of space. In i440fx there is one 8-bit DRB register per DIMM, and DRB[i] describes how many 8MB chunks are contained in DIMMs 0...i. So, the register values are cumulative (and total described memory cannot exceed 256x8MB = 2GB) We could for example model: - an 8-bit non-cumulative register for each DIMM, denoting how many 128MB chunks it contains. This allowes 32GB for each DIMM, and with 255 DIMMs we describe a bit less than 8TB. These registers require 255 bytes. - a 16-bit cumulative register for each DIMM again for 128MB chunks. This allows us to describe 8TB of memory (but the registers take up double the space, because they describe cumulative memory amounts) 3) let everything be handled/abstracted by dimmbus - the chipset DRB modelling is not done (at least for i440fx, other machines could). This is the least precise in terms of emulation. On the other hand, if we are not really trying to emulate the real (too restrictive) hardware, does it matter? thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement -dimm command line option
Hi, sorry for the delayed answer. On Sat, Sep 29, 2012 at 11:13:04AM +, Blue Swirl wrote: The -dimm option is supposed to specify the dimm/memory layout, and not create any devices. If we don't want this new option, I have a question: A -device/device_add means we create a new qdev device at startup or as a hotplug operation respectively. So, the semantics of -device dimm,id=dimm0,size=512M,node=0,populated=on are clear to me. What does -device dimm,populated=off mean from a qdev perspective? There are 2 alternatives: - The device is created on the dimmbus, but is not used/populated yet. Than the activation/acpi-hotplug of the dimm may require a separate command (we used to have dimm_add in versions 3). device_add handling always hotplugs a new qdev device, so this wouldn't fit this usecase, because the device already exists. In this case, the actual acpi hotplug operation is decoupled from qdev device creation. The bus exists but the devices do not, device_add would add DIMMs to the bus. This matches PCI bus created by the host bridge and PCI device hotplug. A more complex setup would be dimm bus, dimm slot devices and DIMM devices. The intermediate slot device would contain one DIMM device if plugged. interesting, I haven't thought about this alternative. It does sounds overly complex, but a dimmslot / dimmdevice splitup could consolidate hotplug semantic differences between populated=on/off. Something similar to the dimmslot device is already present in v3 (dimmcfg structure), but it's not a qdev visible device. I 'd rather avoid the complication, but i might revisit this idea. - The dimmdevice is not created when -device dimm,populated=off (this would require some ugly checking in normal -device argument handling). Only the dimm layout is saved. The hotplug is triggered from a normal device_add later. So in this case, the acpi hotplug happens at the same time as the qdev hotplug. Do you see a simpler alternative without introducing a new option? Using the -dimm option follows the second semantic and avoids changing the -device semantics. Dimm layout description is decoupled from dimmdevice creation, and qdev hotplug coincides with acpi hotplug. Maybe even the dimmbus device shouldn't exist by itself after all, or it should be pretty much invisible to users. On real HW, the memory controller or south bridge handles the memory. For i440fx, it's part of the same chipset. So I think we should just add qdev properties to i440fx to specify the sizes, nodes etc. Then i440fx should create the dimmbus device unconditionally using the properties. The default properties should create a sane configuration, otherwise -global i440fx.dimm_size=512M etc. could be used. Then the bus would be populated as before or with device_add. hmm the problem with using only i440fx properties, is that size/nodes look dimm specific to me, not chipset-memcontroller specific. Unless we only allow uniform size dimms. Is it possible to have a dynamic list of sizes/nodes pairs as properties of a qdev device? Also if there is no dimmbus, and instead we have only links from i440fx to dimm-devices, would the current qdev hotplug API be enough? I am currently leaning towards this: i440fx unconditionally creates the dimmbus. Users don't have to specify the bus (i assume this is what you mean by dimmbus should be invisible to the users) We only use -device dimm to describe dimms. With -device dimm,populated=off, only the dimm config layout will be saved in the dimmbus. The hotplug is triggered from a normal device_add later (same as pci hotplug). thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement -dimm command line option
On Sat, Sep 22, 2012 at 01:46:57PM +, Blue Swirl wrote: On Fri, Sep 21, 2012 at 11:17 AM, Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com wrote: Example: -dimm id=dimm0,size=512M,node=0,populated=off There should not be a need to introduce a new top level option, instead you should just use -device, like -device dimm,base=0,id=dimm0,size=512M,node=0,populated=off That would also specify the start address. What is base? the start address? I think the start address should be calculated by the chipset / board, not by the user. The -dimm option is supposed to specify the dimm/memory layout, and not create any devices. If we don't want this new option, I have a question: A -device/device_add means we create a new qdev device at startup or as a hotplug operation respectively. So, the semantics of -device dimm,id=dimm0,size=512M,node=0,populated=on are clear to me. What does -device dimm,populated=off mean from a qdev perspective? There are 2 alternatives: - The device is created on the dimmbus, but is not used/populated yet. Than the activation/acpi-hotplug of the dimm may require a separate command (we used to have dimm_add in versions 3). device_add handling always hotplugs a new qdev device, so this wouldn't fit this usecase, because the device already exists. In this case, the actual acpi hotplug operation is decoupled from qdev device creation. - The dimmdevice is not created when -device dimm,populated=off (this would require some ugly checking in normal -device argument handling). Only the dimm layout is saved. The hotplug is triggered from a normal device_add later. So in this case, the acpi hotplug happens at the same time as the qdev hotplug. Do you see a simpler alternative without introducing a new option? Using the -dimm option follows the second semantic and avoids changing the -device semantics. Dimm layout description is decoupled from dimmdevice creation, and qdev hotplug coincides with acpi hotplug. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 20/19][SeaBIOS] alternative: Use paravirt interface for pci windows
On Mon, Sep 24, 2012 at 02:35:30PM +0800, Wen Congyang wrote: At 09/21/2012 07:20 PM, Vasilis Liaskovitis Wrote: Initialize the 32-bit and 64-bit pci starting offsets from values passed in by the qemu paravirt interface QEMU_CFG_PCI_WINDOW. Qemu calculates the starting offsets based on initial memory and hotplug-able dimms. This patch can't be applied if I apply the other patches for seabios. And I don't find this patch in your tree. to test these alternative patches, please try these trees: https://github.com/vliaskov/seabios/commits/memhp-v3-alt https://github.com/vliaskov/qemu-kvm/commits/memhp-v3-alt thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 11/19] Implement qmp and hmp commands for notification lists
Hi, On Fri, Sep 21, 2012 at 04:03:26PM -0600, Eric Blake wrote: On 09/21/2012 05:17 AM, Vasilis Liaskovitis wrote: Guest can respond to ACPI hotplug events e.g. with _EJ or _OST method. This patch implements a tail queue to store guest notifications for memory hot-add and hot-remove requests. Guest responses for memory hotplug command on a per-dimm basis can be detected with the new hmp command info memhp or the new qmp command query-memhp Naming doesn't match the QMP code. will fix. Examples: (qemu) device_add dimm,id=ram0 These notification items should probably be part of migration state (not yet implemented). In the case of libvirt driving migration, you already said in 10/19 that libvirt has to start the destination with the populated=on|off fields correct for each dimm according to the state it was in at the time the That patch actually alleviates this restriction for the off-on direction i.e. it allows for the target-VM to not have its args updated for dimm hot-add. (e.g. Let's say the source was started with a dimm, initialy off. The dimm is hot-plugged, and then migrated . WIth patch 10/19, the populated arg doesn't have to be updated on the target) The other direction (off-on) still needs correct arg change. If libvirt/management layers guarantee the dimm arguments are correctly changed, I don't see that we need 10/19 patch eventually. What I think is needed is another hmp/qmp command, that will report which dimms are on/off at any given time e.g. (monitor) info memory-hotplug dimm0: off dimm1: on ... dimmN: off This can be used on the source by libvirt / other layers to find out the populated dimms, and construct the correct command line on the destination. Does this make sense to you? The current patch only deals with success/failure event notifications (not on-off state of dimms) and should probably be renamed to query-memory-hotplug-events. host started the update. Can the host hot unplug memory after migration has started? Good testcase. I would rather not allow any hotplug operations while the migration is happening. What do we do with pci hotplug during migration currently? I found a discussion dating from a year ago, suggesting the same as the simplest solution, but I don't know what's currently implemented. http://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg01204.html + +## +# @MemHpInfo: +# +# Information about status of a memory hotplug command +# +# @dimm: the Dimm associated with the result +# +# @result: the result of the hotplug command +# +# Since: 1.3 +# +## +{ 'type': 'MemHpInfo', + 'data': {'dimm': 'str', 'request': 'str', 'result': 'str'} } Should 'result' be a bool (true for success, false for still pending) or an enum, instead of a free-form string? Likewise, isn't 'request' going to be exactly one of two values (plug or unplug)? agreed with 'request'. For 'result' it is also a boolean, but with 'success' and 'failure' (rather than 'pending'). Items are only queued when the guest has given us a definite _OST or _EJ result wich is either success or fail. If an operation is pending, nothing is queued here. Perhaps queueing pending operations also has a usecase, but this isn't addressed in this patch. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 08/19] pc: calculate dimm physical addresses and adjust memory map
On Sat, Sep 22, 2012 at 02:15:28PM +, Blue Swirl wrote: + +/* Function to configure memory offsets of hotpluggable dimms */ + +target_phys_addr_t pc_set_hp_memory_offset(uint64_t size) +{ +target_phys_addr_t ret; + +/* on first call, initialize ram_hp_offset */ +if (!ram_hp_offset) { +if (ram_size = PCI_HOLE_START ) { +ram_hp_offset = 0x1LL + (ram_size - PCI_HOLE_START); +} else { +ram_hp_offset = ram_size; +} +} + +if (ram_hp_offset = 0x1LL) { +ret = ram_hp_offset; +above_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* if dimm fits before pci hole, append it normally */ +else if (ram_hp_offset + size = PCI_HOLE_START) { } else if ... +ret = ram_hp_offset; +below_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* otherwise place it above 4GB */ +else { } else { +ret = 0x1LL; +above_4g_hp_mem_size += size; +ram_hp_offset = 0x1LL + size; +} + +return ret; +} But the function and use of lots of global variables is ugly. The dimm devices should be just created in piix_pci.c (i440fx) directly with correct offsets and sizes, so all below_4g_mem_size etc. calculations should be moved there. That would implement the PMC part of i440fx. For ISA PC, probably the board should create the DIMMs since there may not be a memory controller. The 4G logic does not make sense there anyway. What about moving the implementation to pc_piix.c? Initial RAM and pci windows are already calculated in pc_init1, and then passed to i440fx_init. The memory bus could be attached to i440fx for pci-enabled pc and to isabus-bridge for isa-pc (isa-pc not tested yet). Something like the following: --- hw/pc.h |1 + hw/pc_piix.c | 57 +++-- 2 files changed, 52 insertions(+), 6 deletions(-) diff --git a/hw/pc.h b/hw/pc.h index e4db071..d6cc43b 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -10,6 +10,7 @@ #include memory.h #include ioapic.h +#define PCI_HOLE_START 0xe000 /* PC-style peripherals (also used by other machines). */ /* serial.c */ diff --git a/hw/pc_piix.c b/hw/pc_piix.c index 88ff041..17db95a 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -43,6 +43,7 @@ #include xen.h #include memory.h #include exec-memory.h +#include dimm.h #ifdef CONFIG_XEN # include xen/hvm/hvm_info_table.h #endif @@ -52,6 +53,8 @@ static const int ide_iobase[MAX_IDE_BUS] = { 0x1f0, 0x170 }; static const int ide_iobase2[MAX_IDE_BUS] = { 0x3f6, 0x376 }; static const int ide_irq[MAX_IDE_BUS] = { 14, 15 }; +static ram_addr_t below_4g_hp_mem_size = 0; +static ram_addr_t above_4g_hp_mem_size = 0; static void kvm_piix3_setup_irq_routing(bool pci_enabled) { @@ -117,6 +120,41 @@ static void ioapic_init(GSIState *gsi_state) } } +static target_phys_addr_t pc_set_hp_memory_offset(uint64_t size) +{ +target_phys_addr_t ret; +static ram_addr_t ram_hp_offset = 0; + +/* on first call, initialize ram_hp_offset */ +if (!ram_hp_offset) { +if (ram_size = PCI_HOLE_START ) { +ram_hp_offset = 0x1LL + (ram_size - PCI_HOLE_START); +} else { +ram_hp_offset = ram_size; +} +} + +if (ram_hp_offset = 0x1LL) { +ret = ram_hp_offset; +above_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* if dimm fits before pci hole, append it normally */ +else if (ram_hp_offset + size = PCI_HOLE_START) { +ret = ram_hp_offset; +below_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* otherwise place it above 4GB */ +else { +ret = 0x1LL; +above_4g_hp_mem_size += size; +ram_hp_offset = 0x1LL + size; +} + +return ret; +} + /* PC hardware initialisation */ static void pc_init1(MemoryRegion *system_memory, MemoryRegion *system_io, @@ -155,9 +193,9 @@ static void pc_init1(MemoryRegion *system_memory, kvmclock_create(); } -if (ram_size = 0xe000 ) { -above_4g_mem_size = ram_size - 0xe000; -below_4g_mem_size = 0xe000; +if (ram_size = PCI_HOLE_START ) { +above_4g_mem_size = ram_size - PCI_HOLE_START; +below_4g_mem_size = PCI_HOLE_START; } else { above_4g_mem_size = 0; below_4g_mem_size = ram_size; @@ -172,6 +210,9 @@ static void pc_init1(MemoryRegion *system_memory, rom_memory = system_memory; } +/* adjust memory map for hotplug dimms */ +dimm_calc_offsets(pc_set_hp_memory_offset); + /* allocate ram and load rom/bios */ if (!xen_enabled()) { fw_cfg = pc_memory_init(system_memory, @@ -192,18 +233,22 @@ static void pc_init1(MemoryRegion *system_memory,
[RFC PATCH v3 00/19] ACPI memory hotplug
/seabios/commits/memhp-v3 Vasilis Liaskovitis (12): Implement dimm device abstraction Implement -dimm command line option acpi_piix4: Implement memory device hotplug registers pc: calculate dimm physical addresses and adjust memory map pc: Add dimm paravirt SRAT info fix live-migration when populated=on is missing Implement qmp and hmp commands for notification lists Implement info memory-total and query-memory-total balloon: update with hotplugged memory Add _OST dimm support Update dimm state on reset Implement _PS3 for dimm arch_init.c | 24 ++- docs/specs/acpi_hotplug.txt | 54 ++ docs/specs/fwcfg.txt| 28 +++ hmp-commands.hx |4 + hmp.c | 24 +++ hmp.h |2 + hw/Makefile.objs|2 +- hw/acpi_piix4.c | 114 +++- hw/dimm.c | 435 +++ hw/dimm.h | 101 ++ hw/pc.c | 55 ++- hw/pc.h |6 + hw/pc_piix.c| 20 ++- hw/virtio-balloon.c | 13 +- monitor.c | 14 ++ qapi-schema.json| 37 qemu-config.c | 25 +++ qemu-options.hx |5 + qmp-commands.hx | 57 ++ sysemu.h|1 + vl.c| 51 + 21 files changed, 1051 insertions(+), 21 deletions(-) create mode 100644 docs/specs/acpi_hotplug.txt create mode 100644 docs/specs/fwcfg.txt create mode 100644 hw/dimm.c create mode 100644 hw/dimm.h Vasilis Liaskovitis (7): Add ACPI_EXTRACT_DEVICE* macros Subject: [PATCH 02/18] Add SSDT memory device support acpi-dsdt: Implement functions for memory hotplug acpi: generate hotplug memory devices Add _OST dimm method Implement _PS3 method for memory device Calculate pcimem_start and pcimem64_start from SRAT entries Makefile |2 +- src/acpi-dsdt.dsl | 135 ++- src/acpi.c| 216 src/acpi.h|3 + src/pciinit.c |6 +- src/post.c|3 + src/smp.c |4 + src/ssdt-mem.dsl | 73 + tools/acpi_extract.py | 28 +++ 9 files changed, 447 insertions(+), 23 deletions(-) create mode 100644 src/ssdt-mem.dsl -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 01/19][SeaBIOS] Add ACPI_EXTRACT_DEVICE* macros
This allows to extract the beginning, end and name of a Device object. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- tools/acpi_extract.py | 28 1 files changed, 28 insertions(+), 0 deletions(-) diff --git a/tools/acpi_extract.py b/tools/acpi_extract.py index 167a322..cb2540e 100755 --- a/tools/acpi_extract.py +++ b/tools/acpi_extract.py @@ -195,6 +195,28 @@ def aml_package_start(offset): offset += 1 return offset + aml_pkglen_bytes(offset) + 1 +def aml_device_start(offset): +#0x5B 0x82 DeviceOp PkgLength NameString ProcID +if ((aml[offset] != 0x5B) or (aml[offset + 1] != 0x82)): +die( Name offset 0x%x: expected 0x5B 0x83 actual 0x%x 0x%x % + (offset, aml[offset], aml[offset + 1])); +return offset + +def aml_device_string(offset): +#0x5B 0x82 DeviceOp PkgLength NameString ProcID +start = aml_device_start(offset) +offset += 2 +pkglenbytes = aml_pkglen_bytes(offset) +offset += pkglenbytes +return offset + +def aml_device_end(offset): +start = aml_device_start(offset) +offset += 2 +pkglenbytes = aml_pkglen_bytes(offset) +pkglen = aml_pkglen(offset) +return offset + pkglen + lineno = 0 for line in fileinput.input(): # Strip trailing newline @@ -279,6 +301,12 @@ for i in range(len(asl)): offset = aml_processor_end(offset) elif (directive == ACPI_EXTRACT_PKG_START): offset = aml_package_start(offset) +elif (directive == ACPI_EXTRACT_DEVICE_START): +offset = aml_device_start(offset) +elif (directive == ACPI_EXTRACT_DEVICE_STRING): +offset = aml_device_string(offset) +elif (directive == ACPI_EXTRACT_DEVICE_END): +offset = aml_device_end(offset) else: die(Unsupported directive %s % directive) -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 03/19][SeaBIOS] acpi-dsdt: Implement functions for memory hotplug
Extend the DSDT to include methods for handling memory hot-add and hot-remove notifications and memory device status requests. These functions are called from the memory device SSDT methods. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl | 70 +++- 1 files changed, 68 insertions(+), 2 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 2060686..5d3e92b 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -737,6 +737,71 @@ DefinitionBlock ( } Return(One) } +/* Objects filled in by run-time generated SSDT */ +External(MTFY, MethodObj) +External(MEON, PkgObj) + +Method (CMST, 1, NotSerialized) { +// _STA method - return ON status of memdevice +// Local0 = MEON flag for this cpu +Store(DerefOf(Index(MEON, Arg0)), Local0) +If (Local0) { Return(0xF) } Else { Return(0x0) } +} + +/* Memory hotplug notify array */ +OperationRegion(MEST, SystemIO, 0xaf80, 32) +Field (MEST, ByteAcc, NoLock, Preserve) +{ +MES, 256 +} + +/* Memory eject byte */ +OperationRegion(MEMJ, SystemIO, 0xafa0, 1) +Field (MEMJ, ByteAcc, NoLock, Preserve) +{ +MPE, 8 +} + +Method(MESC, 0) { +// Local5 = active memdevice bitmap +Store (MES, Local5) +// Local2 = last read byte from bitmap +Store (Zero, Local2) +// Local0 = memory device iterator +Store (Zero, Local0) +While (LLess(Local0, SizeOf(MEON))) { +// Local1 = MEON flag for this memory device +Store(DerefOf(Index(MEON, Local0)), Local1) +If (And(Local0, 0x07)) { +// Shift down previously read bitmap byte +ShiftRight(Local2, 1, Local2) +} Else { +// Read next byte from memdevice bitmap +Store(DerefOf(Index(Local5, ShiftRight(Local0, 3))), Local2) +} +// Local3 = active state for this memory device +Store(And(Local2, 1), Local3) + +If (LNotEqual(Local1, Local3)) { +// State change - update MEON with new state +Store(Local3, Index(MEON, Local0)) +// Do MEM notify +If (LEqual(Local3, 1)) { +MTFY(Local0, 1) +} Else { +MTFY(Local0, 3) +} +} +Increment(Local0) +} +Return(One) +} + +Method (MPEJ, 2, NotSerialized) { +// _EJ0 method - eject callback +Store(Arg0, MPE) +Sleep(200) +} } @@ -759,8 +824,9 @@ DefinitionBlock ( // CPU hotplug event Return(\_SB.PRSC()) } -Method(_L03) { -Return(0x01) +Method(_E03) { +// Memory hotplug event +Return(\_SB.MESC()) } Method(_L04) { Return(0x01) -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 04/19][SeaBIOS] acpi: generate hotplug memory devices
The memory device generation is guided by qemu paravirt info. Seabios first uses the info to setup SRAT entries for the hotplug-able memory slots. Afterwards, build_memssdt uses the created SRAT entries to generate appropriate memory device objects. One memory device (and corresponding SRAT entry) is generated for each hotplug-able qemu memslot. Currently no SSDT memory device is created for initial system memory. We only support up to 255 DIMMs for now (PackageOp used for the MEON array can only describe an array of at most 255 elements. VarPackageOp would be needed to support more than 255 devices) v1-v2: Seabios reads mems_sts from qemu to build e820_map SSDT size and some offsets are calculated with extraction macros. v2-v3: Minor name changes Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi.c | 158 +-- 1 files changed, 152 insertions(+), 6 deletions(-) diff --git a/src/acpi.c b/src/acpi.c index 6d239fa..1223b52 100644 --- a/src/acpi.c +++ b/src/acpi.c @@ -13,6 +13,7 @@ #include pci_regs.h // PCI_INTERRUPT_LINE #include ioport.h // inl #include paravirt.h // qemu_cfg_irq0_override +#include memmap.h // /* ACPI tables init */ @@ -416,11 +417,26 @@ encodeLen(u8 *ssdt_ptr, int length, int bytes) #define PCIHP_AML (ssdp_pcihp_aml + *ssdt_pcihp_start) #define PCI_SLOTS 32 +/* 0x5B 0x82 DeviceOp PkgLength NameString DimmID */ +#define MEM_BASE 0xaf80 +#define MEM_AML (ssdm_mem_aml + *ssdt_mem_start) +#define MEM_SIZEOF (*ssdt_mem_end - *ssdt_mem_start) +#define MEM_OFFSET_HEX (*ssdt_mem_name - *ssdt_mem_start + 2) +#define MEM_OFFSET_ID (*ssdt_mem_id - *ssdt_mem_start) +#define MEM_OFFSET_PXM 31 +#define MEM_OFFSET_START 55 +#define MEM_OFFSET_END 63 +#define MEM_OFFSET_SIZE 79 + +u64 nb_hp_memslots = 0; +struct srat_memory_affinity *mem; + #define SSDT_SIGNATURE 0x54445353 // SSDT #define SSDT_HEADER_LENGTH 36 #include ssdt-susp.hex #include ssdt-pcihp.hex +#include ssdt-mem.hex #define PCI_RMV_BASE 0xae0c @@ -472,6 +488,111 @@ static void patch_pcihp(int slot, u8 *ssdt_ptr, u32 eject) } } +static void build_memdev(u8 *ssdt_ptr, int i, u64 mem_base, u64 mem_len, u8 node) +{ +memcpy(ssdt_ptr, MEM_AML, MEM_SIZEOF); +ssdt_ptr[MEM_OFFSET_HEX] = getHex(i 4); +ssdt_ptr[MEM_OFFSET_HEX+1] = getHex(i); +ssdt_ptr[MEM_OFFSET_ID] = i; +ssdt_ptr[MEM_OFFSET_PXM] = node; +*(u64*)(ssdt_ptr + MEM_OFFSET_START) = mem_base; +*(u64*)(ssdt_ptr + MEM_OFFSET_END) = mem_base + mem_len; +*(u64*)(ssdt_ptr + MEM_OFFSET_SIZE) = mem_len; +} + +static void* +build_memssdt(void) +{ +u64 mem_base; +u64 mem_len; +u8 node; +int i; +struct srat_memory_affinity *entry = mem; +u64 nb_memdevs = nb_hp_memslots; +u8 memslot_status, enabled; + +int length = ((1+3+4) + + (nb_memdevs * MEM_SIZEOF) + + (1+2+5+(12*nb_memdevs)) + + (6+2+1+(1*nb_memdevs))); +u8 *ssdt = malloc_high(sizeof(struct acpi_table_header) + length); +if (! ssdt) { +warn_noalloc(); +return NULL; +} +u8 *ssdt_ptr = ssdt + sizeof(struct acpi_table_header); + +// build Scope(_SB_) header +*(ssdt_ptr++) = 0x10; // ScopeOp +ssdt_ptr = encodeLen(ssdt_ptr, length-1, 3); +*(ssdt_ptr++) = '_'; +*(ssdt_ptr++) = 'S'; +*(ssdt_ptr++) = 'B'; +*(ssdt_ptr++) = '_'; + +for (i = 0; i nb_memdevs; i++) { +mem_base = (((u64)(entry-base_addr_high) 32 )| entry-base_addr_low); +mem_len = (((u64)(entry-length_high) 32 )| entry-length_low); +node = entry-proximity[0]; +build_memdev(ssdt_ptr, i, mem_base, mem_len, node); +ssdt_ptr += MEM_SIZEOF; +entry++; +} + +// build Method(MTFY, 2) {If (LEqual(Arg0, 0x00)) {Notify(CM00, Arg1)} ...} +*(ssdt_ptr++) = 0x14; // MethodOp +ssdt_ptr = encodeLen(ssdt_ptr, 2+5+(12*nb_memdevs), 2); +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'T'; +*(ssdt_ptr++) = 'F'; +*(ssdt_ptr++) = 'Y'; +*(ssdt_ptr++) = 0x02; +for (i=0; inb_memdevs; i++) { +*(ssdt_ptr++) = 0xA0; // IfOp + ssdt_ptr = encodeLen(ssdt_ptr, 11, 1); +*(ssdt_ptr++) = 0x93; // LEqualOp +*(ssdt_ptr++) = 0x68; // Arg0Op +*(ssdt_ptr++) = 0x0A; // BytePrefix +*(ssdt_ptr++) = i; +*(ssdt_ptr++) = 0x86; // NotifyOp +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'P'; +*(ssdt_ptr++) = getHex(i 4); +*(ssdt_ptr++) = getHex(i); +*(ssdt_ptr++) = 0x69; // Arg1Op +} + +// build Name(MEON, Package() { One, One, ..., Zero, Zero, ... }) +*(ssdt_ptr++) = 0x08; // NameOp +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'E'; +*(ssdt_ptr++) = 'O'; +*(ssdt_ptr++) = 'N'; +*(ssdt_ptr++) = 0x12; // PackageOp +ssdt_ptr = encodeLen(ssdt_ptr, 2+1+(1*nb_memdevs), 2); +*(ssdt_ptr
[RFC PATCH v3 10/19] fix live-migration when populated=on is missing
Live migration works after memory hot-add events, as long as the qemu command line -dimm arguments are changed on the destination host to specify populated=on for the dimms that have been hot-added. If a command-line change has not occured, the destination host does not yet have the corresponding ramblock in its ram_list. Activate the dimm on the destination during ram_load. Perhaps several fields of the DimmDevice should be part of a VMStateDescription to handle migration in a cleaner way. But the problem is that ramblocks are checked before qdev vmstates. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- arch_init.c | 24 +--- 1 files changed, 21 insertions(+), 3 deletions(-) diff --git a/arch_init.c b/arch_init.c index 5a1173e..b63caa7 100644 --- a/arch_init.c +++ b/arch_init.c @@ -45,6 +45,7 @@ #include hw/pcspk.h #include qemu/page_cache.h #include qmp-commands.h +#include hw/dimm.h #ifdef DEBUG_ARCH_INIT #define DPRINTF(fmt, ...) \ @@ -740,10 +741,27 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id) } if (!block) { -fprintf(stderr, Unknown ramblock \%s\, cannot +/* this can happen if a dimm was hot-added at source host */ +bool ramblock_found = false; +if (dimm_add(id)) { +fprintf(stderr, Cannot add unknown ramblock \%s\, +cannot accept migration\n, id); +ret = -EINVAL; +goto done; +} +/* rescan ram_list, verify ramblock is there now */ +QLIST_FOREACH(block, ram_list.blocks, next) { +if (!strncmp(id, block-idstr, sizeof(id))) { +ramblock_found = true; +break; +} +} +if (!ramblock_found) { +fprintf(stderr, Unknown ramblock \%s\, cannot accept migration\n, id); -ret = -EINVAL; -goto done; +ret = -EINVAL; +goto done; +} } total_ram_bytes -= length; -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 09/19] pc: Add dimm paravirt SRAT info
The numa_fw_cfg paravirt interface is extended to include SRAT information for all hotplug-able dimms. There are 3 words for each hotplug-able memory slot, denoting start address, size and node proximity. The new info is appended after existing numa info, so that the fw_cfg layout does not break. This information is used by Seabios to build hotplug memory device objects at runtime. nb_numa_nodes is set to 1 by default (not 0), so that we always pass srat info to SeaBIOS. v1-v2: Dimm SRAT info (#dimms) is appended at end of existing numa fw_cfg in order not to break existing layout Documentation of the new fwcfg layout is included in docs/specs/fwcfg.txt Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/fwcfg.txt | 28 hw/pc.c | 14 -- 2 files changed, 40 insertions(+), 2 deletions(-) create mode 100644 docs/specs/fwcfg.txt diff --git a/docs/specs/fwcfg.txt b/docs/specs/fwcfg.txt new file mode 100644 index 000..55f96d9 --- /dev/null +++ b/docs/specs/fwcfg.txt @@ -0,0 +1,28 @@ +QEMU-BIOS Paravirt Documentation +-- + +This document describes paravirt data structures passed from QEMU to BIOS. + +FW_CFG_NUMA paravirt info + +The SRAT info passed from QEMU to BIOS has the following layout: + +--- +#nodes | cpu0_pxm | cpu1_pxm | ... | cpulast_pxm | node0_mem | node1_mem | ... | nodelast_mem + +--- +#dimms | dimm0_start | dimm0_sz | dimm0_pxm | ... | dimmlast_start | dimmlast_sz | dimmlast_pxm + +Entry 0 contains the number of numa nodes (nb_numa_nodes). + +Entries 1..max_cpus: The next max_cpus entries describe node proximity for each +one of the vCPUs in the system. + +Entries max_cpus+1..max_cpus+nb_numa_nodes+1: The next nb_numa_nodes entries +describe the memory size for each one of the NUMA nodes in the system. + +Entry max_cpus+nb_numa_nodes+1 contains the number of memory dimms (nb_hp_dimms) + +The last 3 * nb_hp_dimms entries are organized in triplets: Each triplet contains +the physical address offset, size (in bytes), and node proximity for the +respective dimm. diff --git a/hw/pc.c b/hw/pc.c index 2c9664d..f2604ae 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -598,6 +598,7 @@ static void *bochs_bios_init(void) uint8_t *smbios_table; size_t smbios_len; uint64_t *numa_fw_cfg; +uint64_t *hp_dimms_fw_cfg; int i, j; register_ioport_write(0x400, 1, 2, bochs_bios_write, NULL); @@ -632,8 +633,10 @@ static void *bochs_bios_init(void) /* allocate memory for the NUMA channel: one (64bit) word for the number * of nodes, one word for each VCPU-node and one word for each node to * hold the amount of memory. + * Finally one word for the number of hotplug memory slots and three words + * for each hotplug memory slot (start address, size and node proximity). */ -numa_fw_cfg = g_malloc0((1 + max_cpus + nb_numa_nodes) * 8); +numa_fw_cfg = g_malloc0((2 + max_cpus + nb_numa_nodes + 3 * nb_hp_dimms) * 8); numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); for (i = 0; i max_cpus; i++) { for (j = 0; j nb_numa_nodes; j++) { @@ -646,8 +649,15 @@ static void *bochs_bios_init(void) for (i = 0; i nb_numa_nodes; i++) { numa_fw_cfg[max_cpus + 1 + i] = cpu_to_le64(node_mem[i]); } + +numa_fw_cfg[1 + max_cpus + nb_numa_nodes] = cpu_to_le64(nb_hp_dimms); + +hp_dimms_fw_cfg = numa_fw_cfg + 2 + max_cpus + nb_numa_nodes; +if (nb_hp_dimms) +setup_fwcfg_hp_dimms(hp_dimms_fw_cfg); + fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, (uint8_t *)numa_fw_cfg, - (1 + max_cpus + nb_numa_nodes) * 8); + (2 + max_cpus + nb_numa_nodes + 3 * nb_hp_dimms) * 8); return fw_cfg; } -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 11/19] Implement qmp and hmp commands for notification lists
Guest can respond to ACPI hotplug events e.g. with _EJ or _OST method. This patch implements a tail queue to store guest notifications for memory hot-add and hot-remove requests. Guest responses for memory hotplug command on a per-dimm basis can be detected with the new hmp command info memhp or the new qmp command query-memhp Examples: (qemu) device_add dimm,id=ram0 (qemu) info memory-hotplug dimm: ram0 hot-add success or dimm: ram0 hot-add failure (qemu) device_del ram3 (qemu) info memory-hotplug dimm: ram3 hot-remove success or dimm: ram3 hot-remove failure Results are removed from the queue once read. This patch only queues _EJ events that signal hot-remove success. For _OST event queuing, which cover the hot-remove failure and hot-add success/failure cases, the _OST patches in this series are are also needed. These notification items should probably be part of migration state (not yet implemented). Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hmp-commands.hx |2 + hmp.c| 17 ++ hmp.h|1 + hw/dimm.c| 62 +- hw/dimm.h|2 +- monitor.c|7 ++ qapi-schema.json | 26 ++ qmp-commands.hx | 37 8 files changed, 152 insertions(+), 2 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index ed67e99..cfb1b67 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -1462,6 +1462,8 @@ show device tree show qdev device model list @item info roms show roms +@item info memory-hotplug +show memory-hotplug @end table ETEXI diff --git a/hmp.c b/hmp.c index ba6fbd3..4b3d63d 100644 --- a/hmp.c +++ b/hmp.c @@ -1168,3 +1168,20 @@ void hmp_screen_dump(Monitor *mon, const QDict *qdict) qmp_screendump(filename, err); hmp_handle_error(mon, err); } + +void hmp_info_memory_hotplug(Monitor *mon) +{ +MemHpInfoList *info; +MemHpInfoList *item; +MemHpInfo *dimm; + +info = qmp_query_memory_hotplug(NULL); +for (item = info; item; item = item-next) { +dimm = item-value; +monitor_printf(mon, dimm: %s %s %s\n, dimm-dimm, +dimm-request, dimm-result); +dimm-dimm = NULL; +} + +qapi_free_MemHpInfoList(info); +} diff --git a/hmp.h b/hmp.h index 48b9c59..986705a 100644 --- a/hmp.h +++ b/hmp.h @@ -73,5 +73,6 @@ void hmp_getfd(Monitor *mon, const QDict *qdict); void hmp_closefd(Monitor *mon, const QDict *qdict); void hmp_send_key(Monitor *mon, const QDict *qdict); void hmp_screen_dump(Monitor *mon, const QDict *qdict); +void hmp_info_memory_hotplug(Monitor *mon); #endif diff --git a/hw/dimm.c b/hw/dimm.c index 288b997..fbd93a8 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -65,6 +65,7 @@ static void dimm_bus_initfn(Object *obj) DimmBus *bus = DIMM_BUS(obj); QTAILQ_INIT(bus-dimmconfig_list); QTAILQ_INIT(bus-dimmlist); +QTAILQ_INIT(bus-dimm_hp_result_queue); QTAILQ_FOREACH_SAFE(dimm_cfg, dimmconfig_list, nextdimmcfg, next_dimm_cfg) { QTAILQ_REMOVE(dimmconfig_list, dimm_cfg, nextdimmcfg); @@ -236,20 +237,78 @@ void dimm_notify(uint32_t idx, uint32_t event) { DimmBus *bus = main_memory_bus; DimmDevice *s; +DimmConfig *slotcfg; +struct dimm_hp_result *result; + s = dimm_find_from_idx(idx); assert(s != NULL); +result = g_malloc0(sizeof(*result)); +slotcfg = dimmcfg_find_from_name(DEVICE(s)-id); +result-dimmname = slotcfg-name; switch(event) { case DIMM_REMOVE_SUCCESS: dimm_depopulate(s); -qdev_simple_unplug_cb((DeviceState*)s); QTAILQ_REMOVE(bus-dimmlist, s, nextdimm); +qdev_simple_unplug_cb((DeviceState*)s); +QTAILQ_INSERT_TAIL(bus-dimm_hp_result_queue, result, next); break; default: +g_free(result); break; } } +MemHpInfoList *qmp_query_memory_hotplug(Error **errp) +{ +DimmBus *bus = main_memory_bus; +MemHpInfoList *head = NULL, *cur_item = NULL, *info; +struct dimm_hp_result *item, *nextitem; + +QTAILQ_FOREACH_SAFE(item, bus-dimm_hp_result_queue, next, nextitem) { + +info = g_malloc0(sizeof(*info)); +info-value = g_malloc0(sizeof(*info-value)); +info-value-dimm = g_malloc0(sizeof(char) * 32); +info-value-request = g_malloc0(sizeof(char) * 16); +info-value-result = g_malloc0(sizeof(char) * 16); +switch (item-ret) { +case DIMM_REMOVE_SUCCESS: +strcpy(info-value-request, hot-remove); +strcpy(info-value-result, success); +break; +case DIMM_REMOVE_FAIL: +strcpy(info-value-request, hot-remove); +strcpy(info-value-result, failure); +break; +case DIMM_ADD_SUCCESS: +strcpy(info-value-request, hot-add
[RFC PATCH v3 17/19][SeaBIOS] Implement _PS3 method for memory device
Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl | 15 +++ src/ssdt-mem.dsl |4 2 files changed, 19 insertions(+), 0 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 0d37bbc..8a18770 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -784,6 +784,13 @@ DefinitionBlock ( MIF, 8 } +/* Memory _PS3 byte */ +OperationRegion(MPSB, SystemIO, 0xafa4, 1) +Field (MPSB, ByteAcc, NoLock, Preserve) +{ +MPS, 8 +} + Method(MESC, 0) { // Local5 = active memdevice bitmap Store (MES, Local5) @@ -824,6 +831,14 @@ DefinitionBlock ( Store(Arg0, MPE) Sleep(200) } + +Method (MPS3, 1, NotSerialized) { +// _PS3 method - power-off method +Store(Arg0, MPS) +Store(Zero, Index(MEON, Arg0)) +Sleep(200) +} + Method (MOST, 3, Serialized) { // _OST method - OS status indication Switch (And(Arg0, 0xFF)) { diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl index 041d301..7423fc6 100644 --- a/src/ssdt-mem.dsl +++ b/src/ssdt-mem.dsl @@ -39,6 +39,7 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) External(CMST, MethodObj) External(MPEJ, MethodObj) External(MOST, MethodObj) +External(MPS3, MethodObj) Name(_CRS, ResourceTemplate() { QwordMemory( @@ -64,6 +65,9 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) Method (_OST, 3) { MOST(Arg0, Arg1, ID) } +Method (_PS3, 0) { +MPS3(ID) +} } } -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 18/19] Implement _PS3 for dimm
This will allow us to update dimm state on OSPM-initiated eject operations e.g. with echo 1 /sys/bus/acpi/devices/PNP0C80\:00/eject Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/acpi_hotplug.txt |7 +++ hw/acpi_piix4.c |5 + hw/dimm.c |3 +++ hw/dimm.h |3 ++- 4 files changed, 17 insertions(+), 1 deletions(-) diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt index 536da16..69868fe 100644 --- a/docs/specs/acpi_hotplug.txt +++ b/docs/specs/acpi_hotplug.txt @@ -45,3 +45,10 @@ insertion failed. Written by ACPI memory device _OST method to notify qemu of failed hot-add. Write-only. +Memory Dimm _PS3 power-off initiated by OSPM (IO port 0xafa4, 1-byte access): +--- +Dimm hot-add _PS3 initiated by OSPM. Byte value indicates Dimm slot which +entered D3 state. + +Written by ACPI memory device _PS3 method to notify qemu of power-off state for +the dimm. Write-only. diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 8bf58a6..aad78ca 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -52,6 +52,7 @@ #define MEM_OST_REMOVE_FAIL 0xafa1 #define MEM_OST_ADD_SUCCESS 0xafa2 #define MEM_OST_ADD_FAIL 0xafa3 +#define MEM_PS3 0xafa4 #define PIIX4_MEM_HOTPLUG_STATUS 8 #define PIIX4_PCI_HOTPLUG_STATUS 2 @@ -545,6 +546,9 @@ static void gpe_writeb(void *opaque, uint32_t addr, uint32_t val) case MEM_OST_ADD_FAIL: dimm_notify(val, DIMM_ADD_FAIL); break; +case MEM_PS3: +dimm_notify(val, DIMM_OSPM_POWEROFF); +break; default: acpi_gpe_ioport_writeb(s-ar, addr, val); } @@ -621,6 +625,7 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) register_ioport_write(MEM_OST_REMOVE_FAIL, 1, 1, gpe_writeb, s); register_ioport_write(MEM_OST_ADD_SUCCESS, 1, 1, gpe_writeb, s); register_ioport_write(MEM_OST_ADD_FAIL, 1, 1, gpe_writeb, s); +register_ioport_write(MEM_PS3, 1, 1, gpe_writeb, s); for(i = 0; i DIMM_BITMAP_BYTES; i++) { s-gperegs.mems_sts[i] = 0; diff --git a/hw/dimm.c b/hw/dimm.c index b993668..08f66d5 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -319,6 +319,9 @@ void dimm_notify(uint32_t idx, uint32_t event) qdev_simple_unplug_cb((DeviceState*)s); QTAILQ_INSERT_TAIL(bus-dimm_hp_result_queue, result, next); break; +case DIMM_OSPM_POWEROFF: +if (bus-dimm_revert) +bus-dimm_revert(bus-dimm_hotplug_qdev, s, 1); default: g_free(result); break; diff --git a/hw/dimm.h b/hw/dimm.h index ce091fe..8d73b8f 100644 --- a/hw/dimm.h +++ b/hw/dimm.h @@ -15,7 +15,8 @@ typedef enum { DIMM_REMOVE_SUCCESS = 0, DIMM_REMOVE_FAIL = 1, DIMM_ADD_SUCCESS = 2, -DIMM_ADD_FAIL = 3 +DIMM_ADD_FAIL = 3, +DIMM_OSPM_POWEROFF = 4 } dimm_hp_result_code; typedef enum { -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 08/19] pc: calculate dimm physical addresses and adjust memory map
Dimm physical address offsets are calculated automatically and memory map is adjusted accordingly. If a DIMM can fit before the PCI_HOLE_START (currently 0xe000), it will be added normally, otherwise its physical address will be above 4GB. Also create memory bus on i440fx-pcihost device. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/pc.c | 41 + hw/pc.h |6 ++ hw/pc_piix.c | 20 ++-- vl.c |1 + 4 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 112739a..2c9664d 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -52,6 +52,7 @@ #include arch_init.h #include bitmap.h #include vga-pci.h +#include dimm.h /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -93,6 +94,9 @@ struct e820_table { static struct e820_table e820_table; struct hpet_fw_config hpet_cfg = {.count = UINT8_MAX}; +ram_addr_t below_4g_hp_mem_size = 0; +ram_addr_t above_4g_hp_mem_size = 0; +extern target_phys_addr_t ram_hp_offset; void gsi_handler(void *opaque, int n, int level) { GSIState *s = opaque; @@ -1160,3 +1164,40 @@ void pc_pci_device_init(PCIBus *pci_bus) pci_create_simple(pci_bus, -1, lsi53c895a); } } + + +/* Function to configure memory offsets of hotpluggable dimms */ + +target_phys_addr_t pc_set_hp_memory_offset(uint64_t size) +{ +target_phys_addr_t ret; + +/* on first call, initialize ram_hp_offset */ +if (!ram_hp_offset) { +if (ram_size = PCI_HOLE_START ) { +ram_hp_offset = 0x1LL + (ram_size - PCI_HOLE_START); +} else { +ram_hp_offset = ram_size; +} +} + +if (ram_hp_offset = 0x1LL) { +ret = ram_hp_offset; +above_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* if dimm fits before pci hole, append it normally */ +else if (ram_hp_offset + size = PCI_HOLE_START) { +ret = ram_hp_offset; +below_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* otherwise place it above 4GB */ +else { +ret = 0x1LL; +above_4g_hp_mem_size += size; +ram_hp_offset = 0x1LL + size; +} + +return ret; +} diff --git a/hw/pc.h b/hw/pc.h index e4db071..f3304fc 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -10,6 +10,7 @@ #include memory.h #include ioapic.h +#define PCI_HOLE_START 0xe000 /* PC-style peripherals (also used by other machines). */ /* serial.c */ @@ -214,6 +215,11 @@ static inline bool isa_ne2000_init(ISABus *bus, int base, int irq, NICInfo *nd) /* pc_sysfw.c */ void pc_system_firmware_init(MemoryRegion *rom_memory); +/* memory hotplug */ +target_phys_addr_t pc_set_hp_memory_offset(uint64_t size); +extern ram_addr_t below_4g_hp_mem_size; +extern ram_addr_t above_4g_hp_mem_size; + /* e820 types */ #define E820_RAM1 #define E820_RESERVED 2 diff --git a/hw/pc_piix.c b/hw/pc_piix.c index 88ff041..d1fd276 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -43,6 +43,7 @@ #include xen.h #include memory.h #include exec-memory.h +#include dimm.h #ifdef CONFIG_XEN # include xen/hvm/hvm_info_table.h #endif @@ -155,9 +156,9 @@ static void pc_init1(MemoryRegion *system_memory, kvmclock_create(); } -if (ram_size = 0xe000 ) { -above_4g_mem_size = ram_size - 0xe000; -below_4g_mem_size = 0xe000; +if (ram_size = PCI_HOLE_START ) { +above_4g_mem_size = ram_size - PCI_HOLE_START; +below_4g_mem_size = PCI_HOLE_START; } else { above_4g_mem_size = 0; below_4g_mem_size = ram_size; @@ -172,6 +173,9 @@ static void pc_init1(MemoryRegion *system_memory, rom_memory = system_memory; } +/* adjust memory map for hotplug dimms */ +dimm_calc_offsets(pc_set_hp_memory_offset); + /* allocate ram and load rom/bios */ if (!xen_enabled()) { fw_cfg = pc_memory_init(system_memory, @@ -192,9 +196,11 @@ static void pc_init1(MemoryRegion *system_memory, if (pci_enabled) { pci_bus = i440fx_init(i440fx_state, piix3_devfn, isa_bus, gsi, system_memory, system_io, ram_size, - below_4g_mem_size, - 0x1ULL - below_4g_mem_size, - 0x1ULL + above_4g_mem_size, + below_4g_mem_size + below_4g_hp_mem_size, + 0x1ULL - below_4g_mem_size +- below_4g_hp_mem_size, + 0x1ULL + above_4g_mem_size ++ above_4g_hp_mem_size, (sizeof(target_phys_addr_t) == 4 ? 0 : ((uint64_t)1 62)), @@ -223,6 +229,8 @@ static void pc_init1(MemoryRegion *system_memory
[RFC PATCH v3 19/19][SeaBIOS] Calculate pcimem_start and pcimem64_start from SRAT entries
pcimem_start and pcimem64_start are adjusted from srat entries. For this reason, paravirt info (NUMA SRAT entries and number of cpus) need to be read before pci_setup. Imho, this is an ugly code change since SRAT bios tables and number of cpus have to be read earlier. But the advantage is that no new paravirt interface is introduced. Suggestions to make the code change cleaner are welcome. The alternative patch (will be sent as a reply to this patch) implements a paravirt interface to read the starting values of pcimem_start and pcimem64_start from QEMU. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi.c| 82 src/acpi.h|3 ++ src/pciinit.c |6 +++- src/post.c|3 ++ src/smp.c |4 +++ 5 files changed, 72 insertions(+), 26 deletions(-) diff --git a/src/acpi.c b/src/acpi.c index 1223b52..9e99aa7 100644 --- a/src/acpi.c +++ b/src/acpi.c @@ -428,7 +428,10 @@ encodeLen(u8 *ssdt_ptr, int length, int bytes) #define MEM_OFFSET_END 63 #define MEM_OFFSET_SIZE 79 -u64 nb_hp_memslots = 0; +u64 nb_hp_memslots = 0, nb_numanodes; +u64 *numa_data, *hp_memdata; +u64 below_4g_hp_mem_size = 0; +u64 above_4g_hp_mem_size = 0; struct srat_memory_affinity *mem; #define SSDT_SIGNATURE 0x54445353 // SSDT @@ -763,17 +766,7 @@ acpi_build_srat_memory(struct srat_memory_affinity *numamem, static void * build_srat(void) { -int nb_numa_nodes = qemu_cfg_get_numa_nodes(); - -u64 *numadata = malloc_tmphigh(sizeof(u64) * (MaxCountCPUs + nb_numa_nodes)); -if (!numadata) { -warn_noalloc(); -return NULL; -} - -qemu_cfg_get_numa_data(numadata, MaxCountCPUs + nb_numa_nodes); - -qemu_cfg_get_numa_data(nb_hp_memslots, 1); +int nb_numa_nodes = nb_numanodes; struct system_resource_affinity_table *srat; int srat_size = sizeof(*srat) + sizeof(struct srat_processor_affinity) * MaxCountCPUs + @@ -782,7 +775,7 @@ build_srat(void) srat = malloc_high(srat_size); if (!srat) { warn_noalloc(); -free(numadata); +free(numa_data); return NULL; } @@ -791,6 +784,7 @@ build_srat(void) struct srat_processor_affinity *core = (void*)(srat + 1); int i; u64 curnode; +u64 *numadata = numa_data; for (i = 0; i MaxCountCPUs; ++i) { core-type = SRAT_PROCESSOR; @@ -847,15 +841,7 @@ build_srat(void) mem = (void*)numamem; if (nb_hp_memslots) { -u64 *hpmemdata = malloc_tmphigh(sizeof(u64) * (3 * nb_hp_memslots)); -if (!hpmemdata) { -warn_noalloc(); -free(hpmemdata); -free(numadata); -return NULL; -} - -qemu_cfg_get_numa_data(hpmemdata, 3 * nb_hp_memslots); +u64 *hpmemdata = hp_memdata; for (i = 1; i nb_hp_memslots + 1; ++i) { mem_base = *hpmemdata++; @@ -865,7 +851,7 @@ build_srat(void) numamem++; slots++; } -free(hpmemdata); +free(hp_memdata); } for (; slots nb_numa_nodes + nb_hp_memslots + 2; slots++) { @@ -875,10 +861,58 @@ build_srat(void) build_header((void*)srat, SRAT_SIGNATURE, srat_size, 1); -free(numadata); +free(numa_data); return srat; } +/* QEMU paravirt SRAT entries need to be read in before pci initilization */ +void read_srat_early(void) +{ +int i; + +nb_numanodes = qemu_cfg_get_numa_nodes(); +u64 *hpmemdata; +u64 mem_len, mem_base; + +numa_data = malloc_tmphigh(sizeof(u64) * (MaxCountCPUs + nb_numanodes)); +if (!numa_data) { +warn_noalloc(); +} + +qemu_cfg_get_numa_data(numa_data, MaxCountCPUs + nb_numanodes); +qemu_cfg_get_numa_data(nb_hp_memslots, 1); + +if (nb_hp_memslots) { +hp_memdata = malloc_tmphigh(sizeof(u64) * (3 * nb_hp_memslots)); +if (!hp_memdata) { +warn_noalloc(); +free(hp_memdata); +free(numa_data); +} + +qemu_cfg_get_numa_data(hp_memdata, 3 * nb_hp_memslots); +hpmemdata = hp_memdata; + +for (i = 1; i nb_hp_memslots + 1; ++i) { +mem_base = *hpmemdata++; +mem_len = *hpmemdata++; +hpmemdata++; +if (mem_base = 0x1LL) { +above_4g_hp_mem_size += mem_len; +} +/* if dimm fits before pci hole, append it normally */ +else if (mem_base + mem_len = BUILD_PCIMEM_START) { +below_4g_hp_mem_size += mem_len; +} +/* otherwise place it above 4GB */ +else { +above_4g_hp_mem_size += mem_len; +} +} + +} +} + static const struct pci_device_id acpi_find_tbl[] = { /* PIIX4 Power Management device. */ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82371AB_3, NULL), diff --git a/src/acpi.h b/src/acpi.h index cb21561..d29837f
[RFC PATCH v3 16/19] Update dimm state on reset
in case of hot-remove failure on a guest that does not implement _OST, the dimm bitmaps in qemu and Seabios show the dimm as unplugged, but the dimm is still present on the qdev/memory bus. To avoid this inconsistency, we set the dimm state to active/hot-plugged on a reset of the associated acpi_pm device. This way the dimm is still active after a VM reboot and dimm visibility has always the same behaviour, regardless of _OST support in the guest. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c |1 + hw/dimm.c | 20 hw/dimm.h |1 + 3 files changed, 22 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index f7220d4..8bf58a6 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -373,6 +373,7 @@ static void piix4_reset(void *opaque) pci_conf[0x5B] = 0x02; } piix4_update_hotplug(s); +dimm_state_sync(); } static void piix4_powerdown(void *opaque, int irq, int power_failing) diff --git a/hw/dimm.c b/hw/dimm.c index 1521462..b993668 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -182,6 +182,26 @@ static DimmDevice *dimm_find_from_idx(uint32_t idx) return NULL; } +void dimm_state_sync(void) +{ +DimmBus *bus = main_memory_bus; +DimmDevice *slot; + +/* if a hot-remove operation is pending on reset, it means the hot-remove + * operation has failed, but the guest hasn't notified us e.g. because the + * guest does not provide _OST notifications. The device is still present on + * the dimmbus, but the qemu and Seabios dimm bitmaps show this device as + * unplugged. To avoid this inconsistency, we set the dimm bits to active + * i.e. hot-plugged for each dimm present on the dimmbus. + */ +QTAILQ_FOREACH(slot, bus-dimmlist, nextdimm) { +if (slot-pending == DIMM_REMOVE_PENDING) { +if (bus-dimm_revert) +bus-dimm_revert(bus-dimm_hotplug_qdev, slot, 0); +} +} +} + /* used to create a dimm device, only on incoming migration of a hotplugged * RAMBlock */ diff --git a/hw/dimm.h b/hw/dimm.h index a6c6e6f..ce091fe 100644 --- a/hw/dimm.h +++ b/hw/dimm.h @@ -95,5 +95,6 @@ void main_memory_bus_create(Object *parent); void dimm_config_create(char *id, uint64_t size, uint64_t node, uint32_t dimm_idx, uint32_t populated); uint64_t get_hp_memory_total(void); +void dimm_state_sync(void); #endif -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 15/19] Add _OST dimm support
This allows qemu to receive notifications from the guest OS on success or failure of a memory hotplug request. The guest OS needs to implement the _OST functionality for this to work (linux-next: http://lkml.org/lkml/2012/6/25/321) This patch also updates dimm bitmap state and hot-remove pending flag on hot-remove fail. This allows failed hot operations to be retried at anytime. This only works for guests that use _OST notification. Also adds new _OST registers in docs/specs/acpi_hotplug.txt Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/acpi_hotplug.txt | 25 + hw/acpi_piix4.c | 35 ++- hw/dimm.c | 28 +++- hw/dimm.h | 10 +- 4 files changed, 95 insertions(+), 3 deletions(-) diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt index cf86242..536da16 100644 --- a/docs/specs/acpi_hotplug.txt +++ b/docs/specs/acpi_hotplug.txt @@ -20,3 +20,28 @@ ejected. Written by ACPI memory device _EJ0 method to notify qemu of successfull hot-removal. Write-only. + +Memory Dimm ejection failure notification (IO port 0xafa1, 1-byte access): +--- +Dimm hot-remove _OST notification. Byte value indicates Dimm slot for which +ejection failed. + +Written by ACPI memory device _OST method to notify qemu of failed +hot-removal. Write-only. + +Memory Dimm insertion success notification (IO port 0xafa2, 1-byte access): +--- +Dimm hot-remove _OST notification. Byte value indicates Dimm slot for which +insertion succeeded. + +Written by ACPI memory device _OST method to notify qemu of failed +hot-add. Write-only. + +Memory Dimm insertion failure notification (IO port 0xafa3, 1-byte access): +--- +Dimm hot-remove _OST notification. Byte value indicates Dimm slot for which +insertion failed. + +Written by ACPI memory device _OST method to notify qemu of failed +hot-add. Write-only. + diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 8776669..f7220d4 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -49,6 +49,9 @@ #define PCI_RMV_BASE 0xae0c #define MEM_BASE 0xaf80 #define MEM_EJ_BASE 0xafa0 +#define MEM_OST_REMOVE_FAIL 0xafa1 +#define MEM_OST_ADD_SUCCESS 0xafa2 +#define MEM_OST_ADD_FAIL 0xafa3 #define PIIX4_MEM_HOTPLUG_STATUS 8 #define PIIX4_PCI_HOTPLUG_STATUS 2 @@ -87,6 +90,7 @@ typedef struct PIIX4PMState { uint8_t s4_val; } PIIX4PMState; +static int piix4_dimm_revert(DeviceState *qdev, DimmDevice *dev, int add); static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s); #define ACPI_ENABLE 0xf1 @@ -531,6 +535,15 @@ static void gpe_writeb(void *opaque, uint32_t addr, uint32_t val) case MEM_EJ_BASE: dimm_notify(val, DIMM_REMOVE_SUCCESS); break; +case MEM_OST_REMOVE_FAIL: +dimm_notify(val, DIMM_REMOVE_FAIL); +break; +case MEM_OST_ADD_SUCCESS: +dimm_notify(val, DIMM_ADD_SUCCESS); +break; +case MEM_OST_ADD_FAIL: +dimm_notify(val, DIMM_ADD_FAIL); +break; default: acpi_gpe_ioport_writeb(s-ar, addr, val); } @@ -604,13 +617,16 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) register_ioport_read(MEM_BASE, DIMM_BITMAP_BYTES, 1, gpe_readb, s); register_ioport_write(MEM_EJ_BASE, 1, 1, gpe_writeb, s); +register_ioport_write(MEM_OST_REMOVE_FAIL, 1, 1, gpe_writeb, s); +register_ioport_write(MEM_OST_ADD_SUCCESS, 1, 1, gpe_writeb, s); +register_ioport_write(MEM_OST_ADD_FAIL, 1, 1, gpe_writeb, s); for(i = 0; i DIMM_BITMAP_BYTES; i++) { s-gperegs.mems_sts[i] = 0; } pci_bus_hotplug(bus, piix4_device_hotplug, s-dev.qdev); -dimm_bus_hotplug(piix4_dimm_hotplug, s-dev.qdev); +dimm_bus_hotplug(piix4_dimm_hotplug, piix4_dimm_revert, s-dev.qdev); } static void enable_device(PIIX4PMState *s, int slot) @@ -656,6 +672,23 @@ static int piix4_dimm_hotplug(DeviceState *qdev, DimmDevice *dev, int return 0; } +static int piix4_dimm_revert(DeviceState *qdev, DimmDevice *dev, int add) +{ +PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, qdev); +PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev, pci_dev); +struct gpe_regs *g = s-gperegs; +DimmDevice *slot = DIMM(dev); +int idx = slot-idx; + +if (add) { +g-mems_sts[idx/8] = ~(1 (idx%8)); +} +else { +g-mems_sts[idx/8] |= (1 (idx%8)); +} +return 0; +} + static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev, PCIHotplugState state) { diff --git a/hw/dimm.c b/hw/dimm.c index 21626f6..1521462 100644 --- a/hw/dimm.c +++ b/hw/dimm.c
[RFC PATCH v3 14/19][SeaBIOS] Add _OST dimm method
Add support for _OST method. _OST method will write into the correct I/O byte to signal success / failure of hot-add or hot-remove to qemu. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl | 50 ++ src/ssdt-mem.dsl |4 2 files changed, 54 insertions(+), 0 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 5d3e92b..0d37bbc 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -762,6 +762,28 @@ DefinitionBlock ( MPE, 8 } + +/* Memory hot-remove notify failure byte */ +OperationRegion(MEEF, SystemIO, 0xafa1, 1) +Field (MEEF, ByteAcc, NoLock, Preserve) +{ +MEF, 8 +} + +/* Memory hot-add notify success byte */ +OperationRegion(MPIS, SystemIO, 0xafa2, 1) +Field (MPIS, ByteAcc, NoLock, Preserve) +{ +MIS, 8 +} + +/* Memory hot-add notify failure byte */ +OperationRegion(MPIF, SystemIO, 0xafa3, 1) +Field (MPIF, ByteAcc, NoLock, Preserve) +{ +MIF, 8 +} + Method(MESC, 0) { // Local5 = active memdevice bitmap Store (MES, Local5) @@ -802,6 +824,34 @@ DefinitionBlock ( Store(Arg0, MPE) Sleep(200) } +Method (MOST, 3, Serialized) { +// _OST method - OS status indication +Switch (And(Arg0, 0xFF)) { +Case(0x3) +{ +Switch(And(Arg1, 0xFF)) { +Case(0x1) { +Store(Arg2, MEF) +// Revert MEON flag for this memory device to one +Store(One, Index(MEON, Arg2)) +} +} +} +Case(0x1) +{ +Switch(And(Arg1, 0xFF)) { +Case(0x0) { +Store(Arg2, MIS) +} +Case(0x1) { +Store(Arg2, MIF) +// Revert MEON flag for this memory device to zero +Store(Zero, Index(MEON, Arg2)) +} +} +} +} +} } diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl index ee322f0..041d301 100644 --- a/src/ssdt-mem.dsl +++ b/src/ssdt-mem.dsl @@ -38,6 +38,7 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) External(CMST, MethodObj) External(MPEJ, MethodObj) +External(MOST, MethodObj) Name(_CRS, ResourceTemplate() { QwordMemory( @@ -60,6 +61,9 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) Method (_EJ0, 1, NotSerialized) { MPEJ(ID, Arg0) } +Method (_OST, 3) { +MOST(Arg0, Arg1, ID) +} } } -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 13/19] balloon: update with hotplugged memory
query-balloon and info balloon should report total memory available to the guest. balloon inflate/ deflate can also use all memory available to the guest (initial + hotplugged memory) Ballon driver has been minimaly tested with the patch, please review and test. Caveat: if the guest does not online hotplugged-memory, it's easy for a balloon inflate command to OOM a guest. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/virtio-balloon.c | 13 + 1 files changed, 9 insertions(+), 4 deletions(-) diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index dd1a650..bca21bc 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -22,6 +22,7 @@ #include virtio-balloon.h #include kvm.h #include exec-memory.h +#include dimm.h #if defined(__linux__) #include sys/mman.h @@ -147,10 +148,11 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, VirtIOBalloon *dev = to_virtio_balloon(vdev); struct virtio_balloon_config config; uint32_t oldactual = dev-actual; +uint64_t hotplugged_ram_size = get_hp_memory_total(); memcpy(config, config_data, 8); dev-actual = le32_to_cpu(config.actual); if (dev-actual != oldactual) { -qemu_balloon_changed(ram_size - +qemu_balloon_changed(ram_size + hotplugged_ram_size - (dev-actual VIRTIO_BALLOON_PFN_SHIFT)); } } @@ -188,17 +190,20 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info) info-actual = ram_size - ((uint64_t) dev-actual VIRTIO_BALLOON_PFN_SHIFT); +info-actual += get_hp_memory_total(); } static void virtio_balloon_to_target(void *opaque, ram_addr_t target) { VirtIOBalloon *dev = opaque; +uint64_t hotplugged_ram_size = get_hp_memory_total(); -if (target ram_size) { -target = ram_size; +if (target ram_size + hotplugged_ram_size) { +target = ram_size + hotplugged_ram_size; } if (target) { -dev-num_pages = (ram_size - target) VIRTIO_BALLOON_PFN_SHIFT; +dev-num_pages = (ram_size + hotplugged_ram_size - target) + VIRTIO_BALLOON_PFN_SHIFT; virtio_notify_config(dev-vdev); } } -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 12/19] Implement info memory-total and query-memory-total
Returns total physical memory available to guest in bytes, including hotplugged memory. Note that the number reported here may be different from what the guest sees e.g. if the guest has not logically onlined hotplugged memory. This functionality is provided independently of a balloon device, since a guest can be using ACPI memory hotplug without using a balloon device. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hmp-commands.hx |2 ++ hmp.c|7 +++ hmp.h|1 + hw/dimm.c| 21 + hw/dimm.h|1 + monitor.c|7 +++ qapi-schema.json | 11 +++ qmp-commands.hx | 20 8 files changed, 70 insertions(+), 0 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index cfb1b67..988d207 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -1464,6 +1464,8 @@ show qdev device model list show roms @item info memory-hotplug show memory-hotplug +@item info memory-total +show memory-total @end table ETEXI diff --git a/hmp.c b/hmp.c index 4b3d63d..cc31ddc 100644 --- a/hmp.c +++ b/hmp.c @@ -1185,3 +1185,10 @@ void hmp_info_memory_hotplug(Monitor *mon) qapi_free_MemHpInfoList(info); } + +void hmp_info_memory_total(Monitor *mon) +{ +uint64_t ram_total; +ram_total = (uint64_t)qmp_query_memory_total(NULL); +monitor_printf(mon, MemTotal: %lu \n, ram_total); +} diff --git a/hmp.h b/hmp.h index 986705a..ab96dba 100644 --- a/hmp.h +++ b/hmp.h @@ -74,5 +74,6 @@ void hmp_closefd(Monitor *mon, const QDict *qdict); void hmp_send_key(Monitor *mon, const QDict *qdict); void hmp_screen_dump(Monitor *mon, const QDict *qdict); void hmp_info_memory_hotplug(Monitor *mon); +void hmp_info_memory_total(Monitor *mon); #endif diff --git a/hw/dimm.c b/hw/dimm.c index fbd93a8..21626f6 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -28,6 +28,7 @@ static DimmBus *main_memory_bus; /* the following list is used to hold dimm config info before machine * initialization. After machine init, the list is emptied and not used anymore.*/ static DimmConfiglist dimmconfig_list = QTAILQ_HEAD_INITIALIZER(dimmconfig_list); +extern ram_addr_t ram_size; static void dimmbus_dev_print(Monitor *mon, DeviceState *dev, int indent); static char *dimmbus_get_fw_dev_path(DeviceState *dev); @@ -233,6 +234,26 @@ void setup_fwcfg_hp_dimms(uint64_t *fw_cfg_slots) } } +uint64_t get_hp_memory_total(void) +{ +DimmBus *bus = main_memory_bus; +DimmDevice *slot; +uint64_t info = 0; + +QTAILQ_FOREACH(slot, bus-dimmlist, nextdimm) { +info += slot-size; +} +return info; +} + +int64_t qmp_query_memory_total(Error **errp) +{ +uint64_t info; +info = ram_size + get_hp_memory_total(); + +return (int64_t)info; +} + void dimm_notify(uint32_t idx, uint32_t event) { DimmBus *bus = main_memory_bus; diff --git a/hw/dimm.h b/hw/dimm.h index 95251ba..21225be 100644 --- a/hw/dimm.h +++ b/hw/dimm.h @@ -86,5 +86,6 @@ int dimm_add(char *id); void main_memory_bus_create(Object *parent); void dimm_config_create(char *id, uint64_t size, uint64_t node, uint32_t dimm_idx, uint32_t populated); +uint64_t get_hp_memory_total(void); #endif diff --git a/monitor.c b/monitor.c index be9a1d9..4f5ea60 100644 --- a/monitor.c +++ b/monitor.c @@ -2747,6 +2747,13 @@ static mon_cmd_t info_cmds[] = { .mhandler.info = hmp_info_memory_hotplug, }, { +.name = memory-total, +.args_type = , +.params = , +.help = show total memory size, +.mhandler.info = hmp_info_memory_total, +}, +{ .name = NULL, }, }; diff --git a/qapi-schema.json b/qapi-schema.json index 3706a2a..c1d2571 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -2581,3 +2581,14 @@ # Since: 1.3 ## { 'command': 'query-memory-hotplug', 'returns': ['MemHpInfo'] } + +## +# @query-memory-total: +# +# Returns total memory in bytes, including hotplugged dimms +# +# Returns: int +# +# Since: 1.3 +## +{ 'command': 'query-memory-total', 'returns': 'int' } diff --git a/qmp-commands.hx b/qmp-commands.hx index e50dcc2..20b7eea 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -2576,3 +2576,23 @@ Example: } EQMP + +{ +.name = query-memory-total, +.args_type = , +.mhandler.cmd_new = qmp_marshal_input_query_memory_total +}, +SQMP +query-memory-total +-- + +Return total memory in bytes, including hotplugged dimms + +Example: + +- { execute: query-memory-total } +- { + return: 1073741824 + } + +EQMP -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 07/19] acpi_piix4: Implement memory device hotplug registers
A 32-byte register is used to present up to 256 hotplug-able memory devices to BIOS and OSPM. Hot-add and hot-remove functions trigger an ACPI hotplug event through these. Only reads are allowed from these registers. An ACPI hot-remove event but needs to wait for OSPM to eject the device. We use a single-byte register to know when OSPM has called the _EJ function for a particular dimm. A write to this byte will depopulate the respective dimm. Only writes are allowed to this byte. v1-v2: mems_sts address moved from 0xaf20 to 0xaf80 (to accomodate more space for cpu-hotplugging in the future). _EJ array is reduced to a single byte. Add documentation in docs/specs/acpi_hotplug.txt v2-v3: minor name changes Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/acpi_hotplug.txt | 22 + hw/acpi_piix4.c | 73 -- 2 files changed, 91 insertions(+), 4 deletions(-) create mode 100644 docs/specs/acpi_hotplug.txt diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt new file mode 100644 index 000..cf86242 --- /dev/null +++ b/docs/specs/acpi_hotplug.txt @@ -0,0 +1,22 @@ +QEMU-ACPI BIOS hotplug interface +-- +This document describes the interface between QEMU and the ACPI BIOS for non-PCI +space. For the PCI interface please look at docs/specs/acpi_pci_hotplug.txt + +QEMU-ACPI BIOS memory hotplug interface +-- + +Memory Dimm status array (IO port 0xaf80-0xaf9f, 1-byte access): +--- +Dimm hot-plug notification pending. One bit per slot. + +Read by ACPI BIOS GPE.3 handler to notify OS of memory hot-add or hot-remove +events. Read-only. + +Memory Dimm ejection success notification (IO port 0xafa0, 1-byte access): +--- +Dimm hot-remove _EJ0 notification. Byte value indicates Dimm slot that was +ejected. + +Written by ACPI memory device _EJ0 method to notify qemu of successfull +hot-removal. Write-only. diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index c56220b..8776669 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -28,6 +28,8 @@ #include range.h #include ioport.h #include fw_cfg.h +#include sysbus.h +#include dimm.h //#define DEBUG @@ -45,9 +47,15 @@ #define PCI_DOWN_BASE 0xae04 #define PCI_EJ_BASE 0xae08 #define PCI_RMV_BASE 0xae0c +#define MEM_BASE 0xaf80 +#define MEM_EJ_BASE 0xafa0 +#define PIIX4_MEM_HOTPLUG_STATUS 8 #define PIIX4_PCI_HOTPLUG_STATUS 2 +struct gpe_regs { +uint8_t mems_sts[DIMM_BITMAP_BYTES]; +}; struct pci_status { uint32_t up; /* deprecated, maintained for migration compatibility */ uint32_t down; @@ -69,6 +77,7 @@ typedef struct PIIX4PMState { Notifier machine_ready; /* for pci hotplug */ +struct gpe_regs gperegs; struct pci_status pci0_status; uint32_t pci0_hotplug_enable; uint32_t pci0_slot_device_present; @@ -93,8 +102,8 @@ static void pm_update_sci(PIIX4PMState *s) ACPI_BITMASK_POWER_BUTTON_ENABLE | ACPI_BITMASK_GLOBAL_LOCK_ENABLE | ACPI_BITMASK_TIMER_ENABLE)) != 0) || -(((s-ar.gpe.sts[0] s-ar.gpe.en[0]) - PIIX4_PCI_HOTPLUG_STATUS) != 0); +(((s-ar.gpe.sts[0] s-ar.gpe.en[0]) + (PIIX4_PCI_HOTPLUG_STATUS | PIIX4_MEM_HOTPLUG_STATUS)) != 0); qemu_set_irq(s-irq, sci_level); /* schedule a timer interruption if needed */ @@ -499,7 +508,16 @@ type_init(piix4_pm_register_types) static uint32_t gpe_readb(void *opaque, uint32_t addr) { PIIX4PMState *s = opaque; -uint32_t val = acpi_gpe_ioport_readb(s-ar, addr); +uint32_t val = 0; +struct gpe_regs *g = s-gperegs; + +switch (addr) { +case MEM_BASE ... MEM_BASE+DIMM_BITMAP_BYTES: +val = g-mems_sts[addr - MEM_BASE]; +break; +default: +val = acpi_gpe_ioport_readb(s-ar, addr); +} PIIX4_DPRINTF(gpe read %x == %x\n, addr, val); return val; @@ -509,7 +527,13 @@ static void gpe_writeb(void *opaque, uint32_t addr, uint32_t val) { PIIX4PMState *s = opaque; -acpi_gpe_ioport_writeb(s-ar, addr, val); +switch (addr) { +case MEM_EJ_BASE: +dimm_notify(val, DIMM_REMOVE_SUCCESS); +break; +default: +acpi_gpe_ioport_writeb(s-ar, addr, val); +} pm_update_sci(s); PIIX4_DPRINTF(gpe write %x == %d\n, addr, val); @@ -560,9 +584,11 @@ static uint32_t pcirmv_read(void *opaque, uint32_t addr) static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev, PCIHotplugState state); +static int piix4_dimm_hotplug(DeviceState *qdev, DimmDevice *dev, int add); static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) { +int i = 0; register_ioport_write
[RFC PATCH v3 05/19] Implement dimm device abstraction
Each hotplug-able memory slot is a DimmDevice. All DimmDevices are attached to a new bus called DimmBus. This bus is introduced so that we no longer depend on hotplug-capability of main system bus (the main bus does not allow hotplugging). The DimmBus should be attached to a chipset Device (i440fx in case of the pc) A hot-add operation for a particular dimm: - creates a new DimmDevice and attaches it to the DimmBus - creates a new MemoryRegion of the given physical address offset, size and node proximity, and attaches it to main system memory as a sub_region. A successful hot-remove operation detaches and frees the MemoryRegion from system memory, and removes the DimmDevice from the DimmBus. Hotplug operations are done through normal device_add /device_del commands. Also add properties to DimmDevice. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/dimm.c | 305 + hw/dimm.h | 90 ++ 2 files changed, 395 insertions(+), 0 deletions(-) create mode 100644 hw/dimm.c create mode 100644 hw/dimm.h diff --git a/hw/dimm.c b/hw/dimm.c new file mode 100644 index 000..288b997 --- /dev/null +++ b/hw/dimm.c @@ -0,0 +1,305 @@ +/* + * Dimm device for Memory Hotplug + * + * Copyright ProfitBricks GmbH 2012 + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, see http://www.gnu.org/licenses/ + */ + +#include trace.h +#include qdev.h +#include dimm.h +#include time.h +#include ../exec-memory.h +#include qmp-commands.h + +/* the system-wide memory bus. */ +static DimmBus *main_memory_bus; +/* the following list is used to hold dimm config info before machine + * initialization. After machine init, the list is emptied and not used anymore.*/ +static DimmConfiglist dimmconfig_list = QTAILQ_HEAD_INITIALIZER(dimmconfig_list); + +static void dimmbus_dev_print(Monitor *mon, DeviceState *dev, int indent); +static char *dimmbus_get_fw_dev_path(DeviceState *dev); + +static Property dimm_properties[] = { +DEFINE_PROP_UINT64(start, DimmDevice, start, 0), +DEFINE_PROP_UINT64(size, DimmDevice, size, DEFAULT_DIMMSIZE), +DEFINE_PROP_UINT32(node, DimmDevice, node, 0), +DEFINE_PROP_END_OF_LIST(), +}; + +static void dimmbus_dev_print(Monitor *mon, DeviceState *dev, int indent) +{ +} + +static char *dimmbus_get_fw_dev_path(DeviceState *dev) +{ +char path[40]; + +snprintf(path, sizeof(path), %s, qdev_fw_name(dev)); +return strdup(path); +} + +static void dimm_bus_class_init(ObjectClass *klass, void *data) +{ +BusClass *k = BUS_CLASS(klass); + +k-print_dev = dimmbus_dev_print; +k-get_fw_dev_path = dimmbus_get_fw_dev_path; +} + +static void dimm_bus_initfn(Object *obj) +{ +DimmConfig *dimm_cfg, *next_dimm_cfg; +DimmBus *bus = DIMM_BUS(obj); +QTAILQ_INIT(bus-dimmconfig_list); +QTAILQ_INIT(bus-dimmlist); + +QTAILQ_FOREACH_SAFE(dimm_cfg, dimmconfig_list, nextdimmcfg, next_dimm_cfg) { +QTAILQ_REMOVE(dimmconfig_list, dimm_cfg, nextdimmcfg); +QTAILQ_INSERT_TAIL(bus-dimmconfig_list, dimm_cfg, nextdimmcfg); +} +} + +static const TypeInfo dimm_bus_info = { +.name = TYPE_DIMM_BUS, +.parent = TYPE_BUS, +.instance_size = sizeof(DimmBus), +.instance_init = dimm_bus_initfn, +.class_init = dimm_bus_class_init, +}; + +void main_memory_bus_create(Object *parent) +{ +main_memory_bus = g_malloc0(dimm_bus_info.instance_size); +main_memory_bus-qbus.glib_allocated = true; +qbus_create_inplace(main_memory_bus-qbus, TYPE_DIMM_BUS, DEVICE(parent), +membus); +} + +static void dimm_populate(DimmDevice *s) +{ +DeviceState *dev= (DeviceState*)s; +MemoryRegion *new = NULL; + +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, dev-id, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +} + +static void dimm_depopulate(DimmDevice *s) +{ +assert(s); +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-mr = NULL; +} + +void dimm_config_create(char *id, uint64_t size, uint64_t node, uint32_t +dimm_idx, uint32_t populated) +{ +DimmConfig *dimm_cfg; +dimm_cfg = (DimmConfig*) g_malloc0(sizeof(DimmConfig)); +dimm_cfg-name = id
[RFC PATCH v3 06/19] Implement -dimm command line option
Example: -dimm id=dimm0,size=512M,node=0,populated=off will define a 512M memory slot belonging to numa node 0. When populated=on, a DimmDevice is created and hot-plugged at system startup. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/Makefile.objs |2 +- qemu-config.c| 25 + qemu-options.hx |5 + sysemu.h |1 + vl.c | 50 ++ 5 files changed, 82 insertions(+), 1 deletions(-) diff --git a/hw/Makefile.objs b/hw/Makefile.objs index 6dfebd2..8c5c39a 100644 --- a/hw/Makefile.objs +++ b/hw/Makefile.objs @@ -26,7 +26,7 @@ hw-obj-$(CONFIG_I8254) += i8254_common.o i8254.o hw-obj-$(CONFIG_PCSPK) += pcspk.o hw-obj-$(CONFIG_PCKBD) += pckbd.o hw-obj-$(CONFIG_FDC) += fdc.o -hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o +hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o dimm.o hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_I82374) += i82374.o diff --git a/qemu-config.c b/qemu-config.c index eba977e..4022d64 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -646,6 +646,30 @@ QemuOptsList qemu_boot_opts = { }, }; +static QemuOptsList qemu_dimm_opts = { +.name = dimm, +.head = QTAILQ_HEAD_INITIALIZER(qemu_dimm_opts.head), +.desc = { +{ +.name = id, +.type = QEMU_OPT_STRING, +.help = id of this dimm device, +},{ +.name = size, +.type = QEMU_OPT_SIZE, +.help = memory size for this dimm, +},{ +.name = populated, +.type = QEMU_OPT_BOOL, +.help = populated for this dimm, +},{ +.name = node, +.type = QEMU_OPT_NUMBER, +.help = NUMA node number (i.e. proximity) for this dimm, +}, +{ /* end of list */ } +}, +}; static QemuOptsList *vm_config_groups[32] = { qemu_drive_opts, qemu_chardev_opts, @@ -662,6 +686,7 @@ static QemuOptsList *vm_config_groups[32] = { qemu_boot_opts, qemu_iscsi_opts, qemu_sandbox_opts, +qemu_dimm_opts, NULL, }; diff --git a/qemu-options.hx b/qemu-options.hx index 804a2d1..3687722 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2842,3 +2842,8 @@ HXCOMM This is the last statement. Insert new options before this line! STEXI @end table ETEXI + +DEF(dimm, HAS_ARG, QEMU_OPTION_dimm, +-dimm id=dimmid,size=sz,node=nd,populated=on|off\n +specify memory dimm device with name dimmid, size sz on node nd, +QEMU_ARCH_ALL) diff --git a/sysemu.h b/sysemu.h index 65552ac..7baf9c9 100644 --- a/sysemu.h +++ b/sysemu.h @@ -139,6 +139,7 @@ extern QEMUClock *rtc_clock; extern int nb_numa_nodes; extern uint64_t node_mem[MAX_NODES]; extern unsigned long *node_cpumask[MAX_NODES]; +extern int nb_hp_dimms; #define MAX_OPTION_ROMS 16 typedef struct QEMUOptionRom { diff --git a/vl.c b/vl.c index 7c577fa..af1745c 100644 --- a/vl.c +++ b/vl.c @@ -126,6 +126,7 @@ int main(int argc, char **argv) #include hw/xen.h #include hw/qdev.h #include hw/loader.h +#include hw/dimm.h #include bt-host.h #include net.h #include net/slirp.h @@ -248,6 +249,7 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order int nb_numa_nodes; uint64_t node_mem[MAX_NODES]; unsigned long *node_cpumask[MAX_NODES]; +int nb_hp_dimms; uint8_t qemu_uuid[16]; @@ -530,6 +532,37 @@ static void configure_rtc_date_offset(const char *startdate, int legacy) } } +static void configure_dimm(QemuOpts *opts) +{ +const char *id; +uint64_t size, node; +bool populated; +QemuOpts *devopts; +char buf[256]; +if (nb_hp_dimms == MAX_DIMMS) { +fprintf(stderr, qemu: maximum number of DIMMs (%d) exceeded\n, +MAX_DIMMS); +exit(1); +} +id = qemu_opts_id(opts); +size = qemu_opt_get_size(opts, size, DEFAULT_DIMMSIZE); +populated = qemu_opt_get_bool(opts, populated, 0); +node = qemu_opt_get_number(opts, node, 0); + +dimm_config_create((char*)id, size, node, nb_hp_dimms, 0); + +if (populated) { +devopts = qemu_opts_create(qemu_find_opts(device), id, 0, NULL); +qemu_opt_set(devopts, driver, dimm); +snprintf(buf, sizeof(buf), %lu, size); +qemu_opt_set(devopts, size, buf); +snprintf(buf, sizeof(buf), %lu, node); +qemu_opt_set(devopts, node, buf); +qemu_opt_set(devopts, bus, membus); +} +nb_hp_dimms++; +} + static void configure_rtc(QemuOpts *opts) { const char *value; @@ -2354,6 +2387,8 @@ int main(int argc, char **argv, char **envp) DisplayChangeListener *dcl; int cyls, heads, secs, translation; QemuOpts *hda_opts = NULL, *opts, *machine_opts; +QemuOpts *dimm_opts[MAX_DIMMS]; +int nb_dimm_opts = 0; QemuOptsList *olist; int optind; const char *optarg; @@ -3288,6 +3323,18 @@ int main
[RFC PATCH v3 02/19][SeaBIOS] Add SSDT memory device support
Define SSDT hotplug-able memory devices in _SB namespace. The dynamically generated SSDT includes per memory device hotplug methods. These methods just call methods defined in the DSDT. Also dynamically generate a MTFY method and a MEON array of the online/available memory devices. ACPI extraction macros are used to place the AML code in variables later used by src/acpi. The design is taken from SSDT cpu generation. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- Makefile |2 +- src/ssdt-mem.dsl | 65 ++ 2 files changed, 66 insertions(+), 1 deletions(-) create mode 100644 src/ssdt-mem.dsl diff --git a/Makefile b/Makefile index 5486f88..e82cfc9 100644 --- a/Makefile +++ b/Makefile @@ -233,7 +233,7 @@ $(OUT)%.hex: src/%.dsl ./tools/acpi_extract_preprocess.py ./tools/acpi_extract.p $(Q)$(PYTHON) ./tools/acpi_extract.py $(OUT)$*.lst $(OUT)$*.off $(Q)cat $(OUT)$*.off $@ -$(OUT)ccode32flat.o: $(OUT)acpi-dsdt.hex $(OUT)ssdt-proc.hex $(OUT)ssdt-pcihp.hex $(OUT)ssdt-susp.hex +$(OUT)ccode32flat.o: $(OUT)acpi-dsdt.hex $(OUT)ssdt-proc.hex $(OUT)ssdt-pcihp.hex $(OUT)ssdt-susp.hex $(OUT)ssdt-mem.hex Kconfig rules diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl new file mode 100644 index 000..ee322f0 --- /dev/null +++ b/src/ssdt-mem.dsl @@ -0,0 +1,65 @@ +/* This file is the basis for the ssdt_mem[] variable in src/acpi.c. + * It is similar in design to the ssdt_proc variable. + * It defines the contents of the per-cpu Processor() object. At + * runtime, a dynamically generated SSDT will contain one copy of this + * AML snippet for every possible memory device in the system. The + * objects will * be placed in the \_SB_ namespace. + * + * In addition to the aml code generated from this file, the + * src/acpi.c file creates a MEMNTFY method with an entry for each memdevice: + * Method(MTFY, 2) { + * If (LEqual(Arg0, 0x00)) { Notify(MP00, Arg1) } + * If (LEqual(Arg0, 0x01)) { Notify(MP01, Arg1) } + * ... + * } + * and a MEON array with the list of active and inactive memory devices: + * Name(MEON, Package() { One, One, ..., Zero, Zero, ... }) + */ +ACPI_EXTRACT_ALL_CODE ssdm_mem_aml + +DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) +/* v-- DO NOT EDIT --v */ +{ +ACPI_EXTRACT_DEVICE_START ssdt_mem_start +ACPI_EXTRACT_DEVICE_END ssdt_mem_end +ACPI_EXTRACT_DEVICE_STRING ssdt_mem_name +Device(MPAA) { +ACPI_EXTRACT_NAME_BYTE_CONST ssdt_mem_id +Name(ID, 0xAA) +/* ^-- DO NOT EDIT --^ + * + * The src/acpi.c code requires the above layout so that it can update + * MPAA and 0xAA with the appropriate MEMDEVICE id (see + * SD_OFFSET_MEMHEX/MEMID1/MEMID2). Don't change the above without + * also updating the C code. + */ +Name(_HID, EISAID(PNP0C80)) +Name(_PXM, 0xAA) + +External(CMST, MethodObj) +External(MPEJ, MethodObj) + +Name(_CRS, ResourceTemplate() { +QwordMemory( + ResourceConsumer, + , + MinFixed, + MaxFixed, + Cacheable, + ReadWrite, + 0x0, + 0xDEADBEEF, + 0xE6ADBEEE, + 0x, + 0x0800, + ) +}) +Method (_STA, 0) { +Return(CMST(ID)) +} +Method (_EJ0, 1, NotSerialized) { +MPEJ(ID, Arg0) +} +} +} + -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 19/19] alternative: Introduce paravirt interface QEMU_CFG_PCI_WINDOW
Qemu already calculates the 32-bit and 64-bit PCI starting offsets based on initial memory and hotplug-able dimms. This info needs to be passed to Seabios for PCI initialization. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/fwcfg.txt |9 + hw/fw_cfg.h |1 + hw/pc_piix.c | 10 ++ 3 files changed, 20 insertions(+), 0 deletions(-) diff --git a/docs/specs/fwcfg.txt b/docs/specs/fwcfg.txt index 55f96d9..d9fa215 100644 --- a/docs/specs/fwcfg.txt +++ b/docs/specs/fwcfg.txt @@ -26,3 +26,12 @@ Entry max_cpus+nb_numa_nodes+1 contains the number of memory dimms (nb_hp_dimms) The last 3 * nb_hp_dimms entries are organized in triplets: Each triplet contains the physical address offset, size (in bytes), and node proximity for the respective dimm. + +FW_CFG_PCI_WINDOW paravirt info + +QEMU passes the starting address for the 32-bit and 64-bit PCI windows to BIOS. +The following layouts are followed: + + +pcimem32_start | pcimem64_start | + diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h index 856bf91..6c8c151 100644 --- a/hw/fw_cfg.h +++ b/hw/fw_cfg.h @@ -27,6 +27,7 @@ #define FW_CFG_SETUP_SIZE 0x17 #define FW_CFG_SETUP_DATA 0x18 #define FW_CFG_FILE_DIR 0x19 +#define FW_CFG_PCI_WINDOW 0x1a #define FW_CFG_FILE_FIRST 0x20 #define FW_CFG_FILE_SLOTS 0x10 diff --git a/hw/pc_piix.c b/hw/pc_piix.c index d1fd276..034761f 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -44,6 +44,7 @@ #include memory.h #include exec-memory.h #include dimm.h +#include fw_cfg.h #ifdef CONFIG_XEN # include xen/hvm/hvm_info_table.h #endif @@ -149,6 +150,7 @@ static void pc_init1(MemoryRegion *system_memory, MemoryRegion *pci_memory; MemoryRegion *rom_memory; void *fw_cfg = NULL; +uint64_t *pci_window_fw_cfg; pc_cpus_init(cpu_model); @@ -205,6 +207,14 @@ static void pc_init1(MemoryRegion *system_memory, ? 0 : ((uint64_t)1 62)), pci_memory, ram_memory); + +pci_window_fw_cfg = g_malloc0(2 * 8); +pci_window_fw_cfg[0] = cpu_to_le64(below_4g_mem_size + +below_4g_hp_mem_size); +pci_window_fw_cfg[1] = cpu_to_le64(0x1ULL + above_4g_mem_size ++ above_4g_hp_mem_size); +fw_cfg_add_bytes(fw_cfg, FW_CFG_PCI_WINDOW, +(uint8_t *)pci_window_fw_cfg, 2 * 8); } else { pci_bus = NULL; i440fx_state = NULL; -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 20/19][SeaBIOS] alternative: Use paravirt interface for pci windows
Initialize the 32-bit and 64-bit pci starting offsets from values passed in by the qemu paravirt interface QEMU_CFG_PCI_WINDOW. Qemu calculates the starting offsets based on initial memory and hotplug-able dimms. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/paravirt.c |6 ++ src/paravirt.h |2 ++ src/pciinit.c |5 ++--- 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/src/paravirt.c b/src/paravirt.c index 2a98d53..390ef30 100644 --- a/src/paravirt.c +++ b/src/paravirt.c @@ -346,3 +346,9 @@ void qemu_cfg_romfile_setup(void) dprintf(3, Found fw_cfg file: %s (size=%d)\n, file-name, file-size); } } + +void qemu_cfg_get_pci_offsets(u64 *pcimem_start, u64 *pcimem64_start) +{ +qemu_cfg_read_entry(pcimem_start, QEMU_CFG_PCI_WINDOW, sizeof(u64)); +qemu_cfg_read((u8*)(pcimem64_start), sizeof(u64)); +} diff --git a/src/paravirt.h b/src/paravirt.h index a284c41..b53ff88 100644 --- a/src/paravirt.h +++ b/src/paravirt.h @@ -35,6 +35,7 @@ static inline int kvm_para_available(void) #define QEMU_CFG_BOOT_MENU 0x0e #define QEMU_CFG_MAX_CPUS 0x0f #define QEMU_CFG_FILE_DIR 0x19 +#define QEMU_CFG_PCI_WINDOW 0x1a #define QEMU_CFG_ARCH_LOCAL 0x8000 #define QEMU_CFG_ACPI_TABLES(QEMU_CFG_ARCH_LOCAL + 0) #define QEMU_CFG_SMBIOS_ENTRIES (QEMU_CFG_ARCH_LOCAL + 1) @@ -65,5 +66,6 @@ struct e820_reservation { u32 qemu_cfg_e820_entries(void); void* qemu_cfg_e820_load_next(void *addr); void qemu_cfg_romfile_setup(void); +void qemu_cfg_get_pci_offsets(u64 *pcimem_start, u64 *pcimem64_start); #endif diff --git a/src/pciinit.c b/src/pciinit.c index 68f302a..64468a0 100644 --- a/src/pciinit.c +++ b/src/pciinit.c @@ -592,8 +592,7 @@ static void pci_region_map_entries(struct pci_bus *busses, struct pci_region *r) static void pci_bios_map_devices(struct pci_bus *busses) { -pcimem_start = RamSize; - +qemu_cfg_get_pci_offsets(pcimem_start, pcimem64_start); if (pci_bios_init_root_regions(busses)) { struct pci_region r64_mem, r64_pref; r64_mem.list = NULL; @@ -611,7 +610,7 @@ static void pci_bios_map_devices(struct pci_bus *busses) u64 align_mem = pci_region_align(r64_mem); u64 align_pref = pci_region_align(r64_pref); -r64_mem.base = ALIGN(0x1LL + RamSizeOver4G, align_mem); +r64_mem.base = ALIGN(pcimem64_start, align_mem); r64_pref.base = ALIGN(r64_mem.base + sum_mem, align_pref); pcimem64_start = r64_mem.base; pcimem64_end = r64_pref.base + sum_pref; -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 03/21][SeaBIOS] acpi-dsdt: Implement functions for memory hotplug
On Tue, Jul 17, 2012 at 03:23:00PM +0800, Wen Congyang wrote: +Method(MESC, 0) { +// Local5 = active memdevice bitmap +Store (MES, Local5) +// Local2 = last read byte from bitmap +Store (Zero, Local2) +// Local0 = memory device iterator +Store (Zero, Local0) +While (LLess(Local0, SizeOf(MEON))) { +// Local1 = MEON flag for this memory device +Store(DerefOf(Index(MEON, Local0)), Local1) +If (And(Local0, 0x07)) { +// Shift down previously read bitmap byte +ShiftRight(Local2, 1, Local2) +} Else { +// Read next byte from memdevice bitmap +Store(DerefOf(Index(Local5, ShiftRight(Local0, 3))), Local2) +} +// Local3 = active state for this memory device +Store(And(Local2, 1), Local3) + +If (LNotEqual(Local1, Local3)) { There are two ways to hot remove a memory device: 1. dimm_del 2. echo 1 /sys/bus/acpi/devices/PNP0C80:XX/eject In the 2nd case, we cannot hotplug this memory device again, because both Local1 and Local3 are 1. So, I think MEON flag for this meory device should be set to 0 in method _EJ0 or implement method _PS3 for memory device. good catch. Both internal seabios state (MEON) and the machine qemu bitmap (mems_sts in hw/acpi_piix4.c) have to be updated when the ejection comes from OSPM action. I will implement a _PS3 method that updates the MEON flag and also signals qemu to change the mems_sts bitmap. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH v2 06/21] dimm: Implement memory device abstraction
Hi, On Thu, Jul 12, 2012 at 07:55:42PM +, Blue Swirl wrote: On Wed, Jul 11, 2012 at 10:31 AM, Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com wrote: Each hotplug-able memory slot is a SysBusDevice. A hot-add operation for a particular dimm creates a new MemoryRegion of the given physical address offset, size and node proximity, and attaches it to main system memory as a sub_region. A hot-remove operation detaches and frees the MemoryRegion from system memory. This prototype still lacks proper qdev integration: a separate hotplug side-channel is used and main system bus hotplug capability is ignored. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/Makefile.objs |2 +- hw/dimm.c| 234 ++ hw/dimm.h| 58 + 3 files changed, 293 insertions(+), 1 deletions(-) create mode 100644 hw/dimm.c create mode 100644 hw/dimm.h diff --git a/hw/Makefile.objs b/hw/Makefile.objs index 3d77259..e2184bf 100644 --- a/hw/Makefile.objs +++ b/hw/Makefile.objs @@ -26,7 +26,7 @@ hw-obj-$(CONFIG_I8254) += i8254_common.o i8254.o hw-obj-$(CONFIG_PCSPK) += pcspk.o hw-obj-$(CONFIG_PCKBD) += pckbd.o hw-obj-$(CONFIG_FDC) += fdc.o -hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o +hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o dimm.o hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_I82374) += i82374.o diff --git a/hw/dimm.c b/hw/dimm.c new file mode 100644 index 000..00c4623 --- /dev/null +++ b/hw/dimm.c @@ -0,0 +1,234 @@ +/* + * Dimm device for Memory Hotplug + * + * Copyright ProfitBricks GmbH 2012 + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, see http://www.gnu.org/licenses/ + */ + +#include trace.h +#include qdev.h +#include dimm.h +#include time.h +#include ../exec-memory.h +#include qmp-commands.h + +static DeviceState *dimm_hotplug_qdev; +static dimm_hotplug_fn dimm_hotplug; +static QTAILQ_HEAD(Dimmlist, DimmState) dimmlist; Using global state does not look right. It should always be possible to pass around structures to avoid it. ok, I 'll try to remove the global state. + +static Property dimm_properties[] = { +DEFINE_PROP_END_OF_LIST() +}; + +void dimm_populate(DimmState *s) All functions are global and exported but there does not seem to be users. Please make all static which you can. will do +{ +DeviceState *dev= (DeviceState*)s; +MemoryRegion *new = NULL; + +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, dev-id, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +s-populated = true; +} + + +void dimm_depopulate(DimmState *s) +{ +assert(s); +if (s-populated) { +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-populated = false; +s-mr = NULL; +} +} + +DimmState *dimm_create(char *id, uint64_t size, uint64_t node, uint32_t +dimm_idx, bool populated) +{ +DeviceState *dev; +DimmState *mdev; + +dev = sysbus_create_simple(dimm, -1, NULL); +dev-id = id; + +mdev = DIMM(dev); +mdev-idx = dimm_idx; +mdev-start = 0; +mdev-size = size; +mdev-node = node; +mdev-populated = populated; +QTAILQ_INSERT_TAIL(dimmlist, mdev, nextdimm); +return mdev; +} + +void dimm_register_hotplug(dimm_hotplug_fn hotplug, DeviceState *qdev) +{ +dimm_hotplug_qdev = qdev; +dimm_hotplug = hotplug; +dimm_scan_populated(); +} + +void dimm_activate(DimmState *slot) +{ +dimm_populate(slot); +if (dimm_hotplug) +dimm_hotplug(dimm_hotplug_qdev, (SysBusDevice*)slot, 1); Why the cast? dimm_hotplug accepts SysBusDevice, not DimmState, though that can be changed. Also braces, please check your patches with checkpatch.pl. ok, I 'll do checks with checkpatch.pl. +} + +void dimm_deactivate(DimmState *slot) +{ +if (dimm_hotplug) +dimm_hotplug(dimm_hotplug_qdev, (SysBusDevice*)slot
Re: [Qemu-devel] [RFC PATCH v2 00/21] ACPI memory hotplug
On Thu, Jul 12, 2012 at 08:04:56PM +, Blue Swirl wrote: On Wed, Jul 11, 2012 at 10:31 AM, Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com wrote: This is v2 of the ACPI memory hotplug prototype for x86_64 target. I think the concept of DIMMs (what about SIMMs? SODIMMs? I liked memslot) would be useful for most targets, but hotplugging may be limited to x86 only. It would be nice to keep these two separate or as loosely coupled as possible. agreed. what specific usecases besides hotplugging are you thinking about? Also are there non-acpi hotplug platforms? I am trying to keep generic dimm manipulation functions (e.g. population / depopulation and searching) in hw/dimm[.ch]. Currently the x86-acpi_piix4 backend registers a callback for hot-add / hot-remove. In theory other hotplug backends can hook in. btw I don't mind using -memslot (I think someone during v1 mentioned -dimm), we just need some consensus on the naming. Changes v1-v2 - memory map is automatically calculated for hotplug dimms. Dimms are added from top-of-memory skipping the pci hole at [PCI_HOLE_START, 4G). - Renamed from -memslot to -dimm. Commands changed to dimm_add, dimm_del. - Seabios ejection array reduced to a byte. Use extraction macros for dimm ssdt. - additional SRAT paravirt info does not break previous SRAT fw_cfg layout. - Documentation of new acpi_piix4 registers and paravirt data. - add ACPI _OST support for _OST enabled guests. This allows qemu to receive notification for success / failure of memory hot-add and hot-remove operations. Guest needs to support _OST (https://lkml.org/lkml/2012/6/25/321) - add monitor info command to report total guest memory (initial + hot-added) - add command line options and monitor commands for batch dimm creation/population Overview: Dimm devices are modeled with a new qemu command line -dimm id=name,size=sz,node=pxm,populated=on|off As already mentioned, the starting physical address for all dimms is calculated automatically from top of memory, skipping the pci hole at [PCI_HOLE_START, 4G). Node is defining numa proximity for this dimm. When not defined it defaults to zero. -dimm id=dimm0,size=512M,node=0,populated=off will define a 512M memory slot belonging to numa node 0. Dimms are added or removed with a new hmp command dimm_add/dimm_del: Hot-add syntax: dimm_add id Hot-remove syntax: dimm_del id Issues: - Live migration works as long as populated field is changed to on for hotplugged dimms at the destination qemu command line (patch 12/21 lifts this requirement). The DimmState structure does not yet define a VMStateDescription, but i assume this is the preferred way to pass state for migration. - Dimms are abstracted as qdevices attached to the main system bus. However, memory hotplugging has its own side channel ignoring main_system_bus's hotplug incapability. A cleaner integration is still needed, probably attaching memory devices as children-links of an acpi-capable device (in the pc case acpi_piix4) instead of the system bus (TBD). Then device_add/device_del instead of new commands can hopefully be used. Comments/review welcome. series is based on uq/master for qemu-kvm, and master for seabios. Can be found also at: http://github.com/vliaskov/qemu-kvm/commits/memhp-v2 http://github.com/vliaskov/seabios/commits/memhp-v2 Vasilis Liaskovitis (14): dimm: Implement memory device abstraction acpi_piix4: Implement memory device hotplug registers pc: calculate dimm physical addresses and adjust memory map pc: Add dimm paravirt SRAT info Implement -dimm command line option Implement dimm_add and dimm_del commands for hmp and qmp fix live-migration when populated=on is missing Implement memory hotplug notification lists acpi_piix4: _OST dimm support acpi_piix4: Update dimm state on VM reboot acpi_piix4: Update dimm bitmap state on hot-remove fail Implement info memtotal and query-memtotal Implement -dimms, -dimmspop command line options Implement mem_increase, mem_decrease hmp/qmp commands arch_init.c | 23 ++- docs/specs/acpi_hotplug.txt | 46 + docs/specs/fwcfg.txt| 28 +++ hmp-commands.hx | 67 +++ hmp.c | 24 +++ hmp.h |2 + hw/Makefile.objs|2 +- hw/acpi_piix4.c | 131 - hw/dimm.c | 449 +++ hw/dimm.h | 72 +++ hw/pc.c | 94 +- hw/pc.h |6 + hw/pc_piix.c| 18 ++- monitor.c | 35 monitor.h |5 + qapi-schema.json| 38 qemu-config.c | 70
Re: [RFC PATCH v2 05/21][SeaBIOS] pciinit: Fix pcimem_start value
On Thu, Jul 12, 2012 at 09:22:14AM +0200, Gerd Hoffmann wrote: On 07/11/12 18:45, Vasilis Liaskovitis wrote: Hi, On Wed, Jul 11, 2012 at 01:56:19PM +0200, Gerd Hoffmann wrote: On 07/11/12 12:31, Vasilis Liaskovitis wrote: In order to hotplug memory between RamSize and BUILD_PCIMEM_START, the pci window needs to start at BUILD_PCIMEM_START (0xe000). Otherwise, the guest cannot online new dimms at those ranges due to pci_root window conflicts. (workaround for linux guest is booting with pci=nocrs) static void pci_bios_map_devices(struct pci_bus *busses) { -pcimem_start = RamSize; +pcimem_start = BUILD_PCIMEM_START; It isn't that simple. For the 32bit pci window it will work, but will leaves address space unused instead of assigning it to the 32bit pci window. For the 64bit pci window it will not work. You have to walk the dimms and figure what the highest used address is, for both below-4g and above-4g. Then fill two variable with it and make the pci init code use that instead of RamSize and RamSizeOver4G. I see. I already have these values values computed in qemu-kvm, so I can pass them in a paravirt struct, or infer them from the dimm/srat paravirt info that I already pass to seabios. I'd suggest to infer from the dimm info, to limit the amout of information which needs to be passed from qemu to seabios. ok.Currently dimm info is processed in bios_init_tables(), which is called after pci_setup(). I 'll see if i can do the processing earlier. If i understand correctly, we would like the pcimem windows to use the maximum possible address space (constrained by the exact dimms/ranges which are defined) instead of leaving unused space. Yes, for the 32bit pci window. The 64bit pci window is mapped above all memory, and it must likewise consider defined+unfilled dimms so the start address doesn't collide with memory hot-plugged above 4G later on. yes, understood. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 00/21] ACPI memory hotplug
This is v2 of the ACPI memory hotplug prototype for x86_64 target. Changes v1-v2 - memory map is automatically calculated for hotplug dimms. Dimms are added from top-of-memory skipping the pci hole at [PCI_HOLE_START, 4G). - Renamed from -memslot to -dimm. Commands changed to dimm_add, dimm_del. - Seabios ejection array reduced to a byte. Use extraction macros for dimm ssdt. - additional SRAT paravirt info does not break previous SRAT fw_cfg layout. - Documentation of new acpi_piix4 registers and paravirt data. - add ACPI _OST support for _OST enabled guests. This allows qemu to receive notification for success / failure of memory hot-add and hot-remove operations. Guest needs to support _OST (https://lkml.org/lkml/2012/6/25/321) - add monitor info command to report total guest memory (initial + hot-added) - add command line options and monitor commands for batch dimm creation/population Overview: Dimm devices are modeled with a new qemu command line -dimm id=name,size=sz,node=pxm,populated=on|off As already mentioned, the starting physical address for all dimms is calculated automatically from top of memory, skipping the pci hole at [PCI_HOLE_START, 4G). Node is defining numa proximity for this dimm. When not defined it defaults to zero. -dimm id=dimm0,size=512M,node=0,populated=off will define a 512M memory slot belonging to numa node 0. Dimms are added or removed with a new hmp command dimm_add/dimm_del: Hot-add syntax: dimm_add id Hot-remove syntax: dimm_del id Issues: - Live migration works as long as populated field is changed to on for hotplugged dimms at the destination qemu command line (patch 12/21 lifts this requirement). The DimmState structure does not yet define a VMStateDescription, but i assume this is the preferred way to pass state for migration. - Dimms are abstracted as qdevices attached to the main system bus. However, memory hotplugging has its own side channel ignoring main_system_bus's hotplug incapability. A cleaner integration is still needed, probably attaching memory devices as children-links of an acpi-capable device (in the pc case acpi_piix4) instead of the system bus (TBD). Then device_add/device_del instead of new commands can hopefully be used. Comments/review welcome. series is based on uq/master for qemu-kvm, and master for seabios. Can be found also at: http://github.com/vliaskov/qemu-kvm/commits/memhp-v2 http://github.com/vliaskov/seabios/commits/memhp-v2 Vasilis Liaskovitis (14): dimm: Implement memory device abstraction acpi_piix4: Implement memory device hotplug registers pc: calculate dimm physical addresses and adjust memory map pc: Add dimm paravirt SRAT info Implement -dimm command line option Implement dimm_add and dimm_del commands for hmp and qmp fix live-migration when populated=on is missing Implement memory hotplug notification lists acpi_piix4: _OST dimm support acpi_piix4: Update dimm state on VM reboot acpi_piix4: Update dimm bitmap state on hot-remove fail Implement info memtotal and query-memtotal Implement -dimms, -dimmspop command line options Implement mem_increase, mem_decrease hmp/qmp commands arch_init.c | 23 ++- docs/specs/acpi_hotplug.txt | 46 + docs/specs/fwcfg.txt| 28 +++ hmp-commands.hx | 67 +++ hmp.c | 24 +++ hmp.h |2 + hw/Makefile.objs|2 +- hw/acpi_piix4.c | 131 - hw/dimm.c | 449 +++ hw/dimm.h | 72 +++ hw/pc.c | 94 +- hw/pc.h |6 + hw/pc_piix.c| 18 ++- monitor.c | 35 monitor.h |5 + qapi-schema.json| 38 qemu-config.c | 70 +++ qemu-options.hx | 15 ++ qmp-commands.hx | 137 + sysemu.h|1 + vl.c| 122 - 21 files changed, 1368 insertions(+), 17 deletions(-) create mode 100644 docs/specs/acpi_hotplug.txt create mode 100644 docs/specs/fwcfg.txt create mode 100644 hw/dimm.c create mode 100644 hw/dimm.h Vasilis Liaskovitis (7): Add ACPI_EXTRACT_DEVICE* macros Add SSDT memory device support acpi-dsdt: Implement functions for memory hotplug. acpi: generate hotplug memory devices. pciinit: Fix pcimem_start value acpi_dsdt: Support _OST dimm method acpi_dsdt: Revert internal dimm state on _OST failure Makefile |2 +- src/acpi-dsdt.dsl | 120 - src/acpi.c| 158 +++-- src/pciinit.c |2 +- src/ssdt-mem.dsl | 69 + tools/acpi_extract.py | 28 + 6 files changed, 369 insertions(+), 10 deletions(-) create mode 100644 src/ssdt-mem.dsl
[RFC PATCH v2 03/21][SeaBIOS] acpi-dsdt: Implement functions for memory hotplug
Extend the DSDT to include methods for handling memory hot-add and hot-remove notifications and memory device status requests. These functions are called from the memory device SSDT methods. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl | 70 +++- 1 files changed, 68 insertions(+), 2 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 2060686..5d3e92b 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -737,6 +737,71 @@ DefinitionBlock ( } Return(One) } +/* Objects filled in by run-time generated SSDT */ +External(MTFY, MethodObj) +External(MEON, PkgObj) + +Method (CMST, 1, NotSerialized) { +// _STA method - return ON status of memdevice +// Local0 = MEON flag for this cpu +Store(DerefOf(Index(MEON, Arg0)), Local0) +If (Local0) { Return(0xF) } Else { Return(0x0) } +} + +/* Memory hotplug notify array */ +OperationRegion(MEST, SystemIO, 0xaf80, 32) +Field (MEST, ByteAcc, NoLock, Preserve) +{ +MES, 256 +} + +/* Memory eject byte */ +OperationRegion(MEMJ, SystemIO, 0xafa0, 1) +Field (MEMJ, ByteAcc, NoLock, Preserve) +{ +MPE, 8 +} + +Method(MESC, 0) { +// Local5 = active memdevice bitmap +Store (MES, Local5) +// Local2 = last read byte from bitmap +Store (Zero, Local2) +// Local0 = memory device iterator +Store (Zero, Local0) +While (LLess(Local0, SizeOf(MEON))) { +// Local1 = MEON flag for this memory device +Store(DerefOf(Index(MEON, Local0)), Local1) +If (And(Local0, 0x07)) { +// Shift down previously read bitmap byte +ShiftRight(Local2, 1, Local2) +} Else { +// Read next byte from memdevice bitmap +Store(DerefOf(Index(Local5, ShiftRight(Local0, 3))), Local2) +} +// Local3 = active state for this memory device +Store(And(Local2, 1), Local3) + +If (LNotEqual(Local1, Local3)) { +// State change - update MEON with new state +Store(Local3, Index(MEON, Local0)) +// Do MEM notify +If (LEqual(Local3, 1)) { +MTFY(Local0, 1) +} Else { +MTFY(Local0, 3) +} +} +Increment(Local0) +} +Return(One) +} + +Method (MPEJ, 2, NotSerialized) { +// _EJ0 method - eject callback +Store(Arg0, MPE) +Sleep(200) +} } @@ -759,8 +824,9 @@ DefinitionBlock ( // CPU hotplug event Return(\_SB.PRSC()) } -Method(_L03) { -Return(0x01) +Method(_E03) { +// Memory hotplug event +Return(\_SB.MESC()) } Method(_L04) { Return(0x01) -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 07/21] acpi_piix4: Implement memory device hotplug registers
A 32-byte register is used to present up to 256 hotplug-able memory devices to BIOS and OSPM. Hot-add and hot-remove functions trigger an ACPI hotplug event through these. Only reads are allowed from these registers. An ACPI hot-remove event but needs to wait for OSPM to eject the device. We use a single-byte register to know when OSPM has called the _EJ function for a particular dimm. A write to this byte will depopulate the respective dimm. Only writes are allowed to this byte. v1-v2: mems_sts address moved from 0xaf20 to 0xaf80 (to accomodate more space for cpu-hotplugging in the future). _EJ array is reduced to a single byte. Add documentation in docs/specs/acpi_hotplug.txt Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/acpi_hotplug.txt | 22 + hw/acpi_piix4.c | 73 -- 2 files changed, 91 insertions(+), 4 deletions(-) create mode 100644 docs/specs/acpi_hotplug.txt diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt new file mode 100644 index 000..cf86242 --- /dev/null +++ b/docs/specs/acpi_hotplug.txt @@ -0,0 +1,22 @@ +QEMU-ACPI BIOS hotplug interface +-- +This document describes the interface between QEMU and the ACPI BIOS for non-PCI +space. For the PCI interface please look at docs/specs/acpi_pci_hotplug.txt + +QEMU-ACPI BIOS memory hotplug interface +-- + +Memory Dimm status array (IO port 0xaf80-0xaf9f, 1-byte access): +--- +Dimm hot-plug notification pending. One bit per slot. + +Read by ACPI BIOS GPE.3 handler to notify OS of memory hot-add or hot-remove +events. Read-only. + +Memory Dimm ejection success notification (IO port 0xafa0, 1-byte access): +--- +Dimm hot-remove _EJ0 notification. Byte value indicates Dimm slot that was +ejected. + +Written by ACPI memory device _EJ0 method to notify qemu of successfull +hot-removal. Write-only. diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 0aace60..b988597 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -28,6 +28,8 @@ #include range.h #include ioport.h #include fw_cfg.h +#include sysbus.h +#include dimm.h //#define DEBUG @@ -45,9 +47,15 @@ #define PCI_DOWN_BASE 0xae04 #define PCI_EJ_BASE 0xae08 #define PCI_RMV_BASE 0xae0c +#define MEM_BASE 0xaf80 +#define MEM_EJ_BASE 0xafa0 +#define PIIX4_MEM_HOTPLUG_STATUS 8 #define PIIX4_PCI_HOTPLUG_STATUS 2 +struct gpe_regs { +uint8_t mems_sts[DIMM_BITMAP_BYTES]; +}; struct pci_status { uint32_t up; /* deprecated, maintained for migration compatibility */ uint32_t down; @@ -69,6 +77,7 @@ typedef struct PIIX4PMState { Notifier machine_ready; /* for pci hotplug */ +struct gpe_regs gperegs; struct pci_status pci0_status; uint32_t pci0_hotplug_enable; uint32_t pci0_slot_device_present; @@ -93,8 +102,8 @@ static void pm_update_sci(PIIX4PMState *s) ACPI_BITMASK_POWER_BUTTON_ENABLE | ACPI_BITMASK_GLOBAL_LOCK_ENABLE | ACPI_BITMASK_TIMER_ENABLE)) != 0) || -(((s-ar.gpe.sts[0] s-ar.gpe.en[0]) - PIIX4_PCI_HOTPLUG_STATUS) != 0); +(((s-ar.gpe.sts[0] s-ar.gpe.en[0]) + (PIIX4_PCI_HOTPLUG_STATUS | PIIX4_MEM_HOTPLUG_STATUS)) != 0); qemu_set_irq(s-irq, sci_level); /* schedule a timer interruption if needed */ @@ -499,7 +508,16 @@ type_init(piix4_pm_register_types) static uint32_t gpe_readb(void *opaque, uint32_t addr) { PIIX4PMState *s = opaque; -uint32_t val = acpi_gpe_ioport_readb(s-ar, addr); +uint32_t val = 0; +struct gpe_regs *g = s-gperegs; + +switch (addr) { +case MEM_BASE ... MEM_BASE+DIMM_BITMAP_BYTES: +val = g-mems_sts[addr - MEM_BASE]; +break; +default: +val = acpi_gpe_ioport_readb(s-ar, addr); +} PIIX4_DPRINTF(gpe read %x == %x\n, addr, val); return val; @@ -509,7 +527,13 @@ static void gpe_writeb(void *opaque, uint32_t addr, uint32_t val) { PIIX4PMState *s = opaque; -acpi_gpe_ioport_writeb(s-ar, addr, val); +switch (addr) { +case MEM_EJ_BASE: +dimm_notify(val, DIMM_REMOVE_SUCCESS); +break; +default: +acpi_gpe_ioport_writeb(s-ar, addr, val); +} pm_update_sci(s); PIIX4_DPRINTF(gpe write %x == %d\n, addr, val); @@ -560,9 +584,11 @@ static uint32_t pcirmv_read(void *opaque, uint32_t addr) static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev, PCIHotplugState state); +static int piix4_dimm_hotplug(DeviceState *qdev, SysBusDevice *dev, int add); static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) { +int i = 0; register_ioport_write(GPE_BASE, GPE_LEN, 1, gpe_writeb
[RFC PATCH v2 13/21] Implement memory hotplug notification lists
Guest can respond to ACPI hotplug events e.g. with _EJ or _OST method. This patch implements a tail queue to store guest notifications for memory hot-add and hot-remove requests. Guest responses for memory hotplug command on a per-dimm basis can be detected with the new hmp command info memhp or the new qmp command query-memhp Examples: (qemu) dimm_add dimm0 (qemu) info memhp Dimm: dimm0 hot-add success or Dimm: dimm0 hot-add failure (qemu) dimm_del dimm0 (qemu) info memhp Dimm: dimm0 hot-remove success or Dimm: dimm0 hot-remove failure Results are removed from the queue once read. This patch only queues _EJ events that signal hot-remove success. For _OST event queuing, which cover the hot-remove failure and hot-add success/failure cases, the next 2 patches are also needed. These notification items should probably be part of migration state (not yet implemented) Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hmp-commands.hx |2 + hmp.c| 17 hmp.h|1 + hw/dimm.c| 55 ++ hw/dimm.h|6 + monitor.c|7 ++ qapi-schema.json | 26 + qmp-commands.hx | 38 + 8 files changed, 152 insertions(+), 0 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index 012c150..3172cde 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -1459,6 +1459,8 @@ show device tree show qdev device model list @item info roms show roms +@item info memhp +show memhp @end table ETEXI diff --git a/hmp.c b/hmp.c index b9cec1d..ec25d9a 100644 --- a/hmp.c +++ b/hmp.c @@ -1000,3 +1000,20 @@ void hmp_netdev_del(Monitor *mon, const QDict *qdict) qmp_netdev_del(id, err); hmp_handle_error(mon, err); } + +void hmp_info_memhp(Monitor *mon) +{ +MemHpInfoList *info; +MemHpInfoList *item; +MemHpInfo *dimm; + +info = qmp_query_memhp(NULL); +for (item = info; item; item = item-next) { +dimm = item-value; +monitor_printf(mon, Dimm: %s %s %s\n, dimm-Dimm, +dimm-request, dimm-result); +dimm-Dimm = NULL; +} + +qapi_free_MemHpInfoList(info); +} diff --git a/hmp.h b/hmp.h index 79d138d..971e7c4 100644 --- a/hmp.h +++ b/hmp.h @@ -64,5 +64,6 @@ void hmp_device_del(Monitor *mon, const QDict *qdict); void hmp_dump_guest_memory(Monitor *mon, const QDict *qdict); void hmp_netdev_add(Monitor *mon, const QDict *qdict); void hmp_netdev_del(Monitor *mon, const QDict *qdict); +void hmp_info_memhp(Monitor *mon); #endif diff --git a/hw/dimm.c b/hw/dimm.c index 00c4623..9b32386 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -26,6 +26,7 @@ static DeviceState *dimm_hotplug_qdev; static dimm_hotplug_fn dimm_hotplug; static QTAILQ_HEAD(Dimmlist, DimmState) dimmlist; +static QTAILQ_HEAD(dimm_hp_result_head, dimm_hp_result) dimm_hp_result_queue; static Property dimm_properties[] = { DEFINE_PROP_END_OF_LIST() @@ -189,16 +190,69 @@ void dimm_notify(uint32_t idx, uint32_t event) DimmState *s; s = dimm_find_from_idx(idx); assert(s != NULL); +struct dimm_hp_result *result = g_malloc0(sizeof(*result)); +result-s = s; +result-ret = event; switch(event) { case DIMM_REMOVE_SUCCESS: dimm_depopulate(s); +QTAILQ_INSERT_TAIL(dimm_hp_result_queue, result, next); break; default: +g_free(result); break; } } +MemHpInfoList *qmp_query_memhp(Error **errp) +{ +MemHpInfoList *head = NULL, *cur_item = NULL, *info; +struct dimm_hp_result *item, *nextitem; + +QTAILQ_FOREACH_SAFE(item, dimm_hp_result_queue, next, nextitem) { + +info = g_malloc0(sizeof(*info)); +info-value = g_malloc0(sizeof(*info-value)); +info-value-Dimm = g_malloc0(sizeof(char) * 32); +info-value-request = g_malloc0(sizeof(char) * 16); +info-value-result = g_malloc0(sizeof(char) * 16); +switch (item-ret) { +case DIMM_REMOVE_SUCCESS: +strcpy(info-value-request, hot-remove); +strcpy(info-value-result, success); +break; +case DIMM_REMOVE_FAIL: +strcpy(info-value-request, hot-remove); +strcpy(info-value-result, failure); +break; +case DIMM_ADD_SUCCESS: +strcpy(info-value-request, hot-add); +strcpy(info-value-result, success); +break; +case DIMM_ADD_FAIL: +strcpy(info-value-request, hot-add); +strcpy(info-value-result, failure); +break; +default: +break; +} +strcpy(info-value-Dimm, item-s-busdev.qdev.id); +/* XXX: waiting for the qapi to support GSList */ +if (!cur_item) { +head = cur_item = info
[RFC PATCH v2 14/21][SeaBIOS] acpi_dsdt: Support _OST dimm method
Add support for _OST method. _OST method will write into the correct I/O byte to signal success / failure of hot-add or hot-remove to qemu. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl | 46 ++ src/ssdt-mem.dsl |4 2 files changed, 50 insertions(+), 0 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 5d3e92b..1c253ca 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -762,6 +762,28 @@ DefinitionBlock ( MPE, 8 } + +/* Memory hot-remove notify failure byte */ +OperationRegion(MEEF, SystemIO, 0xafa1, 1) +Field (MEEF, ByteAcc, NoLock, Preserve) +{ +MEF, 8 +} + +/* Memory hot-add notify success byte */ +OperationRegion(MPIS, SystemIO, 0xafa2, 1) +Field (MPIS, ByteAcc, NoLock, Preserve) +{ +MIS, 8 +} + +/* Memory hot-add notify failure byte */ +OperationRegion(MPIF, SystemIO, 0xafa3, 1) +Field (MPIF, ByteAcc, NoLock, Preserve) +{ +MIF, 8 +} + Method(MESC, 0) { // Local5 = active memdevice bitmap Store (MES, Local5) @@ -802,6 +824,30 @@ DefinitionBlock ( Store(Arg0, MPE) Sleep(200) } +Method (MOST, 3, Serialized) { +// _OST method - OS status indication +Switch (And(Arg0, 0xFF)) { +Case(0x3) +{ +Switch(And(Arg1, 0xFF)) { +Case(0x1) { +Store(Arg2, MEF) +} +} +} +Case(0x1) +{ +Switch(And(Arg1, 0xFF)) { +Case(0x0) { +Store(Arg2, MIS) +} +Case(0x1) { +Store(Arg2, MIF) +} +} +} +} +} } diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl index ee322f0..041d301 100644 --- a/src/ssdt-mem.dsl +++ b/src/ssdt-mem.dsl @@ -38,6 +38,7 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) External(CMST, MethodObj) External(MPEJ, MethodObj) +External(MOST, MethodObj) Name(_CRS, ResourceTemplate() { QwordMemory( @@ -60,6 +61,9 @@ DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) Method (_EJ0, 1, NotSerialized) { MPEJ(ID, Arg0) } +Method (_OST, 3) { +MOST(Arg0, Arg1, ID) +} } } -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 18/21] acpi_piix4: Update dimm bitmap state on hot-remove fail
This allows failed hot operations to be retried at anytime. This only works for guests that use _OST notification. Other guests cannot retry failed hot operations on same devices until after reboot. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 20 +++- hw/dimm.c | 16 +++- hw/dimm.h |2 +- 3 files changed, 35 insertions(+), 3 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index ebc5de7..db631cc 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -599,6 +599,7 @@ static uint32_t pcirmv_read(void *opaque, uint32_t addr) static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev, PCIHotplugState state); static int piix4_dimm_hotplug(DeviceState *qdev, SysBusDevice *dev, int add); +static int piix4_dimm_revert(DeviceState *qdev, SysBusDevice *dev, int add); static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) { @@ -627,7 +628,7 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) } pci_bus_hotplug(bus, piix4_device_hotplug, s-dev.qdev); -dimm_register_hotplug(piix4_dimm_hotplug, s-dev.qdev); +dimm_register_hotplug(piix4_dimm_hotplug, piix4_dimm_revert, s-dev.qdev); } static void enable_device(PIIX4PMState *s, int slot) @@ -696,6 +697,23 @@ void piix4_dimm_state_sync(PIIX4PMState *s) } } +static int piix4_dimm_revert(DeviceState *qdev, SysBusDevice *dev, int add) +{ +PCIDevice *pci_dev = DO_UPCAST(PCIDevice, qdev, qdev); +PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev, pci_dev); +struct gpe_regs *g = s-gperegs; +DimmState *slot = DIMM(dev); +int idx = slot-idx; + +if (add) { +g-mems_sts[idx/8] = ~(1 (idx%8)); +} +else { +g-mems_sts[idx/8] |= (1 (idx%8)); +} +return 0; +} + static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev, PCIHotplugState state) { diff --git a/hw/dimm.c b/hw/dimm.c index ba104cc..2115567 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -25,6 +25,7 @@ static DeviceState *dimm_hotplug_qdev; static dimm_hotplug_fn dimm_hotplug; +static dimm_hotplug_fn dimm_revert; static QTAILQ_HEAD(Dimmlist, DimmState) dimmlist; static QTAILQ_HEAD(dimm_hp_result_head, dimm_hp_result) dimm_hp_result_queue; @@ -77,10 +78,12 @@ DimmState *dimm_create(char *id, uint64_t size, uint64_t node, uint32_t return mdev; } -void dimm_register_hotplug(dimm_hotplug_fn hotplug, DeviceState *qdev) +void dimm_register_hotplug(dimm_hotplug_fn hotplug, dimm_hotplug_fn revert, +DeviceState *qdev) { dimm_hotplug_qdev = qdev; dimm_hotplug = hotplug; +dimm_revert = revert; dimm_scan_populated(); } @@ -211,10 +214,20 @@ void dimm_notify(uint32_t idx, uint32_t event) s-pending = false; break; case DIMM_REMOVE_FAIL: +QTAILQ_INSERT_TAIL(dimm_hp_result_queue, result, next); +s-pending = false; +if (dimm_revert) +dimm_revert(dimm_hotplug_qdev, (SysBusDevice*)s, 0); +break; case DIMM_ADD_SUCCESS: +QTAILQ_INSERT_TAIL(dimm_hp_result_queue, result, next); +s-pending = false; +break; case DIMM_ADD_FAIL: QTAILQ_INSERT_TAIL(dimm_hp_result_queue, result, next); s-pending = false; +if (dimm_revert) +dimm_revert(dimm_hotplug_qdev, (SysBusDevice*)s, 1); break; default: g_free(result); @@ -288,6 +301,7 @@ static void dimm_class_init(ObjectClass *klass, void *data) dc-props = dimm_properties; sc-init = dimm_init; dimm_hotplug = NULL; +dimm_revert = NULL; QTAILQ_INIT(dimmlist); QTAILQ_INIT(dimm_hp_result_queue); } diff --git a/hw/dimm.h b/hw/dimm.h index 0fa6137..b563e3f 100644 --- a/hw/dimm.h +++ b/hw/dimm.h @@ -54,7 +54,7 @@ void dimm_depopulate(DimmState *s); int dimm_do(Monitor *mon, const QDict *qdict, bool add); DimmState *dimm_find_from_idx(uint32_t idx); DimmState *dimm_find_from_name(char *id); -void dimm_register_hotplug(dimm_hotplug_fn hotplug, DeviceState *qdev); +void dimm_register_hotplug(dimm_hotplug_fn hotplug, dimm_hotplug_fn revert, DeviceState *qdev); void dimm_calc_offsets(dimm_calcoffset_fn calcfn); void dimm_activate(DimmState *slot); void dimm_deactivate(DimmState *slot); -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 20/21] Implement -dimms, -dimmspop command line options
Implement batch dimm creation command line options. These could be useful for not bloating the command line with a large number of dimms. syntax: -dimms pfx=poolid,size=sz,num=n Will create numdimms dimms with ids poolid0, ..., poolidn-1. Each dimm has a size of sz. Implement -dimmpop option to populate dimms at bootup syntax: -dimmpop pfx=poolid,num=n This will populate n dimms with ids poolid0, ..., poolidn-1. (live-migration could break here without patch 12/21: -dimmspop needs to be reworked to support populating of individual dimms with same prefix, and not only a range of dimms starting from 0) Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/dimm.c |9 ++ hw/dimm.h |2 +- qemu-config.c | 45 qemu-options.hx | 10 ++ vl.c| 86 ++- 5 files changed, 150 insertions(+), 2 deletions(-) diff --git a/hw/dimm.c b/hw/dimm.c index 2115567..6e324d3 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -187,6 +187,15 @@ void dimm_calc_offsets(dimm_calcoffset_fn calcfn) } } +int dimm_set_populated(DimmState *s) +{ +if (s) { +s-populated = true; +return 0; +} +else return -1; +} + /* used to populate and activate dimms at boot time */ void dimm_scan_populated(void) { diff --git a/hw/dimm.h b/hw/dimm.h index b563e3f..0fdf59b 100644 --- a/hw/dimm.h +++ b/hw/dimm.h @@ -60,6 +60,6 @@ void dimm_activate(DimmState *slot); void dimm_deactivate(DimmState *slot); void dimm_scan_populated(void); void dimm_notify(uint32_t idx, uint32_t event); - +int dimm_set_populated(DimmState *s); #endif diff --git a/qemu-config.c b/qemu-config.c index 4abc31b..7f63186 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -650,6 +650,49 @@ static QemuOptsList qemu_dimm_opts = { { /* end of list */ } }, }; + +static QemuOptsList qemu_dimms_opts = { +.name = dimms, +.head = QTAILQ_HEAD_INITIALIZER(qemu_dimms_opts.head), +.desc = { +{ +.name = pfx, +.type = QEMU_OPT_STRING, +.help = prefix of ids for these dimm devices, +},{ +.name = size, +.type = QEMU_OPT_SIZE, +.help = memory size for these dimm, +},{ +.name = num, +.type = QEMU_OPT_NUMBER, +.help = number of dimm devices in this pool, +},{ +.name = node, +.type = QEMU_OPT_NUMBER, +.help = NUMA node number (i.e. proximity) for these dimms, +}, +{ /* end of list */ } +}, +}; + +static QemuOptsList qemu_dimmspop_opts = { +.name = dimmspop, +.head = QTAILQ_HEAD_INITIALIZER(qemu_dimmspop_opts.head), +.desc = { +{ +.name = pfx, +.type = QEMU_OPT_STRING, +.help = pool prefix for this dimm device, +},{ +.name = num, +.type = QEMU_OPT_SIZE, +.help = number of dimm devices to populate, +}, +{ /* end of list */ } +}, +}; + static QemuOptsList *vm_config_groups[32] = { qemu_drive_opts, qemu_chardev_opts, @@ -666,6 +709,8 @@ static QemuOptsList *vm_config_groups[32] = { qemu_boot_opts, qemu_iscsi_opts, qemu_dimm_opts, +qemu_dimms_opts, +qemu_dimmspop_opts, NULL, }; diff --git a/qemu-options.hx b/qemu-options.hx index 61909f7..0a9326e 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2752,3 +2752,13 @@ DEF(dimm, HAS_ARG, QEMU_OPTION_dimm, -dimm id=dimmid,size=sz,node=nd,populated=on|off\n specify memory dimm device with name dimmid, size sz on node nd, QEMU_ARCH_ALL) + +DEF(dimms, HAS_ARG, QEMU_OPTION_dimms, +-dimms pfx=id,size=sz,node=nd\n +specify pool of num memory dimm devices of size sz each on node nd, +QEMU_ARCH_ALL) + +DEF(dimmspop, HAS_ARG, QEMU_OPTION_dimmspop, +-dimmspop pfx=id,num=n\n +populate n dimms of pool id (dimms with ids id0,...,idn-1) at system startup, +QEMU_ARCH_ALL) diff --git a/vl.c b/vl.c index efe915e..37752be 100644 --- a/vl.c +++ b/vl.c @@ -538,6 +538,65 @@ static void configure_dimm(QemuOpts *opts) nb_hp_dimms++; } +static void configure_dimms(QemuOpts *opts) +{ +const char *value, *pfx, *id; +uint64_t size, node; +int num, dimm; +char buf[32]; + +id = qemu_opts_id(opts); +value = qemu_opt_get(opts, pfx); +if (!value) { +fprintf(stderr, qemu: invalid prefix for dimm pool '%s'\n, id); +exit(1); +} +pfx = value; + +size = qemu_opt_get_size(opts, size, DEFAULT_DIMMSIZE); +num = qemu_opt_get_number(opts, num, 1); +node = qemu_opt_get_number(opts, node, 0); + +for (dimm = 0; dimm num; dimm++) { +if (nb_hp_dimms == MAX_DIMMS) { +fprintf(stderr, qemu: maximum number of DIMMs (%d) exceeded\n, +MAX_DIMMS
[RFC PATCH v2 06/21] dimm: Implement memory device abstraction
Each hotplug-able memory slot is a SysBusDevice. A hot-add operation for a particular dimm creates a new MemoryRegion of the given physical address offset, size and node proximity, and attaches it to main system memory as a sub_region. A hot-remove operation detaches and frees the MemoryRegion from system memory. This prototype still lacks proper qdev integration: a separate hotplug side-channel is used and main system bus hotplug capability is ignored. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/Makefile.objs |2 +- hw/dimm.c| 234 ++ hw/dimm.h| 58 + 3 files changed, 293 insertions(+), 1 deletions(-) create mode 100644 hw/dimm.c create mode 100644 hw/dimm.h diff --git a/hw/Makefile.objs b/hw/Makefile.objs index 3d77259..e2184bf 100644 --- a/hw/Makefile.objs +++ b/hw/Makefile.objs @@ -26,7 +26,7 @@ hw-obj-$(CONFIG_I8254) += i8254_common.o i8254.o hw-obj-$(CONFIG_PCSPK) += pcspk.o hw-obj-$(CONFIG_PCKBD) += pckbd.o hw-obj-$(CONFIG_FDC) += fdc.o -hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o +hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o dimm.o hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_I82374) += i82374.o diff --git a/hw/dimm.c b/hw/dimm.c new file mode 100644 index 000..00c4623 --- /dev/null +++ b/hw/dimm.c @@ -0,0 +1,234 @@ +/* + * Dimm device for Memory Hotplug + * + * Copyright ProfitBricks GmbH 2012 + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, see http://www.gnu.org/licenses/ + */ + +#include trace.h +#include qdev.h +#include dimm.h +#include time.h +#include ../exec-memory.h +#include qmp-commands.h + +static DeviceState *dimm_hotplug_qdev; +static dimm_hotplug_fn dimm_hotplug; +static QTAILQ_HEAD(Dimmlist, DimmState) dimmlist; + +static Property dimm_properties[] = { +DEFINE_PROP_END_OF_LIST() +}; + +void dimm_populate(DimmState *s) +{ +DeviceState *dev= (DeviceState*)s; +MemoryRegion *new = NULL; + +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, dev-id, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +s-populated = true; +} + + +void dimm_depopulate(DimmState *s) +{ +assert(s); +if (s-populated) { +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-populated = false; +s-mr = NULL; +} +} + +DimmState *dimm_create(char *id, uint64_t size, uint64_t node, uint32_t +dimm_idx, bool populated) +{ +DeviceState *dev; +DimmState *mdev; + +dev = sysbus_create_simple(dimm, -1, NULL); +dev-id = id; + +mdev = DIMM(dev); +mdev-idx = dimm_idx; +mdev-start = 0; +mdev-size = size; +mdev-node = node; +mdev-populated = populated; +QTAILQ_INSERT_TAIL(dimmlist, mdev, nextdimm); +return mdev; +} + +void dimm_register_hotplug(dimm_hotplug_fn hotplug, DeviceState *qdev) +{ +dimm_hotplug_qdev = qdev; +dimm_hotplug = hotplug; +dimm_scan_populated(); +} + +void dimm_activate(DimmState *slot) +{ +dimm_populate(slot); +if (dimm_hotplug) +dimm_hotplug(dimm_hotplug_qdev, (SysBusDevice*)slot, 1); +} + +void dimm_deactivate(DimmState *slot) +{ +if (dimm_hotplug) +dimm_hotplug(dimm_hotplug_qdev, (SysBusDevice*)slot, 0); +} + +DimmState *dimm_find_from_name(char *id) +{ +Error *err = NULL; +DeviceState *qdev; +const char *type; +qdev = qdev_find_recursive(sysbus_get_default(), id); +if (qdev) { +type = object_property_get_str(OBJECT(qdev), type, err); +if (!type) { +return NULL; +} +if (!strcmp(type, dimm)) { +return DIMM(qdev); +} +} +return NULL; +} + +int dimm_do(Monitor *mon, const QDict *qdict, bool add) +{ +DimmState *slot = NULL; + +char *id = (char*) qdict_get_try_str(qdict, id); +if (!id) { +fprintf(stderr, ERROR %s invalid id\n,__FUNCTION__); +return 1; +} + +slot = dimm_find_from_name(id); + +if (!slot) { +fprintf(stderr, %s no slot %s found\n, __FUNCTION__, id); +return 1; +} + +if (add) { +if (slot-populated) { +fprintf(stderr, ERROR
[RFC PATCH v2 12/21] fix live-migration when populated=on is missing
Live migration works after memory hot-add events, as long as the qemu command line -dimm arguments are changed on the destination host to specify populated=on for the dimms that have been hot-added. If a command-line change has not occured, the destination host does not yet have the corresponding ramblock in its ram_list. Activate the memslot on the destination during ram_load. Perhaps several fields of the DimmState struct should be part of a VMStateDescription to handle migration in a cleaner way. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- arch_init.c | 23 --- 1 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arch_init.c b/arch_init.c index a9e8b74..5f46b98 100644 --- a/arch_init.c +++ b/arch_init.c @@ -43,6 +43,7 @@ #include hw/smbios.h #include exec-memory.h #include hw/pcspk.h +#include hw/dimm.h #ifdef TARGET_SPARC int graphic_width = 1024; @@ -452,9 +453,25 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) } if (!block) { -fprintf(stderr, Unknown ramblock \%s\, cannot -accept migration\n, id); -return -EINVAL; +/* this can happen if a dimm was hot-added at source host */ +DimmState *slot = dimm_find_from_name(id); +if (slot) { +dimm_activate(slot); +/* rescan ram_list, verify ramblock is there now */ +QLIST_FOREACH(block, ram_list.blocks, next) { +if (!strncmp(id, block-idstr, sizeof(id))) { +if (block-length != length) +return -EINVAL; +break; +} +} +assert(block); +} +else { +fprintf(stderr, Unknown ramblock \%s\, cannot +accept migration\n, id); +return -EINVAL; +} } total_ram_bytes -= length; -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 16/21] acpi_piix4: Update dimm state on VM reboot
in case of hot-remove or hot-add failure, the dimm bitmaps in qemu and Seabios are inconsistent with the true state of the DIMM devices. The populated field of the DimmState reflects the true state of the device. This inconsistency means that a failed operation cannot be retried. Ths patch updates the bit array to the true state of the dimms on VM reboot. This allows retry of failed hot-add or hot-remove operations after a reboot. Retrying a failed hot operation is not yet possible before reboot (the following patch removes this limitation for guests with _OST acpi support) Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index d8e2c22..ebc5de7 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -91,6 +91,7 @@ typedef struct PIIX4PMState { } PIIX4PMState; static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s); +static void piix4_dimm_state_sync(PIIX4PMState *s); #define ACPI_ENABLE 0xf1 #define ACPI_DISABLE 0xf0 @@ -369,6 +370,7 @@ static void piix4_reset(void *opaque) /* Mark SMM as already inited (until KVM supports SMM). */ pci_conf[0x5B] = 0x02; } +piix4_dimm_state_sync(s); piix4_update_hotplug(s); } @@ -671,6 +673,29 @@ static int piix4_dimm_hotplug(DeviceState *qdev, SysBusDevice *dev, int return 0; } +void piix4_dimm_state_sync(PIIX4PMState *s) +{ +struct gpe_regs *g = s-gperegs; +DimmState *slot = NULL; +uint32_t i, temp = 1; + +for(i = 0; i MAX_DIMMS; i++) { +slot = dimm_find_from_idx(i); +if (!slot) +break; +if (i % 8 == 0) { +temp = 1; +g-mems_sts[i / 8] = 0; +} +else +temp = temp 1; +if (slot-populated) { +g-mems_sts[i / 8] |= temp; +} +slot-pending = false; +} +} + static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev, PCIHotplugState state) { -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 17/21][SeaBIOS] acpi_dsdt: Revert internal dimm state on _OST failure
This reverts bitmap state in the case of a failed hot operation, in order to allow retry of failed hot operations Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 1c253ca..0d37bbc 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -832,6 +832,8 @@ DefinitionBlock ( Switch(And(Arg1, 0xFF)) { Case(0x1) { Store(Arg2, MEF) +// Revert MEON flag for this memory device to one +Store(One, Index(MEON, Arg2)) } } } @@ -843,6 +845,8 @@ DefinitionBlock ( } Case(0x1) { Store(Arg2, MIF) +// Revert MEON flag for this memory device to zero +Store(Zero, Index(MEON, Arg2)) } } } -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 19/21] Implement info memtotal and query-memtotal
Returns total memory of guest in bytes, including hotplugged memory. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hmp-commands.hx |2 ++ hmp.c|7 +++ hmp.h|1 + hw/dimm.c| 15 +++ monitor.c|7 +++ qapi-schema.json | 12 qmp-commands.hx | 20 7 files changed, 64 insertions(+), 0 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index 3172cde..016062e 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -1461,6 +1461,8 @@ show qdev device model list show roms @item info memhp show memhp +@item info memtotal +show memtotal @end table ETEXI diff --git a/hmp.c b/hmp.c index ec25d9a..8f89c7d 100644 --- a/hmp.c +++ b/hmp.c @@ -1017,3 +1017,10 @@ void hmp_info_memhp(Monitor *mon) qapi_free_MemHpInfoList(info); } + +void hmp_info_memtotal(Monitor *mon) +{ +uint64_t ram_total; +ram_total = (uint64_t)qmp_query_memtotal(NULL); +monitor_printf(mon, MemTotal: %lu \n, ram_total); +} diff --git a/hmp.h b/hmp.h index 971e7c4..d6e715e 100644 --- a/hmp.h +++ b/hmp.h @@ -65,5 +65,6 @@ void hmp_dump_guest_memory(Monitor *mon, const QDict *qdict); void hmp_netdev_add(Monitor *mon, const QDict *qdict); void hmp_netdev_del(Monitor *mon, const QDict *qdict); void hmp_info_memhp(Monitor *mon); +void hmp_info_memtotal(Monitor *mon); #endif diff --git a/hw/dimm.c b/hw/dimm.c index 6e324d3..b544173 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -28,6 +28,7 @@ static dimm_hotplug_fn dimm_hotplug; static dimm_hotplug_fn dimm_revert; static QTAILQ_HEAD(Dimmlist, DimmState) dimmlist; static QTAILQ_HEAD(dimm_hp_result_head, dimm_hp_result) dimm_hp_result_queue; +extern ram_addr_t ram_size; static Property dimm_properties[] = { DEFINE_PROP_END_OF_LIST() @@ -292,6 +293,20 @@ MemHpInfoList *qmp_query_memhp(Error **errp) return head; } + +int64_t qmp_query_memtotal(Error **errp) +{ +DimmState *slot; +uint64_t info = ram_size; + +QTAILQ_FOREACH(slot, dimmlist, nextdimm) { +if (slot-populated) { +info += slot-size; +} +} +return (int64_t)info; +} + static int dimm_init(SysBusDevice *s) { DimmState *slot; diff --git a/monitor.c b/monitor.c index 4a14e26..1dd646c 100644 --- a/monitor.c +++ b/monitor.c @@ -2739,6 +2739,13 @@ static mon_cmd_t info_cmds[] = { .mhandler.info = hmp_info_memhp, }, { +.name = memtotal, +.args_type = , +.params = , +.help = show total memory size, +.mhandler.info = hmp_info_memtotal, +}, +{ .name = NULL, }, }; diff --git a/qapi-schema.json b/qapi-schema.json index 049f6f9..5bbf2c0 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -1888,3 +1888,15 @@ # Since: 1.1.3 ## { 'command': 'query-memhp', 'returns': ['MemHpInfo'] } + +## +# @query-memtotal: +# +# Returns total memory in bytes, including hotplugged dimms +# +# Returns: a l +# +# Since: 1.2 +## +{ 'command': 'query-memtotal', 'returns': 'int' } + diff --git a/qmp-commands.hx b/qmp-commands.hx index cd1d5f0..6c71696 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -2286,3 +2286,23 @@ Example: } EQMP + +{ +.name = query-memtotal, +.args_type = , +.mhandler.cmd_new = qmp_marshal_input_query_memtotal +}, +SQMP +query-memtotal +-- + +Return total memory in bytes, including hotplugged dimms + +Example: + +- { execute: query-memtotal } +- { + return: 1073741824 + } + +EQMP -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 15/21] acpi_piix4: _OST dimm support
This allows qemu to receive notifications from the guest OS on success or failure of a memory hotplug request. The guest OS needs to implement the _OST functionality for this to work (linux-next: http://lkml.org/lkml/2012/6/25/321) Also add new _OST registers in docs/specs/acpi_hotplug.txt Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/acpi_hotplug.txt | 24 hw/acpi_piix4.c | 15 +++ hw/dimm.c | 18 ++ hw/dimm.h |1 + 4 files changed, 58 insertions(+), 0 deletions(-) diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt index cf86242..2f6fd5f 100644 --- a/docs/specs/acpi_hotplug.txt +++ b/docs/specs/acpi_hotplug.txt @@ -20,3 +20,27 @@ ejected. Written by ACPI memory device _EJ0 method to notify qemu of successfull hot-removal. Write-only. + +Memory Dimm ejection failure notification (IO port 0xafa1, 1-byte access): +--- +Dimm hot-remove _OST failure notification. Byte value indicates Dimm slot for +which ejection failed. + +Written by ACPI memory device _OST method to notify qemu of failed +hot-removal. Write-only. + +Memory Dimm insertion success notification (IO port 0xafa2, 1-byte access): +--- +Dimm hot-add _OST success notification. Byte value indicates Dimm slot for which +insertion succeeded. + +Written by ACPI memory device _OST method to notify qemu of failed +hot-add. Write-only. + +Memory Dimm insertion failure notification (IO port 0xafa3, 1-byte access): +--- +Dimm hot-add _OST failure notification. Byte value indicates Dimm slot for which +insertion failed. + +Written by ACPI memory device _OST method to notify qemu of failed +hot-add. Write-only. diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index b988597..d8e2c22 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -49,6 +49,9 @@ #define PCI_RMV_BASE 0xae0c #define MEM_BASE 0xaf80 #define MEM_EJ_BASE 0xafa0 +#define MEM_OST_REMOVE_FAIL 0xafa1 +#define MEM_OST_ADD_SUCCESS 0xafa2 +#define MEM_OST_ADD_FAIL 0xafa3 #define PIIX4_MEM_HOTPLUG_STATUS 8 #define PIIX4_PCI_HOTPLUG_STATUS 2 @@ -531,6 +534,15 @@ static void gpe_writeb(void *opaque, uint32_t addr, uint32_t val) case MEM_EJ_BASE: dimm_notify(val, DIMM_REMOVE_SUCCESS); break; +case MEM_OST_REMOVE_FAIL: +dimm_notify(val, DIMM_REMOVE_FAIL); +break; +case MEM_OST_ADD_SUCCESS: +dimm_notify(val, DIMM_ADD_SUCCESS); +break; +case MEM_OST_ADD_FAIL: +dimm_notify(val, DIMM_ADD_FAIL); +break; default: acpi_gpe_ioport_writeb(s-ar, addr, val); } @@ -604,6 +616,9 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) register_ioport_read(MEM_BASE, DIMM_BITMAP_BYTES, 1, gpe_readb, s); register_ioport_write(MEM_EJ_BASE, 1, 1, gpe_writeb, s); +register_ioport_write(MEM_OST_REMOVE_FAIL, 1, 1, gpe_writeb, s); +register_ioport_write(MEM_OST_ADD_SUCCESS, 1, 1, gpe_writeb, s); +register_ioport_write(MEM_OST_ADD_FAIL, 1, 1, gpe_writeb, s); for(i = 0; i DIMM_BITMAP_BYTES; i++) { s-gperegs.mems_sts[i] = 0; diff --git a/hw/dimm.c b/hw/dimm.c index 9b32386..ba104cc 100644 --- a/hw/dimm.c +++ b/hw/dimm.c @@ -89,12 +89,14 @@ void dimm_activate(DimmState *slot) dimm_populate(slot); if (dimm_hotplug) dimm_hotplug(dimm_hotplug_qdev, (SysBusDevice*)slot, 1); +slot-pending = true; } void dimm_deactivate(DimmState *slot) { if (dimm_hotplug) dimm_hotplug(dimm_hotplug_qdev, (SysBusDevice*)slot, 0); +slot-pending = true; } DimmState *dimm_find_from_name(char *id) @@ -138,6 +140,10 @@ int dimm_do(Monitor *mon, const QDict *qdict, bool add) __FUNCTION__, id); return 1; } +if (slot-pending) { +fprintf(stderr, warning: %s slot %s hot-operation pending\n, +__FUNCTION__, id); +} dimm_activate(slot); } else { @@ -146,6 +152,10 @@ int dimm_do(Monitor *mon, const QDict *qdict, bool add) __FUNCTION__, id); return 1; } +if (slot-pending) { +fprintf(stderr, warning: %s slot %s hot-operation pending\n, +__FUNCTION__, id); +} dimm_deactivate(slot); } @@ -198,6 +208,13 @@ void dimm_notify(uint32_t idx, uint32_t event) case DIMM_REMOVE_SUCCESS: dimm_depopulate(s); QTAILQ_INSERT_TAIL(dimm_hp_result_queue, result, next); +s-pending = false; +break; +case DIMM_REMOVE_FAIL: +case DIMM_ADD_SUCCESS: +case DIMM_ADD_FAIL
[RFC PATCH v2 11/21] Implement dimm_add and dimm_del hmp/qmp commands
Hot-add hmp syntax: dimm_add dimmid Hot-remove hmp syntax: dimm_del dimmid Respective qmp commands are dimm-add, dimm-del. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hmp-commands.hx | 32 monitor.c | 11 +++ monitor.h |3 +++ qmp-commands.hx | 39 +++ 4 files changed, 85 insertions(+), 0 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index f5d9d91..012c150 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -618,6 +618,38 @@ Add device. ETEXI { +.name = dimm_del, +.args_type = id:s, +.params = id, +.help = hot-remove memory (dimm device), +.user_print = monitor_user_noop, +.mhandler.cmd_new = do_dimm_del, +}, + +STEXI +@item dimm_del @var{config} +@findex dimm_del + +Hot-remove dimm. +ETEXI + +{ +.name = dimm_add, +.args_type = id:s, +.params = id, +.help = hot-add memory (dimm device), +.user_print = monitor_user_noop, +.mhandler.cmd_new = do_dimm_add, +}, + +STEXI +@item dimm_add @var{config} +@findex dimm_add + +Hot-add dimm. +ETEXI + +{ .name = device_del, .args_type = id:s, .params = device, diff --git a/monitor.c b/monitor.c index f6107ba..d3d95a6 100644 --- a/monitor.c +++ b/monitor.c @@ -67,6 +67,7 @@ #include qmp-commands.h #include hmp.h #include qemu-thread.h +#include hw/dimm.h /* for pic/irq_info */ #if defined(TARGET_SPARC) @@ -4813,3 +4814,13 @@ int monitor_read_block_device_key(Monitor *mon, const char *device, return monitor_read_bdrv_key_start(mon, bs, completion_cb, opaque); } + +int do_dimm_add(Monitor *mon, const QDict *qdict, QObject **ret_data) +{ +return dimm_do(mon, qdict, true); +} + +int do_dimm_del(Monitor *mon, const QDict *qdict, QObject **ret_data) +{ +return dimm_do(mon, qdict, false); +} diff --git a/monitor.h b/monitor.h index 5f4de1b..afdd721 100644 --- a/monitor.h +++ b/monitor.h @@ -86,4 +86,7 @@ int qmp_qom_set(Monitor *mon, const QDict *qdict, QObject **ret); int qmp_qom_get(Monitor *mon, const QDict *qdict, QObject **ret); +int do_dimm_add(Monitor *mon, const QDict *qdict, QObject **ret_data); +int do_dimm_del(Monitor *mon, const QDict *qdict, QObject **ret_data); + #endif /* !MONITOR_H */ diff --git a/qmp-commands.hx b/qmp-commands.hx index 2e1a38e..7efd628 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -2209,3 +2209,42 @@ EQMP .args_type = implements:s?,abstract:b?, .mhandler.cmd_new = qmp_marshal_input_qom_list_types, }, +{ +.name = dimm-add, +.args_type = id:s, +.mhandler.cmd_new = do_dimm_add, +}, +SQMP +dimm-add +- + +Hot-add memory DIMM + +Will hotplug memory DIMMs with given id. + +Example: + +- { execute: dimm-add, arguments: { id: dimm0 } } +- { return: {} } + +EQMP + +{ +.name = dimm-del, +.args_type = id:s, +.mhandler.cmd_new = do_dimm_del, +}, +SQMP +dimm-del +- + +Hot-remove memory DIMM + +Will hot-unplug memory DIMMs with given id. + +Example: + +- { execute: dimm-del, arguments: { id: dimm0 } } +- { return: {} } + +EQMP -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 10/21] Implement -dimm command line option
Syntax: -dimm id=name,size=sz,node=pxm,populated=on|off The starting physical address for all dimms is calculated automatically from top of memory, skipping the pci hole at [PCI_HOLE_START, 4G). populated=on means the dimm is populated at machine startup. Default is off. node is defining numa proximity for this dimm. Default is node zero. Example: -dimm id=dimm0,size=512M,node=0,populated=off will define a 512M memory slot belonging to numa node 0. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- qemu-config.c | 25 + qemu-options.hx |5 + sysemu.h|1 + vl.c| 35 +++ 4 files changed, 66 insertions(+), 0 deletions(-) diff --git a/qemu-config.c b/qemu-config.c index 5c3296b..4abc31b 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -626,6 +626,30 @@ QemuOptsList qemu_boot_opts = { }, }; +static QemuOptsList qemu_dimm_opts = { +.name = dimm, +.head = QTAILQ_HEAD_INITIALIZER(qemu_dimm_opts.head), +.desc = { +{ +.name = id, +.type = QEMU_OPT_STRING, +.help = id of this dimm device, +},{ +.name = size, +.type = QEMU_OPT_SIZE, +.help = memory size for this dimm, +},{ +.name = populated, +.type = QEMU_OPT_BOOL, +.help = populated for this dimm, +},{ +.name = node, +.type = QEMU_OPT_NUMBER, +.help = NUMA node number (i.e. proximity) for this dimm, +}, +{ /* end of list */ } +}, +}; static QemuOptsList *vm_config_groups[32] = { qemu_drive_opts, qemu_chardev_opts, @@ -641,6 +665,7 @@ static QemuOptsList *vm_config_groups[32] = { qemu_machine_opts, qemu_boot_opts, qemu_iscsi_opts, +qemu_dimm_opts, NULL, }; diff --git a/qemu-options.hx b/qemu-options.hx index 8b66264..61909f7 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2747,3 +2747,8 @@ HXCOMM This is the last statement. Insert new options before this line! STEXI @end table ETEXI + +DEF(dimm, HAS_ARG, QEMU_OPTION_dimm, +-dimm id=dimmid,size=sz,node=nd,populated=on|off\n +specify memory dimm device with name dimmid, size sz on node nd, +QEMU_ARCH_ALL) diff --git a/sysemu.h b/sysemu.h index bc2c788..3e21a22 100644 --- a/sysemu.h +++ b/sysemu.h @@ -136,6 +136,7 @@ extern QEMUClock *rtc_clock; extern int nb_numa_nodes; extern uint64_t node_mem[MAX_NODES]; extern uint64_t node_cpumask[MAX_NODES]; +extern int nb_hp_dimms; #define MAX_OPTION_ROMS 16 typedef struct QEMUOptionRom { diff --git a/vl.c b/vl.c index 0ff8818..efe915e 100644 --- a/vl.c +++ b/vl.c @@ -120,6 +120,7 @@ int main(int argc, char **argv) #include hw/xen.h #include hw/qdev.h #include hw/loader.h +#include hw/dimm.h #include bt-host.h #include net.h #include net/slirp.h @@ -242,6 +243,7 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order int nb_numa_nodes; uint64_t node_mem[MAX_NODES]; uint64_t node_cpumask[MAX_NODES]; +int nb_hp_dimms; uint8_t qemu_uuid[16]; @@ -518,6 +520,23 @@ static void configure_rtc_date_offset(const char *startdate, int legacy) rtc_date_offset = time(NULL) - rtc_start_date; } } +static void configure_dimm(QemuOpts *opts) +{ +const char *id; +uint64_t size, node; +bool populated; +if (nb_hp_dimms == MAX_DIMMS) { +fprintf(stderr, qemu: maximum number of DIMMs (%d) exceeded\n, +MAX_DIMMS); +exit(1); +} +id = qemu_opts_id(opts); +size = qemu_opt_get_size(opts, size, DEFAULT_DIMMSIZE); +populated = qemu_opt_get_bool(opts, populated, 0); +node = qemu_opt_get_number(opts, node, 0); +dimm_create((char*)id, size, node, nb_hp_dimms, populated); +nb_hp_dimms++; +} static void configure_rtc(QemuOpts *opts) { @@ -2273,6 +2292,8 @@ int main(int argc, char **argv, char **envp) DisplayChangeListener *dcl; int cyls, heads, secs, translation; QemuOpts *hda_opts = NULL, *opts, *machine_opts; +QemuOpts *dimm_opts[MAX_DIMMS]; +int nb_dimm_opts = 0; QemuOptsList *olist; int optind; const char *optarg; @@ -3200,6 +3221,18 @@ int main(int argc, char **argv, char **envp) case QEMU_OPTION_qtest_log: qtest_log = optarg; break; +case QEMU_OPTION_dimm: +if (nb_dimm_opts == MAX_DIMMS) { +fprintf(stderr, qemu: maximum number of DIMMs (%d) exceeded\n, +MAX_DIMMS); +} +dimm_opts[nb_dimm_opts] = +qemu_opts_parse(qemu_find_opts(dimm), optarg, 0); +if (!dimm_opts[nb_dimm_opts]) { +exit(1); +} +nb_dimm_opts++; +break; default: os_parse_cmd_args
[RFC PATCH v2 09/21] pc: Add dimm paravirt SRAT info
The numa_fw_cfg paravirt interface is extended to include SRAT information for all hotplug-able dimms. There are 3 words for each hotplug-able memory slot, denoting start address, size and node proximity. The new info is appended after existing numa info, so that the fw_cfg layout does not break. This information is used by Seabios to build hotplug memory device objects at runtime. nb_numa_nodes is set to 1 by default (not 0), so that we always pass srat info to SeaBIOS. v1-v2: Dimm SRAT info (#dimms) is appended at end of existing numa fw_cfg in order not to break existing layout Documentation of the new fwcfg layout is included in docs/specs/fwcfg.txt Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/fwcfg.txt | 28 ++ hw/pc.c | 53 - vl.c |2 +- 3 files changed, 80 insertions(+), 3 deletions(-) create mode 100644 docs/specs/fwcfg.txt diff --git a/docs/specs/fwcfg.txt b/docs/specs/fwcfg.txt new file mode 100644 index 000..e6fcd8f --- /dev/null +++ b/docs/specs/fwcfg.txt @@ -0,0 +1,28 @@ +QEMU-BIOS Paravirt Documentation +-- + +This document describes paravirt data structures passed from QEMU to BIOS. + +fw_cfg SRAT paravirt info + +The SRAT info passed from QEMU to BIOS has the following layout: + +--- +#nodes | cpu0_pxm | cpu1_pxm | ... | cpulast_pxm | node0_mem | node1_mem | ... | nodelast_mem + +--- +#dimms | dimm0_start | dimm0_sz | dimm0_pxm | ... | dimmlast_start | dimmlast_sz | dimmlast_pxm + +Entry 0 contains the number of numa nodes (nb_numa_nodes). + +Entries 1..max_cpus: The next max_cpus entries describe node proximity for each +one of the vCPUs in the system. + +Entries max_cpus+1..max_cpus+nb_numa_nodes+1: The next nb_numa_nodes entries +describe the memory size for each one of the NUMA nodes in the system. + +Entry max_cpus+nb_numa_nodes+1 contains the number of memory dimms (nb_hp_dimms) + +The last 3 * nb_hp_dimms entries are organized in triplets: Each triplet contains +the physical address offset, size (in bytes), and node proximity for the +respective dimm. diff --git a/hw/pc.c b/hw/pc.c index ef9901a..cf651d0 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -598,12 +598,15 @@ int e820_add_entry(uint64_t address, uint64_t length, uint32_t type) return index; } +static void setup_hp_dimms(uint64_t *fw_cfg_slots); + static void *bochs_bios_init(void) { void *fw_cfg; uint8_t *smbios_table; size_t smbios_len; uint64_t *numa_fw_cfg; +uint64_t *hp_dimms_fw_cfg; int i, j; register_ioport_write(0x400, 1, 2, bochs_bios_write, NULL); @@ -638,8 +641,10 @@ static void *bochs_bios_init(void) /* allocate memory for the NUMA channel: one (64bit) word for the number * of nodes, one word for each VCPU-node and one word for each node to * hold the amount of memory. + * Finally one word for the number of hotplug memory slots and three words + * for each hotplug memory slot (start address, size and node proximity). */ -numa_fw_cfg = g_malloc0((1 + max_cpus + nb_numa_nodes) * 8); +numa_fw_cfg = g_malloc0((2 + max_cpus + nb_numa_nodes + 3 * nb_hp_dimms) * 8); numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); for (i = 0; i max_cpus; i++) { for (j = 0; j nb_numa_nodes; j++) { @@ -652,8 +657,15 @@ static void *bochs_bios_init(void) for (i = 0; i nb_numa_nodes; i++) { numa_fw_cfg[max_cpus + 1 + i] = cpu_to_le64(node_mem[i]); } + +numa_fw_cfg[1 + max_cpus + nb_numa_nodes] = cpu_to_le64(nb_hp_dimms); + +hp_dimms_fw_cfg = numa_fw_cfg + 2 + max_cpus + nb_numa_nodes; +if (nb_hp_dimms) +setup_hp_dimms(hp_dimms_fw_cfg); + fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, (uint8_t *)numa_fw_cfg, - (1 + max_cpus + nb_numa_nodes) * 8); + (2 + max_cpus + nb_numa_nodes + 3 * nb_hp_dimms) * 8); return fw_cfg; } @@ -1223,3 +1235,40 @@ target_phys_addr_t pc_set_hp_memory_offset(uint64_t size) return ret; } + +static void setup_hp_dimms(uint64_t *fw_cfg_slots) +{ +int i = 0; +Error *err = NULL; +DeviceState *dev; +DimmState *slot; +const char *type; +BusChild *kid; +BusState *bus = sysbus_get_default(); + +QTAILQ_FOREACH(kid, bus-children, sibling) { +dev = kid-child; +type = object_property_get_str(OBJECT(dev), type, err); +if (err) { +error_free(err); +fprintf(stderr, error getting device type\n); +exit(1); +} + +if (!strcmp(type, dimm)) { +if (!dev-id) { +fprintf(stderr, error getting dimm device id\n
[RFC PATCH v2 08/21] pc: calculate dimm physical addresses and adjust memory map
Dimm physical address offsets are calculated automatically and memory map is adjusted accordingly. If a DIMM can fit before the PCI_HOLE_START (currently 0xe000), it will be added normally, otherwise its physical address will be above 4GB. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/pc.c | 41 + hw/pc.h |6 ++ hw/pc_piix.c | 18 -- vl.c |1 + 4 files changed, 60 insertions(+), 6 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index c7e9ab3..ef9901a 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -48,6 +48,7 @@ #include memory.h #include exec-memory.h #include arch_init.h +#include dimm.h /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -89,6 +90,9 @@ struct e820_table { static struct e820_table e820_table; struct hpet_fw_config hpet_cfg = {.count = UINT8_MAX}; +ram_addr_t below_4g_hp_mem_size = 0; +ram_addr_t above_4g_hp_mem_size = 0; +extern target_phys_addr_t ram_hp_offset; void gsi_handler(void *opaque, int n, int level) { GSIState *s = opaque; @@ -1182,3 +1186,40 @@ void pc_pci_device_init(PCIBus *pci_bus) pci_create_simple(pci_bus, -1, lsi53c895a); } } + + +/* Function to configure memory offsets of hotpluggable dimms */ + +target_phys_addr_t pc_set_hp_memory_offset(uint64_t size) +{ +target_phys_addr_t ret; + +/* on first call, initialize ram_hp_offset */ +if (!ram_hp_offset) { +if (ram_size = PCI_HOLE_START ) { +ram_hp_offset = 0x1LL + (ram_size - PCI_HOLE_START); +} else { +ram_hp_offset = ram_size; +} +} + +if (ram_hp_offset = 0x1LL) { +ret = ram_hp_offset; +above_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* if dimm fits before pci hole, append it normally */ +else if (ram_hp_offset + size = PCI_HOLE_START) { +ret = ram_hp_offset; +below_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* otherwise place it above 4GB */ +else { +ret = 0x1LL; +above_4g_hp_mem_size += size; +ram_hp_offset = 0x1LL + size; +} + +return ret; +} diff --git a/hw/pc.h b/hw/pc.h index 31ccb6f..15bdd7d 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -10,6 +10,7 @@ #include memory.h #include ioapic.h +#define PCI_HOLE_START 0xe000 /* PC-style peripherals (also used by other machines). */ /* serial.c */ @@ -218,6 +219,11 @@ static inline bool isa_ne2000_init(ISABus *bus, int base, int irq, NICInfo *nd) /* pc_sysfw.c */ void pc_system_firmware_init(MemoryRegion *rom_memory); +/* memory hotplug */ +target_phys_addr_t pc_set_hp_memory_offset(uint64_t size); +extern ram_addr_t below_4g_hp_mem_size; +extern ram_addr_t above_4g_hp_mem_size; + /* e820 types */ #define E820_RAM1 #define E820_RESERVED 2 diff --git a/hw/pc_piix.c b/hw/pc_piix.c index 0c0096f..f3f1651 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -43,6 +43,7 @@ #include xen.h #include memory.h #include exec-memory.h +#include dimm.h #ifdef CONFIG_XEN # include xen/hvm/hvm_info_table.h #endif @@ -155,9 +156,9 @@ static void pc_init1(MemoryRegion *system_memory, kvmclock_create(); } -if (ram_size = 0xe000 ) { -above_4g_mem_size = ram_size - 0xe000; -below_4g_mem_size = 0xe000; +if (ram_size = PCI_HOLE_START ) { +above_4g_mem_size = ram_size - PCI_HOLE_START; +below_4g_mem_size = PCI_HOLE_START; } else { above_4g_mem_size = 0; below_4g_mem_size = ram_size; @@ -172,6 +173,9 @@ static void pc_init1(MemoryRegion *system_memory, rom_memory = system_memory; } +/* adjust memory map for hotplug dimms */ +dimm_calc_offsets(pc_set_hp_memory_offset); + /* allocate ram and load rom/bios */ if (!xen_enabled()) { fw_cfg = pc_memory_init(system_memory, @@ -192,9 +196,11 @@ static void pc_init1(MemoryRegion *system_memory, if (pci_enabled) { pci_bus = i440fx_init(i440fx_state, piix3_devfn, isa_bus, gsi, system_memory, system_io, ram_size, - below_4g_mem_size, - 0x1ULL - below_4g_mem_size, - 0x1ULL + above_4g_mem_size, + below_4g_mem_size + below_4g_hp_mem_size, + 0x1ULL - below_4g_mem_size +- below_4g_hp_mem_size, + 0x1ULL + above_4g_mem_size ++ above_4g_hp_mem_size, (sizeof(target_phys_addr_t) == 4 ? 0 : ((uint64_t)1 62)), diff --git a/vl.c b/vl.c index 1329c30..0ff8818 100644 --- a/vl.c +++ b/vl.c @@ -176,6 +176,7 @@ DisplayType display_type
[RFC PATCH v2 04/21][SeaBIOS] acpi: generate hotplug memory devices
The memory device generation is guided by qemu paravirt info. Seabios first uses the info to setup SRAT entries for the hotplug-able memory slots. Afterwards, build_memssdt uses the created SRAT entries to generate appropriate memory device objects. One memory device (and corresponding SRAT entry) is generated for each hotplug-able qemu memslot. Currently no SSDT memory device is created for initial system memory. We only support up to 255 DIMMs for now (PackageOp used for the MEON array can only describe an array of at most 255 elements. VarPackageOp would be needed to support more than 255 devices) v1-v2: Seabios reads mems_sts from qemu to build e820_map SSDT size and some offsets are calculated with extraction macros. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi.c | 158 +-- 1 files changed, 152 insertions(+), 6 deletions(-) diff --git a/src/acpi.c b/src/acpi.c index 55e4607..c83e8c7 100644 --- a/src/acpi.c +++ b/src/acpi.c @@ -510,6 +510,127 @@ build_ssdt(void) return ssdt; } +#include ssdt-mem.hex + +/* 0x5B 0x82 DeviceOp PkgLength NameString DimmID */ +#define MEM_BASE 0xaf80 +#define SD_MEM (ssdm_mem_aml + *ssdt_mem_start) +#define SD_MEMSIZEOF (*ssdt_mem_end - *ssdt_mem_start) +#define SD_OFFSET_MEMHEX (*ssdt_mem_name - *ssdt_mem_start + 2) +#define SD_OFFSET_MEMID (*ssdt_mem_id - *ssdt_mem_start) +#define SD_OFFSET_PXMID 31 +#define SD_OFFSET_MEMSTART 55 +#define SD_OFFSET_MEMEND 63 +#define SD_OFFSET_MEMSIZE 79 + +u64 nb_hp_memslots = 0; +struct srat_memory_affinity *mem; + +static void build_memdev(u8 *ssdt_ptr, int i, u64 mem_base, u64 mem_len, u8 node) +{ +memcpy(ssdt_ptr, SD_MEM, SD_MEMSIZEOF); +ssdt_ptr[SD_OFFSET_MEMHEX] = getHex(i 4); +ssdt_ptr[SD_OFFSET_MEMHEX+1] = getHex(i); +ssdt_ptr[SD_OFFSET_MEMID] = i; +ssdt_ptr[SD_OFFSET_PXMID] = node; +*(u64*)(ssdt_ptr + SD_OFFSET_MEMSTART) = mem_base; +*(u64*)(ssdt_ptr + SD_OFFSET_MEMEND) = mem_base + mem_len; +*(u64*)(ssdt_ptr + SD_OFFSET_MEMSIZE) = mem_len; +} + +static void* +build_memssdt(void) +{ +u64 mem_base; +u64 mem_len; +u8 node; +int i; +struct srat_memory_affinity *entry = mem; +u64 nb_memdevs = nb_hp_memslots; +u8 memslot_status, enabled; + +int length = ((1+3+4) + + (nb_memdevs * SD_MEMSIZEOF) + + (1+2+5+(12*nb_memdevs)) + + (6+2+1+(1*nb_memdevs))); +u8 *ssdt = malloc_high(sizeof(struct acpi_table_header) + length); +if (! ssdt) { +warn_noalloc(); +return NULL; +} +u8 *ssdt_ptr = ssdt + sizeof(struct acpi_table_header); + +// build Scope(_SB_) header +*(ssdt_ptr++) = 0x10; // ScopeOp +ssdt_ptr = encodeLen(ssdt_ptr, length-1, 3); +*(ssdt_ptr++) = '_'; + *(ssdt_ptr++) = 'S'; +*(ssdt_ptr++) = 'B'; +*(ssdt_ptr++) = '_'; + +for (i = 0; i nb_memdevs; i++) { +mem_base = (((u64)(entry-base_addr_high) 32 )| entry-base_addr_low); +mem_len = (((u64)(entry-length_high) 32 )| entry-length_low); +node = entry-proximity[0]; +build_memdev(ssdt_ptr, i, mem_base, mem_len, node); +ssdt_ptr += SD_MEMSIZEOF; +entry++; +} + +// build Method(MTFY, 2) {If (LEqual(Arg0, 0x00)) {Notify(CM00, Arg1)} ...} +*(ssdt_ptr++) = 0x14; // MethodOp +ssdt_ptr = encodeLen(ssdt_ptr, 2+5+(12*nb_memdevs), 2); +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'T'; +*(ssdt_ptr++) = 'F'; +*(ssdt_ptr++) = 'Y'; +*(ssdt_ptr++) = 0x02; +for (i=0; inb_memdevs; i++) { +*(ssdt_ptr++) = 0xA0; // IfOp + ssdt_ptr = encodeLen(ssdt_ptr, 11, 1); +*(ssdt_ptr++) = 0x93; // LEqualOp +*(ssdt_ptr++) = 0x68; // Arg0Op +*(ssdt_ptr++) = 0x0A; // BytePrefix +*(ssdt_ptr++) = i; +*(ssdt_ptr++) = 0x86; // NotifyOp +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'P'; +*(ssdt_ptr++) = getHex(i 4); +*(ssdt_ptr++) = getHex(i); +*(ssdt_ptr++) = 0x69; // Arg1Op +} + +// build Name(MEON, Package() { One, One, ..., Zero, Zero, ... }) +*(ssdt_ptr++) = 0x08; // NameOp +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'E'; +*(ssdt_ptr++) = 'O'; +*(ssdt_ptr++) = 'N'; +*(ssdt_ptr++) = 0x12; // PackageOp +ssdt_ptr = encodeLen(ssdt_ptr, 2+1+(1*nb_memdevs), 2); +*(ssdt_ptr++) = nb_memdevs; + +entry = mem; +memslot_status = 0; + +for (i = 0; i nb_memdevs; i++) { +enabled = 0; +if (i % 8 == 0) +memslot_status = inb(MEM_BASE + i/8); +enabled = memslot_status 1; +mem_base = (((u64)(entry-base_addr_high) 32 )| entry-base_addr_low); +mem_len = (((u64)(entry-length_high) 32 )| entry-length_low); +*(ssdt_ptr++) = enabled ? 0x01 : 0x00; +if (enabled) +add_e820(mem_base, mem_len, E820_RAM); +memslot_status = memslot_status 1; +entry
[RFC PATCH v2 02/21][SeaBIOS] Add SSDT memory device support
Define SSDT hotplug-able memory devices in _SB namespace. The dynamically generated SSDT includes per memory device hotplug methods. These methods just call methods defined in the DSDT. Also dynamically generate a MTFY method and a MEON array of the online/available memory devices. ACPI extraction macros are used to place the AML code in variables later used by src/acpi. The design is taken from SSDT cpu generation. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- Makefile |2 +- src/ssdt-mem.dsl | 65 ++ 2 files changed, 66 insertions(+), 1 deletions(-) create mode 100644 src/ssdt-mem.dsl diff --git a/Makefile b/Makefile index fe974f7..299069e 100644 --- a/Makefile +++ b/Makefile @@ -228,7 +228,7 @@ $(OUT)%.hex: src/%.dsl ./tools/acpi_extract_preprocess.py ./tools/acpi_extract.p $(Q)$(PYTHON) ./tools/acpi_extract.py $(OUT)$*.lst $(OUT)$*.off $(Q)cat $(OUT)$*.off $@ -$(OUT)ccode32flat.o: $(OUT)acpi-dsdt.hex $(OUT)ssdt-proc.hex $(OUT)ssdt-pcihp.hex +$(OUT)ccode32flat.o: $(OUT)acpi-dsdt.hex $(OUT)ssdt-proc.hex $(OUT)ssdt-pcihp.hex $(OUT)ssdt-mem.hex Kconfig rules diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl new file mode 100644 index 000..ee322f0 --- /dev/null +++ b/src/ssdt-mem.dsl @@ -0,0 +1,65 @@ +/* This file is the basis for the ssdt_mem[] variable in src/acpi.c. + * It is similar in design to the ssdt_proc variable. + * It defines the contents of the per-cpu Processor() object. At + * runtime, a dynamically generated SSDT will contain one copy of this + * AML snippet for every possible memory device in the system. The + * objects will * be placed in the \_SB_ namespace. + * + * In addition to the aml code generated from this file, the + * src/acpi.c file creates a MEMNTFY method with an entry for each memdevice: + * Method(MTFY, 2) { + * If (LEqual(Arg0, 0x00)) { Notify(MP00, Arg1) } + * If (LEqual(Arg0, 0x01)) { Notify(MP01, Arg1) } + * ... + * } + * and a MEON array with the list of active and inactive memory devices: + * Name(MEON, Package() { One, One, ..., Zero, Zero, ... }) + */ +ACPI_EXTRACT_ALL_CODE ssdm_mem_aml + +DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) +/* v-- DO NOT EDIT --v */ +{ +ACPI_EXTRACT_DEVICE_START ssdt_mem_start +ACPI_EXTRACT_DEVICE_END ssdt_mem_end +ACPI_EXTRACT_DEVICE_STRING ssdt_mem_name +Device(MPAA) { +ACPI_EXTRACT_NAME_BYTE_CONST ssdt_mem_id +Name(ID, 0xAA) +/* ^-- DO NOT EDIT --^ + * + * The src/acpi.c code requires the above layout so that it can update + * MPAA and 0xAA with the appropriate MEMDEVICE id (see + * SD_OFFSET_MEMHEX/MEMID1/MEMID2). Don't change the above without + * also updating the C code. + */ +Name(_HID, EISAID(PNP0C80)) +Name(_PXM, 0xAA) + +External(CMST, MethodObj) +External(MPEJ, MethodObj) + +Name(_CRS, ResourceTemplate() { +QwordMemory( + ResourceConsumer, + , + MinFixed, + MaxFixed, + Cacheable, + ReadWrite, + 0x0, + 0xDEADBEEF, + 0xE6ADBEEE, + 0x, + 0x0800, + ) +}) +Method (_STA, 0) { +Return(CMST(ID)) +} +Method (_EJ0, 1, NotSerialized) { +MPEJ(ID, Arg0) +} +} +} + -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 04/21][SeaBIOS] acpi: generate hotplug memory devices
Hi, On Wed, Jul 11, 2012 at 06:48:38PM +0800, Wen Congyang wrote: +if (enabled) +add_e820(mem_base, mem_len, E820_RAM); add_e820() is declared in memmap.h. You should include this header file, otherwise, seabios cannot be built. thanks. you had the same comment on v1 but I forgot to address it. I will update. - Vasilis Thanks Wen Congyang +memslot_status = memslot_status 1; +entry++; +} +build_header((void*)ssdt, SSDT_SIGNATURE, ssdt_ptr - ssdt, 1); + +return ssdt; +} + #include ssdt-pcihp.hex #define PCI_RMV_BASE 0xae0c @@ -618,9 +739,6 @@ build_srat(void) { int nb_numa_nodes = qemu_cfg_get_numa_nodes(); -if (nb_numa_nodes == 0) -return NULL; - u64 *numadata = malloc_tmphigh(sizeof(u64) * (MaxCountCPUs + nb_numa_nodes)); if (!numadata) { warn_noalloc(); @@ -629,10 +747,11 @@ build_srat(void) qemu_cfg_get_numa_data(numadata, MaxCountCPUs + nb_numa_nodes); +qemu_cfg_get_numa_data(nb_hp_memslots, 1); struct system_resource_affinity_table *srat; int srat_size = sizeof(*srat) + sizeof(struct srat_processor_affinity) * MaxCountCPUs + -sizeof(struct srat_memory_affinity) * (nb_numa_nodes + 2); +sizeof(struct srat_memory_affinity) * (nb_numa_nodes + nb_hp_memslots + 2); srat = malloc_high(srat_size); if (!srat) { @@ -667,7 +786,7 @@ build_srat(void) * from 640k-1M and possibly another one from 3.5G-4G. */ struct srat_memory_affinity *numamem = (void*)core; -int slots = 0; +int slots = 0, node; u64 mem_len, mem_base, next_base = 0; acpi_build_srat_memory(numamem, 0, 640*1024, 0, 1); @@ -694,10 +813,36 @@ build_srat(void) next_base += (1ULL 32) - RamSize; } acpi_build_srat_memory(numamem, mem_base, mem_len, i-1, 1); + numamem++; slots++; + +} +mem = (void*)numamem; + +if (nb_hp_memslots) { +u64 *hpmemdata = malloc_tmphigh(sizeof(u64) * (3 * nb_hp_memslots)); +if (!hpmemdata) { +warn_noalloc(); +free(hpmemdata); +free(numadata); +return NULL; +} + +qemu_cfg_get_numa_data(hpmemdata, 3 * nb_hp_memslots); + +for (i = 1; i nb_hp_memslots + 1; ++i) { +mem_base = *hpmemdata++; +mem_len = *hpmemdata++; +node = *hpmemdata++; +acpi_build_srat_memory(numamem, mem_base, mem_len, node, 1); +numamem++; +slots++; +} +free(hpmemdata); } -for (; slots nb_numa_nodes + 2; slots++) { + +for (; slots nb_numa_nodes + nb_hp_memslots + 2; slots++) { acpi_build_srat_memory(numamem, 0, 0, 0, 0); numamem++; } @@ -748,6 +893,7 @@ acpi_bios_init(void) ACPI_INIT_TABLE(build_madt()); ACPI_INIT_TABLE(build_hpet()); ACPI_INIT_TABLE(build_srat()); +ACPI_INIT_TABLE(build_memssdt()); ACPI_INIT_TABLE(build_pcihp()); u16 i, external_tables = qemu_cfg_acpi_additional_tables(); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 05/21][SeaBIOS] pciinit: Fix pcimem_start value
Hi, On Wed, Jul 11, 2012 at 01:56:19PM +0200, Gerd Hoffmann wrote: On 07/11/12 12:31, Vasilis Liaskovitis wrote: In order to hotplug memory between RamSize and BUILD_PCIMEM_START, the pci window needs to start at BUILD_PCIMEM_START (0xe000). Otherwise, the guest cannot online new dimms at those ranges due to pci_root window conflicts. (workaround for linux guest is booting with pci=nocrs) static void pci_bios_map_devices(struct pci_bus *busses) { -pcimem_start = RamSize; +pcimem_start = BUILD_PCIMEM_START; It isn't that simple. For the 32bit pci window it will work, but will leaves address space unused instead of assigning it to the 32bit pci window. For the 64bit pci window it will not work. You have to walk the dimms and figure what the highest used address is, for both below-4g and above-4g. Then fill two variable with it and make the pci init code use that instead of RamSize and RamSizeOver4G. I see. I already have these values values computed in qemu-kvm, so I can pass them in a paravirt struct, or infer them from the dimm/srat paravirt info that I already pass to seabios. If i understand correctly, we would like the pcimem windows to use the maximum possible address space (constrained by the exact dimms/ranges which are defined) instead of leaving unused space. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH v2 13/21] Implement memory hotplug notification lists
Hi, On Wed, Jul 11, 2012 at 08:59:03AM -0600, Eric Blake wrote: On 07/11/2012 04:31 AM, Vasilis Liaskovitis wrote: Guest can respond to ACPI hotplug events e.g. with _EJ or _OST method. This patch implements a tail queue to store guest notifications for memory hot-add and hot-remove requests. Guest responses for memory hotplug command on a per-dimm basis can be detected with the new hmp command info memhp or the new qmp command query-memhp Examples: +++ b/qapi-schema.json @@ -1862,3 +1862,29 @@ # Since: 0.14.0 ## { 'command': 'netdev_del', 'data': {'id': 'str'} } + +## +# @MemHpInfo: +# +# Information about status of a memory hotplug command +# +# @Dimm: the Dimm associated with the result +# +# @result: the result of the hotplug command +# +# Since: 1.1.3 Should probably be 1.2, not 1.1.3. right +# +## +{ 'type': 'MemHpInfo', + 'data': {'Dimm': 'str', 'request': 'str', 'result': 'str'} } Why the upper case? Wouldn't 'dimm' be more consistent? I will change to dimm + +## +# @query-memhp: Why are we abbreviating? It might be better to name the QMP command query-memory-hotplug agreed, memhp is a bit cryptic. I will change to your suggestion +# +# Returns a list of information about pending hotplug commands +# +# Returns: a list of @MemhpInfo +# +# Since: 1.1.3 Likewise for 1.2. right + +- Dimm: Dimm name (json-str) +- request: type of hot request: hot-add or hot-remove (json-str) +- result: result of the hotplug request for this Dimm success or failure (json-str) This may need tweaks (such as s/Dimm/dimm/) based on resolution of above comments. ok, it will be dimm thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH v2 19/21] Implement info memtotal and query-memtotal
Hi, On Wed, Jul 11, 2012 at 09:14:29AM -0600, Eric Blake wrote: On 07/11/2012 04:32 AM, Vasilis Liaskovitis wrote: Returns total memory of guest in bytes, including hotplugged memory. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com Should this instead be merged with query-balloon output, so that we have a single command that shows all aspects of memory usage (both balloon and hotplug at once)? @@ -1888,3 +1888,15 @@ # Since: 1.1.3 ## { 'command': 'query-memhp', 'returns': ['MemHpInfo'] } + +## +# @query-memtotal: A more generic name might be 'query-memory', especially if we merge balloon and hotplug information into one command. query-memory sounds reasonable to me. query-balloon should also be updated to show the correct memory. Do you foresee any issues with merging them? the query-memory command should work independently of the balloon driver. +# +# Returns total memory in bytes, including hotplugged dimms +# +# Returns: a l truncated sorry about that. thanks, - Vasilis -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 20/21] Implement -dimms, -dimmspop command line options
Hi, On Wed, Jul 11, 2012 at 05:55:25PM +0300, Avi Kivity wrote: On 07/11/2012 01:32 PM, Vasilis Liaskovitis wrote: Implement batch dimm creation command line options. These could be useful for not bloating the command line with a large number of dimms. IMO this is unneeded. With a management tool there is no problem generating a long command line; from the command line -dimm will be a rarely used option. ok, I thought so. I guess this patch and the next are unwanted, unless there is a strong opinion for using them coming from others. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/9] ACPI memory hotplug
Hi, On Tue, Apr 24, 2012 at 10:52:24AM +0300, Gleb Natapov wrote: On Mon, Apr 23, 2012 at 02:31:15PM +0200, Vasilis Liaskovitis wrote: The 440fx spec mentions: The address range from the top of main DRAM to 4 Gbytes (top of physical memory space supported by the 440FX PCIset) is normally mapped to PCI. The PMC forwards all accesses within this address range to PCI. What we probably want is that the initial memory map creation takes into account all dimms specified (both populated/unpopulated) Yes. So -m 1G, -device dimm,size=1G,populated=true -device dimm,size=1G,populated=false would create a system map with top of memory and start of PCI-hole at 2G. What -m 1G means on this command line? Isn't it redundant? yes, this was redundant with the original concept. May be we should make -m create non unplaggable, populated slot starting at address 0. Ten you config above will specify 3G memory with 2G populated (first of which is not removable) and 1G unpopulated. PCI hole starts above 3G. I agree -m should mean one big unpluggable slot. So in the new proposal,-device dimm populated=true means a hot-removable dimm that has already been hotplugged. A question here is when exactly should the initial hot-add event for this dimm be played? If the relevant OSPM has not yet been initialized (e.g. acpi_memhotplug module in a linux guest needs to be loaded), the guest may not see the event. This is a general issue of course, but with initially populated hot-removable dimms it may be a bigger issue. Can ospm acpi initialization be detected? Or maybe you are suggesting populated=true is part of initial memory (i.e. not hot-added, but still hot-removable). Though in that case guestOS may use it for bootmem allocations, making hot-remove more likely to fail at the memory offlining stage. This may require some shifting of physical address offsets around 3.5GB-4GB - is this the minimum PCI hole allowed? Currently it is 1G in QEMU code. ok E.g. if we specify 4x1GB DIMMs (onlt the first initially populated) -m 1G, -device dimm,size=1G,populated=true -device dimm,size=1G,populated=false -device dimm,size=1G,populated=false -device dimm,size=1G,populated=false we create the following memory map: dimm0: [0,1G) dimm1: [1G, 2G) dimm2: [2G, 3G) dimm3: [4G, 5G) or dimm3 is split into [3G, 3.5G) and [4G, 4.5G) does either of these options sound reasonable? We shouldn't split dimms IMO. Just unnecessary complication. Better make bigger PCI hole. ok thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 3/9][SeaBIOS] acpi: generate hotplug memory devices.
Hi, On Mon, Apr 23, 2012 at 07:37:51PM -0400, Kevin O'Connor wrote: On Thu, Apr 19, 2012 at 04:08:41PM +0200, Vasilis Liaskovitis wrote: The memory device generation is guided by qemu paravirt info. Seabios first uses the info to setup SRAT entries for the hotplug-able memory slots. Afterwards, build_memssdt uses the created SRAT entries to generate appropriate memory device objects. One memory device (and corresponding SRAT entry) is generated for each hotplug-able qemu memslot. Currently no SSDT memory device is created for initial system memory (the method can be generalized to all memory though). Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi.c | 151 ++-- 1 files changed, 147 insertions(+), 4 deletions(-) diff --git a/src/acpi.c b/src/acpi.c index 30888b9..5580099 100644 --- a/src/acpi.c +++ b/src/acpi.c @@ -484,6 +484,131 @@ build_ssdt(void) return ssdt; } +static unsigned char ssdt_mem[] = { +0x5b,0x82,0x47,0x07,0x4d,0x50,0x41,0x41, This patch looks like it uses the SSDT generation mechanism that was present in SeaBIOS v1.6.3. Since then, however, the runtime AML code generation has been improved to be more dynamic. Any runtime generated AML code should be updated to use the newer mechanisms. thanks, I will look into the new mechanism and rewrite. - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/9] pc: adjust e820 map on hot-add and hot-remove
On Sun, Apr 22, 2012 at 04:58:47PM +0300, Gleb Natapov wrote: On Thu, Apr 19, 2012 at 04:08:46PM +0200, Vasilis Liaskovitis wrote: Hotplugged memory is not persistent in the e820 memory maps. After hotplugging a memslot and rebooting the VM, the hotplugged device is not present. A possible solution is to add an e820 for the new memslot in the acpi_piix4 hot-add handler. On a reset, Seabios (see next patch in series) will enable all memory devices for which it finds an e820 entry that covers the devices's address range. On hot-remove, the acpi_piix4 handler will try to remove the e820 entry corresponding to the device. This will work when no VM reboots happen between hot-add and hot-remove, but it is not a sufficient solution in general: Seabios and GuestOS merge adjacent e820 entries on machine reboot, so the sequence hot-add/ rebootVM / hot-remove will fail to remove a corresponding e820 entry at the hot-remove phase. Why do you need this path and the next one? Bios can restore the state of memslots and build e820 map by reading mems_sts. i see, that is a simpler solution. Since qemu currently creates most ram e820map entries and passes them to seabios, I tried to follow the same approach. But your suggestion makes things easier and we don't have to worry about merged e820 entries on hot-remove. I 'll rework it. thanks, Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/9] ACPI memory hotplug
Hi, On Sun, Apr 22, 2012 at 05:20:59PM +0300, Gleb Natapov wrote: On Sun, Apr 22, 2012 at 05:13:27PM +0300, Avi Kivity wrote: On 04/22/2012 05:09 PM, Gleb Natapov wrote: On Sun, Apr 22, 2012 at 05:06:43PM +0300, Avi Kivity wrote: On 04/22/2012 04:56 PM, Gleb Natapov wrote: start. We will need it for migration anyway. hotplug-able memory slots i.e. initial system memory is not modeled with memslots. The concept could be generalized to include all memory though, or it could more closely follow kvm-memory slots. OK, I hope final version will allow for memory 4G to be hot-pluggable. Why is that important? Because my feeling is that people that want to use this kind of feature what to start using it with VMs smaller than 4G. Of course not all memory have to be hot unpluggable. Making first 1M or event first 128M not unpluggable make perfect sense. Can't you achieve this with -m 1G, -device dimm,size=1G,populated=true -device dimm,size=1G,populated=false? From this: (for hw/pc.c PCI hole is currently [below_4g_mem_size, 4G), so hotplugged memory should start from max(4G, above_4g_mem_size). I understand that hotpluggable memory can start from above 4G only. With the config above we will have memory hole from 1G to PCI memory hole. May be not a big problem, but I do not see technical reason for the constrain. The 440fx spec mentions: The address range from the top of main DRAM to 4 Gbytes (top of physical memory space supported by the 440FX PCIset) is normally mapped to PCI. The PMC forwards all accesses within this address range to PCI. What we probably want is that the initial memory map creation takes into account all dimms specified (both populated/unpopulated) So -m 1G, -device dimm,size=1G,populated=true -device dimm,size=1G,populated=false would create a system map with top of memory and start of PCI-hole at 2G. This may require some shifting of physical address offsets around 3.5GB-4GB - is this the minimum PCI hole allowed? E.g. if we specify 4x1GB DIMMs (onlt the first initially populated) -m 1G, -device dimm,size=1G,populated=true -device dimm,size=1G,populated=false -device dimm,size=1G,populated=false -device dimm,size=1G,populated=false we create the following memory map: dimm0: [0,1G) dimm1: [1G, 2G) dimm2: [2G, 3G) dimm3: [4G, 5G) or dimm3 is split into [3G, 3.5G) and [4G, 4.5G) does either of these options sound reasonable? thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 2/9][SeaBIOS] Implement acpi-dsdt functions for memory hotplug.
Hi, On Fri, Apr 20, 2012 at 12:55:24PM +0200, Igor Mammedov wrote: +/* Memory eject notify method */ +OperationRegion(MEMJ, SystemIO, 0xaf40, 32) +Field (MEMJ, ByteAcc, NoLock, Preserve) +{ +MPE, 256 +} + +Method (MPEJ, 2, NotSerialized) { +// _EJ0 method - eject callback +Store(ShiftLeft(1,Arg0), MPE) +Sleep(200) +} MPE is write only and only one memslot is ejected at a time. Why 256 bit-field is here then? Could we use just 1 byte and write a slot number into it and save some io address space this way? good point. This was implemented similarly to the hot-add/status register only for symmetry, but you are right, since only one slot is ejected at a time, this can be reduced to one byte and save space. I will update for the next version. + +/* Memory hotplug notify method */ +OperationRegion(MEST, SystemIO, 0xaf20, 32) It's more a suggestion: move it a bit farther to allow maybe 1024 cpus in the future. That will prevent compatibility a headache, if we decide to expand support to more then 256 cpus. ok, I will move it to 0xaf80 or higher (so cpu-hotplug could be extended to at least 1024 cpus) Or event better to make this address configurable in run-time and build this var along with SSDT (converting along the way all other hard-coded io ports to the same generic run-time interface). This wish is out of scope of this patch-set, but what do you think about the idea? yes, that would give more flexibility and avoid more compatibility headaches. As you say it's not a main issue for the series, but I can work on it as we start converting hardcoded i/o ports to configurable properties. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/9] ACPI memory hotplug
On Thu, Apr 19, 2012 at 04:08:38PM +0200, Vasilis Liaskovitis wrote: series is based on uq/master for qemu-kvm, and master for seabios. Can be found also at: forgot to paste the repo links in the original coverletter, here they are if someone wants them: https://github.com/vliaskov/qemu-kvm/commits/memory-hotplug https://github.com/vliaskov/seabios/commits/memory-hotplug thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 6/9] pc: pass paravirt info for hotplug memory slots to BIOS
On Fri, Apr 20, 2012 at 12:33:57PM +0200, Igor Mammedov wrote: On 04/19/2012 04:08 PM, Vasilis Liaskovitis wrote: -numa_fw_cfg = g_malloc0((1 + max_cpus + nb_numa_nodes) * 8); +numa_fw_cfg = g_malloc0((2 + max_cpus + nb_numa_nodes + 3 * nb_hp_memslots) * 8); numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); +numa_fw_cfg[1] = cpu_to_le64(nb_hp_memslots); this will brake compatibility if guest was migrated from old-new qemu than on reboot it will use old bios that expects numa_fw_cfg[1] to be something else. Could memslots info be moved to the end of an existing interface? right. The number of memslots can be placed at 1 + max_cpus + nb_numa_nodes, instead of right after the number of nodes. This way the old layout is preserved, and all memslot info comes at the end. I will rewrite. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/9] ACPI memory hotplug
This is a prototype for ACPI memory hotplug on x86_64 target. Based on some earlier work and comments from Gleb. Memslot devices are modeled with a new qemu command line -memslot id=name,start=start_addr,size=sz,node=pxm user is responsible for defining memslots with meaningful start/size values, e.g. not defining a memory slot over a PCI-hole. Alternatively, the start size could also be handled/assigned automatically from the specific emulated hardware (for hw/pc.c PCI hole is currently [below_4g_mem_size, 4G), so hotplugged memory should start from max(4G, above_4g_mem_size). Node is defining numa proximity for this memslot. When not defined it defaults to zero. e.g. -memslot id=hot1,start=4294967296,size=536870912,node=0 will define a 512M memory slot starting at physical address 4G, belonging to numa node 0. Memory slots are added or removed with a new hmp command memslot: Hot-add syntax: memslot id add Hot-remove syntax: memslot id delete - All memslots are initially unpopulated. Memslots are currently modeling only hotplug-able memory slots i.e. initial system memory is not modeled with memslots. The concept could be generalized to include all memory though, or it could more closely follow kvm-memory slots. - Memslots are abstracted as qdevices attached to the main system bus. However, memory hotplugging has its own side channel ignoring main_system_bus's hotplug incapability. A cleaner integration would be needed. What's the preferred way of modeling memory devices in qom? Would it be better to attach memory devices as children-links of an acpi-capable device (in the pc case acpi_piix4) instead of the system bus? - Refcounting memory slots has been discussed (but is not included in this series yet). Depopulating a memory region happens on a guestOS _EJ callback, which means the guestOS will not be using the region anymore. However, guest addresses from the depopulated region need to also be unmapped from the qemu address space using cpu_physical_memory_unmap(). Does memory_region_del_subregion() or some other memory API call guarantee that a memoryregion has been unmapped from qemu's address space? - What is the expected behaviour of hotplugged memory after a reboot? Is it supposed to be persistent after reboots? The last 2 patches in the series try to make hotplugged memslots persistent after reboot by creating and consulting e820 map entries. A better solution is needed for hot-remove after a reboot, because e820 entries can be merged. series is based on uq/master for qemu-kvm, and master for seabios. Can be found also at: Vasilis Liaskovitis (9): Seabios: Add SSDT memory device support Seabios, acpi: Implement acpi-dsdt functions for memory hotplug. Seabios, acpi: generate hotplug memory devices. Implement memslot device abstraction acpi_piix4: Implement memory device hotplug registers and handlers. pc: pass paravirt info for hotplug memory slots to BIOS Implement memslot command-line option and memslot hmp monitor command pc: adjust e820 map on hot-add and hot-remove Seabios, acpi: enable memory devices if e820 entry is present Makefile.objs |2 +- hmp-commands.hx | 15 hw/acpi_piix4.c | 103 +++- hw/memslot.c| 201 +++ hw/memslot.h| 44 hw/pc.c | 87 ++-- hw/pc.h |1 + monitor.c |8 ++ monitor.h |1 + qemu-config.c | 25 +++ qemu-options.hx |8 ++ sysemu.h|1 + vl.c| 44 - 13 files changed, 528 insertions(+), 12 deletions(-) create mode 100644 hw/memslot.c create mode 100644 hw/memslot.h create mode 100644 src/ssdt-mem.dsl src/acpi-dsdt.dsl | 68 ++- src/acpi.c| 155 +++-- src/memmap.c | 15 + src/ssdt-mem.dsl | 66 ++ 4 files changed, 298 insertions(+), 6 deletions(-) create mode 100644 src/ssdt-mem.dsl -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/9][SeaBIOS] acpi: generate hotplug memory devices.
The memory device generation is guided by qemu paravirt info. Seabios first uses the info to setup SRAT entries for the hotplug-able memory slots. Afterwards, build_memssdt uses the created SRAT entries to generate appropriate memory device objects. One memory device (and corresponding SRAT entry) is generated for each hotplug-able qemu memslot. Currently no SSDT memory device is created for initial system memory (the method can be generalized to all memory though). Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi.c | 151 ++-- 1 files changed, 147 insertions(+), 4 deletions(-) diff --git a/src/acpi.c b/src/acpi.c index 30888b9..5580099 100644 --- a/src/acpi.c +++ b/src/acpi.c @@ -484,6 +484,131 @@ build_ssdt(void) return ssdt; } +static unsigned char ssdt_mem[] = { +0x5b,0x82,0x47,0x07,0x4d,0x50,0x41,0x41, +0x08,0x49,0x44,0x5f,0x5f,0x0a,0xaa,0x08, +0x5f,0x48,0x49,0x44,0x0c,0x41,0xd0,0x0c, +0x80,0x08,0x5f,0x50,0x58,0x4d,0x0a,0xaa, +0x08,0x5f,0x43,0x52,0x53,0x11,0x33,0x0a, +0x30,0x8a,0x2b,0x00,0x00,0x0d,0x03,0x00, +0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xef, +0xbe,0xad,0xde,0x00,0x00,0x00,0x00,0xee, +0xbe,0xad,0xe6,0x00,0x00,0x00,0x00,0x00, +0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, +0x00,0x00,0x08,0x00,0x00,0x00,0x00,0x79, +0x00,0x14,0x0f,0x5f,0x53,0x54,0x41,0x00, +0xa4,0x43,0x4d,0x53,0x54,0x49,0x44,0x5f, +0x5f,0x14,0x0f,0x5f,0x45,0x4a,0x30,0x01, +0x4d,0x50,0x45,0x4a,0x49,0x44,0x5f,0x5f, +0x68 +}; + +#define SD_OFFSET_MEMHEX 6 +#define SD_OFFSET_MEMID 14 +#define SD_OFFSET_PXMID 31 +#define SD_OFFSET_MEMSTART 55 +#define SD_OFFSET_MEMEND 63 +#define SD_OFFSET_MEMSIZE 79 + +u64 nb_hp_memslots = 0; +struct srat_memory_affinity *mem; + +static void build_memdev(u8 *ssdt_ptr, int i, u64 mem_base, u64 mem_len, u8 node) +{ +memcpy(ssdt_ptr, ssdt_mem, sizeof(ssdt_mem)); +ssdt_ptr[SD_OFFSET_MEMHEX] = getHex(i 4); +ssdt_ptr[SD_OFFSET_MEMHEX+1] = getHex(i); +ssdt_ptr[SD_OFFSET_MEMID] = i; +ssdt_ptr[SD_OFFSET_PXMID] = node; +*(u64*)(ssdt_ptr + SD_OFFSET_MEMSTART) = mem_base; +*(u64*)(ssdt_ptr + SD_OFFSET_MEMEND) = mem_base + mem_len; +*(u64*)(ssdt_ptr + SD_OFFSET_MEMSIZE) = mem_len; +} + +static void* +build_memssdt(void) +{ +u64 mem_base; +u64 mem_len; +u8 node; +int i; +struct srat_memory_affinity *entry = mem; +u64 nb_memdevs = nb_hp_memslots; + +int length = ((1+3+4) + + (nb_memdevs * sizeof(ssdt_mem)) + + (1+2+5+(12*nb_memdevs)) + + (6+2+1+(1*nb_memdevs))); +u8 *ssdt = malloc_high(sizeof(struct acpi_table_header) + length); +if (! ssdt) { +warn_noalloc(); +return NULL; +} +u8 *ssdt_ptr = ssdt + sizeof(struct acpi_table_header); + +// build Scope(_SB_) header +*(ssdt_ptr++) = 0x10; // ScopeOp +ssdt_ptr = encodeLen(ssdt_ptr, length-1, 3); +*(ssdt_ptr++) = '_'; +*(ssdt_ptr++) = 'S'; +*(ssdt_ptr++) = 'B'; +*(ssdt_ptr++) = '_'; + +for (i = 0; i nb_memdevs; i++) { +mem_base = (((u64)(entry-base_addr_high) 32 )| entry-base_addr_low); +mem_len = (((u64)(entry-length_high) 32 )| entry-length_low); +node = entry-proximity[0]; +build_memdev(ssdt_ptr, i, mem_base, mem_len, node); +ssdt_ptr += sizeof(ssdt_mem); +entry++; +} + +// build Method(MTFY, 2) {If (LEqual(Arg0, 0x00)) {Notify(CM00, Arg1)} ...} +*(ssdt_ptr++) = 0x14; // MethodOp +ssdt_ptr = encodeLen(ssdt_ptr, 2+5+(12*nb_memdevs), 2); +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'T'; +*(ssdt_ptr++) = 'F'; +*(ssdt_ptr++) = 'Y'; +*(ssdt_ptr++) = 0x02; +for (i=0; inb_memdevs; i++) { +*(ssdt_ptr++) = 0xA0; // IfOp +ssdt_ptr = encodeLen(ssdt_ptr, 11, 1); +*(ssdt_ptr++) = 0x93; // LEqualOp +*(ssdt_ptr++) = 0x68; // Arg0Op +*(ssdt_ptr++) = 0x0A; // BytePrefix +*(ssdt_ptr++) = i; +*(ssdt_ptr++) = 0x86; // NotifyOp +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'P'; +*(ssdt_ptr++) = getHex(i 4); +*(ssdt_ptr++) = getHex(i); +*(ssdt_ptr++) = 0x69; // Arg1Op +} + +// build Name(MEON, Package() { One, One, ..., Zero, Zero, ... }) +*(ssdt_ptr++) = 0x08; // NameOp +*(ssdt_ptr++) = 'M'; +*(ssdt_ptr++) = 'E'; +*(ssdt_ptr++) = 'O'; +*(ssdt_ptr++) = 'N'; +*(ssdt_ptr++) = 0x12; // PackageOp +ssdt_ptr = encodeLen(ssdt_ptr, 2+1+(1*nb_memdevs), 2); +*(ssdt_ptr++) = nb_memdevs; + +entry = mem; + +for (i = 0; i nb_memdevs; i++) { +mem_base = (((u64)(entry-base_addr_high) 32 )| entry-base_addr_low); +mem_len = (((u64)(entry-length_high) 32 )| entry-length_low); +*(ssdt_ptr++) = 0x00; +entry++; +} +build_header((void*)ssdt, SSDT_SIGNATURE, ssdt_ptr - ssdt, 1); + +return ssdt
[RFC PATCH 1/9][SeaBIOS] Add SSDT memory device support
Define SSDT hotplug-able memory devices in _SB namespace. The dynamically generated SSDT includes per memory device hotplug methods. These methods just call methods defined in the DSDT. Also dynamically generate a MTFY method and a MEON array of the online/available memory devices. Add file src/ssdt-mem.dsl with directions for generating the per-memory device processor object AML code. The design is taken from SSDT cpu generation. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/ssdt-mem.dsl | 66 ++ 1 files changed, 66 insertions(+), 0 deletions(-) create mode 100644 src/ssdt-mem.dsl diff --git a/src/ssdt-mem.dsl b/src/ssdt-mem.dsl new file mode 100644 index 000..9586643 --- /dev/null +++ b/src/ssdt-mem.dsl @@ -0,0 +1,66 @@ +/* This file is the basis for the ssdt_mem[] variable in src/acpi.c. + * It is similar in design to the ssdt_proc variable. + * It defines the contents of the per-cpu Processor() object. At + * runtime, a dynamically generated SSDT will contain one copy of this + * AML snippet for every possible memory device in the system. The + * objects will * be placed in the \_SB_ namespace. + * + * To generate a new ssdt_memc[], run the commands: + * cpp -P src/ssdt-mem.dsl out/ssdt-mem.dsl.i + * iasl -ta -p out/ssdt-mem out/ssdt-mem.dsl.i + * tail -c +37 out/ssdt-mem.aml | hexdump -e ' 8/1 0x%02x, \n' + * and then cut-and-paste the output into the src/acpi.c ssdt_mem[] + * array. + * + * In addition to the aml code generated from this file, the + * src/acpi.c file creates a MEMNTFY method with an entry for each memdevice: + * Method(MTFY, 2) { + * If (LEqual(Arg0, 0x00)) { Notify(MP00, Arg1) } + * If (LEqual(Arg0, 0x01)) { Notify(MP01, Arg1) } + * ... + * } + * and a MEON array with the list of active and inactive memory devices: + * Name(MEON, Package() { One, One, ..., Zero, Zero, ... }) + */ +DefinitionBlock (ssdt-mem.aml, SSDT, 0x02, BXPC, CSSDT, 0x1) +/* v-- DO NOT EDIT --v */ +{ +Device(MPAA) { +Name(ID, 0xAA) +/* ^-- DO NOT EDIT --^ + * + * The src/acpi.c code requires the above layout so that it can update + * MPAA and 0xAA with the appropriate MEMDEVICE id (see + * SD_OFFSET_MEMHEX/MEMID1/MEMID2). Don't change the above without + * also updating the C code. + */ +Name(_HID, EISAID(PNP0C80)) +Name(_PXM, 0xAA) + +External(CMST, MethodObj) +External(MPEJ, MethodObj) + +Name(_CRS, ResourceTemplate() { +QwordMemory( + ResourceConsumer, + , + MinFixed, + MaxFixed, + Cacheable, + ReadWrite, + 0x0, + 0xDEADBEEF, + 0xE6ADBEEE, + 0x, + 0x0800, + ) +}) +Method (_STA, 0) { +Return(CMST(ID)) +} +Method (_EJ0, 1, NotSerialized) { +MPEJ(ID, Arg0) +} +} +} + -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 4/9] Implement memslot device abstraction
Each hotplug-able memory slot is a SysBusDevice. All memslots are initially unpopulated. A hot-add operation for a particular memory slot creates a new MemoryRegion of the given physical address offset, size and node proximity, and attaches it to main system memory as a sub_region. A hot-remove operation detaches and frees the MemoryRegion from system memory. This is an early prototype and lacks proper qdev integration: a separate hotplug mechanism/side-channel is used and main system bus hotplug capability is ignored. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/memslot.c | 195 ++ hw/memslot.h | 44 + 2 files changed, 239 insertions(+), 0 deletions(-) create mode 100644 hw/memslot.c create mode 100644 hw/memslot.h diff --git a/hw/memslot.c b/hw/memslot.c new file mode 100644 index 000..b100824 --- /dev/null +++ b/hw/memslot.c @@ -0,0 +1,195 @@ +/* + * MemorySlot device for Memory Hotplug + * + * Copyright ProfitBricks GmbH 2012 + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, see http://www.gnu.org/licenses/ + */ + +#include trace.h +#include qdev.h +#include memslot.h +#include ../exec-memory.h + +static DeviceState *memslot_hotplug_qdev; +static memslot_hotplug_fn memslot_hotplug; + +static Property memslot_properties[] = { +DEFINE_PROP_END_OF_LIST() +}; + +void memslot_populate(MemSlotState *s) +{ +char buf[32]; +MemoryRegion *new = NULL; + +sprintf(buf, memslot%u, s-idx); +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, buf, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +s-populated = 1; +} + +void memslot_depopulate(MemSlotState *s) +{ +assert(s); +if (s-populated) { +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-populated = 0; +s-mr = NULL; +} +} + +MemSlotState *memslot_create(char *id, target_phys_addr_t start, uint64_t size, +uint64_t node, uint32_t memslot_idx) +{ +DeviceState *dev; +MemSlotState *mdev; + +dev = sysbus_create_simple(memslot, -1, NULL); +dev-id = id; + +mdev = MEMSLOT(dev); +mdev-idx = memslot_idx; +mdev-start = start; +mdev-size = size; +mdev-node = node; + +return mdev; +} + +void memslot_register_hotplug(memslot_hotplug_fn hotplug, DeviceState *qdev) +{ +memslot_hotplug_qdev = qdev; +memslot_hotplug = hotplug; +} + +static MemSlotState *memslot_find(char *id) +{ +DeviceState *qdev; +qdev = qdev_find_recursive(sysbus_get_default(), id); +if (qdev) +return MEMSLOT(qdev); +return NULL; +} + +int memslot_do(Monitor *mon, const QDict *qdict) +{ +MemSlotState *slot = NULL; + +char *id = (char*) qdict_get_try_str(qdict, id); +if (!id) { +fprintf(stderr, ERROR %s invalid id\n,__FUNCTION__); +return 1; +} + +slot = memslot_find(id); + +if (!slot) { +fprintf(stderr, %s no slot %s found\n, __FUNCTION__, id); +return 1; +} + +char *action = (char*) qdict_get_try_str(qdict, action); +if (!action || (strcmp(action, add) strcmp(action, delete))) { +fprintf(stderr, ERROR %s invalid action\n, __FUNCTION__); +return 1; +} + +if (!strcmp(action, add)) { +if (slot-populated) { +fprintf(stderr, ERROR %s slot %s already populated\n, +__FUNCTION__, id); +return 1; +} +memslot_populate(slot); +if (memslot_hotplug) +memslot_hotplug(memslot_hotplug_qdev, (SysBusDevice*)slot, 1); +} +else { +if (!slot-populated) { +fprintf(stderr, ERROR %s slot %s is not populated\n, +__FUNCTION__, id); +return 1; +} +if (memslot_hotplug) +memslot_hotplug(memslot_hotplug_qdev, (SysBusDevice*)slot, 0); +} +return 0; +} + +MemSlotState *memslot_find_from_idx(uint32_t idx) +{ +Error *err = NULL; +DeviceState *dev; +MemSlotState *slot; +char *type; +BusState *bus = sysbus_get_default(); +QTAILQ_FOREACH(dev, bus-children, sibling) { +type = object_property_get_str(OBJECT(dev
[RFC PATCH 5/9] acpi_piix4: Implement memory device hotplug registers
A 32-byte register is used to present up to 256 hotplug-able memory devices to BIOS and OSPM. Hot-add and hot-remove functions trigger an ACPI hotplug event through these. Only reads are allowed from these registers (from BIOS/OSPM perspective). memslot id add will immediately populate the new memslot (a new MemoryRegion is created and attached to system memory), and then trigger the ACPI hot-add event. memslot id delete triggers the ACPI hot-remove event but needs to wait for OSPM to eject the device. We use a second set of eject registers to know when OSPM has called the _EJ function for a particular memslot. A write to these will depopulate the corresponding memslot i.e. detach and free the MemoryRegion. Only writes to the eject registers are allowed. A new property mem_acpi_hotplug should enable these memory hotplug registers for future machine types (not yet implemented in this version). Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 93 -- 1 files changed, 89 insertions(+), 4 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 797ed24..a14dd3c 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -27,6 +27,8 @@ #include sysemu.h #include range.h #include ioport.h +#include sysbus.h +#include memslot.h //#define DEBUG @@ -43,9 +45,16 @@ #define PCI_BASE 0xae00 #define PCI_EJ_BASE 0xae08 #define PCI_RMV_BASE 0xae0c +#define MEM_BASE 0xaf20 +#define MEM_EJ_BASE 0xaf40 +#define PIIX4_MEM_HOTPLUG_STATUS 8 #define PIIX4_PCI_HOTPLUG_STATUS 2 +struct gpe_regs { +uint8_t mems_sts[32]; +}; + struct pci_status { uint32_t up; uint32_t down; @@ -66,6 +75,7 @@ typedef struct PIIX4PMState { int kvm_enabled; Notifier machine_ready; +struct gpe_regsgpe; /* for pci hotplug */ struct pci_status pci0_status; uint32_t pci0_hotplug_enable; @@ -86,8 +96,8 @@ static void pm_update_sci(PIIX4PMState *s) ACPI_BITMASK_POWER_BUTTON_ENABLE | ACPI_BITMASK_GLOBAL_LOCK_ENABLE | ACPI_BITMASK_TIMER_ENABLE)) != 0) || -(((s-ar.gpe.sts[0] s-ar.gpe.en[0]) - PIIX4_PCI_HOTPLUG_STATUS) != 0); +(((s-ar.gpe.sts[0] s-ar.gpe.en[0]) + (PIIX4_PCI_HOTPLUG_STATUS | PIIX4_MEM_HOTPLUG_STATUS)) != 0); qemu_set_irq(s-irq, sci_level); /* schedule a timer interruption if needed */ @@ -432,17 +442,34 @@ type_init(piix4_pm_register_types) static uint32_t gpe_readb(void *opaque, uint32_t addr) { PIIX4PMState *s = opaque; -uint32_t val = acpi_gpe_ioport_readb(s-ar, addr); +uint32_t val = 0; +struct gpe_regs *g = s-gpe; + +switch (addr) { +case MEM_BASE ... MEM_BASE+31: +val = g-mems_sts[addr - MEM_BASE]; +break; +default: +val = acpi_gpe_ioport_readb(s-ar, addr); +} PIIX4_DPRINTF(gpe read %x == %x\n, addr, val); return val; } +static void piix4_memslot_eject(uint32_t addr, uint32_t val); + static void gpe_writeb(void *opaque, uint32_t addr, uint32_t val) { PIIX4PMState *s = opaque; -acpi_gpe_ioport_writeb(s-ar, addr, val); +switch (addr) { +case MEM_EJ_BASE ... MEM_EJ_BASE+31: +piix4_memslot_eject(addr, val); +break; +default: +acpi_gpe_ioport_writeb(s-ar, addr, val); +} pm_update_sci(s); PIIX4_DPRINTF(gpe write %x == %d\n, addr, val); @@ -521,9 +548,12 @@ static void pcirmv_write(void *opaque, uint32_t addr, uint32_t val) static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev, PCIHotplugState state); +static int piix4_memslot_hotplug(DeviceState *qdev, SysBusDevice *dev, int add); + static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) { struct pci_status *pci0_status = s-pci0_status; +int i = 0; register_ioport_write(GPE_BASE, GPE_LEN, 1, gpe_writeb, s); register_ioport_read(GPE_BASE, GPE_LEN, 1, gpe_readb, s); @@ -538,6 +568,13 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) register_ioport_write(PCI_RMV_BASE, 4, 4, pcirmv_write, s); register_ioport_read(PCI_RMV_BASE, 4, 4, pcirmv_read, s); +register_ioport_read(MEM_BASE, 32, 1, gpe_readb, s); +register_ioport_write(MEM_EJ_BASE, 32, 1, gpe_writeb, s); +for(i = 0; i 32; i++) { +s-gpe.mems_sts[i] = 0; +} +memslot_register_hotplug(piix4_memslot_hotplug, s-dev.qdev); + pci_bus_hotplug(bus, piix4_device_hotplug, s-dev.qdev); } @@ -553,6 +590,54 @@ static void disable_device(PIIX4PMState *s, int slot) s-pci0_status.down |= (1 slot); } +static void enable_mem_device(PIIX4PMState *s, int memdevice) +{ +struct gpe_regs *g = s-gpe; +s-ar.gpe.sts[0] |= PIIX4_MEM_HOTPLUG_STATUS; +g-mems_sts[memdevice/8] |= (1 (memdevice%8)); +} + +static void
[RFC PATCH 8/9] pc: adjust e820 map on hot-add and hot-remove
Hotplugged memory is not persistent in the e820 memory maps. After hotplugging a memslot and rebooting the VM, the hotplugged device is not present. A possible solution is to add an e820 for the new memslot in the acpi_piix4 hot-add handler. On a reset, Seabios (see next patch in series) will enable all memory devices for which it finds an e820 entry that covers the devices's address range. On hot-remove, the acpi_piix4 handler will try to remove the e820 entry corresponding to the device. This will work when no VM reboots happen between hot-add and hot-remove, but it is not a sufficient solution in general: Seabios and GuestOS merge adjacent e820 entries on machine reboot, so the sequence hot-add/ rebootVM / hot-remove will fail to remove a corresponding e820 entry at the hot-remove phase. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c |6 ++ hw/pc.c | 28 hw/pc.h |1 + 3 files changed, 35 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 2921d18..2b5fd04 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -619,6 +619,9 @@ static void piix4_memslot_eject(uint32_t addr, uint32_t val) s = memslot_find_from_idx(start + idx); assert(s != NULL); memslot_depopulate(s); +if (e820_del_entry(s-start, s-size, E820_RAM) == -EBUSY) +PIIX4_DPRINTF(failed to remove e820 entry for memslot %u\n, + s-idx); } val = val 1; idx++; @@ -634,6 +637,9 @@ static int piix4_memslot_hotplug(DeviceState *qdev, SysBusDevice *dev, int if (add) { enable_mem_device(s, slot-idx); +if (e820_add_entry(slot-start, slot-size, E820_RAM) == -EBUSY) +PIIX4_DPRINTF(failed to add e820 entry for memslot %u\n, +slot-idx); } else { disable_mem_device(s, slot-idx); diff --git a/hw/pc.c b/hw/pc.c index f1f550a..04d243f 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -593,6 +593,34 @@ int e820_add_entry(uint64_t address, uint64_t length, uint32_t type) return index; } +int e820_del_entry(uint64_t address, uint64_t length, uint32_t type) +{ +int index = le32_to_cpu(e820_table.count); +int search; +struct e820_entry *entry; + +if (index == 0) +return -EBUSY; +search = index - 1; +entry = e820_table.entry[search]; +while (search = 0) { +if ((entry-address == cpu_to_le64(address)) +(entry-length == cpu_to_le64(length)) +(entry-type == cpu_to_le32(type))){ +if (search != index - 1) { +memcpy(e820_table.entry[search], e820_table.entry[search + 1], +sizeof(struct e820_entry) * (index - search)); +} +index--; +e820_table.count = cpu_to_le32(index); +return 1; +} +search--; +entry = e820_table.entry[search]; +} +return -EBUSY; +} + static void bochs_bios_setup_hp_memslots(uint64_t *fw_cfg_slots); static void *bochs_bios_init(void) diff --git a/hw/pc.h b/hw/pc.h index 74d3369..4925e8c 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -226,5 +226,6 @@ void pc_system_firmware_init(MemoryRegion *rom_memory); #define E820_UNUSABLE 5 int e820_add_entry(uint64_t, uint64_t, uint32_t); +int e820_del_entry(uint64_t, uint64_t, uint32_t); #endif -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 9/9][SeaBIOS] enable memory devices if e820 entry is present
On a reboot, seabios regenerates srat/ssdt objects. If a valid e820 entry is found spanning the whole address range of a hotplug memory device, the device will be enabled. This ensures persistency of hotplugged memory slots across VM reboots. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi.c |6 +- src/memmap.c | 15 +++ 2 files changed, 20 insertions(+), 1 deletions(-) diff --git a/src/acpi.c b/src/acpi.c index 5580099..2ebed2e 100644 --- a/src/acpi.c +++ b/src/acpi.c @@ -601,7 +601,11 @@ build_memssdt(void) for (i = 0; i nb_memdevs; i++) { mem_base = (((u64)(entry-base_addr_high) 32 )| entry-base_addr_low); mem_len = (((u64)(entry-length_high) 32 )| entry-length_low); -*(ssdt_ptr++) = 0x00; +if (find_e820(mem_base, mem_len, E820_RAM)) { +*(ssdt_ptr++) = 0x01; +} +else +*(ssdt_ptr++) = 0x00; entry++; } build_header((void*)ssdt, SSDT_SIGNATURE, ssdt_ptr - ssdt, 1); diff --git a/src/memmap.c b/src/memmap.c index 56865b4..9790da1 100644 --- a/src/memmap.c +++ b/src/memmap.c @@ -131,6 +131,21 @@ add_e820(u64 start, u64 size, u32 type) //dump_map(); } +// Check if an e820 entry exists that covers the memory range +// [start, start+size) with same type as type. +int +find_e820(u64 start, u64 size, u32 type) +{ +int i; +for (i=0; ie820_count; i++) { +struct e820entry *e = e820_list[i]; +if ((e-start = start) (e-size = (size + start - e-start)) +(e-type == type)) +return 1; +} +return 0; +} + // Report on final memory locations. void memmap_finalize(void) -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 7/9] Implement memslot command-line option and memslot hmp command
Implement -memslot qemu-kvm command line option to define hotplug-able memory slots. Syntax: -memslot id=name,start=addr,size=sz,node=nodeid e.g. -memslot id=hot1,start=4294967296,size=1073741824,node=0 will define a 1G memory slot starting at physical address 4G, belonging to numa node 0. Defining no node will automatically add a memslot to node 0. Also implement a new hmp monitor command for hot-add and hot-remove of memory slots Syntax: memslot slotname action where action is add/delete and slotname is the qdev-id of the memory slot. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- Makefile.objs |2 +- hmp-commands.hx | 15 +++ monitor.c |8 monitor.h |1 + qemu-config.c | 25 + qemu-options.hx |8 sysemu.h|1 + vl.c| 40 8 files changed, 99 insertions(+), 1 deletions(-) diff --git a/Makefile.objs b/Makefile.objs index 5c3bcda..98ce865 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -240,7 +240,7 @@ hw-obj-$(CONFIG_USB_OHCI) += usb/hcd-ohci.o hw-obj-$(CONFIG_USB_EHCI) += usb/hcd-ehci.o hw-obj-$(CONFIG_USB_XHCI) += usb/hcd-xhci.o hw-obj-$(CONFIG_FDC) += fdc.o -hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o +hw-obj-$(CONFIG_ACPI) += acpi.o acpi_piix4.o memslot.o hw-obj-$(CONFIG_APM) += pm_smbus.o apm.o hw-obj-$(CONFIG_DMA) += dma.o hw-obj-$(CONFIG_I82374) += i82374.o diff --git a/hmp-commands.hx b/hmp-commands.hx index a6f5a84..cadf4ca 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -618,6 +618,21 @@ Add device. ETEXI { +.name = memslot, +.args_type = id:s,action:s, +.params = id,action, +.help = add memslot device, +.user_print = monitor_user_noop, +.mhandler.cmd_new = do_memslot_add, +}, + +STEXI +@item memslot_add @var{config} +@findex memslot_add + +Add memslot. +ETEXI +{ .name = device_del, .args_type = id:s, .params = device, diff --git a/monitor.c b/monitor.c index 8946a10..f672186 100644 --- a/monitor.c +++ b/monitor.c @@ -30,6 +30,7 @@ #include hw/pci.h #include hw/watchdog.h #include hw/loader.h +#include hw/memslot.h #include gdbstub.h #include net.h #include net/slirp.h @@ -4675,3 +4676,10 @@ int monitor_read_block_device_key(Monitor *mon, const char *device, return monitor_read_bdrv_key_start(mon, bs, completion_cb, opaque); } + +int do_memslot_add(Monitor *mon, const QDict *qdict, QObject **ret_data) +{ +#if defined(TARGET_I386) || defined(TARGET_X86_64) +return memslot_do(mon, qdict); +#endif +} diff --git a/monitor.h b/monitor.h index 0d49800..1e14a63 100644 --- a/monitor.h +++ b/monitor.h @@ -80,5 +80,6 @@ int monitor_read_password(Monitor *mon, ReadLineFunc *readline_func, int qmp_qom_set(Monitor *mon, const QDict *qdict, QObject **ret); int qmp_qom_get(Monitor *mon, const QDict *qdict, QObject **ret); +int do_memslot_add(Monitor *mon, const QDict *qdict, QObject **ret_data); #endif /* !MONITOR_H */ diff --git a/qemu-config.c b/qemu-config.c index be84a03..1f26187 100644 --- a/qemu-config.c +++ b/qemu-config.c @@ -613,6 +613,30 @@ QemuOptsList qemu_boot_opts = { }, }; +static QemuOptsList qemu_memslot_opts = { +.name = memslot, +.head = QTAILQ_HEAD_INITIALIZER(qemu_memslot_opts.head), +.desc = { +{ +.name = id, +.type = QEMU_OPT_STRING, +},{ +.name = start, +.type = QEMU_OPT_SIZE, +.help = physical address start for this memslot, +},{ +.name = size, +.type = QEMU_OPT_SIZE, +.help = memory size for this memslot, +},{ +.name = node, +.type = QEMU_OPT_NUMBER, +.help = NUMA node number (i.e. proximity) for this memslot, +}, +{ /* end of list */ } +}, +}; + static QemuOptsList *vm_config_groups[32] = { qemu_drive_opts, qemu_chardev_opts, @@ -628,6 +652,7 @@ static QemuOptsList *vm_config_groups[32] = { qemu_machine_opts, qemu_boot_opts, qemu_iscsi_opts, +qemu_memslot_opts, NULL, }; diff --git a/qemu-options.hx b/qemu-options.hx index a169792..aff0546 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2728,3 +2728,11 @@ HXCOMM This is the last statement. Insert new options before this line! STEXI @end table ETEXI + +DEF(memslot, HAS_ARG, QEMU_OPTION_memslot, +-memslot start=num,size=num,id=name\n +specify unpopulated memory slot, +QEMU_ARCH_ALL) + + + diff --git a/sysemu.h b/sysemu.h index bc2c788..7247099 100644 --- a/sysemu.h +++ b/sysemu.h @@ -136,6 +136,7 @@ extern QEMUClock *rtc_clock; extern int nb_numa_nodes; extern uint64_t node_mem[MAX_NODES]; extern uint64_t node_cpumask[MAX_NODES]; +extern int nb_hp_memslots; #define MAX_OPTION_ROMS 16 typedef struct
[RFC PATCH 6/9] pc: pass paravirt info for hotplug memory slots to BIOS
The numa_fw_cfg paravirt interface is extended to include SRAT information for all hotplug-able memslots. There are 3 words for each hotplug-able memory slot, denoting start address, size and node proximity. nb_numa_nodes is set to 1 by default (not 0), so that we always pass srat info to SeaBIOS. This information is used by Seabios to build hotplug memory device objects at runtime. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/pc.c | 59 +-- vl.c|4 +++- 2 files changed, 56 insertions(+), 7 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 67f0479..f1f550a 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -46,6 +46,7 @@ #include ui/qemu-spice.h #include memory.h #include exec-memory.h +#include memslot.h /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -592,12 +593,15 @@ int e820_add_entry(uint64_t address, uint64_t length, uint32_t type) return index; } +static void bochs_bios_setup_hp_memslots(uint64_t *fw_cfg_slots); + static void *bochs_bios_init(void) { void *fw_cfg; uint8_t *smbios_table; size_t smbios_len; uint64_t *numa_fw_cfg; +uint64_t *hp_memslots_fw_cfg; int i, j; register_ioport_write(0x400, 1, 2, bochs_bios_write, NULL); @@ -630,28 +634,71 @@ static void *bochs_bios_init(void) fw_cfg_add_bytes(fw_cfg, FW_CFG_HPET, (uint8_t *)hpet_cfg, sizeof(struct hpet_fw_config)); /* allocate memory for the NUMA channel: one (64bit) word for the number - * of nodes, one word for each VCPU-node and one word for each node to - * hold the amount of memory. + * of nodes, one word for the number of hotplug memory slots, one word + * for each VCPU-node, one word for each node to hold the amount of memory. + * Finally three words for each hotplug memory slot, denoting start address, + * size and node proximity. */ -numa_fw_cfg = g_malloc0((1 + max_cpus + nb_numa_nodes) * 8); +numa_fw_cfg = g_malloc0((2 + max_cpus + nb_numa_nodes + 3 * nb_hp_memslots) * 8); numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); +numa_fw_cfg[1] = cpu_to_le64(nb_hp_memslots); + for (i = 0; i max_cpus; i++) { for (j = 0; j nb_numa_nodes; j++) { if (node_cpumask[j] (1 i)) { -numa_fw_cfg[i + 1] = cpu_to_le64(j); +numa_fw_cfg[i + 2] = cpu_to_le64(j); break; } } } for (i = 0; i nb_numa_nodes; i++) { -numa_fw_cfg[max_cpus + 1 + i] = cpu_to_le64(node_mem[i]); +numa_fw_cfg[max_cpus + 2 + i] = cpu_to_le64(node_mem[i]); } + +hp_memslots_fw_cfg = numa_fw_cfg + 2 + max_cpus + nb_numa_nodes; +if (nb_hp_memslots) +bochs_bios_setup_hp_memslots(hp_memslots_fw_cfg); + fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, (uint8_t *)numa_fw_cfg, - (1 + max_cpus + nb_numa_nodes) * 8); + (2 + max_cpus + nb_numa_nodes + 3 * nb_hp_memslots) * 8); return fw_cfg; } +static void bochs_bios_setup_hp_memslots(uint64_t *fw_cfg_slots) +{ +int i = 0; +Error *err = NULL; +DeviceState *dev; +MemSlotState *slot; +char *type; +BusState *bus = sysbus_get_default(); + +QTAILQ_FOREACH(dev, bus-children, sibling) { +type = object_property_get_str(OBJECT(dev), type, err); +if (err) { +error_free(err); +fprintf(stderr, error getting device type\n); +exit(1); +} + +if (!strcmp(type, memslot)) { +if (!dev-id) { +error_free(err); +fprintf(stderr, error getting memslot device id\n); +exit(1); +} +if (!strcmp(dev-id, initialslot)) continue; +slot = MEMSLOT(dev); +fw_cfg_slots[3 * slot-idx] = cpu_to_le64(slot-start); +fw_cfg_slots[3 * slot-idx + 1] = cpu_to_le64(slot-size); +fw_cfg_slots[3 * slot-idx + 2] = cpu_to_le64(slot-node); +i++; +} +} +assert(i == nb_hp_memslots); +} + static long get_file_size(FILE *f) { long where, size; diff --git a/vl.c b/vl.c index ae91a8a..50df453 100644 --- a/vl.c +++ b/vl.c @@ -3428,8 +3428,10 @@ int main(int argc, char **argv, char **envp) register_savevm_live(NULL, ram, 0, 4, NULL, ram_save_live, NULL, ram_load, NULL); +if (!nb_numa_nodes) +nb_numa_nodes = 1; -if (nb_numa_nodes 0) { +{ int i; if (nb_numa_nodes MAX_NODES) { -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/9][SeaBIOS] Implement acpi-dsdt functions for memory hotplug.
Extend the DSDT to include methods for handling memory hot-add and hot-remove notifications and memory device status requests. These functions are called from the memory device SSDT methods. Eject has only been tested with level gpe event, but will be changed to edge gpe event soon, according to recent master patch for other ACPI hotplug events. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl | 68 +++- 1 files changed, 66 insertions(+), 2 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 4bdc268..184daf0 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -709,9 +709,72 @@ DefinitionBlock ( } Return(One) } -} +/* Objects filled in by run-time generated SSDT */ +External(MTFY, MethodObj) +External(MEON, PkgObj) + +Method (CMST, 1, NotSerialized) { +// _STA method - return ON status of memdevice +// Local0 = MEON flag for this cpu +Store(DerefOf(Index(MEON, Arg0)), Local0) +If (Local0) { Return(0xF) } Else { Return(0x0) } +} +/* Memory eject notify method */ +OperationRegion(MEMJ, SystemIO, 0xaf40, 32) +Field (MEMJ, ByteAcc, NoLock, Preserve) +{ +MPE, 256 +} + +Method (MPEJ, 2, NotSerialized) { +// _EJ0 method - eject callback +Store(ShiftLeft(1,Arg0), MPE) +Sleep(200) +} + +/* Memory hotplug notify method */ +OperationRegion(MEST, SystemIO, 0xaf20, 32) +Field (MEST, ByteAcc, NoLock, Preserve) +{ +MES, 256 +} + +Method(MESC, 0) { +// Local5 = active memdevice bitmap +Store (MES, Local5) +// Local2 = last read byte from bitmap +Store (Zero, Local2) +// Local0 = memory device iterator +Store (Zero, Local0) +While (LLess(Local0, SizeOf(MEON))) { +// Local1 = MEON flag for this memory device +Store(DerefOf(Index(MEON, Local0)), Local1) +If (And(Local0, 0x07)) { +// Shift down previously read bitmap byte +ShiftRight(Local2, 1, Local2) +} Else { +// Read next byte from memdevice bitmap +Store(DerefOf(Index(Local5, ShiftRight(Local0, 3))), Local2) +} +// Local3 = active state for this memory device +Store(And(Local2, 1), Local3) +If (LNotEqual(Local1, Local3)) { +// State change - update MEON with new state +Store(Local3, Index(MEON, Local0)) +// Do MEM notify +If (LEqual(Local3, 1)) { +MTFY(Local0, 1) +} Else { +MTFY(Local0, 3) +} +} +Increment(Local0) +} +Return(One) +} +} / * General purpose events / @@ -732,7 +795,8 @@ DefinitionBlock ( Return(\_SB.PRSC()) } Method(_L03) { -Return(0x01) +// Memory hotplug event +Return(\_SB.MESC()) } Method(_L04) { Return(0x01) -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 0/9] ACPI memory hotplug
Hi, On Thu, Apr 19, 2012 at 09:49:31AM -0500, Anthony Liguori wrote: On 04/19/2012 09:08 AM, Vasilis Liaskovitis wrote: This is a prototype for ACPI memory hotplug on x86_64 target. Based on some earlier work and comments from Gleb. Memslot devices are modeled with a new qemu command line -memslot id=name,start=start_addr,size=sz,node=pxm Hi, For 1.2, I'd really like to focus on refactoring the PC machine as described in this series: https://github.com/aliguori/qemu/commits/qom-rebase.12 I'd like to represent the guest memory as a DIMM device. In terms of this proposal, I would then expect that the i440fx device would have a num_dimms property that controlled how many linkDIMM's it had. Hotplug would consist of creating a DIMM at run time and connecting it to the appropriate link. ok, makes sense. One thing that's not clear to me is how the start/size fits in. On bare metal, is this something that's calculated by the firmware during start up and then populated in ACPI? Does it do something like take the largest possible DIMM size that it supports and fill out the table? The current series works as follows: For each DIMM/memslot option, firmware constructs a QWordMemory ACPI object (see ACPI spec, ASL 18.5.95). This object has AddressMinimum, AddressMaximum, RangeLength fields. The first of these corresponds directly to the start attribute, the third corresponds to size, and the second is derived from both. On bare metal, I believe the firmware detects the actual DIMM devices and their size and calculates the physical offset (AddressMinimum) for each, taking into account possible PCI hole. I doubt the largest possible DIMM size is used, since a hotplug entity/event should correspond to a physical device. (Kevin or Gleb may have a better idea of what real metal firmware usually does). Perhaps you are suggesting having a predefined number of equally sized DIMMs as being hotplug-able? This way no memslot/DIMM config would have to be passed by the user at the command line for each DIMM. In this series, the user-defined memslot options pass the desired DIMM descriptions to SeaBIOS, which then builds the aforementioned objects.(I assume it would be possible to pass this info with normal -device commands, after proper qom-ification) At any rate, I think we should focus on modeling this in QOM verses adding a new option and hacking at the existing memory init code. agreed. I will take a look into qom-rebase. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 7/9] Implement memslot command-line option and memslot hmp command
Hi, On Thu, Apr 19, 2012 at 05:22:52PM +0300, Avi Kivity wrote: On 04/19/2012 05:08 PM, Vasilis Liaskovitis wrote: Implement -memslot qemu-kvm command line option to define hotplug-able memory slots. Syntax: -memslot id=name,start=addr,size=sz,node=nodeid e.g. -memslot id=hot1,start=4294967296,size=1073741824,node=0 will define a 1G memory slot starting at physical address 4G, belonging to numa node 0. Defining no node will automatically add a memslot to node 0. start=4G,size=1G ought to work too, no? it should, but it didn't when I tried. Probably some silliness on my part, I will retry. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
live migration between qemu-kvm 1.0 and 0.15
Hi, is live migration between qemu-kvm stable-0.15 and stable-1.0 trees possible? When I live migrate a VM from 1.0 to 0.15, the destination side 0.15 qemu-kvm exits with: (qemu) Unknown savevm section or instance 'i8259' 0 That's expected, since commit i8259:convert to qdev 747c70af78f7088f182c87e683bdf847beead1e4 introduces the i8259 device in the qdev tree. The other direction (live migrate from 0.15.1 to 1.0.0) seems to work fine. Are any other issues expected in this direction? The vmstate for i8259 has not changed between these trees afaict. If the qdev-ified i8259 is reverted from stable-1.0 tree (to restore live-migration compatibility between the versions), would you expect problems? thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][SeaBIOS] memory hotplug
Hi, On Thu, Mar 15, 2012 at 02:01:38PM +0200, Gleb Natapov wrote: Commenting a little bit late, but since you've said that you are working on a new version of the patch... better late than never. On Thu, Aug 11, 2011 at 04:39:38PM +0200, Vasilis Liaskovitis wrote: Hi, I am testing a set of experimental patches for memory-hotplug on x86_64 host / guest combinations. I have implemented this in a similar way to cpu-hotplug. A dynamic SSDT table with all memory devices is created at boot time. This table calls static methods from the DSDT. A byte array indicates which memory device is online or not. This array is kept in sync with a qemu-kvm bitmap array through ioport 0xaf20. Qemu-kvm updates this table on a mem_set command and an ACPI event is triggered. Memory devices are 128MB in size (to match /sys/devices/memory/block_size_bytes in x86_64). They are constructed dynamically in src/ssdt-mem.asl , similarly to hotpluggable-CPUs. The _CRS memstart-memend attribute for each memory device is defined accordingly, skipping the hole at 0xe000 - 0x1. Hotpluggable memory is always located above 4GB. What is the reason for this limitation? We currently model a PCI hole from below_4g_mem_size to 4GB, see i440fx_init call in pc_init1. The decision was discussed here: http://patchwork.ozlabs.org/patch/105892/ afaict because there was no clear resolution on using a top-of-memory register. So, hotplugging will start at 4GB + above_4g_mem_size. Unless we can model the pci hole more accurately hardware-wise. Qemu-kvm sets the upper bound of hotpluggable memory with maxmem = [totalmemory in MB] on the command line. Maxmem is an argument for -m similar to maxcpus for smp. E.g. -m 1024,maxmem=2048 on the qemu command line will create memory devices for 2GB of RAM, enabling only 1GB initially. Qemu_monitor triggers a memory hotplug with: (qemu) mem_set [memory range in MBs] online As far as I see mem_set does not get memory range as a parameter. The parameter is amount of memory to add/remove and memory is added/removed to/from the top. This is not flexible enough. Find grained control for memory slots is needed. What about exposing memory slot configuration to command line like this: -memslot mem=size,populated=yes|no adding one of those for each slot. yes, I agree we need this. Is the idea to model all physical DIMMs? For initial system RAM does it make sense to explicitly specify slots at the command line, or infer them? I think we can allocate a new qemu ram MemoryRegion for each new hotplugged slot/DIMM, so there will be a 1-1 mapping between new populated slots and qemu memory ram regions. Perhaps we want initial memory allocation to also comply with physical slot/DIMM modeling. Initial (cold) RAM is created as a single MemoryRegion pc.ram Also in kvm we can easily run out of kvm_memory_slots (10 slots per VM and 32 system-wide I think) mem_set will get slot id to populate/depopulate just like cpu_set gets cpu slot number to remove and not just yanks cpus with highest slot id. right, but I think for upstream qemu, people would like to eventually use device_add, instead of a new mem_set command. Pretty much the same way as cpu hotplug? For this to happen, memory devices should be modeled in QOM/qdev. Are we planning on keeping a CPUSocket structures for CPUs? or perhaps modelling a memory controller is the right way. What type should the memory controller/devices be a child of? I 'll try to resubmit in a few weeks time, though depending on feedack qom/qdev of memory devices will probably take longer. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
live migration between amd fam15h-fam10h
Hi, I am getting a frozen guest when migrating from an Opteron 6274 host (amd fam15h) to an Opteron 6174 host (amd fam10h). The live migration completes succesfully, but the guest is frozen: vcn screen is still there, but no input is possible and no kernel output is seen. Trying c on the qemu-monitor does not help. I am using -cpu Opteron_G3 which I assumed would be ok for both host cpus. In the opposite direction (migrating from an amd fam10h host to an amdfam15h host) the guest continues to run on the destination. However, on most of these successfull live migrations, I notice a clocksource unstable message on the guest kernel (using the default kvm-clock clocksource) e.g. Clocksource tsc unstable (delta = -1500533439 ns) Same situation (guest runs on destination with clocksource unstable message) happens when migrating between fam15h hosts (I have not tried between fam10h hosts) Changing the clocksource (tsc, acpi_pm, hpet) does not solve the issue. Also tried with -cpu kvm64 with same result. qemu-kvm version: 0.15.1, 1.0 or qemu-kvm/master Host kernel: 3.0.15 (on both hosts) Guest kernel: 3.0.6 or 3.2 this is the qemu-kvm command line used on the source host: kvm -enable-kvm -m 1024 -smp 1 -cpu Opteron_G3,check -drive \ file=/opt/test.img,if=none,id=drive-virtio-disk1,format=raw,cache=writethrough,boot=on -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -monitor stdio -vnc 0.0.0.0:6 -vga std -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 The destination host has the same command line with an added -incoming tcp:. I have mainly tested this with non-shared storage (but also shared storage has the same result). Migration is triggered with migrate -b tcp:destip: Do the TSC microarchitecture changes in amdfam15h (see AMD SW optimiization guide for fam15h, 47414 Rev 3.02 Appendix E) affect pvclock stability on migration in same family or across families? cpuid information follows in case it's helpful. 6274 host: eax ineax ebx ecx edx 000d 68747541 444d4163 69746e65 0001 00600f12 02100800 1e98220b 178bfbff 0002 0003 0004 0005 0040 0040 0003 0006 0001 0007 0008 0009 000a 000b 000c 000d 8000 801e 68747541 444d4163 69746e65 8001 00600f12 3000 01c9bfff 2fd3fbff 8002 20444d41 6574704f 286e6f72 20294d54 8003 636f7250 6f737365 32362072 20203437 8004 20202020 20202020 20202020 00202020 8005 ff20ff18 ff20ff30 10040140 40020140 8006 6400 64004200 08008140 0060e140 8007 03d9 8008 3030 500f 8009 800a 0001 0001 14ff 800b 800c 800d 800e 800f 8010 8011 8012 8013 8014 8015 8016 8017 8018 8019 f020f018 6400 801a 0003 801b 00ff 801c 80032013 00010200 800f 801d 801e 0022 0101 0100 Vendor ID: AuthenticAMD; CPUID level 13 AMD-specific functions Version 00600f12: Family: 15 Model: 1 [] Standard feature flags 178bfbff: Floating Point Unit Virtual Mode Extensions Debugging Extensions Page Size Extensions Time Stamp Counter (with RDTSC and CR4 disable bit) Model Specific Registers with RDMSR WRMSR PAE - Page Address Extensions Machine Check Exception COMPXCHG8B Instruction APIC SYSCALL/SYSRET or SYSENTER/SYSEXIT instructions MTRR - Memory Type Range Registers Global paging extension Machine Check Architecture Conditional Move Instruction PAT - Page Attribute Table PSE-36 - Page Size Extensions 19 - reserved MMX instructions FXSAVE/FXRSTOR 25 - reserved 26 - reserved 28 - reserved Generation: 15 Model: 1 Extended feature flags 2fd3fbff: Floating Point Unit Virtual Mode Extensions Debugging Extensions Page Size Extensions Time
Re: [PATCH v2 3/4] uq/master: Add CPU eject handling for acpi_piix4
On Thu, Jan 26, 2012 at 12:46:18PM +0200, Avi Kivity wrote: On 01/24/2012 04:56 PM, Vasilis Liaskovitis wrote: On Tue, Jan 24, 2012 at 11:28:41AM +0100, Jan Kiszka wrote: On 2012-01-24 11:10, Vasilis Liaskovitis wrote: Add stub functions for CPU eject callback. Define cpu_acpi_eject property and enable eject callback only for pc-1.1 machine model. Just to get the idea: What is the plan and advantage of introducing a stub first? How much more is required to have some usable feature, even if its just a friction of the full support? There's not really an advantage to adding stubs first. The plan depends on the lifecycle patches getting accepted in some form at some point. The code is all out there, and some of it has been reviewed/commented on, but not accepted. kvm needs the following patches: https://lkml.org/lkml/2012/1/6/355 (v7, still in work) http://patchwork.ozlabs.org/patch/127828/ This second patch introduces ioctl KVM_SETSTATE_VCPU, (qemu uses it to signal vcpu destruction to the host) but the review mentions there should be a simpler way. It's unclear to me whether this ioctl is desired or not. Those patches are not strictly needed. On a kernel that doesn't have them, you can simply park the vcpu thread in userspace until it is re-added. I suggest writing the qemu patches without the assumption that you're running on a 3.4+ kernel. ok, I will try to handle CPU ejection without relying on the lifecycle patches. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/4] acpi_piix4: Add CPU eject infrastructure for pc-1.1
This patch series adds support for CPU ejection callbacks in Seabios and qemu. This will be needed for proper ACPI vcpu destruction/unplug in conjunction with the vcpu lifecycle patches. v1-v2: Add pc-1.1 model with cpu acpi ejection property. Add documentation. v1 of the series also defined the eject method to handle the CPU_DEAD event in the cpu lifecycle/destruction series. That patch has been dropped from the patchset and will be sent separately as lifecycle/unplug series matures. v2 patches are based on uq/master, plus a patch from the first version of vcpu-hotplug qemu upstream series, specifically: http://patchwork.ozlabs.org/patch/136463/ Vasilis Liaskovitis (3): uq/master: Add machine model pc-1.1 uq/master: Add CPU eject handling for acpi_piix4 uq/master: Add acpi cpu interface documentation docs/specs/acpi_hotplug.txt | 49 +++ docs/specs/acpi_pci_hotplug.txt | 37 - hw/acpi_piix4.c | 20 hw/pc_piix.c| 16 4 files changed, 85 insertions(+), 37 deletions(-) create mode 100644 docs/specs/acpi_hotplug.txt delete mode 100644 docs/specs/acpi_pci_hotplug.txt -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4][SeaBios] Add bitmap for CPU EJ0 callback
Add bitmap for CPU EJ0 callback and write to it on a cpu _EJ0 callback. Remove Sleep() call. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 7082b65..5138c2a 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -650,9 +650,15 @@ DefinitionBlock ( Store(DerefOf(Index(CPON, Arg0)), Local0) If (Local0) { Return(0xF) } Else { Return(0x0) } } +/* CPU eject notify method */ +OperationRegion(PREJ, SystemIO, 0xaf20, 32) +Field (PREJ, ByteAcc, NoLock, Preserve) +{ +PRE, 256 +} Method (CPEJ, 2, NotSerialized) { // _EJ0 method - eject callback -Sleep(200) +Store(ShiftLeft(1, Arg0), PRE) } /* CPU hotplug notify method */ -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] uq/master: Add machine model pc-1.1
Add machine model pc-1.1 Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/pc_piix.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/hw/pc_piix.c b/hw/pc_piix.c index 744b0dc..ac251c6 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -375,6 +375,13 @@ static void pc_xen_hvm_init(ram_addr_t ram_size, } #endif +static QEMUMachine pc_machine_v1_1 = { +.name = pc-1.1, +.desc = Standard PC, +.init = pc_init_pci, +.max_cpus = 255, +}; + static QEMUMachine pc_machine_v1_0 = { .name = pc-1.0, .alias = pc, @@ -674,6 +681,7 @@ static QEMUMachine xenfv_machine = { static void pc_machine_init(void) { +qemu_register_machine(pc_machine_v1_1); qemu_register_machine(pc_machine_v1_0); qemu_register_machine(pc_machine_v0_15); qemu_register_machine(pc_machine_v0_14); -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] uq/master: Add acpi cpu interface documentation
Add CPU acpi interface documentation. Move all ACPI documentation (CPU and PCI) to one file. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- docs/specs/acpi_hotplug.txt | 49 +++ docs/specs/acpi_pci_hotplug.txt | 37 - 2 files changed, 49 insertions(+), 37 deletions(-) create mode 100644 docs/specs/acpi_hotplug.txt delete mode 100644 docs/specs/acpi_pci_hotplug.txt diff --git a/docs/specs/acpi_hotplug.txt b/docs/specs/acpi_hotplug.txt new file mode 100644 index 000..2026bed --- /dev/null +++ b/docs/specs/acpi_hotplug.txt @@ -0,0 +1,49 @@ +QEMU-ACPI BIOS PCI hotplug interface +-- + +QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document +describes the interface between QEMU and the ACPI BIOS. + +ACPI GPE block (IO ports 0xafe0-0xafe3, byte access): +- + +Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject +event to ACPI BIOS, via SCI interrupt. + +PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access): +--- +Slot injection notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of injection +events. + +PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access): +- +Slot removal notification pending. One bit per slot. + +Read by ACPI BIOS GPE.1 handler to notify OS of removal +events. + +PCI device eject (IO port 0xae08-0xae0b, 4-byte access): + + +Used by ACPI BIOS _EJ0 method to request device removal. One bit per slot. +Reads return 0. + +PCI removability status (IO port 0xae0c-0xae0f, 4-byte access): +--- + +Used by ACPI BIOS _RMV method to indicate removability status to OS. One +bit per slot. + +CPU hotplug notification pending (IO port 0xaf00-0xaf1f, 32-byte access): +--- +CPU hotplug notification pending. One bit per cpu. + +Read by ACPI BIOS GPE.2 handler to notify OS of injection +events. + +CPU eject (IO port 0xaf20-0xaf3f, 32-byte access): + + +Used by ACPI BIOS _EJ0 method to request cpu removal. One bit per cpu. diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt deleted file mode 100644 index f0f74a7..000 --- a/docs/specs/acpi_pci_hotplug.txt +++ /dev/null @@ -1,37 +0,0 @@ -QEMU-ACPI BIOS PCI hotplug interface --- - -QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document -describes the interface between QEMU and the ACPI BIOS. - -ACPI GPE block (IO ports 0xafe0-0xafe3, byte access): -- - -Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject -event to ACPI BIOS, via SCI interrupt. - -PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access): -Slot injection notification pending. One bit per slot. - -Read by ACPI BIOS GPE.1 handler to notify OS of injection -events. - -PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access): -- -Slot removal notification pending. One bit per slot. - -Read by ACPI BIOS GPE.1 handler to notify OS of removal -events. - -PCI device eject (IO port 0xae08-0xae0b, 4-byte access): - - -Used by ACPI BIOS _EJ0 method to request device removal. One bit per slot. -Reads return 0. - -PCI removability status (IO port 0xae0c-0xae0f, 4-byte access): - -Used by ACPI BIOS _RMV method to indicate removability status to OS. One -bit per slot. -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] uq/master: Add CPU eject handling for acpi_piix4
Add stub functions for CPU eject callback. Define cpu_acpi_eject property and enable eject callback only for pc-1.1 machine model. Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 20 hw/pc_piix.c|8 2 files changed, 28 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 96e1ce8..8475aa6 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -40,6 +40,7 @@ #define GPE_BASE 0xafe0 #define PROC_BASE 0xaf00 +#define PROC_EJ_BASE 0xaf20 #define GPE_LEN 4 #define PCI_BASE 0xae00 #define PCI_EJ_BASE 0xae08 @@ -80,6 +81,8 @@ typedef struct PIIX4PMState { struct gpe_regs gpe_cpu; struct pci_status pci0_status; uint32_t pci0_hotplug_enable; +/* for cpu hotplug */ +uint32_t cpu_acpi_eject; } PIIX4PMState; static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s); @@ -424,6 +427,7 @@ static PCIDeviceInfo piix4_pm_info = { .class_id = PCI_CLASS_BRIDGE_OTHER, .qdev.props = (Property[]) { DEFINE_PROP_UINT32(smb_io_base, PIIX4PMState, smb_io_base, 0), +DEFINE_PROP_UINT32(cpu_acpi_eject, PIIX4PMState, cpu_acpi_eject, 0), DEFINE_PROP_END_OF_LIST(), } }; @@ -497,6 +501,17 @@ static void pcihotplug_write(void *opaque, uint32_t addr, uint32_t val) PIIX4_DPRINTF(pcihotplug write %x == %d\n, addr, val); } +static uint32_t cpuej_read(void *opaque, uint32_t addr) +{ +PIIX4_DPRINTF(cpuej read %x\n, addr); +return 0; +} + +static void cpuej_write(void *opaque, uint32_t addr, uint32_t val) +{ +PIIX4_DPRINTF(cpuej write %x == %d\n, addr, val); +} + static uint32_t pciej_read(void *opaque, uint32_t addr) { PIIX4_DPRINTF(pciej read %x\n, addr); @@ -555,6 +570,11 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) register_ioport_write(PROC_BASE, 32, 1, gpe_writeb, s); register_ioport_read(PROC_BASE, 32, 1, gpe_readb, s); +if (s-cpu_acpi_eject) { +register_ioport_write(PROC_EJ_BASE, 32, 1, cpuej_write, s); +register_ioport_read(PROC_EJ_BASE, 32, 1, cpuej_read, s); +} + register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write, pci0_status); register_ioport_read(PCI_BASE, 8, 4, pcihotplug_read, pci0_status); diff --git a/hw/pc_piix.c b/hw/pc_piix.c index ac251c6..6d61567 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -380,6 +380,14 @@ static QEMUMachine pc_machine_v1_1 = { .desc = Standard PC, .init = pc_init_pci, .max_cpus = 255, +.compat_props = (GlobalProperty[]) { +{ +.driver = PIIX4_PM, +.property = cpu_acpi_eject, +.value= stringify(1), +}, +{ /* end of list */ } +}, }; static QEMUMachine pc_machine_v1_0 = { -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/4] uq/master: Add CPU eject handling for acpi_piix4
On Tue, Jan 24, 2012 at 11:28:41AM +0100, Jan Kiszka wrote: On 2012-01-24 11:10, Vasilis Liaskovitis wrote: Add stub functions for CPU eject callback. Define cpu_acpi_eject property and enable eject callback only for pc-1.1 machine model. Just to get the idea: What is the plan and advantage of introducing a stub first? How much more is required to have some usable feature, even if its just a friction of the full support? There's not really an advantage to adding stubs first. The plan depends on the lifecycle patches getting accepted in some form at some point. The code is all out there, and some of it has been reviewed/commented on, but not accepted. kvm needs the following patches: https://lkml.org/lkml/2012/1/6/355 (v7, still in work) http://patchwork.ozlabs.org/patch/127828/ This second patch introduces ioctl KVM_SETSTATE_VCPU, (qemu uses it to signal vcpu destruction to the host) but the review mentions there should be a simpler way. It's unclear to me whether this ioctl is desired or not. userspace qemu/qemu-kvm need some form of these patches http://patchwork.ozlabs.org/patch/127831/ http://patchwork.ozlabs.org/patch/127830/ http://patchwork.ozlabs.org/patch/127833/ http://patchwork.ozlabs.org/patch/127834/ Assuming that the above is further reviewed and accepted, the extra code needed to actually make something useful in the stub functions would be something like the following (with the above ioctl), comments welcome. This code calls kvm function from hw/acpi_piix4.c so it's probably not well abstracted enough for upstream. diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 8475aa6..b5fcb4a 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -509,6 +509,20 @@ static uint32_t cpuej_read(void *opaque, uint32_t addr) static void cpuej_write(void *opaque, uint32_t addr, uint32_t val) { +PIIX4PMState *s = opaque; +CPUState *env; +int cpu; +int ret; + +cpu = ffs(val); +/* zero means no bit was set, i.e. no CPU ejection happened */ +if (!cpu) + return; +cpu--; +env = cpu_phyid_to_cpu((uint64_t)cpu); +if (s-kvm_enabled env != NULL) { +kvm_eject_vcpu(env); +} PIIX4_DPRINTF(cpuej write %x == %d\n, addr, val); } diff --git a/kvm-all.c b/kvm-all.c index 88f1156..d3e53f5 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -193,6 +193,13 @@ static void kvm_reset_vcpu(void *opaque) kvm_arch_reset_vcpu(env); } +static void kvm_eject_vcpu(void *opaque) +{ +CPUState *env = opaque; + +kvm_arch_eject_vcpu(env); +} + int kvm_irqchip_in_kernel(void) { return kvm_state-irqchip_in_kernel; diff --git a/kvm.h b/kvm.h index 40b5ffc..ace28a8 100644 --- a/kvm.h +++ b/kvm.h @@ -125,6 +125,8 @@ int kvm_arch_init_vcpu(CPUState *env); void kvm_arch_reset_vcpu(CPUState *env); +void kvm_arch_eject_vcpu(CPUState *env); + int kvm_arch_on_sigbus_vcpu(CPUState *env, int code, void *addr); int kvm_arch_on_sigbus(int code, void *addr); diff --git a/target-i386/kvm.c b/target-i386/kvm.c index e41de39..f8239c0 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -589,6 +589,21 @@ void kvm_arch_reset_vcpu(CPUState *env) } } +void kvm_arch_eject_vcpu(CPUState *env) +{ +struct kvm_vcpu_state state; +int ret = 0; + +if (env-state == CPU_STATE_ZAPREQ) { +state.vcpu_id = env-cpu_index; +state.state = 1; +ret = kvm_vm_ioctl(env-kvm_state, KVM_SETSTATE_VCPU, state); +if (ret) +fprintf(stderr, KVM_SETSTATE_VCPU failed: %s\n, +strerror(ret)); +} +} + static int kvm_get_supported_msrs(KVMState *s) { static int kvm_supported_msrs; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3][Seabios] Add bitmap for cpu _EJ0 callback
On Fri, Jan 13, 2012 at 07:27:01PM -0500, Kevin O'Connor wrote: [...] Method (CPEJ, 2, NotSerialized) { // _EJ0 method - eject callback +Store(ShiftLeft(1, Arg0), PRE) Sleep(200) } I have another question here: the PCI _EJO callback seems to return 0x0, but the CPU _EJ0 doesn't return anything. THe ACPIspec4.0a draft section 6.3.3 mentions that _EJx methods have no return value. Is the above difference intentional? thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] acpi_piix4: Add stub functions for CPU eject callback
On Sun, Jan 15, 2012 at 02:38:52PM +0200, Avi Kivity wrote: On 01/13/2012 01:11 PM, Vasilis Liaskovitis wrote: Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 15 +++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index d5743b6..8bf30dd 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -37,6 +37,7 @@ #define GPE_BASE 0xafe0 #define PROC_BASE 0xaf00 +#define PROC_EJ_BASE 0xaf20 We're adding stuff to piix4 which was never there. At a minimum this needs to be documented. Also needs to be -M pc-1.1 and later only. Where should this be documented? PCI/ACPI hotplug addresses are documented in docs/specs/acpi_pci_hotplug.txt but for CPU hotplug documentation (i.e. for the existing PROC_BASE) I don't see relevant documentation. I will create a docs/specs/acpi_cpu_hotplug.txt if that sounds reasonable. For pc-1.1, a new QEMUmachine type will be needed I assume. Should a check be made against the machine version in the piix4 code? any relevant examples? thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3][Seabios] Add bitmap for cpu _EJ0 callback
On Fri, Jan 13, 2012 at 07:27:01PM -0500, Kevin O'Connor wrote: On Fri, Jan 13, 2012 at 12:11:30PM +0100, Vasilis Liaskovitis wrote: Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com The SeaBIOS change is okay with me, but the qemu/kvm change needs to be accepted first. [...] Method (CPEJ, 2, NotSerialized) { // _EJ0 method - eject callback +Store(ShiftLeft(1, Arg0), PRE) Sleep(200) } Is the Sleep() still needed? I believe it's unneccesary. I 'll test without it and resend. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] acpi_piix4: Add CPU eject handling
This patch series adds support for CPU _EJ0 callback in Seabios and qemu-kvm. The first patch defines the CPU eject bitmap in Seabios and writes to it during the callback. The second patch adds empty stub functions to qemu-kvm to handle the bitmap writes. The third patch defines the eject method to handle the CPU_DEAD event in Liu Ping Fan's cpu lifecycle/destruction patchseries, see: http://patchwork.ozlabs.org/patch/127832/ This ACPI implementation can be used instead of the cpustate virtio/pci device in the original series. Vasilis Liaskovitis (2): acpi_piix4: Add CPU ejection handling acpi_piix4: Call KVM_SETSTATE_VCPU ioctl on cpu ejection hw/acpi_piix4.c | 36 1 files changed, 36 insertions(+), 0 deletions(-) -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3][Seabios] Add bitmap for cpu _EJ0 callback
Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- src/acpi-dsdt.dsl |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/src/acpi-dsdt.dsl b/src/acpi-dsdt.dsl index 7082b65..71d8ac4 100644 --- a/src/acpi-dsdt.dsl +++ b/src/acpi-dsdt.dsl @@ -650,8 +650,15 @@ DefinitionBlock ( Store(DerefOf(Index(CPON, Arg0)), Local0) If (Local0) { Return(0xF) } Else { Return(0x0) } } +/* CPU eject notify method */ +OperationRegion(PREJ, SystemIO, 0xaf20, 32) +Field (PREJ, ByteAcc, NoLock, Preserve) +{ +PRE, 256 +} Method (CPEJ, 2, NotSerialized) { // _EJ0 method - eject callback +Store(ShiftLeft(1, Arg0), PRE) Sleep(200) } -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] acpi_piix4: Add stub functions for CPU eject callback
Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 15 +++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index d5743b6..8bf30dd 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -37,6 +37,7 @@ #define GPE_BASE 0xafe0 #define PROC_BASE 0xaf00 +#define PROC_EJ_BASE 0xaf20 #define GPE_LEN 4 #define PCI_BASE 0xae00 #define PCI_EJ_BASE 0xae08 @@ -493,6 +494,17 @@ static void pcihotplug_write(void *opaque, uint32_t addr, uint32_t val) PIIX4_DPRINTF(pcihotplug write %x == %d\n, addr, val); } +static uint32_t cpuej_read(void *opaque, uint32_t addr) +{ +PIIX4_DPRINTF(cpuej read %x\n, addr); +return 0; +} + +static void cpuej_write(void *opaque, uint32_t addr, uint32_t val) +{ +PIIX4_DPRINTF(cpuej write %x == %d\n, addr, val); +} + static uint32_t pciej_read(void *opaque, uint32_t addr) { PIIX4_DPRINTF(pciej read %x\n, addr); @@ -553,6 +565,9 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s) register_ioport_write(PROC_BASE, 32, 1, gpe_writeb, s); register_ioport_read(PROC_BASE, 32, 1, gpe_readb, s); +register_ioport_write(PROC_EJ_BASE, 32, 1, cpuej_write, s); +register_ioport_read(PROC_EJ_BASE, 32, 1, cpuej_read, s); + register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write, pci0_status); register_ioport_read(PCI_BASE, 8, 4, pcihotplug_read, pci0_status); -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] acpi_piix4: Call KVM_SETSTATE_VCPU ioctl on cpu ejection
Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 21 + 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 8bf30dd..12eef55 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -502,6 +502,27 @@ static uint32_t cpuej_read(void *opaque, uint32_t addr) static void cpuej_write(void *opaque, uint32_t addr, uint32_t val) { +struct kvm_vcpu_state state; +CPUState *env; +int cpu; +int ret; + +cpu = ffs(val); +/* zero means no bit was set, i.e. no CPU ejection happened */ +if (!cpu) + return; +cpu--; +env = cpu_phyid_to_cpu((uint64_t)cpu); +if (env != NULL) { +if (env-state == CPU_STATE_ZAPREQ) { +state.vcpu_id = env-cpu_index; +state.state = 1; +ret = kvm_vm_ioctl(env-kvm_state, KVM_SETSTATE_VCPU, state); +if (ret) +fprintf(stderr, KVM_SETSTATE_VCPU failed: %s\n, +strerror(ret)); +} +} PIIX4_DPRINTF(cpuej write %x == %d\n, addr, val); } -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] acpi_piix4: Call KVM_SETSTATE_VCPU ioctl on cpu ejection
On Fri, Jan 13, 2012 at 12:58:53PM +0100, Jan Kiszka wrote: On 2012-01-13 12:11, Vasilis Liaskovitis wrote: Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com --- hw/acpi_piix4.c | 21 + 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 8bf30dd..12eef55 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -502,6 +502,27 @@ static uint32_t cpuej_read(void *opaque, uint32_t addr) static void cpuej_write(void *opaque, uint32_t addr, uint32_t val) { +struct kvm_vcpu_state state; +CPUState *env; +int cpu; +int ret; + +cpu = ffs(val); +/* zero means no bit was set, i.e. no CPU ejection happened */ +if (!cpu) + return; +cpu--; +env = cpu_phyid_to_cpu((uint64_t)cpu); +if (env != NULL) { +if (env-state == CPU_STATE_ZAPREQ) { +state.vcpu_id = env-cpu_index; +state.state = 1; +ret = kvm_vm_ioctl(env-kvm_state, KVM_SETSTATE_VCPU, state); That breaks in the absence of KVM or if it is not enabled. Right, I will rework. Do we expect icc-bus related changes on a CPU unplug? This patch does not handle this yet. Also, where was this IOCTL introduced? Where are the linux header changes? The headers are here: http://patchwork.ozlabs.org/patch/127834/ And the ioctl is introduced here: http://patchwork.ozlabs.org/patch/127828/ Though the actual ioctl code seems to have dropped through the cracks in the above patch. A sample implementation against 3.1.0 is below, but I have not included it in the patch series. I expect the ioctl implementation to be part of Liu 's kernel kvm-related series. In any case, this third patch depends on the cpu zap/lifecycle patchseries and perhaps should be reviewed separately from the first 2. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6d3a724..8dd9ebd 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2095,6 +2095,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_ioeventfd(kvm, data); break; } + case KVM_SETSTATE_VCPU: { + struct kvm_vcpu_state vcpu_state; + struct kvm_vcpu *vcpu; + int idx; + r = -EFAULT; + if (copy_from_user(vcpu_state, argp, + sizeof(struct kvm_vcpu_state))) + goto out; + idx = srcu_read_lock(kvm-srcu); + kvm_for_each_vcpu(vcpu, kvm) + if (vcpu_state.vcpu_id == vcpu-vcpu_id) + vcpu-state = vcpu_state.state; + srcu_read_unlock(kvm-srcu, idx); + r = 0; + break; + } #ifdef CONFIG_KVM_APIC_ARCHITECTURE case KVM_SET_BOOT_CPU_ID: r = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] acpi_piix4: Add CPU eject handling
On Fri, Jan 13, 2012 at 12:58:10PM +0100, Jan Kiszka wrote: Please work against upstream (uq/master for kvm-related patches), not qemu-kvm. It possibly makes no technical difference here, but we do not want to let the code bases needlessly diverge again. If if does make a difference and upstream lacks further bits, push them first. Apologies, I will from now on. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM, CPU hotplug: Avoid wraparound in pvclock_get_nsec_offset
Hotplugging a vCPU with kvmclock enabled can cause a guest stall/hang. When the stall happens, pvclock_clocksource_read() is called for the new vCPU and pvclock_get_nsec_offset calculates native_read_tsc() - shadow-tsc_timestamp. shadow-tsc_timestamp contains a value larger than native_read_tsc(), so the result is a very large 64-bit unsigned value. The global tsc variable last_value gets updated with this, causing system stall/freeze: rcu_sched_state detected stalls on CPUs/tasks ... The large shadow-tsc_timestamp value observed in the hanged cases is the tsc written into the boot clock on VM startup. Is the boot clock persistent in the guest? Can it get accessed by a vCPU other than vCPU 0, if its own hv_clock struct has not yet been registered or if the host has not yet updated the new hv_clock with a valid tsc_timestamp in kvm_guest_time_update() ? Fix temporarily by returning a zero offset if the delta in pvclock_get_nsec_offset() is negative. Tested on 3.0.6 guest kernel. Testing this patch requires qemu-kvm from: git://git.kiszka.org/qemu-kvm.git queues/cpu-hotplug --- arch/x86/kernel/pvclock.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..9d31144 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -43,9 +43,14 @@ void pvclock_set_flags(u8 flags) static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) { - u64 delta = native_read_tsc() - shadow-tsc_timestamp; - return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, - shadow-tsc_shift); +u64 current_read_tsc = native_read_tsc(); +if (current_read_tsc shadow-tsc_timestamp) { +u64 delta = current_read_tsc - shadow-tsc_timestamp; +return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, +shadow-tsc_shift); +} +/* tsc value can be smaller than tsc_timestamp on a vCPU hotplug */ +else return 0; } /* -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM, CPU hotplug: Avoid wraparound in pvclock_get_nsec_offset
On Mon, Dec 12, 2011 at 02:53:29PM +0100, Jan Kiszka wrote: Can't comment on the semantics, but your patch is whitespace damaged and doesn't follow kernel coding style. But I assume it's not for application yet, right? right. It fixes the hang for me, but I am not sure it's the best solution. If it is, I 'll resend properly. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html