On 05/28/2014 10:41 AM, Alexander Graf wrote: > > On 28.05.14 02:34, Alexey Kardashevskiy wrote: >> On 05/28/2014 09:55 AM, Alexander Graf wrote: >>> On 27.05.14 06:51, Alexey Kardashevskiy wrote: >>>> On 05/23/2014 12:25 AM, Alexey Kardashevskiy wrote: >>>>> On 05/22/2014 08:57 PM, Alexander Graf wrote: >>>>>> On 22.05.14 12:53, Alexey Kardashevskiy wrote: >>>>>>> On 05/22/2014 05:16 PM, Alexander Graf wrote:> >>>>>>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <a...@ozlabs.ru>: >>>>>>>>> >>>>>>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote: >>>>>>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote: >>>>>>>>>>> >>>>>>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote: >>>>>>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote: >>>>>>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote: >>>>>>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote: >>>>>>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote: >>>>>>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote: >>>>>>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote: >>>>>>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote: >>>>>>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX >>>>>>>>>>>>>>>>>>>> interrupt as >>>>>>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a >>>>>>>>>>>>>>>>>>>> problem for >>>>>>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest >>>>>>>>>>>>>>>>>>>> driver >>>>>>>>>>>>>>>>>>>> reload or >>>>>>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a >>>>>>>>>>>>>>>>>>>> limit of >>>>>>>>>>>>>>>>>>>> devices >>>>>>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there >>>>>>>>>>>>>>>>>>>> is no >>>>>>>>>>>>>>>>>>>> good >>>>>>>>>>>>>>>>>>>> reason >>>>>>>>>>>>>>>>>>>> for that. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>> IRQ >>>>>>>>>>>>>>>>>>>> number >>>>>>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there >>>>>>>>>>>>>>>>>>>> is no >>>>>>>>>>>>>>>>>>>> need to >>>>>>>>>>>>>>>>>>>> store >>>>>>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps >>>>>>>>>>>>>>>>>>>> flags >>>>>>>>>>>>>>>>>>>> telling >>>>>>>>>>>>>>>>>>>> what >>>>>>>>>>>>>>>>>>>> type >>>>>>>>>>>>>>>>>>>> of interrupt for which device it has configured in >>>>>>>>>>>>>>>>>>>> order to >>>>>>>>>>>>>>>>>>>> return >>>>>>>>>>>>>>>>>>>> error if >>>>>>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to >>>>>>>>>>>>>>>>>>>> disable >>>>>>>>>>>>>>>>>>>> MSI >>>>>>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>>>>>> it has not enabled. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This removes a limit for the maximum number of >>>>>>>>>>>>>>>>>>>> MSIX-enabled >>>>>>>>>>>>>>>>>>>> devices >>>>>>>>>>>>>>>>>>>> per PHB, >>>>>>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This changes migration stream as it fixes >>>>>>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name >>>>>>>>>>>>>>>>>>>> which was >>>>>>>>>>>>>>>>>>>> wrong since the beginning. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This fixed traces to be more informative. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I >>>>>>>>>>>>>>>>>>>> could >>>>>>>>>>>>>>>>>>>> remove >>>>>>>>>>>>>>>>>>>> msi/msix >>>>>>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense? >>>>>>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose >>>>>>>>>>>>>>>>>>> between >>>>>>>>>>>>>>>>>>> MSIX and >>>>>>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a >>>>>>>>>>>>>>>>>>> PCI >>>>>>>>>>>>>>>>>>> limitation, >>>>>>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your >>>>>>>>>>>>>>>>>>> implementation? >>>>>>>>>>>>>>>>>> My implementation does not have this limitation, I asked if >>>>>>>>>>>>>>>>>> I can >>>>>>>>>>>>>>>>>> simplify >>>>>>>>>>>>>>>>>> code by introducing one :) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and >>>>>>>>>>>>>>>>>> MSIX >>>>>>>>>>>>>>>>>> enabled >>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and >>>>>>>>>>>>>>>>>> confirm. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when >>>>>>>>>>>>>>>>>> MSI is >>>>>>>>>>>>>>>>>> already >>>>>>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But >>>>>>>>>>>>>>>>>> it also >>>>>>>>>>>>>>>>>> says MSI >>>>>>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm. >>>>>>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and >>>>>>>>>>>>>>>>> I'd >>>>>>>>>>>>>>>>> rather not impose one. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and* >>>>>>>>>>>>>>>>> MSI-X at >>>>>>>>>>>>>>>>> the same time? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Alex >>>>>>>>>>>>>>>> No, and the PCI spec says: >>>>>>>>>>>>>>>> A function is permitted to implement both MSI and >>>>>>>>>>>>>>>> MSI-X, but >>>>>>>>>>>>>>>> system >>>>>>>>>>>>>>>> software is >>>>>>>>>>>>>>>> prohibited from enabling both at the same time. If >>>>>>>>>>>>>>>> system >>>>>>>>>>>>>>>> software >>>>>>>>>>>>>>>> enables both at the same time, the result is >>>>>>>>>>>>>>>> undefined. >>>>>>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :). >>>>>>>>>>>>>> Heh. This solves just half of the problem - I still have to keep >>>>>>>>>>>>>> track of >>>>>>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi >>>>>>>>>>>>>> interface. I >>>>>>>>>>>>>> was hoping I can store such flag somewhere in a device PCI >>>>>>>>>>>>>> config >>>>>>>>>>>>>> space >>>>>>>>>>>>>> but >>>>>>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those >>>>>>>>>>>>>> calls are >>>>>>>>>>>>>> made. >>>>>>>>>>>>>> And I cannot rely on address/data fields much as the guest can >>>>>>>>>>>>>> change them >>>>>>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing >>>>>>>>>>>>>> checks when >>>>>>>>>>>>>> I read them back for disposal, I'll fix in next round). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI >>>>>>>>>>>>>> config space >>>>>>>>>>>>>> and on "disable" check if it is not zero, then configuration >>>>>>>>>>>>>> took >>>>>>>>>>>>>> place, >>>>>>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did >>>>>>>>>>>>>> any change >>>>>>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird >>>>>>>>>>>>>> selftest >>>>>>>>>>>>>> cases), I compare .data with what ICS can possibly have and >>>>>>>>>>>>>> either >>>>>>>>>>>>>> reject >>>>>>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad >>>>>>>>>>>>>> for the >>>>>>>>>>>>>> stupid guest. Would that be acceptable? >>>>>>>>>>>>> Can't you prohibit the guest from writing to the MSI >>>>>>>>>>>>> configuration >>>>>>>>>>>>> registers itself? Then you don't need to do sanity checks. >>>>>>>>>>>> I could for emulated devices but VFIO uses the same code. For >>>>>>>>>>>> example, >>>>>>>>>>>> there is an IBM SCSI IPR card which does a "self test". For >>>>>>>>>>>> that, it >>>>>>>>>>>> saves >>>>>>>>>>>> MSIX BAR content, does reboot via some backdoor interface and >>>>>>>>>>>> restores MSIX >>>>>>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring >>>>>>>>>>>> MSIX data >>>>>>>>>>>> from cached values when guest is trying to restore it with what it >>>>>>>>>>>> thinks >>>>>>>>>>>> is actual MSIX data (it is virtualized because of x86). But >>>>>>>>>>>> there is >>>>>>>>>>>> cache >>>>>>>>>>> We already have a cache because we don't access the real PCI >>>>>>>>>>> registers with >>>>>>>>>>> msi_set_message(), no? >>>>>>>>>> For emulated devices there is no cache. And in any case the guest is >>>>>>>>>> allowed to write to it... Who knows what AIX does? I do not. >>>>>>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or >>>>>>>>> less ok >>>>>>>>> but >>>>>>>>> how to migrate such thing? Temporary cache somewhere and then unpack >>>>>>>>> it? Or >>>>>>>>> use old style migration callbacks? >>>>>>>> Could you try to introduce a new vmstate type that just serializes and >>>>>>>> deserializes hash tables? Maybe there is already a serialization >>>>>>>> function for it in glib? >>>>>>> I have not found any (most likely I do not know how to search there), >>>>>>> I added mine, here are VMSTATE_HASH + its use for SPAPR. >>>>>>> >>>>>>> >>>>>>> Is this a movement to right direction? I need to put key/value sizes >>>>>>> into VMSTATE definition somehow but do not really want to touch >>>>>>> VMStateField. >>>>>> I'm not sure. Juan? >>>>> I also tried to simplify this particular thing more by assuming that the >>>>> key is always "int" and put size of value to VMStateField::size. But >>>>> there >>>>> is no way to get the size in VMStateInfo::get(). Or I could do a >>>>> "subsection" and try implementing has as an array (and have get/put >>>>> defined >>>>> for items, should work) but in this case I'll need to cache number of >>>>> elements of the hash table somewhere so I'll need a wrapper around >>>>> GHashTable. >>>>> >>>>> Making generalized version is not obvious for me :( >>>> Juan? >>>> Alex? >>>> Anybody? >>>> Ping. >>>> >>>> How do I migrate GHashTable? If I am allowed to use custom and bit more >>>> polished get/put from "[PATCH 1/2] vmstate: Add helper to enable >>>> GHashTable >>>> migration", I'll be fine. >>> Yeah, I think it's ok to be custom in this case. Or another crazy idea - >>> could you flatten the hash table into an array of structs that you can >>> describe using VMState? You could then convert from the flat array to/from >>> the GHashTable with pre_load/post_load hooks. >> >> Array is exactly what I am trying to get rid of. Ok, I'll remove hashmap at >> all and implement dynamic flat array (yay, yet another bicycle!). > > Huh? The array would only live during the migration. It would be size=0 > during normal execution, but in a pre_save hook we could make size = > hash.length() and reuse the existing, working VMState infrastructure.
When would I free that array? What if I continue the source guest and then migrate again? I mean I can solve all of this for sure but duplicating data just to make existing migration happy is bit weird. But - I'll do what you say here, it is no big deal :) >>> The benefit of that would be that the data becomes more easily >>> introspectible with tools like my live migration parser. >> Wow. What is that parser? > > https://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02949.html > > I still need to rewrite that thing to always send the JSON description of > the stream after the migration's EOF marker. And implement support for the > HTAB stream ;). But overall it works reasonably well. Uh. Cool. -- Alexey