Re: [Qemu-devel] [PATCH v2 8/8] spapr_pci: Use XICS interrupt allocator and do not cache interrupts in PHB

Alexey Kardashevskiy Tue, 27 May 2014 18:20:07 -0700

On 05/28/2014 10:41 AM, Alexander Graf wrote:
> 
> On 28.05.14 02:34, Alexey Kardashevskiy wrote:
>> On 05/28/2014 09:55 AM, Alexander Graf wrote:
>>> On 27.05.14 06:51, Alexey Kardashevskiy wrote:
>>>> On 05/23/2014 12:25 AM, Alexey Kardashevskiy wrote:
>>>>> On 05/22/2014 08:57 PM, Alexander Graf wrote:
>>>>>> On 22.05.14 12:53, Alexey Kardashevskiy wrote:
>>>>>>> On 05/22/2014 05:16 PM, Alexander Graf wrote:>
>>>>>>>>> Am 22.05.2014 um 08:53 schrieb Alexey Kardashevskiy <a...@ozlabs.ru>:
>>>>>>>>>
>>>>>>>>>> On 05/21/2014 10:42 PM, Alexey Kardashevskiy wrote:
>>>>>>>>>>> On 05/21/2014 08:35 PM, Alexander Graf wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 21.05.14 12:13, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>> On 05/21/2014 07:50 PM, Alexander Graf wrote:
>>>>>>>>>>>>>> On 21.05.14 11:33, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>> On 05/21/2014 07:13 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>> On 21.05.14 11:11, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>>> On Wed, May 21, 2014 at 11:06:09AM +0200, Alexander Graf
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> On 21.05.14 10:52, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>>> On 05/21/2014 06:40 PM, Alexander Graf wrote:
>>>>>>>>>>>>>>>>>>>> On 15.05.14 11:59, Alexey Kardashevskiy wrote:
>>>>>>>>>>>>>>>>>>>> Currently SPAPR PHB keeps track of all allocated MSI/MISX
>>>>>>>>>>>>>>>>>>>> interrupt as
>>>>>>>>>>>>>>>>>>>> XICS used to be unable to reuse interrupts which becomes a
>>>>>>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>>>>>>> dynamic MSI reconfiguration which is happening on guest
>>>>>>>>>>>>>>>>>>>> driver
>>>>>>>>>>>>>>>>>>>> reload or
>>>>>>>>>>>>>>>>>>>> PCI hot (un)plug. Another problem is that PHB has a
>>>>>>>>>>>>>>>>>>>> limit of
>>>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>>>> supporting MSI/MSIX (SPAPR_MSIX_MAX_DEVS=32) and there
>>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>>>>>>> for that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This makes use of new XICS ability to reuse interrupts.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This removes cached MSI configuration from SPAPR PHB so
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>>>>> of a device is stored in MSI/MSIX config space so there
>>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>>>>>>> this anywhere else. From now on, SPAPR PHB only keeps
>>>>>>>>>>>>>>>>>>>> flags
>>>>>>>>>>>>>>>>>>>> telling
>>>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>>> of interrupt for which device it has configured in
>>>>>>>>>>>>>>>>>>>> order to
>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>> error if
>>>>>>>>>>>>>>>>>>>> (for example) MSIX was enabled and the guest is trying to
>>>>>>>>>>>>>>>>>>>> disable
>>>>>>>>>>>>>>>>>>>> MSI
>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> it has not enabled.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This removes a limit for the maximum number of
>>>>>>>>>>>>>>>>>>>> MSIX-enabled
>>>>>>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>>>>>>> per PHB,
>>>>>>>>>>>>>>>>>>>> now XICS and PCI bus capacity are the only limitation.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This changes migration stream as it fixes
>>>>>>>>>>>>>>>>>>>> vmstate_spapr_pci_msi::name
>>>>>>>>>>>>>>>>>>>> which was
>>>>>>>>>>>>>>>>>>>> wrong since the beginning.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This fixed traces to be more informative.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In reality either MSIX or MSI is enabled, never both. So I
>>>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>>>> msi/msix
>>>>>>>>>>>>>>>>>>>> bitmaps from this patch, would it make sense?
>>>>>>>>>>>>>>>>>>> Is this a hard requirement? Does a device have to choose
>>>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>> MSIX and
>>>>>>>>>>>>>>>>>>> MSI or could it theoretically have both enabled? Is this a
>>>>>>>>>>>>>>>>>>> PCI
>>>>>>>>>>>>>>>>>>> limitation,
>>>>>>>>>>>>>>>>>>> a PAPR/XICS limitation or just a limitation of your
>>>>>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>>>>>> My implementation does not have this limitation, I asked if
>>>>>>>>>>>>>>>>>> I can
>>>>>>>>>>>>>>>>>> simplify
>>>>>>>>>>>>>>>>>> code by introducing one :)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I cannot see any reason why PCI cannot have both MSI and
>>>>>>>>>>>>>>>>>> MSIX
>>>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> it does not seem to be used by anyone => cannot debug and
>>>>>>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PAPR spec assumes that if the guest tries enabling MSIX when
>>>>>>>>>>>>>>>>>> MSI is
>>>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>> enabled, this is a "change", not enabling both types. But
>>>>>>>>>>>>>>>>>> it also
>>>>>>>>>>>>>>>>>> says MSI
>>>>>>>>>>>>>>>>>> and MSIX vector numbers are not shared. Hm.
>>>>>>>>>>>>>>>>> Yeah, I'm not aware of any limitation on hardware here and
>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>> rather not impose one.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Michael, do you know of any hardware that uses MSI *and*
>>>>>>>>>>>>>>>>> MSI-X at
>>>>>>>>>>>>>>>>> the same time?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>>> No, and the PCI spec says:
>>>>>>>>>>>>>>>>         A function is permitted to implement both MSI and
>>>>>>>>>>>>>>>> MSI-X, but
>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>>         software is
>>>>>>>>>>>>>>>>         prohibited from enabling both at the same time. If
>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>>>         enables both at the same time, the result is
>>>>>>>>>>>>>>>> undefined.
>>>>>>>>>>>>>>> Ah, cool. So yes Alexey, feel free to impose it :).
>>>>>>>>>>>>>> Heh. This solves just half of the problem - I still have to keep
>>>>>>>>>>>>>> track of
>>>>>>>>>>>>>> what device got MSI/MSIX configured via that ibm,change-msi
>>>>>>>>>>>>>> interface. I
>>>>>>>>>>>>>> was hoping I can store such flag somewhere in a device PCI
>>>>>>>>>>>>>> config
>>>>>>>>>>>>>> space
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> MSI/MSIX enable bit is not good as it is not set when those
>>>>>>>>>>>>>> calls are
>>>>>>>>>>>>>> made.
>>>>>>>>>>>>>> And I cannot rely on address/data fields much as the guest can
>>>>>>>>>>>>>> change them
>>>>>>>>>>>>>> (I already use them to store IRQ numbers and btw it is missing
>>>>>>>>>>>>>> checks when
>>>>>>>>>>>>>> I read them back for disposal, I'll fix in next round).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or on "enable" event I could put IRQ numbers to .data of MSI
>>>>>>>>>>>>>> config space
>>>>>>>>>>>>>> and on "disable" check if it is not zero, then configuration
>>>>>>>>>>>>>> took
>>>>>>>>>>>>>> place,
>>>>>>>>>>>>>> then I can remove my msi[]/msix[] flag arrays. If the guest did
>>>>>>>>>>>>>> any change
>>>>>>>>>>>>>> to MSI/MSIX config space (it does not on SPAPR except weird
>>>>>>>>>>>>>> selftest
>>>>>>>>>>>>>> cases), I compare .data with what ICS can possibly have and
>>>>>>>>>>>>>> either
>>>>>>>>>>>>>> reject
>>>>>>>>>>>>>> "disable" or handle it and if it breaks XICS - that's too bad
>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>> stupid guest. Would that be acceptable?
>>>>>>>>>>>>> Can't you prohibit the guest from writing to the MSI
>>>>>>>>>>>>> configuration
>>>>>>>>>>>>> registers itself? Then you don't need to do sanity checks.
>>>>>>>>>>>> I could for emulated devices but VFIO uses the same code. For
>>>>>>>>>>>> example,
>>>>>>>>>>>> there is an IBM SCSI IPR card which does a "self test". For
>>>>>>>>>>>> that, it
>>>>>>>>>>>> saves
>>>>>>>>>>>> MSIX BAR content, does reboot via some backdoor interface and
>>>>>>>>>>>> restores MSIX
>>>>>>>>>>>> BAR. It has been solved for VFIO in the host kernel by restoring
>>>>>>>>>>>> MSIX data
>>>>>>>>>>>> from cached values when guest is trying to restore it with what it
>>>>>>>>>>>> thinks
>>>>>>>>>>>> is actual MSIX data (it is virtualized because of x86). But
>>>>>>>>>>>> there is
>>>>>>>>>>>> cache
>>>>>>>>>>> We already have a cache because we don't access the real PCI
>>>>>>>>>>> registers with
>>>>>>>>>>> msi_set_message(), no?
>>>>>>>>>> For emulated devices there is no cache. And in any case the guest is
>>>>>>>>>> allowed to write to it... Who knows what AIX does? I do not.
>>>>>>>>> Tried GHashTable for keeping bus:dev.fn <-> (irq, num), more or
>>>>>>>>> less ok
>>>>>>>>> but
>>>>>>>>> how to migrate such thing? Temporary cache somewhere and then unpack
>>>>>>>>> it? Or
>>>>>>>>> use old style migration callbacks?
>>>>>>>> Could you try to introduce a new vmstate type that just serializes and
>>>>>>>> deserializes hash tables? Maybe there is already a serialization
>>>>>>>> function for it in glib?
>>>>>>> I have not found any (most likely I do not know how to search there),
>>>>>>> I added mine, here are VMSTATE_HASH + its use for SPAPR.
>>>>>>>
>>>>>>>
>>>>>>> Is this a movement to right direction? I need to put key/value sizes
>>>>>>> into VMSTATE definition somehow but do not really want to touch
>>>>>>> VMStateField.
>>>>>> I'm not sure. Juan?
>>>>> I also tried to simplify this particular thing more by assuming that the
>>>>> key is always "int" and put size of value to VMStateField::size. But
>>>>> there
>>>>> is no way to get the size in VMStateInfo::get(). Or I could do a
>>>>> "subsection" and try implementing has as an array (and have get/put
>>>>> defined
>>>>> for items, should work) but in this case I'll need to cache number of
>>>>> elements of the hash table somewhere so I'll need a wrapper around
>>>>> GHashTable.
>>>>>
>>>>> Making generalized version is not obvious for me :(
>>>> Juan?
>>>> Alex?
>>>> Anybody?
>>>> Ping.
>>>>
>>>> How do I migrate GHashTable? If I am allowed to use custom and bit more
>>>> polished get/put from "[PATCH 1/2] vmstate: Add helper to enable
>>>> GHashTable
>>>> migration", I'll be fine.
>>> Yeah, I think it's ok to be custom in this case. Or another crazy idea -
>>> could you flatten the hash table into an array of structs that you can
>>> describe using VMState? You could then convert from the flat array to/from
>>> the GHashTable with pre_load/post_load hooks.
>>
>> Array is exactly what I am trying to get rid of. Ok, I'll remove hashmap at
>> all and implement dynamic flat array (yay, yet another bicycle!).
> 
> Huh? The array would only live during the migration. It would be size=0
> during normal execution, but in a pre_save hook we could make size =
> hash.length() and reuse the existing, working VMState infrastructure.


When would I free that array? What if I continue the source guest and then
migrate again? I mean I can solve all of this for sure but duplicating data
just to make existing migration happy is bit weird. But - I'll do what you
say here, it is no big deal :)


>>> The benefit of that would be that the data becomes more easily
>>> introspectible with tools like my live migration parser.
>> Wow. What is that parser?
> 
> https://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02949.html
> 
> I still need to rewrite that thing to always send the JSON description of
> the stream after the migration's EOF marker. And implement support for the
> HTAB stream ;). But overall it works reasonably well.

Uh. Cool.


-- 
Alexey

Re: [Qemu-devel] [PATCH v2 8/8] spapr_pci: Use XICS interrupt allocator and do not cache interrupts in PHB

Reply via email to