On 08.02.22 12:50, Roger Pau Monné wrote:
> On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 18:44, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 18:37, Jan Beulich wrote:
>>>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 18:15, Jan Beulich wrote:
>>>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>>>>> 1b. Make vpci_write use write lock for writes to command register and 
>>>>>>>> BARs
>>>>>>>> only; keep using the read lock for all other writes.
>>>>>>> I am not quite sure how to do that. Do you mean something like:
>>>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>>                      uint32_t data)
>>>>>>> [snip]
>>>>>>>          list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>>>>> {
>>>>>>> [snip]
>>>>>>>          if ( r->needs_write_lock)
>>>>>>>              write_lock(d->vpci_lock)
>>>>>>>          else
>>>>>>>              read_lock(d->vpci_lock)
>>>>>>> ....
>>>>>>>
>>>>>>> And provide rw as an argument to:
>>>>>>>
>>>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>>>>                            vpci_write_t *write_handler, unsigned int 
>>>>>>> offset,
>>>>>>>                            unsigned int size, void *data, --->>> bool 
>>>>>>> write_path <<<-----)
>>>>>>>
>>>>>>> Is this what you mean?
>>>>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>>>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>>>>> in write mode.
>>>>> Yes, I started writing a reply with that. So, the summary (ROM
>>>>> position depends on header type):
>>>>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>>>>> {
>>>>>         read PCI_COMMAND and see if memory or IO decoding are enabled.
>>>>>         if ( enabled )
>>>>>             write_lock(d->vpci_lock)
>>>>>         else
>>>>>             read_lock(d->vpci_lock)
>>>>> }
>>>> Hmm, yes, you can actually get away without using "size", since both
>>>> command register and ROM BAR are 32-bit aligned registers, and 64-bit
>>>> accesses get split in vpci_ecam_write().
>>> But, OS may want reading a single byte of ROM BAR, so I think
>>> I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
>>> ranges
>>>> For the command register the memory- / IO-decoding-enabled check may
>>>> end up a little more complicated, as the value to be written also
>>>> matters. Maybe read the command register only for the ROM BAR write,
>>>> using the write lock uniformly for all command register writes?
>>> Sounds good for the start.
>>> Another concern is that if we go with a read_lock and then in the
>>> underlying code we disable memory decoding and try doing
>>> something and calling cmd_write handler for any reason then....
>>>
>>> I mean that the check in the vpci_write is somewhat we can tolerate,
>>> but then it is must be considered that no code in the read path
>>> is allowed to perform write path functions. Which brings a pretty
>>> valid use-case: say in read mode we detect an unrecoverable error
>>> and need to remove the device:
>>> vpci_process_pending -> ERROR -> vpci_remove_device or similar.
>>>
>>> What do we do then? It is all going to be fragile...
>> I have tried to summarize the options we have wrt locking
>> and would love to hear from @Roger and @Jan.
>>
>> In every variant there is a task of dealing with the overlap
>> detection in modify_bars, so this is the only place as of now
>> which needs special treatment.
>>
>> Existing limitations: there is no way to upgrade a read lock to a write
>> lock, so paths which may require write lock protection need to use
>> write lock from the very beginning. Workarounds can be applied.
>>
>> 1. Per-domain rw lock, aka d->vpci_lock
>> ==============================================================
>> Note: with per-domain rw lock it is possible to do without introducing
>> per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock
>> should be required.
> Er, no, I think you still need a per-device lock unless you intent to
> take the per-domain rwlock in write mode every time you modify data
> in vpci.
This is exactly the assumption stated below. I am trying to discuss
all the possible options, so this one is also listed
>   I still think you need pdev->vpci->lock. It's possible this
> approach doesn't require moving the lock outside of the vpci struct.
>
>> This is only going to work in case if vpci_write always takes the write lock
>> and vpci_read takes a read lock and no path in vpci_read is allowed to
>> perform write path operations.
> I think that's likely too strong?
>
> You could get away with both vpci_{read,write} only taking the read
> lock and use a per-device vpci lock?
But as discussed before:
- if pdev->vpci_lock is used this still leads to ABBA
- we should know about if to take the write lock beforehand
>
> Otherwise you are likely to introduce contention in msix_write if a
> guest makes heavy use of the MSI-X entry mask bit.
>
>> vpci_process_pending uses write lock as it have vpci_remove_device in its
>> error path.
>>
>> Pros:
>> - no per-device vpci lock is needed?
>> - solves overlap code ABBA in modify_bars
>>
>> Cons:
>> - all writes are serialized
>> - need to carefully select read paths, so they are guaranteed not to lead
>>     to lock upgrade use-cases
>>
>> 1.1. Semi read lock upgrade in modify bars
>> --------------------------------------------------------------
>> In this case both vpci_read and vpci_write take a read lock and when it comes
>> to modify_bars:
>>
>> 1. read_unlock(d->vpci_lock)
>> 2. write_lock(d->vpci_lock)
>> 3. Check that pdev->vpci is still available and is the same object:
>> if (pdev->vpci && (pdev->vpci == old_vpci) )
>> {
>>       /* vpci structure is valid and can be used. */
>> }
>> else
>> {
>>       /* vpci has gone, return an error. */
>> }
>>
>> Pros:
>> - no per-device vpci lock is needed?
>> - solves overlap code ABBA in modify_bars
>> - readers and writers are NOT serialized
>> - NO need to carefully select read paths, so they are guaranteed not to lead
>>     to lock upgrade use-cases
>>
>> Cons:
>> - ???
>>
>> 2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock
>> ==============================================================
>> In order to solve overlap ABBA, we introduce a per-domain helper
>> lock to protect the overlapping code in modify_bars:
>>
>>       old_vpci = pdev->vpci;
>>       spin_unlock(pdev->vpci_lock);
>>       spin_lock(pdev->domain->overlap_chk_lock);
> Since you drop the pdev lock you get a window here where either vpci
> or even pdev itself could be removed under your feet, so using
> pdev->vpci_lock like you do below could dereference a stale pdev.
pdev is anyways not protected with pcidevs lock here, so even
now it is possible to have pdev disapear in between.
We do not use pcidevs_lock in MMIO handlers...
>
>>       spin_lock(pdev->vpci_lock);
>>       if ( pdev->vpci && (pdev->vpci == old_vpci) )
>>           for_each_pdev ( pdev->domain, tmp )
>>           {
>>               if ( tmp != pdev )
>>               {
>>                   spin_lock(tmp->vpci_lock);
>>                   if ( tmp->vpci )
>>                       ...
>>               }
>>           }
>>
>> Pros:
>> - all accesses are independent, only the same device access is serialized
>> - no need to care about readers and writers wrt read lock upgrade issues
>>
>> Cons:
>> - helper spin lock
>>
>> 3. Move overlap detection into process pending
>> ==============================================================
>> There is a Roger's patch [1] which adds a possibility for 
>> vpci_process_pending
>> to perform different tasks rather than just map/unmap. With this patch 
>> extended
>> in a way that it can hold a request queue it is possible to delay execution
>> of the overlap code until no pdev->vpci_lock is held, but before returning to
>> a guest after vpci_{read|write} or similar.
>>
>> Pros:
>> - no need to emulate read lock upgrade
>> - fully parallel read/write
>> - queue in the vpci_process_pending will later on be used by SR-IOV,
>>     so this is going to help the future code
>> Cons:
>> - ???
> Maybe? It's hard to devise how that would end up looking like, and
> whether it won't still require such kind of double locking. We would
> still need to prevent doing a rangeset_remove_range for the device we
> are trying to setup the mapping for, at which point we still need to
> lock the current device plus the device we are iterating against?
>
> Since the code in vpci_process_pending is always executed in guest
> vCPU context requiring all guest vCPUs to be paused when doing a
> device addition or removal would prevent devices from going away, but
> we could still have issues with concurrent accesses from other vCPUs.
Yes, I understand that this may not be easily done, but this is still
an option,
>
>> 4. Re-write overlap detection code
>> ==============================================================
>> It is possible to re-write overlap detection code, so the information about 
>> the
>> mapped/unmapped regions is not read from vpci->header->bars[i] of each 
>> device,
>> but instead there is a per-domain structure which holds the regions and
>> implements reference counting.
>>
>> Pros:
>> - solves ABBA
>>
>> Cons:
>> - very complex code is expected
>>
>> 5. You name it
>> ==============================================================
>>
>>   From all the above I would recommend we go with option 2 which seems to 
>> reliably
>> solve ABBA and does not bring cons of the other approaches.
> 6. per-domain rwlock + per-device vpci lock
>
> Introduce vpci_header_write_lock(start, {end, size}) helper: return
> whether a range requires the per-domain lock in write mode. This will
> only return true if the range overlaps with the BAR ROM or the command
> register.
>
> In vpci_{read,write}:
>
> if ( vpci_header_write_lock(...) )
>      /* Gain exclusive access to all of the domain pdevs vpci. */
>      write_lock(d->vpci);
> else
> {
>      read_lock(d->vpci);
>      spin_lock(vpci->lock);
> }
> ...
>
> The vpci assign/deassign functions would need to be modified to write
> lock the per-domain rwlock. The MSI-X table MMIO handler will also
> need to read lock the per domain vpci lock.
Ok, so it seems you are in favor of this implementation and I have
no objection as well. The only limitation we should be aware of is
that once a path has acquired the read lock it is not possible to do
any write path operations in there.
vpci_process_pending will acquire write lock though as it can
lead to vpci_remove_device on its error path.

So, I am going to implement pdev->vpci->lock + d->vpci_lock
>
> I think it's either something along the lines of my suggestion above,
> or maybe option 3, albeit you would have to investigate how to
> implement option 3.
>
> Thanks, Roger.

@Roger, @Jan!
Thank you!!

Reply via email to