On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 18:44, Oleksandr Andrushchenko wrote:
> >
> > On 07.02.22 18:37, Jan Beulich wrote:
> >> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
> >>> On 07.02.22 18:15, Jan Beulich wrote:
> >>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
> >>>>> On 07.02.22 17:26, Jan Beulich wrote:
> >>>>>> 1b. Make vpci_write use write lock for writes to command register and 
> >>>>>> BARs
> >>>>>> only; keep using the read lock for all other writes.
> >>>>> I am not quite sure how to do that. Do you mean something like:
> >>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>>>>                     uint32_t data)
> >>>>> [snip]
> >>>>>         list_for_each_entry ( r, &pdev->vpci->handlers, node )
> >>>>> {
> >>>>> [snip]
> >>>>>         if ( r->needs_write_lock)
> >>>>>             write_lock(d->vpci_lock)
> >>>>>         else
> >>>>>             read_lock(d->vpci_lock)
> >>>>> ....
> >>>>>
> >>>>> And provide rw as an argument to:
> >>>>>
> >>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
> >>>>>                           vpci_write_t *write_handler, unsigned int 
> >>>>> offset,
> >>>>>                           unsigned int size, void *data, --->>> bool 
> >>>>> write_path <<<-----)
> >>>>>
> >>>>> Is this what you mean?
> >>>> This sounds overly complicated. You can derive locally in vpci_write(),
> >>>> from just its "reg" and "size" parameters, whether the lock needs taking
> >>>> in write mode.
> >>> Yes, I started writing a reply with that. So, the summary (ROM
> >>> position depends on header type):
> >>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
> >>> {
> >>>        read PCI_COMMAND and see if memory or IO decoding are enabled.
> >>>        if ( enabled )
> >>>            write_lock(d->vpci_lock)
> >>>        else
> >>>            read_lock(d->vpci_lock)
> >>> }
> >> Hmm, yes, you can actually get away without using "size", since both
> >> command register and ROM BAR are 32-bit aligned registers, and 64-bit
> >> accesses get split in vpci_ecam_write().
> > But, OS may want reading a single byte of ROM BAR, so I think
> > I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
> > ranges
> >> For the command register the memory- / IO-decoding-enabled check may
> >> end up a little more complicated, as the value to be written also
> >> matters. Maybe read the command register only for the ROM BAR write,
> >> using the write lock uniformly for all command register writes?
> > Sounds good for the start.
> > Another concern is that if we go with a read_lock and then in the
> > underlying code we disable memory decoding and try doing
> > something and calling cmd_write handler for any reason then....
> >
> > I mean that the check in the vpci_write is somewhat we can tolerate,
> > but then it is must be considered that no code in the read path
> > is allowed to perform write path functions. Which brings a pretty
> > valid use-case: say in read mode we detect an unrecoverable error
> > and need to remove the device:
> > vpci_process_pending -> ERROR -> vpci_remove_device or similar.
> >
> > What do we do then? It is all going to be fragile...
> I have tried to summarize the options we have wrt locking
> and would love to hear from @Roger and @Jan.
> 
> In every variant there is a task of dealing with the overlap
> detection in modify_bars, so this is the only place as of now
> which needs special treatment.
> 
> Existing limitations: there is no way to upgrade a read lock to a write
> lock, so paths which may require write lock protection need to use
> write lock from the very beginning. Workarounds can be applied.
> 
> 1. Per-domain rw lock, aka d->vpci_lock
> ==============================================================
> Note: with per-domain rw lock it is possible to do without introducing
> per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock
> should be required.

Er, no, I think you still need a per-device lock unless you intent to
take the per-domain rwlock in write mode every time you modify data
in vpci. I still think you need pdev->vpci->lock. It's possible this
approach doesn't require moving the lock outside of the vpci struct.

> This is only going to work in case if vpci_write always takes the write lock
> and vpci_read takes a read lock and no path in vpci_read is allowed to
> perform write path operations.

I think that's likely too strong?

You could get away with both vpci_{read,write} only taking the read
lock and use a per-device vpci lock?

Otherwise you are likely to introduce contention in msix_write if a
guest makes heavy use of the MSI-X entry mask bit.

> vpci_process_pending uses write lock as it have vpci_remove_device in its
> error path.
> 
> Pros:
> - no per-device vpci lock is needed?
> - solves overlap code ABBA in modify_bars
> 
> Cons:
> - all writes are serialized
> - need to carefully select read paths, so they are guaranteed not to lead
>    to lock upgrade use-cases
> 
> 1.1. Semi read lock upgrade in modify bars
> --------------------------------------------------------------
> In this case both vpci_read and vpci_write take a read lock and when it comes
> to modify_bars:
> 
> 1. read_unlock(d->vpci_lock)
> 2. write_lock(d->vpci_lock)
> 3. Check that pdev->vpci is still available and is the same object:
> if (pdev->vpci && (pdev->vpci == old_vpci) )
> {
>      /* vpci structure is valid and can be used. */
> }
> else
> {
>      /* vpci has gone, return an error. */
> }
> 
> Pros:
> - no per-device vpci lock is needed?
> - solves overlap code ABBA in modify_bars
> - readers and writers are NOT serialized
> - NO need to carefully select read paths, so they are guaranteed not to lead
>    to lock upgrade use-cases
> 
> Cons:
> - ???
> 
> 2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock
> ==============================================================
> In order to solve overlap ABBA, we introduce a per-domain helper
> lock to protect the overlapping code in modify_bars:
> 
>      old_vpci = pdev->vpci;
>      spin_unlock(pdev->vpci_lock);
>      spin_lock(pdev->domain->overlap_chk_lock);

Since you drop the pdev lock you get a window here where either vpci
or even pdev itself could be removed under your feet, so using
pdev->vpci_lock like you do below could dereference a stale pdev.

>      spin_lock(pdev->vpci_lock);
>      if ( pdev->vpci && (pdev->vpci == old_vpci) )
>          for_each_pdev ( pdev->domain, tmp )
>          {
>              if ( tmp != pdev )
>              {
>                  spin_lock(tmp->vpci_lock);
>                  if ( tmp->vpci )
>                      ...
>              }
>          }
> 
> Pros:
> - all accesses are independent, only the same device access is serialized
> - no need to care about readers and writers wrt read lock upgrade issues
> 
> Cons:
> - helper spin lock
> 
> 3. Move overlap detection into process pending
> ==============================================================
> There is a Roger's patch [1] which adds a possibility for vpci_process_pending
> to perform different tasks rather than just map/unmap. With this patch 
> extended
> in a way that it can hold a request queue it is possible to delay execution
> of the overlap code until no pdev->vpci_lock is held, but before returning to
> a guest after vpci_{read|write} or similar.
> 
> Pros:
> - no need to emulate read lock upgrade
> - fully parallel read/write
> - queue in the vpci_process_pending will later on be used by SR-IOV,
>    so this is going to help the future code
> Cons:
> - ???

Maybe? It's hard to devise how that would end up looking like, and
whether it won't still require such kind of double locking. We would
still need to prevent doing a rangeset_remove_range for the device we
are trying to setup the mapping for, at which point we still need to
lock the current device plus the device we are iterating against?

Since the code in vpci_process_pending is always executed in guest
vCPU context requiring all guest vCPUs to be paused when doing a
device addition or removal would prevent devices from going away, but
we could still have issues with concurrent accesses from other vCPUs.

> 
> 4. Re-write overlap detection code
> ==============================================================
> It is possible to re-write overlap detection code, so the information about 
> the
> mapped/unmapped regions is not read from vpci->header->bars[i] of each device,
> but instead there is a per-domain structure which holds the regions and
> implements reference counting.
> 
> Pros:
> - solves ABBA
> 
> Cons:
> - very complex code is expected
> 
> 5. You name it
> ==============================================================
> 
>  From all the above I would recommend we go with option 2 which seems to 
> reliably
> solve ABBA and does not bring cons of the other approaches.

6. per-domain rwlock + per-device vpci lock

Introduce vpci_header_write_lock(start, {end, size}) helper: return
whether a range requires the per-domain lock in write mode. This will
only return true if the range overlaps with the BAR ROM or the command
register.

In vpci_{read,write}:

if ( vpci_header_write_lock(...) )
    /* Gain exclusive access to all of the domain pdevs vpci. */
    write_lock(d->vpci);
else
{
    read_lock(d->vpci);
    spin_lock(vpci->lock);
}
...

The vpci assign/deassign functions would need to be modified to write
lock the per-domain rwlock. The MSI-X table MMIO handler will also
need to read lock the per domain vpci lock.

I think it's either something along the lines of my suggestion above,
or maybe option 3, albeit you would have to investigate how to
implement option 3.

Thanks, Roger.

Reply via email to