from:"Raj, Ashok"

Re: [PATCH 1/1] x86/microcode: Check for offline CPUs before checking for microcode update

2021-03-20 Thread Raj, Ashok

On Sat, Mar 20, 2021 at 03:55:46PM +0100, Borislav Petkov wrote:
[snip]
> > microcode : 0x30
> > microcode : 0xde
> > microcode : 0x30
> > microcode : 0xde
> 
> Yeah, I'm looking at that check_online_cpus() thing and wondering why we
> even need that:
> 
> 0. So you have CPUs 1 and 3 offline.
> 1. We can update on the subset of cores which are online
> 2. If a core is offline and comes online, we have the hotplug notifier:
> 
> cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "x86/microcode:online",
>   mc_cpu_online, mc_cpu_down_prep);
> 
> which takes care of updating the microcode when that CPU comes online.
> 
> So unless your microcode folks don't come back with a real requirement
> why all CPUs must absolutely be online for a late update, then the
> proper fix is to get rid of check_online_cpus() altogether and update
> what's online and the rest will get updated when they come online.

Its true we update them during the online flow, but the core is still
behind compared to other cores. It still participates when it enters SMM,
or when running MCE for instance. Unless its in WAIT_FOR_SIPI state its
best to not leave a core behind when updating microcode.

-- 
Cheers,
Ashok

[Forgiveness is the attribute of the STRONG - Gandhi]

Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered

2021-03-17 Thread Raj, Ashok

On Wed, Mar 17, 2021 at 08:09:52PM +0100, Lukas Wunner wrote:
> On Wed, Mar 17, 2021 at 10:45:21AM -0700, Dan Williams wrote:
> > Ah, ok, we're missing a flush of the hotplug event handler after the
> > link is up to make sure the hotplug handler does not see the Link Up.
> > I'm not immediately seeing how the new proposal ensures that there is
> > no Link Up event still in flight after DPC completes its work.
> > Wouldn't it be required to throw away Link Up to Link Up transitions?
> 
> If you look at the new code added to pciehp_ist() by my patch...
> 
>   atomic_and(~PCI_EXP_SLTSTA_DLLSC, &ctrl->pending_events);
>   if (pciehp_check_link_active(ctrl) > 0)
>   events &= ~PCI_EXP_SLTSTA_DLLSC;

When you have a Surprise Link Down and without any DPC, the link trains
back up. Aren't we expected to take the slot down and add it as if a remove
and add happens?

without this change if slot-status == ON_STATE, DLLSC means we would power
the slot off. Then we check link_active and bring the slot back on isn't
it?

Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-03-08 Thread Raj, Ashok

Hi Joerg

On Mon, Mar 08, 2021 at 09:58:26AM +0800, Lu Baolu wrote:
> Hi Joerg,
> 
> On 3/4/21 8:26 PM, Joerg Roedel wrote:
> >On Thu, Feb 25, 2021 at 02:26:51PM +0800, Lu Baolu wrote:
> >>When the first level page table is used for IOVA translation, it only
> >>supports Read-Only and Read-Write permissions. The Write-Only permission
> >>is not supported as the PRESENT bit (implying Read permission) should
> >>always set. When using second level, we still give separate permissions
> >>that allows WriteOnly which seems inconsistent and awkward. There is no
> >>use case we can think off, hence remove that configuration to make it
> >>consistent.
> >
> >No use-case for WriteOnly mappings? How about DMA_FROM_DEVICE mappings?
> >
> 
> The statement of no use case is not correct. Sorry about it.
> 
> As we have moved to use first level for IOVA translation, the first
> level page table entry only provides RO and RW permissions. So if any
> device driver specifies DMA_FROM_DEVICE attribution, it will get RW
> permission in the page table. This patch aims to make the permissions
> of second level and first level consistent. No impact on the use of
> DMA_FROM_DEVICE attribution.
> 

That is the primary motivation, given that we have moved to 1st level for
general IOVA, first level doesn't have a WO mapping. I didn't know enough
about the history to determine if a WO without a READ is very useful. I
guess the ZLR was invented to support those cases without a READ in PCIe. I

Early Intel IOMMU's didn't handle ZLR properly, until we fixed it in the
next generation. It just seemed opposite to the CPU page-tables, and we
wanted to have consistent behavior. After moving to 1st level, we don't
want things to work sometimes, and break if we use 2nd level for the same
mappings.

Hope this helps

Cheers,
Ashok

Re: [PATCH 2/3] iommu/vt-d: Parse SATC reporting structure

2021-02-02 Thread Raj, Ashok

On Tue, Feb 02, 2021 at 12:40:56PM +0800, Lu Baolu wrote:
> From: Yian Chen 
> 
> Software should parse every SATC table and all devices in the tables
> reported by the BIOS and keep the information in kernel list for further
> SATC policy deployment.
> 
The last part seems bit vague? Are you trying to imply, 

if kernel is booted with noats for instance, a device listed in SATC table
as "requires ATS" will fail to load?

Otherwise its not clear what the policy enforcement means.

Re: [PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC

2021-02-02 Thread Raj, Ashok

On Tue, Feb 02, 2021 at 12:40:55PM +0800, Lu Baolu wrote:
> From: Yian Chen 
> 
> Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping
> structure SATC for SOC integrated devices, according to section 8.8 of
> Intel VT-d architecture specification v3.2. The SATC structure reports
> a list of the devices that require SATC enabling via ATS capacity.

nit: s/require SATC/require ATS for normal device operation. This is a
functional requirement that these devices will not work without OS enabling
ATS capability.

> 
> This patch introduces the new enum value and structure to represent the
> remapping information. Kernel should parse the information from the
> reporting structure and enable ATC for the devices as needed.
> 
> Signed-off-by: Yian Chen 
> ---
>  include/acpi/actbl1.h | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
> index 43549547ed3e..b7ca802b66d2 100644
> --- a/include/acpi/actbl1.h
> +++ b/include/acpi/actbl1.h
> @@ -514,7 +514,8 @@ enum acpi_dmar_type {
>   ACPI_DMAR_TYPE_ROOT_ATS = 2,
>   ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,
>   ACPI_DMAR_TYPE_NAMESPACE = 4,
> - ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */
> + ACPI_DMAR_TYPE_SATC = 5,
> + ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */
>  };
>  

Think Joerg spotted the comment update.

Re: [Patch v2 1/1] PCI: pciehp: Add support for handling MRL events

2020-12-03 Thread Raj, Ashok

Hi Lukas and Bjorn

On Sun, Nov 22, 2020 at 10:08:52AM +0100, Lukas Wunner wrote:
> > @@ -275,6 +302,13 @@ void pciehp_handle_presence_or_link_change(struct 
> > controller *ctrl, u32 events)
> > if (link_active)
> > ctrl_info(ctrl, "Slot(%s): Link Up\n",
> >   slot_name(ctrl));
> > +   /*
> > +* If slot is closed && ATTN button exists
> > +* don't continue, let the ATTN button
> > +* drive the hot-plug
> > +*/
> > +   if (((events & PCI_EXP_SLTSTA_MRLSC) && ATTN_BUTTN(ctrl)))
> > +   return;
> > ctrl->request_result = pciehp_enable_slot(ctrl);
> > break;
> 
> Hm, if the Attention Button is pressed with MRL still open, the slot is
> not brought up.  If the MRL is subsequently closed, it is still not
> brought up.  I guess the slot keeps blinking and one has to push the
> button to abort the operation, then press it once more to attempt
> another slot bringup.  The spec doesn't seem to say how such a situation
> should be handled. Oh well.
> 
> I'm wondering if this is the right place to bail out:  Immediately
> before the above hunk, the button_work is canceled, so it can't later
> trigger bringup of the slot.  Shouldn't the above check be in the
> code block with the "Turn the slot on if it's occupied or link is up"
> comment?
> 

I have a fix tested on the platform, but I'm wondering if that's exactly
what you had in mind. 

Currently we don't subscribe for PDC events when ATTN exists. So the
behavior is almost similar to this MRL case after ATTN, but the slot is not
ready for hot-add.

- Press ATTN, 
- Slot is empty
- After 5 seconds synthetic PDC arrives.
  but since no presence and no link active, we leave slot in 
  BLINKINGON_STATE, and led keeps blinking

if someone were to add a card after the 5 seconds, no hot-add is processed
since we don't get notifications for PDC events when ATTN exists.

Can we automatically cancel the blinking and return slot back to OFF_STATE?

This way we don't need another button press to first cancel, and restart
add via another button press? 

According to section 6.7.1.5 Attention Button.
Once the power indicator begins blinking, a 5 second abort interval exists 
during which a second depression of the attention button cancels the operation.

If the operation initiated by the attention button fails for any reason, it
is recommended that system software present an error message explaining
failure via a software user interface, or add the error message to system
log.

Seems like we can cancel the blinking and return back to power off state.
Since the attention button press wasn't successful to add anything.?

Alternately we can also choose to subscribe to PDC, but ignore if slot is
in OFF_STATE. So we let ATTN drive the add. But if PDC happens and we are
in BLINKINGON_STATE, then we can process the hot-add? Spec says a software
recommendation, but i think the cancel after 5 seconds seems better?

Cheers,
Ashok

Re: [Patch v2 1/1] PCI: pciehp: Add support for handling MRL events

2020-11-25 Thread Raj, Ashok

Hi Lukas

On Sun, Nov 22, 2020 at 10:08:52AM +0100, Lukas Wunner wrote:
> On Sat, Nov 21, 2020 at 05:42:03PM -0800, Ashok Raj wrote:
> > --- a/drivers/pci/hotplug/pciehp_ctrl.c
> > +++ b/drivers/pci/hotplug/pciehp_ctrl.c
> >  void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 
> > events)
> >  {
> > int present, link_active;
> > +   u8 getstatus = 0;
> >  
> > /*
> >  * If the slot is on and presence or link has changed, turn it off.
> > @@ -246,6 +259,20 @@ void pciehp_handle_presence_or_link_change(struct 
> > controller *ctrl, u32 events)
> > if (events & PCI_EXP_SLTSTA_PDC)
> > ctrl_info(ctrl, "Slot(%s): Card not present\n",
> >   slot_name(ctrl));
> > +   if (events & PCI_EXP_SLTSTA_MRLSC)
> > +   ctrl_info(ctrl, "Slot(%s): Latch %s\n",
> > + slot_name(ctrl), getstatus ? "Open" : 
> > "Closed");
> 
> This message will currently always be "Latch closed".  It should be
> "Latch open" instead because if the slot was up, the latch must have
> been closed.  So an MRLSC event can only mean that the latch is now open.
> The "getstatus" variable can be removed.

We only report if there was an MRLSC event. What if there is a link event,
but MRL is closed? This just reflects current state rather than hardcoding
a value which could be wrong in cases where you try to remove due to DLLSC
event?

> 
> 
> > +   /*
> > +* PCIe Base Spec 5.0 Chapter 6.7.1.3 states.
> > +*
> > +* If an MRL Sensor is implemented without a corresponding MRL 
> > Sensor input
> > +* on the Hot-Plug Controller, it is recommended that the MRL 
> > Sensor be
> > +* routed to power fault input of the Hot-Plug Controller.
> > +* This allows an active adapter to be powered off when the MRL 
> > is opened."
> > +*
> > +* This seems to suggest that the slot should be brought down 
> > as soon as MRL
> > +* is opened.
> > +*/
> > pciehp_disable_slot(ctrl, SURPRISE_REMOVAL);
> > break;
> 
> The code comment is not wrapped at 80 chars and a bit long.
> I'd move it to the commit message and keep only a shortened version here.

Make sense. I'll clean this up.
> 
> The "SURPRISE_REMOVAL" may now be problematic because the card may still
> be in the slot (both presence and link still up) with only the MRL open.
> My suggestion would be to add a local variable "bool safe_removal"
> which is initialized to "SAFE_REMOVAL".  In the two if-clauses for
> DLLSC and PDC, it is set to SURPRISE_REMOVAL.

I see, so for MRL we want to treat it as safe-removal, for other two its
surprise. Got it.

> 
> 
> > @@ -275,6 +302,13 @@ void pciehp_handle_presence_or_link_change(struct 
> > controller *ctrl, u32 events)
> > if (link_active)
> > ctrl_info(ctrl, "Slot(%s): Link Up\n",
> >   slot_name(ctrl));
> > +   /*
> > +* If slot is closed && ATTN button exists
> > +* don't continue, let the ATTN button
> > +* drive the hot-plug
> > +*/
> > +   if (((events & PCI_EXP_SLTSTA_MRLSC) && ATTN_BUTTN(ctrl)))
> > +   return;
> > ctrl->request_result = pciehp_enable_slot(ctrl);
> > break;
> 
> Hm, if the Attention Button is pressed with MRL still open, the slot is
> not brought up.  If the MRL is subsequently closed, it is still not
> brought up.  I guess the slot keeps blinking and one has to push the
> button to abort the operation, then press it once more to attempt
> another slot bringup.  The spec doesn't seem to say how such a situation
> should be handled. Oh well.

Looks like we are in the same boat today even without MRL. If no card in
slot and someone presses ATTN, after 5 sec blink, we call the synthetic PDC
event. But the check for present || link_active would fail and return
immediately. So the light would keep blinking until someone presses ATTN to
cancel?

Maybe in that we should simply cancel if it was blinking before we return?

> 
> I'm wondering if this is the right place to bail out:  Immediately
> before the above hunk, the button_work is canceled, so it can't later
> trigger bringup of the slot.  Shouldn't the above check be in the
> code block with the "Turn the slot on if it's occupied or link is up"
> comment?

Or maybe remove the check !latch_closed(ctrl), and let if fall through
anyway. 
> 
> You're also not unlocking the state_lock here before bailing out of
> the function.
> 
> 
> > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
> > down_read(&ctrl->reset_lock);
> > if (events & DISABLE_SLOT)
> > pciehp_handle_disable_request(ctrl);
> > -   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
> > +   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_

Re: [PATCH 1/1] pci: pciehp: Handle MRL interrupts to enable slot for hotplug.

2020-11-21 Thread Raj, Ashok

On Sat, Nov 21, 2020 at 12:10:50PM +0100, Lukas Wunner wrote:
> 
> > > > +   /*
> > > > +* If ATTN is present and MRL is triggered
> > > > +* ignore the Presence Change Event.
> > > > +*/
> > > > +   if (ATTN_BUTTN(ctrl) && (events & PCI_EXP_SLTSTA_MRLSC))
> > > > +   events &= ~PCI_EXP_SLTSTA_PDC;
> > > 
> > > An Attention Button press results in a synthesized PDC event after
> > > 5 seconds, which may get lost due to the above if-statement.
> > 
> > When its synthesized you don't get the MRLSC? So we won't nuke the PDC then
> > correct?
> 
> I just meant to say, pciehp_queue_pushbutton_work() will synthesize
> a PDC event after 5 seconds and with the above code snippet, if an
> MRL event happens simultaneously, that synthesized PDC event would
> be lost.  So I'd just drop the above code snippet.  I think you
> just need to subscribe to MRL events and propagate them to
> pciehp_handle_presence_or_link_change().  There, you'd bring down
> the slot if an MRL event has occurred (same as DLLSC or PDC).
> Then, check whether MRL is closed.  If so, and if presence or link
> is up, try to bring up the slot.
> 
> If the MRL is open when slot or presence has gone up, the slot is not
> brought up.  But once MRL is closed, there'll be another MRL event and
> *then* the slot is brought up.
> 

Sounds good.. I'll send the update patch.

Re: [PATCH 1/1] pci: pciehp: Handle MRL interrupts to enable slot for hotplug.

2020-11-19 Thread Raj, Ashok

On Thu, Nov 19, 2020 at 04:27:41PM -0600, Bjorn Helgaas wrote:
> Hi Ashok,
> 
> When you update this patch, can you run "git log --oneline
> drivers/pci/hotplug/pciehp*" and make yours match?  It is a severe
> character defect of mine, but seeing subject lines that are obviously

:-)

> different than the convention is a disincentive to look at the patch.
> 

Thanks for the reminder. forgot the cap the PCI, and some inadvertant '.'
at the end of the line...

Will fix when i resend.

Re: [PATCH 1/1] pci: pciehp: Handle MRL interrupts to enable slot for hotplug.

2020-11-19 Thread Raj, Ashok

On Thu, Nov 19, 2020 at 08:51:20AM +0100, Lukas Wunner wrote:
> Hi Ashok,
> 
> my sincere apologies for the delay.

No worries.. 

> 
> On Fri, Sep 25, 2020 at 04:01:38PM -0700, Ashok Raj wrote:
> > When Mechanical Retention Lock (MRL) is present, Linux doesn't process
> > those change events.
> > 
> > The following changes need to be enabled when MRL is present.
> > 
> > 1. Subscribe to MRL change events in SlotControl.
> > 2. When MRL is closed,
> >- If there is no ATTN button, then POWER on the slot.
> >- If there is ATTN button, and an MRL event pending, ignore
> >  Presence Detect. Since we want ATTN button to drive the
> >  hotplug event.
> 
> So I understand you have a hardware platform which has both MRL and
> Attention Button on its hotplug slots?  It may be useful to name the
> hardware platform in the commit message.

I should, let me find out the exact intercept and include it next.

> 
> If an Attention Button is present, the current behavior is to bring up
> the hotplug slot as soon as presence or link is detected.  We don't wait
> for a button press.  This is intended as a convience feature to bring up
> slots as quickly as possible, but the behavior can be changed if the
> button press and 5 second delay is preferred.

No personal preference, I thought that is how the code in Linux was
earlier.

Looks like we still don't subscribe to PDC if ATTN is present? So you don't
get an event until someone presses the button to process hotplug right?

pciehp_hpc.c:pcie_enable_notification()
{


 * Always enable link events: thus link-up and link-down shall
 * always be treated as hotplug and unplug respectively. Enable
 * presence detect only if Attention Button is not present.
 */ 
cmd = PCI_EXP_SLTCTL_DLLSCE;
if (ATTN_BUTTN(ctrl))
cmd |= PCI_EXP_SLTCTL_ABPE;
else
cmd |= PCI_EXP_SLTCTL_PDCE;

}


Looks like we still wait for button press to process. When you don't have a
power control yes the DLLSC would kick in and we would process them. but if
you have power control, you won't turn on until the button press?

> 
> In any case the behavior in response to an Attention Button press should
> be the same regardless whether an MRL is present or not.  (The spec
> doesn't seem to say otherwise.)

True, Looks like that is a Linux behavior, certainly not a spec
recommendation. Our validation teams tell Windows also behaves such. i.e
when ATTN exists button press drivers the hot-add. Linux doesn't need to
follow suite though. 

In that case do we need to subscribe to PDC even when ATTN is present?

> 
> 
> > +   if (MRL_SENS(ctrl)) {
> > +   pciehp_get_latch_status(ctrl, &getstatus);
> > +   /*
> > +* If slot is closed && ATTN button exists
> > +* don't continue, let the ATTN button
> > +* drive the hot-plug
> > +*/
> > +   if (!getstatus && ATTN_BUTTN(ctrl))
> > +   return;
> > +   }
> 
> For the sake of readability I'd suggest adding a pciehp_latch_closed()
> helper similar to pciehp_card_present_or_link_active() which returns
> true if no MRL is present (i.e. !MRL_SENS(ctrl)), otherwise retrieves
> and returns the status with pciehp_get_latch_status().

Makes sense.. I can add that.

> 
> 
> > +void pciehp_handle_mrl_change(struct controller *ctrl)
> > +{
> > +   u8 getstatus = 0;
> > +   int present, link_active;
> > +
> > +   pciehp_get_latch_status(ctrl, &getstatus);
> > +
> > +   present = pciehp_card_present(ctrl);
> > +   link_active = pciehp_check_link_active(ctrl);
> > +
> > +   ctrl_info(ctrl, "Slot(%s): Card %spresent\n",
> > + slot_name(ctrl), present ? "" : "not ");
> > +
> > +   ctrl_info(ctrl, "Slot(%s): Link %s\n",
> > + slot_name(ctrl), link_active ? "Up" : "Down");
> 
> This duplicates a lot of code from pciehp_handle_presence_or_link_change(),
> which I think suggests handling MRL events in that function.  After all,
> an MRL event may lead to bringup or bringdown of a slot similar to
> a link or presence event.
> 
> Basically pciehp_handle_presence_or_link_change() just needs to be
> amended to bring down the slot if an MRL event occurred, and
> afterwards only bring up the slot if MRL is closed.  Like this:
> 
> - if (present <= 0 && link_active <= 0) {
> + if ((present <= 0 && link_active <= 0) || !latch_closed) {

Certainly looks like good simplication. I'll change and have a test run.

>   mutex_unlock(&ctrl->state_lock);
>   return;
>   }
> 
> 
> > +   /*
> > +* Need to handle only MRL Open. When MRL is closed with
> > +* a Card Present, either the ATTN button, or the PDC indication
> > +* should power the slot and add the card in the slot
> > +*/
> 
> I agree with the second sentence.  You may want to refer to PCIe Base Spec
> r4.0

Re: [PATCH][v2] x86/microcode/intel: check cpu stepping and processor flag before saving microcode

2020-11-16 Thread Raj, Ashok

Hi Boris

On Mon, Nov 16, 2020 at 01:27:35PM +0100, Borislav Petkov wrote:
> ( drop stable@ from Cc because this is not how fixes get added to stable@ )

Stable is still left below. with #v4.10+

Do you want to keep this? Also do you want him to resend or you have that
covered?

> 
> On Fri, Nov 13, 2020 at 09:59:23AM +0800, Chen Yu wrote:
> > Currently scan_microcode() leverages microcode_matches() to check if the
> > microcode matches the CPU by comparing the family and model. However before
> > saving the microcode in scan_microcode(), the processor stepping and flag
> > of the microcode signature should also be considered in order to avoid
> > incompatible update and caused the failure of microcode update.
> 
> This is going in the right direction but needs to take care of one
> more angle. I've extended your fix to the version below. Lemme know if
> something's not clear or still missing.
> 

Seems clear to me, and the commit log cleanup also makes sense.
I don't have a system myself,. Will wait for Chen Yu to confirm if it works
for him as well. 

> Thx.
> 
> ---
> From: Chen Yu 
> Date: Fri, 13 Nov 2020 09:59:23 +0800
> Subject: [PATCH] x86/microcode/intel: Check patch signature before saving 
> microcode for early loading
> 
> Currently, scan_microcode() leverages microcode_matches() to check
> if the microcode matches the CPU by comparing the family and model.
> However, the processor stepping and flags of the microcode signature
> should also be considered when saving a microcode patch for early
> update.
> 
> Use find_matching_signature() in scan_microcode() and get rid of the
> now-unused microcode_matches() which is a good cleanup in itself.
> 
> Complete the verification of the patch being saved for early loading in
> save_microcode_patch() directly. This needs to be done there too because
> save_mc_for_early() will call save_microcode_patch() too.
> 
> The second reason why this needs to be done is because the loader still
> tries to support, at least hypothetically, mixed-steppings systems and
> thus adds all patches to the cache that belong to the same CPU model
> albeit with different steppings.
> 
> For example:
> 
>   microcode: CPU: sig=0x906ec, pf=0x2, rev=0xd6
>   microcode: mc_saved[0]: sig=0x906e9, pf=0x2a, rev=0xd6, total size=0x19400, 
> date = 2020-04-23
>   microcode: mc_saved[1]: sig=0x906ea, pf=0x22, rev=0xd6, total size=0x19000, 
> date = 2020-04-27
>   microcode: mc_saved[2]: sig=0x906eb, pf=0x2, rev=0xd6, total size=0x19400, 
> date = 2020-04-23
>   microcode: mc_saved[3]: sig=0x906ec, pf=0x22, rev=0xd6, total size=0x19000, 
> date = 2020-04-27
>   microcode: mc_saved[4]: sig=0x906ed, pf=0x22, rev=0xd6, total size=0x19400, 
> date = 2020-04-23
> 
> The patch which is being saved for early loading, however, can only be
> the one which fits the CPU this runs on so do the signature verification
> before saving.
> 
>  [ bp: Do signature verification in save_microcode_patch()
>and rewrite commit message. ]
> 
> Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading")
> Signed-off-by: Chen Yu 
> Signed-off-by: Borislav Petkov 
> Cc: sta...@vger.kernel.org #v4.10+
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=208535
> Link: https://lkml.kernel.org/r/20201113015923.13960-1-yu.c.c...@intel.com
> ---
>  arch/x86/kernel/cpu/microcode/intel.c | 63 +--
>  1 file changed, 10 insertions(+), 53 deletions(-)

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-15 Thread Raj, Ashok

On Sun, Nov 15, 2020 at 11:11:27PM +0100, Thomas Gleixner wrote:
> On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> > On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> >> > opt-in by device or kernel? The way we are planning to support this is:
> >> >
> >> > Device support for IMS - Can discover in device specific means
> >> > Kernel support for IMS. - Supported by IOMMU driver.
> >> 
> >> And why exactly do we have to enforce IOMMU support? Please stop looking
> >> at IMS purely from the IDXD perspective. We are talking about the
> >> general concept here and not about the restricted Intel universe.
> >
> > I think you have mentioned it almost every reply :-)..Got that! Point taken
> > several emails ago!! :-)
> 
> You sure? I _try_ to not mention it again then. No promise though. :)

Hey.. anything that's entertaining go for it :-)

> 
> > I didn't mean just for idxd, I said for *ANY* device driver that wants to
> > use IMS.
> 
> Which is wrong. Again:
> 
> A) For PF/VF on bare metal there is absolutely no IOMMU dependency
>because it does not have a PASID requirement. It's just an
>alternative solution to MSI[X], which allows optimizations like
>storing the message in driver manages queue memory or lifting the
>restriction of 2048 interrupts per device. Nothing else.

You are right.. my eyes were clouded by virtualization.. no dependency for
native absolutely.

> 
> B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
>There is no direct dependency on the IOMMU.
> 
>The problem is the inability of the VMM to trap the message write to
>the IMS storage if the storage is in guest driver managed memory.
>This can be solved with either
> 
>- a hypercall which translates the guest MSI message
>or
>- a vIOMMU which uses a hypercall or whatever to translate the guest
>  MSI message
> 
> C) Subdevices ala mdev are a different story. They require PASID which
>enforces IOMMU and the IMS part is not managed by the users anyway.

You are right again :)

The subdevices require PASID & IOMMU in native, but inside the guest there is no
need for IOMMU unless you want to build SVM on top. subdevices work without
any vIOMMU or hypercall in the guest. Only because they look like normal
PCI devices we could map interrupts to legacy MSIx.

> 
> So we have a couple of problems to solve:
> 
>   1) Figure out whether the OS runs on bare metal
> 
>  There is no reliable answer to that, so we either:
> 
>   - Use heuristics and assume that failure is unlikely and in case
> of failure blame the incompetence of VMM authors and/or
> sysadmins
> 
>  or
>  
>   - Default to IMS disabled and let the sysadmin enable it via
> command line option.
> 
> If the kernel detects to run in a VM it yells and disables it
> unless the OS and the hypervisor agree to provide support for
> that scenario (see #2).
> 
> That's fails as well if the sysadmin does so when the OS runs on
> a VMM which is not identifiable, but at least we can rightfully
> blame the sysadmin in that case.

cmdline isn't nice, best to have this functional out of box.

> 
>  or
> 
>   - Declare that IMS always depends on IOMMU

As you had mentioned IMS has no real dependency on IOMMU in native.

we just need to make sure if running in guest we have support for it
plumbed.

> 
> I personaly don't care, but people working on these kind of
> device already said, that they want to avoid it when possible.
> 
> If you want to go that route, then please talk to those folks
> and ask them to agree in public.
> 
>  You also need to take into account that this must work on all
>  architectures which support virtualization because IMS is
>  architecture independent.

What you suggest makes perfect sense. We can certainly get buy in from
iommu list and have this co-ordinated between all existing iommu varients.

> 
>   2) Guest support for PF/VF
> 
>  Again we have several scenarios depending on the IMS storage
>  type.
> 
>   - If the storage type is device memory then it's pretty much the
> same as MSI[X] just a different location.

True, but still need to have some special handling for trapping those mmio
access. Unlike for MSIx VFIO already traps them and everything is
pre-plummbed. It isn't seamless as its for MSIx.

> 
>   - If the storage is in driver managed memory then this needs
> #1 plus guest OS and hypervisor support (hypercall/vIOMMU)

Violent agreement here :-)

> 
>   3) Guest support for PF/VF and guest managed subdevice (mdev)
> 
>  Depends on #1 and #2 and is an orthogonal problem if I'm not
>  missing something.
> 
> To move forward we need to make a decision about #1 and #2 now.

Mostly in agreement. Except for mdev (current considered use case) have no
need for IMS in the guest. (Don't

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-15 Thread Raj, Ashok

On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 14 2020 at 13:18, Ashok Raj wrote:
> > On Sat, Nov 14, 2020 at 10:34:30AM +, Christoph Hellwig wrote:
> >> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> >> Which is why I really think we need explicit opt-ins for "native"
> >> SIOV handling and for paravirtualized SIOV handling, with the kernel
> >> not offering support at all without either or a manual override on
> >> the command line.
> >
> > opt-in by device or kernel? The way we are planning to support this is:
> >
> > Device support for IMS - Can discover in device specific means
> > Kernel support for IMS. - Supported by IOMMU driver.
> 
> And why exactly do we have to enforce IOMMU support? Please stop looking
> at IMS purely from the IDXD perspective. We are talking about the
> general concept here and not about the restricted Intel universe.

I think you have mentioned it almost every reply :-)..Got that! Point taken
several emails ago!! :-)

I didn't mean just for idxd, I said for *ANY* device driver that wants to
use IMS.

> 
> > each driver can check 
> >
> > if (dev_supports_ims() && iommu_supports_ims()) {
> > /* Then IMS is supported in the platform.*/
> > }
> 
> Please forget this 'each driver can check'. That's just wrong.

Ok.

> 
> The only thing the driver has to check is whether the device supports
> IMS or not. Everything else has to be handled by the underlying
> infrastructure.

That's pretty much the same thing.. I guess you wanted to add 
"Does infrastructure support IMS" to be someplace else, instead
of device driver checking it. That's perfectly fine.

Until we support this natively via hypercall or vIOMMU we can use your
varient of finding if you are not on bare_metal to decide support for IMS.

How you highligted below:

https://lore.kernel.org/lkml/877dqrnzr3@nanos.tec.linutronix.de/

probably_on_bare_metal()
{
if (CPUID(FEATURE_HYPERVISOR))
return false;
if (dmi_match_hypervisor_vendor())
return false;

return PROBABLY_RUNNING_ON_BARE_METAL;
}

The above is all we need for now and will work in almost all cases. 
We will move forward with just the above in the next series.

Below is for future consideration.

Even the above isn't fool proof if both HYPERVISOR feature flag isn't set,
and the dmi_string doesn't match, say some new hypervisor. The only way 
we can figure that is

- If no iommu support, or iommu can tell if this is a virtualized iommu.
The presence of caching_mode is one such indication for Intel. 

PS: Other IOMMU's must have something like this to support virtualization.
I'm not saying this is an Intel only feature just in case you interpret
it that way! I'm only saying if there is a mechanism to distinguish
native vs emulated platform.

When vIOMMU supports getting native interrupt handle via a virtual command
interface for Intel IOMMU's. OR some equivalent when other vedors provide
such capability. Even without a hypercall virtualizing IOMMU can provide
the same solution.

If we support hypercall then its more generic so it would fall into the
native all platforms/vendors. Certainly the most scalable long term
solution.

Cheers,
Ashok

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-14 Thread Raj, Ashok

On Sat, Nov 14, 2020 at 10:34:30AM +, Christoph Hellwig wrote:
> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> > DMI vendor name is pretty good final check when the bit is 0. The
> > strings I'm aware of are:
> > 
> > QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
> > Corporation, Parallels, BHYVE, Microsoft Corporation
> > 
> > which is not complete but better than nothing ;)
> 
> Which is why I really think we need explicit opt-ins for "native"
> SIOV handling and for paravirtualized SIOV handling, with the kernel
> not offering support at all without either or a manual override on
> the command line.

opt-in by device or kernel? The way we are planning to support this is:

Device support for IMS - Can discover in device specific means
Kernel support for IMS. - Supported by IOMMU driver.

each driver can check 

if (dev_supports_ims() && iommu_supports_ims()) {
/* Then IMS is supported in the platform.*/
}

until we have vIOMMU support or a hypercall, iommu_supports_ims() will
check if X86_FEATURE_HYPERVISOR in addition to the platform id's Thomas
mentioned. or on intel platform check for cap.caching_mode=1 and return false.

When we add support for getting a native interrupt handle then we will plumb 
that
appropriately.

Does this match what you wanted?

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-13 Thread Raj, Ashok

On Fri, Nov 13, 2020 at 08:12:39AM -0800, Luck, Tony wrote:
> > Of course is this not only an x86 problem. Every architecture which
> > supports virtualization has the same issue. ARM(64) has no way to tell
> > for sure whether the machine runs bare metal either. No idea about the
> > other architectures.
> 
> Sounds like a hypervisor problem. If the VMM provides perfect emulation
> of every weird quirk of h/w, then it is OK to let the guest believe that it is
> running on bare metal.

That's true, which is why there isn't an immutable bit in cpuid or
otherwise telling you are running under a hypervisor. Providing something
like that would make certain features not virtualizable. Apparently before we
had faulting cpuid, what you had in guest was the real raw cpuid. 

Waiver: I'm not saying this is perfect, I'm just replaying the reason
behind it. Not trying to defend it... flames > /dev/null
> 
> If it isn't perfect, then it should make sure the guest knows *for sure*, so 
> that
> the guest can take appropriate actions to avoid the sharp edges.
> 

There are indeed 2 problems to solve.

1. How does device driver know if device is IMS capable.

   IMS is a device attribute. Each vendor can provide its own method to
   provide that indication. One such mechanism is the DVSEC.SIOV.IMS
   property. Some might believe this is for use only by Intel. For DVSEC I
   don't believe there is such a connection as in device vendor id in
   standard header. TBH, there are other device vendors using the exact
   same method to indicate SIOV and IMS propeties. What a DVSEC vendor ID
   states is "As defined by Vendor X". 

   Why we choose a config vs something in device specific mmio is because
   today VFIO being that one common mechanism, it only exposes known
   standard and some extended headers to guest. When we expose a full PF,
   the guest doens't see the DVSEC, so drivers know this isn't available.

   This is our mechanism to stop drivers from calling
   pci_ims_array_create_msi_irq_domain(). It may not be perfect for all
   devices, it is a device specific mechanism. For devices under
   consideration following the SIOV spec it meets the sprit of the
   requirement even without #2 below. When devices have no way to detect
   this, #2 is required as a second way to block IMS.

2. How does platform component (IOMMU) inform if they can support all forms
   of IMS. (On device, or in memory). 

   On device would require some form trap/emulate. Legacy MSIx already has
   that solved, but for device specific store you need some additional
   work.

   When its system memory (say IMS is in GPA space), you need some form of
   hypercall. There is no way around it since we can't intercept. Yes, you
   can maybe map those as RO and trap, but its not pretty.

   To solve this rather than a generic platform capability, maybe we should
   flip this to IOMMU instead, because that's the one that offers this
   capability today.

   iommu_ims_supported() 
When platform has no IOMMU or no hypervisor calls, it returns
false. So device driver can tell, even if it supports IMS
capability deduction, does the platform support IMS.

On platforms where iommu supports capability.

Either there is a vIOMMU with a Virtual Command Register that can
provide a way to get the interrupt handle similar to what you would
get from an hypercall for instance. Or there is a real hypercall
that supports giving the guest OS the physical IRTE handle. 

-- 
Cheers,
Ashok

[Forgiveness is the attribute of the STRONG - Gandhi]

Re: [PATCH][RFC] x86/microcode/intel: check cpu stepping and processor flag before saving microcode

2020-11-12 Thread Raj, Ashok

Hi ChenYu

I think you can drop the RFC tag.

I suppose you can add Cc stable as well. Boris should return next week to
take a look.


On Tue, Nov 10, 2020 at 09:52:47PM +0800, Chen Yu wrote:
> Currently scan_microcode() leverages microcode_matches() to check if the
> microcode matches the CPU by comparing the family and model. However before
> saving the microcode in scan_microcode(), the processor stepping and flag
> of the microcode signature should also be considered in order to avoid
> incompatible update and caused the failure of microcode update.
> 
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208535
> Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading")
> Suggested-by: "Raj, Ashok" 
> Cc: Borislav Petkov 
> Cc: Len Brown 
> Cc: "Rafael J. Wysocki" 
> Cc: "Raj, Ashok" 
> Cc: Tony Luck 
> Signed-off-by: Chen Yu 
> --

Reviewed-by: Ashok Raj 

Cheers,
Ashok

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-11 Thread Raj, Ashok

On Wed, Nov 11, 2020 at 11:27:28PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 11 2020 at 08:09, Ashok Raj wrote:
> >> > We'd also need a way for an OS running on bare metal to *know* that
> >> > it's on bare metal and can just compose MSI messages for itself. Since
> >> > we do expect bare metal to have an IOMMU, perhaps that is just a
> >> > feature flag on the IOMMU?
> >> 
> >> Have the platform firmware advertise if it needs native or virtualized
> >> IMS handling.  If it advertises neither don't support IMS?
> >
> > The platform hint can be easily accomplished via DMAR table flags. We could
> > have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native 
> > and IMS is supported.
> >
> > When vIOMMU is presented to guest, virtual DMAR table will have this flag
> > set to 1. Indicates to GuestOS, native IMS isn't supported.
> 
> These opt-out bits suck by definition. It comes all back to the fact
> that the whole virt thing didn't have a hardware defined way to tell
> that the OS runs in a VM and not on bare metal. It wouldn't have been
> rocket science to do so.

I'm sure everybody dislikes (hate being a strong word :-)). 
DVSEC capability. Real hardware always sets it to 1 for the IMS capability.

By default the DVSEC is not presented to guest even when the full PF is
presented to guest. I believe VFIO only builds and presents known standard
capabilities and specific extended capabilities. I'm a bit weak but maybe
@AlexWilliamson can confirm if I'm off track.

This tells the driver in guest that IMS is not available and will not
create those new dev_msi calls. 

Only if the VMM has build support to expose IMS for this device, guest SW
can even see DVSEC.SIOV.IMS=1. This also means the required plumbing, say
vIOMMU, or a hypercall has been provisioned, and adminstrator knows the
guest is compatible for these options. 

There maybe better ways to do this. If this has to be done differently
we certainly can and will do. 

> 
> And because that does not exist, we need magic opt-out bits for every
> other piece of functionality which gets added. Can we please stop this
> and provide a well defined way to tell the OS whether it runs on bare
> metal or not?
> 
> The point is that you really want opt-in bits so that decisions come
> down to

How would we opt-in when the feature is not available? You need someway to
tell the capability is available in the guest?, but then there is no reason
to opt-in though.. its ready for use isn't it?

> 
>  if (!virt || virt->supports_X)

The only closest thing that comes to mind is the CPUID bits, you had
mentioned they aren't reliable if the VMM didn't set those in an earlier
mail. If you want a platform level generic support.

- DMAR table optout's you had mentioned that's ugly
- We could use caching mode, but its not a platform level thing, and vendor
  specific. I'm not sure if other vendors have a similar feature. If there
  is a generic capabilty, we could expose via the iommu api's if we are in
  virt or real platform.
> 
> which is the obvious sane and safe logic. But sure, why am I asking for
> sane and safe in the context of virtualization?

We can pick how to solve this, and just waiting for you to tell, what
mechanism you prefer that's less painful and architecturally acceptible for
virtualization and linux. We are all ears!

Cheers,
Ashok

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-11 Thread Raj, Ashok

On Wed, Nov 11, 2020 at 03:41:59PM +, Christoph Hellwig wrote:
> On Sun, Nov 08, 2020 at 07:36:34PM +, David Woodhouse wrote:
> > So it does look like we're going to need a hypercall interface to
> > compose an MSI message on behalf of the guest, for IMS to use. In fact
> > PCI devices assigned to a guest could use that too, and then we'd only
> > need to trap-and-remap any attempt to write a Compatibility Format MSI
> > to the device's MSI table, while letting Remappable Format messages get
> > written directly.
> > 
> > We'd also need a way for an OS running on bare metal to *know* that
> > it's on bare metal and can just compose MSI messages for itself. Since
> > we do expect bare metal to have an IOMMU, perhaps that is just a
> > feature flag on the IOMMU?
> 
> Have the platform firmware advertise if it needs native or virtualized
> IMS handling.  If it advertises neither don't support IMS?

The platform hint can be easily accomplished via DMAR table flags. We could
have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native 
and IMS is supported.

When vIOMMU is presented to guest, virtual DMAR table will have this flag
set to 1. Indicates to GuestOS, native IMS isn't supported.

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-10 Thread Raj, Ashok

Hi David

I did't follow the support for 32768 CPUs in guest without IR support.

Can you tell me how that is done?

On Sun, Nov 08, 2020 at 03:25:57PM -0800, Ashok Raj wrote:
> On Sun, Nov 08, 2020 at 06:34:55PM +, David Woodhouse wrote:
> > > 
> > > When we do interrupt remapping support in guest which would be required 
> > > if we support x2apic in guest, I think this is something we should look 
> > > into more 
> > > carefully to make this work.
> > 
> > No, interrupt remapping is not required for X2APIC in guests
> > 
> > They can have X2APIC and up to 32768 CPUs without needing interrupt
> 
> How is this made available today without interrupt remapping? 
> 
> I thought without IR, the destination ID is still limited to only 8 bits?
> 
> On native, even if you have less than 255 cpu's but the APICID are sparsly 
> distributed due to platform rules, the x2apic id could be more than 8 bits. 
> Which is why the spec requires IR when x2apic is enabled.
> 
> > remapping at all. Only if they want more than 32768 vCPUs, or to do
> > nested virtualisation and actually remap for the benefit of *their*
> > (L2+) guests would they need IR.

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-10 Thread Raj, Ashok

Thomas,

With all these interrupt message storms ;-), I'm missing how to move towards
an end goal.

On Tue, Nov 10, 2020 at 11:27:29AM +0100, Thomas Gleixner wrote:
> Ashok,
> 
> On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> > On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> >> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> > Approach to IMS is more of a phased approach. 
> >
> > #1 Allow physical device to scale beyond limits of PCIe MSIx
> >Follows current methodology for guest interrupt programming and
> >evolutionary changes rather than drastic.
> 
> Trapping MSI[X] writes is there because it allows to hand a device to an
> unmodified guest OS and to handle the case where the MSI[X] entries
> storage cannot be mapped exclusively to the guest.
> 
> But aside of this, it's not required if the storage can be mapped
> exclusively, the guest is hypervisor aware and can get a host composed
> message via a hypercall. That works for physical functions and SRIOV,
> but not for SIOV.

It would greatly help if you can put down what you see is blocking 
to move forward in the following areas.

Address Gaps in Spec: 

Specs can accomodate change after review, as the number of ECN's that go on
with PCIe ;-). Please add what you like to see in the spec if you beleive
is a gap today.

Hardware Gaps?
- PASID tagged Interrupts.
- IOMMU Support for PASID based IR.

As i had called out, there are a lot of moving parts, and requires more
attention.

OS Gaps?
- Lack of ability to identify if platform can use IMS.
- Lack of hypercall.

We will always have devices that have more interrupts but their use doesn't
need IMS to be directly manipulated by the guest, or the fact those usages
require more than what is allowed by PCIe in a guest. These devices can 
scale by adding another sub-device and you get another block of 2048 if needed.

This isn't just for idxd, as I mentioned earlier, there are vendors other
than Intel already working on this. In all cases the need for guest direct
manipulation of interrupt store hasn't come up. From the discussion, it
seems like there are devices today or in future that will require direct
manipulation of interrupt store in the guest. This needs additional work
in both the device hardware providing the right plumbing and OS work to
comprehend those.

Cheers,
Ashok

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-09 Thread Raj, Ashok

Hi Thomas,

On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> >
> > The relavance of PASID is this:
> >
> >> Again, trap emulate does not work for IMS when the IMS store is software
> >> managed guest memory and not part of the device. And that's the whole
> >> reason why we are discussing this.
> >
> > With PASID tagged interrupts and a IOMMU interrupt remapping
> > capability that can trigger on PASID, then the platform can provide
> > the same level of security as SRIOV - the above is no problem.
> >
> > The device ensures that all DMAs and all interrupts program by the
> > guest are PASID tagged and the platform provides security by checking
> > the PASID when delivering the interrupt.
> 
> Correct.
> 
> > Intel IOMMU doesn't work this way today, but it makes alot of design
> > sense.

Approach to IMS is more of a phased approach. 

#1 Allow physical device to scale beyond limits of PCIe MSIx
   Follows current methodology for guest interrupt programming and
   evolutionary changes rather than drastic.
#2 Long term we should work together on enabling IMS in guest which
   requires changes in both HW and SW eco-system.

For #1, the immediate need is to find a way to limit guest from using IMS
due to current limitations. We have couple options.

a) CPUID based method to disallow IMS when running in a guest OS. Limiting
   use to existing virtual MSIx to guest devices. (Both you/Jason alluded)
b) We can extend DMAR table to have a flag for opt-out. So in real platform
   this flag is clear and in guest VMM will ensure vDMAR will have this flag
   set. Along the lines as Jason alluded, platform level and via ACPI
   methods. We have similar use for x2apic_optout today.

Think a) is probably more generic.

For #2 Long term goal of allowing IMS in guest for devices that require
them. This requires some extensive eco-system enabling. 

- Extending HW to understand PASID-tagged interrupt messages.
- Appropriate extensions to IOMMU to enforce such PASID based isolation.

>From SW improvements:

- Hypercall to retrieve addr/data from host
- Ensure SW can provide guarantee that the interrupt address range will not
  be mapped in process space when SVM is in play. Otherwise its hard to
  distinguish between DMA and Interrupt. OS needs to opt-in to this
  behavior. Today we ensure IOVA space has this 0xFEEx range carve out
  of the IOVA space.

Devices such as idxd that do not have these entries on page-boundaries for
isolation to permit direct programming from GuestOS will continue to use
trap-emulate as used today.

In the end, virtualizing IMS requires eco-system collaboration, and we are
very open to change hw when all the relevant pieces are in place.

Until then, IMS will be restricted to host VMM only, and we can use the
methods above to prevent IMS in guest and continue to use the legacy
virtual MSIx.

> 
> Right.
> 
> > Otherwise the interrupt is effectively delivered to the hypervisor. A
> > secure device can *never* allow a guest to specify an addr/data pair
> > for a non-PASID tagged TLP, so the device cannot offer IMS to the
> > guest.
> 
> Ok. Let me summarize the current state of supported scenarios:
> 
>  1) SRIOV works with any form of IMS storage because it does not require
> PASID and the VF devices have unique requester ids, which allows the
> remap unit to sanity check the message.
> 
>  2) SIOV with IMS when the hypervisor can manage the IMS store
> exclusively.

Today this is true for all interrupt types, MSI/MSIx/IMS.

> 
> So #2 prevents a device which handles IMS storage in queue memory to
> utilize IMS for SIOV in a guest because the hypervisor cannot manage the
> IMS message store and the guest can write arbitrary crap to it which
> violates the isolation principle.
> 
> And here is the relevant part of the SIOV spec:
> 
>  "IMS is managed by host driver software and is not accessible directly
>   from guest or user-mode drivers.
> 
>   Within the device, IMS storage is not accessible from the ADIs. ADIs
>   can request interrupt generation only through the device’s ‘Interrupt
>   Message Generation Logic’, which allows an ADI to only generate
>   interrupt messages that are associated with that specific ADI. These
>   restrictions ensure that the host driver has complete control over
>   which interrupt messages can be generated by each ADI.
> 
>   On Intel 64 architecture platforms, message signaled interrupts are
>   issued as DWORD size untranslated memory writes without a PASID TLP
>   Prefix, to address range 0xFEEx. Since all memory requests
>   generated by ADIs include a PASID TLP Prefix, it is not possible for
>   an ADI to generate a DMA write that would be interpreted by the
>   platform as an interrupt message."
> 
> That's the reductio ad absurdum for this sentence in the first paragraph
> of the preceding chapter describing the concept of IMS:
> 
>   "IMS enables devices to store t

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-09 Thread Raj, Ashok

On Mon, Nov 09, 2020 at 01:30:34PM -0400, Jason Gunthorpe wrote:
> 
> > Again, trap emulate does not work for IMS when the IMS store is software
> > managed guest memory and not part of the device. And that's the whole
> > reason why we are discussing this.
> 
> With PASID tagged interrupts and a IOMMU interrupt remapping
> capability that can trigger on PASID, then the platform can provide
> the same level of security as SRIOV - the above is no problem.

You mean even if its stored in memory, as long as the MemWr comes with
PASID, and the hypercall has provisioned the IRTE properly?

that seems like a possiblity.

> 
> The device ensures that all DMAs and all interrupts program by the
> guest are PASID tagged and the platform provides security by checking
> the PASID when delivering the interrupt. Intel IOMMU doesn't work this
> way today, but it makes alot of design sense.
> 
> Otherwise the interrupt is effectively delivered to the hypervisor. A
> secure device can *never* allow a guest to specify an addr/data pair
> for a non-PASID tagged TLP, so the device cannot offer IMS to the
> guest.

Right, it seems like that's a limitation today.

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-09 Thread Raj, Ashok

On Mon, Nov 09, 2020 at 03:08:17PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 09 2020 at 12:14, Thomas Gleixner wrote:
> > On Sun, Nov 08 2020 at 15:58, Ashok Raj wrote:
> >> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >> But for SIOV devices there is no PASID filtering at the remap level since
> >> interrupt messages don't carry PASID in the TLP.
> >
> > Why do we need PASID for VMM integrity?
> >
> > If the device sends a message then the remap unit will see the requester
> > ID of the device and if the message it sends is not 
> 
> That made me look at patch 4/17 which adds DEVMSI support to the
> remap code:
> 
> > +   case X86_IRQ_ALLOC_TYPE_DEV_MSI:
> > +  irte_prepare_msg(msg, index, sub_handle);
> >break;
> 
> It does not setup any requester-id filter in IRTE. How is that supposed
> to be correct?
> 

Its missing a set_msi_sid() equivalent for the DEV_MSI type.

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Raj, Ashok

Hi Jason

On Sun, Nov 08, 2020 at 07:41:42PM -0400, Jason Gunthorpe wrote:
> On Sun, Nov 08, 2020 at 10:11:24AM -0800, Raj, Ashok wrote:
> 
> > > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > > doesn't work. It is some fake thing.
> > 
> > Is it really some fake thing? I thought the vCPU and vector are real
> > for a guest, and VMM ensures when interrupts are delivered they are either.
> 
> It is fake in the sense it is programmed into no hardware.
>  
> It is real in the sense it is an ABI contract with the VMM.

Ah.. its clear now. That clears up my question below as well.

> 
> Yes, no matter what the VMM has to know the guest wants an interrupt
> routed in and setup the VMM part of the equation. With SRIOV this is
> all done with the MSI trapping.
> 
> > What if the guest creates some addr in the 0xfee... range how do we
> > take care of interrupt remapping and such without any VMM assist?
> 
> Not sure I understand this?
> 

My question was based on mis-conception that interrupt entries are directly
written by guest OS for mlx*. My concern was about security isolation if guest 
OS
has full control of device interrupt store. 

I think you clarified it, that interrupts still are marshalled by the VMM
and not in direct control of guest OS. That makes my question moot.

Cheers,
Ashok

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Raj, Ashok

Hi Thomas,

[-] Jing, She isn't working at Intel anymore.

Now this is getting compiled as a book :-).. Thanks a ton!

One question on the hypercall case that isn't immediately
clear to me.

On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> 
> 
> Now if we look at the virtualization scenario and device hand through
> then the structure in the guest view is not any different from the basic
> case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> hypervisor can trap the access to the storage and translate the message:
> 
>|
>|
>   [CPU]-- [Bri | dge] -- Bus -- [Device]
>|
>   Alloc +
>   Compose   Store Use
>  |
>  | Trap
>  v
>  Hypervisor translates and stores
> 

The above case, VMM is responsible for writing to the message
store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
the writes to the device interrupt region and to the IRTE tables.

> But obviously with an IMS storage location which is software controlled
> by the guest side driver (the case Jason is interested in) the above
> cannot work for obvious reasons.
> 
> That means the guest needs a way to ask the hypervisor for a proper
> translation, i.e. a hypercall. Now where to do that? Looking at the
> above remapping case it's pretty obvious:
> 
> 
>  |
>  |
>   [CPU]   -- [VI | RT]  -- [Bridge] --Bus-- [Device]
>  |
>   Alloc  "Compose"   Store Use
> 
>   Vectordomain   HCALLdomainBusdomain
>  |^
>  ||
>  v| 
> Hypervisor
>Alloc + Compose
> 
> Why? Because it reflects the boundaries and leaves the busdomain part
> agnostic as it should be. And it works for _all_ variants of Busdomains.
> 
> Now the question which I can't answer is whether this can work correctly
> in terms of isolation. If the IMS storage is in guest memory (queue
> storage) then the guest driver can obviously write random crap into it
> which the device will happily send. (For MSI and IDXD style IMS it
> still can trap the store).

The isolation problem is not just the guest memory being used as interrrupt
store right? If the Store to device region is not trapped and controlled by 
VMM, there is no gaurantee the guest OS has done the right thing?

Thinking about it, guest memory might be more problematic since its not
trappable and VMM can't enforce what is written. This is something that
needs more attension. But for now the devices supporting memory on device
the trap and store by VMM seems to satisfy the security properties you
highlight here.

> 
> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> go outside the space to which the guest is allowed to signal to? If yes,
> problem solved. If no, then IMS storage in guest memory can't ever work.

This can probably work for SRIOV devices where guest owns the entire device.
interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
not allocated for that BDF.

But for SIOV devices there is no PASID filtering at the remap level since
interrupt messages don't carry PASID in the TLP.

> 
> Coming back to this:
> 
> > In the end pci_subdevice_msi_create_irq_domain() is a platform
> > function. Either it should work completely on every device with no
> > device-specific emulation required in the VMM, or it should not work
> > at all and return -EOPNOTSUPP.
> 
> The subdevice domain is a 'Busdomain' according to the structure
> above. It does not and should never have any clue about the underlying
> system. It's in the agnostic part and always works. It simply does not
> care what's underneath. So it won't return -EOPNOTSUPP.
> 
> What it has to do is to transport the IMS in queue memory requirement to
> the underlying parent domain.
> 
> So in case that the HCALL domain is missing, the Vector domain needs
> return an error code on domain creation. If the HCALL domain is there
> then the domain creation works and in case of actual interrupt
> allocation the hypercall either returns a valid composed message or an
> appropriate error code.
> 
> But there's a catch:
> 
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be

Precicely!. It might work if the OS is new, but for legacy the trap-emulate
seems both safe and works for legacy as well?

Cheers,
Ashok

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Raj, Ashok

Hi Jason,

On Sun, Nov 08, 2020 at 07:23:41PM -0400, Jason Gunthorpe wrote:
> 
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
> 

The PASID in the interrupt store is for the IDXD to verify the interrupt handle
that came with the ENQCMD. User applications can obtain an interrupt handle and
ask for interrupt to be generated for transactions submitted via ENQCMD.

IDXD will compare the PASID that came with ENQCMD and verify if the PASID 
matches
the one stored in the Interrupt Table before generating the MemWr.

So MemWr for interrupts remains unchanged for IDXD on the wire. PASID is 
present in interrupt
store because the value was programmed by user space, and needs OS/hardware to 
ensure 
the entity asking for interrupts has ownership for the interrupt handle.

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Raj, Ashok

On Sun, Nov 08, 2020 at 06:34:55PM +, David Woodhouse wrote:
> > 
> > When we do interrupt remapping support in guest which would be required 
> > if we support x2apic in guest, I think this is something we should look 
> > into more 
> > carefully to make this work.
> 
> No, interrupt remapping is not required for X2APIC in guests
> 
> They can have X2APIC and up to 32768 CPUs without needing interrupt

How is this made available today without interrupt remapping? 

I thought without IR, the destination ID is still limited to only 8 bits?

On native, even if you have less than 255 cpu's but the APICID are sparsly 
distributed due to platform rules, the x2apic id could be more than 8 bits. 
Which is why the spec requires IR when x2apic is enabled.

> remapping at all. Only if they want more than 32768 vCPUs, or to do
> nested virtualisation and actually remap for the benefit of *their*
> (L2+) guests would they need IR.

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-08 Thread Raj, Ashok

Hi Jason

Thanks, its now clear what you had mentioned earlier.

I had couple questions/clarifications below. Thanks for working 
through this.

On Fri, Nov 06, 2020 at 08:12:07PM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
> 
> > Also feel free to straighten me out (Jason or Ashok) if I've botched
> > the understanding of this.
> 
> It is pretty simple when you get down to it.
> 
> We have a new kernel API that Thomas added:
> 
>   pci_subdevice_msi_create_irq_domain()
> 
> This creates an IRQ domain that hands out addr/data pairs that
> trigger interrupts.
> 
> On bare metal the addr/data pairs from the IRQ domain are programmed
> into the HW in some HW specific way by the device driver that calls
> the above function.
> 
> On (kvm) virtualization the addr/data pair the IRQ domain hands out
> doesn't work. It is some fake thing.

Is it really some fake thing? I thought the vCPU and vector are real
for a guest, and VMM ensures when interrupts are delivered they are either.

1. Handled by VMM first and then injected to guest
2. Handled in a Posted Interrupt manner, and injected to guest
   when it resumes. It can be delivered directly if guest was running
   when the interrupt arrived.

> 
> To make this work on normal MSI/MSI-X the VMM implements emulation of
> the standard MSI/MSI-X programming and swaps the fake addr/data pair
> for a real one obtained from the hypervisor IRQ domain.
> 
> To "deal" with this issue the SIOV spec suggests to add a per-device
> PCI Capability that says "IMS works". Which means either:
>  - This is bare metal, so of course it works
>  - The VMM is trapping and emulating whatever the device specific IMS
>programming is.
> 
> The idea being that a VMM can never advertise the IMS cap flag to the
> guest unles the VMM provides a device specific driver that does device
> specific emulation to capture the addr/data pair. Remeber IMS doesn't
> say how to program the addr/data pair! Every device is unique!
> 
> On something like IDXD this emulation is not so hard, on something
> like mlx5 this is completely unworkable. Further we never do
> emulation on our devices, they always pass native hardware through,
> even for SIOV-like cases.

So is that true for interrupts too? Possibly you have the interrupt
entries sitting in memory resident on the device? Don't we need the 
VMM to ensure they are brokered by VMM in either one of the two ways 
above? What if the guest creates some addr in the 0xfee... range
how do we take care of interrupt remapping and such without any VMM 
assist?

Its probably a gap in my understanding. 

> 
> In the end pci_subdevice_msi_create_irq_domain() is a platform
> function. Either it should work completely on every device with no
> device-specific emulation required in the VMM, or it should not work
> at all and return -EOPNOTSUPP.
> 
> The only sane way to implement this generically is for the VMM to
> provide a hypercall to obtain a real *working* addr/data pair(s) and
> then have the platform hand those out from
> pci_subdevice_msi_create_irq_domain(). 
> 
> All IMS device drivers will work correctly. No VMM device emulation is
> ever needed to translate addr/data pairs.
> 

That's true. Probably this can work the same even for MSIx types too then?

When we do interrupt remapping support in guest which would be required 
if we support x2apic in guest, I think this is something we should look into 
more 
carefully to make this work.

One criteria that we generally tried to follow is driver that runs in host
and guest are the same, and if needed they need some functionality make it
work around some capability  detection so the alternate path can be plummed in
a generic way. 

I agree with the overall idea and we should certainly take that into 
consideration
when we need IMS in guest support and in context of interrupt remapping.

Hopefully I understood the overall concept. If I mis-understood any of this
please let me know.

Cheers,
Ashok

Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-06 Thread Raj, Ashok

Hi Jason

On Fri, Nov 06, 2020 at 09:14:15AM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 09:48:34AM +, Tian, Kevin wrote:
> > > The interrupt controller is responsible to create an addr/data pair
> > > for an interrupt message. It sets the message format and ensures it
> > > routes to the proper CPU interrupt handler. Everything about the
> > > addr/data pair is owned by the platform interrupt controller.
> > > 
> > > Devices do not create interrupts. They only trigger the addr/data pair
> > > the platform gives them.
> > 
> > I guess that we may just view it from different angles. On x86 platform,
> > a MSI/IMS capable device directly composes interrupt messages, with 
> > addr/data pair filled by OS.
> 
> Yes, all platforms work like that. The addr/data pair is *opaque* to
> the device. Only the platform interrupt controller component
> understands how to form those values.

True, the addr/data pair is opaque. IMS doesn't dictate what the contents
of addr/data pair is made of. That is still a platform attribute. IMS simply 
controls where the pair is physically stored. Which only the device dictates.

> 
> > If there is no IOMMU remapping enabled in the middle, the message
> > just hits the CPU. Your description possibly is from software side,
> > e.g. describing the hierarchical IRQ domain concept?
> 
> I suppose you could say that. Technically the APIC doesn't form any
> addr/data pairs, but the configuration of the APIC, IOMMU and other
> platform components define what addr/data pairs are acceptable.
> 
> The IRQ domain stuff broadly puts responsibilty to form these values
> in the IRQ layer which abstracts all the platform detatils. In Linux
> we expect the platform to provide the IRQ Domain tha can specify
> working addr/data pairs.
> 
> > I agree with this point, just as how pci-hyperv.c works. In concept Linux
> > guest driver should be able to use IMS when running on Hyper-v. There
> > is no such thing for KVM, but possibly one day we will need similar stuff.
> > Before that happens the guest could choose to simply disallow devmsi
> > by default in the platform code (inventing a hypercall just for 'disable' 
> > doesn't make sense) and ignore the IMS cap. One small open is whether
> > this can be done in one central-place. The detection of running as guest
> > is done in arch-specific code. Do we need disabling devmsi for every arch?
> >
> > But when talking about virtualization it's not good to assume the guest
> > behavior. It's perfectly sane to run a guest OS which doesn't implement 
> > any PV stuff (thus don't know running in a VM) but do support IMS. In 
> > such scenario the IMS cap allows the hypervisor to educate the guest 
> > driver to use MSI instead of IMS, as long as the driver follows the device 
> > spec. In this regard I don't think that the IMS cap will be a short-term 
> > thing, although Linux may choose to not use it.
> 
> The IMS flag belongs in the platform not in the devices.

This support is mostly a SW thing right? we don't need to muck with
platform/ACPI for that matter. 

> 
> For instance you could put a "disable IMS" flag in the ACPI tables, in
> the config space of the emuulated root port, or any other areas that
> clearly belong to the platform.

Maybe there is a different interpretation for IMS that I'm missing. Devices
that need more interrupt support than supported by PCIe standards, and how
device has grouped the storage needs for the addr/data pair is a device
attribute.

I missed why ACPI tables should carry such information. If kernel doesn't
want to support those devices its within kernel control. Which means kernel
will only use the available MSIx interfaces. This is legacy support.

> 
> The OS logic would be
>  - If no IMS information found then use IMS (Bare metal)
>  - If the IMS disable flag is found then
>- If (future) hypercall available and the OS knows how to use it
>  then use IMS
>- If no hypercall found, or no OS knowledge, fail IMS
> 
> Our devices can use IMS even in a pure no-emulation

This is true for IMS as well. But probably not implemented in the kernel as
such. From a HW point of view (take idxd for instance) the facility is
available to native OS as well. The early RFC supported this for native.

Native devices can have both MSIx and IMS capability. But as I understand this
isn't how we have partitioned things in SW today. We left IMS only for
mdev's. And I agree this would be very useful.

In cases where we want to support interrupt handles for user space
notification (when application specifies that in the descriptor). Those
could be IMS. The device HW has support for it.

Remember the "Why PASID in IMS entry" discussion?

https://lore.kernel.org/lkml/20201008233210.gh4...@nvidia.com/

Cheers,
Ashok

Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-11-02 Thread Raj, Ashok

Hi Jason

On Mon, Nov 02, 2020 at 09:20:36AM -0400, Jason Gunthorpe wrote:

> > of IDXD for guest drivers. These look and feel like IDXD, not another 
> > device 
> > interface. In that sense if we move PF/VF mailboxes as
> > separate drivers i thought it feels a bit odd.
> 
> You need this split anyhow, putting VFIO calls into the main idxd
> module is not OK.
> 
> Plugging in a PCI device should not auto-load VFIO modules.

Yes, I agree that would be a good reason to separate them completely and
glue functionality with private APIs between the 2 modules.

- Separate mdev code from base idxd.
- Separate maintainers, so its easy to review and include. (But remember
  they are heavily inter-dependent. They have to move to-gether)

Almost all SRIOV drivers today are just configured with some form of Kconfig
and those relevant files are compiled into the same module.

I think in *most* applications idxd would be operating in that mode, where
you have the base driver and mdev parts (like VF) compiled in if configured
such.

Creating these private interfaces for intra-module are just 1-1 and not
general purpose and every accelerator needs to create these instances.

I wasn't sure focibly creating this firewall between the PF/VF interfaces
is actually worth the work every driver is going to require. I can see
where this is required when they offer separate functional interfaces
when we talk about multi-function in a more confined definition today.

idxd mdev's are purely a VF extension. It doesn't provide any different
function. For e.g. like an RDMA device that can provide iWarp, ipoib or
even multiplexing storage over IB. IDXD is a fixed function interface.

Sure having separate modules helps with that isolation. But I'm not
convinced if this simplifies, or complicates things more than what is
required for these device types.

Cheers,
Ashok

Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-10-31 Thread Raj, Ashok

Hi Thomas,

On Sat, Oct 31, 2020 at 03:50:43AM +0100, Thomas Gleixner wrote:
> Ashok,
> 
> < skip a lot of non-sensical arguments>

Ouch!.. Didn't mean to awaken you like this :-).. apologies.. profusely! 

> 
> Just because there is historical precendence which does not care about
> the differentiation of subsystems is not an argument at all to make the
> same mistakes which have been made years ago.
> 
> IDXD is just infrastructure which provides the base for a variety of
> different functionalities. Very similar to what multi function devices
> provide. In fact IDXD is pretty much a MFD facility.

I'm only asking this to better understand the thought process. 
I don't intend to be defensive,  I have my hands tied back.. so we will do
what you say best fits per your recommendation.

Not my intend to dig a deeper hole than I have already dug! :-(

IDXD is just a glorified DMA engine, data mover. It also does a few other
things. In that sense its a multi-function facility. But doesn't do  different 
functional pieces like PCIe multi-function device in that sense. i.e
it doesn't do other storage and network in that sense. 

> 
> Sticking all of it into dmaengine is sloppy at best. The dma engine
> related part of IDXD is only a part of the overall functionality.

dmaengine is the basic non-transformational data-mover. Doing other operations
or transformations are just the glorified data-mover part. But fundamentally
not different.

> 
> I'm well aware that it is conveniant to just throw everything into
> drivers/myturf/ but that does neither make it reviewable nor
> maintainable.

That's true, when we add lot of functionality in one place. IDXD doing
mdev support is not offering new functioanlity. SRIOV PF drivers that support
PF/VF mailboxes are part of PF drivers today. IDXD mdev is preciely playing that
exact role. 

If we are doing this just to improve review effectiveness, Now we would need
some parent driver, and these sub-drivers registering seemed like a bit of
over-engineering when these sub-drivers actually are an extension of the
base driver and offer nothing more than extending sub-device partitions 
of IDXD for guest drivers. These look and feel like IDXD, not another device 
interface. In that sense if we move PF/VF mailboxes as
separate drivers i thought it feels a bit odd.

Please don't take it the wrong way. 

Cheers,
Ashok

Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-10-30 Thread Raj, Ashok

On Fri, Oct 30, 2020 at 04:30:45PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 12:23:25PM -0700, Raj, Ashok wrote:
> > On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> > > > 
> > > > 
> > > > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > > > >   .../ABI/stable/sysfs-driver-dma-idxd  |6 +
> > > > > >   Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++
> > > > > >   MAINTAINERS   |1 +
> > > > > >   drivers/dma/Kconfig   |9 +
> > > > > >   drivers/dma/idxd/Makefile |2 +
> > > > > >   drivers/dma/idxd/cdev.c   |6 +-
> > > > > >   drivers/dma/idxd/device.c |  294 -
> > > > > >   drivers/dma/idxd/idxd.h   |   67 +-
> > > > > >   drivers/dma/idxd/init.c   |   86 ++
> > > > > >   drivers/dma/idxd/irq.c|6 +-
> > > > > >   drivers/dma/idxd/mdev.c   | 1121 
> > > > > > +
> > > > > >   drivers/dma/idxd/mdev.h   |  116 ++
> > > > > 
> > > > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > > > subsystem, not in other random places. All this mdev stuff belongs
> > > > > under drivers/vfio
> > > > 
> > > > Alex seems to have disagreed last time
> > > > https://lore.kernel.org/dmaengine/20200917113016.425dc...@x1.home/
> > > 
> > > Nobody else in the kernel is splitting subsystems up anymore
> > >  
> > > > And I do agree with his perspective. The mdev is an extension of the PF
> > > > driver. It's a bit awkward to be a stand alone mdev driver under 
> > > > vfio/mdev/.
> > > 
> > > By this logic we'd have giagantic drivers under drivers/ethernet
> > > touching netdev, rdma, scsi, vdpa, etc just because that is where the
> > > PF driver came from.
> > 
> > What makes you think this is providing services like scsi/rdma/vdpa etc.. ?
> > 
> > for DSA this playes the exact same role, not a different function 
> > as you highlight above. these mdev's are creating DSA for virtualization
> > use. They aren't providing a completely different role or subsystem per-se.
> 
> It is a different subsystem, different maintainer, and different
> reviewers.
> 
> It is a development process problem, it doesn't matter what it is
> doing.

So drawing that parallel, do you expect all drivers that call
pci_register_driver() to be located in drivers/pci? Aren't they scattered
all over the place ata,scsi, platform drivers and such?

As Alex pointed out, i915 and handful of s390 drivers that are mdev users
are not in drivers/vfio. Are you sayint those drivers don't get reviewed? 

This is no different than PF driver offering VF services. Its a logical
extension. 

Reviews happen for mdev users today. What you suggest seems like cutting 
the feet to fit the shoe. Unless the maintainers are asking things 
to be split just because its calling mdev_register_device() that practice 
doesn't exist and would be totally weird if you want to move all callers of
pci_register_driver(). 

Your argument seems interesting even entertaining :-). But honestly i'm not 
finding it
practical :-). So every caller of mmu_register_notifier() needs to be in
mm? 

What you mention for different functions make absolute sense, not arguing
against that.  but this ain't that. 

And we just follow the asks of the maintainer. 

I know you aren't going to give up, but there is little we can do. I want
the maintainers to make that call and I'm not add more noise to this.

Cheers,
Ashok

Re: [PATCH v4 00/17] Add VFIO mediated device support and DEV-MSI support for the idxd driver

2020-10-30 Thread Raj, Ashok

On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> > 
> > 
> > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > >   .../ABI/stable/sysfs-driver-dma-idxd  |6 +
> > > >   Documentation/driver-api/vfio/mdev-idxd.rst   |  404 ++
> > > >   MAINTAINERS   |1 +
> > > >   drivers/dma/Kconfig   |9 +
> > > >   drivers/dma/idxd/Makefile |2 +
> > > >   drivers/dma/idxd/cdev.c   |6 +-
> > > >   drivers/dma/idxd/device.c |  294 -
> > > >   drivers/dma/idxd/idxd.h   |   67 +-
> > > >   drivers/dma/idxd/init.c   |   86 ++
> > > >   drivers/dma/idxd/irq.c|6 +-
> > > >   drivers/dma/idxd/mdev.c   | 1121 +
> > > >   drivers/dma/idxd/mdev.h   |  116 ++
> > > 
> > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > subsystem, not in other random places. All this mdev stuff belongs
> > > under drivers/vfio
> > 
> > Alex seems to have disagreed last time
> > https://lore.kernel.org/dmaengine/20200917113016.425dc...@x1.home/
> 
> Nobody else in the kernel is splitting subsystems up anymore
>  
> > And I do agree with his perspective. The mdev is an extension of the PF
> > driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
> 
> By this logic we'd have giagantic drivers under drivers/ethernet
> touching netdev, rdma, scsi, vdpa, etc just because that is where the
> PF driver came from.

What makes you think this is providing services like scsi/rdma/vdpa etc.. ?

for DSA this playes the exact same role, not a different function 
as you highlight above. these mdev's are creating DSA for virtualization
use. They aren't providing a completely different role or subsystem per-se.

Cheers,
Ashok

Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-19 Thread Raj, Ashok

Hi Jean

On Mon, Oct 19, 2020 at 04:08:24PM +0200, Jean-Philippe Brucker wrote:
> On Sat, Oct 17, 2020 at 04:25:25AM -0700, Raj, Ashok wrote:
> > > For devices that *don't* use a stop marker, the PCIe spec says (10.4.1.2):
> > > 
> > >   To stop [using a PASID] without using a Stop Marker Message, the
> > >   function shall:
> > >   1. Stop queueing new Page Request Messages for this PASID.
> > 
> > The device driver would need to tell stop sending any new PR's.
> > 
> > >   2. Finish transmitting any multi-page Page Request Messages for this
> > >  PASID (i.e. send the Page Request Message with the L bit Set).
> > >   3. Wait for PRG Response Messages associated any outstanding Page
> > >  Request Messages for the PASID.
> > > 
> > > So they have to flush their PR themselves. And since the device driver
> > > completes this sequence before calling unbind(), then there shouldn't be
> > > any oustanding PR for the PASID, and unbind() doesn't need to flush,
> > > right?
> > 
> > I can see how the device can complete #2,3 above. But the device driver
> > isn't the one managing page-responses right. So in order for the device to
> > know the above sequence is complete, it would need to get some assist from
> > IOMMU driver?
> 
> No the device driver just waits for the device to indicate that it has
> completed the sequence. That's what the magic stop-PASID mechanism
> described by PCIe does. In 6.20.1 "Managing PASID TLP Prefix Usage" it
> says:

The goal is we do this when the device is in a messup up state. So we can't
take for granted the device is properly operating which is why we are going
to wack the device with a flr().

The only thing that's supposed to work without a brain freeze is the
invalidation logic. Spec requires devices to respond to invalidations even when
they are in the process of flr().

So when IOMMU does an invalidation wait with a Page-Drain, IOMMU waits till
the response for that arrives before completing the descriptor. Due to 
the posted semantics it will ensure any PR's issued and in the fabric are 
flushed 
out to memory. 

I suppose you can wait for the device to vouch for all responses, but that
is assuming the device is still functioning properly. Given that we use it
in two places,

* Reclaiming a PASID - only during a tear down sequence, skipping it
  doesn't really buy us much.
* During FLR this can't be skipped anyway due to the above sequence
  requirement. 

> 
> "A Function must have a mechanism to request that it gracefully stop using
>  a specific PASID. This mechanism is device specific but must satisfy the
>  following rules:
>  [...]
>  * When the stop request mechanism indicates completion, the Function has:
>[...]
>* Complied with additional rules described in Address Translation
>  Services (Chapter 10 [10.4.1.2 quoted above]) if Address Translations
>  or Page Requests were issued on the behalf of this PASID."
> 
> So after the device driver initiates this mechanism in the device, the
> device must be able to indicate completion of the mechanism, which
> includes completing all in-flight Page Requests. At that point the device
> driver can call unbind() knowing there is no pending PR for this PASID.
> 

Cheers,
Ashok

Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-17 Thread Raj, Ashok

Hi Jean

On Fri, Oct 16, 2020 at 09:59:23AM +0200, Jean-Philippe Brucker wrote:
> On Thu, Oct 15, 2020 at 11:22:11AM -0700, Raj, Ashok wrote:
> > Hi Jean
> > 
> > + Baolu who is looking into this.
> > 
> > 
> > On Thu, Oct 15, 2020 at 11:00:27AM +0200, Jean-Philippe Brucker wrote:
> > > Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
> > > whether the PRI queue needs flushing. When looking at the PCIe spec
> > > again I noticed that most of the time the SMMUv3 driver doesn't actually
> > > need to flush the PRI queue. Does this make sense for Intel VT-d as well
> > > or did I overlook something?
> > > 
> > > Before calling iommu_sva_unbind_device(), device drivers must stop the
> > > device from using the PASID. For PCIe devices, that consists of
> > > completing any pending DMA, and completing any pending page request
> > > unless the device uses Stop Markers. So unless the device uses Stop
> > > Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
> > > means completing all stall events, so we never need to flush the event
> > > queue.
> > 
> > I don't think this is true. Baolu is working on an enhancement to this,
> > I'll quickly summarize this below:
> > 
> > Stop markers are weird, I'm not certain there is any device today that
> > sends STOP markers. Even if they did, markers don't have a required
> > response, they are fire and forget from the device pov.
> 
> Definitely agree with this, and I hope none will implement stop marker.
> For devices that *don't* use a stop marker, the PCIe spec says (10.4.1.2):
> 
>   To stop [using a PASID] without using a Stop Marker Message, the
>   function shall:
>   1. Stop queueing new Page Request Messages for this PASID.

The device driver would need to tell stop sending any new PR's.

>   2. Finish transmitting any multi-page Page Request Messages for this
>  PASID (i.e. send the Page Request Message with the L bit Set).
>   3. Wait for PRG Response Messages associated any outstanding Page
>  Request Messages for the PASID.
> 
> So they have to flush their PR themselves. And since the device driver
> completes this sequence before calling unbind(), then there shouldn't be
> any oustanding PR for the PASID, and unbind() doesn't need to flush,
> right?

I can see how the device can complete #2,3 above. But the device driver
isn't the one managing page-responses right. So in order for the device to
know the above sequence is complete, it would need to get some assist from
IOMMU driver?

How does the driver know that everything host received has been responded
back to device?

> 
> > I'm not sure about other IOMMU's how they behave, When there is no space in
> > the PRQ, IOMMU auto-responds to the device. This puts the device in a
> > while (1) loop. The fake successful response will let the device do a ATS
> > lookup, and that would fail forcing the device to do another PRQ.
> 
> But in the sequence above, step 1 should ensure that the device will not
> send another PR for any successful response coming back at step 3.

True, but there could be some page-request in flight on its way to the
IOMMU. By draining and getting that round trip back to IOMMU we gaurantee
things in flight are flushed to PRQ after that Drain completes.
> 
> So I agree with the below if we suspect there could be pending PR, but
> given that pending PR are a stop marker thing and we don't know any device
> using stop markers, I wondered why I bothered implementing PRIq flush at
> all for SMMUv3, hence this RFC.
> 

Cheers,
Ashok

Re: [PATCH 1/1] pci: pciehp: Handle MRL interrupts to enable slot for hotplug.

2020-10-16 Thread Raj, Ashok

Hi Bjorn


On Fri, Sep 25, 2020 at 04:01:38PM -0700, Ashok Raj wrote:
> When Mechanical Retention Lock (MRL) is present, Linux doesn't process
> those change events.
> 
> The following changes need to be enabled when MRL is present.
> 
> 1. Subscribe to MRL change events in SlotControl.
> 2. When MRL is closed,
>- If there is no ATTN button, then POWER on the slot.
>- If there is ATTN button, and an MRL event pending, ignore
>  Presence Detect. Since we want ATTN button to drive the
>  hotplug event.
> 

Did you get a chance to review this? Thought i'll just check with you. 

Seems like there was a lot happening in the error handling and hotplug
side, I thought you might have wanted to wait for the dust to settle :-)

> 
> Signed-off-by: Ashok Raj 
> Co-developed-by: Kuppuswamy Sathyanarayanan 
> 
> ---
>  drivers/pci/hotplug/pciehp.h  |  1 +
>  drivers/pci/hotplug/pciehp_ctrl.c | 69 
> +++
>  drivers/pci/hotplug/pciehp_hpc.c  | 27 ++-
>  3 files changed, 96 insertions(+), 1 deletion(-)
>

Re: [RFC PATCH 0/2] iommu: Avoid unnecessary PRI queue flushes

2020-10-15 Thread Raj, Ashok

Hi Jean

+ Baolu who is looking into this.

On Thu, Oct 15, 2020 at 11:00:27AM +0200, Jean-Philippe Brucker wrote:
> Add a parameter to iommu_sva_unbind_device() that tells the IOMMU driver
> whether the PRI queue needs flushing. When looking at the PCIe spec
> again I noticed that most of the time the SMMUv3 driver doesn't actually
> need to flush the PRI queue. Does this make sense for Intel VT-d as well
> or did I overlook something?
> 
> Before calling iommu_sva_unbind_device(), device drivers must stop the
> device from using the PASID. For PCIe devices, that consists of
> completing any pending DMA, and completing any pending page request
> unless the device uses Stop Markers. So unless the device uses Stop
> Markers, we don't need to flush the PRI queue. For SMMUv3, stopping DMA
> means completing all stall events, so we never need to flush the event
> queue.

I don't think this is true. Baolu is working on an enhancement to this,
I'll quickly summarize this below:

Stop markers are weird, I'm not certain there is any device today that
sends STOP markers. Even if they did, markers don't have a required
response, they are fire and forget from the device pov.

I'm not sure about other IOMMU's how they behave, When there is no space in
the PRQ, IOMMU auto-responds to the device. This puts the device in a
while (1) loop. The fake successful response will let the device do a ATS
lookup, and that would fail forcing the device to do another PRQ. The idea
is somewhere there the OS has repeated the others and this will find a way
in the PRQ. The point is this is less reliable and can't be the only
indication. PRQ draining has a specific sequence. 

The detailed steps are outlined in 
"Chapter 7.10 "Software Steps to Drain Page Requests & Responses"

- Submit invalidation wait with fence flag to ensure all prior
  invalidations are processed.
- submit iotlb followed by devtlb invalidation
- Submit invalidation wait with page-drain to make sure any page-requests
  issued by the device are flushed when this invalidation wait completes.
- If during the above process there was a queue overflow SW can assume no
  outstanding page-requests are there. If we had a queue full condition,
  then sw must repeat step 2,3 above.

To that extent the proposal is as follows: names are suggestive :-) I'm
making this up as I go!

- iommu_stop_page_req() - Kernel needs to make sure we respond to all
  outstanding requests (since we can't drop responses) 
  - Ensure we respond immediatly for anything that comes before the drain
sequence completes
- iommu_drain_page_req() - Which does the above invalidation with
  Page_drain set.

Once the driver has performed a reset and before issuing any new request,
it does iommu_resume_page_req() this will ensure we start processing
incoming page-req after this point.

> 
> First patch adds flags to unbind(), and the second one lets device
> drivers tell whether the PRI queue needs to be flushed.
> 
> Other remarks:
> 
> * The PCIe spec (see quote on patch 2), says that the device signals
>   whether it has sent a Stop Marker or not during Stop PASID. In reality
>   it's unlikely that a given device will sometimes use one stop method
>   and sometimes the other, so it could be a device-wide flag rather than
>   passing it at each unbind(). I don't want to speculate too much about
>   future implementation so I prefer having the flag in unbind().
> 
> * In patch 1, uacce passes 0 to unbind(). To pass the right flag I'm
>   thinking that uacce->ops->stop_queue(), which tells the device driver
>   to stop DMA, should return whether faults are pending. This can be
>   added later once uacce has an actual PCIe user, but we need to
>   remember to do it.

I think intel iommmu driver does this today for SVA when pasid is being
freed. Its still important to go through the drain before that PASID can be
re-purposed.

Cheers,
Ashok

Re: [PATCH v5 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2020-10-13 Thread Raj, Ashok

On Tue, Oct 13, 2020 at 04:45:01PM -0700, Kuppuswamy Sathyanarayanan wrote:
> Currently if report_error_detected() or report_mmio_enabled()
> functions requests PCI_ERS_RESULT_NEED_RESET, current
> pcie_do_recovery() implementation does not do the requested
> explicit device reset, but instead just calls the
> report_slot_reset() on all affected devices. Notifying about the
> reset via report_slot_reset() without doing the actual device
> reset is incorrect. So call pci_bus_reset() before triggering
> ->slot_reset() callback.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> Reviewed-by: Sinan Kaya 
> ---
>  Changes since v4:
>   * Added check for pci_reset_bus() return value.

Looks good!

Reviewed-by: Ashok Raj

Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2020-10-12 Thread Raj, Ashok

On Sun, Oct 11, 2020 at 10:03:40PM -0700, sathyanarayanan.nkuppusw...@gmail.com 
wrote:
> From: Kuppuswamy Sathyanarayanan 
> 
> Currently if report_error_detected() or report_mmio_enabled()
> functions requests PCI_ERS_RESULT_NEED_RESET, current
> pcie_do_recovery() implementation does not do the requested
> explicit device reset, but instead just calls the
> report_slot_reset() on all affected devices. Notifying about the
> reset via report_slot_reset() without doing the actual device
> reset is incorrect. So call pci_bus_reset() before triggering
> ->slot_reset() callback.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> ---
>  drivers/pci/pcie/err.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index c543f419d8f9..067c58728b88 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   }
>  
>   if (status == PCI_ERS_RESULT_NEED_RESET) {
> - /*
> -  * TODO: Should call platform-specific
> -  * functions to reset slot before calling
> -  * drivers' slot_reset callbacks?
> -  */
> + pci_reset_bus(dev);

pci_reset_bus() returns an error, do you need to consult that before
unconditionally setting PCI_ERS_RESULT_RECOVERED?

>   status = PCI_ERS_RESULT_RECOVERED;
>   pci_dbg(dev, "broadcast slot_reset message\n");
>   pci_walk_bus(bus, report_slot_reset, &status);
> -- 
> 2.17.1
>

Re: [PATCH v3 11/18] dmaengine: idxd: ims setup for the vdcm

2020-10-09 Thread Raj, Ashok

On Fri, Oct 09, 2020 at 10:12:18AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 09, 2020 at 06:02:09AM -0700, Raj, Ashok wrote:
> > On Fri, Oct 09, 2020 at 09:49:45AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 09, 2020 at 05:43:07AM -0700, Raj, Ashok wrote:
> > > > On Fri, Oct 09, 2020 at 08:57:37AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Oct 08, 2020 at 06:22:31PM -0700, Raj, Ashok wrote:
> > > > > 
> > > > > > Not randomly put there Jason :-).. There is a good reason for it. 
> > > > > 
> > > > > Sure the PASID value being associated with the IRQ make sense, but
> > > > > combining that register with the interrupt mask is just a compltely
> > > > > random thing to do.
> > > > 
> > > > Hummm... Not sure what you are complaining.. but in any case giving
> > > > hardware a more efficient way to store interrupt entries breaking any
> > > > boundaries that maybe implied by the spec is why IMS was defined.
> > > 
> > > I'm saying this PASID stuff is just some HW detail of IDXD and nothing
> > > that the core irqchip code should concern itself with
> > 
> > Ok, so you are saying this is device specific why is generic framework
> > having to worry about the PASID stuff? 
> > 
> > I thought we are consolidating code that otherwise similar drivers would
> > require anyway. I thought that's what Thomas was accomplishing with the new
> > framework.
> 
> My point is why would another driver combine PASID and the IRQ mask in
> one register? There is no spec saying to do this, no common design

IMS is a concept. How a device organizes its interrupt data is completely
hardware specific. Some vendor could keep them organized like how MSIx is
done today, and put PASID's in a separate offset. Or put all interrupt
related entries all in one place like how idxd handles it today.

Cheers,
Ashok

Re: [PATCH v3 11/18] dmaengine: idxd: ims setup for the vdcm

2020-10-09 Thread Raj, Ashok

On Fri, Oct 09, 2020 at 09:49:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 09, 2020 at 05:43:07AM -0700, Raj, Ashok wrote:
> > On Fri, Oct 09, 2020 at 08:57:37AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 08, 2020 at 06:22:31PM -0700, Raj, Ashok wrote:
> > > 
> > > > Not randomly put there Jason :-).. There is a good reason for it. 
> > > 
> > > Sure the PASID value being associated with the IRQ make sense, but
> > > combining that register with the interrupt mask is just a compltely
> > > random thing to do.
> > 
> > Hummm... Not sure what you are complaining.. but in any case giving
> > hardware a more efficient way to store interrupt entries breaking any
> > boundaries that maybe implied by the spec is why IMS was defined.
> 
> I'm saying this PASID stuff is just some HW detail of IDXD and nothing
> that the core irqchip code should concern itself with

Ok, so you are saying this is device specific why is generic framework
having to worry about the PASID stuff? 

I thought we are consolidating code that otherwise similar drivers would
require anyway. I thought that's what Thomas was accomplishing with the new
framework.

He is the architect in chief for this... having PASID in the framework
seems like everybody could benefit instead of just being idxd specific.

This isn't the only game in town.

Re: [PATCH v3 11/18] dmaengine: idxd: ims setup for the vdcm

2020-10-09 Thread Raj, Ashok

On Fri, Oct 09, 2020 at 08:57:37AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 08, 2020 at 06:22:31PM -0700, Raj, Ashok wrote:
> 
> > Not randomly put there Jason :-).. There is a good reason for it. 
> 
> Sure the PASID value being associated with the IRQ make sense, but
> combining that register with the interrupt mask is just a compltely
> random thing to do.

Hummm... Not sure what you are complaining.. but in any case giving
hardware a more efficient way to store interrupt entries breaking any
boundaries that maybe implied by the spec is why IMS was defined.

> 
> If this HW was using MSI-X PASID would have been given its own
> register.

Well there is no MSI-X PASID is there?

Re: [PATCH v1 1/1] PCI/ERR: don't clobber status after reset_link()

2020-10-08 Thread Raj, Ashok

On Fri, Oct 09, 2020 at 03:52:51AM +0100, Hedi Berriche wrote:
> Commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> changed pcie_do_recovery() so that status is updated with the return
> value from reset_link(); this was to fix the problem where we would
> wrongly report recovery failure, despite a successful reset_link(),
> whenever the initial error status is PCI_ERS_RESULT_DISCONNECT or
> PCI_ERS_RESULT_NO_AER_DRIVER.
> 
> Unfortunately this breaks the flow of pcie_do_recovery() as it prevents

What is the reference to "this breaks" above?  

> the actions needed when the initial error is PCI_ERS_RESULT_CAN_RECOVER
> or PCI_ERS_RESULT_NEED_RESET from taking place which causes error
> recovery to fail.
> 
> Don't clobber status after reset_link() to restore the intended flow in
> pcie_do_recovery().
> 
> Fix the original problem by saving the return value from reset_link()
> and use it later on to decide whether error recovery should be deemed
> successful in the scenarios where the initial error status is
> PCI_ERS_RESULT_{DISCONNECT,NO_AER_DRIVER}.

I would rather rephrase the above to make it clear what is being proposed.
Since the description seems to talk about the old problem and new solution
all mixed up.

> 
> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> Signed-off-by: Hedi Berriche 
> Cc: Russ Anderson 
> Cc: Kuppuswamy Sathyanarayanan 
> Cc: Bjorn Helgaas 
> Cc: Ashok Raj 
> Cc: Keith Busch 
> Cc: Joerg Roedel 
> 
> Cc: sta...@kernel.org # v5.7+
> ---
>  drivers/pci/pcie/err.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index c543f419d8f9..dbd0b56bd6c1 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -150,7 +150,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   pci_channel_state_t state,
>   pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
>  {
> - pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> + pci_ers_result_t post_reset_status, status = PCI_ERS_RESULT_CAN_RECOVER;

why call it post_reset_status? 

>   struct pci_bus *bus;
>  
>   /*
> @@ -165,8 +165,8 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   pci_dbg(dev, "broadcast error_detected message\n");
>   if (state == pci_channel_io_frozen) {
>   pci_walk_bus(bus, report_frozen_detected, &status);
> - status = reset_link(dev);
> - if (status != PCI_ERS_RESULT_RECOVERED) {
> + post_reset_status = reset_link(dev);
> + if (post_reset_status != PCI_ERS_RESULT_RECOVERED) {
>   pci_warn(dev, "link reset failed\n");
>   goto failed;
>   }
> @@ -174,6 +174,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   pci_walk_bus(bus, report_normal_detected, &status);
>   }
>  
> + if ((status == PCI_ERS_RESULT_DISCONNECT ||
> +  status == PCI_ERS_RESULT_NO_AER_DRIVER) &&
> +  post_reset_status == PCI_ERS_RESULT_RECOVERED) {
> + /* error recovery succeeded thanks to reset_link() */
> + status = PCI_ERS_RESULT_RECOVERED;
> + }
> +
>   if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>   status = PCI_ERS_RESULT_RECOVERED;
>   pci_dbg(dev, "broadcast mmio_enabled message\n");
> -- 
> 2.28.0
>

Re: [PATCH v3 11/18] dmaengine: idxd: ims setup for the vdcm

2020-10-08 Thread Raj, Ashok

Hi Jason

On Thu, Oct 08, 2020 at 08:32:10PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 09, 2020 at 01:17:38AM +0200, Thomas Gleixner wrote:
> > Dave,
> > 
> > On Thu, Oct 08 2020 at 09:51, Dave Jiang wrote:
> > > On 10/8/2020 12:39 AM, Thomas Gleixner wrote:
> > >> On Wed, Oct 07 2020 at 14:54, Dave Jiang wrote:
> > >>> On 9/30/2020 12:57 PM, Thomas Gleixner wrote:
> >  Aside of that this is fiddling in the IMS storage array behind the irq
> >  chips back without any comment here and a big fat comment about the
> >  shared usage of ims_slot::ctrl in the irq chip driver.
> > 
> > >>> This is to program the pasid fields in the IMS table entry. Was
> > >>> thinking the pasid fields may be considered device specific so didn't
> > >>> attempt to add the support to the core code.
> > >> 
> > >> Well, the problem is that this is not really irq chip functionality.
> > >> 
> > >> But the PASID programming needs to touch the IMS storage which is also
> > >> touched by the irq chip.
> > >> 
> > >> This might be correct as is, but without a big fat comment explaining
> > >> WHY it is safe to do so without any form of serialization this is just
> > >> voodoo and unreviewable.
> > >> 
> > >> Can you please explain when the PASID is programmed and what the state
> > >> of the interrupt is at that point? Is this a one off setup operation or
> > >> does this happen dynamically at random points during runtime?
> > >
> > > I will put in comments for the function to explain why and when we modify 
> > > the 
> > > pasid field for the IMS entry. Programming of the pasid is done right 
> > > before we 
> > > request irq. And the clearing is done after we free the irq. We will not 
> > > be 
> > > touching the IMS field at runtime. So the touching of the entry should be 
> > > safe.
> > 
> > Thanks for clarifying that.
> > 
> > Thinking more about it, that very same thing will be needed for any
> > other IMS device and of course this is not going to end well because
> > some driver will fiddle with the PASID at the wrong time.
> 
> Why? This looks like some quirk of the IDXD HW where it just randomly
> put PASID along with the IRQ mask register. Probably because PASID is
> not the full 32 bits.

Not randomly put there Jason :-).. There is a good reason for it. I'm sure
Dave must have responded already. ENQCMD for DSA has the interrupt handle
on which the notification should be sent. Since the data from from user
space HW will verify if the PASID for IMS entry matches what is there in
the descriptor. 

Check description in section 9.2.2.1 of the DSA specification, when PASID
enable is 1, this field is checked against the PASID field of the
descriptor. Also check Section 5.4 and Interrupt Virtualization 7.3.3 for
more info.

> 
> AFAIK the PASID is not tagged on the MemWr TLP triggering the
> interrupt, so it really is unrelated to the irq.

Correct, the purpose is not to send PASID prefix for interrupt tranactions.

> 
> I think the ioread to get the PASID is rather ugly, it should pluck

Where do you see the ioread? I suppose idxd driver will fill in from the
aux_domain default PASID. Not reading from the device IMS entry.

> the PASID out of some driver specific data structure with proper
> locking, and thus use the sleepable version of the irqchip?
> 
> This is really not that different from what I was describing for queue
> contexts - the queue context needs to be assigned to the irq # before
> it can be used in the irq chip other wise there is no idea where to
> write the msg to. Just like pasid here.

Sorry, I don't follow you on this.. you mean context in hardware or user
context that holds interrupt addr/data values?

Re: [PATCH v7 0/5] Fix DPC hotplug race and enhance error handling

2020-10-03 Thread Raj, Ashok

Hi Ethan

On Sat, Oct 03, 2020 at 03:55:09AM -0400, Ethan Zhao wrote:
> Hi,folks,
> 
> This simple patch set fixed some serious security issues found when DPC
> error injection and NVMe SSD hotplug brute force test were doing -- race
> condition between DPC handler and pciehp, AER interrupt handlers, caused
> system hang and system with DPC feature couldn't recover to normal
> working state as expected (NVMe instance lost, mount operation hang,
> race PCIe access caused uncorrectable errors reported alternatively etc).

I think maybe picking from other commit messages to make this description in 
cover letter bit clear. The fundamental premise is that when due to error
conditions when events are processed by both DPC handler and hotplug handling 
of 
DLLSC both operating on the same device object ends up with crashes.


> 
> With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
> see
> https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))
> 
> could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time
> interval between hot-remove and plug-in operation tens of times without
> any errors occur and system works normal.

> 
> With this patch set applied, system with DPC feature could recover from
> NON-FATAL and FATAL errors injection test and works as expected.
> 
> System works smoothly when errors happen while hotplug is doing, no
> uncorrectable errors found.
> 
> Brute DPC error injection script:
> 
> for i in {0..100}
> do
> setpci -s 64:02.0 0x196.w=000a
> setpci -s 65:00.0 0x04.w=0544
> mount /dev/nvme0n1p1 /root/nvme
> sleep 1
> done
> 
> Other details see every commits description part.
> 
> This patch set could be applied to stable 5.9-rc6/rc7 directly.
> 
> Help to review and test.
> 
> v2: changed according to review by Andy Shevchenko.
> v3: changed patch 4/5 to simpler coding.
> v4: move function pci_wait_port_outdpc() to DPC driver and its
>declaration to pci.h. (tip from Christoph Hellwig ).
> v5: fix building issue reported by l...@intel.com with some config.
> v6: move patch[3/5] as the first patch according to Lukas's suggestion.
> and rewrite the comment part of patch[3/5].
> v7: change the patch[4/5], based on Bjorn's code and truth table.
> change the patch[5/5] about the debug output information.
> 
> Thanks,
> Ethan 
> 
> 
> Ethan Zhao (5):
>   PCI/ERR: get device before call device driver to avoid NULL pointer
> dereference
>   PCI/DPC: define a function to check and wait till port finish DPC
> handling
>   PCI: pciehp: check and wait port status out of DPC before handling
> DLLSC and PDC
>   PCI: only return true when dev io state is really changed
>   PCI/ERR: don't mix io state not changed and no driver together
> 
>  drivers/pci/hotplug/pciehp_hpc.c |  4 ++-
>  drivers/pci/pci.h| 55 +---
>  drivers/pci/pcie/dpc.c   | 27 
>  drivers/pci/pcie/err.c   | 18 +--
>  4 files changed, 68 insertions(+), 36 deletions(-)
> 
> 
> base-commit: a1b8638ba1320e6684aa98233c15255eb803fac7
> -- 
> 2.18.4
>

Re: [PATCH v3 05/18] dmaengine: idxd: add IMS support in base driver

2020-09-30 Thread Raj, Ashok

Hi Thomas,

On Wed, Sep 30, 2020 at 11:57:22PM +0200, Thomas Gleixner wrote:
> On Wed, Sep 30 2020 at 14:49, Ashok Raj wrote:
> >> It is the weirdest thing, IMHO. Intel defined a dvsec cap in their
> >> SIOV cookbook, but as far as I can see it serves no purpose at
> >> all.
> >> 
> >> Last time I asked I got some unclear mumbling about "OEMs".
> >> 
> > One of the parameters it has is the "supported system page-sizes" which is
> > usually there in the SRIOV properties. So it needed a place holder for
> > that. 
> >
> > IMS is a device specific capability, and I almost forgot why we needed
> > until I had to checking internally. Remember when a device is given to a
> > guest, MSIX routing via Interrupt Remapping is automatic via the VFIO/IRQFD
> > and such.
> 
> -ENOPARSE

Let me try again to see if this will turn into ESUCCESS :-)

Devices exposed to guest need host OS support for programming interrupt
entries in the IOMMU interrupt remapping table. VFIO provides those 
services for standard interrupt schemes like MSI/MSIx for instance. 
Since IMS is device specific VFIO can't provide an intercept when 
IMS entries are programmed by the guest OS. 

If the virtualisation software doesn't expose vIOMMU with virtual
capabilities to allocate IRTE entries and support for vIRTE in guest
then its expected to turn off the IMS capability. Hence driver running 
in guest will not enable IMS.

> 
> > When we provision an entire PCI device that is IMS capable. The guest
> > driver does know it can update the IMS entries directly without going to
> > the host. But in order to do remapping we need something like how we manage
> > PASID allocation from guest, so an IRTE entry can be allocated and the host
> > driver can write the proper values for IMS.
> 
> And how is that related to that capbility thing?
> 
> Also this stuff is host side and not guest side. I seriously doubt that
> you want to hand in the whole PCI device which contains the IMS

You are right, but nothing prevents a user from simply taking a full PCI
device and assign to guest. 

> thing. The whole point of IMS was as far as I was told that you can
> create gazillions of subdevices and have seperate MSI interrupts for
> each of them.


Cheers,
Ashok

Re: [PATCH v3 05/18] dmaengine: idxd: add IMS support in base driver

2020-09-30 Thread Raj, Ashok

Hi Jason

On Wed, Sep 30, 2020 at 03:51:03PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 30, 2020 at 08:47:00PM +0200, Thomas Gleixner wrote:
> 
> > > + pci_read_config_dword(pdev, SIOVCAP(dvsec), &val32);
> > > + if ((val32 & 0x1) && idxd->hw.gen_cap.max_ims_mult) {
> > > + idxd->ims_size = idxd->hw.gen_cap.max_ims_mult * 256ULL;
> > > + dev_dbg(dev, "IMS size: %u\n", idxd->ims_size);
> > > + set_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags);
> > > + dev_dbg(&pdev->dev, "IMS supported for device\n");
> > > + return;
> > > + }
> > > +
> > > + dev_dbg(&pdev->dev, "SIOV unsupported for device\n");
> > 
> > It's really hard to find the code inside all of this dev_dbg()
> > noise. But why is this capability check done in this driver? Is this
> > capability stuff really IDXD specific or is the next device which
> > supports this going to copy and pasta the above?
> 
> It is the weirdest thing, IMHO. Intel defined a dvsec cap in their
> SIOV cookbook, but as far as I can see it serves no purpose at
> all.
> 
> Last time I asked I got some unclear mumbling about "OEMs".
> 
One of the parameters it has is the "supported system page-sizes" which is
usually there in the SRIOV properties. So it needed a place holder for
that. 

IMS is a device specific capability, and I almost forgot why we needed
until I had to checking internally. Remember when a device is given to a
guest, MSIX routing via Interrupt Remapping is automatic via the VFIO/IRQFD
and such.

When we provision an entire PCI device that is IMS capable. The guest
driver does know it can update the IMS entries directly without going to
the host. But in order to do remapping we need something like how we manage
PASID allocation from guest, so an IRTE entry can be allocated and the host
driver can write the proper values for IMS.

Hope this helps clear things up.

Cheers,
Ashok

Re: [PATCH v8 3/9] Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)

2020-09-17 Thread Raj, Ashok

Hi Boris,

On Thu, Sep 17, 2020 at 09:53:38AM +0200, Borislav Petkov wrote:
> On Tue, Sep 15, 2020 at 09:30:07AM -0700, Fenghua Yu wrote:
> > +Background
> > +==
> > +
> > +Shared Virtual Addressing (SVA) allows the processor and device to use the
> > +same virtual addresses avoiding the need for software to translate virtual
> > +addresses to physical addresses. SVA is what PCIe calls Shared Virtual
> > +Memory (SVM).
> > +
> > +In addition to the convenience of using application virtual addresses
> > +by the device, it also doesn't require pinning pages for DMA.
> > +PCIe Address Translation Services (ATS) along with Page Request Interface
> > +(PRI) allow devices to function much the same way as the CPU handling
> > +application page-faults. For more information please refer to the PCIe
> > +specification Chapter 10: ATS Specification.
> > +
> > +Use of SVA requires IOMMU support in the platform. IOMMU also is required
> > +to support PCIe features ATS and PRI. ATS allows devices to cache
> > +translations for virtual addresses. The IOMMU driver uses the 
> > mmu_notifier()
> > +support to keep the device TLB cache and the CPU cache in sync. PRI allows
> > +the device to request paging the virtual address by using the CPU page 
> > tables
> > +before accessing the address.
> 
> That still reads funny, the "the device to request paging the virtual
> address" part. Do you mean that per chance here:
> 
> "Before the device can access that address, the device uses the PRI in
> order to request the virtual address to be paged in into the CPU page
> tables."
> 
Agree, this reads a bit funny.

Just tweaked it a bit: 

"When ATS lookup fails for a virtual address, device should use PRI in
order to request the virtual address to be paged into the CPU page tables.
The device must use ATS again in order the fetch the translation again
before use"

Cheers,
Ashok

Re: [PATCH v8 3/9] Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)

2020-09-17 Thread Raj, Ashok

Thanks Boris.

multiple "again" makes it funny again :-)


On Thu, Sep 17, 2020 at 07:18:49PM +0200, Borislav Petkov wrote:
> On Thu, Sep 17, 2020 at 07:56:09AM -0700, Raj, Ashok wrote:
> > Just tweaked it a bit: 
> > 
> > "When ATS lookup fails for a virtual address, device should use PRI in
> > order to request the virtual address to be paged into the CPU page tables.
> > The device must use ATS again in order the fetch the translation again

s/translation again/translation

> > before use"
> 
> Thanks, amended.
>

Re: [PATCH] intel-iommu: don't disable ATS for device without page aligned request

2020-09-09 Thread Raj, Ashok

Hi Jason

On Wed, Sep 09, 2020 at 04:34:32PM +0800, Jason Wang wrote:
> Commit 61363c1474b1 ("iommu/vt-d: Enable ATS only if the device uses
> page aligned address.") disables ATS for device that can do unaligned
> page request.

Did you take a look at the PCI specification?
Page Aligned Request is in the ATS capability Register.

ATS Capability Register (Offset 0x04h)

bit (5):
Page Aligned Request - If Set, indicates the Untranslated address is always
aligned to 4096 byte boundary. Setting this field is recommended. This
field permits software to distinguish between implemntations compatible
with this specification and those compatible with an earlier version of
this specification in which a Requester was permitted to supply anything in
bits [11:2].

> 
> This looks wrong, since the commit log said it's because the page
> request descriptor doesn't support reporting unaligned request.

I don't think you can change the definition from ATS to PRI. Both are
orthogonal feature.

> 
> A victim is Qemu's virtio-pci which doesn't advertise the page aligned
> address. Fixing by disable PRI instead of ATS if device doesn't have
> page aligned request.

This is a requirement for the Intel IOMMU's.

You say virtio, so is it all emulated device or you talking about some
hardware that implemented virtio-pci compliant hw? If you are sure the
device actually does comply with the requirement, but just not enumerating
the capability, you can maybe work a quirk to overcome that?

Now PRI also has an alignment requirement, and Intel IOMMU's requires that
as well. If your device supports SRIOV as well, PASID and PRI are
enumerated just on the PF and not the VF. You might want to pay attension
to that. We are still working on a solution for that problem.

I don't think this is the right fix for your problem. 

> 
> Cc: sta...@vger.kernel.org
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Keith Busch 
> Cc: Kuppuswamy Sathyanarayanan 
> Signed-off-by: Jason Wang 
> ---
>  drivers/iommu/intel/iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>

Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

2020-09-03 Thread Raj, Ashok

Hi Thomas,

Thanks a ton for jumping in helping on straightening it for IMS!!!


On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)

s/Storm/Store

maybe pun intended :-)

> based devices in a halfways architecture independent way.

You mean "halfways" because the message addr and data follow guidelines
per arch (x86 or such), but the location of the storage isn't dictated
by architecture? or did you have something else in mind? 

> 
> The first version can be found here:
> 
> https://lore.kernel.org/r/20200821002424.119492...@linutronix.de
> 

[snip]

> 
> Changes vs. V1:
> 
>- Addressed various review comments and addressed the 0day fallout.
>  - Corrected the XEN logic (Jürgen)
>  - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
> 
>- Fixed the compose MSI message inconsistency
> 
>- Ensure that the necessary flags are set for device SMI

is that supposed to be MSI? 

Cheers,
Ashok

Re: [PATCH v2] x86/hotplug: Silence APIC only after all irq's are migrated

2020-08-25 Thread Raj, Ashok

Hi Thomas,

On Wed, Aug 26, 2020 at 02:40:45AM +0200, Thomas Gleixner wrote:
> Ashok,
> 
> On Thu, Aug 20 2020 at 17:42, Ashok Raj wrote:
> > When offlining CPUs, fixup_irqs() migrates all interrupts away from the
> > outgoing CPU to an online CPU. It's always possible the device sent an
> > interrupt to the previous CPU destination. Pending interrupt bit in IRR in
> > LAPIC identifies such interrupts. apic_soft_disable() will not capture any
> > new interrupts in IRR. This causes interrupts from device to be lost during
> > CPU offline. The issue was found when explicitly setting MSI affinity to a
> > CPU and immediately offlining it. It was simple to recreate with a USB
> > ethernet device and doing I/O to it while the CPU is offlined. Lost
> > interrupts happen even when Interrupt Remapping is enabled.
> 
> New lines exist for a reason. They help to structure information. For
> the content, please see below.

Will work on that :-)

> 
> > Current code does apic_soft_disable() before migrating interrupts.
> >
> > native_cpu_disable()
> > {
> > ...
> > apic_soft_disable();
> > cpu_disable_common();
> >   --> fixup_irqs(); // Too late to capture anything in IRR.
> > }
> >
> > Just flipping the above call sequence seems to hit the IRR checks
> > and the lost interrupt is fixed for both legacy MSI and when
> > interrupt remapping is enabled.
> 
> Seems to hit? Come on, we really want changelogs which are based on
> facts and not on assumptions.

What I intended to convay was by placing a debug trace_printk() at
fixup_irqs(), it was *indeed* observed. Before the change I never noticed
that path being covered.

Just my Inglish (Indian English) tricking you :-).
Will make them sensible in the next update.

> 
> Aside of that, yes that's a really subtle one and thanks for tracking it
> down! For some reason I never looked at that ordering, but now that you
> stick it in front of me, it's pretty clear that this is the root cause.
> 
> > /*
> >  * Disable the local APIC. Otherwise IPI broadcasts will reach
> >  * it. It still responds normally to INIT, NMI, SMI, and SIPI
> > -* messages.
> > +* messages. It's important to do apic_soft_disable() after
> > +* fixup_irqs(), because fixup_irqs() called from cpu_disable_common()
> > +* depends on IRR being set.
> 
> That sentence does not make sense to me.

Right, I was just stating the obvious. Since fixup_irqs() isn't called
right in that function, it was suggested to make that connection explicit.

Your writeup below is crystal.. so will replace with what you have below.


> 
> > +    After apic_soft_disable() CPU preserves
> > +* currently set IRR/ISR but new interrupts will not set IRR.
> 
> I agree with the IRR part, but ISR is simply impossible to be set in
> this situation. 

You are correct. I was trying to convey what the SDM said, but its probably
irrelavant for this discussion. 

> 
> > +* This causes interrupts sent to outgoing CPU before completion
> > +* of IRQ migration to be lost. Check SDM Vol 3 "10.4.7.2 Local
> > +* APIC State after It Has been Software Disabled" section for more
> > +* details.
> 
> Please do not use the SDM chapter number of today. It's going to be a
> different one with the next version.
> 
> Something like this perhaps?
> 
>   /*
>* Disable the local APIC. Otherwise IPI broadcasts will reach
>* it. It still responds normally to INIT, NMI, SMI, and SIPI
>* messages.
>  *
>  * Disabling the APIC must happen after cpu_disable_common()
>* which invokes fixup_irqs().
>  *
>  * Disabling the APIC preserves already set bits in IRR, but
>  * an interrupt arriving after disabling the local APIC does not
>  * set the corresponding IRR bit.
>  *
>  * fixup_irqs() scans IRR for set bits so it can raise a not
>* yet handled interrupt on the new destination CPU via an IPI
>  * but obviously it can't do so for IRR bits which are not set.
>  * IOW, interrupts arriving after disabling the local APIC will
>  * be lost.
>  */
> 
> Hmm?
> 
> The changelog wants to have a corresponding update.

Will do ...

Cheers,
Ashok

Re: [PATCH v2] x86/hotplug: Silence APIC only after all irq's are migrated

2020-08-23 Thread Raj, Ashok

Hi Thomas,

I was wondering if you got a chance to take a look at this fix?

I had some mail issues recently and they showed up at lore after 2
days. I wasn't sure if you got the original mail, or maybe it didn't
make it. 

If you had a different way to fix it, we can try those out. 


On Thu, Aug 20, 2020 at 05:42:03PM -0700, Ashok Raj wrote:
> When offlining CPUs, fixup_irqs() migrates all interrupts away from the
> outgoing CPU to an online CPU. It's always possible the device sent an
> interrupt to the previous CPU destination. Pending interrupt bit in IRR in
> LAPIC identifies such interrupts. apic_soft_disable() will not capture any
> new interrupts in IRR. This causes interrupts from device to be lost during
> CPU offline. The issue was found when explicitly setting MSI affinity to a
> CPU and immediately offlining it. It was simple to recreate with a USB
> ethernet device and doing I/O to it while the CPU is offlined. Lost
> interrupts happen even when Interrupt Remapping is enabled.
> 
> Current code does apic_soft_disable() before migrating interrupts.
> 
> native_cpu_disable()
> {
>   ...
>   apic_soft_disable();
>   cpu_disable_common();
> --> fixup_irqs(); // Too late to capture anything in IRR.
> }
> 
> Just flipping the above call sequence seems to hit the IRR checks
> and the lost interrupt is fixed for both legacy MSI and when
> interrupt remapping is enabled.

On another note, we have tested both with and without the read
after write when programming MSI addr/data on the device. It didn't
seem to change the results. But I think its a useful one to add
for correctness.

https://lore.kernel.org/lkml/878si6rx7f@nanos.tec.linutronix.de/

This bug been eluding for a while. Looking for your feedback.

> 
> Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead")
> Link: https://lore.kernel.org/lkml/875zdarr4h@nanos.tec.linutronix.de/
> Reported-by: Evan Green 
> Tested-by: Mathias Nyman 
> Tested-by: Evan Green 
> Reviewed-by: Evan Green 
> Signed-off-by: Ashok Raj 
> ---
> v2:
> - Typos and fixes suggested by Randy Dunlap
> 
> To: linux-kernel@vger.kernel.org
> To: Thomas Gleixner 
> Cc: Sukumar Ghorai 
> Cc: Srikanth Nandamuri 
> Cc: Evan Green 
> Cc: Mathias Nyman 
> Cc: Bjorn Helgaas 
> Cc: sta...@vger.kernel.org
> ---
>  arch/x86/kernel/smpboot.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 27aa04a95702..3016c3b627ce 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1594,13 +1594,20 @@ int native_cpu_disable(void)
>   if (ret)
>   return ret;
>  
> + cpu_disable_common();
>   /*
>* Disable the local APIC. Otherwise IPI broadcasts will reach
>* it. It still responds normally to INIT, NMI, SMI, and SIPI
> -  * messages.
> +  * messages. It's important to do apic_soft_disable() after
> +  * fixup_irqs(), because fixup_irqs() called from cpu_disable_common()
> +  * depends on IRR being set. After apic_soft_disable() CPU preserves
> +  * currently set IRR/ISR but new interrupts will not set IRR.
> +  * This causes interrupts sent to outgoing CPU before completion
> +  * of IRQ migration to be lost. Check SDM Vol 3 "10.4.7.2 Local
> +  * APIC State after It Has been Software Disabled" section for more
> +  * details.
>*/
>   apic_soft_disable();
> - cpu_disable_common();
>  
>   return 0;
>  }
> -- 
> 2.7.4
>

Re: [PATCH] x86/hotplug: Silence APIC only after all irq's are migrated

2020-08-20 Thread Raj, Ashok

On Thu, Aug 20, 2020 at 11:21:24AM -0700, Evan Green wrote:
> > >
> > > I'm slightly unclear about whether interrupts are disabled at the core
> > > by this point or not. I followed native_cpu_disable() up to
> > > __cpu_disable(), up to take_cpu_down(). This is passed into a call to
> > > stop_machine_cpuslocked(), where interrupts get disabled at the core.
> > > So unless there's another path, it seems like interrupts are always
> > > disabled at the core by this point.
> >
> > local_irq_disable() just does cli() which allows interrupts to trickle
> > in to the IRR bits, and once you do sti() things would flow back for
> > normal interrupt processing.
> >
> >
> > >
> > > If interrupts are always disabled, then the comment above is a little
> >
> > Disable interrupts is different from disabling LAPIC. Once you do the
> > apic_soft_disable(), there is nothing flowing into the LAPIC except
> > for INIT, NMI, SMI and SIPI messages.
> >
> > This turns off the pipe for all other interrupts to enter LAPIC. Which
> > is different from doing a cli().
> 
> I understand the distinction. I was mostly musing on the difference in
> behavior your change causes if this function is entered with
> interrupts enabled at the core (ie sti()). But I think it never is, so
> that thought is moot.
> 
> I could never repro the issue reliably on comet lake after Thomas'
> original fix. But I can still repro it easily on jasper lake. And this
> patch fixes the issue for me on that platform. Thanks for the fix.
> 
> Reviewed-by: Evan Green 
> Tested-by: Evan Green 

Thanks Evan for testing. I'll wait for thomas if he finds anything else
that needs to be fixed and send a final v2 after fixing the typos and
others identified by Randy. 

Cheers,
Ashok

Re: [PATCH] x86/hotplug: Silence APIC only after all irq's are migrated

2020-08-17 Thread Raj, Ashok

Hi Evan

Some details below, 

On Mon, Aug 17, 2020 at 11:12:17AM -0700, Evan Green wrote:
> Hi Ashok,
> Thank you, Srikanth, and Sukumar for some very impressive debugging here.
> 
> On Fri, Aug 14, 2020 at 2:38 PM Ashok Raj  wrote:
> >
> > When offlining CPU's, fixup_irqs() migrates all interrupts away from the
> > outgoing CPU to an online CPU. Its always possible the device sent an
> > interrupt to the previous CPU destination. Pending interrupt bit in IRR in
> > lapic identifies such interrupts. apic_soft_disable() will not capture any
> > new interrupts in IRR. This causes interrupts from device to be lost during
> > cpu offline. The issue was found when explicitly setting MSI affinity to a
> > CPU and immediately offlining it. It was simple to recreate with a USB
> > ethernet device and doing I/O to it while the CPU is offlined. Lost
> > interrupts happen even when Interrupt Remapping is enabled.
> >
> > Current code does apic_soft_disable() before migrating interrupts.
> >
> > native_cpu_disable()
> > {
> > ...
> > apic_soft_disable();
> > cpu_disable_common();
> >   --> fixup_irqs(); // Too late to capture anything in IRR.
> > }
> >
> > Just fliping the above call sequence seems to hit the IRR checks
> > and the lost interrupt is fixed for both legacy MSI and when
> > interrupt remapping is enabled.
> >
> >
> > Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead")
> > Link: https://lore.kernel.org/lkml/875zdarr4h@nanos.tec.linutronix.de/
> > Signed-off-by: Ashok Raj 
> >
> > To: linux-kernel@vger.kernel.org
> > To: Thomas Gleixner 
> > Cc: Sukumar Ghorai 
> > Cc: Srikanth Nandamuri 
> > Cc: Evan Green 
> > Cc: Mathias Nyman 
> > Cc: Bjorn Helgaas 
> > Cc: sta...@vger.kernel.org
> > ---
> >  arch/x86/kernel/smpboot.c | 11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index ffbd9a3d78d8..278cc9f92f2f 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -1603,13 +1603,20 @@ int native_cpu_disable(void)
> > if (ret)
> > return ret;
> >
> > +   cpu_disable_common();
> > /*
> >  * Disable the local APIC. Otherwise IPI broadcasts will reach
> >  * it. It still responds normally to INIT, NMI, SMI, and SIPI
> > -* messages.
> 
> I'm slightly unclear about whether interrupts are disabled at the core
> by this point or not. I followed native_cpu_disable() up to
> __cpu_disable(), up to take_cpu_down(). This is passed into a call to
> stop_machine_cpuslocked(), where interrupts get disabled at the core.
> So unless there's another path, it seems like interrupts are always
> disabled at the core by this point.

local_irq_disable() just does cli() which allows interrupts to trickle
in to the IRR bits, and once you do sti() things would flow back for
normal interrupt processing. 


> 
> If interrupts are always disabled, then the comment above is a little

Disable interrupts is different from disabling LAPIC. Once you do the
apic_soft_disable(), there is nothing flowing into the LAPIC except
for INIT, NMI, SMI and SIPI messages. 

This turns off the pipe for all other interrupts to enter LAPIC. Which
is different from doing a cli().


> obsolete, since we're not expecting to receive broadcast IPIs from
> here on out anyway. We could clean up that comment in this change.
> 
> If there is a path to here where interrupts are still enabled at the
> core, then we'd need to watch out, because this change now allows
> broadcast IPIs to come in during cpu_disable_common(). That could be
> bad. But I think that's not this case, so this should be ok.

Section SDM Vol3.b 10.4.7.2 says.

* The reception of any interrupt or transmission of any IPIs that are in 
  progress when the local APIC is disabled are completed before the local 
  APIC enters the software-disabled state.

It doesn't actually say much about broadcast IPI's, except broadcast 
NMI for instance, which is still permitted when cli() is set.

Hope this helps.

Cheers,
Ashok

Re: [PATCH] x86/hotplug: Silence APIC only after all irq's are migrated

2020-08-15 Thread Raj, Ashok

Hi Randy,


On Fri, Aug 14, 2020 at 04:25:32PM -0700, Randy Dunlap wrote:
> On 8/14/20 2:38 PM, Ashok Raj wrote:
> > When offlining CPU's, fixup_irqs() migrates all interrupts away from the
> 
>  CPUs,

Thanks for catching these. I'll fix all these suggested changes in my next rev
Once i get additional feedback from Thomas.


> 
> > outgoing CPU to an online CPU. Its always possible the device sent an
> 
>  It's
> 
> > interrupt to the previous CPU destination. Pending interrupt bit in IRR in
> > lapic identifies such interrupts. apic_soft_disable() will not capture any
> 
>   LAPIC
> 
> > new interrupts in IRR. This causes interrupts from device to be lost during
> > cpu offline. The issue was found when explicitly setting MSI affinity to a
> 
>   CPU
> 
> > CPU and immediately offlining it. It was simple to recreate with a USB
> > ethernet device and doing I/O to it while the CPU is offlined. Lost
> > interrupts happen even when Interrupt Remapping is enabled.
> > 
> > Current code does apic_soft_disable() before migrating interrupts.
> > 
> > native_cpu_disable()
> > {
> > ...
> > apic_soft_disable();
> > cpu_disable_common();
> >   --> fixup_irqs(); // Too late to capture anything in IRR.
> > }
> > 
> > Just fliping the above call sequence seems to hit the IRR checks
> 
>flipping
> 
> > and the lost interrupt is fixed for both legacy MSI and when
> > interrupt remapping is enabled.
> > 
> > 
> > Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead")
> > Link: https://lore.kernel.org/lkml/875zdarr4h@nanos.tec.linutronix.de/
> > Signed-off-by: Ashok Raj 
> > 
> > To: linux-kernel@vger.kernel.org
> > To: Thomas Gleixner 
> > Cc: Sukumar Ghorai 
> > Cc: Srikanth Nandamuri 
> > Cc: Evan Green 
> > Cc: Mathias Nyman 
> > Cc: Bjorn Helgaas 
> > Cc: sta...@vger.kernel.org
> > ---
> >  arch/x86/kernel/smpboot.c | 11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index ffbd9a3d78d8..278cc9f92f2f 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -1603,13 +1603,20 @@ int native_cpu_disable(void)
> > if (ret)
> > return ret;
> >  
> > +   cpu_disable_common();
> > /*
> >  * Disable the local APIC. Otherwise IPI broadcasts will reach
> >  * it. It still responds normally to INIT, NMI, SMI, and SIPI
> > -* messages.
> > +* messages. Its important to do apic_soft_disable() after
> 
>It's
> 
> > +* fixup_irqs(), because fixup_irqs() called from cpu_disable_common()
> > +* depends on IRR being set. After apic_soft_disable() CPU preserves
> > +* currently set IRR/ISR but new interrupts will not set IRR.
> > +* This causes interrupts sent to outgoing cpu before completion
> 
>  CPU
> 
> > +* of irq migration to be lost. Check SDM Vol 3 "10.4.7.2 Local
> 
> IRQ
> 
> > +* APIC State after It Has been Software Disabled" section for more
> > +* details.
> >  */
> > apic_soft_disable();
> > -   cpu_disable_common();
> >  
> > return 0;
> >  }
> > 
> 
> thanks.
> -- 
> ~Randy
>

Re: [PATCH] x86/hotplug: Silence APIC only after all irq's are migrated

2020-08-15 Thread Raj, Ashok

Hi Randy

For some unknown reason my previous response said its taiting to be 
delivered. 

On Fri, Aug 14, 2020 at 04:25:32PM -0700, Randy Dunlap wrote:
> On 8/14/20 2:38 PM, Ashok Raj wrote:
> > When offlining CPU's, fixup_irqs() migrates all interrupts away from the
> 
>  CPUs,

I'll fix all these in the next rev. 

Just waiting to hear back from Thomas if he has additional ones I can fix
and resend v2.

Cheers,
Ashok
> 
> > outgoing CPU to an online CPU. Its always possible the device sent an
> 
>  It's
> 
> > interrupt to the previous CPU destination. Pending interrupt bit in IRR in
> > lapic identifies such interrupts. apic_soft_disable() will not capture any
> 
>   LAPIC
> 
> > new interrupts in IRR. This causes interrupts from device to be lost during
> > cpu offline. The issue was found when explicitly setting MSI affinity to a
> 
>   CPU
> 
> > CPU and immediately offlining it. It was simple to recreate with a USB
> > ethernet device and doing I/O to it while the CPU is offlined. Lost
> > interrupts happen even when Interrupt Remapping is enabled.
> > 
> > Current code does apic_soft_disable() before migrating interrupts.
> > 
> > native_cpu_disable()
> > {
> > ...
> > apic_soft_disable();
> > cpu_disable_common();
> >   --> fixup_irqs(); // Too late to capture anything in IRR.
> > }
> > 
> > Just fliping the above call sequence seems to hit the IRR checks
> 
>flipping
> 
> > and the lost interrupt is fixed for both legacy MSI and when
> > interrupt remapping is enabled.
> > 
> > 
> > Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead")
> > Link: https://lore.kernel.org/lkml/875zdarr4h@nanos.tec.linutronix.de/
> > Signed-off-by: Ashok Raj 
> > 
> > To: linux-kernel@vger.kernel.org
> > To: Thomas Gleixner 
> > Cc: Sukumar Ghorai 
> > Cc: Srikanth Nandamuri 
> > Cc: Evan Green 
> > Cc: Mathias Nyman 
> > Cc: Bjorn Helgaas 
> > Cc: sta...@vger.kernel.org
> > ---
> >  arch/x86/kernel/smpboot.c | 11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index ffbd9a3d78d8..278cc9f92f2f 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -1603,13 +1603,20 @@ int native_cpu_disable(void)
> > if (ret)
> > return ret;
> >  
> > +   cpu_disable_common();
> > /*
> >  * Disable the local APIC. Otherwise IPI broadcasts will reach
> >  * it. It still responds normally to INIT, NMI, SMI, and SIPI
> > -* messages.
> > +* messages. Its important to do apic_soft_disable() after
> 
>It's
> 
> > +* fixup_irqs(), because fixup_irqs() called from cpu_disable_common()
> > +* depends on IRR being set. After apic_soft_disable() CPU preserves
> > +* currently set IRR/ISR but new interrupts will not set IRR.
> > +* This causes interrupts sent to outgoing cpu before completion
> 
>  CPU
> 
> > +* of irq migration to be lost. Check SDM Vol 3 "10.4.7.2 Local
> 
> IRQ
> 
> > +* APIC State after It Has been Software Disabled" section for more
> > +* details.
> >  */
> > apic_soft_disable();
> > -   cpu_disable_common();
> >  
> > return 0;
> >  }
> > 
> 
> thanks.
> -- 
> ~Randy
>

Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Raj, Ashok

On Mon, Aug 03, 2020 at 08:12:18AM -0700, Andy Lutomirski wrote:
> On Mon, Aug 3, 2020 at 8:03 AM Dave Hansen  wrote:
> >
> > On 7/31/20 4:34 PM, Andy Lutomirski wrote:
> > >> Thomas suggested to provide a reason for the #GP caused by executing 
> > >> ENQCMD
> > >> without a valid PASID value programmed. #GP error codes are 16 bits and 
> > >> all
> > >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The 
> > >> other
> > >> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> > >> when loading from the source operand, so its not fully comprehending all
> > >> the reasons. Rather than special case the ENQCMD, in future Intel may
> > >> choose a different fault mechanism for such cases if recovery is needed 
> > >> on
> > >> #GP.
> > > Decoding the user instruction is ugly and sets a bad architecture
> > > precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.
> >
> > I'll try to do one more bit of convincing. :)
> >
> > In the end, we need a way to figure out if the #GP was from a known "OK"
> > source that we can fix up.  You're right that we could fire up the
> > instruction decoder to help answer that question.  But, it (also)
> > doesn't easily yield a perfect answer as to the source of the #GP, it
> > always involves a user copy, and it's a larger code impact than what
> > we've got.
> >
> > I think I went and looked at fixup_umip_exception(), and compared it to
> > the alternative which is essentially just these three lines of code:
> >
> > > + /*
> > > +  * If the current task already has a valid PASID in the MSR,
> > > +  * the #GP must be for some other reason.
> > > +  */
> > > + if (current->has_valid_pasid)
> > > + return false;
> > ...> +  /* Now the current task has a valid PASID in the MSR. */
> > > + current->has_valid_pasid = 1;
> >
> > and *I* was convinced that instruction decoding wasn't worth it.
> >
> > There's a lot of stuff that fixup_umip_exception() does which we don't
> > have to duplicate, but it's going to be really hard to get it anywhere
> > near as compact as what we've got.
> >
> 
> I could easily be convinced that the PASID fixup is so trivial and so
> obviously free of misfiring in a way that causes an infinite loop that
> this code is fine.  But I think we first need to answer the bigger
> question of why we're doing a lazy fixup in the first place.

We choose lazy fixup for 2 reasons. 

- If some threads were already created before the MSR is programmed, then
  we need to fixup those in a race free way. scheduling some task-work etc.
  We did do that early on, but decided it was ugly. 
- Not all threads need to submit ENQCMD, force feeding the MSR probably
  isn't even required for all. Yes the overhead isn't probably big, but
  might not even be required for all threads.

We needed to fixup MSR in two different way. To keep the code simple, the
choice was to only fixup on #GP, that eliminated the extra code we need to
support case1.

Cheers,
Ashok

Re: [PATCH v3 1/1] PCI/ATS: Check PRI supported on the PF device when SRIOV is enabled

2020-07-28 Thread Raj, Ashok

Hi Sasha

On Mon, Jul 27, 2020 at 09:24:35PM +, Sasha Levin wrote:
> Hi
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag
> fixing commit: b16d0cb9e2fc ("iommu/vt-d: Always enable PASID/PRI PCI 
> capabilities before ATS").
> 
> The bot has tested the following trees: v5.7.10, v5.4.53, v4.19.134, 
> v4.14.189, v4.9.231, v4.4.231.

Looks like the dependency is making this more involved with the backport. I

We could pursue a simpler fix for these older versions where there is a
conflict, but I'm not sure if that's recommended. 

In addition from our perspective 5.7 and above if there are other products
that require PASID/PRI on prior versions for SRIOV devices  we can drop the
backports. I see the same issue with other IOMMU's, for e.g. AMD as
well, I'm not sure if there are real regressions. 

> 
> v5.7.10: Build OK!
> v5.4.53: Failed to apply! Possible dependencies:
> 2b0ae7cc3bfc ("PCI/ATS: Handle sharing of PF PASID Capability with all 
> VFs")
> 751035b8dc06 ("PCI/ATS: Cache PASID Capability offset")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> 
> v4.19.134: Failed to apply! Possible dependencies:
> 2b0ae7cc3bfc ("PCI/ATS: Handle sharing of PF PASID Capability with all 
> VFs")
> 4f802170a861 ("PCI/DPC: Save and restore config state")
> 6e1ffbb7c2ab ("PCI: Move ATS declarations outside of CONFIG_PCI")
> 751035b8dc06 ("PCI/ATS: Cache PASID Capability offset")
> 8c938ddc6df3 ("PCI/ATS: Add pci_ats_page_aligned() interface")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> 9c2120090586 ("PCI: Provide pci_match_id() with CONFIG_PCI=n")
> b92b512a435d ("PCI: Make pci_ats_init() private")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> e5567f5f6762 ("PCI/ATS: Add pci_prg_resp_pasid_required() interface.")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> fff42928ade5 ("PCI/ATS: Add inline to pci_prg_resp_pasid_required()")
> 
> v4.14.189: Failed to apply! Possible dependencies:
> 1b79c5284439 ("PCI: cadence: Add host driver for Cadence PCIe controller")
> 1e4511604dfa ("PCI/AER: Expose internal API for obtaining AER 
> information")
> 3133e6dd07ed ("PCI: Tidy Makefiles")
> 37dddf14f1ae ("PCI: cadence: Add EndPoint Controller driver for Cadence 
> PCIe controller")
> 4696b828ca37 ("PCI/AER: Hoist aerdrv.c, aer_inject.c up to 
> drivers/pci/pcie/")
> 4f802170a861 ("PCI/DPC: Save and restore config state")
> 8c938ddc6df3 ("PCI/ATS: Add pci_ats_page_aligned() interface")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> 9de0eec29c07 ("PCI: Regroup all PCI related entries into 
> drivers/pci/Makefile")
> b92b512a435d ("PCI: Make pci_ats_init() private")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> d3252ace0bc6 ("PCI: Restore resized BAR state on resume")
> e5567f5f6762 ("PCI/ATS: Add pci_prg_resp_pasid_required() interface.")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> fff42928ade5 ("PCI/ATS: Add inline to pci_prg_resp_pasid_required()")
> 
> v4.9.231: Failed to apply! Possible dependencies:
> 4ebeb1ec56d4 ("PCI: Restore PRI and PASID state after Function-Level 
> Reset")
> 8c938ddc6df3 ("PCI/ATS: Add pci_ats_page_aligned() interface")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> a4f4fa681add ("PCI: Cache PRI and PASID bits in pci_dev")
> c065190bbcd4 ("PCI/ATS: Cache PRI Capability offset")
> e5567f5f6762 ("PCI/ATS: Add pci_prg_resp_pasid_required() interface.")
> e5adf79a1d80 ("PCI/ATS: Cache PRI PRG Response PASID Required bit")
> fff42928ade5 ("PCI/ATS: Add inline to pci_prg_resp_pasid_required()")
> 
> v4.4.231: Failed to apply! Possible dependencies:
> 2a2aca316aed ("PCI: Include  for isa_dma_bridge_buggy")
> 4d3f13845957 ("PCI: Add pci_unmap_iospace() to unmap I/O resources")
> 4ebeb1ec56d4 ("PCI: Restore PRI and PASID state after Function-Level 
> Reset")
> 8cbb8a9374a2 ("PCI/ATS: Move pci_prg_resp_pasid_required() to 
> CONFIG_PCI_PRI")
> 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs")
> a4f4fa681add ("PCI: Cache PRI and PASID bits in pci_dev")
> c5076cfe7689 ("PCI, of: Move PCI I/O space management to PCI core code")
> e5567f5f6762 ("PCI/ATS: Add pci_prg_resp_pasid_

Re: [PATCH] PCI/ATS: PASID and PRI are only enumerated in PF devices.

2020-07-23 Thread Raj, Ashok

Hi Bjorn

On Tue, Jul 21, 2020 at 09:54:01AM -0500, Bjorn Helgaas wrote:
> On Mon, Jul 20, 2020 at 09:43:00AM -0700, Ashok Raj wrote:
> > PASID and PRI capabilities are only enumerated in PF devices. VF devices
> > do not enumerate these capabilites. IOMMU drivers also need to enumerate
> > them before enabling features in the IOMMU. Extending the same support as
> > PASID feature discovery (pci_pasid_features) for PRI.
> > 
> > Signed-off-by: Ashok Raj 
> 
> Hi Ashok,
> 
> When you update this for the 0-day implicit declaration thing, can you
> update the subject to say what the patch *does*, as opposed to what it
> is solving?  Also, no need for a period at the end.

Yes, will update and resend. Goofed up a couple things, i'll update those
as well.


> 
> Does this fix a regression?  Is it associated with a commit that we
> could add as a "Fixes:" tag so we know how far back to try to apply
> to stable kernels?

Yes, but the iommu files moved location and git fixes tags only generates
for a few handful of commits and doesn't show the old ones. 

Cheers,
Ashok

Re: [PATCH v4 4/4] PCI/ACS: Enable PCI_ACS_TB for untrusted/external-facing devices

2020-07-10 Thread Raj, Ashok

Hi Bjorn


On Fri, Jul 10, 2020 at 03:29:22PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 07, 2020 at 03:46:04PM -0700, Rajat Jain wrote:
> > When enabling ACS, enable translation blocking for external facing ports
> > and untrusted devices.
> > 
> > Signed-off-by: Rajat Jain 
> > ---
> > v4: Add braces to avoid warning from kernel robot
> > print warning for only external-facing devices.
> > v3: print warning if ACS_TB not supported on external-facing/untrusted 
> > ports.
> > Minor code comments fixes.
> > v2: Commit log change
> > 
> >  drivers/pci/pci.c|  8 
> >  drivers/pci/quirks.c | 15 +++
> >  2 files changed, 23 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 73a8627822140..a5a6bea7af7ce 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -876,6 +876,14 @@ static void pci_std_enable_acs(struct pci_dev *dev)
> > /* Upstream Forwarding */
> > ctrl |= (cap & PCI_ACS_UF);
> >  
> > +   /* Enable Translation Blocking for external devices */
> > +   if (dev->external_facing || dev->untrusted) {
> > +   if (cap & PCI_ACS_TB)
> > +   ctrl |= PCI_ACS_TB;
> > +   else if (dev->external_facing)
> > +   pci_warn(dev, "ACS: No Translation Blocking on 
> > external-facing dev\n");
> > +   }
> 
> IIUC, this means that external devices can *never* use ATS and can
> never cache translations.  And (I guess, I'm not an expert) it can
> also never use the Page Request Services?

Yep, sounds like it.

> 
> Is this what we want?  Do we have any idea how many external devices
> this will affect or how much of a performance impact they will see?
> 
> Do we need some kind of override or mechanism to authenticate certain
> devices so they can use ATS and PRI?

Sounds like we would need some form of an allow-list to start with so we
can have something in the interim. 

I suppose a future platform might have a facilty to ensure ATS is secure and
authenticated we could enable for all of devices in the system, in addition
to PCI CMA/IDE. 

I think having a global override to enable all devices so platform can
switch to current behavior, or maybe via a cmdline switch.. as much as we
have a billion of those, it still gives an option in case someone needs it.



> 
> If we do decide this is the right thing to do, I think we need to
> expand the commit log a bit, because this is potentially a significant
> user-visible change.
> 
> > pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
> >  }
> >  
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index b341628e47527..bb22b46c1d719 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4934,6 +4934,13 @@ static void pci_quirk_enable_intel_rp_mpc_acs(struct 
> > pci_dev *dev)
> > }
> >  }
> >  
> > +/*
> > + * Currently this quirk does the equivalent of
> > + * PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF
> > + *
> > + * TODO: This quirk also needs to do equivalent of PCI_ACS_TB,
> > + * if dev->external_facing || dev->untrusted
> > + */
> >  static int pci_quirk_enable_intel_pch_acs(struct pci_dev *dev)
> >  {
> > if (!pci_quirk_intel_pch_acs_match(dev))
> > @@ -4973,6 +4980,14 @@ static int pci_quirk_enable_intel_spt_pch_acs(struct 
> > pci_dev *dev)
> > ctrl |= (cap & PCI_ACS_CR);
> > ctrl |= (cap & PCI_ACS_UF);
> >  
> > +   /* Enable Translation Blocking for external devices */
> > +   if (dev->external_facing || dev->untrusted) {
> > +   if (cap & PCI_ACS_TB)
> > +   ctrl |= PCI_ACS_TB;
> > +   else if (dev->external_facing)
> > +   pci_warn(dev, "ACS: No Translation Blocking on 
> > external-facing dev\n");
> > +   }
> > +
> > pci_write_config_dword(dev, pos + INTEL_SPT_ACS_CTRL, ctrl);
> >  
> > pci_info(dev, "Intel SPT PCH root port ACS workaround enabled\n");
> > -- 
> > 2.27.0.212.ge8ba1cc988-goog
> >

Re: [PATCH 3/4] pci: acs: Enable PCI_ACS_TB for untrusted/external-facing devices

2020-06-19 Thread Raj, Ashok

Hi Rajat


On Mon, Jun 15, 2020 at 06:17:41PM -0700, Rajat Jain wrote:
> When enabling ACS, currently the bit "translation blocking" was
> not getting changed at all. Set it to disable translation blocking

Maybe you meant "enable translation blocking" ?

Keep the commit log simple:

When enabling ACS, enable translation blocking for external facing ports
and untrusted devices.

> too for all external facing or untrusted devices. This is OK
> because ATS is only allowed on internal devces.
> 
> Signed-off-by: Rajat Jain 
> ---
>  drivers/pci/pci.c|  4 
>  drivers/pci/quirks.c | 11 +++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d2ff987585855..79853b52658a2 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3330,6 +3330,10 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   /* Upstream Forwarding */
>   ctrl |= (cap & PCI_ACS_UF);
>  
> + if (dev->external_facing || dev->untrusted)
> + /* Translation Blocking */
> + ctrl |= (cap & PCI_ACS_TB);
> +
>   pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>  }
>  
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index b341628e47527..6294adeac4049 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4934,6 +4934,13 @@ static void pci_quirk_enable_intel_rp_mpc_acs(struct 
> pci_dev *dev)
>   }
>  }
>  
> +/*
> + * Currently this quirk does the equivalent of
> + * PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_SV
> + *
> + * Currently missing, it also needs to do equivalent of PCI_ACS_TB,
> + * if dev->external_facing || dev->untrusted
> + */
>  static int pci_quirk_enable_intel_pch_acs(struct pci_dev *dev)
>  {
>   if (!pci_quirk_intel_pch_acs_match(dev))
> @@ -4973,6 +4980,10 @@ static int pci_quirk_enable_intel_spt_pch_acs(struct 
> pci_dev *dev)
>   ctrl |= (cap & PCI_ACS_CR);
>   ctrl |= (cap & PCI_ACS_UF);
>  
> + if (dev->external_facing || dev->untrusted)
> + /* Translation Blocking */
> + ctrl |= (cap & PCI_ACS_TB);
> +
>   pci_write_config_dword(dev, pos + INTEL_SPT_ACS_CTRL, ctrl);
>  
>   pci_info(dev, "Intel SPT PCH root port ACS workaround enabled\n");
> -- 
> 2.27.0.290.gba653c62da-goog
>

Re: [PATCH 4/4] pci: export untrusted attribute in sysfs

2020-06-18 Thread Raj, Ashok

Hi Greg,


On Thu, Jun 18, 2020 at 06:02:12PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Jun 18, 2020 at 08:03:49AM -0700, Rajat Jain wrote:
> > Hello,
> > 
> > On Thu, Jun 18, 2020 at 2:14 AM Andy Shevchenko
> >  wrote:
> > >
> > > On Thu, Jun 18, 2020 at 11:36 AM Greg Kroah-Hartman
> > >  wrote:
> > > >
> > > > On Thu, Jun 18, 2020 at 11:12:56AM +0300, Andy Shevchenko wrote:
> > > > > On Wed, Jun 17, 2020 at 10:56 PM Rajat Jain  
> > > > > wrote:
> > > > > > On Wed, Jun 17, 2020 at 12:31 AM Christoph Hellwig 
> > > > > >  wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > (and likely call it "external" instead of "untrusted".
> > > > >
> > > > > Which is not okay. 'External' to what? 'untrusted' has been carefully
> > > > > chosen by the meaning of it.
> > > > > What external does mean for M.2. WWAN card in my laptop? It's in ACPI
> > > > > tables, but I can replace it.
> > > >
> > > > Then your ACPI tables should show this, there is an attribute for it,
> > > > right?
> > >
> > > There is a _PLD() method, but it's for the USB devices (or optional
> > > for others, I don't remember by heart). So, most of the ACPI tables,
> > > alas, don't show this.
> > >
> > > > > This is only one example. Or if firmware of some device is altered,
> > > > > and it's internal (whatever it means) is it trusted or not?
> > > >
> > > > That is what people are using policy for today, if you object to this,
> > > > please bring it up to those developers :)
> > >
> > > > > So, please leave it as is (I mean name).
> > > >
> > > > firmware today exports this attribute, why do you not want userspace to
> > > > also know it?
> > 
> > To clarify, the attribute exposed by the firmware today is
> > "ExternalFacingPort" and "external-facing" respectively:
> > 
> > 617654aae50e ("PCI / ACPI: Identify untrusted PCI devices")
> > 9cb30a71ac45d("PCI: OF: Support "external-facing" property")
> > 
> > The kernel flag was named "untrusted" though, hence the assumption
> > that "external=untrusted" is currently baked into the kernel today.
> > IMHO, using "external" would fix that (The assumption can thus be
> > contained in the IOMMU drivers) and at the same time allow more use of
> > this attribute.
> > 
> > > >
> > > > Trust is different, yes, don't get the two mixed up please.  That should
> > > > be a different sysfs attribute for obvious reasons.
> > >
> > > Yes, as a bottom line that's what I meant as well.
> > 
> > So what is the consensus here? I don't have a strong opinion - but it
> > seemed to me Greg is saying "external" and Andy is saying "untrusted"?
> 
> Those two things are totally separate things when it comes to a device.

Agree that these are two separate attributes, and better not mixed.

> 
> One (external) describes the location of the device in the system.
> 
> The other (untrusted) describes what you want the kernel to do with this
> device (trust or not trust it).
> 
> One you can change (from trust to untrusted or back), the other you can
> not, it is a fixed read-only property that describes the hardware device
> as defined by the firmware.

The genesis is due to lack of a mechanism to establish if the device 
is trusted or not was the due lack of some specs and implementation around
Component Measurement And Authentication (CMA). Treating external as
untrusted was the best first effort. i.e trust internal 
devices and don't trust external devices for enabling ATS.

But that said external is just describing topology, and if Linux wants to
use that in the policy that's different. Some day external device may also
use CMA to estabilish trust. FWIW even internal devices aren't trust
worthy, except maybe RCIEP's. 

> 
> Depending on the policy you wish to define, you can use the location of
> the device to determine if you want to trust the device or not.
> 

Cheers,
Ashok

Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-16 Thread Raj, Ashok

On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote:
> On Tue, Jun 16, 2020 at 02:26:38AM +, Tian, Kevin wrote:
> > > From: Stefan Hajnoczi 
> > > Sent: Monday, June 15, 2020 6:02 PM
> > > 
> > > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > > Intel platforms allows address space sharing between device DMA and
> > > > applications. SVA can reduce programming complexity and enhance
> > > security.
> > > >
> > > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > > guest application address space with passthru devices. This is called
> > > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > > > in the "Related series").
> > > >
> > > > The high-level architecture for SVA virtualization is as below, the key
> > > > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > > > also known as IOMMU nesting translation) capability in host IOMMU.
> > > >
> > > >
> > > > .-.  .---.
> > > > |   vIOMMU|  | Guest process CR3, FL only|
> > > > | |  '---'
> > > > ./
> > > > | PASID Entry |--- PASID cache flush -
> > > > '-'   |
> > > > | |   V
> > > > | |CR3 in GPA
> > > > '-'
> > > > Guest
> > > > --| Shadow |--|
> > > >   vv  v
> > > > Host
> > > > .-.  .--.
> > > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > > | |  '--'
> > > > ./  |
> > > > | PASID Entry | V (Nested xlate)
> > > > '\.--.
> > > > | |   |SL for GPA-HPA, default domain|
> > > > | |   '--'
> > > > '-'
> > > > Where:
> > > >  - FL = First level/stage one page tables
> > > >  - SL = Second level/stage two page tables
> > > 
> > > Hi,
> > > Looks like an interesting feature!
> > > 
> > > To check I understand this feature: can applications now pass virtual
> > > addresses to devices instead of translating to IOVAs?
> > > 
> > > If yes, can guest applications restrict the vSVA address space so the
> > > device only has access to certain regions?
> > > 
> > > On one hand replacing IOVA translation with virtual addresses simplifies
> > > the application programming model, but does it give up isolation if the
> > > device can now access all application memory?
> > > 
> > 
> > with SVA each application is allocated with a unique PASID to tag its
> > virtual address space. The device that claims SVA support must guarantee 
> > that one application can only program the device to access its own virtual
> > address space (i.e. all DMAs triggered by this application are tagged with
> > the application's PASID, and are translated by IOMMU's PASID-granular
> > page table). So, isolation is not sacrificed in SVA.
> 
> Isolation between applications is preserved but there is no isolation
> between the device and the application itself. The application needs to
> trust the device.

Right. With all convenience comes security trust. With SVA there is an
expectation that the device has the required security boundaries properly
implemented. FWIW, what is our guarantee today that VF's are secure from
one another or even its own PF? They can also generate transactions with
any of its peer id's and there is nothing an IOMMU can do today. Other than
rely on ACS. Even BusMaster enable can be ignored and devices (malicious
or otherwise) can generate after the BM=0. With SVM you get the benefits of

* Not having to register regions
* Don't need to pin application space for DMA.

> 
> Examples:
> 
> 1. The device can snoop secret data from readable pages in the
>application's virtual memory space.

Aren't there other security technologies that can address this?

> 
> 2. The device can gain arbitrary execution on the CPU by overwriting
>control flow addresses (e.g. function pointers, stack return
>addresses) in writable pages.

I suppose technology like CET might be able to guard. The general
expectation is code pages and anything that needs to be protected should be
mapped nor writable.

Cheers,
Ashok

Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Raj, Ashok

On Mon, Jun 15, 2020 at 06:03:57PM +0200, Peter Zijlstra wrote:
> 
> I don't get why you need a rdmsr here, or why not having one would
> require a TIF flag. Is that because this MSR is XSAVE/XRSTOR managed?
> 
> > > > +*/
> > > > +   rdmsrl(MSR_IA32_PASID, pasid_msr);
> > > > +   if (pasid_msr & MSR_IA32_PASID_VALID)
> > > > +   return false;
> > > > +
> > > > +   /* Fix up the MSR if the MSR doesn't have a valid PASID. */
> > > > +   wrmsrl(MSR_IA32_PASID, pasid | MSR_IA32_PASID_VALID);
> 
> How much more expensive is the wrmsr over the rdmsr? Can't we just
> unconditionally write the current PASID and call it a day?

The reason to check the rdmsr() is because we are using a hueristic taking
GP faults. If we already setup the MSR, but we get it a second time it
means the reason is something other than PASID_MSR not being set.

Ideally we should use the TIF_ to track this would be cheaper, but we were
told those bits aren't easy to give out. 

Cheers,
Ashok

Re: [PATCH v3] iommu/vt-d: Don't apply gfx quirks to untrusted devices

2020-06-02 Thread Raj, Ashok

On Tue, Jun 02, 2020 at 04:26:02PM -0700, Rajat Jain wrote:
> Currently, an external malicious PCI device can masquerade the VID:PID
> of faulty gfx devices, and thus apply iommu quirks to effectively
> disable the IOMMU restrictions for itself.
> 
> Thus we need to ensure that the device we are applying quirks to, is
> indeed an internal trusted device.
> 
> Signed-off-by: Rajat Jain 
> Acked-by: Lu Baolu 

With these changes

Reviewed-by: Ashok Raj 

> ---
> v3: - Separate out the warning mesage in a function to be called from
>   other places. Change the warning string as suggested.
> v2: - Change the warning print strings.
> - Add Lu Baolu's acknowledgement.
> 
>  drivers/iommu/intel-iommu.c | 37 +
>  1 file changed, 37 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ef0a5246700e5..dc859f02985a0 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6185,6 +6185,23 @@ intel_iommu_domain_set_attr(struct iommu_domain 
> *domain,
>   return ret;
>  }
>  
> +/*
> + * Check that the device does not live on an external facing PCI port that is
> + * marked as untrusted. Such devices should not be able to apply quirks and
> + * thus not be able to bypass the IOMMU restrictions.
> + */
> +static bool risky_device(struct pci_dev *pdev)
> +{
> + if (pdev->untrusted) {
> + pci_warn(pdev,
> +  "Skipping IOMMU quirk for dev (%04X:%04X) on untrusted"
> +  " PCI link. Please check with your BIOS/Platform"
> +  " vendor about this\n", pdev->vendor, pdev->device);
> + return true;
> + }
> + return false;
> +}
> +
>  const struct iommu_ops intel_iommu_ops = {
>   .capable= intel_iommu_capable,
>   .domain_alloc   = intel_iommu_domain_alloc,
> @@ -6214,6 +6231,9 @@ const struct iommu_ops intel_iommu_ops = {
>  
>  static void quirk_iommu_igfx(struct pci_dev *dev)
>  {
> + if (risky_device(dev))
> + return;
> +
>   pci_info(dev, "Disabling IOMMU for graphics on this chipset\n");
>   dmar_map_gfx = 0;
>  }
> @@ -6255,6 +6275,9 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163D, 
> quirk_iommu_igfx);
>  
>  static void quirk_iommu_rwbf(struct pci_dev *dev)
>  {
> + if (risky_device(dev))
> + return;
> +
>   /*
>* Mobile 4 Series Chipset neglects to set RWBF capability,
>* but needs it. Same seems to hold for the desktop versions.
> @@ -6285,6 +6308,9 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev 
> *dev)
>  {
>   unsigned short ggc;
>  
> + if (risky_device(dev))
> + return;
> +
>   if (pci_read_config_word(dev, GGC, &ggc))
>   return;
>  
> @@ -6318,6 +6344,12 @@ static void __init check_tylersburg_isoch(void)
>   pdev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x3a3e, NULL);
>   if (!pdev)
>   return;
> +
> + if (risky_device(pdev)) {
> + pci_dev_put(pdev);
> + return;
> + }
> +
>   pci_dev_put(pdev);
>  
>   /* System Management Registers. Might be hidden, in which case
> @@ -6327,6 +6359,11 @@ static void __init check_tylersburg_isoch(void)
>   if (!pdev)
>   return;
>  
> + if (risky_device(pdev)) {
> + pci_dev_put(pdev);
> + return;
> + }
> +
>   if (pci_read_config_dword(pdev, 0x188, &vtisochctrl)) {
>   pci_dev_put(pdev);
>   return;
> -- 
> 2.27.0.rc2.251.g90737beb825-goog
>

Re: [PATCH] iommu/vt-d: Don't apply gfx quirks to untrusted devices

2020-06-02 Thread Raj, Ashok

On Tue, Jun 02, 2020 at 06:43:00PM +, Rajat Jain wrote:
> Hi MIka,
> 
> Thanks for taking a look.
> 
> On Tue, Jun 2, 2020 at 2:50 AM Mika Westerberg
>  wrote:
> >
> > On Mon, Jun 01, 2020 at 10:45:17PM -0700, Rajat Jain wrote:
> > > Currently, an external malicious PCI device can masquerade the VID:PID
> > > of faulty gfx devices, and thus apply iommu quirks to effectively
> > > disable the IOMMU restrictions for itself.
> > >
> > > Thus we need to ensure that the device we are applying quirks to, is
> > > indeed an internal trusted device.
> > >
> > > Signed-off-by: Rajat Jain 
> > > ---
> > >  drivers/iommu/intel-iommu.c | 28 
> > >  1 file changed, 28 insertions(+)
> > >
> > > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > > index ef0a5246700e5..f2a480168a02f 100644
> > > --- a/drivers/iommu/intel-iommu.c
> > > +++ b/drivers/iommu/intel-iommu.c
> > > @@ -6214,6 +6214,11 @@ const struct iommu_ops intel_iommu_ops = {
> > >
> > >  static void quirk_iommu_igfx(struct pci_dev *dev)
> > >  {
> > > + if (dev->untrusted) {
> > > + pci_warn(dev, "skipping iommu quirk for untrusted gfx 
> > > dev\n");
> >
> > I think you should be consistent with other messages. For example iommu
> > should be spelled IOMMU as done below.
> >
> > Also this is visible to users so maybe put bit more information there:
> >
> >   pci_warn(dev, "Will not apply IOMMU quirk for untrusted graphics 
> > device\n");
> >
> > Ditto for all the other places. Also is "untrusted" good word here? If
> > an ordinary user sees this will it trigger some sort of panic reaction.
> > Perhaps we should call it "potentially untrusted" or something like
> > that?

Wish we called it external_facing rather than untrusted attribute, so its
description is consistent with the spec that defines it. Once we have
Platform Component Security Enhancements. 

to be correct, maybe call "Device located on an untrusted link" rather than
assert blame on the device.

Since the information is harvsted from BIOS tables and there are chances
this could be wrongly grouped such, add "Check with your BIOS/Platform
Vendor.



> 
> Fixed it, posted new patch at
> https://lkml.org/lkml/2020/6/2/822
> 
> Thanks,
> 
> Rajat
> 
> >
> > > + return;
> > > + }
> > > +
> > >   pci_info(dev, "Disabling IOMMU for graphics on this chipset\n");
> > >   dmar_map_gfx = 0;

Re: [PATCH v2] iommu/vt-d: Don't apply gfx quirks to untrusted devices

2020-06-02 Thread Raj, Ashok

Hi Rajat

On Tue, Jun 02, 2020 at 11:41:33AM -0700, Rajat Jain wrote:
> Currently, an external malicious PCI device can masquerade the VID:PID
> of faulty gfx devices, and thus apply iommu quirks to effectively
> disable the IOMMU restrictions for itself.
> 
> Thus we need to ensure that the device we are applying quirks to, is
> indeed an internal trusted device.
> 
> Signed-off-by: Rajat Jain 
> Acked-by: Lu Baolu 
> ---
> V2: - Change the warning print strings.
> - Add Lu Baolu's acknowledgement.
> 
>  drivers/iommu/intel-iommu.c | 38 +
>  1 file changed, 38 insertions(+)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ef0a5246700e5..fdfbea4ff8cb3 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -6214,6 +6214,13 @@ const struct iommu_ops intel_iommu_ops = {
>  
>  static void quirk_iommu_igfx(struct pci_dev *dev)
>  {
> + if (dev->untrusted) {
> + pci_warn(dev,
> +  "Skipping IOMMU quirk %s() for potentially untrusted 
> device\n",
> +  __func__);
> + return;
> + }
> +

This check and code seems to be happening several times. Maybe add a simple
function to do the test and use in all places?

Cheers,
Ashok

Re: [PATCH] PCI: Relax ACS requirement for Intel RCiEP devices.

2020-06-01 Thread Raj, Ashok

On Mon, Jun 01, 2020 at 04:25:19PM -0500, Bjorn Helgaas wrote:
> On Thu, May 28, 2020 at 01:57:42PM -0700, Ashok Raj wrote:
> > All Intel platforms guarantee that all root complex implementations
> > must send transactions up to IOMMU for address translations. Hence for
> > RCiEP devices that are Vendor ID Intel, can claim exception for lack of
> > ACS support.
> > 
> > 
> > 3.16 Root-Complex Peer to Peer Considerations
> > When DMA remapping is enabled, peer-to-peer requests through the
> > Root-Complex must be handled
> > as follows:
> > • The input address in the request is translated (through first-level,
> >   second-level or nested translation) to a host physical address (HPA).
> >   The address decoding for peer addresses must be done only on the
> >   translated HPA. Hardware implementations are free to further limit
> >   peer-to-peer accesses to specific host physical address regions
> >   (or to completely disallow peer-forwarding of translated requests).
> > • Since address translation changes the contents (address field) of
> >   the PCI Express Transaction Layer Packet (TLP), for PCI Express
> >   peer-to-peer requests with ECRC, the Root-Complex hardware must use
> >   the new ECRC (re-computed with the translated address) if it
> >   decides to forward the TLP as a peer request.
> > • Root-ports, and multi-function root-complex integrated endpoints, may
> >   support additional peerto-peer control features by supporting PCI Express
> >   Access Control Services (ACS) capability. Refer to ACS capability in
> >   PCI Express specifications for details.
> > 
> > Since Linux didn't give special treatment to allow this exception, certain
> > RCiEP MFD devices are getting grouped in a single iommu group. This
> > doesn't permit a single device to be assigned to a guest for instance.
> > 
> > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > 
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > 
> > After the patch:
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > 
> > 14.0 and 14.2 are integrated devices, but legacy end points.
> > Whereas 14.3 was a PCIe compliant RCiEP.
> > 
> > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > 
> > This permits assigning this device to a guest VM.
> > 
> > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > Signed-off-by: Ashok Raj 
> > To: Joerg Roedel 
> > To: Bjorn Helgaas 
> > Cc: linux-kernel@vger.kernel.org
> > Cc: io...@lists.linux-foundation.org
> > Cc: Lu Baolu 
> > Cc: Alex Williamson 
> > Cc: Darrel Goeddel 
> > Cc: Mark Scott ,
> > Cc: Romil Sharma 
> > Cc: Ashok Raj 
> 
> Tentatively applied to pci/virtualization for v5.8, thanks!
> 
> The spec says this handling must apply "when DMA remapping is
> enabled".  The patch does not check whether DMA remapping is enabled.
> 
> Is there any case where DMA remapping is *not* enabled, and we rely on
> this patch to tell us whether the device is isolated?  It sounds like
> it may give the wrong answer in such a case?
> 
> Can you confirm that I don't need to worry about this?  

I think all of this makes sense only when DMA remapping is enabled.
Otherwise there is no enforcement for isolation. 

Cheers,
Ashok

Re: [PATCH] PCI: Relax ACS requirement for Intel RCiEP devices.

2020-05-28 Thread Raj, Ashok

On Thu, May 28, 2020 at 03:38:26PM -0600, Alex Williamson wrote:
> On Thu, 28 May 2020 13:57:42 -0700
> Ashok Raj  wrote:
> 
> > All Intel platforms guarantee that all root complex implementations
> > must send transactions up to IOMMU for address translations. Hence for
> > RCiEP devices that are Vendor ID Intel, can claim exception for lack of
> > ACS support.
> > 
> > 
> > 3.16 Root-Complex Peer to Peer Considerations
> > When DMA remapping is enabled, peer-to-peer requests through the
> > Root-Complex must be handled
> > as follows:
> > • The input address in the request is translated (through first-level,
> >   second-level or nested translation) to a host physical address (HPA).
> >   The address decoding for peer addresses must be done only on the
> >   translated HPA. Hardware implementations are free to further limit
> >   peer-to-peer accesses to specific host physical address regions
> >   (or to completely disallow peer-forwarding of translated requests).
> > • Since address translation changes the contents (address field) of
> >   the PCI Express Transaction Layer Packet (TLP), for PCI Express
> >   peer-to-peer requests with ECRC, the Root-Complex hardware must use
> >   the new ECRC (re-computed with the translated address) if it
> >   decides to forward the TLP as a peer request.
> > • Root-ports, and multi-function root-complex integrated endpoints, may
> >   support additional peerto-peer control features by supporting PCI Express
> >   Access Control Services (ACS) capability. Refer to ACS capability in
> >   PCI Express specifications for details.
> > 
> > Since Linux didn't give special treatment to allow this exception, certain
> > RCiEP MFD devices are getting grouped in a single iommu group. This
> > doesn't permit a single device to be assigned to a guest for instance.
> > 
> > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > 
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > 
> > After the patch:
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > 
> > 14.0 and 14.2 are integrated devices, but legacy end points.
> > Whereas 14.3 was a PCIe compliant RCiEP.
> > 
> > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > 
> > This permits assigning this device to a guest VM.
> > 
> > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> 
> I don't really understand this Fixes tag.  This seems like a feature,
> not a fix.  If you want it in stable releases as a feature, request it
> via Cc: sta...@vger.kernel.org.  I'd drop that tag, that's my nit.
> Otherwise:

Yes, i should have Cced Stable instead. 

Bjorn: Can you massage this in? or i can resend with Alex's Reviewed-by +
adding stable in cc list.

> 
> Reviewed-by: Alex Williamson 
> 
> > Signed-off-by: Ashok Raj 
> > To: Joerg Roedel 
> > To: Bjorn Helgaas 
> > Cc: linux-kernel@vger.kernel.org
> > Cc: io...@lists.linux-foundation.org
> > Cc: Lu Baolu 
> > Cc: Alex Williamson 
> > Cc: Darrel Goeddel 
> > Cc: Mark Scott ,
> > Cc: Romil Sharma 
> > Cc: Ashok Raj 
> > ---
> > v2: Moved functionality from iommu to pci quirks - Alex Williamson
> > 
> >  drivers/pci/quirks.c | 15 +++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 28c9a2409c50..63373ca0a3fe 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4682,6 +4682,20 @@ static int pci_quirk_mf_endpoint_acs(struct pci_dev 
> > *dev, u16 acs_flags)
> > PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
> >  }
> >  
> > +static int pci_quirk_rciep_acs(struct pci_dev *dev, u16 acs_flags)
> > +{
> > +   /*
> > +* RCiEP's are required to allow p2p only on translated addresses.
> > +* Refer to Intel VT-d specification Section 3.16 Root-Complex Peer
> > +* to Peer Considerations
> > +*/
> > +   if (pci_pcie_type(dev) != PCI_EXP_TYPE_RC_END)
> > +   return -ENOTTY;
> > +
> > +   return pci_acs_ctrl_enabled(acs_flags,
> > +   PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF);
> > +}
> > +
> >  static int pci_quirk_brcm_acs(struct pci_dev *dev, u16 acs_flags)
> >  {
> > /*
> > @@ -4764,6 +4778,7 @@ static const struct pci_dev_acs_enabled {
> > /* I219 */
> > { PCI_VENDOR_ID_INTEL, 0x15b7, pci_quirk_mf_endpoint_acs },
> > { PCI_VENDOR_ID_INTEL, 0x15b8, pci_quirk_mf_endpoint_acs },
> > +   { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_quirk_rciep_acs },
> > /* QCOM QDF2xxx root ports */
> > { PCI_VENDOR_ID_QCOM, 0x0400, pci_quirk_qcom_rp_acs },
> > { PCI_VENDOR_ID_QCOM, 0x0401, pci_quirk_qcom_rp_acs },
>

Re: [PATCH] iommu: Relax ACS requirement for Intel RCiEP devices.

2020-05-28 Thread Raj, Ashok

Hi Alex

On Tue, May 26, 2020 at 05:07:15PM -0600, Alex Williamson wrote:
> > ---
> >  drivers/iommu/iommu.c | 13 -
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2b471419e26c..31b595dfedde 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1187,7 +1187,18 @@ static struct iommu_group 
> > *get_pci_function_alias_group(struct pci_dev *pdev,
> > struct pci_dev *tmp = NULL;
> > struct iommu_group *group;
> >  
> > -   if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > +   /*
> > +* Intel VT-d Specification Section 3.16, Root-Complex Peer to Peer
> > +* Considerations manadate that all transactions in RCiEP's and
> > +* even Integrated MFD's *must* be sent up to the IOMMU. P2P is
> > +* only possible on translated addresses. This gives enough
> > +* guarantee that such devices can be forgiven for lack of ACS
> > +* support.
> > +*/
> > +   if (!pdev->multifunction ||
> > +   (pdev->vendor == PCI_VENDOR_ID_INTEL &&
> > +pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> > +pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > return NULL;
> >  
> > for_each_pci_dev(tmp) {
> 
> Hi Ashok,
> 
> As this is an Intel/VT-d standard, not a PCIe standard, why not
> implement this in pci_dev_specific_acs_enabled() with all the other
> quirks?  Thanks,

Yes, that sounds like the right place.. I have a new patch, once its tested
i'll resend it. Thanks for pointing it out.

Cheers,
Ashok

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-26 Thread Raj, Ashok

On Tue, May 26, 2020 at 12:26:54PM -0600, Alex Williamson wrote:
> > > 
> > > I don't think the language in the spec is anything sufficient to handle
> > > RCiEP uniquely.  We've previously rejected kernel command line opt-outs
> > > for ACS, and the extent to which those patches still float around the
> > > user community and are blindly used to separate IOMMU groups are a
> > > testament to the failure of this approach.  Users do not have a basis
> > > for enabling this sort of opt-out.  The benefit is obvious in the IOMMU
> > > grouping, but the risk is entirely unknown.  A kconfig option is even
> > > worse as that means if you consume a downstream kernel, the downstream
> > > maintainers might have decided universally that isolation is less
> > > important than functionality.  
> > 
> > We discussed this internally, and Intel vt-d spec does spell out clearly 
> > in Section 3.16 Root-Complex Peer to Peer Considerations. The spec clearly
> > calls out that all p2p must be done on translated addresses and therefore
> > must go through the IOMMU.
> > 
> > I suppose they should also have some similar platform gauranteed behavior
> > for RCiEP's or MFD's *Must* behave as follows. The language is strict and
> > when IOMMU is enabled in the platform, everything is sent up north to the
> > IOMMU agent.
> > 
> > 3.16 Root-Complex Peer to Peer Considerations
> > When DMA remapping is enabled, peer-to-peer requests through the
> > Root-Complex must be handled
> > as follows:
> > • The input address in the request is translated (through first-level,
> >   second-level or nested translation) to a host physical address (HPA).
> >   The address decoding for peer addresses must be done only on the 
> >   translated HPA. Hardware implementations are free to further limit 
> >   peer-to-peer accesses to specific host physical address regions 
> >   (or to completely disallow peer-forwarding of translated requests).
> > • Since address translation changes the contents (address field) of the PCI
> >   Express Transaction Layer Packet (TLP), for PCI Express peer-to-peer 
> >   requests with ECRC, the Root-Complex hardware must use the new ECRC 
> >   (re-computed with the translated address) if it decides to forward 
> >   the TLP as a peer request.
> > • Root-ports, and multi-function root-complex integrated endpoints, may
> >   support additional peerto-peer control features by supporting PCI Express
> >   Access Control Services (ACS) capability. Refer to ACS capability in 
> >   PCI Express specifications for details.
> 
> That sounds like it might be a reasonable basis for quirking all RCiEPs
> on VT-d platforms if Intel is willing to stand behind it.  Thanks,
> 

Sounds good.. that's what i hear from our platform teams. If there is a
violation it would be a bug in silicon.

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-26 Thread Raj, Ashok

Hi Alex,

I was able to find better language in the IOMMU spec that gaurantees 
the behavior we need. See below.


On Tue, May 05, 2020 at 09:34:14AM -0600, Alex Williamson wrote:
> On Tue, 5 May 2020 07:56:06 -0700
> "Raj, Ashok"  wrote:
> 
> > On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> > > On Mon, 4 May 2020 23:11:07 -0700
> > > "Raj, Ashok"  wrote:
> > >   
> > > > Hi Alex
> > > > 
> > > > + Joerg, accidently missed in the Cc.
> > > > 
> > > > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:  
> > > > > On Mon,  4 May 2020 21:42:16 -0700
> > > > > Ashok Raj  wrote:
> > > > > 
> > > > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > > > 
> > > > > > PCIe 5.0 Specification.
> > > > > > 6.12 Access Control Services (ACS)
> > > > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > > > implement ACS and some do not. It is strongly recommended that Root 
> > > > > > Complex
> > > > > > implementations ensure that all accesses originating from RCiEPs
> > > > > > (PFs and VFs) without ACS capability are first subjected to 
> > > > > > processing by
> > > > > > the Translation Agent (TA) in the Root Complex before further 
> > > > > > decoding and
> > > > > > processing. The details of such Root Complex handling are outside 
> > > > > > the scope
> > > > > > of this specification.
> > > > > >   
> > > > > 
> > > > > Is the language here really strong enough to make this change?  ACS is
> > > > > an optional feature, so being permitted but not required is rather
> > > > > meaningless.  The spec is also specifically avoiding the words "must"
> > > > > or "shall" and even when emphasized with "strongly", we still only 
> > > > > have
> > > > > a recommendation that may or may not be honored.  This seems like a
> > > > > weak basis for assuming that RCiEPs universally honor this
> > > > > recommendation.  Thanks,
> > > > > 
> > > > 
> > > > We are speaking about PCIe spec, where people write it about 5 years 
> > > > ahead
> > > > and every vendor tries to massage their product behavior with vague
> > > > words like this..  :)
> > > > 
> > > > But honestly for any any RCiEP, or even integrated endpoints, there 
> > > > is no way to send them except up north. These aren't behind a RP.  
> > > 
> > > But they are multi-function devices and the spec doesn't define routing
> > > within multifunction packages.  A single function RCiEP will already be
> > > assumed isolated within its own group.  
> > 
> > That's right. The other two devices only have legacy PCI headers. So 
> > they can't claim to be RCiEP's but just integrated endpoints. The legacy
> > devices don't even have a PCIe header.
> > 
> > I honestly don't know why these are groped as MFD's in the first place.
> > 
> > >
> > > > I did check with couple folks who are part of the SIG, and seem to agree
> > > > that ACS treatment for RCiEP's doesn't mean much. 
> > > > 
> > > > I understand the language isn't strong, but it doesn't seem like ACS 
> > > > should
> > > > be a strong requirement for RCiEP's and reasonable to relax.
> > > > 
> > > > What are your thoughts?   
> > > 
> > > I think hardware vendors have ACS at their disposal to clarify when
> > > isolation is provided, otherwise vendors can submit quirks, but I don't
> > > see that the "strongly recommended" phrasing is sufficient to assume
> > > isolation between multifunction RCiEPs.  Thanks,  
> > 
> > You point is that integrated MFD endpoints, without ACS, there is no 
> > gaurantee to SW that they are isolated.
> > 
> > As far as a quirk, do you think:
> > - a cmdline optput for integrated endpoints, and RCiEP's suffice?
> >   along with a compile time default that is strict enforcement
> > - typical

Re: (a design open) RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host

2020-05-15 Thread Raj, Ashok

On Fri, May 15, 2020 at 12:39:14AM -0700, Tian, Kevin wrote:
> Hi, Alex,
> 
> When working on an updated version Yi and I found an design open
> which needs your guidance.
> 
> In concept nested translation can be incarnated as one GPA->HPA page
> table and multiple GVA->GPA page tables per VM. It means one container
> is sufficient to include all SVA-capable devices assigned to the same guest,
> as there is just one iova space (GPA->HPA) to be managed by vfio iommu
> map/unmap api. GVA->GPA page tables are bound through the new api
> introduced in this patch. It is different from legacy shadow translation
> which merges GIOVA->GPA->HPA into GIOVA->HPA thus each device requires
> its own iova space and must be in a separate container.
> 
> However supporting multiple SVA-capable devices in one container
> imposes one challenge. Now the bind_guest_pgtbl is implemented as
> iommu type1 api. From the definition of iommu type 1 any operation
> should be applied to all devices within the same container, just like
> dma map/unmap. However this philosophy is incorrect regarding to
> page table binding. We must follow the guest binding requirements
> between its page tables and assigned devices, otherwise every bound
> address space is suddenly accessible by all assigned devices thus creating
> security holes.

The above 2 paragraphs are bit confusing :-) but what really matters
is the below: 
> 
> Do you think whether it's possible to extend iommu api to accept
> device specific cmd? If not, moving it to vfio-pci or vfio-mdev sounds
> also problematic, as PASID and page tables are IOMMU things which
> are not touched by vfio device drivers today.

All you are referring to is when admin groups multiple devices in a
single container, you are saying you can't give isolation to them
for SVA purposes. This is logically equivalent to a single group with
multiple devices.

- Such as devices behind p2p bridge
- MFD's or devices behind switches or hieararchies without ACS
  support for instance.

By limitation you mean, disable PASID on those devices in a single 
container?

what about ATS? 

> 
> Alternatively can we impose the limitation that nesting APIs can be
> only enabled for singleton containers which contains only one device?
> This basically falls back to the state of legacy shadow translation
> vIOMMU. and our current Qemu vIOMMU implementation actually
> does this way regardless of whether nesting is used. Do you think
> whether such tradeoff is acceptable as a starting point?
> 
> Thanks
> Kevin
> 
> > -Original Message-
> > From: Liu, Yi L 
> > Sent: Sunday, March 22, 2020 8:32 PM
> > To: alex.william...@redhat.com; eric.au...@redhat.com
> > Cc: Tian, Kevin ; jacob.jun@linux.intel.com;
> > j...@8bytes.org; Raj, Ashok ; Liu, Yi L
> > ; Tian, Jun J ; Sun, Yi Y
> > ; jean-phili...@linaro.org; pet...@redhat.com;
> > io...@lists.linux-foundation.org; k...@vger.kernel.org; linux-
> > ker...@vger.kernel.org; Wu, Hao 
> > Subject: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> >
> > From: Liu Yi L 
> >
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > hardware
> > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > translation). For such hardware IOMMUs, there are two stages/levels of
> > address translation, and software may let userspace/VM to own the first-
> > level/stage-1 translation structures. Example of such usage is vSVA (
> > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > translation structures and bind the structures to host, then hardware
> > IOMMU would utilize nesting translation when doing DMA translation fo
> > the devices behind such hardware IOMMU.
> >
> > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> > bind
> > guest page table is needed, it also requires to expose interface to guest
> > for iommu cache invalidation when guest modified the first-level/stage-1
> > translation structures since hardware needs to be notified to flush stale
> > iotlbs. This would be introduced in next patch.
> >
> > In this patch, guest page table bind and unbind are done by using flags
> > VFIO_IOMMU_BIND_GUEST_PGTBL and
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > VM should have got a PASID allocated by host via
> > VFIO_IOMMU_PASID_REQUEST.
> >
> > Bind gue

Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

2020-05-11 Thread Raj, Ashok

Hi Thomas, 

On Fri, May 08, 2020 at 06:49:15PM +0200, Thomas Gleixner wrote:
> Ashok,
> 
> "Raj, Ashok"  writes:
> > With legacy MSI we can have these races and kernel is trying to do the
> > song and dance, but we see this happening even when IR is turned on.
> > Which is perplexing. I think when we have IR, once we do the change vector 
> > and flush the interrupt entry cache, if there was an outstandng one in 
> > flight it should be in IRR. Possibly should be clearned up by the
> > send_cleanup_vector() i suppose.
> 
> Ouch. With IR this really should never happen and yes the old vector
> will catch one which was raised just before the migration disabled the
> IR entry. During the change nothing can go wrong because the entry is
> disabled and only reenabled after it's flushed which will send a pending
> one to the new vector.

with IR, I'm not sure if we actually mask the interrupt except when
its a Posted Interrupt. 

We do an atomic update to IRTE, with cmpxchg_double

ret = cmpxchg_double(&irte->low, &irte->high,
 irte->low, irte->high,
 irte_modified->low, irte_modified->high);

followed by flushing the interrupt entry cache. After which any 
old ones in flight before the flush should be sittig in IRR
on the outgoing cpu.

The send_cleanup_vector() sends IPI to the apic_id->old_cpu which 
would be the cpu we are running on correct? and this is a self_ipi
to IRQ_MOVE_CLEANUP_VECTOR.

smp_irq_move_cleanup_interrupt() seems to check IRR with 
apicid_prev_vector()

irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
if (irr & (1U << (vector % 32))) {
apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);
continue;
}

And this would allow any pending IRR bits in the outgoing CPU to 
call the relevant ISR's before draining all vectors on the outgoing
CPU. 

Does it sound right?

I couldn't quite pin down how the device ISR's are hooked up through
this send_cleanup_vector() and what follows.

Re: [PATCH RFC] Microcode late loading feature identification

2020-05-11 Thread Raj, Ashok

Hi Mihai

Thanks for an attempt to find a fix. To solve this for real there
are several other factors to consider and I'm afraid its not as simple
as you have articulated here. There are lots of practical limitations that
prevent us from solving this completely. But we haven't given up :-)

In order to be successful, this needs to be factored in by the vendor either
as part of the development process or somehow generated automatically. Both
have pitfalls, and as you read below some of it might become clear.

On Mon, May 11, 2020 at 05:11:23PM +0300, Mihai Carabas wrote:
> La 27.04.2020 10:27, Mihai Carabas a scris:
> >This RFC patch set aims to provide a way to identify the modifications
> >brought in by the new microcode updated at runtime (aka microcode late
> >loading). This was debated last year and this patch set implements
> >point #1 from Thomas Gleixner's idea:
> >https://lore.kernel.org/lkml/alpine.deb.2.21.1909062237580.1...@nanos.tec.linutronix.de/
> >
> 
> +Ashok and Thomas to get a feedback from vendor side on file
> format/integration in the microcode blob and signature.

To understand the complications of microcode there are a few things to consider.

We have been working on this internally, and here is why its difficult to 
simplify it as +msr/-msr, +cpuid etc. Yes these are things that possibly
controlled by microcode, but microcode has several parts, not just CPU 
microcode.

The revision you see is just one big running number. There are other parts of 
the 
microcode that you don't see its internal version. In addition some parts of the
microcode are effective only when deployed by FIT. That cpu picks up right after
reset. During late load, even though you see the version updated, it doesn't
mean all the internal versions of ucode are latched and effective.

In addition, there are differences how some mitigations are deployed, as you 
know
some have MSR's that OS can find, there are others that need BIOS/early boot to 
make the mitigations effective. In order to get a full picture of weather a 
microcode 
file is late-loadable you need to know a lot more about how we got here to this
version loaded on the CPU.
> 
> Thank you,
> Mihai
> 
> >This patch set has the following patches:
> >
> >- patch 1 is introducing a new metadata file that comes with the microcode
> >(provided by the CPU manufacture) that describes what modifications are
> >done by loading the new microcode
> >
> >- patch 2 parses the metadata file and is verifying it against kernel
> >policy. In this patch, as an RFC, as a kernel policy, it was imposed
> >the rule of not allowing to remove any feature. If so, it won't be
> >loaded a new microcode. The policy can be further extended and describe
> >in different ways

Haven't read the individual patches yet. but you would need every interim
patch metadata to be always available. 

Since if you move from patch x->y you can find that an msr was removed.
But say you go from patch x->z, but there was no msr removed in patch y-z.
You need to process and collate all the msr/cpuid to comprehend what changing
between x->z since that also removes the msr for e.g.

To add more complications CPU has fuses and each patch content can be effective
on some but not all SKU's. So the exact same microcode can have different 
behavior
depending on which SKU its loaded into. So generating the common meta-data file
is also not a trivial process.

Cheers,
Ashok

Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.

2020-05-08 Thread Raj, Ashok

Hi Jason

On Fri, May 08, 2020 at 08:16:10PM -0300, Jason Gunthorpe wrote:
> On Fri, May 08, 2020 at 01:47:10PM -0700, Raj, Ashok wrote:
> 
> > Even when uaccel was under development, one of the options
> > was to use VFIO as the transport, goal was the same i.e to keep
> > the user space have one interface. 
> 
> I feel a bit out of the loop here, uaccel isn't in today's kernel is
> it? I've heard about it for a while, it sounds very similar to RDMA,
> so I hope they took some of my advice...

I think since 5.7 maybe? drivers/misc/uacce. I don't think this is like
RDMA, its just a plain accelerator. There is no connection management,
memory registration or other things.. IB was my first job at Intel,
but saying that i would be giving my age away :)

> 
> > But the needs of generic user space application is significantly
> > different from exporting a more functional device model to guest,
> > which isn't full emulated device. which is why VFIO didn't make
> > sense for native use.
> 
> I'm not sure this is true. We've done these kinds of emulated SIOV
> like things already and there is a huge overlap between what a generic
> user application needs and what the VMM neds. Actually almost a
> perfect subset except for interrupt remapping (which is quite
> trivial).

>From a simple user application POV, if we need to do simple compression
or such with a shared WQ, all the application needs do do is
bind_mm() that somehow associates the process address space with the 
IOMMU to create that association and communication channel.

For supporting this with guest user, we need to support the same actions
from a guest OS. i.e a guest OS bind should be serviced and end up with the 
IOMMU plumbing it with the guest cr3, and making sure the guest 2nd level 
is plumed right for the nested walk. 

Now we can certainly go bolt all these things again. When VFIO has already 
done the pluming in a generic way.

> 
> The things vfio focuses on, like groups and managing a real config
> space just don't apply here.
> 
> > And when we move things from VFIO which is already established
> > as a general device model and accepted by multiple VMM's it gives
> > instant footing without a whole redesign. 
> 
> Yes, I understand, but I think you need to get more people to support
> this idea. From my standpoint this is taking secure lean VMMs and

When we decided on VFIO, it was after using the best practices then,
after discussion with Kirti Wankhede and Alex. Kevin had used it for
graphics virtualization. It was even presented at KVM forum and such
dating back to 2017. No one has raised alarms until now :-)

> putting emulation code back into them, except in a more dangerous
> kernel location. This does not seem like a net win to me.

Its not a whole lot of emulation right? mdev are soft partitioned. There is
just a single PF, but we can create a separate partition for the guest using
PASID along with the normal BDF (RID). And exposing a consistent PCI like
interface to user space you get everything else for free.

Yes, its not SRIOV, but giving that interface to user space via VFIO, we get 
all of that functionality without having to reinvent a different way to do it.

vDPA went the other way, IRC, they went and put a HW implementation of what
virtio is in hardware. So they sort of fit the model. Here the instance
looks and feels like real hardware for the setup and control aspect.

> 
> You'd be much better to have some userspace library scheme instead of
> being completely tied to a kernel interface for modularity.

Couldn't agree more :-).. all I'm asking is if we can do a phased approach to 
get to that goodness! If we need to move things to user space for emulation
that's a great goal, but it can be evolutionary.

> 
> > When we move things from VFIO to uaccel to bolt on the functionality
> > like VFIO, I suspect we would be moving code/functionality from VFIO
> > to Uaccel. I don't know what the net gain would be.
> 
> Most of VFIO functionality is already decomposed inside the kernel,
> and you need most of it to do secure user access anyhow.
> 
> > For mdev, would you agree we can keep the current architecture,
> > and investigate moving some emulation code to user space (say even for
> > standard vfio_pci) and then expand scope later.
> 
> I won't hard NAK this, but I think you need more people to support
> this general idea of more emulation code in the kernel to go ahead -
> particularly since this is one of many future drivers along this
> design.
> 
> It would be good to hear from the VMM teams that this is what they
> want (and why), for instance.

IRC Paolo was present I think and we can find other VMM folks to chime in if
that helps.

Re: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.

2020-05-08 Thread Raj, Ashok

Hi Jason

In general your idea of moving pure emulation code to user space 
is a good strategy.

On Wed, Apr 29, 2020 at 02:42:20AM -0700, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Monday, April 27, 2020 9:22 PM
> > 
> > On Mon, Apr 27, 2020 at 07:19:39AM -0600, Alex Williamson wrote:
> > 
> > > > It is not trivial masking. It is a 2000 line patch doing comprehensive
> > > > emulation.
> > >
> > > Not sure what you're referring to, I see about 30 lines of code in
> > > vdcm_vidxd_cfg_write() that specifically handle writes to the 4 BARs in
> > > config space and maybe a couple hundred lines of code in total handling
> > > config space emulation.  Thanks,
> > 
> > Look around vidxd_do_command()
> > 
> > If I understand this flow properly..
> > 
> 
> Hi, Jason,
> 
> I guess the 2000 lines mostly refer to the changes in mdev.c and vdev.c. 
> We did a break-down among them:
> 
> 1) ~150 LOC for vdev initialization
> 2) ~150 LOC for cfg space emulation
> 3) ~230 LOC for mmio r/w emulation
> 4) ~500 LOC for controlling the work queue (vidxd_do_command), 
> triggered by write emulation of IDXD_CMD_OFFSET register
> 5) the remaining lines are all about vfio-mdev registration/callbacks,
> for reporting mmio/irq resource, eventfd, mmap, etc.
> 
> 1/2/3) are pure device emulation, which counts for ~500 LOC. 
> 
> 4) needs be in the kernel regardless of which uAPI is used, because it
> talks to the physical work queue (enable, disable, drain, abort, reset, etc.)
> 
> Then if just talking about ~500 LOC emulation code left in the kernel, 
> is it still a big concern to you? 😊

Even when uaccel was under development, one of the options
was to use VFIO as the transport, goal was the same i.e to keep
the user space have one interface. But the needs of generic 
user space application is significantly different from exporting
a more functional device model to guest, which isn't full emulated
device. which is why VFIO didn't make sense for native use.

And when we move things from VFIO which is already established
as a general device model and accepted by multiple VMM's it gives
instant footing without a whole redesign. When we move things from 
VFIO to uaccel to bolt on the functionality like VFIO, I suspect we 
would be moving code/functionality from VFIO to Uaccel. I don't know
what the net gain would be.

IMS is being reworked based on your feedback. And for mdev
since the code is minimal for emulation, and rest are control paths
that need kernel code to deal with.

For mdev, would you agree we can keep the current architecture,
and investigate moving some emulation code to user space (say even for
standard vfio_pci) and then expand scope later.

Cheers
Ashok

Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

2020-05-08 Thread Raj, Ashok

Hi Thomas,

On Fri, May 08, 2020 at 01:04:29PM +0200, Thomas Gleixner wrote:
> Ashok,
> 
> "Raj, Ashok"  writes:
> >> But as this last one is the migration thread, aka stomp machine, I
> >> assume this is a hotplug operation. Which means the CPU cannot handle
> >> interrupts anymore. In that case we check the old vector on the
> >> unplugged CPU in fixup_irqs() and do the retrigger from there.
> >> Can you please add tracing to that one as well?
> >
> > New tracelog attached. It just happened once.
> 
> Correct and it worked as expected.
> 
>  migration/3-24[003] d..1   275.665751: msi_set_affinity: quirk[1] 
> new vector allocated, new apic = 4 vector = 36 this apic = 6
>  migration/3-24[003] d..1   275.665776: msi_set_affinity: Redirect to 
> new vector 36 on old apic 6
>  migration/3-24[003] d..1   275.665789: msi_set_affinity: Transition 
> to new target apic 4 vector 36
>  migration/3-24[003] d..1   275.665790: msi_set_affinity: Update Done 
> [IRR 0]: irq 123 Nvec 36 Napic 4
>  migration/3-24[003] d..1   275.666792: fixup_irqs: retrigger vector 
> 33 irq 123
> 
> So looking at your trace further down, the problem is not the last
> one. It dies already before that:
> 
><...>-14[001] d..1   284.901587: msi_set_affinity: quirk[1] 
> new vector allocated, new apic = 6 vector = 33 this apic = 2
><...>-14[001] d..1   284.901604: msi_set_affinity: Direct 
> Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 6
>
> Here, the interrupts stop coming in and that's just a regular direct
> update, i.e. same vector, different CPU. The update below is updating a
> dead device already.

So we lost the interrupt either before or after we migrated to apic2?

In the direct update when we migrate same vector but different CPU
if there was a pending IRR on the current CPU, where is that getting cleaned up?

> 
>  migration/3-24[003] d..1   284.924960: msi_set_affinity: quirk[1] 
> new vector allocated, new apic = 4 vector = 36 this apic = 6
>  migration/3-24[003] d..1   284.924987: msi_set_affinity: Redirect to 
> new vector 36 on old apic 6
>  migration/3-24[003] d..1   284.924999: msi_set_affinity: Transition 
> to new target apic 4 vector 36
>  migration/3-24[003] d..1   284.925000: msi_set_affinity: Update Done 
> [IRR 0]: irq 123 Nvec 36 Napic 4
> 
> TBH, I can't see anything what's wrong here from the kernel side and as
> this is new silicon and you're the only ones reporting this it seems
> that this is something which is specific to that particular
> hardware. Have you talked to the hardware people about this?
> 

After chasing it, I'm also trending to think that way. We had a question
about the moderation timer and how its affecting this behavior.
Mathias tried to turn off the moderation timer, and we are still able to 
see this hang. We are reaching out to the HW folks to get a handle on this.

With legacy MSI we can have these races and kernel is trying to do the
song and dance, but we see this happening even when IR is turned on.
Which is perplexing. I think when we have IR, once we do the change vector 
and flush the interrupt entry cache, if there was an outstandng one in 
flight it should be in IRR. Possibly should be clearned up by the
send_cleanup_vector() i suppose.

Cheers,
Ashok

Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

2020-05-07 Thread Raj, Ashok

Hi Thomas

We did a bit more tracing and it looks like the IRR check is actually
not happening on the right cpu. See below.

On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
> >
> > msi_set_affinit ()
> > {
> > 
> > unlock_vector_lock();
> >
> > /*
> >  * Check whether the transition raced with a device interrupt and
> >  * is pending in the local APICs IRR. It is safe to do this outside
> >  * of vector lock as the irq_desc::lock of this interrupt is still
> >  * held and interrupts are disabled: The check is not accessing the
> >  * underlying vector store. It's just checking the local APIC's
> >  * IRR.
> >  */
> > if (lapic_vector_set_in_irr(cfg->vector))
> > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
> 
> No. This catches the transitional interrupt to the new vector on the
> original CPU, i.e. the one which is running that code.

Mathias added some trace to his xhci driver when the isr is called.

Below is the tail of my trace with last two times xhci_irq isr is called:

-0 [003] d.h.   200.277971: xhci_irq: xhci irq
-0 [003] d.h.   200.278052: xhci_irq: xhci irq

Just trying to follow your steps below with traces. The traces follow
the same comments in the source.

> 
> Again the steps are:
> 
>  1) Allocate new vector on new CPU

/* Allocate a new target vector */
ret = parent->chip->irq_set_affinity(parent, mask, force);

migration/3-24[003] d..1   200.283012: msi_set_affinity: msi_set_affinity: 
quirk: 1: new vector allocated, new cpu = 0

> 
>  2) Set new vector on original CPU

/* Redirect it to the new vector on the local CPU temporarily */
old_cfg.vector = cfg->vector;
irq_msi_update_msg(irqd, &old_cfg);

migration/3-24[003] d..1   200.283033: msi_set_affinity: msi_set_affinity: 
Redirect to new vector 33 on old cpu 6
> 
>  3) Set new vector on new CPU

/* Now transition it to the target CPU */
irq_msi_update_msg(irqd, cfg);

 migration/3-24[003] d..1   200.283044: msi_set_affinity: 
msi_set_affinity: Transition to new target cpu 0 vector 33



 if (lapic_vector_set_in_irr(cfg->vector))
irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


migration/3-24[003] d..1   200.283046: msi_set_affinity: msi_set_affinity: 
Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0

> 
> So we have 3 points where an interrupt can fire:
> 
>  A) Before #2
> 
>  B) After #2 and before #3
> 
>  C) After #3
> 
> #A is hitting the old vector which is still valid on the old CPU and
>will be handled once interrupts are enabled with the correct irq
>descriptor - Normal operation (same as with maskable MSI)
> 
> #B This must be checked in the IRR because the there is no valid vector
>on the old CPU.

The check for IRR seems like on a random cpu3 vs checking for the new vector 33
on old cpu 6?

This is the place when we force the retrigger without the IRR check things seem 
to fix itself.

> 
> #C is handled on the new vector on the new CPU
> 


Did we miss something? 

Cheers,
Ashok

Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

2020-05-05 Thread Raj, Ashok

On Tue, May 05, 2020 at 09:36:04PM +0200, Thomas Gleixner wrote:
> Ashok,
> 
> 
> > Now the second question with Interrupt Remapping Support:
> >
> > intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()
> >
> > The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
> > were in flight, they made it to the previous CPU, and any new interrupts
> > must be delivered to the new CPU.
> >
> > Question is do we need a check similar to the legacy MSI handling
> >
> > if (lapic_vector_set_in_irr())
> > handle interrupt? 
> >
> > Is there a reason we don't check if the interrupt delivered to previous
> > CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
> > an IPI to perform the cleanup?
> >
> > It appears that maybe send_cleanup_vector() sends IPI to the old cpu
> > and that somehow ensures the device interrupt handler actually getting
> > called? I lost my track somewhere down there :)
> 
> The way it works is:
> 
> 1) New vector on different CPU is allocated
> 
> 2) New vector is installed
> 
> 3) When the first interrupt on the new CPU arrives then the cleanup
>IPI is sent to the previous CPU which tears down the old vector
> 
> So if the interrupt is sent to the original CPU just before #2 then this
> will be correctly handled on that CPU after the set affinity code
> reenables interrupts.

I'll have this test tried out.. but in msi_set_affinity() the check below
for lapic_vector_set_in_irr(cfg->vector), this is the new vector correct? 
don't we want to check for old_cfg.vector instead?

msi_set_affinit ()
{



unlock_vector_lock();

/*
 * Check whether the transition raced with a device interrupt and
 * is pending in the local APICs IRR. It is safe to do this outside
 * of vector lock as the irq_desc::lock of this interrupt is still
 * held and interrupts are disabled: The check is not accessing the
 * underlying vector store. It's just checking the local APIC's
 * IRR.
 */
if (lapic_vector_set_in_irr(cfg->vector))
irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


> 
> Can you try the patch below?
> 
> Thanks,
> 
> tglx
> 
> 8<--
>  drivers/pci/msi.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -323,6 +323,7 @@ void __pci_write_msi_msg(struct msi_desc
>   writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
>   writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
>   writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
> + readl(base + PCI_MSIX_ENTRY_DATA);
>   } else {
>   int pos = dev->msi_cap;
>   u16 msgctl;
> @@ -343,6 +344,7 @@ void __pci_write_msi_msg(struct msi_desc
>   pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
> msg->data);
>   }
> + pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
>   }
>  
>  skip:

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-05 Thread Raj, Ashok

On Tue, May 05, 2020 at 08:05:14AM -0600, Alex Williamson wrote:
> On Mon, 4 May 2020 23:11:07 -0700
> "Raj, Ashok"  wrote:
> 
> > Hi Alex
> > 
> > + Joerg, accidently missed in the Cc.
> > 
> > On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> > > On Mon,  4 May 2020 21:42:16 -0700
> > > Ashok Raj  wrote:
> > >   
> > > > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > > > 
> > > > PCIe 5.0 Specification.
> > > > 6.12 Access Control Services (ACS)
> > > > Implementation of ACS in RCiEPs is permitted but not required. It is
> > > > explicitly permitted that, within a single Root Complex, some RCiEPs
> > > > implement ACS and some do not. It is strongly recommended that Root 
> > > > Complex
> > > > implementations ensure that all accesses originating from RCiEPs
> > > > (PFs and VFs) without ACS capability are first subjected to processing 
> > > > by
> > > > the Translation Agent (TA) in the Root Complex before further decoding 
> > > > and
> > > > processing. The details of such Root Complex handling are outside the 
> > > > scope
> > > > of this specification.
> > > > 
> > > 
> > > Is the language here really strong enough to make this change?  ACS is
> > > an optional feature, so being permitted but not required is rather
> > > meaningless.  The spec is also specifically avoiding the words "must"
> > > or "shall" and even when emphasized with "strongly", we still only have
> > > a recommendation that may or may not be honored.  This seems like a
> > > weak basis for assuming that RCiEPs universally honor this
> > > recommendation.  Thanks,
> > >   
> > 
> > We are speaking about PCIe spec, where people write it about 5 years ahead
> > and every vendor tries to massage their product behavior with vague
> > words like this..  :)
> > 
> > But honestly for any any RCiEP, or even integrated endpoints, there 
> > is no way to send them except up north. These aren't behind a RP.
> 
> But they are multi-function devices and the spec doesn't define routing
> within multifunction packages.  A single function RCiEP will already be
> assumed isolated within its own group.

That's right. The other two devices only have legacy PCI headers. So 
they can't claim to be RCiEP's but just integrated endpoints. The legacy
devices don't even have a PCIe header.

I honestly don't know why these are groped as MFD's in the first place.

>  
> > I did check with couple folks who are part of the SIG, and seem to agree
> > that ACS treatment for RCiEP's doesn't mean much. 
> > 
> > I understand the language isn't strong, but it doesn't seem like ACS should
> > be a strong requirement for RCiEP's and reasonable to relax.
> > 
> > What are your thoughts? 
> 
> I think hardware vendors have ACS at their disposal to clarify when
> isolation is provided, otherwise vendors can submit quirks, but I don't
> see that the "strongly recommended" phrasing is sufficient to assume
> isolation between multifunction RCiEPs.  Thanks,

You point is that integrated MFD endpoints, without ACS, there is no 
gaurantee to SW that they are isolated.

As far as a quirk, do you think:
- a cmdline optput for integrated endpoints, and RCiEP's suffice?
  along with a compile time default that is strict enforcement
- typical vid/did type exception list?

A more generic way to ask for exception would be scalable until we can stop
those type of integrated devices. Or we need to maintain these device lists
for eternity. 

Cheers,
Ashok

Re: [PATCH] iommu: Relax ACS requirement for RCiEP devices.

2020-05-04 Thread Raj, Ashok

Hi Alex

+ Joerg, accidently missed in the Cc.

On Mon, May 04, 2020 at 11:19:36PM -0600, Alex Williamson wrote:
> On Mon,  4 May 2020 21:42:16 -0700
> Ashok Raj  wrote:
> 
> > PCIe Spec recommends we can relax ACS requirement for RCIEP devices.
> > 
> > PCIe 5.0 Specification.
> > 6.12 Access Control Services (ACS)
> > Implementation of ACS in RCiEPs is permitted but not required. It is
> > explicitly permitted that, within a single Root Complex, some RCiEPs
> > implement ACS and some do not. It is strongly recommended that Root Complex
> > implementations ensure that all accesses originating from RCiEPs
> > (PFs and VFs) without ACS capability are first subjected to processing by
> > the Translation Agent (TA) in the Root Complex before further decoding and
> > processing. The details of such Root Complex handling are outside the scope
> > of this specification.
> > 
> > Since Linux didn't give special treatment to allow this exception, certain
> > RCiEP MFD devices are getting grouped in a single iommu group. This
> > doesn't permit a single device to be assigned to a guest for instance.
> > 
> > In one vendor system: Device 14.x were grouped in a single IOMMU group.
> > 
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/5/devices/:00:14.3
> > 
> > After the patch:
> > /sys/kernel/iommu_groups/5/devices/:00:14.0
> > /sys/kernel/iommu_groups/5/devices/:00:14.2
> > /sys/kernel/iommu_groups/6/devices/:00:14.3 <<< new group
> > 
> > 14.0 and 14.2 are integrated devices, but legacy end points.
> > Whereas 14.3 was a PCIe compliant RCiEP.
> > 
> > 00:14.3 Network controller: Intel Corporation Device 9df0 (rev 30)
> > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > 
> > This permits assigning this device to a guest VM.
> > 
> > Fixes: f096c061f552 ("iommu: Rework iommu_group_get_for_pci_dev()")
> > Signed-off-by: Ashok Raj 
> > To: Joerg Roedel 
> > To: Bjorn Helgaas 
> > Cc: linux-kernel@vger.kernel.org
> > Cc: io...@lists.linux-foundation.org
> > Cc: Lu Baolu 
> > Cc: Alex Williamson 
> > Cc: Darrel Goeddel 
> > Cc: Mark Scott ,
> > Cc: Romil Sharma 
> > Cc: Ashok Raj 
> > ---
> >  drivers/iommu/iommu.c | 15 ++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2b471419e26c..5744bd65f3e2 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1187,7 +1187,20 @@ static struct iommu_group 
> > *get_pci_function_alias_group(struct pci_dev *pdev,
> > struct pci_dev *tmp = NULL;
> > struct iommu_group *group;
> >  
> > -   if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > +   /*
> > +* PCI Spec 5.0, Section 6.12 Access Control Service
> > +* Implementation of ACS in RCiEPs is permitted but not required.
> > +* It is explicitly permitted that, within a single Root
> > +* Complex, some RCiEPs implement ACS and some do not. It is
> > +* strongly recommended that Root Complex implementations ensure
> > +* that all accesses originating from RCiEPs (PFs and VFs) without
> > +* ACS capability are first subjected to processing by the Translation
> > +* Agent (TA) in the Root Complex before further decoding and
> > +* processing.
> > +*/
> 
> Is the language here really strong enough to make this change?  ACS is
> an optional feature, so being permitted but not required is rather
> meaningless.  The spec is also specifically avoiding the words "must"
> or "shall" and even when emphasized with "strongly", we still only have
> a recommendation that may or may not be honored.  This seems like a
> weak basis for assuming that RCiEPs universally honor this
> recommendation.  Thanks,
> 

We are speaking about PCIe spec, where people write it about 5 years ahead
and every vendor tries to massage their product behavior with vague
words like this..  :)

But honestly for any any RCiEP, or even integrated endpoints, there 
is no way to send them except up north. These aren't behind a RP.

I did check with couple folks who are part of the SIG, and seem to agree
that ACS treatment for RCiEP's doesn't mean much. 

I understand the language isn't strong, but it doesn't seem like ACS should
be a strong requirement for RCiEP's and reasonable to relax.

What are your thoughts? 

Cheers,
Ashok
> 
> > +   if (!pdev->multifunction ||
> > +   (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END) ||
> > +pci_acs_enabled(pdev, REQ_ACS_FLAGS))
> > return NULL;
> >  
> > for_each_pci_dev(tmp) {
>

Re: MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

2020-05-01 Thread Raj, Ashok

Hi Thomas

Just started looking into it to get some idea about what could be
going on. I had some questions, that would be helpful to clarify.

On Tue, Mar 24, 2020 at 08:03:44PM +0100, Thomas Gleixner wrote:
> Evan Green  writes:
> > On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner  wrote:
> >> And of course all of this is so well documented that all of us can
> >> clearly figure out what's going on...
> >
> > I won't pretend to know what's going on, so I'll preface this by
> > labeling it all as "flailing", but:
> >
> > I wonder if there's some way the interrupt can get delayed between
> > XHCI snapping the torn value and it finding its way into the IRR. For
> > instance, if xhci read this value at the start of their interrupt
> > moderation timer period, that would be awful (I hope they don't do
> > this). One test patch would be to carve out 8 vectors reserved for
> > xhci on all cpus. Whenever you change the affinity, the assigned
> > vector is always reserved_base + cpu_number. That lets you exercise
> > the affinity switching code, but in a controlled manner where torn
> > interrupts could be easily seen (ie hey I got an interrupt on cpu 4's
> > vector but I'm cpu 2). I might struggle to write such a change, but in
> > theory it's doable.
> 
> Well, the point is that we don't see a spurious interrupt on any
> CPU. We added a traceprintk into do_IRQ() and that would immediately
> tell us where the thing goes off into lala land. Which it didn't.

Now that we don't have the torn write issue. We did an experiment 
with legacy MSI, and no interrupt remap support. One of the thought
process was, since we don't have a way to ensure that previous MSI writes
are globally observed, a read from the device should flush any
outstanidng writes correct? (according to PCI, not sure if we can
depend on this.. or chipsets would take their own sweet time to push to CPU)

I'm not certain if such a read happens today? So to make it simple tried
to force a retrigger. In the following case of direct update,
even though the vector isn't changing a MSI write to the previous 
destination could have been sent to the previous CPU right? 

arch/x86/kernel/apic/msi.c: msi_set_affinity()

/*
 * Direct update is possible when:
 * - The MSI is maskable (remapped MSI does not use this code path)).
 *   The quirk bit is not set in this case.
 * - The new vector is the same as the old vector
 * - The old vector is MANAGED_IRQ_SHUTDOWN_VECTOR (interrupt starts up)
 * - The new destination CPU is the same as the old destination CPU
 */
if (!irqd_msi_nomask_quirk(irqd) ||
cfg->vector == old_cfg.vector ||
old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
cfg->dest_apicid == old_cfg.dest_apicid) {
irq_msi_update_msg(irqd, cfg);
-->>> force a retrigger

It appears that without a gaurantee of flusing MSI writes from the device
the check for lapic_vector_set_in_irr(vector) is still racy. 

With adding the forced retrigger in both places, the test didn't reveal any
lost interrupt cases.

Now the second question with Interrupt Remapping Support:

intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()

The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
were in flight, they made it to the previous CPU, and any new interrupts
must be delivered to the new CPU.

Question is do we need a check similar to the legacy MSI handling

if (lapic_vector_set_in_irr())
handle interrupt? 

Is there a reason we don't check if the interrupt delivered to previous
CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
an IPI to perform the cleanup?  

It appears that maybe send_cleanup_vector() sends IPI to the old cpu
and that somehow ensures the device interrupt handler actually getting
called? I lost my track somewhere down there :)

Cheers,
Ashok

Re: [PATCH v6 03/10] iommu/vt-d: Replace Intel specific PASID allocator with IOASID

2019-10-22 Thread Raj, Ashok

On Tue, Oct 22, 2019 at 04:53:16PM -0700, Jacob Pan wrote:
> Make use of generic IOASID code to manage PASID allocation,
> free, and lookup. Replace Intel specific code.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel-iommu.c | 12 ++--
>  drivers/iommu/intel-pasid.c | 36 
>  drivers/iommu/intel-svm.c   | 37 +
>  3 files changed, 27 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 3aff0141c522..72febcf2c48f 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5311,7 +5311,7 @@ static void auxiliary_unlink_device(struct dmar_domain 
> *domain,
>   domain->auxd_refcnt--;
>  
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - intel_pasid_free_id(domain->default_pasid);
> + ioasid_free(domain->default_pasid);

if the domain is gauranteed to be torn down, its ok.. but otherwise
do you need to clear domain->default_pasid to avoid accidental reference
in some other code path?

>  }
>  
>  static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5329,10 +5329,10 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   if (domain->default_pasid <= 0) {
>   int pasid;
>  
> - pasid = intel_pasid_alloc_id(domain, PASID_MIN,
> -  pci_max_pasids(to_pci_dev(dev)),
> -  GFP_KERNEL);
> - if (pasid <= 0) {
> + /* No private data needed for the default pasid */
> + pasid = ioasid_alloc(NULL, PASID_MIN, 
> pci_max_pasids(to_pci_dev(dev)) - 1,

If we ensure IOMMU max-pasid is full 20bit width, its good. In a vIOMMU
if iommu max pasid is restricted for some kind of partitioning in future
you want to consider limiting to no more than what's provisioned in the vIOMMU 
right?


> + NULL);
> + if (pasid == INVALID_IOASID) {
>   pr_err("Can't allocate default pasid\n");
>   return -ENODEV;
>   }
> @@ -5368,7 +5368,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   spin_unlock(&iommu->lock);
>   spin_unlock_irqrestore(&device_domain_lock, flags);
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - intel_pasid_free_id(domain->default_pasid);
> + ioasid_free(domain->default_pasid);
>  
>   return ret;
>  }
> diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c
> index 76bcbb21e112..c0d1f28d3412 100644
> --- a/drivers/iommu/intel-pasid.c
> +++ b/drivers/iommu/intel-pasid.c
> @@ -26,42 +26,6 @@
>   */
>  static DEFINE_SPINLOCK(pasid_lock);
>  u32 intel_pasid_max_id = PASID_MAX;
> -static DEFINE_IDR(pasid_idr);
> -
> -int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp)
> -{
> - int ret, min, max;
> -
> - min = max_t(int, start, PASID_MIN);
> - max = min_t(int, end, intel_pasid_max_id);
> -
> - WARN_ON(in_interrupt());
> - idr_preload(gfp);
> - spin_lock(&pasid_lock);
> - ret = idr_alloc(&pasid_idr, ptr, min, max, GFP_ATOMIC);
> - spin_unlock(&pasid_lock);
> - idr_preload_end();
> -
> - return ret;
> -}
> -
> -void intel_pasid_free_id(int pasid)
> -{
> - spin_lock(&pasid_lock);
> - idr_remove(&pasid_idr, pasid);
> - spin_unlock(&pasid_lock);
> -}
> -
> -void *intel_pasid_lookup_id(int pasid)
> -{
> - void *p;
> -
> - spin_lock(&pasid_lock);
> - p = idr_find(&pasid_idr, pasid);
> - spin_unlock(&pasid_lock);
> -
> - return p;
> -}
>  
>  static int check_vcmd_pasid(struct intel_iommu *iommu)
>  {
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index 9b159132405d..5aef5b7bf561 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "intel-pasid.h"
> @@ -318,16 +319,15 @@ int intel_svm_bind_mm(struct device *dev, int *pasid, 
> int flags, struct svm_dev_
>   if (pasid_max > intel_pasid_max_id)
>   pasid_max = intel_pasid_max_id;
>  
> - /* Do not use PASID 0 in caching mode (virtualised IOMMU) */
> - ret = intel_pasid_alloc_id(svm,
> -!!cap_caching_mode(iommu->cap),
> -pasid_max - 1, GFP_KERNEL);
> - if (ret < 0) {
> + /* Do not use PASID 0, reserved for RID to PASID */
> + svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid_max - 1, svm);
> + if (svm->pasid == INVALID_IOASID) {
>   kfree(svm);
>   kfree(sdev);
> + ret = ENOSPC;
>   goto out;
>

Re: [PATCH v7 7/7] PCI: Skip Enhanced Allocation (EA) initialization for VF device

2019-10-03 Thread Raj, Ashok

On Thu, Oct 03, 2019 at 01:57:47PM -0500, Bjorn Helgaas wrote:
> On Thu, Oct 03, 2019 at 10:21:24AM -0700, Kuppuswamy Sathyanarayanan wrote:
> > Hi Bjorn,
> > 
> > On 8/28/19 3:14 PM, sathyanarayanan.kuppusw...@linux.intel.com wrote:
> > > From: Kuppuswamy Sathyanarayanan 
> > > 
> > > 
> > > As per PCIe r4.0, sec 9.3.6, VF must not implement Enhanced Allocation
> > > Capability. So skip pci_ea_init() for virtual devices.
> > > 
> > > Cc: Ashok Raj 
> > > Cc: Keith Busch 
> > > Suggested-by: Ashok Raj 
> > > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > > 
> > This patch was also dropped in your v8. Is this also intentional?
> 
> Yes, I dropped it because I didn't think there was much motivation for
> it.

Agreed!

> 
> If a device is broken, i.e., a VF has an EA capability, this patch
> silently returns.  The existing code would try to use the EA
> capability and something would probably blow up, so in that sense,
> this patch makes the hardware issue less visible.
> 
> If a device is correct, i.e., a VF does *not* have an EA capability,
> pci_find_capability() will fail anyway, so this patch doesn't change
> the functional behavior.


But do you think while at this can we atleast do a warning
to make sure HW probably messed up just after the EA capability
is read? Atleast it would be an early warning vs. it trying to break
later. Like the other issues we ran into with the PIN interrupt
accidently set in some hardware for VF's for instance. 

> 
> This patch *does* avoid the pci_find_capability() in that case, which
> is a performance optimization.  We could merge it on that basis, but
> we should try to quantify the benefit to see if it's really worthwhile
> and the commit log should use that as the explicit motivation.
> 
> > > ---
> > >   drivers/pci/pci.c | 7 +++
> > >   1 file changed, 7 insertions(+)
> > > 
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index 1b27b5af3d55..266600a11769 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -3025,6 +3025,13 @@ void pci_ea_init(struct pci_dev *dev)
> > >   int offset;
> > >   int i;
> > > + /*
> > > +  * Per PCIe r4.0, sec 9.3.6, VF must not implement Enhanced
> > > +  * Allocation Capability.
> > > +  */
> > > + if (dev->is_virtfn)
> > > + return;
> > > +
> > >   /* find PCI EA capability in list */
> > >   ea = pci_find_capability(dev, PCI_CAP_ID_EA);
> > >   if (!ea)
> > 
> > -- 
> > Sathyanarayanan Kuppuswamy
> > Linux kernel developer
> >

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-17 Thread Raj, Ashok

Hi Thomas,

On Tue, Sep 17, 2019 at 08:37:10AM +0200, Thomas Gleixner wrote:
> > microode updates should be of 3 types.
> > 
> > - Only loadable from BIOS (Only via FIT tables)
> > - Suitable for early load (things that take cpuid bits for e.g.)
> > - Suitable for late-load. (Where no cpuid bits should change etc).
> > 
> > Today the way we load after a stop_machine() all threads in the system are
> > held hostage until all the cores have done the update. The thread sibling
> > is also in the rendezvous loop. 
> 
> I know. See below.
>  
> > Do you think we still have that risk with a sibling thread? 
> > (Assuming future ucodes don't do weird things like what happened in 
> > that case where a cpuid was removed via an update)
> 
> Well, yes. The sibling executes a limited set of instructions in a loop,
> but it might be hit by an NMI or MCE which executes even more instructions.

There is a plan to solve the NMI issue. Although there is one case we might
be showing as a spurious that might not be nice. If #MCE's showup there is
nothing we can do at that point. These are most likely unrecoverable. 
But we want to make sure we could atleast follow through with a proper reset.

Let me gather my thoughts on that when i have the patch ready to handle
those senarios.

> 
> So what happens if the ucode update "fixes" one of the executed
> instructions on the fly? Is that guaranteed to be safe? There is nothing
> which says so.
> 
> A decade ago I experimented with putting the spinning CPUs into MWAIT,
> which caused havoc. Did neither have time nor the stomach to dig into that
> further, but the ucode update _did_ fix an issue with MWAIT according to
> the version history.

Excellent point. 
> 
> That's why I'm worried about instructions being "fixed" which are executed
> in parallel on the sibling.
> 
> An authorative statement vs. that would be appreciated. Preferrably in form
> of an extension of the SDM, but an upfront statement in this thread would
> be a good start.

I have started the conversation internally. Once we have something solid
I'll share in the list, and also follow up with updates to SDM. 

Cheers,
Ashok

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-16 Thread Raj, Ashok

On Mon, Sep 16, 2019 at 12:36:11PM +0200, Thomas Gleixner wrote:
> > On Fri, 6 Sep 2019, Raj, Ashok wrote:
> > > On Fri, Sep 06, 2019 at 11:16:00PM +0200, Thomas Gleixner wrote:
> > > > Now #1 is actually a sensible and feasible solution which can be pulled 
> > > > off
> > > > in a reasonably short time frame, avoids all the bound to be ugly and
> > > > failure laden attempts of fixing late loading completely and provides a
> > > > usable and safe solution for joe user, jack admin and the super experts 
> > > > at
> > > > big-cloud corporate.
> > > > 
> > > > That is not requiring any new format of microcode payload, as this can 
> > > > be
> > > > nicely done as a metadata package which comes with the microcode
> > > > payload. So you get the following backwards compatible states:
> > > > 
> > > >   Kernel  metadataresult
> > > > 
> > > >   old don't care  refuse late load
> > > > 
> > > >   new No  refuse late load
> > > > 
> > > >   new Yes decide based on metadata
> > > > 
> > > > Thoughts?
> > > 
> > > This is 100% in line with what we proposed... 
> > 
> > So what it hindering you to implement that? ucode teams whining about the
> > little bit of extra work?
> 
> That said, there is also a distinct lack of information about micro code
> loading in a safe way in general. We absolutely do not know whether a micro
> code update affects any instruction which might be in use during the update
> on a sibling. Right now it's all load and pray and the SDM is not really
> helpful with that either.
> 

Guilty as charged :-). In general we do not expect microcode updates to 
remove any cpuid bits (Not that it hasn't happened, but it slipped through
the cracks). 

microode updates should be of 3 types.

- Only loadable from BIOS (Only via FIT tables)
- Suitable for early load (things that take cpuid bits for e.g.)
- Suitable for late-load. (Where no cpuid bits should change etc).

Today the way we load after a stop_machine() all threads in the system are
held hostage until all the cores have done the update. The thread sibling
is also in the rendezvous loop. 

Do you think we still have that risk with a sibling thread? 
(Assuming future ucodes don't do weird things like what happened in 
that case where a cpuid was removed via an update)

Cheers,
Ashok

Re: [RFC V1 0/7] Add support for a new IMS interrupt mechanism

2019-09-13 Thread Raj, Ashok

On Fri, Sep 13, 2019 at 07:50:50PM +, Jason Gunthorpe wrote:
> On Thu, Sep 12, 2019 at 06:32:01PM -0700, Megha Dey wrote:
> 
> > This series is a basic patchset to get the ball rolling and receive some
> > inital comments. As per my discussion with Marc Zyngier and Thomas Gleixner
> > at the Linux Plumbers, I need to do the following:
> > 1. Since a device can support MSI-X and IMS simultaneously, ensure proper
> >locking mechanism for the 'msi_list' in the device structure.
> > 2. Introduce dynamic allocation of IMS vectors perhaps by using a group ID
> > 3. IMS support of a device needs to be discoverable. A bit in the vendor
> >specific capability in the PCI config is to be added rather than getting
> >this information from each device driver.
> 
> Why #3? The point of this scheme is to delegate programming the
> addr/data pairs to the driver so it can be done in some
> device-specific way. There is no PCI standard behind this, and no
> change in PCI semantics. 
> 
> I think it would be a tall ask to get a config space bit from PCI-SIG
> for something that has little to do with PCI.

This isn't a standard config capability. Its Designated Vendor Specific
Capability (DVSEC). The device is responsible for managing the addr-data
pair. This provides a hint to the OS framework that this device has
device specific methods.

Agreed its not required, but some OSV's like a generic way to discover
these capabilities, hence its there so device vendors can have
a common guideline.

Check here for some of those details:

https://software.intel.com/en-us/blogs/2018/06/25/introducing-intel-scalable-io-virtualization
 

> 
> After seeing that we already have a platform device based version of
> this same idea (drivers/base/platform-msi.c), I think the task here is
> really just to extend and expand that approach to work generically for
> platform and PCI devices. Along the way tidying the arch interface so
> x86 and ARM's stuff to support that uses the same generic interfaces.
> 
> ie it is re-organizing code and ideas already in Linux, not defining
> some new standard.
> 
> I also think refering to this existing idea by some new IMS name is
> only confusing people what the goal is... Which is perhaps why #3 was
> suggested??
> 
> Stated more clearly, I think all uses would be satisfied if
> platform_msi_domain_alloc_irqs() could be called for struct
> pci_device, could be called multiple times for the same struct
> pci_device, and co-existed with MSI and MSI-X on the same pci_device.
> 
> Jason

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-06 Thread Raj, Ashok

On Fri, Sep 06, 2019 at 11:16:00PM +0200, Thomas Gleixner wrote:
> 
> So if we want to do late microcode loading in a sane way then there are
> only a few options and none of them exist today:
> 
>  1) Micro-code contains a description of CPUID bits which are going to be
> exposed after the load. Then the kernel can sanity check whether this
> changes anything relevant or not. If there is a relevant change it can
> reject the load and tell the admin that a reboot is required.

This is pretty much what we had in mind when we suggested to the uCode teams.

Just a process of providing a meta data file to accompany every uCode release.

IMO new cpuid bits are probably less harmful than old ones dissappearing.



> 
>  2) Rework CPUID feature handling so that it can reevaluate and reconfigure
> the running system safely. There are a lot of things you need for that:
> 
> A) Introduce a safe state for CPUs to reach which guarantees that none
>of the CPUs will return from that state via a code path which
>depends on previous state and might now go the other route with data
>on the stack which only fits the previous configuration.
> 
> B) Make all the cpufeature thingies run time switchable. That means
>that you need to keep quite some code around which is currently init
>only. That also means that you have to provide backout code for
>things which set up data corresponding to cpu feature bits and so
>forth.
> 
> So #2 might be finished in about 20 years from now with the result that
> some of the code pathes might simply still have a

Maybe we can catch the kernel side in 20 years.. user space would still be 
busted, or have a fault way to control new cpuid much like how we do for
VM's. 

> 
>  if (cpufeature_changed())
>  panic();
> 
> because there are things which you cannot back out. So the only sane
> solution is to panic. Which is not a solution as it would be much more sane
> to prevent late loading upfront and force people to reboot proper.
> 
> Now #1 is actually a sensible and feasible solution which can be pulled off
> in a reasonably short time frame, avoids all the bound to be ugly and
> failure laden attempts of fixing late loading completely and provides a
> usable and safe solution for joe user, jack admin and the super experts at
> big-cloud corporate.
> 
> That is not requiring any new format of microcode payload, as this can be
> nicely done as a metadata package which comes with the microcode
> payload. So you get the following backwards compatible states:
> 
>   Kernel  metadata  result
> 
>   old   don't care  refuse late load
> 
>   new   No  refuse late load
> 
>   new   Yes decide based on metadata
> 
> Thoughts?

This is 100% in line with what we proposed... 

Cheers,
Ashok

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-06 Thread Raj, Ashok

Hi Thomas,

On Fri, Sep 06, 2019 at 02:51:17PM +0200, Thomas Gleixner wrote:
> Raj,
> 
> On Thu, 5 Sep 2019, Raj, Ashok wrote:
> > On Thu, Sep 05, 2019 at 11:22:31PM +0200, Thomas Gleixner wrote:
> > > That's all nice, but what it the general use case for this outside of 
> > > Intel's
> > > microcode development and testing?
> > > 
> > > We all know that late microcode loading has severe limitations and we
> > > really don't want to proliferate that further if not absolutely required
> > 
> > Several customers have asked this to check the safety of late loads. They 
> > want
> > to validate in production setup prior to rolling late-load to all 
> > production systems.
> 
> Groan. Late loading _IS_ broken by definition and it was so forever.

Lets tighten the seat belts :-).. I'm with you that late-loading has
shown weakness more recently than earlier. There are several obvious reasons
that you are well aware. But there is a lot that *must* be done to make sure
the guard rails are tight enough for deplopying late-load. 100% agree on that
to make sure the interface and mechanism needs to be improved for robustness
but not a candidate for removal. Certainly this is an argument that would help
me drive towards that objective internally.

> 
> What your customers are asking for is a receipe for disaster. They can
> check the safety of late loading forever, it will not magically become safe
> because they do so.
> 
> If you want late loading, then the whole approach needs to be reworked from
> ground up. You need to make sure that all CPUs are in a safe state,
> i.e. where switching of CPU feature bits of all sorts can be done with the
> guarantee that no CPU will return to the wrong code path after coming out
> of safe state and that any kernel internal state which depends on the
> previous set of CPU feature bits has been mopped up and switched over
> before CPUs are released.
> 
> That does not exist and unless it does, late loading is just going to cause
> trouble nothing else.
> 
> So, no. We are not merging something which is known to be broken and then
> we have to deal with the subtle fallout and the bug reports forever. Not to

When we did the late-load changes last year we added a warning if any
of the cpuid bits either dissappear or new ones appear. Maybe we should
have tainted the kernel to track that so its not that subtle anymore. 

> talk about having to fend of half baken duct tape patches which try to glue
> things together.
> 
> The only sensible patch for that is to remove any trace of late loading
> crappola once and forever.
> 
> Sorry, -ENOPONIES

:-)

Cheers,
Ashok

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-05 Thread Raj, Ashok

Hi Thomas,


On Thu, Sep 05, 2019 at 11:22:31PM +0200, Thomas Gleixner wrote:
> Raj,
> 
> On Thu, 5 Sep 2019, Raj, Ashok wrote:
> > On Thu, Sep 05, 2019 at 09:20:29AM +0200, Borislav Petkov wrote:
> > > On Wed, Sep 04, 2019 at 05:21:32PM -0700, Raj, Ashok wrote:
> > > > But echo 2 > reload would allow reading a microcode file from 
> > > > /lib/firmware/intel-ucode/ even if the revision hasn't changed right?
> > > > 
> > > > #echo 1 > reload wouldn't load if the revision on disk is same as 
> > > > what's loaded,
> > > > and we want to permit that with the echo 2 option.
> > > 
> > > Then before we continue with this, please specify what the exact
> > > requirements are. Talk to your microcoders or whoever is going to use
> > > this and give the exact use cases which should be supported and describe
> > > them in detail.
> > 
> > https://lore.kernel.org/lkml/1567056803-6640-1-git-send-email-ashok@intel.com/
> > 
> > The original description said to load a new microcode file, the content
> > could have changed, but revision in the header hasn't increased. 
> > 
> > The other rules are same, i.e we can't go backwards. There is another
> > SVN (Security version number) embedded in the microcode which won't allow
> > going backwards anyway. 
> > 
> > I'll get back to you if there are additional uses, but allowing the 
> > facility to 
> > actually read the file achieves the same purpose as using the in-kernel 
> > copy.
> > 
> > I have used it multiple times during development :-)
> 
> That's all nice, but what it the general use case for this outside of Intel's
> microcode development and testing?
> 
> We all know that late microcode loading has severe limitations and we
> really don't want to proliferate that further if not absolutely required

Several customers have asked this to check the safety of late loads. They want
to validate in production setup prior to rolling late-load to all production 
systems.

Thanks
Ashok

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-05 Thread Raj, Ashok

On Thu, Sep 05, 2019 at 09:49:50PM +0200, Borislav Petkov wrote:
> On Thu, Sep 05, 2019 at 12:40:44PM -0700, Raj, Ashok wrote:
> > The original description said to load a new microcode file, the content
> > could have changed, but revision in the header hasn't increased.
> 
> How does the hardware even accept a revision which is the same as the
> one already loaded?

Hardware will allow a reload as long as the loaded ucode svn <= the version 
being updated.

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-05 Thread Raj, Ashok

Hi Boris

On Thu, Sep 05, 2019 at 09:20:29AM +0200, Borislav Petkov wrote:
> On Wed, Sep 04, 2019 at 05:21:32PM -0700, Raj, Ashok wrote:
> > But echo 2 > reload would allow reading a microcode file from 
> > /lib/firmware/intel-ucode/ even if the revision hasn't changed right?
> > 
> > #echo 1 > reload wouldn't load if the revision on disk is same as what's 
> > loaded,
> > and we want to permit that with the echo 2 option.
> 
> Then before we continue with this, please specify what the exact
> requirements are. Talk to your microcoders or whoever is going to use
> this and give the exact use cases which should be supported and describe
> them in detail.

https://lore.kernel.org/lkml/1567056803-6640-1-git-send-email-ashok@intel.com/

The original description said to load a new microcode file, the content
could have changed, but revision in the header hasn't increased. 

The other rules are same, i.e we can't go backwards. There is another
SVN (Security version number) embedded in the microcode which won't allow
going backwards anyway. 

I'll get back to you if there are additional uses, but allowing the facility to 
actually read the file achieves the same purpose as using the in-kernel copy.

I have used it multiple times during development :-)

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-09-04 Thread Raj, Ashok

On Thu, Sep 05, 2019 at 12:12:29AM +0200, Boris Petkov wrote:
> On September 5, 2019 12:06:54 AM GMT+02:00, Boris Ostrovsky 
>  wrote:
> >Why do we need to taint kernel here? We are not making any changes.
> 
> Because this is not a normal operation we want users to do. This is only for 
> testing microcode quicker.
> 
> >This won't allow people to load from new microcode blob which I thought
> >was one of the objectives behind this new feature.
> 
> You load a new blob the usual way: echo 1 > ...
> 
> This is the "unusual" way where you reload an already loaded revision only.

But echo 2 > reload would allow reading a microcode file from 
/lib/firmware/intel-ucode/ even if the revision hasn't changed right?

#echo 1 > reload wouldn't load if the revision on disk is same as what's loaded,
and we want to permit that with the echo 2 option.

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-08-29 Thread Raj, Ashok

On Thu, Aug 29, 2019 at 08:09:42AM +0200, Borislav Petkov wrote:
> On Wed, Aug 28, 2019 at 10:33:22PM -0700, Ashok Raj wrote:
> > During microcode development, its often required to test different versions
> > of microcode. Intel microcode loader enforces loading only if the revision 
> > is
> > greater than what is currently loaded on the cpu. Overriding this behavior
> > allows us to reuse the same revision during development cycles.
> > This facilty also allows us to share debug microcode with development
> > partners for getting feedback before microcode release.
> > 
> > Microcode developers should have other ways to check which
> > of their internal version is actually loaded. For e.g. checking a
> > temporary MSR for instance. In order to reload the same microcode do as
> > shown below.
> > 
> >  # echo 2 > /sys/devices/system/cpu/microcode/reload
> > 
> >  as root.
> > 
> > 
> > I tested this on top of the parallel ucode load patch
> > 
> > https://lore.kernel.org/r/1566506627-16536-2-git-send-email-mihai.cara...@oracle.com/
> > 
> > v2: [Mihai] Address comments from Boris
> > - Support for AMD
> > - add taint flag
> > - removed global force_ucode_load and parameterized it.
> 
> As I've said before, I don't like the churn in this version and how it
> turns out. I'll have a look at how to do this cleanly when I get some
> free cycles.

Thanks Boris. I'll wait for your updates. I remember your comment on another
simplification from the Boris Ostrovsky 
https://lore.kernel.org/r/20190828191618.go4...@zn.tnic/

Mihai rolled in your suggestions noted above.

BTW, We only need to force on the late-load. Its not needed during early load.

Cheers,
Ashok

Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

2019-08-28 Thread Raj, Ashok

Hi Boris,

sorry i added the wrong message id in the commit log. 

https://lore.kernel.org/r/20190824085300.gb16...@zn.tnic/

On Wed, Aug 28, 2019 at 10:33:22PM -0700, Ashok Raj wrote:
> During microcode development, its often required to test different versions
> of microcode. Intel microcode loader enforces loading only if the revision is
> greater than what is currently loaded on the cpu. Overriding this behavior
> allows us to reuse the same revision during development cycles.
> This facilty also allows us to share debug microcode with development
> partners for getting feedback before microcode release.
> 
> Microcode developers should have other ways to check which
> of their internal version is actually loaded. For e.g. checking a
> temporary MSR for instance. In order to reload the same microcode do as
> shown below.
> 
>  # echo 2 > /sys/devices/system/cpu/microcode/reload
> 
>  as root.
> 
> 
> I tested this on top of the parallel ucode load patch
> 
> https://lore.kernel.org/r/1566506627-16536-2-git-send-email-mihai.cara...@oracle.com/

Please update with the following: 

https://lore.kernel.org/r/20190824085300.gb16...@zn.tnic/

Cheers,
Ashok

Re: [PATCH 1/2] x86/microcode: Update late microcode in parallel

2019-08-28 Thread Raj, Ashok

On Wed, Aug 28, 2019 at 09:13:31PM +0200, Borislav Petkov wrote:
> On Tue, Aug 27, 2019 at 05:56:30PM -0700, Raj, Ashok wrote:
> > > "Cloud customers have expressed discontent as services disappear for
> > > a prolonged time. The restriction is that only one core (or only one
> > > thread of a core in the case of an SMT system) goes through the update
> > > while other cores (or respectively, SMT threads) are quiesced."
> > 
> > the last line seems to imply that only one core can be updated at a time.
> 
> Only one core *is* being updated at a time now, before the parallel
> loading patch. Look at the code. I'm talking about what the code does,
> not what the requirement is.

Crystal :-)

> 
> Maybe it should not say "restriction" above but the sentence should
> start with: "Currently, only one core... "

That will help clear things up.. 

> 
> -- 
> Regards/Gruss,
> Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [PATCH 1/2] x86/microcode: Update late microcode in parallel

2019-08-27 Thread Raj, Ashok

On Tue, Aug 27, 2019 at 02:25:01PM +0200, Borislav Petkov wrote:
> On Mon, Aug 26, 2019 at 01:23:39PM -0700, Raj, Ashok wrote:
> > > Cloud customers have expressed discontent as services disappear for a
> > > prolonged time. The restriction is that only one core goes through the
> > s/one core/one thread of a core/
> > 
> > > update while other cores are quiesced.
> > s/cores/other thread(s) of the core
> 
> Changed it to:
> 
> "Cloud customers have expressed discontent as services disappear for
> a prolonged time. The restriction is that only one core (or only one
> thread of a core in the case of an SMT system) goes through the update
> while other cores (or respectively, SMT threads) are quiesced."

the last line seems to imply that only one core can be updated at a time.
But the only requirement is on a per-core basis, the HT thread sharing
a core must be quiesced while the microcode update is performed. 

for ex. if you have 2 cores, you can update microcode on both cores
at the same time. 

C0T0, C0T1 - If you are updating on C0T0, C0T1 must be waiting for the update
to complete

But You can initiate the microcode update on C1T0 simultaneously. 


> 
> Thx.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [PATCH 1/2] x86/microcode: Update late microcode in parallel

2019-08-26 Thread Raj, Ashok

On Mon, Aug 26, 2019 at 08:53:05AM -0400, Boris Ostrovsky wrote:
> On 8/24/19 4:53 AM, Borislav Petkov wrote:
> >  
> > +wait_for_siblings:
> > +   if (__wait_for_cpus(&late_cpus_out, NSEC_PER_SEC))
> > +   panic("Timeout during microcode update!\n");
> > +
> > /*
> > -* Increase the wait timeout to a safe value here since we're
> > -* serializing the microcode update and that could take a while on a
> > -* large number of CPUs. And that is fine as the *actual* timeout will
> > -* be determined by the last CPU finished updating and thus cut short.
> > +* At least one thread has completed update on each core.
> > +* For others, simply call the update to make sure the
> > +* per-cpu cpuinfo can be updated with right microcode
> > +* revision.
> 
> 
> What is the advantage of having those other threads go through
> find_patch() and (in Intel case) intel_get_microcode_revision() (which
> involves two MSR accesses) vs. having the master sibling update slaves'
> microcode revisions? There are only two things that need to be updated,
> uci->cpu_sig.rev and c->microcode.
> 

True, yes we could do that. But there is some warm and comfy feeling
that you really read the revision from the thread local copy of the revision
and it matches what was updated in the other thread sibling rather than
just hardcoding the fixup's. The code looks clean in the sense it looks like
you are attempting to upgrade but the new revision is reflected correctly
and you skip the update.

But if you feel complelled, i'm not opposed to it as long as Boris is 
happy with the changes :-).

Cheers,
Ashok

1 2 3 >

1 - 100 of 202 matches

Mail list logo