Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()

2021-12-02 Thread Raj, Ashok
Hi Thomas

On Thu, Dec 02, 2021 at 09:40:08PM +0100, Thomas Gleixner wrote:
> Ashok,
> 
> On Thu, Dec 02 2021 at 11:21, Ashok Raj wrote:
> > On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> >> You're missing the real world use case. The above is fiction.
> >
> > I don't think there is a valid use case for freeing specific vectors. Its
> > true some are special, IDXD has vector#0 like that. But I expect drivers to
> > acquire these special vectors  once and never free them until driver 
> > tear down time.
> >
> > But there is a need to free on demand, for a subdevice constructed for idxd
> > pass-through, when the guest is torn down, host would need to free them.
> > Only growing on demand seems to only catch one part of the dynamic part.
> >
> > IDXD also allocates interrupt only when the WQ is enabled, and frees when 
> > its
> > disabled.
> 
> You're talking about IMS not MSI-X here, right? IMS cannot be allocated
> via the PCI/MSI interfaces as we established long ago.
> 
> And if you are talking about the 8 MSI-X interrupts for IDXD then I
> really do not see the point of ever releasing it.

Not worried about MSI-x for IDXD :), I assumed the purpose of this exercise
was about 2 things.

- Fix the VFIO mask/unmask weirdness ending up disable, reenable with more
  interrupts. 
  - We are only fixing the case by not calling the disable_msi, but just
growing on demand.

- Use this as a case to build IMS. but if we treat MSIx and IMS
  differently, IMS would be bit different in how the dynamic parts work.

Although there is no real need for MSIx being dynamic except to avoid host
vector exhausion, do you think we could still allocate specific entries.
Since unmask is per-vector, is there benefit to doing just that vs
allocating current+N?

> 
> >> If a driver would release 1 and 2 then it should explicitely reallocate
> >> 1 and 2 and not let the core decide to magically allocate something.
> >> 
> >> If the driver wants three more after freeing 1, 2 then the core could
> >> just allocate 5, 6, 7, and would still fulfil the callers request to
> >> allocate three more, right?
> >
> > Since the core is already managing what's allocated and free, requiring
> > drivers to manage each allocated entries seem hard, while the core can
> > easily manage it. For IDXD cases, we don't really care which ones of the
> > IMS is being allocated and freed. It just wants one of the available IMS
> > entries. The assumption is since the driver would have acquired any special
> > ones upfront with the alloc_irqs().
> 
> For MSI-X the free vector use case does not exist today and even if it
> would exist the driver has to know about the index.
> 
> If the index -> function accociation is hard wired, it needs to know it
> obviously.
> 
> If it's not hardwired then it still needs to know the resulting index,
> because it has to program that index into a device function register so
> that the device knows which entry to use.
> 
> IMS is not any different. You need to know the index in order to
> associate it to the queue, no? And you need the index in order to figure
> out the Linux irq number.
> 
> But again, that's not a problem of this very API because this API is
> about PCI/MSI and not about IMS.

fair enough..the thought was even though MSIx doesn't require that, but the
implementations can be consistent if we aren't breaking MSIx. 

but as you said they don't have to be the same and can differ in how they
are implemented.


> 
> >> And even if it just allocates one, then the caller still has to know the
> >> index upfront. Why? Because it needs to know it in order to get the
> >> Linux interrupt number via pci_irq_vector().
> >
> > If we were to allocate one, the new API can simply return the index
> > directly to the caller, and they call pci_irq_vector() to get the IRQ
> > number.
> 
> That can work, but then we need both variants:
> 
>  pci_msix_alloc_vector_at() and pci_msix_alloc_vector_any()
> 
> Why?
> 
> Because pci_msix_alloc_vector_any() cannot solve the VFIO on demand
> allocation problem and it cannot be used to replace the sparse
> allocations which are done via pci_enable_msix_exact() today.
> 
> If there is an MSI-X use case to allocate any vector then we can
> implement that. If there is none, then we don't need it, right?

agreed.

> 
> >> So if the driver would free the vector for a particular functionality,
> >> or not allocate it in the first place, then it exactly knows what it
> >> freed and what it needs to allocate when it needs that functionality
> >> (again).
> >
> > It doesn't *have* to be that all vectors are special. Some of them are
> > special that they acquired all during driver load time. These are allocated
> > once and never freed. The rest are for say completion interrupts or such 
> > and 
> > such that go with work queues. These can dynamically be allocated and
> > freed.
> >
> > The driver doesn't really 

Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()

2021-12-02 Thread Raj, Ashok
Hi Thomas,

On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
> Megha,
> 
> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> > On 11/26/2021 5:25 PM, Thomas Gleixner wrote:
> >>   /**
> >> + * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
> >> + *
> >> + * @dev:  PCI device to operate on
> >> + * @at:   Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
> >> + *the function expands automatically after the last
> > Not sure why some of these changes related to PCI_MSIX_EXPAND_AUTO and 
> > num_descs did not make it to the 'msi' branch.
> > Is this intentional?
> 
> Yes, because I'm not happy about that magic.
> 
> >
> > For instance, say:
> > 1. Driver requests for 5 vectors:
> > pci_enable_msix_range(dev, NULL, 5, 5)
> > =>num_descs = 5
> 
> Driver should not use with pci_enable_msix_range() in the first
> place. But yes, it got 5 vectors now.

Bad start with a deprecated interface :-). 

> 
> > 2. Driver frees vectors at index 1,2:
> > range = {1, 2, 2};
> > pci_msi_teardown_msi_irqs(dev, range)
> 
> That function is not accessible by drivers for a reason.
> 
> > =>num_descs = 3; Current active vectors are at index: 0, 3, 4
> 
> > 3. Driver requests for 3 more vectors using the new API:
> > pci_msix_expand_vectors(dev, 3)
> > =>range.first = 3 => It will try to allocate index 3-5, but we already 
> > have 3,4 active?
> > Ideally, we would want index 1,2 and 5 to be allocated for this request 
> > right?
> >
> > Could you please let me know what I am missing?
> 
> You're missing the real world use case. The above is fiction.

I don't think there is a valid use case for freeing specific vectors. Its
true some are special, IDXD has vector#0 like that. But I expect drivers to
acquire these special vectors  once and never free them until driver 
tear down time.

But there is a need to free on demand, for a subdevice constructed for idxd
pass-through, when the guest is torn down, host would need to free them.
Only growing on demand seems to only catch one part of the dynamic part.

IDXD also allocates interrupt only when the WQ is enabled, and frees when its
disabled. 

> 
> If a driver would release 1 and 2 then it should explicitely reallocate
> 1 and 2 and not let the core decide to magically allocate something.
> 
> If the driver wants three more after freeing 1, 2 then the core could
> just allocate 5, 6, 7, and would still fulfil the callers request to
> allocate three more, right?

Since the core is already managing what's allocated and free, requiring
drivers to manage each allocated entries seem hard, while the core can
easily manage it. For IDXD cases, we don't really care which ones of the
IMS is being allocated and freed. It just wants one of the available IMS
entries. The assumption is since the driver would have acquired any special
ones upfront with the alloc_irqs().


> 
> And even if it just allocates one, then the caller still has to know the
> index upfront. Why? Because it needs to know it in order to get the
> Linux interrupt number via pci_irq_vector().

If we were to allocate one, the new API can simply return the index
directly to the caller, and they call pci_irq_vector() to get the IRQ
number.

> 
> > Correspondingly, pci_free_msix_irq_vector(pdev, irq) frees all the 
> > allocated resources associated with MSI-X interrupt with Linux IRQ 
> > number 'irq'.
> > I had issues when trying to dynamically allocate more than 1 interrupt 
> > because I didn't have a clean way to communicate to the driver what 
> > indexes were assigned in the current allocation.
> 
> If the driver is able to free a particular vector then it should exactly
> know what it it doing and which index it is freeing. If it needs that
> particular vector again, then it knows the index, right?
> 
> Let's look how MSI-X works in reality:
> 
> Each vector is associated to a particular function in the device. How
> that association works is device dependent.
> 
> Some devices have hardwired associations, some allow the driver to
> program the association in the device configuration and there is also a
> combination of both.
> 
> So if the driver would free the vector for a particular functionality,
> or not allocate it in the first place, then it exactly knows what it
> freed and what it needs to allocate when it needs that functionality
> (again).

It doesn't *have* to be that all vectors are special. Some of them are
special that they acquired all during driver load time. These are allocated
once and never freed. The rest are for say completion interrupts or such and 
such that go with work queues. These can dynamically be allocated and
freed.

The driver doesn't really care which index it wants or what the next index
should be. But it has to remember the allocated ones so it can pass down
for the free. Maybe the one we did a while back

https://lore.kernel.org/lkml/1561162778-12669-1-git-send-email-megha@linux.intel.com/

This has a group handle, and 

Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

2020-09-03 Thread Raj, Ashok
Hi Thomas,

Thanks a ton for jumping in helping on straightening it for IMS!!!


On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)

s/Storm/Store

maybe pun intended :-)

> based devices in a halfways architecture independent way.

You mean "halfways" because the message addr and data follow guidelines
per arch (x86 or such), but the location of the storage isn't dictated
by architecture? or did you have something else in mind? 

> 
> The first version can be found here:
> 
> https://lore.kernel.org/r/20200821002424.119492...@linutronix.de
> 

[snip]

> 
> Changes vs. V1:
> 
>- Addressed various review comments and addressed the 0day fallout.
>  - Corrected the XEN logic (Jürgen)
>  - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
> 
>- Fixed the compose MSI message inconsistency
> 
>- Ensure that the necessary flags are set for device SMI

is that supposed to be MSI? 

Cheers,
Ashok



Re: [Xen-devel] [PATCH v7 09/10] microcode: remove microcode_update_lock

2019-06-13 Thread Raj, Ashok
On Thu, Jun 13, 2019 at 08:08:46AM -0600, Jan Beulich wrote:
> >>> On 13.06.19 at 16:05,  wrote:
> > On Wed, Jun 12, 2019 at 01:38:31AM -0600, Jan Beulich wrote:
> > On 11.06.19 at 18:04,  wrote:
> >>> On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
>  On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
>  >
>  >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct 
>  >> microcode_patch 
> >>> *patch)
>  >>  
>  >>  mc_intel = patch->mc_intel;
>  >>  
>  >> -/* serialize access to the physical write to MSR 0x79 */
>  >> -spin_lock_irqsave(_update_lock, flags);
>  >> +BUG_ON(local_irq_is_enabled());
>  >>  
>  >>  /*
>  >>   * Writeback and invalidate caches before updating microcode to 
>  >> avoid
>  >
>  >Thinking about it - what happens if we hit an NMI or #MC here?
>  >watchdog_disable(), a call to which you add in an earlier patch,
>  >doesn't really suppress the generation of NMIs, it only tells the
>  >handler not to look at the accumulated statistics.
>  
>  I think they should be suppressed. Ashok, could you confirm it?
> >>> 
> >>> I think the only sources would be the watchdog as you pointed out
> >>> which we already touch to keep it from expiring. The perf counters
> >>> i'm not an expert in, but i'll check. When we are in stop_machine() type
> >>> flow, its not clear if any of those would fire. (I might be wrong, but let
> >>> me check).
> >>
> >>Well, without disarming the watchdog NMI at the LAPIC / IO-APIC,
> >>how would it _not_ potentially fire?
> > 
> > We plan not to prevent NMI being fired. Instead, if one thread of a core
> > is updating microcode, other threads of this core would stop in the
> > handler of NMI until the update completion. Is this approach acceptable?
> 
> Well, I have to return the question: It is you who knows what is or
> is not acceptable while an ucode update is in progress. In particular
> it obviously matters how much ucode is involved in the delivery of
> an NMI (and in allowing the handler to get to the point where you'd
> "stop" it).
> 
> If the approach you suggest is fine for the NMI case, I'd then wonder
> if it couldn't also be used for the #MC one.

Architecturally only one #MC can be active in the system. If a new #MC 
condition happens when MCG_STATUS.MCIP is already set, that would cause 
spontaneous 
shutdown.

If another NMI arrives on the CPU doing the wrmsr, it will be pended
in the lapic and delivered after the wrmsr returns. wrmsr flow
can't be interrupted. 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v7 09/10] microcode: remove microcode_update_lock

2019-06-11 Thread Raj, Ashok
On Tue, Jun 11, 2019 at 08:46:04PM +0800, Chao Gao wrote:
> On Wed, Jun 05, 2019 at 08:53:46AM -0600, Jan Beulich wrote:
> >
> >> @@ -307,8 +303,7 @@ static int apply_microcode(const struct 
> >> microcode_patch *patch)
> >>  
> >>  mc_intel = patch->mc_intel;
> >>  
> >> -/* serialize access to the physical write to MSR 0x79 */
> >> -spin_lock_irqsave(_update_lock, flags);
> >> +BUG_ON(local_irq_is_enabled());
> >>  
> >>  /*
> >>   * Writeback and invalidate caches before updating microcode to avoid
> >
> >Thinking about it - what happens if we hit an NMI or #MC here?
> >watchdog_disable(), a call to which you add in an earlier patch,
> >doesn't really suppress the generation of NMIs, it only tells the
> >handler not to look at the accumulated statistics.
> 
> I think they should be suppressed. Ashok, could you confirm it?

I think the only sources would be the watchdog as you pointed out
which we already touch to keep it from expiring. The perf counters
i'm not an expert in, but i'll check. When we are in stop_machine() type
flow, its not clear if any of those would fire. (I might be wrong, but let
me check).

#MC shouldn't fire once you entered the rendezvous, if it does its usually
fatal like a 3strike or something. Machine is going down at that time
so nothing we could do to handle.

Cheers,
Ashok

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v7 08/10] x86/microcode: Synchronize late microcode loading

2019-06-11 Thread Raj, Ashok
Hi Gao, Jan

On Tue, Jun 11, 2019 at 08:36:17PM +0800, Chao Gao wrote:
> On Wed, Jun 05, 2019 at 08:09:43AM -0600, Jan Beulich wrote:
> >
> >There's no comment here and nothing in the description: I don't
> >recall clarification as to whether RDTSC is fine to be issued by a
> >thread when ucode is being updated by another thread on the
> >same core.
> 
> Yes. I think it is fine.
> 
> Ashok, could you share your opinion on this question?
> 

Yes, rdtsc should be fine for other threads to execute while waiting for the 
microcode update to complete on others. We do the same in Linux as well.

Cheers,
Ashok

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v6 11/12] x86/microcode: Synchronize late microcode loading

2019-03-14 Thread Raj, Ashok
Hi Andrew 

Sorry it took a while, since I had to check with couple others.

+ Asit

On Thu, Mar 14, 2019 at 12:39:46PM +, Andrew Cooper wrote:
[snip]
> 
> On late load failure, we should dump enough information to work out
> exactly what went on, to determine how best to proceed, but the server
> is effectively lost to us.  On late load success, the proposed new
> "version" replaces the current "version".
> 
> And again - I reiterate the point that I think it is fine to have a
> simplifying assumption that we don't have mixed stepping systems to
> start with, presuming this is generally in line with Intel's support
> statement.  If in practice we find mixed stepping systems which are
> supported by an OEM/Intel, we can see about extending the logic.

Checking with Asit he says it is in fact permitted to have 1 step behind
even on a multi-socket system. One could be N and other N-1 should be 
supported.

Cheers,
Ashok

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v6 11/12] x86/microcode: Synchronize late microcode loading

2019-03-12 Thread Raj, Ashok
On Mon, Mar 11, 2019 at 03:57:35PM +0800, Chao Gao wrote:
> This patch ports microcode improvement patches from linux kernel.
> 
> Before you read any further: the early loading method is still the
> preferred one and you should always do that. The following patch is
> improving the late loading mechanism for long running jobs and cloud use
> cases.
> 
> Gather all cores and serialize the microcode update on them by doing it
> one-by-one to make the late update process as reliable as possible and
> avoid potential issues caused by the microcode update.
> 
> Signed-off-by: Chao Gao 
> Tested-by: Chao Gao 
> [linux commit: a5321aec6412b20b5ad15db2d6b916c05349dbff]
> [linux commit: bb8c13d61a629276a162c1d2b1a20a815cbcfbb7]
> Cc: Kevin Tian 
> Cc: Jun Nakajima 
> Cc: Ashok Raj 
> Cc: Borislav Petkov 
> Cc: Thomas Gleixner 
> Cc: Andrew Cooper 
> Cc: Jan Beulich 
> ---
> Changes in v6:
>  - Use one timeout period for rendezvous stage and another for update stage.
>  - scale time to wait by the number of remaining cpus to respond.
>It helps to find something wrong earlier and thus we can reboot the
>system earlier.
> ---
>  xen/arch/x86/microcode.c | 149 
> ++-
>  1 file changed, 136 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
> index c510808..96bcef6 100644
> --- a/xen/arch/x86/microcode.c
> +++ b/xen/arch/x86/microcode.c
> @@ -22,6 +22,7 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -30,15 +31,34 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  
> +/*
> + * Before performing a late microcode update on any thread, we
> + * rendezvous all cpus in stop_machine context. The timeout for
> + * waiting for cpu rendezvous is 30ms. It is the timeout used by
> + * live patching
> + */
> +#define MICROCODE_CALLIN_TIMEOUT_US 3
> +
> +/*
> + * Timeout for each thread to complete update is set to 1s. It is a
> + * conservative choice considering all possible interference (for
> + * instance, sometimes wbinvd takes relative long time). And a perfect
> + * timeout doesn't help a lot except an early shutdown.
> + */
> +#define MICROCODE_UPDATE_TIMEOUT_US 100
> +
>  static module_t __initdata ucode_mod;
>  static signed int __initdata ucode_mod_idx;
>  static bool_t __initdata ucode_mod_forced;
> @@ -189,6 +209,12 @@ static DEFINE_SPINLOCK(microcode_mutex);
>  DEFINE_PER_CPU(struct cpu_signature, cpu_sig);
>  
>  /*
> + * Count the CPUs that have entered and exited the rendezvous
> + * during late microcode update.
> + */
> +static atomic_t cpu_in, cpu_out;
> +
> +/*
>   * Save an ucode patch to the global cache list.
>   *
>   * Return true if a patch is added to the global cache or it replaces
> @@ -284,25 +310,86 @@ static int microcode_update_cpu(void)
>  return ret;
>  }
>  
> -static long do_microcode_update(void *unused)
> +/* Wait for CPUs to rendezvous with a timeout (us) */
> +static int wait_for_cpus(atomic_t *cnt, unsigned int expect,
> + unsigned int timeout)
>  {
> -int error, cpu;
> +while ( atomic_read(cnt) < expect )
> +{
> +if ( timeout <= 0 )

timeout is unsigned int.. it will never be < 0.. flip it to read better maybe?

> +{
> +printk("CPU%d: Timeout when waiting for CPUs calling in\n",
> +   smp_processor_id());
> +return -EBUSY;
> +}
> +udelay(1);
> +timeout--;
> +}
> +
> +return 0;
> +}
>  
> -error = microcode_update_cpu();
> -if ( error )
> -return error;
> +static int do_microcode_update(void *unused)
> +{
> +unsigned int cpu = smp_processor_id();
> +unsigned int cpu_nr = num_online_cpus();
> +unsigned int finished;
> +int ret;
> +static bool error;
>  
> -cpu = cpumask_next(smp_processor_id(), _online_map);
> -if ( cpu < nr_cpu_ids )
> -return continue_hypercall_on_cpu(cpu, do_microcode_update, NULL);
>  
> -return error;
> +atomic_inc(_in);
> +ret = wait_for_cpus(_in, cpu_nr, MICROCODE_CALLIN_TIMEOUT_US);
> +if ( ret )
> +return ret;
> +
> +/*
> + * Initiate an update on all processors which don't have an online 
> sibling
> + * thread with a lower thread id. Other sibling threads just await the
> + * completion of microcode update.
> + */

The above comment isn't clear. Looks like you are doing the update just
to the first cpu in the sibling map? 

> +if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
> +ret = microcode_update_cpu();

Does ret have any useful things on if the update failed? Doesn't seem 
to be used before you overwrite later in collect_cpu_info()?


> +/*
> + * Increase the wait timeout to a safe value here since we're serializing
> + * the 

Re: [Xen-devel] [PATCH v6 04/12] microcode: introduce a global cache of ucode patch

2019-03-12 Thread Raj, Ashok
On Tue, Mar 12, 2019 at 05:53:53PM +0100, Roger Pau Monné wrote:
> On Mon, Mar 11, 2019 at 03:57:28PM +0800, Chao Gao wrote:
> > to replace the current per-cpu cache 'uci->mc'.
> > 
> > Compared to the current per-cpu cache, the benefits of the global
> > microcode cache are:
> > 1. It reduces the work that need to be done on each CPU. Parsing ucode
> > file is done once on one CPU. Other CPUs needn't parse ucode file.
> > Instead, they can find out and load the newest patch from the global
> > cache.
> > 2. It reduces the memory consumption on a system with many CPU cores.
> > 
> > Two functions, microcode_save_patch() and microcode_find_patch() are
> > introduced. The former adds one given patch to the global cache. The
> > latter gets a newer and matched ucode patch from the global cache.
> > All operations on the cache is expected to be done with the
> > 'microcode_mutex' hold.
> > 
> > Note that I deliberately avoid touching 'uci->mc' as I am going to
> > remove it completely in the next patch.
> > 
> > Signed-off-by: Chao Gao 
> > ---
> > Changes in v6:
> >  - constify local variables and function parameters if possible
> >  - comment that the global cache is protected by 'microcode_mutex'.
> >and add assertions to catch violations in microcode_{save/find}_patch()
> > 
> > Changes in v5:
> >  - reword the commit description
> >  - find_patch() and save_patch() are abstracted into common functions
> >with some hooks for AMD and Intel
> > ---
> >  xen/arch/x86/microcode.c| 60 +++
> >  xen/arch/x86/microcode_amd.c| 91 
> > +
> >  xen/arch/x86/microcode_intel.c  | 66 ++
> >  xen/include/asm-x86/microcode.h | 13 ++
> >  4 files changed, 215 insertions(+), 15 deletions(-)
> > 
> > diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
> > index 4163f50..e629e6c 100644

[snip]
> 
> I think this is rather too simplistic, and I'm not sure you really
> need a list in order to store the microcodes.
> 
> IIRC we agreed that systems with mixed CPU versions are not supported,
> hence the same microcode blob should be used to update all the
> possible CPUs on the system, so a list should not be needed here.
> 
> Also I'm afraid that freeing the old microcode when a new version is
> uploaded is not fully correct. For example given the following
> scenario:
> 
> 1. Upload a microcode blob to the hypervisor and apply it to online
> CPUs.
> 
> 2. Upload a microcode blob with a higher version than the previous one,
> but which fails to apply. This microcode would replace the
> previous one.
> 
> 3. Online a CPU. This CPU will try to use the last uploaded microcode
> and fail, because last uploaded version is broken. Newly onlined CPUs
> would then end up with a microcode version different from the
> currently running CPUs, likely breaking the system.
> 
> I think the best way to solve this is to ditch the list usage and
> instead only keep at most two microcode versions cached in the
> hypervisor:
> 
>  - The microcode version that's currently successfully applied (if any).
>  - A microcode version higher than the current version, that has yet
>to be applied.
> 
> Once this new microcode version has been applied it will replace the
> previously applied version. If the new microcode version fails to
> apply it will be discarded, thus keeping a copy of the currently
> applied microcode version.
> 
> With this approach AFAICT you only need two variables, one to store
> the currently applied microcode_patch and another one to save the new
> microcode version in order to attempt to apply it.

This sounds very reasonable!

> 
> I think this will also simplify some of the code. Let me know if this
> sounds sensible, or if I'm missing something.
> 
> Thanks, Roger.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading

2019-02-11 Thread Raj, Ashok
On Mon, Feb 11, 2019 at 02:35:30PM +0100, Juergen Gross wrote:
> On 11/02/2019 14:23, Jan Beulich wrote:
>  On 11.02.19 at 06:40,  wrote:
> >> On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote:
> >> On 28.01.19 at 08:06,  wrote:
>  +/*
>  + * Initiate an update on all processors which don't have an online 
>  sibling
>  + * thread with a lower thread id. Other sibling threads just await 
>  the
>  + * completion of microcode update.
>  + */
>  +if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>  +ret = microcode_update_cpu();
>  +/*
>  + * Increase the wait timeout to a safe value here since we're 
>  serializing
>  + * the microcode update and that could take a while on a large 
>  number of
>  + * CPUs. And that is fine as the *actual* timeout will be 
>  determined by
>  + * the last CPU finished updating and thus cut short
>  + */
>  +if ( wait_for_cpus(_out, MICROCODE_DEFAULT_TIMEOUT_US * 
>  nr_cores) )
>  +panic("Timeout when finishing updating microcode");
> >>>
> >>> While I expect this to go away again in the next patch, I'd still like to
> >>> see this improved, in particular in case the patch here goes in
> >>> independently of the next one. After all on a system with 100 cores
> >>> the timeout totals to a whopping 3 seconds.
> >>
> >> To be clear, the timeout remains the same in the next patch due to
> >> the serial print clause in apply_microcode().
> >>
> >>>
> >>> Generally the time needed to wait scales by the number of CPUs still
> >>> in need of doing the update. And if a timeout is really to occur, it's
> >>> perhaps because of one bad core or socket, not because nothing
> >>> works at all. Hence it would seem both nice and possible to scale the
> >>> "remaining time to wait" by the (known) number of remaining
> >>> processors to respond.
> >>
> >> Basically, I think the benefit is we can recognize the failure earlier
> >> if no core called in in a given interval (i.e. 30ms), and trigger a
> >> panic. Considering for such case, even with this optimization, the
> >> system needs reboot, which generally takes several minutes, what's the
> >> value of this optimization?
> > 
> > Hmm, on one hand this is a fair point you make. Otoh, why do
> > you add any timeout at all, if we say we're hosed anyway if the
> > timeout expires? You could then as well log a message (say
> > once a second) about how many (or which) CPUs still didn't
> > respond. The admin can then still reboot the system if desired.
> 
> That's not a data center friendly approach.
> 
> The ability to do microcode update in an online system might by
> risky, but in case of failure requiring access to the console or
> power settings of the system isn't nice.

Thats right... the reason it cannnot be a perfect number is because
it really depends on several factors. If the other thread is in a 
long flow instruction we can only break at instruction boundary, even
ipi. Say if you were in wbinvd() for example, or some new ISA taking its
sweet time. In the grand scheme of things its less important to fine 
optimnize this number. 

On the other hand not panicking and waiting for sysadmin certainly isn't
datacenter friendly as you had pointed out. 
> 
> I think doing a panic() after some timeout is a sensible way to
> handle a failure.
> 
> In case you'd like having a way to wait longer: we could allow the
> "noreboot" parameter to be modified at runtime and do the panic only
> if opt_noreboot isn't set.
> 
> 
> Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4 6/6] x86/microcode: Synchronize late microcode loading

2018-12-11 Thread Raj, Ashok
On Tue, Dec 11, 2018 at 10:01:11AM -0700, Jan Beulich wrote:
> >>> On 28.11.18 at 06:34,  wrote:
> > This patch ports microcode improvement patches from linux kernel.
> > 
> > Before you read any further: the early loading method is still the
> > preferred one and you should always do that. The following patch is
> > improving the late loading mechanism for long running jobs and cloud use
> > cases.
> > 
> > Gather all cores and serialize the microcode update on them by doing it
> > one-by-one to make the late update process as reliable as possible and
> > avoid potential issues caused by the microcode update.
> 
> So you still didn't switch to process cores or at the very least
> sockets in parallel?

Let me sync with Gao on this.. we did do a patch for upstream kernel
but Boris wasn't interested in talking any patches to improve
live microcode update because there is no report of breakage with what we do
today with the big hammer approach. Plus the preferred path should be BIOS,
next is early load and 3rd option is live update. 

But since this is xen a different source base you are free to make
those changes.

BTW: Apart from the fact its ugly and take a lng time to complete, do you
have any practical isssues you want to highlight? maybe that can 
help upstream as well.

Cheers,
Ashok


> 
> Jan
> 
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline

2018-04-13 Thread Raj, Ashok
Hi Jan

+ Boris

On Fri, Apr 13, 2018 at 09:57:25AM -0600, Jan Beulich wrote:
> >>> On 30.03.18 at 08:59, <chao@intel.com> wrote:
> > This patch is to backport the patch below from linux kernel.
> > 
> > commit 30ec26da9967d0d785abc24073129a34c3211777
> > Author: Ashok Raj <ashok@intel.com>
> > Date:   Wed Feb 28 11:28:43 2018 +0100
> > 
> > x86/microcode: Do not upload microcode if CPUs are offline
> > 
> > Avoid loading microcode if any of the CPUs are offline, and issue a
> > warning. Having different microcode revisions on the system at any 
> > time
> > is outright dangerous.
> 
> I'm afraid I don't fully agree - not applying an ucode update to the online
> CPUs because some are offline isn't any better. Plus (while updating)
> there's always going to be some discrepancy between ucode versions.
> As long as we apply updates while bringing a CPU online, I think we're fine.

This is the safest option. Microcode is considered part of the cpu. We don't
allow cpus with different capabilities in the same system.. yes they might
work, but not something we allow. 

In general we recommend early update. Earliest the best. 

- BIOS update (difficult to deploy, but some microcodes have to be done
  this way.)
- early update from initrd.. almost same as #1, since we apply at earliest
  chance that's the closest and most recommended method.
- late update. Before this procedure of stopping all cpus, we did have a 
  time when some are updated and some werent uptodate yet. This synchronized
  update is precicely to get as close as possible to updating all of them.
> 
> Also please consider valid cases you make not work anymore, like someone
> having brought offline all sibling hyperthreads, with each core still having
> one thread active. In that case an ucode update will implicitly update all
> offline threads as well.

Of cource we can tweak this to be much better, there are other ideas, but
this is an effort to keep this simple, and also address microcode requirements
post spectre for some processors.  In the grand scheme of things although its
interesting to allow such updates we think it may not be best practice.

We want to get this working right first before getting fancy.

Cheers,
Ashok

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 1/2] x86/microcode: Synchronize late microcode loading

2018-04-12 Thread Raj, Ashok
On Fri, Mar 30, 2018 at 02:59:00PM +0800, Chao Gao wrote:
> From: Gao Chao <chao@intel.com>
> 
> This patch is to backport microcode improvement patches from linux
> kernel. Below are the original patches description:
> 
> commit a5321aec6412b20b5ad15db2d6b916c05349dbff
>     Author: Ashok Raj <ashok@intel.com>
> Date:   Wed Feb 28 11:28:46 2018 +0100
> 
>   x86/microcode: Synchronize late microcode loading
> 
>   Original idea by Ashok, completely rewritten by Borislav.
> 
>   Before you read any further: the early loading method is still the
>   preferred one and you should always do that. The following patch is
>   improving the late loading mechanism for long running jobs and cloud use
>   cases.
> 
>   Gather all cores and serialize the microcode update on them by doing it
>   one-by-one to make the late update process as reliable as possible and
>   avoid potential issues caused by the microcode update.
> 
>   [ Borislav: Rewrite completely. ]
> 
>   Co-developed-by: Borislav Petkov <b...@suse.de>
>   Signed-off-by: Ashok Raj <ashok@intel.com>
>   Signed-off-by: Borislav Petkov <b...@suse.de>
>   Signed-off-by: Thomas Gleixner <t...@linutronix.de>
>   Tested-by: Tom Lendacky <thomas.lenda...@amd.com>
>   Tested-by: Ashok Raj <ashok@intel.com>

The tested by tags were good for linux upstream. Can you make sure
you add your name under tested-by for xen?

>   Reviewed-by: Tom Lendacky <thomas.lenda...@amd.com>
>   Cc: Arjan Van De Ven <arjan.van.de@intel.com>
>   Link: https://lkml.kernel.org/r/20180228102846.13447-8...@alien8.de
> 
> +static int do_microcode_update(void *_info)
> +{
> +struct microcode_info *info = _info;
> +int error, ret = 0;
> +
> +error = __wait_for_cpus(>cpu_in, USEC_PER_SEC);
> +if ( error )
> +{
> +ret = -EBUSY;
> +return ret;
> +}
> +
> +error = microcode_update_cpu(info->buffer, info->buffer_size);
> +if ( error && !ret )
> +ret = error;

Isn't this redundant? can ret become non-zero in the check above?

> +/*
> + * Increase the wait timeout to a safe value here since we're serializing
> + * the microcode update and that could take a while on a large number of
> + * CPUs. And that is fine as the *actual* timeout will be determined by
> + * the last CPU finished updating and thus cut short
> + */
> +error = __wait_for_cpus(>cpu_out, USEC_PER_SEC * 
> num_online_cpus());
> +if (error)
> +panic("Timeout when finishing updating microcode");
> +info->error = ret;
> +return ret;
>  }
>  



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel