Re: [ACPI] Re: [RFC 5/6]clean cpu state after hotremove CPU

2005-04-04 Thread Ashok Raj
On Mon, Apr 04, 2005 at 03:46:20PM -0700, Nathan Lynch wrote:
> 
>Hi Nigel!
> 
>On Tue, Apr 05, 2005 at 08:14:25AM +1000, Nigel Cunningham wrote:
>>
>> On Tue, 2005-04-05 at 01:33, Nathan Lynch wrote:
>>  >  > Yes, exactly. Someone who understand do_exit please help clean
> 
>No, that wouldn't work.  I am saying that there's little to gain by
>adding all this complexity for destroying the idle tasks when it's
>fairly simple to create num_possible_cpus() - 1 idle tasks* to
>accommodate any additional cpus which may come along.  This is what
>ppc64 does now, and it should be feasible on any architecture which
>supports cpu hotplug.
> 
>Nathan
> 
>* num_possible_cpus() - 1 because the idle task for the boot cpu is
>  created in sched_init.
> 

In ia64 we create idle threads on demand if one is not available for the same
logical cpu number, and re-used when the same logical cpu number is re-used. 

just a minor improvement, i also thought about idle exit, but wasnt worth
anything in return.

Cheers,
ashok
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PC300 pci_enable_device fix

2005-04-13 Thread Ashok Raj
On Wed, Apr 13, 2005 at 02:31:43PM -0700, Bjorn Helgaas wrote:
> 
>Call pci_enable_device() before looking at IRQ and resources.
>The driver requires this fix or the "pci=routeirq" workaround
>on 2.6.10 and later kernels.


the failure cases dont seem to worry about pci_disable_device()?

in err_release_ram: etc?

> 
>Reported and tested by Artur Lipowski.
> 
>Signed-off-by: Bjorn Helgaas <[EMAIL PROTECTED]>
> 
>= drivers/net/wan/pc300_drv.c 1.24 vs edited =
>--- 1.24/drivers/net/wan/pc300_drv.c2004-12-29 12:25:16 -07:00
>+++ edited/drivers/net/wan/pc300_drv.c  2005-04-13 13:35:21 -06:00
>@@ -3439,6 +3439,9 @@
> #endif
>}
> 
>+   if ((err = pci_enable_device(pdev)) != 0)
>+   return err;
>+
>card = (pc300_t *) kmalloc(sizeof(pc300_t), GFP_KERNEL);
>if (card == NULL) {
>printk("PC300 found at RAM 0x%08lx, "
>@@ -3526,9 +3529,6 @@
>err = -ENODEV;
>goto err_release_ram;
>}
>-
>-   if ((err = pci_enable_device(pdev)) != 0)
>-   goto err_release_sca;
> 
>  card->hw.plxbase   =   ioremap(card->hw.plxphys,
>card->hw.plxsize);
>  card->hw.rambase   =   ioremap(card->hw.ramphys,
>card->hw.alloc_ramsize);
> 
>-
>To   unsubscribe   from   this   list:   send  the  line  "unsubscribe
>linux-kernel" in
>the body of a message to [EMAIL PROTECTED]
>More majordomo info at  [1]http://vger.kernel.org/majordomo-info.html
>    Please read the FAQ at  [2]http://www.tux.org/lkml/
> 
> References
> 
>1. http://vger.kernel.org/majordomo-info.html
>2. http://www.tux.org/lkml/

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Extending defconfig for x86_64

2005-07-21 Thread Ashok Raj
Hi Andi

This patch is a trivial one. Provide a differnet defconfig for x86_64. 

Each time people get bitten by which scsi controller/eth to use. It might
be possible to setup configs for other systems as well, if there are well
known system names to make it simple for devl.

Please consider for next update.

-- 
Cheers,
Ashok Raj
- Open Source Technology Center


This provides a working default config file for Intel systems.

Tested on harwich (4p + ht systems), if more are required either add
to this config, or create new defconfig's as required.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
--
 arch/x86_64/configs/harwich_defconfig | 1185 ++
 1 files changed, 1185 insertions(+)

Index: linux-2.6.13-rc3-mm1/arch/x86_64/configs/harwich_defconfig
===
--- /dev/null
+++ linux-2.6.13-rc3-mm1/arch/x86_64/configs/harwich_defconfig
@@ -0,0 +1,1185 @@
+#
+# Automatically generated make config: don't edit
+# Linux kernel version: 2.6.13-rc3
+# Mon Jul 18 12:18:34 2005
+#
+CONFIG_X86_64=y
+CONFIG_64BIT=y
+CONFIG_X86=y
+CONFIG_MMU=y
+CONFIG_RWSEM_GENERIC_SPINLOCK=y
+CONFIG_GENERIC_CALIBRATE_DELAY=y
+CONFIG_X86_CMPXCHG=y
+CONFIG_EARLY_PRINTK=y
+CONFIG_GENERIC_ISA_DMA=y
+CONFIG_GENERIC_IOMAP=y
+
+#
+# Code maturity level options
+#
+CONFIG_EXPERIMENTAL=y
+CONFIG_CLEAN_COMPILE=y
+CONFIG_LOCK_KERNEL=y
+CONFIG_INIT_ENV_ARG_LIMIT=32
+
+#
+# General setup
+#
+CONFIG_LOCALVERSION=""
+CONFIG_SWAP=y
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+# CONFIG_BSD_PROCESS_ACCT is not set
+CONFIG_SYSCTL=y
+# CONFIG_AUDIT is not set
+CONFIG_HOTPLUG=y
+CONFIG_KOBJECT_UEVENT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+# CONFIG_CPUSETS is not set
+# CONFIG_EMBEDDED is not set
+CONFIG_KALLSYMS=y
+CONFIG_KALLSYMS_ALL=y
+# CONFIG_KALLSYMS_EXTRA_PASS is not set
+CONFIG_PRINTK=y
+CONFIG_BUG=y
+CONFIG_BASE_FULL=y
+CONFIG_FUTEX=y
+CONFIG_EPOLL=y
+CONFIG_SHMEM=y
+CONFIG_CC_ALIGN_FUNCTIONS=0
+CONFIG_CC_ALIGN_LABELS=0
+CONFIG_CC_ALIGN_LOOPS=0
+CONFIG_CC_ALIGN_JUMPS=0
+# CONFIG_TINY_SHMEM is not set
+CONFIG_BASE_SMALL=0
+
+#
+# Loadable module support
+#
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_MODULE_FORCE_UNLOAD=y
+CONFIG_OBSOLETE_MODPARM=y
+# CONFIG_MODVERSIONS is not set
+# CONFIG_MODULE_SRCVERSION_ALL is not set
+# CONFIG_KMOD is not set
+CONFIG_STOP_MACHINE=y
+
+#
+# Processor type and features
+#
+# CONFIG_MK8 is not set
+# CONFIG_MPSC is not set
+CONFIG_GENERIC_CPU=y
+CONFIG_X86_L1_CACHE_BYTES=128
+CONFIG_X86_L1_CACHE_SHIFT=7
+CONFIG_X86_TSC=y
+CONFIG_X86_GOOD_APIC=y
+# CONFIG_MICROCODE is not set
+CONFIG_X86_MSR=y
+CONFIG_X86_CPUID=y
+CONFIG_X86_HT=y
+CONFIG_X86_IO_APIC=y
+CONFIG_X86_LOCAL_APIC=y
+CONFIG_MTRR=y
+CONFIG_SMP=y
+CONFIG_SCHED_SMT=y
+CONFIG_PREEMPT_NONE=y
+# CONFIG_PREEMPT_VOLUNTARY is not set
+# CONFIG_PREEMPT is not set
+CONFIG_PREEMPT_BKL=y
+# CONFIG_K8_NUMA is not set
+# CONFIG_NUMA_EMU is not set
+# CONFIG_NUMA is not set
+CONFIG_ARCH_FLATMEM_ENABLE=y
+CONFIG_SELECT_MEMORY_MODEL=y
+CONFIG_FLATMEM_MANUAL=y
+# CONFIG_DISCONTIGMEM_MANUAL is not set
+# CONFIG_SPARSEMEM_MANUAL is not set
+CONFIG_FLATMEM=y
+CONFIG_FLAT_NODE_MEM_MAP=y
+CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
+CONFIG_HAVE_DEC_LOCK=y
+CONFIG_NR_CPUS=8
+CONFIG_HOTPLUG_CPU=y
+CONFIG_HPET_TIMER=y
+CONFIG_X86_PM_TIMER=y
+CONFIG_HPET_EMULATE_RTC=y
+CONFIG_GART_IOMMU=y
+CONFIG_SWIOTLB=y
+CONFIG_X86_MCE=y
+CONFIG_X86_MCE_INTEL=y
+CONFIG_PHYSICAL_START=0x10
+# CONFIG_KEXEC is not set
+CONFIG_SECCOMP=y
+# CONFIG_HZ_100 is not set
+CONFIG_HZ_250=y
+# CONFIG_HZ_1000 is not set
+CONFIG_HZ=250
+CONFIG_GENERIC_HARDIRQS=y
+CONFIG_GENERIC_IRQ_PROBE=y
+CONFIG_ISA_DMA_API=y
+
+#
+# Power management options
+#
+CONFIG_PM=y
+# CONFIG_PM_DEBUG is not set
+CONFIG_SOFTWARE_SUSPEND=y
+CONFIG_PM_STD_PARTITION=""
+CONFIG_SUSPEND_SMP=y
+
+#
+# ACPI (Advanced Configuration and Power Interface) Support
+#
+CONFIG_ACPI=y
+CONFIG_ACPI_BOOT=y
+CONFIG_ACPI_INTERPRETER=y
+# CONFIG_ACPI_SLEEP is not set
+CONFIG_ACPI_AC=y
+CONFIG_ACPI_BATTERY=y
+CONFIG_ACPI_BUTTON=y
+# CONFIG_ACPI_VIDEO is not set
+CONFIG_ACPI_HOTKEY=m
+CONFIG_ACPI_FAN=y
+CONFIG_ACPI_PROCESSOR=y
+CONFIG_ACPI_HOTPLUG_CPU=y
+CONFIG_ACPI_THERMAL=y
+# CONFIG_ACPI_ASUS is not set
+# CONFIG_ACPI_IBM is not set
+CONFIG_ACPI_TOSHIBA=y
+CONFIG_ACPI_BLACKLIST_YEAR=2001
+# CONFIG_ACPI_DEBUG is not set
+CONFIG_ACPI_BUS=y
+CONFIG_ACPI_EC=y
+CONFIG_ACPI_POWER=y
+CONFIG_ACPI_PCI=y
+CONFIG_ACPI_SYSTEM=y
+CONFIG_ACPI_CONTAINER=y
+
+#
+# CPU Frequency scaling
+#
+CONFIG_CPU_FREQ=y
+CONFIG_CPU_FREQ_TABLE=y
+# CONFIG_CPU_FREQ_DEBUG is not set
+CONFIG_CPU_FREQ_STAT=y
+# CONFIG_CPU_FREQ_STAT_DETAILS is not set
+CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
+# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
+CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
+# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
+CONFIG_CPU_FREQ_GOV_USERSPACE=y
+CONFIG_CPU_FREQ_GOV_ONDEMAND=y
+# CONFIG_

2.6.13-rc5-mm1 doesnt boot on x86_64

2005-08-08 Thread Ashok Raj
Folks,

Iam getting this on the recent 2.6.12-rc5-mm1 kernel built with defconfig. 

Cheers,
Ashok Raj

--- [cut here ] - [please bite here ] -
Kernel BUG at "include/linux/list.h":165
invalid operand:  [1] SMP
CPU 2
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.13-rc5-mm1
RIP: 0010:[] 
{attribute_container_unregist}RSP: 0018:8100bfb63f00  
EFLAGS: 00010283
RAX: 8100bfbd4c58 RBX: 8100bfbd4c00 RCX: 804e6600
RDX: 00200200 RSI:  RDI: 804e6600
RBP:  R08: 8100bfbd4c48 R09: 0020
R10:  R11: 8019baa0 R12: 80100190
R13:  R14: 8010 R15: 80627fb0
FS:  () GS:80616980() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2:  CR3: 00101000 CR4: 06e0
Process swapper (pid: 1, threadinfo 8100bfb62000, task 8100bfb614d0)
Stack: 8032643d  8064499f 80100190
   80651288  8010b249 0246
   00020800 804ae180
Call Trace:{spi_release_transport+13} {ahd} 
  {init+505} {child_rip+8}
   {init+0} {child_rip+0}


Code: 0f 0b a3 e1 d9 44 80 ff ff ff ff c2 a5 00 49 8b 00 4c 39 40
RIP {attribute_container_unregister+52} RSP  
<0>Kernel panic - not syncing: Attempted to kill init!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc5-mm1 doesnt boot on x86_64

2005-08-08 Thread Ashok Raj
On Mon, Aug 08, 2005 at 07:11:26PM +0200, Andi Kleen wrote:
> On Mon, Aug 08, 2005 at 09:48:19AM -0700, Ashok Raj wrote:
> > Folks,
> > 
> > Iam getting this on the recent 2.6.12-rc5-mm1 kernel built with defconfig. 
> > 
> > Cheers,
> > Ashok Raj
> > 
> > --- [cut here ] - [please bite here ] -
> > Kernel BUG at "include/linux/list.h":165
> > invalid operand:  [1] SMP
> > CPU 2
> > Modules linked in:
> > Pid: 1, comm: swapper Not tainted 2.6.13-rc5-mm1
> > RIP: 0010:[] 
> > {attribute_container_unregist}RSP: 0018:8100bfb63f00  
> > EFLAGS: 00010283
> > RAX: 8100bfbd4c58 RBX: 8100bfbd4c00 RCX: 804e6600
> > RDX: 00200200 RSI:  RDI: 804e6600
> > RBP:  R08: 8100bfbd4c48 R09: 0020
> > R10:  R11: 8019baa0 R12: 80100190
> > R13:  R14: 8010 R15: 80627fb0
> > FS:  () GS:80616980() knlGS:
> > CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> > CR2:  CR3: 00101000 CR4: 06e0
> > Process swapper (pid: 1, threadinfo 8100bfb62000, task 8100bfb614d0)
> > Stack: 8032643d  8064499f 80100190
> >80651288  8010b249 0246
> >00020800 804ae180
> > Call Trace:{spi_release_transport+13} 
> > {ahd}   {init+505} 
> > {child_rip+8}
> >{init+0} {child_rip+0}
> 
> Looks like a SCSI problem. The machine has an Adaptec SCSI adapter, right?

Yep, its adaptec problem

Actually i dont need AIX7XXX, since my system requires only CONFIG_FUSION.
I turned that option off, and it seems to boot fine now.

Ashok


> 
> -AndI
> > 
> > 
> > Code: 0f 0b a3 e1 d9 44 80 ff ff ff ff c2 a5 00 49 8b 00 4c 39 40
> > RIP {attribute_container_unregister+52} RSP 
> >  <0>Kernel panic - not syncing: Attempted to kill init!
> > 

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc5-mm1 doesnt boot on x86_64

2005-08-08 Thread Ashok Raj
On Mon, Aug 08, 2005 at 12:33:29PM -0500, James Bottomley wrote:
> On Mon, 2005-08-08 at 19:11 +0200, Andi Kleen wrote:
> > Looks like a SCSI problem. The machine has an Adaptec SCSI adapter, right?
> 
> The traceback looks pretty meaningless.
> 
> What was happening on the machine before this.  i.e. was it booting up,
> in which case can we have the prior dmesg file; or was the aic79xxx
> driver being removed?

I can get the trace again, but basically the system was booting. 

AIC_7XXX was defined in defconfig, but my system doesnt have it. Seems like
the senario was the driver tried to probe, found nothing, and tries
to de-reg resulting in the BUG().

I will try to get the recompile and entire dmesg log in the meantime.
> 
> James
> 
> 

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc5-mm1 doesnt boot on x86_64

2005-08-09 Thread Ashok Raj
On Mon, Aug 08, 2005 at 07:06:50PM -0500, James Bottomley wrote:
> On Mon, 2005-08-08 at 10:42 -0700, Andrew Morton wrote:
> > -mm has extra list_head debugging goodies.  I'd be suspecting a list_head
> > corruption detected somewhere under spi_release_transport().
> 
> Aha, looking in wrong driver ... the problem actually appears to be a
> double release of the transport template in aic79xx.  Try this patch

Hi James

Sorry for the delay...

This patch works like a charm!.

Cheers,
ashok
> 
> James
> 
> diff --git a/drivers/scsi/aic7xxx/aic79xx_osm.c 
> b/drivers/scsi/aic7xxx/aic79xx_osm.c
> --- a/drivers/scsi/aic7xxx/aic79xx_osm.c
> +++ b/drivers/scsi/aic7xxx/aic79xx_osm.c
> @@ -2326,8 +2326,6 @@ done:
>   return (retval);
>  }
>  
> -static void ahd_linux_exit(void);
> -
>  static void ahd_linux_set_width(struct scsi_target *starget, int width)
>  {
>   struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
> @@ -2772,7 +2770,7 @@ ahd_linux_init(void)
>   if (ahd_linux_detect(&aic79xx_driver_template) > 0)
>   return 0;
>   spi_release_transport(ahd_linux_transport_template);
> - ahd_linux_exit();
> +
>   return -ENODEV;
>  }
>  
> diff --git a/drivers/scsi/aic7xxx/aic7xxx_osm.c 
> b/drivers/scsi/aic7xxx/aic7xxx_osm.c
> --- a/drivers/scsi/aic7xxx/aic7xxx_osm.c
> +++ b/drivers/scsi/aic7xxx/aic7xxx_osm.c
> @@ -2331,8 +2331,6 @@ ahc_platform_dump_card_state(struct ahc_
>  {
>  }
>  
> -static void ahc_linux_exit(void);
> -
>  static void ahc_linux_set_width(struct scsi_target *starget, int width)
>  {
>   struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
> 
> 

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 5/8] x86_64:Dont do broadcast IPIs when hotplug is enabled in flat mode.

2005-08-01 Thread Ashok Raj
the use of non-shortcut version of routines breaking CPU hotplug. The option
to select this via cmdline also is deleted with the physflat patch, hence
directly placing this code under CONFIG_HOTPLUG_CPU.

We dont want to use broadcast mode IPI's when hotplug is enabled. This causes
bad effects in send IPI to a cpu that is offline which can trip when the 
cpu is in the process of being kicked alive.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/genapic_flat.c |8 
 1 files changed, 8 insertions(+)

Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
@@ -78,8 +78,16 @@ static void flat_send_IPI_mask(cpumask_t
 
 static void flat_send_IPI_allbutself(int vector)
 {
+#ifndef CONFIG_HOTPLUG_CPU
if (((num_online_cpus()) - 1) >= 1)
__send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL);
+#else
+   cpumask_t allbutme = cpu_online_map;
+   int me = get_cpu(); /* Ensure we are not preempted when we clear */
+   cpu_clear(me, allbutme);
+   flat_send_IPI_mask(allbutme, vector);
+   put_cpu();
+#endif
 }
 
 static void flat_send_IPI_all(int vector)

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/8] x86_64:Dont call enforce_max_cpus when hotplug is enabled

2005-08-01 Thread Ashok Raj
No need to enforce_max_cpus when hotplug code is enabled. This
nukes out cpu_present_map and cpu_possible_map making it impossible to add
new cpus in the system.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>

 arch/x86_64/kernel/smpboot.c |   40 +++-
 1 files changed, 23 insertions(+), 17 deletions(-)

Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/smpboot.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/smpboot.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/smpboot.c
@@ -893,23 +893,6 @@ static __init void disable_smp(void)
cpu_set(0, cpu_core_map[0]);
 }
 
-/*
- * Handle user cpus=... parameter.
- */
-static __init void enforce_max_cpus(unsigned max_cpus)
-{
-   int i, k;
-   k = 0;
-   for (i = 0; i < NR_CPUS; i++) {
-   if (!cpu_possible(i))
-   continue;
-   if (++k > max_cpus) {
-   cpu_clear(i, cpu_possible_map);
-   cpu_clear(i, cpu_present_map);
-   }
-   }
-}
-
 #ifdef CONFIG_HOTPLUG_CPU
 /*
  * cpu_possible_map should be static, it cannot change as cpu's
@@ -929,6 +912,29 @@ static void prefill_possible_map(void)
for (i = 0; i < NR_CPUS; i++)
cpu_set(i, cpu_possible_map);
 }
+
+/*
+ * Dont need this when we have hotplug enabled
+ */
+#define enforce_max_cpus(x)
+
+#else
+/*
+ * Handle user cpus=... parameter.
+ */
+static __init void enforce_max_cpus(unsigned max_cpus)
+{
+   int i, k;
+   k = 0;
+
+   for_each_cpu(i) {
+   if (++k > max_cpus) {
+   cpu_clear(i, cpu_possible_map);
+   cpu_clear(i, cpu_present_map);
+   }
+   }
+}
+
 #endif
 
 /*

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/8] x86_64: Reintroduce clustered_apic_check() for x86_64.

2005-08-01 Thread Ashok Raj
Auto selection of bigsmp patch removed this check from a shared common file
in arch/i386/kernel/acpi/boot.c. We still need to call this to determine 
the right genapic code for x86_64. 

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/setup.c |1 +
 1 files changed, 1 insertion(+)

Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/setup.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c
@@ -663,6 +663,7 @@ void __init setup_arch(char **cmdline_p)
 * Read APIC and some other early information from ACPI tables.
 */
acpi_boot_init();
+   clustered_apic_check();
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 7/8] x86_64:Use common functions in cluster and physflat mode

2005-08-01 Thread Ashok Raj
Newly introduced physflat_* shares way too much with cluster with only
a very differences. So we introduce some common functions in that can be
reused in both cases.

In addition the following are also fixed.
- Use of non-existent CONFIG_CPU_HOTPLUG option renamed to actual one in use.
- Removed comment that ACPI would provide a way to select this dynamically
  since ACPI_CONFIG_HOTPLUG_CPU already exists that indicates platform support
  for hotplug via ACPI. In addition CONFIG_HOTPLUG_CPU only indicates logical 
  offline/online which is even used by Suspend/Resume folks where the same 
  support (for no-broadcast) is required.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>

 arch/x86_64/kernel/genapic.c |   52 +
 arch/x86_64/kernel/genapic_cluster.c |   55 +++
 arch/x86_64/kernel/genapic_flat.c|   49 +++
 include/asm-x86_64/ipi.h |5 +++
 4 files changed, 61 insertions(+), 100 deletions(-)

Index: linux-2.6.13-rc4-mm1/include/asm-x86_64/ipi.h
===
--- linux-2.6.13-rc4-mm1.orig/include/asm-x86_64/ipi.h
+++ linux-2.6.13-rc4-mm1/include/asm-x86_64/ipi.h
@@ -107,4 +107,9 @@ static inline void send_IPI_mask_sequenc
local_irq_restore(flags);
 }
 
+extern cpumask_t generic_target_cpus(void);
+extern void generic_send_IPI_mask(cpumask_t mask, int vector);
+extern void generic_send_IPI_allbutself(int vector);
+extern void generic_send_IPI_all(int vector);
+extern unsigned int generic_cpu_mask_to_apicid(cpumask_t cpumask);
 #endif /* __ASM_IPI_H */
Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
@@ -134,56 +134,17 @@ struct genapic apic_flat =  {
  * overflows, so use physical mode.
  */
 
-static cpumask_t physflat_target_cpus(void)
-{
-   return cpumask_of_cpu(0);
-}
-
-static void physflat_send_IPI_mask(cpumask_t cpumask, int vector)
-{
-   send_IPI_mask_sequence(cpumask, vector);
-}
-
-static void physflat_send_IPI_allbutself(int vector)
-{
-   cpumask_t allbutme = cpu_online_map;
-   int me = get_cpu();
-   cpu_clear(me, allbutme);
-   physflat_send_IPI_mask(allbutme, vector);
-   put_cpu();
-}
-
-static void physflat_send_IPI_all(int vector)
-{
-   physflat_send_IPI_mask(cpu_online_map, vector);
-}
-
-static unsigned int physflat_cpu_mask_to_apicid(cpumask_t cpumask)
-{
-   int cpu;
-
-   /*
-* We're using fixed IRQ delivery, can only return one phys APIC ID.
-* May as well be the first.
-*/
-   cpu = first_cpu(cpumask);
-   if ((unsigned)cpu < NR_CPUS)
-   return x86_cpu_to_apicid[cpu];
-   else
-   return BAD_APICID;
-}
-
 struct genapic apic_physflat =  {
.name = "physical flat",
.int_delivery_mode = dest_Fixed,
.int_dest_mode = (APIC_DEST_PHYSICAL != 0),
.int_delivery_dest = APIC_DEST_PHYSICAL | APIC_DM_FIXED,
-   .target_cpus = physflat_target_cpus,
+   .target_cpus = generic_target_cpus,
.apic_id_registered = flat_apic_id_registered,
.init_apic_ldr = flat_init_apic_ldr,/*not needed, but shouldn't hurt*/
-   .send_IPI_all = physflat_send_IPI_all,
-   .send_IPI_allbutself = physflat_send_IPI_allbutself,
-   .send_IPI_mask = physflat_send_IPI_mask,
-   .cpu_mask_to_apicid = physflat_cpu_mask_to_apicid,
+   .send_IPI_all = generic_send_IPI_all,
+   .send_IPI_allbutself = generic_send_IPI_allbutself,
+   .send_IPI_mask = generic_send_IPI_mask,
+   .cpu_mask_to_apicid = generic_cpu_mask_to_apicid,
.phys_pkg_id = phys_pkg_id,
 };
Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_cluster.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_cluster.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_cluster.c
@@ -57,56 +57,11 @@ static void cluster_init_apic_ldr(void)
apic_write_around(APIC_LDR, val);
 }
 
-/* Start with all IRQs pointing to boot CPU.  IRQ balancing will shift them. */
-
-static cpumask_t cluster_target_cpus(void)
-{
-   return cpumask_of_cpu(0);
-}
-
-static void cluster_send_IPI_mask(cpumask_t mask, int vector)
-{
-   send_IPI_mask_sequence(mask, vector);
-}
-
-static void cluster_send_IPI_allbutself(int vector)
-{
-   cpumask_t mask = cpu_online_map;
-   int me = get_cpu(); /* Ensure we are not preempted when we clear */
-
-   cpu_clear(me, mask);
-
-   if (!cpus_empty(mask))
-   cluster_send_IPI_mask(mask, vector);
-
-   put_cpu();
-}
-
-static void cluster_send_IPI_all(int vector)
-{

[patch 8/8] x86_64: Choose physflat for AMD systems only when >8 CPUS.

2005-08-01 Thread Ashok Raj
It is not required to choose the physflat mode when CPU hotplug is enabled 
and CPUs <=8 case. Use of genapic_flat with the mask version is capable of 
doing the same, instead of doing the send_IPI_mask_sequence() where its a 
unicast.

This is another change that Andi introduced with the physflat mode. 

Andi: Do you think this is acceptable?

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/genapic.c |9 +
 1 files changed, 1 insertion(+), 8 deletions(-)

Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic.c
@@ -69,15 +69,8 @@ void __init clustered_apic_check(void)
}
 
/* Don't use clustered mode on AMD platforms. */
-   if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
+   if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (num_cpus > 8)) {
genapic = &apic_physflat;
-   /* In the CPU hotplug case we cannot use broadcast mode
-  because that opens a race when a CPU is removed.
-  Stay at physflat mode in this case. - AK */
-#ifdef CONFIG_HOTPLUG_CPU
-   if (num_cpus <= 8)
-   genapic = &apic_flat;
-#endif
goto print;
}
 

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/8] x86_64: Reintroduce clustered_apic_check() for x86_64.

2005-08-01 Thread Ashok Raj
On Mon, Aug 01, 2005 at 01:20:18PM -0700, Ashok Raj wrote:
> Auto selection of bigsmp patch removed this check from a shared common file
> in arch/i386/kernel/acpi/boot.c. We still need to call this to determine 
> the right genapic code for x86_64. 
> 

Thanks venki,

missed the check for lapic and smp_found_config before the call.

Resending patch.

-- 
Cheers,
Ashok Raj
- Open Source Technology Center

Auto selection of bigsmp patch removed this check from a shared common file
in arch/i386/kernel/acpi/boot.c. We still need to call this to determine 
the right genapic code for x86_64. 

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/setup.c |2 ++
 1 files changed, 2 insertions(+)

Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/setup.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c
@@ -663,6 +663,8 @@ void __init setup_arch(char **cmdline_p)
 * Read APIC and some other early information from ACPI tables.
 */
acpi_boot_init();
+   if (acpi_lapic && smp_found_config)
+   clustered_apic_check();
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 6/8] x86_64:Dont use Lowest Priority when using physical mode.

2005-08-01 Thread Ashok Raj
Delivery mode should be APIC_DM_FIXED when using physical mode.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>

 arch/x86_64/kernel/genapic_flat.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
@@ -175,9 +175,9 @@ static unsigned int physflat_cpu_mask_to
 
 struct genapic apic_physflat =  {
.name = "physical flat",
-   .int_delivery_mode = dest_LowestPrio,
+   .int_delivery_mode = dest_Fixed,
.int_dest_mode = (APIC_DEST_PHYSICAL != 0),
-   .int_delivery_dest = APIC_DEST_PHYSICAL | APIC_DM_LOWEST,
+   .int_delivery_dest = APIC_DEST_PHYSICAL | APIC_DM_FIXED,
.target_cpus = physflat_target_cpus,
.apic_id_registered = flat_apic_id_registered,
.init_apic_ldr = flat_init_apic_ldr,/*not needed, but shouldn't hurt*/

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/8] x86_64:Dont call enforce_max_cpus when hotplug is enabled

2005-08-04 Thread Ashok Raj
On Thu, Aug 04, 2005 at 12:41:10PM +0200, Andi Kleen wrote:
> On Mon, Aug 01, 2005 at 01:20:20PM -0700, Ashok Raj wrote:
> > No need to enforce_max_cpus when hotplug code is enabled. This
> > nukes out cpu_present_map and cpu_possible_map making it impossible to add
> > new cpus in the system.
> 
> Hmm - i think there was some reason for this early zeroing,
> but I cannot remember it right now.
> 
> It might be related to some checks later that check max possible cpus.
> 
> So it would be still good to have some way to limit max possible cpus.
> Maybe with a new option?

The only useful thing with enfore_max() is that cpu_possible_map is trimmed
so some resource allocations that use for_each_cpu() for upfront allocation
wont allocate resources.

Currently i see max_cpus only limiting boot-time start, none trim cpu_possible
which is done in only x86_64. max_cpu is still honored, just that for initial
boot. I would think maybe remove enforce_max_cpus() altogether like other 
archs instead of adding one more just for x86_64. 

Maybe we should add only if there is a need, instead of adding and finding
no-one using it and finally removing it very soon.
> 
> -Andi

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 4/8] x86_64:Fix cluster mode send_IPI_allbutself to use get_cpu()/put_cpu()

2005-08-04 Thread Ashok Raj
On Thu, Aug 04, 2005 at 12:43:02PM +0200, Andi Kleen wrote:
> On Mon, Aug 01, 2005 at 01:20:21PM -0700, Ashok Raj wrote:
> > Need to ensure we dont get prempted when we clear ourself from mask when 
> > using
> > clustered mode genapic code.
> 
> It's not needed I think. If the caller wants to execute code
> on the current CPU then it has to have disabled preemption
> itself already to avoid races. And if not it doesn't care.
> 
> One could argue that this function should be always called
> with preemption disabled though. Perhaps better a WARN_ON().
> 

This is only required for smp_call_function(), since we do allbutself
by exclusing self, its the internal function that needs to do this.

allbutself shortcut takes care of it, since it doesnt matter which cpu
we write the shortcut, in the mask version and for cluster i think its required
to ensure in the low level function. Otherwise we would need each 
implementation of smp_call_function() and send_IPI_allbutself() callers would
need to do this, which would be lots of changes.
> -Andi

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/8] x86_64:Dont do broadcast IPIs when hotplug is enabled in flat mode.

2005-08-04 Thread Ashok Raj
On Thu, Aug 04, 2005 at 12:51:07PM +0200, Andi Kleen wrote:
> >  static void flat_send_IPI_allbutself(int vector)
> >  {
> > +#ifndef CONFIG_HOTPLUG_CPU
> > if (((num_online_cpus()) - 1) >= 1)
> > __send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL);
> > +#else
> > +   cpumask_t allbutme = cpu_online_map;
> > +   int me = get_cpu(); /* Ensure we are not preempted when we clear */
> > +   cpu_clear(me, allbutme);
> > +   flat_send_IPI_mask(allbutme, vector);
> > +   put_cpu();
> 
> This still needs the num_online_cpus()s check.

Opps missed that... Thanks for spotting it.

I will send an updated one to Andrew.
-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/8] x86_64:Dont do broadcast IPIs when hotplug is enabled in flat mode.

2005-08-04 Thread Ashok Raj
On Thu, Aug 04, 2005 at 12:51:07PM +0200, Andi Kleen wrote:
> >  static void flat_send_IPI_allbutself(int vector)
> >  {
> > +#ifndef CONFIG_HOTPLUG_CPU
> > if (((num_online_cpus()) - 1) >= 1)
> > __send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL);
> > +#else
> > +   cpumask_t allbutme = cpu_online_map;
> > +   int me = get_cpu(); /* Ensure we are not preempted when we clear */
> > +   cpu_clear(me, allbutme);
> > +   flat_send_IPI_mask(allbutme, vector);
> > +   put_cpu();
> 
> This still needs the num_online_cpus()s check.
> 
> -Andi

Modified patch attached.

Andrew: the filename in your -mm queue is below, with the attached
patch.

x86_64dont-do-broadcast-ipis-when-hotplug-is-enabled-in-flat-mode.patch

-- 
Cheers,
Ashok Raj
- Open Source Technology Center


Note: Recent introduction of physflat mode for x86_64 inadvertently deleted 
the use of non-shortcut version of routines breaking CPU hotplug. The option
to select this via cmdline also is deleted with the physflat patch, hence
directly placing this code under CONFIG_HOTPLUG_CPU.

We dont want to use broadcast mode IPI's when hotplug is enabled. This causes
bad effects in send IPI to a cpu that is offline which can trip when the 
cpu is in the process of being kicked alive.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/genapic_flat.c |   10 ++
 1 files changed, 10 insertions(+)

Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
===
--- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c
+++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c
@@ -78,8 +78,18 @@ static void flat_send_IPI_mask(cpumask_t
 
 static void flat_send_IPI_allbutself(int vector)
 {
+#ifndef CONFIG_HOTPLUG_CPU
if (((num_online_cpus()) - 1) >= 1)
__send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL);
+#else
+   cpumask_t allbutme = cpu_online_map;
+   int me = get_cpu(); /* Ensure we are not preempted when we clear */
+   cpu_clear(me, allbutme);
+
+   if (!cpus_empty(allbutme))
+   flat_send_IPI_mask(allbutme, vector);
+   put_cpu();
+#endif
 }
 
 static void flat_send_IPI_all(int vector)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/1] Hot plug CPU to support physical add of new processors (i386)

2005-09-01 Thread Ashok Raj
  return;
>   }
> +#endif
>   num_processors++;
>   ver = m->mpc_apicver;
>  
> diff -puN arch/i386/kernel/smpboot.c~hotcpu-i386 arch/i386/kernel/smpboot.c
> --- linux-2.6.13-rc6-mm2/arch/i386/kernel/smpboot.c~hotcpu-i386   
> 2005-08-31 04:17:20.924024616 -0700
> +++ linux-2.6.13-rc6-mm2-root/arch/i386/kernel/smpboot.c  2005-08-31 
> 04:21:49.474198784 -0700
> @@ -1003,9 +1003,10 @@ int __devinit smp_prepare_cpu(int cpu)
>   struct warm_boot_cpu_info info;
>   struct work_struct task;
>   int apicid, ret;
> + extern u8 bios_cpu_apicid[NR_CPUS];
>  
>   lock_cpu_hotplug();
> - apicid = x86_cpu_to_apicid[cpu];
> + apicid = bios_cpu_apicid[cpu];
>   if (apicid == BAD_APICID) {
>   ret = -ENODEV;
>   goto exit;
> diff -puN arch/i386/mach-default/topology.c~hotcpu-i386 
> arch/i386/mach-default/topology.c
> --- linux-2.6.13-rc6-mm2/arch/i386/mach-default/topology.c~hotcpu-i386
> 2005-08-31 04:17:20.957019600 -0700
> +++ linux-2.6.13-rc6-mm2-root/arch/i386/mach-default/topology.c   
> 2005-08-31 04:22:13.020619184 -0700
> @@ -76,7 +76,7 @@ static int __init topology_init(void)
>   for_each_online_node(i)
>   arch_register_node(i);
>  
> - for_each_present_cpu(i)
> + for_each_cpu(i)

Nope. Should still be for present_cpus. with NR_CPUS large we would see 
way too many files in sysfs than what is really available. 

>   arch_register_cpu(i);
>   return 0;
>  }
> @@ -87,7 +87,7 @@ static int __init topology_init(void)
>  {
>   int i;
>  
> - for_each_present_cpu(i)
> + for_each_cpu(i)
>   arch_register_cpu(i);
>   return 0;
>  }
> diff -puN kernel/cpu.c~hotcpu-i386 kernel/cpu.c
> --- linux-2.6.13-rc6-mm2/kernel/cpu.c~hotcpu-i386 2005-08-31 
> 04:17:21.002012760 -0700
> +++ linux-2.6.13-rc6-mm2-root/kernel/cpu.c2005-08-31 04:23:34.378250944 
> -0700
> @@ -158,7 +158,11 @@ int __devinit cpu_up(unsigned int cpu)
>   if ((ret = down_interruptible(&cpucontrol)) != 0)
>   return ret;
>  
> +#ifdef CONFIG_HOTPLUG_CPU
> + if (cpu_online(cpu)) {
> +#else
>   if (cpu_online(cpu) || !cpu_present(cpu)) {
> +#endif

ditto.
>   ret = -EINVAL;
>   goto out;
>   }
> diff -puN arch/i386/kernel/irq.c~hotcpu-i386 arch/i386/kernel/irq.c
> --- linux-2.6.13-rc6-mm2/arch/i386/kernel/irq.c~hotcpu-i386   2005-08-31 
> 04:17:21.047005920 -0700
> +++ linux-2.6.13-rc6-mm2-root/arch/i386/kernel/irq.c  2005-08-31 
> 04:25:21.761926144 -0700
> @@ -248,7 +248,7 @@ int show_interrupts(struct seq_file *p, 
>  
>   if (i == 0) {
>   seq_printf(p, "   ");
> - for_each_cpu(j)
> + for_each_online_cpu(j)
>   seq_printf(p, "CPU%d   ",j);
>   seq_putc(p, '\n');
>   }
> @@ -262,7 +262,7 @@ int show_interrupts(struct seq_file *p, 
>  #ifndef CONFIG_SMP
>   seq_printf(p, "%10u ", kstat_irqs(i));
>  #else
> - for_each_cpu(j)
> + for_each_online_cpu(j)
>   seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]);
>  #endif
>   seq_printf(p, " %14s", irq_desc[i].handler->typename);
> @@ -276,12 +276,12 @@ skip:
>   spin_unlock_irqrestore(&irq_desc[i].lock, flags);
>   } else if (i == NR_IRQS) {
>   seq_printf(p, "NMI: ");
> - for_each_cpu(j)
> + for_each_online_cpu(j)
>   seq_printf(p, "%10u ", nmi_count(j));
>   seq_putc(p, '\n');
>  #ifdef CONFIG_X86_LOCAL_APIC
>   seq_printf(p, "LOC: ");
> - for_each_cpu(j)
> + for_each_online_cpu(j)
>   seq_printf(p, "%10u ",
>   per_cpu(irq_stat,j).apic_timer_irqs);
>   seq_putc(p, '\n');
> _

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/1] Hot plug CPU to support physical add of new processors (i386)

2005-09-01 Thread Ashok Raj
On Thu, Sep 01, 2005 at 10:45:10AM +0200, Andi Kleen wrote:
> Hallo Natalie,
> 
> On Wednesday 31 August 2005 14:13, [EMAIL PROTECTED] wrote:
> > Current IA32 CPU hotplug code doesn't allow bringing up processors that
> > were not present in the boot configuration. To make existing hot plug
> > facility more practical for physical hot plug, possible processors should
> > be encountered during boot for potentual hot add/replace/remove. On ES7000,
> > ACPI marks all the sockets that are empty or not assigned to the
> > partitionas as "disabled". 
> 
> Good idea. In fact I always hated the behaviour of the existing
> hotplug code that assumes all possible CPUs can be hotplugged.
> It would be much nicer to be told be the firmware what CPUs
> are hotpluggable. It would be great if all ia32/x86-64 hotplug capable 
> BIOS behaved like your.
> 

Andi, you are getting mixed up with software only ability to offline with 
hardware eject capability. ACPI indicates ability to hotplug by the presence
of _EJD in the appropriate scope of the object. So ACPI does have ability to 
do what you mention above precicely, but the entire namespace is not known 
upfront since some could be dynamically loaded. Which is why we need to show 
the entire NR_CPUS as hotpluggable. 

Possibly we can keep cpu_possible_map as NR_CPUS only when support for 
PHYSICAL_CPU_HOTPLUG is present, otherwise we can keep it
cloned as cpu_present_map. (we dont have a generic PHYSICAL hotplug CONFIG
option today)

What CONFIG_HOTPLUG_CPU=y indicates is ability to offline a processor from the
kernel. It DOES NOT indicate physical hotpluggablity.  So we dont need any
hardware support (apart arch/kernel support) for this to work. Support
for physical hotplug is indicated via CONFIG_ACPI_HOTPLUG_CPU.

Be aware that suspend/resume folks using CPU hotplug to offline CPUS except
BSP need just the kernel support to offline. BIOS has nothing to do with 
being able to offline a CPU (preferably called as soft-removal).

Cheers,
ashok

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/1] Hot plug CPU to support physical add of new processors (i386)

2005-09-01 Thread Ashok Raj
On Thu, Sep 01, 2005 at 04:09:09PM -0500, Protasevich, Natalie wrote:
> > 
> > > Current IA32 CPU hotplug code doesn't allow bringing up 
> > processors that were not present in the boot configuration. 
> > > To make existing hot plug facility more practical for physical hot 
> > > plug, possible processors should be encountered during boot for 
> > > potentual hot add/replace/remove. On ES7000, ACPI marks all the 
> > > sockets that are empty or not assigned to the partitionas as 
> > > "disabled". The patch allows arrays/masks with APIC info 
> > for disabled 
> > > processors to be
> > 
> > This sounds like a cluge to me. The correct implementation 
> > would be you would need some sysmgmt deamon or something that 
> > works with the kernel to notify of new cpus and populate 
> > apicid and grow cpu_present_map. Making an assumption that 
> > disabled APICID are valid for ES7000 sake is not a safe assumption.
> 
> Yes, this is a kludge, I realize that. The AML code was not there so far
> (it will be in the next one). I have a point here though that if the
> processor is there, but is unusable (what "disabled" means as the ACPI
> spec says), meaning bad maybe, then with physical hot plug it can
> certainly be made usable and I think it should be taken into
> consideration (and into configuration). It should be counted as possible
> at least, with hot plug, because it represent existing socket. 


I think marking it as present, and considering in cpu_possible_map is perfectly
ok. But we would need more glue logic, that is if firmware marked it as 
disabled, then one would expect you then run _STA and find that the CPU
is now present and functional as reported by _STA, then the CPU is onlinable.

So if _STA can work favorably in your case you can use it to override the 
disabled setting at boot time which would be prefectly fine.
> 

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 09/14] x86_64: Don't call enforce_max_cpus when hotplug is enabled

2005-09-06 Thread Ashok Raj
Hi Andi

On Mon, Sep 05, 2005 at 06:48:21AM +0200, Andi Kleen wrote:
> On Sat, Sep 03, 2005 at 02:33:26PM -0700, [EMAIL PROTECTED] wrote:
> > 
> > From: Ashok Raj <[EMAIL PROTECTED]>
> > 
> > No need to enforce_max_cpus when hotplug code is enabled.  This nukes out
> > cpu_present_map and cpu_possible_map making it impossible to add new cpus in
> > the system.
> 
> I see the point, but the implementation is wrong. If anything
> we shouldn't do it neither for the !HOTPLUG_CPU case.Why did 
> you not do it unconditionally? 
> 
> I would prefer to keep the special cases for hotplug to be
> as narrow as possible.

Link to earlier discussion below

http://marc.theaimsgroup.com/?l=linux-kernel&m=112317327529855&w=2

I had suggested that we remove it completely in our discussion but i didnt
hear anything from you after that, so i thought that was acceptable.

You had suggested in that discussion that it would be better to add an 
option for startup. Iam opposed to adding any option, when we certainly know 
there are no users. Earlier based on your suggestion i added a startup
option to choose ipi broadcast mode, which you promptly removed when you
put physflat changes. I think its better to not add any option without
real need. Do you agree?

Please reply if you want me to remove the !HOTPLUG case which is my 
preference as well, and maybe while the memory is fresh, we can stick
with it this time when we are in the same page :-(

> 
> -Andi

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 14/14] x86_64: Choose physflat for AMD systems only when >8 CPUS.

2005-09-06 Thread Ashok Raj
On Tue, Sep 06, 2005 at 01:18:08AM +0200, Andi Kleen wrote:
> On Sat, Sep 03, 2005 at 02:33:30PM -0700, [EMAIL PROTECTED] wrote:
> > 
> > From: Ashok Raj <[EMAIL PROTECTED]>
> > 
> > It is not required to choose the physflat mode when CPU hotplug is enabled 
> > and
> > CPUs <=8 case.  Use of genapic_flat with the mask version is capable of 
> > doing
> > the same, instead of doing the send_IPI_mask_sequence() where its a unicast.
> 
> I don't get the reasoning of this change. So probably not.

Hummm...Please see below. Nothing has changed since then, any idea why
its not acceptable now?

http://marc.theaimsgroup.com/?l=linux-kernel&m=112315304423377&w=2

This really doesnt affect me, it just bothers me to go over inefficient code.
send_IPI_mask_sequence() does unicast IPI's. When number of CPUs is <=8
the mask version acheives the same with just one write, so its a selective
broadcast which is more efficient. 

Based on our earlier exchange i assumed it was clear and apparent which is
why you  "OK"ed the version when it was submitted to -mm. 

Nothing has changed, its the exact same patch. Hope its clear now.

Entirely up to you... :-(

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/14] x86_64: Use common functions in cluster and physflat mode

2005-09-06 Thread Ashok Raj
On Tue, Sep 06, 2005 at 01:16:28AM +0200, Andi Kleen wrote:
> On Sat, Sep 03, 2005 at 02:33:30PM -0700, [EMAIL PROTECTED] wrote:
> > 
> > From: Ashok Raj <[EMAIL PROTECTED]>
> > 
> > Newly introduced physflat_* shares way too much with cluster with only a 
> > very
> > differences.  So we introduce some common functions in that can be reused in
> > both cases.
> > 
> > In addition the following are also fixed.
> > - Use of non-existent CONFIG_CPU_HOTPLUG option renamed to actual one in 
> > use.
> > - Removed comment that ACPI would provide a way to select this dynamically
> >   since ACPI_CONFIG_HOTPLUG_CPU already exists that indicates platform 
> > support
> >   for hotplug via ACPI. In addition CONFIG_HOTPLUG_CPU only indicates 
> > logical 
> >   offline/online which is even used by Suspend/Resume folks where the same 
> >   support (for no-broadcast) is required.
> 
> 
> (hmm did I reply to that? I think I did but my mailer seems to have
> lost the r flag. My apologies if it's a duplicate) 
> 
> I didn't like that one because it makes the code less readable than
> before imho. I did a separate patch for the CPU_HOTPLUG typo.

The code is less readable? Now iam confused. Attached the link to patch
below to refresh your memory.

http://marc.theaimsgroup.com/?l=linux-kernel&m=112293577309653&w=2

diffstat would show we have fewer lines ~40 less lines of code. physflat
basicaly copied/cloned some useful code in cluster and some from flat mode
genapic code. 

I would have consolidated the code in the first place when you put the physflat
mode. Again this was just my habit, cant step over code bloat and duplication.

Which part of the code is unreadable to you? If you are happy with just renamed
functions with copied body of the code which is what physflat did, thats fine.

I was just puzzeled at the convoluted and less readable part of the code. If 
there is something you like to point out, i would be happy to fix it.. or you 
can if you prefer it that way.


-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 09/14] x86_64: Don't call enforce_max_cpus when hotplug is enabled

2005-09-07 Thread Ashok Raj
On Wed, Sep 07, 2005 at 08:49:50AM +0200, Andi Kleen wrote:
> > 
> > You had suggested in that discussion that it would be better to add an 
> > option for startup. Iam opposed to adding any option, when we certainly 
> > know 
> 
> I suggested to auto detect it based on ACPI information. I don't 
> think I ever wrote anything about an option.
> 
> If that is not possible it's better to always use the sequence mechanism.

Using ACPI or any other method to choose broadcast or use mask version
of IPI in flat mode for <=8 cpus has no real value. I had posted a 
small stat program that showed using mask IPI provides same performance numbers.

We didnt choose that method only because there is no perf gain except code 
bloat. I dont understand putting all that complexity without any real merrit.

Moreover CONFIG_HOTPLUG_CPU does not imply physical CPU hotplug, which i had
tried to convey several times. 

It is important to understand that there is no just ONE RIGHT way
and that we consider alternatives for the right reason.

> 
> 
> P.S.: Don't bother sending me such "blame game" mails again. I will
> just d them next time because they're a waste of time.

Sorry Andi if you felt that way. I was trying to get some consistent feedback
and that you also consider and weight in what we explain instead of being
a one way street.

Certainly my intend was not to blame you, but to explain with clarity
so we dont end up reworking some trivial patches for a long time.

If you feel that way, i deeply apologize, and repeat, thats not my intend.
> 

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 13/14] x86_64: Use common functions in cluster and physflat mode

2005-09-09 Thread Ashok Raj
On Fri, Sep 09, 2005 at 10:07:28AM -0700, Zwane Mwaikambo wrote:
> On a slightly different topic, how come we're using physflat for hotplug 
> cpu?
> 
> -#ifndef CONFIG_CPU_HOTPLUG
>   /* In the CPU hotplug case we cannot use broadcast mode
>  because that opens a race when a CPU is removed.
> -Stay at physflat mode in this case.
> -It is bad to do this unconditionally though. Once
> -we have ACPI platform support for CPU hotplug
> -we should detect hotplug capablity from ACPI tables and
> -only do this when really needed. -AK */
> +Stay at physflat mode in this case. - AK */
> +#ifdef CONFIG_HOTPLUG_CPU
>   if (num_cpus <= 8)
>   genapic = &apic_flat;

What you say was true before this patch, (Although now that you point out i 
realize the ifdef CONFIG_HOTPLUG_CPU is not required). 

Think Andi is fixing this in his next drop to -mm*

When physflat was introduced, it also  switched to use physflat mode for 
#cpus <=8 when hotplug is enabled, since it doesnt use shortcuts and 
so is also safer (although slower). 

http://marc.theaimsgroup.com/?l=linux-kernel&m=112317686712929&w=2

The link above made using genapic_flat safer by using the
flat_send_IPI_mask(), and hence i switched back to using
logical flat mode when #cpus <=8, since that a little more efficient than
the send_IPI_mask_sequence() used in physflat mode.

In general we need

flat_mode - #cpus <= 8 (Hotplug defined or not, so we use mask version 
   for safety)

physflat or cluster_mode when #cpus >8.

If we choose physflat as default for #cpus <=8 (with hotplug) would make
IPI performance worse, since it would do one cpu at a time, and requires 2 
writes per cpu for each IPI v.s just 2 for a flat mode mask version of the API.

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS] hotplugging cpus via /sys/devices/system/cpu/

2005-09-09 Thread Ashok Raj
On Fri, Sep 09, 2005 at 01:41:58PM -0700, Christopher Beppler wrote:
> 
>[1.] One line summary of the problem:
>If I deactivate a CPU with /sys/devices/system/cpux and try to
>reactivate it, then the CPU doesn't start and the kernel prints out an
>oops.
> 

Could you try this on 2.6.13-mm2? If this is due to a sending broadcast
IPI related issue that should fix the problem.

I should say i didnt try i386 in a while
but i suspect some of the recent suspend/resume code required some
modifications to the i386 hotplug code which might be getting in the way
if you just try logical cpu hotplug alone without using it for suspend/resume.

Shaohua might know more about the status.

Cheers,
ashok
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Fix irq_affinity write from /proc for IPF

2005-03-14 Thread Ashok Raj
Hi Andrew/Tony

This patch is required for IPF to perform deferred write to rte's when
affinity is programmed via /proc. These entries can only be programmed when 
interrupt is pending.

We will eventually need the same method for x86 and x86_64 as well. 

This patch is only for IPF though. (the others are comming, more testing and
changes needed in my sandbox)

Could you please queue this up for next mm candidate? if it looks acceptable.

Since iam touching a common file for GENERIC_HARDIRQ, it would be best it its
reviwed in the -mm releases for any potential conflicts.

[ sorry for the cross post to lia64 ]

-- 
Cheers,
Ashok Raj
- Open Source Technology Center



---
fix_ia64_smp_affinity - Make GENERIC_HARDIRQ work for IPF and CPU Hotplug

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>

Made GENERIC_HARDIRQ mechanism work for IPF and CPU hotplug. When write to 
/proc/irq is handled it is not appropriate to perform set_rte immediatly, 
since there is a race when the interrupt is asserted while the re-program 
is happening. Hence such programming is only safe when we do the re-program 
at the time of servicing an interrupt. This got broken when GENERIC_HARDIRQ 
got introduced for IPF.

- added CONFIG_PENDING_IRQ so default /proc/irq write handler can do the right 
  thing.

TBD: We currently dont handle redirectable hint either in the display, or when
we handle writes to /proc/irq/XX/smp_affinity. We need an arch specific way to 
account for the presence of "r" hint when we handle the proc write.
---

 release_dir-araj/arch/ia64/kernel/irq.c |   12 ++--
 release_dir-araj/kernel/irq/proc.c  |   10 --
 2 files changed, 18 insertions(+), 4 deletions(-)

diff -puN arch/ia64/kernel/irq.c~fix_ia64_smp_affinity arch/ia64/kernel/irq.c
--- release_dir/arch/ia64/kernel/irq.c~fix_ia64_smp_affinity2005-03-14 
14:35:44.589293491 -0800
+++ release_dir-araj/arch/ia64/kernel/irq.c 2005-03-14 15:27:54.262106715 
-0800
@@ -94,12 +94,20 @@ skip:
 /*
  * This is updated when the user sets irq affinity via /proc
  */
-cpumask_t __cacheline_aligned pending_irq_cpumask[NR_IRQS];
+static cpumask_t __cacheline_aligned pending_irq_cpumask[NR_IRQS];
 static unsigned long pending_irq_redir[BITS_TO_LONGS(NR_IRQS)];
 
-static cpumask_t irq_affinity [NR_IRQS] = { [0 ... NR_IRQS-1] = CPU_MASK_ALL };
 static char irq_redir [NR_IRQS]; // = { [0 ... NR_IRQS-1] = 1 };
 
+/*
+ * Arch specific routine for deferred write to iosapic rte to reprogram
+ * intr destination.
+ */
+void proc_set_irq_affinity(unsigned int irq, cpumask_t mask_val)
+{
+   pending_irq_cpumask[irq] = mask_val;
+}
+
 void set_irq_affinity_info (unsigned int irq, int hwid, int redir)
 {
cpumask_t mask = CPU_MASK_NONE;
diff -puN kernel/irq/proc.c~fix_ia64_smp_affinity kernel/irq/proc.c
--- release_dir/kernel/irq/proc.c~fix_ia64_smp_affinity 2005-03-14 
14:41:05.475031747 -0800
+++ release_dir-araj/kernel/irq/proc.c  2005-03-14 15:27:59.436911339 -0800
@@ -19,6 +19,13 @@ static struct proc_dir_entry *root_irq_d
  */
 static struct proc_dir_entry *smp_affinity_entry[NR_IRQS];
 
+void __attribute__((weak))
+proc_set_irq_affinity(unsigned int irq, cpumask_t mask_val)
+{
+   irq_affinity[irq] = mask_val;
+   irq_desc[irq].handler->set_affinity(irq, mask_val);
+}
+
 static int irq_affinity_read_proc(char *page, char **start, off_t off,
  int count, int *eof, void *data)
 {
@@ -53,8 +60,7 @@ static int irq_affinity_write_proc(struc
if (cpus_empty(tmp))
return -EINVAL;
 
-   irq_affinity[irq] = new_value;
-   irq_desc[irq].handler->set_affinity(irq, new_value);
+   proc_set_irq_affinity(irq, new_value);
 
return full_count;
 }
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fix irq_affinity write from /proc for IPF

2005-03-14 Thread Ashok Raj
On Mon, Mar 14, 2005 at 03:59:23PM -0800, Andrew Morton wrote:
> Ashok Raj <[EMAIL PROTECTED]> wrote:
> >
> 
> "ia64" is preferred, please.  Nobody knows what an IPF is.

Right!. Sorry about that.
> 
> 
> Is it not possible for ia64's ->set_affinity() handler to do this deferring?
> 

There are other places where we re-program, and its fine to call the 
current version of set_affinity directly, like when we are doing cpu offline
and trying to force migrate irqs for ia64.

Changing the default set_affinity() for ia64 would result in many changes, 
this still keeps the same purpose of those access functions, and 
differentiates the proc write cases alone without changing the meaning 
of those handler functions. (and a smaller patch)

this would further complicate the force migrate irq's when we consider 
MSI interrupts as well. Since it would have its own set_affinity, and we need
to hack into MSI's set affinity handler as well which would complicate things.

-- 
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] User Level Interrupts

2005-03-23 Thread Ashok Raj
Hi Michael

have you thought about how this infrastructure would play well with 
existing CPU hotplug code for ia64?

Once you return to user mode via the iret, is it possible that user mode
thread could get switched due to a pending cpu quiese attempt to remove
a cpu? (Current cpu removal code would bring the entire system to knees
by scheduling a high priority thread and looping with intr disabled, until the
target cpu is removed)

the cpu removal code would also attempt to migrate user process to another cpu,
retarget interrupts to another existing cpu etc. I havent tested the hotplug
code on sgi boxes so far. (only tested on some hp boxes by Alex Williamson
and on tiger4 boxes so far)

Cheers,
ashok


On Wed, Mar 23, 2005 at 08:38:33AM -0800, Michael Raymond wrote:
> 
> Allow  fast  (1+us) user notification of device interrupts.  This
>allows
>more  powerful  user  I/O  applications to be written.  The process of
>porting
>to other architectures is straight forward and fully documented.  More
>information can be found at [1]http://oss.sgi.com/projects/uli/.
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

2007-06-11 Thread Ashok Raj
On Mon, Jun 11, 2007 at 02:14:49PM -0700, Andrew Morton wrote:
> > 
> > Again, if dma_map_{single|sg} API's fails due to 
> > failure to allocate memory, the only thing that can
> > be done is to panic as this is what few of the other 
> > IOMMU implementation is doing today. 
> 
> If the only option is to panic then something's busted.  If it's network IO
> then there should be a way of dropping the frame.  If it's disk IO then we
> should report the failure and cause an IO error.

Just looking at the code.. appears that quite a few popular ones (or should say 
most) dont even look at the dma_addr_t returned to check for failure.

Thats going to be another major cleanup work.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

2007-06-11 Thread Ashok Raj
On Tue, Jun 12, 2007 at 12:25:57AM +0200, Andi Kleen wrote:
> 
> > Please advice.
> 
> I think the short term only safe option would be to fully preallocate an 
> aperture.
> If it is too small you can try GFP_ATOMIC but it would be just
> a unreliable fallback. For safety you could perhaps have some kernel thread
> that tries to enlarge it in the background depending on current
> use. That would be not 100% guaranteed to keep up with load,
> but would at least keep up if the system is not too busy.
> 
> That is basically what your resource pools do, but they seem
> to be unnecessarily convoluted for the task :- after all you
> could just preallocate the page tables and rewrite/flush them without
> having some kind of allocator inbetween, can't you?

Each iommu has multiple domains, where each domain represents an 
address space. PCIexpress endpoints can be located on its own domain
for addr protection reasons, and also have its own tag for iotlb cache.

each addr space can be either a 3 or 4 level. So it would be hard to predict
how much to setup ahead of time for each domain/device.

Its not a simple single level table with a small window like the gart case.

Just keeping a pool of page sized pages its easy to respond and use where its
really necessary without having to lock pages down without knowing real demand.

The addr space is plenty, so growing on demand is the best use of memory 
available.

> If you make the start value large enough (256+MB?) that might reasonably
> work. How much memory in page tables would that take? Or perhaps scale
> it with available memory or available devices. 
> 
> In theory it could also be precomputed from the block/network device queue 
> lengths etc.; the trouble is just such checks would need to be added to all 
> kinds of 
> other odd subsystems that manage devices too.  That would be much more work.
> 
> Some investigation how to do sleeping block/network submit would be
> also interesting (e.g. replace the spinlocks there with mutexes and see how
> much it affects performance). For networking you would need to keep 
> at least a non sleeping path though because packets can be legally
> submitted from interrupt context. If it works out then sleeping
> interfaces to the IOMMU code could be added.
> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 0/8] [Intel IOMMU] Support for Intel Virtualization Technology for Directed I/O

2007-04-09 Thread Ashok Raj
Hi,

Pleased to announce support for Intel(R) Virtualization Technology for 
Directed I/O use as an IOMMU in Linux.

This is a series of patches to support the same. 

A brief description of the patches follows.

1. Support for ACPI framework to parse and work with DMA Remapping Tables.
2. Add support for PCI infrastructure to search parent relationships.
3. Hardware support for providing DMA remapping support for Intel Chipsets.
4. Supporting Zero Length Reads on DMAR's not able to support ZLR.
5. Graphics driver workarounds to provide unity map since they dont use dma api.
6. Updates to Documentation area for startup options and some basics.
7. Workaround to provide unity map for ISA bridge device to enable floppy disk.
8. Ability to preserve some mappings for devices not able to address entire 
   range.

Please help review and provide feedback.

Cheers,
Ashok Raj & Shaohua Li
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/8] [Intel IOMMU] Some generic search functions required to lookup device relationships.

2007-04-09 Thread Ashok Raj
PCI support functions for DMAR, to find parent bridge. 

When devices are under a p2p bridge, upstream
transactions get replaced by the device id of the bridge as it owns the 
PCIE transaction. Hence its necessary to setup translations on behalf of the 
bridge as well. Due to this limitation all devices under a p2p share the same
domain in a DMAR. 

We just cache the type of device, if its a native PCIe device
or not for later use.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
-

Index: linux-2.6.21-rc5/drivers/pci/pci.h
===
--- linux-2.6.21-rc5.orig/drivers/pci/pci.h 2007-04-03 04:30:44.0 
-0700
+++ linux-2.6.21-rc5/drivers/pci/pci.h  2007-04-03 06:58:58.0 -0700
@@ -90,3 +90,4 @@
return NULL;
 }
 
+struct pci_dev *pci_find_upstream_pcie_bridge(struct pci_dev *pdev);
Index: linux-2.6.21-rc5/drivers/pci/probe.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/probe.c   2007-04-03 04:30:44.0 
-0700
+++ linux-2.6.21-rc5/drivers/pci/probe.c2007-04-03 06:58:58.0 
-0700
@@ -822,6 +822,19 @@
kfree(pci_dev);
 }
 
+static void set_pcie_port_type(struct pci_dev *pdev)
+{
+   int pos;
+   u16 reg16;
+
+   pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
+   if (!pos)
+   return;
+   pdev->is_pcie = 1;
+   pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16);
+   pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4;
+}
+
 /**
  * pci_cfg_space_size - get the configuration space size of the PCI device.
  * @dev: PCI device
@@ -919,6 +932,7 @@
dev->device = (l >> 16) & 0x;
dev->cfg_size = pci_cfg_space_size(dev);
dev->error_state = pci_channel_io_normal;
+   set_pcie_port_type(dev);
 
/* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
   set this higher, assuming the system even supports it.  */
Index: linux-2.6.21-rc5/drivers/pci/search.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/search.c  2007-04-03 04:30:44.0 
-0700
+++ linux-2.6.21-rc5/drivers/pci/search.c   2007-04-03 06:58:58.0 
-0700
@@ -14,6 +14,36 @@
 #include "pci.h"
 
 DECLARE_RWSEM(pci_bus_sem);
+/*
+ * find the upstream PCIE-to-PCI bridge of a PCI device
+ * if the device is PCIE, return NULL
+ * if the device isn't connected to a PCIE bridge (that is its parent is a
+ * legacy PCI bridge and the bridge is directly connected to bus 0), return its
+ * parent
+ */
+struct pci_dev *
+pci_find_upstream_pcie_bridge(struct pci_dev *pdev)
+{
+   struct pci_dev *tmp = NULL;
+
+   if (pdev->is_pcie)
+   return NULL;
+   while (1) {
+   if (!pdev->bus->self)
+   break;
+   pdev = pdev->bus->self;
+   /* a p2p bridge */
+   if (!pdev->is_pcie) {
+   tmp = pdev;
+   continue;
+   }
+   /* PCI device should connect to a PCIE bridge */
+   BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);
+   return pdev;
+   }
+
+   return tmp;
+}
 
 static struct pci_bus *
 pci_do_find_bus(struct pci_bus* bus, unsigned char busnr)
Index: linux-2.6.21-rc5/include/linux/pci.h
===
--- linux-2.6.21-rc5.orig/include/linux/pci.h   2007-04-03 04:30:51.0 
-0700
+++ linux-2.6.21-rc5/include/linux/pci.h2007-04-03 06:58:58.0 
-0700
@@ -126,6 +126,7 @@
unsigned short  subsystem_device;
unsigned intclass;  /* 3 bytes: (base,sub,prog-if) */
u8  hdr_type;   /* PCI header type (`multi' flag masked 
out) */
+   u8  pcie_type;  /* PCI-E device/port type */
u8  rom_base_reg;   /* which config register controls the 
ROM */
u8  pin;/* which interrupt pin this device uses 
*/
 
@@ -168,6 +169,7 @@
unsigned intmsi_enabled:1;
unsigned intmsix_enabled:1;
unsigned intis_managed:1;
+   unsigned intis_pcie:1;
atomic_tenable_cnt; /* pci_enable_device has been called */
 
u32 saved_config_space[16]; /* config space saved at 
suspend time */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/8] [Intel IOMMU] Supporting Zero Length Reads in Intel IOMMU.

2007-04-09 Thread Ashok Raj
PCI specs permit zero length reads (ZLR) even if the mapping for that region 
is write only. Support for this feature is indicated by the presence of a bit 
in the DMAR capability. If a particular DMAR does not support this capability
we map write-only regions as read-write.

This option can also provides a workaround for some drivers that request
a write-only mapping when they really should request a read-write.
(We ran into one such case in eepro100.c in handling rx_ring_dma)

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
--
 drivers/pci/intel-iommu.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 
03:05:25.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-09 03:05:32.0 
-0700
@@ -84,7 +84,7 @@
struct sys_device sysdev;
 };
 
-static int dmar_disabled;
+static int dmar_disabled, dmar_force_rw;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -102,6 +102,9 @@
if (!strncmp(str, "off", 3)) {
dmar_disabled = 1;
printk(KERN_INFO"Intel-IOMMU: disabled\n");
+   } else if (!strncmp(str, "forcerw", 7)) {
+   dmar_force_rw = 1;
+   printk(KERN_INFO"Intel-IOMMU: force R/W for W/O 
mapping\n");
}
str += strcspn(str, ",");
while (*str == ',')
@@ -1668,7 +1671,12 @@
goto error;
}
 
-   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+   /*
+* Check if DMAR supports zero-length reads on write only
+* mappings..
+*/
+   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \
+   !cap_zlr(domain->iommu->cap) || dmar_force_rw)
prot |= DMA_PTE_READ;
if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
prot |= DMA_PTE_WRITE;
Index: linux-2.6.21-rc5/include/linux/intel-iommu.h
===
--- linux-2.6.21-rc5.orig/include/linux/intel-iommu.h   2007-04-09 
03:05:25.0 -0700
+++ linux-2.6.21-rc5/include/linux/intel-iommu.h2007-04-09 
03:05:32.0 -0700
@@ -79,6 +79,7 @@
 #define cap_max_fault_reg_offset(c) \
(cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16)
 
+#define cap_zlr(c) (((c) >> 22) & 1)
 #define cap_isoch(c)   (((c) >> 23) & 1)
 #define cap_mgaw(c)c) >> 16) & 0x3f) + 1)
 #define cap_sagaw(c)   (((c) >> 8) & 0x1f)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 5/8] [Intel IOMMU] Graphics driver workarounds to provide unity map

2007-04-09 Thread Ashok Raj
Most GFX drivers don't call standard PCI DMA APIs to allocate DMA buffer,
Such drivers will be broken with IOMMU enabled. To workaround this issue, 
we added two options.

Once graphics devices are converted over to use the DMA-API's this entire
patch can be removed... 

a. intel_iommu=igfx_off. With this option, DMAR who has just gfx devices
   under it will be ignored. This mostly affect intergated gfx devices.
   If the DMAR is ignored, gfx device under it will get physical address
   for DMA.
b. intel_iommu=gfx_workaround. With this option, we will setup 1:1 mapping
   for whole memory for gfx devices, that is physical address equals to
   virtual address.In this way, gfx will use physical address for DMA, this
   is primarily for add-in card GFX device.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/arch/x86_64/kernel/e820.c
===
--- linux-2.6.21-rc5.orig/arch/x86_64/kernel/e820.c 2007-04-09 
03:02:37.0 -0700
+++ linux-2.6.21-rc5/arch/x86_64/kernel/e820.c  2007-04-09 03:05:34.0 
-0700
@@ -730,3 +730,22 @@
printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: 
%lx:%lx)\n",
pci_mem_start, gapstart, gapsize);
 }
+
+int __init arch_get_ram_range(int slot, u64 *addr, u64 *size)
+{
+   int i;
+
+   if (slot < 0 || slot >= e820.nr_map)
+   return -1;
+   for (i = slot; i < e820.nr_map; i++) {
+   if(e820.map[i].type != E820_RAM)
+   continue;
+   break;
+   }
+   if (i == e820.nr_map || e820.map[i].addr > (max_pfn << PAGE_SHIFT))
+   return -1;
+   *addr = e820.map[i].addr;
+   *size = min_t(u64, e820.map[i].size + e820.map[i].addr,
+   max_pfn << PAGE_SHIFT) - *addr;
+   return i + 1;
+}
Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 
03:05:32.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-09 03:05:34.0 
-0700
@@ -36,6 +36,7 @@
 #include "iova.h"
 #include "pci.h"
 
+#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
 #define IOAPIC_RANGE_START (0xfee0)
 #define IOAPIC_RANGE_END   (0xfeef)
 #define IOAPIC_RANGE_SIZE  (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1)
@@ -85,6 +86,7 @@
 };
 
 static int dmar_disabled, dmar_force_rw;
+static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -105,7 +107,14 @@
} else if (!strncmp(str, "forcerw", 7)) {
dmar_force_rw = 1;
printk(KERN_INFO"Intel-IOMMU: force R/W for W/O 
mapping\n");
+   } else if (!strncmp(str, "igfx_off", 8)) {
+   dmar_map_gfx = 0;
+   printk(KERN_INFO"Intel-IOMMU: disable GFX device 
mapping\n");
+   } else if (!strncmp(str, "gfx_workaround", 14)) {
+   dmar_no_gfx_identity_map = 0;
+   printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole 
physical memory for GFX device\n");
}
+
str += strcspn(str, ",");
while (*str == ',')
str++;
@@ -1311,6 +1320,7 @@
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct domain *domain;
 };
+#define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
 static LIST_HEAD(device_domain_list);
 
@@ -1531,10 +1541,40 @@
 static inline int iommu_prepare_rmrr_dev(struct acpi_rmrr_unit *rmrr,
struct pci_dev *pdev)
 {
+   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   return 0;
return iommu_prepare_identity_map(pdev, rmrr->base_address,
rmrr->end_address + 1);
 }
 
+static void iommu_prepare_gfx_mapping(void)
+{
+   struct pci_dev *pdev = NULL;
+   u64 base, size;
+   int slot;
+   int ret;
+
+   if (dmar_no_gfx_identity_map)
+   return;
+
+   for_each_pci_dev(pdev) {
+   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO ||
+   !IS_GFX_DEVICE(pdev))
+   continue;
+   printk(KERN_INFO "IOMMU: gfx device %s 1-1 mapping\n",
+   pci_name(pdev));
+   slot = 0;
+   while ((slot = arch_get_ram_range(slot, &base, &size)) >= 0) {
+   ret = iommu_prepare_identity_map(pdev, base, base + 
size);
+   if (ret)
+  

[patch 6/8] [Intel IOMMU] Doc updates for Intel Virtualization Technology for Directed I/O.

2007-04-09 Thread Ashok Raj
Document Intel IOMMU driver boot option.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt  2007-04-09 
03:05:36.0 -0700
@@ -0,0 +1,119 @@
+Linux IOMMU Support
+===
+
+The architecture spec can be obtained from the below location.
+
+http://www.intel.com/technology/virtualization/
+
+This guide gives a quick cheat sheet for some basic understanding.
+
+Some Keywords
+
+DMAR - DMA remapping
+DRHD - DMA Engine Reporting Structure
+RMRR - Reserved memory Region Reporting Structure
+ZLR  - Zero length reads from PCI devices
+IOVA - IO Virtual address.
+
+Basic stuff
+---
+
+ACPI enumerates and lists the different DMA engines in the platform, and
+device scope relationships between PCI devices and which DMA engine  controls
+them.
+
+What is RMRR?
+-
+
+There are some devices the BIOS controls, for e.g USB devices to perform
+PS2 emulation. The regions of memory used for these devices are marked
+reserved in the e820 map. When we turn on DMA translation, DMA to those
+regions will fail. Hence BIOS uses RMRR to specify these regions along with
+devices that need to access these regions. OS is expected to setup
+unity mappings for these regions for these devices to access these regions.
+
+How is IOVA generated?
+-
+
+Well behaved drivers call pci_map_*() calls before sending command to device
+that needs to perform DMA. Once DMA is completed and mapping is no longer
+required, device performs a pci_unmap_*() calls to unmap the region.
+
+The Intel IOMMU driver allocates a virtual address per domain. Each PCIE
+device has its own domain (hence protection). Devices under p2p bridges
+share the virtual address with all devices under the p2p bridge due to
+transaction id aliasing for p2p bridges.
+
+IOVA generation is pretty generic. We used the same technique as vmalloc()
+but these are not global address spaces, but separate for each domain.
+Different DMA engines may support different number of domains.
+
+We also allocate gaurd pages with each mapping, so we can attempt to catch
+any overflow that might happen.
+
+
+Graphics Problems?
+--
+If you encounter issues with graphics devices, you can try adding
+option intel_iommu=igfx_off to turn off the integrated graphics engine.
+
+If it happens to be a PCI device included in the INCLUDE_ALL Engine,
+then try the intel_iommu=gfx_workaround to setup a 1-1 map. We hear
+graphics drivers may be in process of using DMA api's in the near
+future
+
+Some exceptions to IOVA
+---
+Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
+The same is true for peer to peer transactions. Hence we reserve the
+address from PCI MMIO ranges so they are not allocated for IOVA addresses.
+
+
+Fault reporting
+---
+When errors are reported, the DMA engine signals via an interrupt. The fault
+reason and device that caused it with fault reason is printed on console.
+
+See below for sample.
+
+
+Boot Message Sample
+---
+
+Something like this gets printed indicating presence of DMAR tables
+in ACPI.
+
+ACPI: DMAR (v001 A M I  OEMDMAR  0x0001 MSFT 0x0097) @ 
0x7f5b5ef0
+
+When DMAR is being processed and initialized by ACPI, prints DMAR locations
+and any RMRR's processed.
+
+ACPI DMAR:Host address width 36
+ACPI DMAR:DRHD (flags: 0x)base: 0xfed9
+ACPI DMAR:DRHD (flags: 0x)base: 0xfed91000
+ACPI DMAR:DRHD (flags: 0x0001)base: 0xfed93000
+ACPI DMAR:RMRR base: 0x000ed000 end: 0x000e
+ACPI DMAR:RMRR base: 0x7f60 end: 0x7fff
+
+When DMAR is enabled for use, you will notice..
+
+PCI-DMA: Using DMAR IOMMU
+
+Fault reporting
+---
+
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+
+TBD
+
+
+- No Performance tuning / analysis yet.
+- sysfs needs useful data to be populated.
+  DMAR info, device scope, stats could be exposed to some extent.
+- Add support to Firmware Developer Kit to test ACPI tables for DMAR.
+- For compatibility testing, could use unity map domain for all devices, just
+  provide a 1-1 for all useful memory under a single domain for all devices.
+- API for paravirt ops for abstracting functionlity for VMM folks.
Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-09 
03:02:37.0 -070

[patch 8/8] [Intel IOMMU] Preserve some Virtual Address when devices cannot address entire range.

2007-04-09 Thread Ashok Raj
Some devices may not support entire 64bit DMA. In a situation where such 
devices are co-located in a shared domain, we need to ensure there is some 
address space reserved for such devices without the low addresses getting
depleted by other devices capable of handling high dma addresses.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-09 
03:05:38.0 -0700
+++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-09 
03:05:41.0 -0700
@@ -735,6 +735,11 @@
first 16M. The floppy disk could be modified to use
the DMA api's but thats a lot of pain for very small
gain. This option is turned on by default.
+   preserve_{1g/2g/4g/512m/256m/16m}
+   If a device is sharing a domain with other devices
+   and the device mask doesnt cover the 64bit range,
+   use this option to let the iommu code preserve some
+   virtual addr for such devices.
io7=[HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 
03:05:38.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-09 03:06:17.0 
-0700
@@ -90,6 +90,7 @@
 static int dmar_disabled, dmar_force_rw;
 static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
 static int dmar_fix_isa = 1;
+static u64 dmar_preserve_iova_mask;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -119,6 +120,28 @@
} else if (!strncmp(str, "noisamap", 8)) {
dmar_fix_isa = 0;
printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity 
map for LPC\n");
+   } else if (!strncmp(str, "preserve_", 9)) {
+   if (!strncmp(str + 9, "4g", 2) ||
+   !strncmp(str + 9, "4G", 2))
+   dmar_preserve_iova_mask = DMA_32BIT_MASK;
+   else if (!strncmp(str + 9, "2g", 2) ||
+   !strncmp(str + 9, "2G", 2))
+   dmar_preserve_iova_mask = DMA_31BIT_MASK;
+   else if (!strncmp(str + 9, "1g", 2) ||
+!strncmp(str + 9, "1G", 2))
+   dmar_preserve_iova_mask = DMA_30BIT_MASK;
+   else if (!strncmp(str + 9, "512m", 2) ||
+!strncmp(str + 9, "512M", 2))
+   dmar_preserve_iova_mask = DMA_29BIT_MASK;
+   else if (!strncmp(str + 9, "256m", 4) ||
+!strncmp(str + 9, "256M", 4))
+   dmar_preserve_iova_mask = DMA_28BIT_MASK;
+   else if (!strncmp(str + 9, "16m", 3) ||
+!strncmp(str + 9, "16M", 3))
+   dmar_preserve_iova_mask = DMA_24BIT_MASK;
+   printk(KERN_INFO
+   "DMAR: Preserved IOVA mask 0x%Lx for devices "
+   "sharing domain\n", dmar_preserve_iova_mask);
}
 
str += strcspn(str, ",");
@@ -1726,9 +1749,10 @@
 * leave rooms for other devices
 */
if ((domain->flags & DOMAIN_FLAG_MULTIPLE_DEVICES) &&
-   pdev->dma_mask > DMA_32BIT_MASK)
+   dmar_preserve_iova_mask &&
+   pdev->dma_mask > dmar_preserve_iova_mask)
iova = alloc_iova(domain, addr, size,
-   DMA_32BIT_MASK + 1, pdev->dma_mask);
+   dmar_preserve_iova_mask + 1, pdev->dma_mask);
else
iova = alloc_iova(domain, addr, size,
IOVA_START_ADDR, pdev->dma_mask);

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/8] [Intel IOMMU] ACPI support for Intel Virtualization Technology for Directed I/O

2007-04-09 Thread Ashok Raj
This patch contains basic ACPI parsing and enumeration support.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc5.orig/arch/x86_64/Kconfig   2007-04-03 04:30:40.0 
-0700
+++ linux-2.6.21-rc5/arch/x86_64/Kconfig2007-04-03 06:34:17.0 
-0700
@@ -687,6 +687,14 @@
bool "Support mmconfig PCI config space access"
depends on PCI && ACPI
 
+config DMAR
+   bool "Support for DMA Remapping Devices (EXPERIMENTAL)"
+   depends on PCI_MSI && ACPI && EXPERIMENTAL
+   help
+ Support DMA Remapping Devices. The devices are reported via
+ ACPI tables and includes pci device scope under each DMA
+ remapping device.
+
 source "drivers/pci/pcie/Kconfig"
 
 source "drivers/pci/Kconfig"
Index: linux-2.6.21-rc5/drivers/acpi/Makefile
===
--- linux-2.6.21-rc5.orig/drivers/acpi/Makefile 2007-04-03 04:30:40.0 
-0700
+++ linux-2.6.21-rc5/drivers/acpi/Makefile  2007-04-03 06:34:17.0 
-0700
@@ -60,3 +60,4 @@
 obj-$(CONFIG_ACPI_HOTPLUG_MEMORY)  += acpi_memhotplug.o
 obj-y  += cm_sbs.o
 obj-$(CONFIG_ACPI_SBS) += i2c_ec.o sbs.o
+obj-$(CONFIG_DMAR) += dmar.o
Index: linux-2.6.21-rc5/drivers/acpi/dmar.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc5/drivers/acpi/dmar.c2007-04-03 06:54:27.0 
-0700
@@ -0,0 +1,344 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Ashok Raj <[EMAIL PROTECTED]>
+ * Copyright (C) Shaohua Li <[EMAIL PROTECTED]>
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#undef PREFIX
+#define PREFIX "ACPI DMAR:"
+
+#define MIN_SCOPE_LEN (sizeof(struct acpi_pci_path) + sizeof(struct 
acpi_dev_scope))
+
+LIST_HEAD(acpi_drhd_units);
+LIST_HEAD(acpi_rmrr_units);
+u8 dmar_host_address_width;
+
+static int __init acpi_register_drhd_unit(struct acpi_drhd_unit *drhd)
+{
+   /*
+* add INCLUDE_ALL at the tail, so scan the list will find it at
+* the very end.
+*/
+   if (drhd->include_all)
+   list_add_tail(&drhd->list, &acpi_drhd_units);
+   else
+   list_add(&drhd->list, &acpi_drhd_units);
+   return 0;
+}
+
+static int __init acpi_register_rmrr_unit(struct acpi_rmrr_unit *rmrr)
+{
+   list_add(&rmrr->list, &acpi_rmrr_units);
+   return 0;
+}
+
+static int acpi_pci_device_match(struct pci_dev *devices[], int cnt,
+struct pci_dev *dev)
+{
+   int index;
+
+   while (dev) {
+   for (index = 0; index < cnt; index ++)
+   if (dev == devices[index])
+   return 1;
+
+   /* Check our parent */
+   dev = dev->bus->self;
+   }
+
+   return 0;
+}
+
+struct acpi_drhd_unit * acpi_find_matched_drhd_unit(struct pci_dev *dev)
+{
+   struct acpi_drhd_unit *drhd = NULL;
+
+   list_for_each_entry(drhd, &acpi_drhd_units, list) {
+   if (drhd->include_all || acpi_pci_device_match(drhd->devices,
+   drhd->devices_cnt, dev))
+   break;
+   }
+
+   return drhd;
+}
+
+struct acpi_rmrr_unit * acpi_find_matched_rmrr_unit(struct pci_dev *dev)
+{
+   struct acpi_rmrr_unit *rmrr;
+
+   list_for_each_entry(rmrr, &acpi_rmrr_units, list) {
+   if (acpi_pci_device_match(rmrr->devices,
+   rmrr->devices_cnt, dev))
+   goto out;
+   }
+   rmrr = NULL;
+out:
+   return rmrr;
+}
+
+static int __init acpi_parse_one_dev_scope(struct acpi_dev_scope *scope,
+  struct pci_dev **dev, u16 segment)
+{
+   struct pci_bus *bus;
+   struct pci_dev *pdev = NULL;
+   struct acpi_pci_path *path;
+   

[patch 7/8] [Intel IOMMU] Support for legacy ISA devices

2007-04-09 Thread Ashok Raj
Floppy disk drivers dont work well with DMA remapping. Its possible to 
extend the current use for x86_64, but the gain is very little. If someone
feels compelled to clean this up, its up for grabs. Since these use 16M, we 
just provide a unity map for the ISA bridge device.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-09 
03:05:36.0 -0700
+++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-09 
03:05:38.0 -0700
@@ -730,6 +730,11 @@
the IOMMU driver to set a unity map for all OS
visible memory. Hence the driver can continue to use
physical addresses for DMA.
+   noisamap
+   This option is required to setup identify map for
+   first 16M. The floppy disk could be modified to use
+   the DMA api's but thats a lot of pain for very small
+   gain. This option is turned on by default.
io7=[HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 
03:05:34.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-09 03:05:38.0 
-0700
@@ -37,6 +37,8 @@
 #include "pci.h"
 
 #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
+#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
+
 #define IOAPIC_RANGE_START (0xfee0)
 #define IOAPIC_RANGE_END   (0xfeef)
 #define IOAPIC_RANGE_SIZE  (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1)
@@ -87,6 +89,7 @@
 
 static int dmar_disabled, dmar_force_rw;
 static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
+static int dmar_fix_isa = 1;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -113,6 +116,9 @@
} else if (!strncmp(str, "gfx_workaround", 14)) {
dmar_no_gfx_identity_map = 0;
printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole 
physical memory for GFX device\n");
+   } else if (!strncmp(str, "noisamap", 8)) {
+   dmar_fix_isa = 0;
+   printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity 
map for LPC\n");
}
 
str += strcspn(str, ",");
@@ -1575,6 +1581,25 @@
}
 }
 
+static void iommu_prepare_isa(void)
+{
+   struct pci_dev *pdev = NULL;
+   int ret;
+
+   if (!dmar_fix_isa)
+   return;
+
+   pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL);
+   if (!pdev)
+   return;
+
+   printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n");
+   ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024);
+
+   if (ret)
+   printk ("IOMMU: Failed to create 0-64M identity map, Floppy 
might not work\n");
+
+}
 int __init init_dmars(void)
 {
struct acpi_drhd_unit *drhd;
@@ -1631,6 +1656,7 @@
end_for_each_rmrr_device(rmrr, pdev)
 
iommu_prepare_gfx_mapping();
+   iommu_prepare_isa();
 
/*
 * for each drhd

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/8] [Intel IOMMU] ACPI support for Intel Virtualization Technology for Directed I/O

2007-04-10 Thread Ashok Raj
On Mon, Apr 09, 2007 at 11:39:19PM -0400, Len Brown wrote:
> On Monday 09 April 2007 17:55, Ashok Raj wrote:
> > This patch contains basic ACPI parsing and enumeration support.
> 
> AFAICS, ACPI supplies the envelope which delivers the table,
> and ACPI has some convenience structure definitions for that
> table in include/acpi/actbl1.h (primarily for the acpixtract table 
> dis-assembler),
> but ACPI is otherwise not involved in IOMMU support.
> 
> Indeed, one might argue that all new functions in this patch series with
> "acpi..." would more appropriately be called "pci...", since a cursory
> scan of the IOMMU spec seems to suggest it is specific to PCI.

Think we can migrate some of the code to make the core part just perform
the get-table. We will do that in the next respin.

> 
> So on first blush, it looks like the only call to a function that begins with
> "acpi" in this patch series should be acpi_get_table() from some IOMMU
> specific file outside of drivers/acpi,
> and the only modification to any code with an "acpi" in the file path or 
> filename should
> be any updates to the convenience structure definitions in acpitbl1.h
> 
> thanks,
> -Len
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/8] [Intel IOMMU] Support for Intel Virtualization Technology for Directed I/O

2007-04-10 Thread Ashok Raj
On Tue, Apr 10, 2007 at 09:49:55AM +0200, Andi Kleen wrote:
> On Monday 09 April 2007 23:55:52 Ashok Raj wrote:
> 
> > Please help review and provide feedback.
> 
> High level question: how did you solve the "user X server needs IOMMU bypass"
> problem?

There is no special consideration for user space. Since all useful memeory is 
mapped 1-1, guess user space would work as is. Unless iam missing something 
here.. so yes.. there is no protection with 1-1, but guess its like 
compatibility mode until the code gets converted over.

Keith: cced says some of the user space side also is getting converted over with
some driver support possibly. So this is an interim problem until X catches up
with it.
> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/8] [Intel IOMMU] Support for Intel Virtualization Technology for Directed I/O

2007-04-10 Thread Ashok Raj
On Tue, Apr 10, 2007 at 04:34:48AM -0400, Jeff Garzik wrote:
> Shaohua Li wrote:
> >DMA remapping just uses ACPI table to tell which dma remapping engine a
> >pci device is controlled by at boot time. At run time, DMA remapping
> >hasn't any interactive with ACPI.
> 
> The Linux kernel _really_ wants a non-ACPI way to detect this.
> 
> Just use the hardware registers themselves, you don't need an ACPI table.
> 
>   Jeff

ACPI is required just not for identifying the DMA remapping hardware in the
system. We also need them to identify which engines control which pci devices.

Also there are some reserved sections that BIOS uses for its purpose that it
needs identity map.. say for legacy emulation via usb etc that needs
to be passed to the OS. 

not sure we can get away from using ACPI that easily.. its just only for setup
information, once the identification is complete we dont bother ACPI anymore.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 2/8] Some generic search functions required to lookup device relationships.

2007-04-24 Thread Ashok Raj
PCI support functions for DMAR, to find parent bridge. 

When devices are under a p2p bridge, upstream
transactions get replaced by the device id of the bridge as it owns the 
PCIE transaction. Hence its necessary to setup translations on behalf of the 
bridge as well. Due to this limitation all devices under a p2p share the same
domain in a DMAR. 

We just cache the type of device, if its a native PCIe device
or not for later use.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
-

Index: 2.6.21-rc6/drivers/pci/pci.h
===
--- 2.6.21-rc6.orig/drivers/pci/pci.h   2007-04-06 10:36:56.0 +0800
+++ 2.6.21-rc6/drivers/pci/pci.h2007-04-11 13:52:29.0 +0800
@@ -90,3 +90,4 @@ pci_match_one_device(const struct pci_de
return NULL;
 }
 
+struct pci_dev *pci_find_upstream_pcie_bridge(struct pci_dev *pdev);
Index: 2.6.21-rc6/drivers/pci/probe.c
===
--- 2.6.21-rc6.orig/drivers/pci/probe.c 2007-04-06 10:36:56.0 +0800
+++ 2.6.21-rc6/drivers/pci/probe.c  2007-04-11 13:52:29.0 +0800
@@ -822,6 +822,19 @@ static void pci_release_dev(struct devic
kfree(pci_dev);
 }
 
+static void set_pcie_port_type(struct pci_dev *pdev)
+{
+   int pos;
+   u16 reg16;
+
+   pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
+   if (!pos)
+   return;
+   pdev->is_pcie = 1;
+   pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16);
+   pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4;
+}
+
 /**
  * pci_cfg_space_size - get the configuration space size of the PCI device.
  * @dev: PCI device
@@ -919,6 +932,7 @@ pci_scan_device(struct pci_bus *bus, int
dev->device = (l >> 16) & 0x;
dev->cfg_size = pci_cfg_space_size(dev);
dev->error_state = pci_channel_io_normal;
+   set_pcie_port_type(dev);
 
/* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
   set this higher, assuming the system even supports it.  */
Index: 2.6.21-rc6/drivers/pci/search.c
===
--- 2.6.21-rc6.orig/drivers/pci/search.c2007-04-06 10:36:56.0 
+0800
+++ 2.6.21-rc6/drivers/pci/search.c 2007-04-11 13:52:29.0 +0800
@@ -14,6 +14,36 @@
 #include "pci.h"
 
 DECLARE_RWSEM(pci_bus_sem);
+/*
+ * find the upstream PCIE-to-PCI bridge of a PCI device
+ * if the device is PCIE, return NULL
+ * if the device isn't connected to a PCIE bridge (that is its parent is a
+ * legacy PCI bridge and the bridge is directly connected to bus 0), return its
+ * parent
+ */
+struct pci_dev *
+pci_find_upstream_pcie_bridge(struct pci_dev *pdev)
+{
+   struct pci_dev *tmp = NULL;
+
+   if (pdev->is_pcie)
+   return NULL;
+   while (1) {
+   if (!pdev->bus->self)
+   break;
+   pdev = pdev->bus->self;
+   /* a p2p bridge */
+   if (!pdev->is_pcie) {
+   tmp = pdev;
+   continue;
+   }
+   /* PCI device should connect to a PCIE bridge */
+   BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);
+   return pdev;
+   }
+
+   return tmp;
+}
 
 static struct pci_bus *
 pci_do_find_bus(struct pci_bus* bus, unsigned char busnr)
Index: 2.6.21-rc6/include/linux/pci.h
===
--- 2.6.21-rc6.orig/include/linux/pci.h 2007-04-06 10:36:56.0 +0800
+++ 2.6.21-rc6/include/linux/pci.h  2007-04-11 13:52:29.0 +0800
@@ -126,6 +126,7 @@ struct pci_dev {
unsigned short  subsystem_device;
unsigned intclass;  /* 3 bytes: (base,sub,prog-if) */
u8  hdr_type;   /* PCI header type (`multi' flag masked 
out) */
+   u8  pcie_type;  /* PCI-E device/port type */
u8  rom_base_reg;   /* which config register controls the 
ROM */
u8  pin;/* which interrupt pin this device uses 
*/
 
@@ -168,6 +169,7 @@ struct pci_dev {
unsigned intmsi_enabled:1;
unsigned intmsix_enabled:1;
unsigned intis_managed:1;
+   unsigned intis_pcie:1;
atomic_tenable_cnt; /* pci_enable_device has been called */
 
u32 saved_config_space[16]; /* config space saved at 
suspend time */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 4/8] Supporting Zero Length Reads in Intel IOMMU.

2007-04-24 Thread Ashok Raj
PCI specs permit zero length reads (ZLR) even if the mapping for that region 
is write only. Support for this feature is indicated by the presence of a bit 
in the DMAR capability. If a particular DMAR does not support this capability
we map write-only regions as read-write.

This option can also provides a workaround for some drivers that request
a write-only mapping when they really should request a read-write.
(We ran into one such case in eepro100.c in handling rx_ring_dma)

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
--
 drivers/pci/intel-iommu.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: 2.6.21-rc6/drivers/pci/intel-iommu.c
===
--- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c   2007-04-18 09:04:56.0 
+0800
+++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-18 09:04:59.0 
+0800
@@ -84,7 +84,7 @@ struct iommu {
struct sys_device sysdev;
 };
 
-static int dmar_disabled;
+static int dmar_disabled, dmar_force_rw;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -102,6 +102,9 @@ static int __init intel_iommu_setup(char
if (!strncmp(str, "off", 3)) {
dmar_disabled = 1;
printk(KERN_INFO"Intel-IOMMU: disabled\n");
+   } else if (!strncmp(str, "forcerw", 7)) {
+   dmar_force_rw = 1;
+   printk(KERN_INFO"Intel-IOMMU: force R/W for W/O 
mapping\n");
}
str += strcspn(str, ",");
while (*str == ',')
@@ -1720,7 +1723,12 @@ static dma_addr_t __intel_map_single(str
goto error;
}
 
-   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+   /*
+* Check if DMAR supports zero-length reads on write only
+* mappings..
+*/
+   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \
+   !cap_zlr(domain->iommu->cap) || dmar_force_rw)
prot |= DMA_PTE_READ;
if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
prot |= DMA_PTE_WRITE;
Index: 2.6.21-rc6/include/linux/intel-iommu.h
===
--- 2.6.21-rc6.orig/include/linux/intel-iommu.h 2007-04-18 09:04:56.0 
+0800
+++ 2.6.21-rc6/include/linux/intel-iommu.h  2007-04-18 09:04:59.0 
+0800
@@ -79,6 +79,7 @@
 #define cap_max_fault_reg_offset(c) \
(cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16)
 
+#define cap_zlr(c) (((c) >> 22) & 1)
 #define cap_isoch(c)   (((c) >> 23) & 1)
 #define cap_mgaw(c)c) >> 16) & 0x3f) + 1)
 #define cap_sagaw(c)   (((c) >> 8) & 0x1f)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 2/8] Some generic search functions required to lookup device relationships.

2007-04-24 Thread Ashok Raj
PCI support functions for DMAR, to find parent bridge. 

When devices are under a p2p bridge, upstream
transactions get replaced by the device id of the bridge as it owns the 
PCIE transaction. Hence its necessary to setup translations on behalf of the 
bridge as well. Due to this limitation all devices under a p2p share the same
domain in a DMAR. 

We just cache the type of device, if its a native PCIe device
or not for later use.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
-

Index: 2.6.21-rc6/drivers/pci/pci.h
===
--- 2.6.21-rc6.orig/drivers/pci/pci.h   2007-04-06 10:36:56.0 +0800
+++ 2.6.21-rc6/drivers/pci/pci.h2007-04-11 13:52:29.0 +0800
@@ -90,3 +90,4 @@ pci_match_one_device(const struct pci_de
return NULL;
 }
 
+struct pci_dev *pci_find_upstream_pcie_bridge(struct pci_dev *pdev);
Index: 2.6.21-rc6/drivers/pci/probe.c
===
--- 2.6.21-rc6.orig/drivers/pci/probe.c 2007-04-06 10:36:56.0 +0800
+++ 2.6.21-rc6/drivers/pci/probe.c  2007-04-11 13:52:29.0 +0800
@@ -822,6 +822,19 @@ static void pci_release_dev(struct devic
kfree(pci_dev);
 }
 
+static void set_pcie_port_type(struct pci_dev *pdev)
+{
+   int pos;
+   u16 reg16;
+
+   pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
+   if (!pos)
+   return;
+   pdev->is_pcie = 1;
+   pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16);
+   pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4;
+}
+
 /**
  * pci_cfg_space_size - get the configuration space size of the PCI device.
  * @dev: PCI device
@@ -919,6 +932,7 @@ pci_scan_device(struct pci_bus *bus, int
dev->device = (l >> 16) & 0x;
dev->cfg_size = pci_cfg_space_size(dev);
dev->error_state = pci_channel_io_normal;
+   set_pcie_port_type(dev);
 
/* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
   set this higher, assuming the system even supports it.  */
Index: 2.6.21-rc6/drivers/pci/search.c
===
--- 2.6.21-rc6.orig/drivers/pci/search.c2007-04-06 10:36:56.0 
+0800
+++ 2.6.21-rc6/drivers/pci/search.c 2007-04-11 13:52:29.0 +0800
@@ -14,6 +14,36 @@
 #include "pci.h"
 
 DECLARE_RWSEM(pci_bus_sem);
+/*
+ * find the upstream PCIE-to-PCI bridge of a PCI device
+ * if the device is PCIE, return NULL
+ * if the device isn't connected to a PCIE bridge (that is its parent is a
+ * legacy PCI bridge and the bridge is directly connected to bus 0), return its
+ * parent
+ */
+struct pci_dev *
+pci_find_upstream_pcie_bridge(struct pci_dev *pdev)
+{
+   struct pci_dev *tmp = NULL;
+
+   if (pdev->is_pcie)
+   return NULL;
+   while (1) {
+   if (!pdev->bus->self)
+   break;
+   pdev = pdev->bus->self;
+   /* a p2p bridge */
+   if (!pdev->is_pcie) {
+   tmp = pdev;
+   continue;
+   }
+   /* PCI device should connect to a PCIE bridge */
+   BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);
+   return pdev;
+   }
+
+   return tmp;
+}
 
 static struct pci_bus *
 pci_do_find_bus(struct pci_bus* bus, unsigned char busnr)
Index: 2.6.21-rc6/include/linux/pci.h
===
--- 2.6.21-rc6.orig/include/linux/pci.h 2007-04-06 10:36:56.0 +0800
+++ 2.6.21-rc6/include/linux/pci.h  2007-04-11 13:52:29.0 +0800
@@ -126,6 +126,7 @@ struct pci_dev {
unsigned short  subsystem_device;
unsigned intclass;  /* 3 bytes: (base,sub,prog-if) */
u8  hdr_type;   /* PCI header type (`multi' flag masked 
out) */
+   u8  pcie_type;  /* PCI-E device/port type */
u8  rom_base_reg;   /* which config register controls the 
ROM */
u8  pin;/* which interrupt pin this device uses 
*/
 
@@ -168,6 +169,7 @@ struct pci_dev {
unsigned intmsi_enabled:1;
unsigned intmsix_enabled:1;
unsigned intis_managed:1;
+   unsigned intis_pcie:1;
atomic_tenable_cnt; /* pci_enable_device has been called */
 
u32 saved_config_space[16]; /* config space saved at 
suspend time */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 7/8] Support for legacy ISA devices

2007-04-24 Thread Ashok Raj
Floppy disk drivers dont work well with DMA remapping. Its possible to 
extend the current use for x86_64, but the gain is very little. If someone
feels compelled to clean this up, its up for grabs. Since these use 16M, we 
just provide a unity map for the ISA bridge device.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-17 
05:41:56.0 -0700
+++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 
05:41:59.0 -0700
@@ -730,6 +730,11 @@
the IOMMU driver to set a unity map for all OS
visible memory. Hence the driver can continue to use
physical addresses for DMA.
+   noisamap
+   This option is required to setup identify map for
+   first 16M. The floppy disk could be modified to use
+   the DMA api's but thats a lot of pain for very small
+   gain. This option is turned on by default.
io7=[HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 
05:41:53.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-17 05:41:59.0 
-0700
@@ -37,6 +37,8 @@
 #include "pci.h"
 
 #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
+#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
+
 #define IOAPIC_RANGE_START (0xfee0)
 #define IOAPIC_RANGE_END   (0xfeef)
 #define IOAPIC_RANGE_SIZE  (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1)
@@ -87,6 +89,7 @@
 
 static int dmar_disabled, dmar_force_rw;
 static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
+static int dmar_fix_isa = 1;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -113,6 +116,9 @@
} else if (!strncmp(str, "gfx_workaround", 14)) {
dmar_no_gfx_identity_map = 0;
printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole 
physical memory for GFX device\n");
+   } else if (!strncmp(str, "noisamap", 8)) {
+   dmar_fix_isa = 0;
+   printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity 
map for LPC\n");
}
 
str += strcspn(str, ",");
@@ -1582,6 +1588,25 @@
}
 }
 
+static void iommu_prepare_isa(void)
+{
+   struct pci_dev *pdev = NULL;
+   int ret;
+
+   if (!dmar_fix_isa)
+   return;
+
+   pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL);
+   if (!pdev)
+   return;
+
+   printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n");
+   ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024);
+
+   if (ret)
+   printk ("IOMMU: Failed to create 0-64M identity map, Floppy 
might not work\n");
+
+}
 int __init init_dmars(void)
 {
struct dmar_drhd_unit *drhd;
@@ -1638,6 +1663,7 @@
end_for_each_rmrr_device(rmrr, pdev)
 
iommu_prepare_gfx_mapping();
+   iommu_prepare_isa();
 
/*
 * for each drhd

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 6/8] Doc updates for Intel Virtualization Technology for Directed I/O.

2007-04-24 Thread Ashok Raj
Document Intel IOMMU driver boot option.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt  2007-04-17 
05:41:56.0 -0700
@@ -0,0 +1,119 @@
+Linux IOMMU Support
+===
+
+The architecture spec can be obtained from the below location.
+
+http://www.intel.com/technology/virtualization/
+
+This guide gives a quick cheat sheet for some basic understanding.
+
+Some Keywords
+
+DMAR - DMA remapping
+DRHD - DMA Engine Reporting Structure
+RMRR - Reserved memory Region Reporting Structure
+ZLR  - Zero length reads from PCI devices
+IOVA - IO Virtual address.
+
+Basic stuff
+---
+
+ACPI enumerates and lists the different DMA engines in the platform, and
+device scope relationships between PCI devices and which DMA engine  controls
+them.
+
+What is RMRR?
+-
+
+There are some devices the BIOS controls, for e.g USB devices to perform
+PS2 emulation. The regions of memory used for these devices are marked
+reserved in the e820 map. When we turn on DMA translation, DMA to those
+regions will fail. Hence BIOS uses RMRR to specify these regions along with
+devices that need to access these regions. OS is expected to setup
+unity mappings for these regions for these devices to access these regions.
+
+How is IOVA generated?
+-
+
+Well behaved drivers call pci_map_*() calls before sending command to device
+that needs to perform DMA. Once DMA is completed and mapping is no longer
+required, device performs a pci_unmap_*() calls to unmap the region.
+
+The Intel IOMMU driver allocates a virtual address per domain. Each PCIE
+device has its own domain (hence protection). Devices under p2p bridges
+share the virtual address with all devices under the p2p bridge due to
+transaction id aliasing for p2p bridges.
+
+IOVA generation is pretty generic. We used the same technique as vmalloc()
+but these are not global address spaces, but separate for each domain.
+Different DMA engines may support different number of domains.
+
+We also allocate gaurd pages with each mapping, so we can attempt to catch
+any overflow that might happen.
+
+
+Graphics Problems?
+--
+If you encounter issues with graphics devices, you can try adding
+option intel_iommu=igfx_off to turn off the integrated graphics engine.
+
+If it happens to be a PCI device included in the INCLUDE_ALL Engine,
+then try the intel_iommu=gfx_workaround to setup a 1-1 map. We hear
+graphics drivers may be in process of using DMA api's in the near
+future
+
+Some exceptions to IOVA
+---
+Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
+The same is true for peer to peer transactions. Hence we reserve the
+address from PCI MMIO ranges so they are not allocated for IOVA addresses.
+
+
+Fault reporting
+---
+When errors are reported, the DMA engine signals via an interrupt. The fault
+reason and device that caused it with fault reason is printed on console.
+
+See below for sample.
+
+
+Boot Message Sample
+---
+
+Something like this gets printed indicating presence of DMAR tables
+in ACPI.
+
+ACPI: DMAR (v001 A M I  OEMDMAR  0x0001 MSFT 0x0097) @ 
0x7f5b5ef0
+
+When DMAR is being processed and initialized by ACPI, prints DMAR locations
+and any RMRR's processed.
+
+ACPI DMAR:Host address width 36
+ACPI DMAR:DRHD (flags: 0x)base: 0xfed9
+ACPI DMAR:DRHD (flags: 0x)base: 0xfed91000
+ACPI DMAR:DRHD (flags: 0x0001)base: 0xfed93000
+ACPI DMAR:RMRR base: 0x000ed000 end: 0x000e
+ACPI DMAR:RMRR base: 0x7f60 end: 0x7fff
+
+When DMAR is enabled for use, you will notice..
+
+PCI-DMA: Using DMAR IOMMU
+
+Fault reporting
+---
+
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+
+TBD
+
+
+- No Performance tuning / analysis yet.
+- sysfs needs useful data to be populated.
+  DMAR info, device scope, stats could be exposed to some extent.
+- Add support to Firmware Developer Kit to test ACPI tables for DMAR.
+- For compatibility testing, could use unity map domain for all devices, just
+  provide a 1-1 for all useful memory under a single domain for all devices.
+- API for paravirt ops for abstracting functionlity for VMM folks.
Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-17 
04:59:42.0 -070

[Intel IOMMU][patch 5/8] Graphics driver workarounds to provide unity map

2007-04-24 Thread Ashok Raj
Most GFX drivers don't call standard PCI DMA APIs to allocate DMA buffer,
Such drivers will be broken with IOMMU enabled. To workaround this issue, 
we added two options.

Once graphics devices are converted over to use the DMA-API's this entire
patch can be removed... 

a. intel_iommu=igfx_off. With this option, DMAR who has just gfx devices
   under it will be ignored. This mostly affect intergated gfx devices.
   If the DMAR is ignored, gfx device under it will get physical address
   for DMA.
b. intel_iommu=gfx_workaround. With this option, we will setup 1:1 mapping
   for whole memory for gfx devices, that is physical address equals to
   virtual address.In this way, gfx will use physical address for DMA, this
   is primarily for add-in card GFX device.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: 2.6.21-rc6/arch/x86_64/kernel/e820.c
===
--- 2.6.21-rc6.orig/arch/x86_64/kernel/e820.c   2007-04-20 11:03:01.0 
+0800
+++ 2.6.21-rc6/arch/x86_64/kernel/e820.c2007-04-20 11:45:56.0 
+0800
@@ -730,3 +730,22 @@ __init void e820_setup_gap(void)
printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: 
%lx:%lx)\n",
pci_mem_start, gapstart, gapsize);
 }
+
+int __init arch_get_ram_range(int slot, u64 *addr, u64 *size)
+{
+   int i;
+
+   if (slot < 0 || slot >= e820.nr_map)
+   return -1;
+   for (i = slot; i < e820.nr_map; i++) {
+   if(e820.map[i].type != E820_RAM)
+   continue;
+   break;
+   }
+   if (i == e820.nr_map || e820.map[i].addr > (max_pfn << PAGE_SHIFT))
+   return -1;
+   *addr = e820.map[i].addr;
+   *size = min_t(u64, e820.map[i].size + e820.map[i].addr,
+   max_pfn << PAGE_SHIFT) - *addr;
+   return i + 1;
+}
Index: 2.6.21-rc6/drivers/pci/dmar.h
===
--- 2.6.21-rc6.orig/drivers/pci/dmar.h  2007-04-20 11:38:30.0 +0800
+++ 2.6.21-rc6/drivers/pci/dmar.h   2007-04-20 11:45:56.0 +0800
@@ -35,6 +35,7 @@ struct dmar_drhd_unit {
int devices_cnt;
u8  include_all:1;
struct iommu *iommu;
+   int ignored:1; /* the drhd should be ignored */
 };
 
 struct dmar_rmrr_unit {
Index: 2.6.21-rc6/drivers/pci/intel-iommu.c
===
--- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c   2007-04-20 11:45:52.0 
+0800
+++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-20 11:45:56.0 
+0800
@@ -36,6 +36,7 @@
 #include "iova.h"
 #include "pci.h"
 
+#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
 #define IOAPIC_RANGE_START (0xfee0)
 #define IOAPIC_RANGE_END   (0xfeef)
 #define IOAPIC_RANGE_SIZE  (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1)
@@ -85,6 +86,7 @@ struct iommu {
 };
 
 static int dmar_disabled, dmar_force_rw;
+static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -105,7 +107,14 @@ static int __init intel_iommu_setup(char
} else if (!strncmp(str, "forcerw", 7)) {
dmar_force_rw = 1;
printk(KERN_INFO"Intel-IOMMU: force R/W for W/O 
mapping\n");
+   } else if (!strncmp(str, "igfx_off", 8)) {
+   dmar_map_gfx = 0;
+   printk(KERN_INFO"Intel-IOMMU: disable GFX device 
mapping\n");
+   } else if (!strncmp(str, "gfx_workaround", 14)) {
+   dmar_no_gfx_identity_map = 0;
+   printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole 
physical memory for GFX device\n");
}
+
str += strcspn(str, ",");
while (*str == ',')
str++;
@@ -1318,6 +1327,7 @@ struct device_domain_info {
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct domain *domain;
 };
+#define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
 static LIST_HEAD(device_domain_list);
 
@@ -1538,10 +1548,40 @@ error:
 static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr,
struct pci_dev *pdev)
 {
+   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   return 0;
return iommu_prepare_identity_map(pdev, rmrr->base_address,
rmrr->end_address + 1);
 }
 
+static void iommu_prepare_gfx_mapping(void)
+{
+   struct pci_dev *pdev = NULL;
+   u64 base, size;
+   int slot;
+   int ret;
+
+   if (dmar_no_gfx_identity_m

[Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread Ashok Raj
Some devices may not support entire 64bit DMA. In a situation where such 
devices are co-located in a shared domain, we need to ensure there is some 
address space reserved for such devices without the low addresses getting
depleted by other devices capable of handling high dma addresses.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-17 
06:02:24.0 -0700
+++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 
06:02:33.0 -0700
@@ -735,6 +735,11 @@
first 16M. The floppy disk could be modified to use
the DMA api's but thats a lot of pain for very small
gain. This option is turned on by default.
+   preserve_{1g/2g/4g/512m/256m/16m}
+   If a device is sharing a domain with other devices
+   and the device mask doesnt cover the 64bit range,
+   use this option to let the iommu code preserve some
+   virtual addr for such devices.
io7=[HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 
06:02:24.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-17 06:05:49.0 
-0700
@@ -90,6 +90,7 @@
 static int dmar_disabled, dmar_force_rw;
 static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
 static int dmar_fix_isa = 1;
+static u64 dmar_preserve_iova_mask;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -119,6 +120,32 @@
} else if (!strncmp(str, "noisamap", 8)) {
dmar_fix_isa = 0;
printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity 
map for LPC\n");
+   } else if (!strncmp(str, "preserve_", 9)) {
+   if (!strncmp(str + 9, "4g", 2) ||
+   !strncmp(str + 9, "4G", 2))
+   dmar_preserve_iova_mask = DMA_32BIT_MASK;
+   else if (!strncmp(str + 9, "2g", 2) ||
+   !strncmp(str + 9, "2G", 2))
+   dmar_preserve_iova_mask = DMA_31BIT_MASK;
+   else if (!strncmp(str + 9, "1g", 2) ||
+!strncmp(str + 9, "1G", 2))
+   dmar_preserve_iova_mask = DMA_30BIT_MASK;
+   else if (!strncmp(str + 9, "512m", 2) ||
+!strncmp(str + 9, "512M", 2))
+   dmar_preserve_iova_mask = DMA_29BIT_MASK;
+   else if (!strncmp(str + 9, "256m", 4) ||
+!strncmp(str + 9, "256M", 4))
+   dmar_preserve_iova_mask = DMA_28BIT_MASK;
+   else if (!strncmp(str + 9, "16m", 3) ||
+!strncmp(str + 9, "16M", 3))
+   dmar_preserve_iova_mask = DMA_24BIT_MASK;
+   if (dmar_preserve_iova_mask)
+   printk(KERN_INFO
+   "DMAR: Preserved IOVA mask 0x%Lx for devices "
+   "sharing domain\n", dmar_preserve_iova_mask);
+   else
+   printk(KERN_ERR"DMAR: Unsuppored preserve mask"
+   " provided");
}
 
str += strcspn(str, ",");
@@ -1723,7 +1750,6 @@
last_addr : IOVA_START_ADDR);
}
return last_addr;
-
 }
 #endif
 
@@ -1751,13 +1777,14 @@
 
/*
 * If the device shares a domain with other devices and the device can
-* handle > 4G DMA, let the device use DMA address started from 4G, so 
to
-* leave rooms for other devices
+* can handle higher address, leave rooms for devices that cant
+* address high address ranges.
 */
if ((domain->flags & DOMAIN_FLAG_MULTIPLE_DEVICES) &&
-   pdev->dma_mask > DMA_32BIT_MASK)
+   dmar_preserve_iova_mask &&
+   (pdev->dma_mask > dmar_preserve_iova_mask))
iova = alloc_iova(domain, addr, size,
-  

[Intel IOMMU][patch 7/8] Support for legacy ISA devices

2007-04-24 Thread Ashok Raj
Floppy disk drivers dont work well with DMA remapping. Its possible to 
extend the current use for x86_64, but the gain is very little. If someone
feels compelled to clean this up, its up for grabs. Since these use 16M, we 
just provide a unity map for the ISA bridge device.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-17 
05:41:56.0 -0700
+++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 
05:41:59.0 -0700
@@ -730,6 +730,11 @@
the IOMMU driver to set a unity map for all OS
visible memory. Hence the driver can continue to use
physical addresses for DMA.
+   noisamap
+   This option is required to setup identify map for
+   first 16M. The floppy disk could be modified to use
+   the DMA api's but thats a lot of pain for very small
+   gain. This option is turned on by default.
io7=[HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 
05:41:53.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-17 05:41:59.0 
-0700
@@ -37,6 +37,8 @@
 #include "pci.h"
 
 #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
+#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
+
 #define IOAPIC_RANGE_START (0xfee0)
 #define IOAPIC_RANGE_END   (0xfeef)
 #define IOAPIC_RANGE_SIZE  (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1)
@@ -87,6 +89,7 @@
 
 static int dmar_disabled, dmar_force_rw;
 static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
+static int dmar_fix_isa = 1;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -113,6 +116,9 @@
} else if (!strncmp(str, "gfx_workaround", 14)) {
dmar_no_gfx_identity_map = 0;
printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole 
physical memory for GFX device\n");
+   } else if (!strncmp(str, "noisamap", 8)) {
+   dmar_fix_isa = 0;
+   printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity 
map for LPC\n");
}
 
str += strcspn(str, ",");
@@ -1582,6 +1588,25 @@
}
 }
 
+static void iommu_prepare_isa(void)
+{
+   struct pci_dev *pdev = NULL;
+   int ret;
+
+   if (!dmar_fix_isa)
+   return;
+
+   pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL);
+   if (!pdev)
+   return;
+
+   printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n");
+   ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024);
+
+   if (ret)
+   printk ("IOMMU: Failed to create 0-64M identity map, Floppy 
might not work\n");
+
+}
 int __init init_dmars(void)
 {
struct dmar_drhd_unit *drhd;
@@ -1638,6 +1663,7 @@
end_for_each_rmrr_device(rmrr, pdev)
 
iommu_prepare_gfx_mapping();
+   iommu_prepare_isa();
 
/*
 * for each drhd

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O

2007-04-24 Thread Ashok Raj
This patch contains basic ACPI parsing and enumeration support.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc5.orig/arch/x86_64/Kconfig   2007-04-23 07:11:49.0 
-0700
+++ linux-2.6.21-rc5/arch/x86_64/Kconfig2007-04-23 07:11:51.0 
-0700
@@ -687,6 +687,14 @@
bool "Support mmconfig PCI config space access"
depends on PCI && ACPI
 
+config DMAR
+   bool "Support for DMA Remapping Devices (EXPERIMENTAL)"
+   depends on PCI_MSI && ACPI && EXPERIMENTAL
+   help
+ Support DMA Remapping Devices. The devices are reported via
+ ACPI tables and includes pci device scope under each DMA
+ remapping device.
+
 source "drivers/pci/pcie/Kconfig"
 
 source "drivers/pci/Kconfig"
Index: linux-2.6.21-rc5/drivers/pci/Makefile
===
--- linux-2.6.21-rc5.orig/drivers/pci/Makefile  2007-04-23 07:11:49.0 
-0700
+++ linux-2.6.21-rc5/drivers/pci/Makefile   2007-04-23 07:11:51.0 
-0700
@@ -38,6 +38,7 @@
 #
 obj-$(CONFIG_ACPI)+= pci-acpi.o
 
+obj-$(CONFIG_DMAR) += dmar.o
 # Cardbus & CompactPCI use setup-bus
 obj-$(CONFIG_HOTPLUG) += setup-bus.o
 
Index: linux-2.6.21-rc5/drivers/pci/dmar.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc5/drivers/pci/dmar.c 2007-04-23 07:12:00.0 -0700
@@ -0,0 +1,350 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Ashok Raj <[EMAIL PROTECTED]>
+ * Copyright (C) Shaohua Li <[EMAIL PROTECTED]>
+ */
+
+#include 
+#include 
+#include 
+
+#include "dmar.h"
+
+#undef PREFIX
+#define PREFIX "DMAR:"
+
+#define MIN_SCOPE_LEN (sizeof(struct acpi_dmar_pci_path) + \
+   sizeof(struct acpi_dmar_device_scope))
+
+LIST_HEAD(dmar_drhd_units);
+LIST_HEAD(dmar_rmrr_units);
+u8 dmar_host_address_width;
+
+static struct acpi_table_header *dmar_tbl;
+
+static int __init dmar_register_drhd_unit(struct dmar_drhd_unit *drhd)
+{
+   /*
+* add INCLUDE_ALL at the tail, so scan the list will find it at
+* the very end.
+*/
+   if (drhd->include_all)
+   list_add_tail(&drhd->list, &dmar_drhd_units);
+   else
+   list_add(&drhd->list, &dmar_drhd_units);
+   return 0;
+}
+
+static int __init dmar_register_rmrr_unit(struct dmar_rmrr_unit *rmrr)
+{
+   list_add(&rmrr->list, &dmar_rmrr_units);
+   return 0;
+}
+
+static int dmar_pci_device_match(struct pci_dev *devices[], int cnt,
+struct pci_dev *dev)
+{
+   int index;
+
+   while (dev) {
+   for (index = 0; index < cnt; index ++)
+   if (dev == devices[index])
+   return 1;
+
+   /* Check our parent */
+   dev = dev->bus->self;
+   }
+
+   return 0;
+}
+
+struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev *dev)
+{
+   struct dmar_drhd_unit *drhd = NULL;
+
+   list_for_each_entry(drhd, &dmar_drhd_units, list) {
+   if (drhd->include_all || dmar_pci_device_match(drhd->devices,
+   drhd->devices_cnt, dev))
+   break;
+   }
+
+   return drhd;
+}
+
+struct dmar_rmrr_unit * dmar_find_matched_rmrr_unit(struct pci_dev *dev)
+{
+   struct dmar_rmrr_unit *rmrr;
+
+   list_for_each_entry(rmrr, &dmar_rmrr_units, list) {
+   if (dmar_pci_device_match(rmrr->devices,
+   rmrr->devices_cnt, dev))
+   goto out;
+   }
+   rmrr = NULL;
+out:
+   return rmrr;
+}
+
+static int __init dmar_parse_one_dev_scope(struct acpi_dmar_device_scope 
*scope,
+  struct pci_dev **dev, u16 segment)
+{
+   struct pci_bus *bus;
+   struct pci_dev *pdev = NULL;
+

[Intel IOMMU][patch 0/8] Intel IOMMU Support

2007-04-24 Thread Ashok Raj
Hello again!

Andrew: Could you help include in -mm to give it more exposure preparing for 
mainline inclusion with more testing?

This is a resend of the patches after addressing some feedback we received.

1. As len requested, we moved most of the acpi parts to drivers/pci instead
   of leaving them in drivers/acpi, including some renaming of functions,
   using just acpi_get_table() only.

2. Made the guard page support configurable.
3. Added new option to allocate consecutive address instead of re-using 
   free addresses as an experimental option, Not validated extensively.
   Its expected to improve in certain cases...
4. Fixed a couple minor bugs that got exposed in testing.


Some more interesting possibilities. -- enhancements to work on.

- In order to ensure we dont break any driver that may not be using dma api's
  here are some suggestions to work on.
- Create a single 1-1 map, and make sure any pci device gets this map
  when they do a pci_set_master() to enable bus mastering automatically.
- When the device driver does a first call to do a dma mapping, then
  dissociate the device from the unity map domain, to its own. This gives
  limited protection but doesnt break drivers that do not use dma 
  mapping.
- Creating context entries only 1 per segment.. today we create one per 
  IOMMU which is not required.

This way we can avoid doing some of the workarounds we do for some devices
today, and will function as a default container for compatibility.

On one hand, this will provide more compatibility, but we will lose 
oppertunity to identify broken device drivers that dont use dma api's 
and fix them Depending on who you talk to.. some like it.. some just
hate it! and would like to fix the broken ones instead.

Cheers,
Ashok Raj
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 0/8] Intel IOMMU Support.

2007-04-24 Thread Ashok Raj
Hello again!

Andrew: Could you help include in -mm to give it more exposure preparing for 
mainline inclusion with more testing?

This is a resend of the patches after addressing some feedback we received.

1. As len requested, we moved most of the acpi parts to drivers/pci instead
   of leaving them in drivers/acpi, including some renaming of functions,
   using just acpi_get_table() only.
2. Made the guard page support configurable.
3. Added new CONFIG option to allocate consecutive address instead of re-using 
   free addresses as an experimental option, Not validated extensively.
   Its expected to improve in certain cases...
4. Fixed a couple minor bugs that got exposed in testing.

Other feedback: 

- Some suggested depend on ACPI, but thats not doable for several reasons.
- Graphics 1-1 maps exist only for compatibility until graphics drivers
  start calling pci map functions, including user space X that might be
  using /dev/mem.
 

Some more interesting possibilities. -- enhancements to work on.

- In order to ensure we dont break any driver that may not be using dma api's
  here are some suggestions to work on.
- Create a single 1-1 map, and make sure any pci device gets this map
  when they do a pci_set_master() to enable bus mastering automatically.
- When the device driver does a first call to do a dma mapping, then
  dissociate the device from the unity map domain, to its own. This gives
  limited protection but doesnt break drivers that do not use dma 
  mapping.
- Creating context entries only 1 per segment.. today we create one per 
  IOMMU which is not required.

This way we can avoid doing some of the workarounds we do for some devices
today, and will function as a default container for compatibility.

On one hand, this will provide more compatibility, but we will lose 
oppertunity to identify broken device drivers that dont use dma api's 
and fix them Depending on who you talk to.. some like it.. some just
hate it! and would like to fix the broken ones instead.

Cheers,
Ashok Raj
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 4/8] Supporting Zero Length Reads in Intel IOMMU.

2007-04-24 Thread Ashok Raj
PCI specs permit zero length reads (ZLR) even if the mapping for that region 
is write only. Support for this feature is indicated by the presence of a bit 
in the DMAR capability. If a particular DMAR does not support this capability
we map write-only regions as read-write.

This option can also provides a workaround for some drivers that request
a write-only mapping when they really should request a read-write.
(We ran into one such case in eepro100.c in handling rx_ring_dma)

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
--
 drivers/pci/intel-iommu.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: 2.6.21-rc6/drivers/pci/intel-iommu.c
===
--- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c   2007-04-18 09:04:56.0 
+0800
+++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-18 09:04:59.0 
+0800
@@ -84,7 +84,7 @@ struct iommu {
struct sys_device sysdev;
 };
 
-static int dmar_disabled;
+static int dmar_disabled, dmar_force_rw;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -102,6 +102,9 @@ static int __init intel_iommu_setup(char
if (!strncmp(str, "off", 3)) {
dmar_disabled = 1;
printk(KERN_INFO"Intel-IOMMU: disabled\n");
+   } else if (!strncmp(str, "forcerw", 7)) {
+   dmar_force_rw = 1;
+   printk(KERN_INFO"Intel-IOMMU: force R/W for W/O 
mapping\n");
}
str += strcspn(str, ",");
while (*str == ',')
@@ -1720,7 +1723,12 @@ static dma_addr_t __intel_map_single(str
goto error;
}
 
-   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+   /*
+* Check if DMAR supports zero-length reads on write only
+* mappings..
+*/
+   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \
+   !cap_zlr(domain->iommu->cap) || dmar_force_rw)
prot |= DMA_PTE_READ;
if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
prot |= DMA_PTE_WRITE;
Index: 2.6.21-rc6/include/linux/intel-iommu.h
===
--- 2.6.21-rc6.orig/include/linux/intel-iommu.h 2007-04-18 09:04:56.0 
+0800
+++ 2.6.21-rc6/include/linux/intel-iommu.h  2007-04-18 09:04:59.0 
+0800
@@ -79,6 +79,7 @@
 #define cap_max_fault_reg_offset(c) \
(cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16)
 
+#define cap_zlr(c) (((c) >> 22) & 1)
 #define cap_isoch(c)   (((c) >> 23) & 1)
 #define cap_mgaw(c)c) >> 16) & 0x3f) + 1)
 #define cap_sagaw(c)   (((c) >> 8) & 0x1f)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O

2007-04-24 Thread Ashok Raj
This patch contains basic ACPI parsing and enumeration support.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc5.orig/arch/x86_64/Kconfig   2007-04-23 07:11:49.0 
-0700
+++ linux-2.6.21-rc5/arch/x86_64/Kconfig2007-04-23 07:11:51.0 
-0700
@@ -687,6 +687,14 @@
bool "Support mmconfig PCI config space access"
depends on PCI && ACPI
 
+config DMAR
+   bool "Support for DMA Remapping Devices (EXPERIMENTAL)"
+   depends on PCI_MSI && ACPI && EXPERIMENTAL
+   help
+ Support DMA Remapping Devices. The devices are reported via
+ ACPI tables and includes pci device scope under each DMA
+ remapping device.
+
 source "drivers/pci/pcie/Kconfig"
 
 source "drivers/pci/Kconfig"
Index: linux-2.6.21-rc5/drivers/pci/Makefile
===
--- linux-2.6.21-rc5.orig/drivers/pci/Makefile  2007-04-23 07:11:49.0 
-0700
+++ linux-2.6.21-rc5/drivers/pci/Makefile   2007-04-23 07:11:51.0 
-0700
@@ -38,6 +38,7 @@
 #
 obj-$(CONFIG_ACPI)+= pci-acpi.o
 
+obj-$(CONFIG_DMAR) += dmar.o
 # Cardbus & CompactPCI use setup-bus
 obj-$(CONFIG_HOTPLUG) += setup-bus.o
 
Index: linux-2.6.21-rc5/drivers/pci/dmar.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc5/drivers/pci/dmar.c 2007-04-23 07:12:00.0 -0700
@@ -0,0 +1,350 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Ashok Raj <[EMAIL PROTECTED]>
+ * Copyright (C) Shaohua Li <[EMAIL PROTECTED]>
+ */
+
+#include 
+#include 
+#include 
+
+#include "dmar.h"
+
+#undef PREFIX
+#define PREFIX "DMAR:"
+
+#define MIN_SCOPE_LEN (sizeof(struct acpi_dmar_pci_path) + \
+   sizeof(struct acpi_dmar_device_scope))
+
+LIST_HEAD(dmar_drhd_units);
+LIST_HEAD(dmar_rmrr_units);
+u8 dmar_host_address_width;
+
+static struct acpi_table_header *dmar_tbl;
+
+static int __init dmar_register_drhd_unit(struct dmar_drhd_unit *drhd)
+{
+   /*
+* add INCLUDE_ALL at the tail, so scan the list will find it at
+* the very end.
+*/
+   if (drhd->include_all)
+   list_add_tail(&drhd->list, &dmar_drhd_units);
+   else
+   list_add(&drhd->list, &dmar_drhd_units);
+   return 0;
+}
+
+static int __init dmar_register_rmrr_unit(struct dmar_rmrr_unit *rmrr)
+{
+   list_add(&rmrr->list, &dmar_rmrr_units);
+   return 0;
+}
+
+static int dmar_pci_device_match(struct pci_dev *devices[], int cnt,
+struct pci_dev *dev)
+{
+   int index;
+
+   while (dev) {
+   for (index = 0; index < cnt; index ++)
+   if (dev == devices[index])
+   return 1;
+
+   /* Check our parent */
+   dev = dev->bus->self;
+   }
+
+   return 0;
+}
+
+struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev *dev)
+{
+   struct dmar_drhd_unit *drhd = NULL;
+
+   list_for_each_entry(drhd, &dmar_drhd_units, list) {
+   if (drhd->include_all || dmar_pci_device_match(drhd->devices,
+   drhd->devices_cnt, dev))
+   break;
+   }
+
+   return drhd;
+}
+
+struct dmar_rmrr_unit * dmar_find_matched_rmrr_unit(struct pci_dev *dev)
+{
+   struct dmar_rmrr_unit *rmrr;
+
+   list_for_each_entry(rmrr, &dmar_rmrr_units, list) {
+   if (dmar_pci_device_match(rmrr->devices,
+   rmrr->devices_cnt, dev))
+   goto out;
+   }
+   rmrr = NULL;
+out:
+   return rmrr;
+}
+
+static int __init dmar_parse_one_dev_scope(struct acpi_dmar_device_scope 
*scope,
+  struct pci_dev **dev, u16 segment)
+{
+   struct pci_bus *bus;
+   struct pci_dev *pdev = NULL;
+

[Intel IOMMU][patch 6/8] Doc updates for Intel Virtualization Technology for Directed I/O.

2007-04-24 Thread Ashok Raj
Document Intel IOMMU driver boot option.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt  2007-04-17 
05:41:56.0 -0700
@@ -0,0 +1,119 @@
+Linux IOMMU Support
+===
+
+The architecture spec can be obtained from the below location.
+
+http://www.intel.com/technology/virtualization/
+
+This guide gives a quick cheat sheet for some basic understanding.
+
+Some Keywords
+
+DMAR - DMA remapping
+DRHD - DMA Engine Reporting Structure
+RMRR - Reserved memory Region Reporting Structure
+ZLR  - Zero length reads from PCI devices
+IOVA - IO Virtual address.
+
+Basic stuff
+---
+
+ACPI enumerates and lists the different DMA engines in the platform, and
+device scope relationships between PCI devices and which DMA engine  controls
+them.
+
+What is RMRR?
+-
+
+There are some devices the BIOS controls, for e.g USB devices to perform
+PS2 emulation. The regions of memory used for these devices are marked
+reserved in the e820 map. When we turn on DMA translation, DMA to those
+regions will fail. Hence BIOS uses RMRR to specify these regions along with
+devices that need to access these regions. OS is expected to setup
+unity mappings for these regions for these devices to access these regions.
+
+How is IOVA generated?
+-
+
+Well behaved drivers call pci_map_*() calls before sending command to device
+that needs to perform DMA. Once DMA is completed and mapping is no longer
+required, device performs a pci_unmap_*() calls to unmap the region.
+
+The Intel IOMMU driver allocates a virtual address per domain. Each PCIE
+device has its own domain (hence protection). Devices under p2p bridges
+share the virtual address with all devices under the p2p bridge due to
+transaction id aliasing for p2p bridges.
+
+IOVA generation is pretty generic. We used the same technique as vmalloc()
+but these are not global address spaces, but separate for each domain.
+Different DMA engines may support different number of domains.
+
+We also allocate gaurd pages with each mapping, so we can attempt to catch
+any overflow that might happen.
+
+
+Graphics Problems?
+--
+If you encounter issues with graphics devices, you can try adding
+option intel_iommu=igfx_off to turn off the integrated graphics engine.
+
+If it happens to be a PCI device included in the INCLUDE_ALL Engine,
+then try the intel_iommu=gfx_workaround to setup a 1-1 map. We hear
+graphics drivers may be in process of using DMA api's in the near
+future
+
+Some exceptions to IOVA
+---
+Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
+The same is true for peer to peer transactions. Hence we reserve the
+address from PCI MMIO ranges so they are not allocated for IOVA addresses.
+
+
+Fault reporting
+---
+When errors are reported, the DMA engine signals via an interrupt. The fault
+reason and device that caused it with fault reason is printed on console.
+
+See below for sample.
+
+
+Boot Message Sample
+---
+
+Something like this gets printed indicating presence of DMAR tables
+in ACPI.
+
+ACPI: DMAR (v001 A M I  OEMDMAR  0x0001 MSFT 0x0097) @ 
0x7f5b5ef0
+
+When DMAR is being processed and initialized by ACPI, prints DMAR locations
+and any RMRR's processed.
+
+ACPI DMAR:Host address width 36
+ACPI DMAR:DRHD (flags: 0x)base: 0xfed9
+ACPI DMAR:DRHD (flags: 0x)base: 0xfed91000
+ACPI DMAR:DRHD (flags: 0x0001)base: 0xfed93000
+ACPI DMAR:RMRR base: 0x000ed000 end: 0x000e
+ACPI DMAR:RMRR base: 0x7f60 end: 0x7fff
+
+When DMAR is enabled for use, you will notice..
+
+PCI-DMA: Using DMAR IOMMU
+
+Fault reporting
+---
+
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+
+TBD
+
+
+- No Performance tuning / analysis yet.
+- sysfs needs useful data to be populated.
+  DMAR info, device scope, stats could be exposed to some extent.
+- Add support to Firmware Developer Kit to test ACPI tables for DMAR.
+- For compatibility testing, could use unity map domain for all devices, just
+  provide a 1-1 for all useful memory under a single domain for all devices.
+- API for paravirt ops for abstracting functionlity for VMM folks.
Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-17 
04:59:42.0 -070

[Intel IOMMU][patch 5/8] Graphics driver workarounds to provide unity map

2007-04-24 Thread Ashok Raj
Most GFX drivers don't call standard PCI DMA APIs to allocate DMA buffer,
Such drivers will be broken with IOMMU enabled. To workaround this issue, 
we added two options.

Once graphics devices are converted over to use the DMA-API's this entire
patch can be removed... 

a. intel_iommu=igfx_off. With this option, DMAR who has just gfx devices
   under it will be ignored. This mostly affect intergated gfx devices.
   If the DMAR is ignored, gfx device under it will get physical address
   for DMA.
b. intel_iommu=gfx_workaround. With this option, we will setup 1:1 mapping
   for whole memory for gfx devices, that is physical address equals to
   virtual address.In this way, gfx will use physical address for DMA, this
   is primarily for add-in card GFX device.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: 2.6.21-rc6/arch/x86_64/kernel/e820.c
===
--- 2.6.21-rc6.orig/arch/x86_64/kernel/e820.c   2007-04-20 11:03:01.0 
+0800
+++ 2.6.21-rc6/arch/x86_64/kernel/e820.c2007-04-20 11:45:56.0 
+0800
@@ -730,3 +730,22 @@ __init void e820_setup_gap(void)
printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: 
%lx:%lx)\n",
pci_mem_start, gapstart, gapsize);
 }
+
+int __init arch_get_ram_range(int slot, u64 *addr, u64 *size)
+{
+   int i;
+
+   if (slot < 0 || slot >= e820.nr_map)
+   return -1;
+   for (i = slot; i < e820.nr_map; i++) {
+   if(e820.map[i].type != E820_RAM)
+   continue;
+   break;
+   }
+   if (i == e820.nr_map || e820.map[i].addr > (max_pfn << PAGE_SHIFT))
+   return -1;
+   *addr = e820.map[i].addr;
+   *size = min_t(u64, e820.map[i].size + e820.map[i].addr,
+   max_pfn << PAGE_SHIFT) - *addr;
+   return i + 1;
+}
Index: 2.6.21-rc6/drivers/pci/dmar.h
===
--- 2.6.21-rc6.orig/drivers/pci/dmar.h  2007-04-20 11:38:30.0 +0800
+++ 2.6.21-rc6/drivers/pci/dmar.h   2007-04-20 11:45:56.0 +0800
@@ -35,6 +35,7 @@ struct dmar_drhd_unit {
int devices_cnt;
u8  include_all:1;
struct iommu *iommu;
+   int ignored:1; /* the drhd should be ignored */
 };
 
 struct dmar_rmrr_unit {
Index: 2.6.21-rc6/drivers/pci/intel-iommu.c
===
--- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c   2007-04-20 11:45:52.0 
+0800
+++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-20 11:45:56.0 
+0800
@@ -36,6 +36,7 @@
 #include "iova.h"
 #include "pci.h"
 
+#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
 #define IOAPIC_RANGE_START (0xfee0)
 #define IOAPIC_RANGE_END   (0xfeef)
 #define IOAPIC_RANGE_SIZE  (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1)
@@ -85,6 +86,7 @@ struct iommu {
 };
 
 static int dmar_disabled, dmar_force_rw;
+static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -105,7 +107,14 @@ static int __init intel_iommu_setup(char
} else if (!strncmp(str, "forcerw", 7)) {
dmar_force_rw = 1;
printk(KERN_INFO"Intel-IOMMU: force R/W for W/O 
mapping\n");
+   } else if (!strncmp(str, "igfx_off", 8)) {
+   dmar_map_gfx = 0;
+   printk(KERN_INFO"Intel-IOMMU: disable GFX device 
mapping\n");
+   } else if (!strncmp(str, "gfx_workaround", 14)) {
+   dmar_no_gfx_identity_map = 0;
+   printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole 
physical memory for GFX device\n");
}
+
str += strcspn(str, ",");
while (*str == ',')
str++;
@@ -1318,6 +1327,7 @@ struct device_domain_info {
struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
struct domain *domain;
 };
+#define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
 static LIST_HEAD(device_domain_list);
 
@@ -1538,10 +1548,40 @@ error:
 static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr,
struct pci_dev *pdev)
 {
+   if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+   return 0;
return iommu_prepare_identity_map(pdev, rmrr->base_address,
rmrr->end_address + 1);
 }
 
+static void iommu_prepare_gfx_mapping(void)
+{
+   struct pci_dev *pdev = NULL;
+   u64 base, size;
+   int slot;
+   int ret;
+
+   if (dmar_no_gfx_identity_m

[Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread Ashok Raj
Some devices may not support entire 64bit DMA. In a situation where such 
devices are co-located in a shared domain, we need to ensure there is some 
address space reserved for such devices without the low addresses getting
depleted by other devices capable of handling high dma addresses.

Signed-off-by: Ashok Raj <[EMAIL PROTECTED]>
Signed-off-by: Shaohua Li <[EMAIL PROTECTED]>
Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt
===
--- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt   2007-04-17 
06:02:24.0 -0700
+++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 
06:02:33.0 -0700
@@ -735,6 +735,11 @@
first 16M. The floppy disk could be modified to use
the DMA api's but thats a lot of pain for very small
gain. This option is turned on by default.
+   preserve_{1g/2g/4g/512m/256m/16m}
+   If a device is sharing a domain with other devices
+   and the device mask doesnt cover the 64bit range,
+   use this option to let the iommu code preserve some
+   virtual addr for such devices.
io7=[HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c
===
--- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 
06:02:24.0 -0700
+++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c  2007-04-17 06:05:49.0 
-0700
@@ -90,6 +90,7 @@
 static int dmar_disabled, dmar_force_rw;
 static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1;
 static int dmar_fix_isa = 1;
+static u64 dmar_preserve_iova_mask;
 
 static char *get_fault_reason(u8 fault_reason)
 {
@@ -119,6 +120,32 @@
} else if (!strncmp(str, "noisamap", 8)) {
dmar_fix_isa = 0;
printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity 
map for LPC\n");
+   } else if (!strncmp(str, "preserve_", 9)) {
+   if (!strncmp(str + 9, "4g", 2) ||
+   !strncmp(str + 9, "4G", 2))
+   dmar_preserve_iova_mask = DMA_32BIT_MASK;
+   else if (!strncmp(str + 9, "2g", 2) ||
+   !strncmp(str + 9, "2G", 2))
+   dmar_preserve_iova_mask = DMA_31BIT_MASK;
+   else if (!strncmp(str + 9, "1g", 2) ||
+!strncmp(str + 9, "1G", 2))
+   dmar_preserve_iova_mask = DMA_30BIT_MASK;
+   else if (!strncmp(str + 9, "512m", 2) ||
+!strncmp(str + 9, "512M", 2))
+   dmar_preserve_iova_mask = DMA_29BIT_MASK;
+   else if (!strncmp(str + 9, "256m", 4) ||
+!strncmp(str + 9, "256M", 4))
+   dmar_preserve_iova_mask = DMA_28BIT_MASK;
+   else if (!strncmp(str + 9, "16m", 3) ||
+!strncmp(str + 9, "16M", 3))
+   dmar_preserve_iova_mask = DMA_24BIT_MASK;
+   if (dmar_preserve_iova_mask)
+   printk(KERN_INFO
+   "DMAR: Preserved IOVA mask 0x%Lx for devices "
+   "sharing domain\n", dmar_preserve_iova_mask);
+   else
+   printk(KERN_ERR"DMAR: Unsuppored preserve mask"
+   " provided");
}
 
str += strcspn(str, ",");
@@ -1723,7 +1750,6 @@
last_addr : IOVA_START_ADDR);
}
return last_addr;
-
 }
 #endif
 
@@ -1751,13 +1777,14 @@
 
/*
 * If the device shares a domain with other devices and the device can
-* handle > 4G DMA, let the device use DMA address started from 4G, so 
to
-* leave rooms for other devices
+* can handle higher address, leave rooms for devices that cant
+* address high address ranges.
 */
if ((domain->flags & DOMAIN_FLAG_MULTIPLE_DEVICES) &&
-   pdev->dma_mask > DMA_32BIT_MASK)
+   dmar_preserve_iova_mask &&
+   (pdev->dma_mask > dmar_preserve_iova_mask))
iova = alloc_iova(domain, addr, size,
-  

Re: [Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 08:50:48PM +0200, Andi Kleen wrote:
> 
> > +
> > +LIST_HEAD(dmar_drhd_units);
> > +LIST_HEAD(dmar_rmrr_units);
> 
> Comment describing what lock protects those lists?
> In fact there seems to be no locking. What about hotplug?
> 

There is no support to handle an IOMMU hotplug at this time. IOMMU hotplug
requires additional support via ACPI that needs to be extended to handle this.

These definitions are scanned at boot time from BIOS tables. They are
pretty much static data that we process during boot. Hence no locking is 
required. We pretty much tread this as read only, and the information never 
gets changed after initial parsing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 09:33:15PM +0200, Andi Kleen wrote:
> On Tuesday 24 April 2007 08:03:07 Ashok Raj wrote:
> > Some devices may not support entire 64bit DMA. In a situation where such 
> > devices are co-located in a shared domain, we need to ensure there is some 
> > address space reserved for such devices without the low addresses getting
> > depleted by other devices capable of handling high dma addresses.
> 
> Sorry, but you need to find some way to make this usually work without special
> options. Otherwise users will be unhappy.
> 
> An possible way would be to allocate space upside down from the limit of the
> device. Then the lower areas should be usually free.
> 
With PCIE there is some benefit to keep dma addr low for performance reasons, 
since it will use   32bit Transaction level packets instead of 64bit.

This reservation is only required if we have some legacy device under a p2p 
where its required to share its addr space with other devices. We could 
implement a default when one is not specified to keep things simple.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 7/8] Support for legacy ISA devices

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 09:31:09PM +0200, Andi Kleen wrote:
> On Tuesday 24 April 2007 08:03:06 Ashok Raj wrote:
> > Floppy disk drivers dont work well with DMA remapping.
> 
> What is the problem? You can't allocate mappings <16MB?

No.. these drivers dont call DMA mapping api's.. thats the problem.

> 
> > Its possible to  
> > extend the current use for x86_64, but the gain is very little. If someone
> > feels compelled to clean this up, its up for grabs. Since these use 16M, we 
> > just provide a unity map for the ISA bridge device.
> > 
> 
> While it's probably not worth for the floppy there are other devices
> with similar weird addressing limitations. Some generic handling of it
> would be nice.
> 

In the intro we had outlined a way to handle this via a generic unity
map for all devices, we could do that, i.e

- implement a generic 1-1 map if the device is not calling dma api's and 
dynamically dissociate it if the device does start using dma apis.

For some of the addr reservation as well, we could use set_dma_mask() 
to ensure there is some dma space. Problem is some drivers may not use 
dma apis. Also it might be difficult to address device hotplugged that 
has a weird requirement.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 4/8] Supporting Zero Length Reads in Intel IOMMU.

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 09:28:11PM +0200, Andi Kleen wrote:
> On Tuesday 24 April 2007 08:03:03 Ashok Raj wrote:
> > PCI specs permit zero length reads (ZLR) even if the mapping for that 
> > region 
> > is write only. Support for this feature is indicated by the presence of a 
> > bit 
> > in the DMAR capability. If a particular DMAR does not support this 
> > capability
> > we map write-only regions as read-write.
> > 
> > This option can also provides a workaround for some drivers that request
> > a write-only mapping when they really should request a read-write.
> > (We ran into one such case in eepro100.c in handling rx_ring_dma)
> 
> Better just fix the drivers instead of adding such hacks

Some of the early DMAR's dont handle zero-length-reads as required. hardware
that supports it correctly will advertise via its capabilities.

We could remove the cmdline option since it should not be really required.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 3/8] Generic hardware support for Intel IOMMU.

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 09:27:08PM +0200, Andi Kleen wrote:
> On Tuesday 24 April 2007 08:03:02 Ashok Raj wrote:
> >
> > +#ifdef CONFIG_DMAR
> > +#ifdef CONFIG_SMP
> > +static void dmar_msi_set_affinity(unsigned int irq, cpumask_t mask)
> 
> 
> Why does it need an own interrupt type?

Problem is its MSI type interrupt, but we cannot use pci_dev since its not
a PCI device, hence requires its own way of setup etc.

> 
> > +
> > +config IOVA_GUARD_PAGE
> > +   bool "Enables gaurd page when allocating IO Virtual Address for IOMMU"
> > +   depends on DMAR
> > +
> > +config IOVA_NEXT_CONTIG
> > +   bool "Keeps IOVA allocations consequent between allocations"
> > +   depends on DMAR && EXPERIMENTAL
> 
> Needs reference to Intel and better description
> 
> The file should have a high level description what it is good for etc.
> 
> Need high level overview over what locks protects what and if there
> is a locking order.
> 
> It doesn't seem to enable sg merging? Since you have enough space 
> that should work.

Most of the IOVA stuff is really generic, and could be used outside
of the Intel code with probably some rework. Since today only DMAR 
requires it we have depends on DMAR, but we could make it more generic
and let the IOMMU driver just turn it on as required.

> 
> > +static char *fault_reason_strings[] =
> > +{
> > +   "Software",
> > +   "Present bit in root entry is clear",
> > +   "Present bit in context entry is clear",
> > +   "Invalid context entry",
> > +   "Access beyond MGAW",
> > +   "PTE Write access is not set",
> > +   "PTE Read access is not set",
> > +   "Next page table ptr is invalid",
> > +   "Root table address invalid",
> > +   "Context table ptr is invalid",
> > +   "non-zero reserved fields in RTP",
> > +   "non-zero reserved fields in CTP",
> > +   "non-zero reserved fields in PTE",
> > +   "Unknown"
> > +};
> > +
> > +#define MAX_FAULT_REASON_IDX   (12)
> 
> 
> You got 14 of them. better use ARRAY_SIZE

Its the last index(zero based) of the useful entry returned by the fault record.

Only used to find out if index from fault record is out of bounds.

We will work on the remaining comments and repost.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 6/8] Doc updates for Intel Virtualization Technology for Directed I/O.

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 11:17:55PM +0200, Markus Rechberger wrote:
> >+We also allocate gaurd pages with each mapping, so we can attempt to catch
> >+any overflow that might happen.
> >+
> 
> guess you probably mean guard tables here...
> 

So there is a good chance i can be "The Governor of California" :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread Ashok Raj
On Tue, Apr 24, 2007 at 02:23:51PM -0700, David Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Tue, 24 Apr 2007 23:12:54 +0200
> 
> > We already have a couple of other IOMMU architectures who essentially have 
> > the same
> > problem. Have you checked how they solve this?
> 
> Sparc64, for one, only uses 32-bit IOMMU addresses.  And we simply
> don't try to handle the funny devices at all, in fact we can't
> handle the ones that want mappings only in the low 16MB for
> example since the hardware IOMMU window usually starts in the
> middle of the 32-bit PCI address space.
> 
> We do it both because that's faster due to Single Address Cycles, as
> mentioned, and also because that's simply is where the hardware's
> IOMMU window is.  You can't use 64-bit IOMMU addresses even if you
> wanted to on sparc64.
> 
> My suggestion would be to allocate top-down in the 32-bit IOMMU space.
> 
> That might work, but my gut feeling is that this won't be sufficient
> and we'll need some kind of device driver initiated reservation
> mechanism for the <16MB et al. weird stuff.

Its not clear if we have a very generic device breakage.. most devices
on these platforms are going to be more recent, (except maybe some
legacy fd)... 

Maybe we should wait to fix unless we are certain if there are more of them
that breaks in these platforms.

We could choose to use the generic 1-1 for those weird cases, since the 
driver does ensure today that physical mem is low 16M, then we could
just turn on 1-1 domain for such devices without breaking any.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.

2007-04-24 Thread Ashok Raj
On Wed, Apr 25, 2007 at 12:03:57AM +0200, Andi Kleen wrote:
> On Tuesday 24 April 2007 23:50:26 David Miller wrote:
> > From: Ashok Raj <[EMAIL PROTECTED]>
> > Date: Tue, 24 Apr 2007 14:38:35 -0700
> > 
> > > Its not clear if we have a very generic device breakage.. most devices
> > > on these platforms are going to be more recent, (except maybe some
> > > legacy fd)... 
> > 
> > I'm not so sure, there are some "modern" sound cards that have
> > a 31-bit DMA addressing limitation because they use the 31st
> > bit as a status bit in their DMA descriptors :-)
> 
> There's also a 2GB only megaraid RAID controller that's pretty popular 
> because Dell shipped it for a long time.

Sounds like we have quite a few of those weird ones!

The real question is whats the working set for mapped dma handles for such
a controller. They would typically allocate only a few to what the controller
could handle, and would submit when the io completes right.. So typically
we shouldnt have any trouble since they would be able to reclaim what they
freed before submission (for iova).

Having a IOVA requirement in 2g etc is not a problem, except how many devices
on the same pci bus and what the total working set for iova is for that config.

The only way to gaurantee would be for the device to ask for a gauranteed set
maybe during pci_set_dma_mask() or some such time, and pre-reserve some
IOVA to gaurantee we never run out. BUt this is again driver changes and 
wont be fair if some driver is greedy.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] PCI: Cache PRI and PASID bits in pci_dev

2017-05-30 Thread Ashok Raj
From: Jean-Philippe Brucker 

Device drivers need to check if an IOMMU enabled ATS, PRI and PASID in
order to know when they can use the SVM API. Cache PRI and PASID bits in
the pci_dev structure, similarly to what is currently done for ATS.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/pci/ats.c   | 23 +++
 include/linux/pci.h |  2 ++
 2 files changed, 25 insertions(+)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index eeb9fb2..2126497 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -153,6 +153,9 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs)
u32 max_requests;
int pos;
 
+   if (WARN_ON(pdev->pri_enabled))
+   return -EBUSY;
+
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI);
if (!pos)
return -EINVAL;
@@ -170,6 +173,8 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs)
control |= PCI_PRI_CTRL_ENABLE;
pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control);
 
+   pdev->pri_enabled = 1;
+
return 0;
 }
 EXPORT_SYMBOL_GPL(pci_enable_pri);
@@ -185,6 +190,9 @@ void pci_disable_pri(struct pci_dev *pdev)
u16 control;
int pos;
 
+   if (WARN_ON(!pdev->pri_enabled))
+   return;
+
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI);
if (!pos)
return;
@@ -192,6 +200,8 @@ void pci_disable_pri(struct pci_dev *pdev)
pci_read_config_word(pdev, pos + PCI_PRI_CTRL, &control);
control &= ~PCI_PRI_CTRL_ENABLE;
pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control);
+
+   pdev->pri_enabled = 0;
 }
 EXPORT_SYMBOL_GPL(pci_disable_pri);
 
@@ -207,6 +217,9 @@ int pci_reset_pri(struct pci_dev *pdev)
u16 control;
int pos;
 
+   if (WARN_ON(pdev->pri_enabled))
+   return -EBUSY;
+
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI);
if (!pos)
return -EINVAL;
@@ -239,6 +252,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
u16 control, supported;
int pos;
 
+   if (WARN_ON(pdev->pasid_enabled))
+   return -EBUSY;
+
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PASID);
if (!pos)
return -EINVAL;
@@ -259,6 +275,8 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
 
pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control);
 
+   pdev->pasid_enabled = 1;
+
return 0;
 }
 EXPORT_SYMBOL_GPL(pci_enable_pasid);
@@ -273,11 +291,16 @@ void pci_disable_pasid(struct pci_dev *pdev)
u16 control = 0;
int pos;
 
+   if (WARN_ON(!pdev->pasid_enabled))
+   return;
+
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PASID);
if (!pos)
return;
 
pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control);
+
+   pdev->pasid_enabled = 0;
 }
 EXPORT_SYMBOL_GPL(pci_disable_pasid);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index eb3da1a..bee980e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -351,6 +351,8 @@ struct pci_dev {
unsigned intmsix_enabled:1;
unsigned intari_enabled:1;  /* ARI forwarding */
unsigned intats_enabled:1;  /* Address Translation Service */
+   unsigned intpasid_enabled:1;/* Process Address Space ID */
+   unsigned intpri_enabled:1;  /* Page Request Interface */
unsigned intis_managed:1;
unsigned intneeds_freset:1; /* Dev requires fundamental reset */
unsigned intstate_saved:1;
-- 
2.7.4



[PATCH 0/2] Save and restore pci properties to support FLR

2017-05-30 Thread Ashok Raj
Resending Jean's patch so it can be included earlier than his large
SVM commits. Original patch https://patchwork.kernel.org/patch/9593891
was ack'ed by Bjorn. Let's commit these separately since we need
functionality earlier.

Resending this series as requested by Jean.

CQ Tang (1):
  PCI: Save properties required to handle FLR for replay purposes.

Jean-Philippe Brucker (1):
  PCI: Cache PRI and PASID bits in pci_dev

 drivers/pci/ats.c   | 88 -
 drivers/pci/pci.c   |  3 ++
 include/linux/pci-ats.h | 10 ++
 include/linux/pci.h |  8 +
 4 files changed, 94 insertions(+), 15 deletions(-)

-- 
2.7.4



[PATCH 2/2] PCI: Save properties required to handle FLR for replay purposes.

2017-05-30 Thread Ashok Raj
From: CQ Tang 

Requires: https://patchwork.kernel.org/patch/9593891


After a FLR, pci-states need to be restored. This patch saves PASID features
and PRI reqs cached.

To: Bjorn Helgaas 
To: Joerg Roedel 
To: linux-...@vger.kernel.org
To: linux-kernel@vger.kernel.org
Cc: Jean-Phillipe Brucker 
Cc: David Woodhouse 
Cc: io...@lists.linux-foundation.org

Signed-off-by: CQ Tang 
Signed-off-by: Ashok Raj 
---
 drivers/pci/ats.c   | 65 +
 drivers/pci/pci.c   |  3 +++
 include/linux/pci-ats.h | 10 
 include/linux/pci.h |  6 +
 4 files changed, 69 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index 2126497..a769955 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -160,17 +160,16 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs)
if (!pos)
return -EINVAL;
 
-   pci_read_config_word(pdev, pos + PCI_PRI_CTRL, &control);
pci_read_config_word(pdev, pos + PCI_PRI_STATUS, &status);
-   if ((control & PCI_PRI_CTRL_ENABLE) ||
-   !(status & PCI_PRI_STATUS_STOPPED))
+   if (!(status & PCI_PRI_STATUS_STOPPED))
return -EBUSY;
 
pci_read_config_dword(pdev, pos + PCI_PRI_MAX_REQ, &max_requests);
reqs = min(max_requests, reqs);
+   pdev->pri_reqs_alloc = reqs;
pci_write_config_dword(pdev, pos + PCI_PRI_ALLOC_REQ, reqs);
 
-   control |= PCI_PRI_CTRL_ENABLE;
+   control = PCI_PRI_CTRL_ENABLE;
pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control);
 
pdev->pri_enabled = 1;
@@ -206,6 +205,29 @@ void pci_disable_pri(struct pci_dev *pdev)
 EXPORT_SYMBOL_GPL(pci_disable_pri);
 
 /**
+ * pci_restore_pri_state - Restore PRI
+ * @pdev: PCI device structure
+ *
+ */
+void pci_restore_pri_state(struct pci_dev *pdev)
+{
+   u16 control = PCI_PRI_CTRL_ENABLE;
+   u32 reqs = pdev->pri_reqs_alloc;
+   int pos;
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI);
+   if (!pos)
+   return;
+
+   if (!pdev->pri_enabled)
+   return;
+
+   pci_write_config_dword(pdev, pos + PCI_PRI_ALLOC_REQ, reqs);
+   pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control);
+}
+EXPORT_SYMBOL_GPL(pci_restore_pri_state);
+
+/**
  * pci_reset_pri - Resets device's PRI state
  * @pdev: PCI device structure
  *
@@ -224,12 +246,7 @@ int pci_reset_pri(struct pci_dev *pdev)
if (!pos)
return -EINVAL;
 
-   pci_read_config_word(pdev, pos + PCI_PRI_CTRL, &control);
-   if (control & PCI_PRI_CTRL_ENABLE)
-   return -EBUSY;
-
-   control |= PCI_PRI_CTRL_RESET;
-
+   control = PCI_PRI_CTRL_RESET;
pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control);
 
return 0;
@@ -259,12 +276,7 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
if (!pos)
return -EINVAL;
 
-   pci_read_config_word(pdev, pos + PCI_PASID_CTRL, &control);
pci_read_config_word(pdev, pos + PCI_PASID_CAP, &supported);
-
-   if (control & PCI_PASID_CTRL_ENABLE)
-   return -EINVAL;
-
supported &= PCI_PASID_CAP_EXEC | PCI_PASID_CAP_PRIV;
 
/* User wants to enable anything unsupported? */
@@ -272,6 +284,7 @@ int pci_enable_pasid(struct pci_dev *pdev, int features)
return -EINVAL;
 
control = PCI_PASID_CTRL_ENABLE | features;
+   pdev->pasid_features = features;
 
pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control);
 
@@ -305,6 +318,28 @@ void pci_disable_pasid(struct pci_dev *pdev)
 EXPORT_SYMBOL_GPL(pci_disable_pasid);
 
 /**
+ * pci_restore_pasid_state - Restore PASID capabilities.
+ * @pdev: PCI device structure
+ *
+ */
+void pci_restore_pasid_state(struct pci_dev *pdev)
+{
+   u16 control;
+   int pos;
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PASID);
+   if (!pos)
+   return;
+
+   if (!pdev->pasid_enabled)
+   return;
+
+   control = PCI_PASID_CTRL_ENABLE | pdev->pasid_features;
+   pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control);
+}
+EXPORT_SYMBOL_GPL(pci_restore_pasid_state);
+
+/**
  * pci_pasid_features - Check which PASID features are supported
  * @pdev: PCI device structure
  *
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 7904d02..c9a6510 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1171,6 +1172,8 @@ void pci_restore_state(struct pci_dev *dev)
 
/* PCI Express register must be restored first */
pci_restore_pcie_state(dev);
+   pci_restore_pasid_state(dev);
+   pci_restore_pri_state(dev);
pci_restore_ats_state(dev);
pci_restore_vc_state(dev);
 
diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
index 57e0b82..782fb8e 100644
--- a/include/linux/pci-ats.h
+++ b/include/linux/pci-

[Patch V0] x86, mce: Don't clear global error reporting banks during cpu_offline

2015-09-03 Thread Ashok Raj
During CPU offline, or during suspend/resume operations, its not safe to
clear MCi_CTL. These MSR's are either thread scoped (meaning private to
thread), or core scoped (private to threads in that core only), or socket
scope i.e visible and controllable from all threads in the socket.

When we turn off during CPU_OFFLINE, just offlining a single CPU will
stop signaling for all the socket wide resources, such as LLC, iMC for e.g.

It is true for Intel CPU's. But there seems some history that other processors
may require to turn these off during every CPU offline.

Intel Secure Guard eXtentions will be disabled when these controls are cleared
from a security perspective. This patch enables SGX to work across
suspend/resume.

- Consolidated some code to use sharing
- Minor changes to some prototypes to fit usage.
- Left handling same for non-Intel CPU models to avoid any unknown regressions.

Signed-off-by: Ashok Raj 
Reviewed-by: Tony Luck 
Tested-by: Serge Ayoun 
---
 arch/x86/kernel/cpu/mcheck/mce.c | 38 --
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d350858..5498a79 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2100,7 +2100,7 @@ int __init mcheck_init(void)
  * Disable machine checks on suspend and shutdown. We can't really handle
  * them later.
  */
-static int mce_disable_error_reporting(void)
+static void mce_disable_error_reporting(void)
 {
int i;
 
@@ -2110,17 +2110,40 @@ static int mce_disable_error_reporting(void)
if (b->init)
wrmsrl(MSR_IA32_MCx_CTL(i), 0);
}
-   return 0;
+   return;
+}
+
+static void _vendor_disable_error_reporting(void)
+{
+   struct cpuinfo_x86 *c = &boot_cpu_data;
+
+   switch (c->x86_vendor) {
+   case X86_VENDOR_INTEL:
+   /*
+* Don't clear on Intel CPU's. Some of these MSR's are
+* socket wide. Disabling them for just a single cpu offline
+* is bad, since it will inhibit reporting for all shared
+* resources.. such as LLC, iMC for e.g.
+*/
+   break;
+   default:
+   /*
+* Disble MCE reporting for all other CPU Vendor.
+* Don't want to break functionality on those
+*/
+   mce_disable_error_reporting();
+   }
 }
 
 static int mce_syscore_suspend(void)
 {
-   return mce_disable_error_reporting();
+   _vendor_disable_error_reporting();
+   return 0;
 }
 
 static void mce_syscore_shutdown(void)
 {
-   mce_disable_error_reporting();
+   _vendor_disable_error_reporting();
 }
 
 /*
@@ -2400,19 +2423,14 @@ static void mce_device_remove(unsigned int cpu)
 static void mce_disable_cpu(void *h)
 {
unsigned long action = *(unsigned long *)h;
-   int i;
 
if (!mce_available(raw_cpu_ptr(&cpu_info)))
return;
 
if (!(action & CPU_TASKS_FROZEN))
cmci_clear();
-   for (i = 0; i < mca_cfg.banks; i++) {
-   struct mce_bank *b = &mce_banks[i];
 
-   if (b->init)
-   wrmsrl(MSR_IA32_MCx_CTL(i), 0);
-   }
+   _vendor_disable_error_reporting();
 }
 
 static void mce_reenable_cpu(void *h)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch V1] x86, mce: Don't clear global error reporting banks during cpu_offline

2015-09-04 Thread Ashok Raj
During CPU offline, or during suspend/resume operations, its not safe to
clear MCi_CTL. These MSR's are either thread scoped (meaning private to
thread), or core scoped (private to threads in that core only), or socket
scope i.e visible and controllable from all threads in the socket.

When we turn off during CPU_OFFLINE, just offlining a single CPU will
stop signaling for all the socket wide resources, such as LLC, iMC for e.g.

It is true for Intel CPU's. But there seems some history that other processors
may require to turn these off during every CPU offline.

Intel Secure Guard eXtentions (SGX) is worried that it might be possible to
compromise integrity in a SGX system if the attacker has control of host system
to inject errors which would be otherwise ignored when MCi_CTL bits are
cleared. Hence on SGX enabled systems, if MCi_CTL is cleared SGX becomes not
available anymore.

- Consolidated some code to use sharing
- Minor changes to some prototypes to fit usage.
- Left handling same for non-Intel CPU models to avoid any unknown regressions.
- Fixed review comments from Boris

Signed-off-by: Ashok Raj 
Reviewed-by: Tony Luck 
Tested-by: Serge Ayoun 
---
 arch/x86/kernel/cpu/mcheck/mce.c | 30 --
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d350858..69c7e3c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2100,7 +2100,7 @@ int __init mcheck_init(void)
  * Disable machine checks on suspend and shutdown. We can't really handle
  * them later.
  */
-static int mce_disable_error_reporting(void)
+static void mce_disable_error_reporting(void)
 {
int i;
 
@@ -2110,17 +2110,32 @@ static int mce_disable_error_reporting(void)
if (b->init)
wrmsrl(MSR_IA32_MCx_CTL(i), 0);
}
-   return 0;
+   return;
+}
+
+static void vendor_disable_error_reporting(void)
+{
+   /*
+* Don't clear on Intel CPUs. Some of these MSRs are
+* socket wide. Disabling them for just a single CPU offline
+* is bad, since it will inhibit reporting for all shared
+* resources.. such as LLC, iMC for e.g.
+*/
+   if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+   return;
+
+   mce_disable_error_reporting();
 }
 
 static int mce_syscore_suspend(void)
 {
-   return mce_disable_error_reporting();
+   vendor_disable_error_reporting();
+   return 0;
 }
 
 static void mce_syscore_shutdown(void)
 {
-   mce_disable_error_reporting();
+   vendor_disable_error_reporting();
 }
 
 /*
@@ -2400,19 +2415,14 @@ static void mce_device_remove(unsigned int cpu)
 static void mce_disable_cpu(void *h)
 {
unsigned long action = *(unsigned long *)h;
-   int i;
 
if (!mce_available(raw_cpu_ptr(&cpu_info)))
return;
 
if (!(action & CPU_TASKS_FROZEN))
cmci_clear();
-   for (i = 0; i < mca_cfg.banks; i++) {
-   struct mce_bank *b = &mce_banks[i];
 
-   if (b->init)
-   wrmsrl(MSR_IA32_MCx_CTL(i), 0);
-   }
+   vendor_disable_error_reporting();
 }
 
 static void mce_reenable_cpu(void *h)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5] Add support for IBRS & IBPB KVM support.

2018-01-11 Thread Ashok Raj
The following patches are based on v3 from Tim Chen

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1582043.html

This patch set supports exposing MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD
for user space.

Thomas is steam blowing v3 :-).. but I didn't want to keep holding this
much longer for the rebase to be complete in tip/x86/pti.

Ashok Raj (4):
  x86/ibrs: Introduce native_rdmsrl, and native_wrmsrl
  x86/ibrs: Add new helper macros to save/restore MSR_IA32_SPEC_CTRL
  x86/ibrs: Add direct access support for MSR_IA32_SPEC_CTRL
  x86/feature: Detect the x86 feature Indirect Branch Prediction Barrier

Paolo Bonzini (1):
  x86/svm: Direct access to MSR_IA32_SPEC_CTRL

 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/msr-index.h   |  3 +++
 arch/x86/include/asm/spec_ctrl.h   | 29 +-
 arch/x86/kernel/cpu/spec_ctrl.c| 19 ++
 arch/x86/kvm/cpuid.c   |  3 ++-
 arch/x86/kvm/svm.c | 51 ++
 arch/x86/kvm/vmx.c | 51 ++
 arch/x86/kvm/x86.c |  1 +
 8 files changed, 156 insertions(+), 2 deletions(-)

-- 
2.7.4



[PATCH 4/5] x86/svm: Direct access to MSR_IA32_SPEC_CTRL

2018-01-11 Thread Ashok Raj
From: Paolo Bonzini 

Direct access to MSR_IA32_SPEC_CTRL is important
for performance.  Allow load/store of MSR_IA32_SPEC_CTRL, restore guest
IBRS on VM entry and set restore host values on VM exit.
it yet).

TBD: need to check msr's can be passed through even if feature is not
emuerated by the CPU.

[Ashok: Modified to reuse V3 spec-ctrl patches from Tim]

Signed-off-by: Paolo Bonzini 
Signed-off-by: Ashok Raj 
---
 arch/x86/kvm/svm.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0e68f0b..7c14471a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -183,6 +183,8 @@ struct vcpu_svm {
u64 gs_base;
} host;
 
+   u64 spec_ctrl;
+
u32 *msrpm;
 
ulong nmi_iret_rip;
@@ -248,6 +250,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_CSTAR,   .always = true  },
{ .index = MSR_SYSCALL_MASK,.always = true  },
 #endif
+   { .index = MSR_IA32_SPEC_CTRL,  .always = true  },
{ .index = MSR_IA32_LASTBRANCHFROMIP,   .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP,  .always = false },
@@ -917,6 +920,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
 
set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
}
+
+   if (boot_cpu_has(X86_FEATURE_SPEC_CTRL))
+   set_msr_interception(msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
 }
 
 static void add_msr_offset(u32 offset)
@@ -3576,6 +3582,9 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_VM_CR:
msr_info->data = svm->nested.vm_cr_msr;
break;
+   case MSR_IA32_SPEC_CTRL:
+   msr_info->data = svm->spec_ctrl;
+   break;
case MSR_IA32_UCODE_REV:
msr_info->data = 0x0165;
break;
@@ -3724,6 +3733,9 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
case MSR_VM_IGNNE:
vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", 
ecx, data);
break;
+   case MSR_IA32_SPEC_CTRL:
+   svm->spec_ctrl = data;
+   break;
case MSR_IA32_APICBASE:
if (kvm_vcpu_apicv_active(vcpu))
avic_update_vapic_bar(to_svm(vcpu), data);
@@ -4871,6 +4883,19 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu)
svm_complete_interrupts(svm);
 }
 
+
+/*
+ * Save guest value of spec_ctrl and also restore host value
+ */
+static void save_guest_spec_ctrl(struct vcpu_svm *svm)
+{
+   if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
+   svm->spec_ctrl = spec_ctrl_get();
+   spec_ctrl_restriction_on();
+   } else
+   rmb();
+}
+
 static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -4910,6 +4935,14 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
clgi();
 
+   if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
+   /*
+* FIXME: lockdep_assert_irqs_disabled();
+*/
+   WARN_ON_ONCE(!irqs_disabled());
+   spec_ctrl_set(svm->spec_ctrl);
+   }
+
local_irq_enable();
 
asm volatile (
@@ -4985,6 +5018,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
);
 
+   save_guest_spec_ctrl(svm);
+
 #ifdef CONFIG_X86_64
wrmsrl(MSR_GS_BASE, svm->host.gs_base);
 #else
-- 
2.7.4



[PATCH 1/5] x86/ibrs: Introduce native_rdmsrl, and native_wrmsrl

2018-01-11 Thread Ashok Raj
- Remove including microcode.h, and use native macros from asm/msr.h
- added license header for spec_ctrl.c

Signed-off-by: Ashok Raj 
---
 arch/x86/include/asm/spec_ctrl.h | 17 -
 arch/x86/kernel/cpu/spec_ctrl.c  |  1 +
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/spec_ctrl.h b/arch/x86/include/asm/spec_ctrl.h
index 948959b..2dfa31b 100644
--- a/arch/x86/include/asm/spec_ctrl.h
+++ b/arch/x86/include/asm/spec_ctrl.h
@@ -3,12 +3,27 @@
 #ifndef _ASM_X86_SPEC_CTRL_H
 #define _ASM_X86_SPEC_CTRL_H
 
-#include 
+#include 
+#include 
 
 void spec_ctrl_scan_feature(struct cpuinfo_x86 *c);
 void spec_ctrl_unprotected_begin(void);
 void spec_ctrl_unprotected_end(void);
 
+static inline u64 native_rdmsrl(unsigned int msr)
+{
+   u64 val;
+
+   val = __rdmsr(msr);
+
+   return val;
+}
+
+static inline void native_wrmsrl(unsigned int msr, u64 val)
+{
+   __wrmsr(msr, (u32) (val & 0xULL), (u32) (val >> 32));
+}
+
 static inline void __disable_indirect_speculation(void)
 {
native_wrmsrl(MSR_IA32_SPEC_CTRL, SPEC_CTRL_ENABLE_IBRS);
diff --git a/arch/x86/kernel/cpu/spec_ctrl.c b/arch/x86/kernel/cpu/spec_ctrl.c
index 843b4e6..9e9d013 100644
--- a/arch/x86/kernel/cpu/spec_ctrl.c
+++ b/arch/x86/kernel/cpu/spec_ctrl.c
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 #include 
 
 #include 
-- 
2.7.4



[PATCH 5/5] x86/feature: Detect the x86 feature Indirect Branch Prediction Barrier

2018-01-11 Thread Ashok Raj
cpuid ax=0x7, return rdx bit 26 to indicate presence of both
IA32_SPEC_CTRL(MSR 0x48) and IA32_PRED_CMD(MSR 0x49)

BIT0: Indirect Branch Prediction Barrier

When this MSR is written with IBPB=1 it ensures that earlier code's behavior
doesn't control later indirect branch predictions.

Note this MSR is only writable and does not carry any state. Its a barrier
so the code should perform a wrmsr when the barrier is needed.

Signed-off-by: Ashok Raj 
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/msr-index.h   |  3 +++
 arch/x86/kernel/cpu/spec_ctrl.c|  7 +++
 arch/x86/kvm/svm.c | 16 
 arch/x86/kvm/vmx.c | 10 ++
 5 files changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 624b58e..52f37fc 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -213,6 +213,7 @@
 #define X86_FEATURE_MBA( 7*32+18) /* Memory Bandwidth 
Allocation */
 #define X86_FEATURE_SPEC_CTRL  ( 7*32+19) /* Speculation Control */
 #define X86_FEATURE_SPEC_CTRL_IBRS ( 7*32+20) /* Speculation Control, use 
IBRS */
+#define X86_FEATURE_PRED_CMD   ( 7*32+21) /* Indirect Branch Prediction 
Barrier */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3e1cb18..1888e19 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -46,6 +46,9 @@
 #define SPEC_CTRL_DISABLE_IBRS (0 << 0)
 #define SPEC_CTRL_ENABLE_IBRS  (1 << 0)
 
+#define MSR_IA32_PRED_CMD  0x0049
+#define FEATURE_SET_IBPB   (1<<0)
+
 #define MSR_IA32_PERFCTR0  0x00c1
 #define MSR_IA32_PERFCTR1  0x00c2
 #define MSR_FSB_FREQ   0x00cd
diff --git a/arch/x86/kernel/cpu/spec_ctrl.c b/arch/x86/kernel/cpu/spec_ctrl.c
index 02fc630..6cfec19 100644
--- a/arch/x86/kernel/cpu/spec_ctrl.c
+++ b/arch/x86/kernel/cpu/spec_ctrl.c
@@ -15,6 +15,13 @@ void spec_ctrl_scan_feature(struct cpuinfo_x86 *c)
if (!c->cpu_index)
static_branch_enable(&spec_ctrl_dynamic_ibrs);
}
+   /*
+* For Intel CPU's this MSR is shared the same cpuid
+* enumeration. When MSR_IA32_SPEC_CTRL is present
+* MSR_IA32_SPEC_CTRL is also available
+* TBD: AMD might have a separate enumeration for each.
+*/
+   set_cpu_cap(c, X86_FEATURE_PRED_CMD);
}
 }
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7c14471a..36924c9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -251,6 +251,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_SYSCALL_MASK,.always = true  },
 #endif
{ .index = MSR_IA32_SPEC_CTRL,  .always = true  },
+   { .index = MSR_IA32_PRED_CMD,   .always = false },
{ .index = MSR_IA32_LASTBRANCHFROMIP,   .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP,  .always = false },
@@ -531,6 +532,7 @@ struct svm_cpu_data {
struct kvm_ldttss_desc *tss_desc;
 
struct page *save_area;
+   struct vmcb *current_vmcb;
 };
 
 static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -923,6 +925,8 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
 
if (boot_cpu_has(X86_FEATURE_SPEC_CTRL))
set_msr_interception(msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+   if (boot_cpu_has(X86_FEATURE_PRED_CMD))
+   set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
 }
 
 static void add_msr_offset(u32 offset)
@@ -1711,11 +1715,18 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, svm);
+/* 
+ * The VMCB could be recycled, causing a false negative in svm_vcpu_load;
+ * block speculative execution.
+ */
+   if (boot_cpu_has(X86_FEATURE_PRED_CMD))
+native_wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB);
 }
 
 static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
+   struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
int i;
 
if (unlikely(cpu != vcpu->cpu)) {
@@ -1744,6 +1755,11 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (static_cpu_has(X86_FEATURE_RDTSCP))
wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
 
+   if (sd->current_vmcb != svm->vmcb) {
+   sd->current_vmcb = svm-&g

[PATCH 3/5] x86/ibrs: Add direct access support for MSR_IA32_SPEC_CTRL

2018-01-11 Thread Ashok Raj
Add direct access to MSR_IA32_SPEC_CTRL from a guest. Also save/restore
IBRS values during exits and guest resume path.

Rebasing based on Tim's patch

Signed-off-by: Ashok Raj 
---
 arch/x86/kvm/cpuid.c |  3 ++-
 arch/x86/kvm/vmx.c   | 41 +
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..6fa81c7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -70,6 +70,7 @@ u64 kvm_supported_xcr0(void)
 /* These are scattered features in cpufeatures.h. */
 #define KVM_CPUID_BIT_AVX512_4VNNIW 2
 #define KVM_CPUID_BIT_AVX512_4FMAPS 3
+#define KVM_CPUID_BIT_SPEC_CTRL26
 #define KF(x) bit(KVM_CPUID_BIT_##x)
 
 int kvm_update_cpuid(struct kvm_vcpu *vcpu)
@@ -392,7 +393,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
 
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
-   KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
+   KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS) | KF(SPEC_CTRL);
 
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 62ee436..1913896 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "trace.h"
 #include "pmu.h"
@@ -579,6 +580,7 @@ struct vcpu_vmx {
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
u32 secondary_exec_control;
+   u64 spec_ctrl;
 
/*
 * loaded_vmcs points to the VMCS currently used in this vcpu. For a
@@ -3259,6 +3261,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+   case MSR_IA32_SPEC_CTRL:
+   msr_info->data = to_vmx(vcpu)->spec_ctrl;
+   break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
break;
@@ -3366,6 +3371,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+   case MSR_IA32_SPEC_CTRL:
+   to_vmx(vcpu)->spec_ctrl = msr_info->data;
+   break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -6790,6 +6798,13 @@ static __init int hardware_setup(void)
kvm_tsc_scaling_ratio_frac_bits = 48;
}
 
+   /*
+* If feature is available then setup MSR_IA32_SPEC_CTRL to be in
+* passthrough mode for the guest.
+*/
+   if (boot_cpu_has(X86_FEATURE_SPEC_CTRL))
+   vmx_disable_intercept_for_msr(MSR_IA32_SPEC_CTRL, false);
+
vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
@@ -9242,6 +9257,15 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
 }
 
+static void save_guest_spec_ctrl(struct vcpu_vmx *vmx)
+{
+   if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
+   vmx->spec_ctrl = spec_ctrl_get();
+   spec_ctrl_restriction_on();
+   } else
+   rmb();
+}
+
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -9298,6 +9322,21 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmx_arm_hv_timer(vcpu);
 
vmx->__launched = vmx->loaded_vmcs->launched;
+
+   /*
+* Just update whatever the value was set for the MSR in guest.
+* If this is unlaunched: Assume that initialized value is 0.
+* IRQ's also need to be disabled. If guest value is 0, an interrupt
+* could start running in unprotected mode (i.e with IBRS=0).
+*/
+   if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
+   /*
+* FIXME: lockdep_assert_irqs_disabled();
+*/
+   WARN_ON_ONCE(!irqs_disabled());
+   spec_ctrl_set(vmx->spec_ctrl);
+   }
+
asm(
/* Store host registers */
"push %%" _ASM_DX "; push %%" _ASM_BP ";"
@@ -9403,6 +9442,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
  );
 
+   save_guest_spec_ctrl(vmx);
+
/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
if (debugctlmsr)
update_debugctlmsr(debugctlmsr);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm

[PATCH 2/5] x86/ibrs: Add new helper macros to save/restore MSR_IA32_SPEC_CTRL

2018-01-11 Thread Ashok Raj
Add some helper macros to save/restore MSR_IA32_SPEC_CTRL.

Although we could use the spec_ctrl_unprotected_begin/end macros they seem
be bit unreadable for some uses.

spec_ctrl_get - read MSR_IA32_SPEC_CTRL to save
spec_ctrl_set - write value restore MSR_IA32_SPEC_CTRL
spec_ctrl_restriction_off - same as spec_ctrl_unprotected_begin
spec_ctrl_restriction_on - same as spec_ctrl_unprotected_end

Signed-off-by: Ashok Raj 
---
 arch/x86/include/asm/spec_ctrl.h | 12 
 arch/x86/kernel/cpu/spec_ctrl.c  | 11 +++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/spec_ctrl.h b/arch/x86/include/asm/spec_ctrl.h
index 2dfa31b..926feb2 100644
--- a/arch/x86/include/asm/spec_ctrl.h
+++ b/arch/x86/include/asm/spec_ctrl.h
@@ -9,6 +9,10 @@
 void spec_ctrl_scan_feature(struct cpuinfo_x86 *c);
 void spec_ctrl_unprotected_begin(void);
 void spec_ctrl_unprotected_end(void);
+void spec_ctrl_set(u64 val);
+
+#define spec_ctrl_restriction_on   spec_ctrl_unprotected_end
+#define spec_ctrl_restriction_off  spec_ctrl_unprotected_begin
 
 static inline u64 native_rdmsrl(unsigned int msr)
 {
@@ -34,4 +38,12 @@ static inline void __enable_indirect_speculation(void)
native_wrmsrl(MSR_IA32_SPEC_CTRL, SPEC_CTRL_DISABLE_IBRS);
 }
 
+static inline u64 spec_ctrl_get(void)
+{
+   u64 val;
+
+   val = native_rdmsrl(MSR_IA32_SPEC_CTRL);
+
+   return val;
+}
 #endif /* _ASM_X86_SPEC_CTRL_H */
diff --git a/arch/x86/kernel/cpu/spec_ctrl.c b/arch/x86/kernel/cpu/spec_ctrl.c
index 9e9d013..02fc630 100644
--- a/arch/x86/kernel/cpu/spec_ctrl.c
+++ b/arch/x86/kernel/cpu/spec_ctrl.c
@@ -47,3 +47,14 @@ void spec_ctrl_unprotected_end(void)
__disable_indirect_speculation();
 }
 EXPORT_SYMBOL_GPL(spec_ctrl_unprotected_end);
+
+void spec_ctrl_set(u64 val)
+{
+   if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
+   if (!val) {
+   spec_ctrl_restriction_off();
+   } else
+   spec_ctrl_restriction_on();
+   }
+}
+EXPORT_SYMBOL(spec_ctrl_set);
-- 
2.7.4



Re: [PATCH 3/7] kvm: vmx: pass MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD down to the guest

2018-01-08 Thread Ashok Raj
Hi Paolo

Do you assume that host isn't using IBRS and only guest uses it?



On Mon, Jan 8, 2018 at 10:08 AM, Paolo Bonzini  wrote:
> Direct access to MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD is important
> for performance.  Allow load/store of MSR_IA32_SPEC_CTRL, restore guest
> IBRS on VM entry and set it to 0 on VM exit (because Linux does not use
> it yet).
>
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/vmx.c | 32 
>  1 file changed, 32 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 669f5f74857d..d00bcad7336e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -120,6 +120,8 @@
>  module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  #endif
>
> +static bool __read_mostly have_spec_ctrl;
> +
>  #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
>  #define KVM_VM_CR0_ALWAYS_ON   \
> @@ -609,6 +611,8 @@ struct vcpu_vmx {
> u64   msr_host_kernel_gs_base;
> u64   msr_guest_kernel_gs_base;
>  #endif
> +   u64   spec_ctrl;
> +
> u32 vm_entry_controls_shadow;
> u32 vm_exit_controls_shadow;
> u32 secondary_exec_control;
> @@ -3361,6 +3365,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> case MSR_IA32_TSC:
> msr_info->data = guest_read_tsc(vcpu);
> break;
> +   case MSR_IA32_SPEC_CTRL:
> +   msr_info->data = to_vmx(vcpu)->spec_ctrl;
> +   break;
> case MSR_IA32_SYSENTER_CS:
> msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
> break;
> @@ -3500,6 +3507,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
> +   case MSR_IA32_SPEC_CTRL:
> +   to_vmx(vcpu)->spec_ctrl = msr_info->data;
> +   break;
> case MSR_IA32_CR_PAT:
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -7062,6 +7072,17 @@ static __init int hardware_setup(void)
> goto out;
> }
>
> +   /*
> +* FIXME: this is only needed until SPEC_CTRL is supported
> +* by upstream Linux in cpufeatures, then it can be replaced
> +* with static_cpu_has.
> +*/
> +   have_spec_ctrl = cpu_has_spec_ctrl();
> +   if (have_spec_ctrl)
> +   pr_info("kvm: SPEC_CTRL available\n");
> +   else
> +   pr_info("kvm: SPEC_CTRL not available\n");
> +
> if (boot_cpu_has(X86_FEATURE_NX))
> kvm_enable_efer_bits(EFER_NX);
>
> @@ -7131,6 +7152,8 @@ static __init int hardware_setup(void)
> vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
> vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
> vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
> +   vmx_disable_intercept_for_msr(MSR_IA32_SPEC_CTRL, false);
> +   vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false);
>
> memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
> vmx_msr_bitmap_legacy, PAGE_SIZE);
> @@ -9597,6 +9620,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>
> pt_guest_enter(vmx);
>
> +   if (have_spec_ctrl && vmx->spec_ctrl != 0)
> +   wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +

Do we even need to optimize this? what if host Linux enabled IBRS, but
guest has it turned off?
Thought it might be simpler to blindly update it with what
vmx->spec_ctrl value is?

> atomic_switch_perf_msrs(vmx);
>
> vmx_arm_hv_timer(vcpu);
> @@ -9707,6 +9733,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  #endif
>   );
>
> +   if (have_spec_ctrl) {
> +   rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +   if (vmx->spec_ctrl)
> +   wrmsrl(MSR_IA32_SPEC_CTRL, 0);
> +   }
> +

Same thing here.. if the host OS has enabled IBRS wouldn't you want to
keep the same value?

> /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
> if (vmx->host_debugctlmsr)
> update_debugctlmsr(vmx->host_debugctlmsr);
> --
> 1.8.3.1
>
>


[4.15 & 4.14 stable 07/12] x86/microcode: Do not upload microcode if CPUs are offline

2018-04-06 Thread Ashok Raj
commit 30ec26da9967d0d785abc24073129a34c3211777 upstream

Avoid loading microcode if any of the CPUs are offline, and issue a
warning. Having different microcode revisions on the system at any time
is outright dangerous.

[ Borislav: Massage changelog. ]

Signed-off-by: Ashok Raj 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Tom Lendacky 
Tested-by: Ashok Raj 
Reviewed-by: Tom Lendacky 
Cc: Arjan Van De Ven 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1519352533-15992-4-git-send-email-ashok@intel.com
Link: https://lkml.kernel.org/r/20180228102846.13447-5...@alien8.de
---
 arch/x86/kernel/cpu/microcode/core.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index cbeace2..f25c395 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -486,6 +486,16 @@ static void __exit microcode_dev_exit(void)
 /* fake device for request_firmware */
 static struct platform_device  *microcode_pdev;
 
+static int check_online_cpus(void)
+{
+   if (num_online_cpus() == num_present_cpus())
+   return 0;
+
+   pr_err("Not all CPUs online, aborting microcode update.\n");
+
+   return -EINVAL;
+}
+
 static enum ucode_state reload_for_cpu(int cpu)
 {
struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
@@ -519,7 +529,13 @@ static ssize_t reload_store(struct device *dev,
return size;
 
get_online_cpus();
+
+   ret = check_online_cpus();
+   if (ret)
+   goto put;
+
mutex_lock(µcode_mutex);
+
for_each_online_cpu(cpu) {
tmp_ret = reload_for_cpu(cpu);
if (tmp_ret > UCODE_NFOUND) {
@@ -538,6 +554,8 @@ static ssize_t reload_store(struct device *dev,
microcode_check();
 
mutex_unlock(µcode_mutex);
+
+put:
put_online_cpus();
 
if (!ret)
-- 
2.7.4



[4.15 & 4.14 stable 08/12] x86/microcode/intel: Look into the patch cache first

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit d8c3b52c00a05036e0a6b315b4b17921a7b67997 upstream

The cache might contain a newer patch - look in there first.

A follow-on change will make sure newest patches are loaded into the
cache of microcode patches.

Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Tom Lendacky 
Tested-by: Ashok Raj 
Cc: Arjan Van De Ven 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180228102846.13447-6...@alien8.de
---
 arch/x86/kernel/cpu/microcode/intel.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index e2864bc..2aded9d 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -791,9 +791,9 @@ static int collect_cpu_info(int cpu_num, struct 
cpu_signature *csig)
 
 static enum ucode_state apply_microcode_intel(int cpu)
 {
-   struct microcode_intel *mc;
-   struct ucode_cpu_info *uci;
+   struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
struct cpuinfo_x86 *c = &cpu_data(cpu);
+   struct microcode_intel *mc;
static int prev_rev;
u32 rev;
 
@@ -801,11 +801,10 @@ static enum ucode_state apply_microcode_intel(int cpu)
if (WARN_ON(raw_smp_processor_id() != cpu))
return UCODE_ERROR;
 
-   uci = ucode_cpu_info + cpu;
-   mc = uci->mc;
+   /* Look for a newer patch in our cache: */
+   mc = find_patch(uci);
if (!mc) {
-   /* Look for a newer patch in our cache: */
-   mc = find_patch(uci);
+   mc = uci->mc;
if (!mc)
return UCODE_NFOUND;
}
-- 
2.7.4



[4.15 & 4.14 stable 09/12] x86/microcode: Request microcode on the BSP

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit cfb52a5a09c8ae3a1dafb44ce549fde5b69e8117 upstream

... so that any newer version can land in the cache and can later be
fished out by the application functions. Do that before grabbing the
hotplug lock.

Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Tom Lendacky 
Tested-by: Ashok Raj 
Reviewed-by: Tom Lendacky 
Cc: Arjan Van De Ven 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180228102846.13447-7...@alien8.de
---
 arch/x86/kernel/cpu/microcode/core.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index f25c395..8adbf43 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -499,15 +499,10 @@ static int check_online_cpus(void)
 static enum ucode_state reload_for_cpu(int cpu)
 {
struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
-   enum ucode_state ustate;
 
if (!uci->valid)
return UCODE_OK;
 
-   ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev, 
true);
-   if (ustate != UCODE_OK)
-   return ustate;
-
return apply_microcode_on_target(cpu);
 }
 
@@ -515,11 +510,11 @@ static ssize_t reload_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t size)
 {
+   int cpu, bsp = boot_cpu_data.cpu_index;
enum ucode_state tmp_ret = UCODE_OK;
bool do_callback = false;
unsigned long val;
ssize_t ret = 0;
-   int cpu;
 
ret = kstrtoul(buf, 0, &val);
if (ret)
@@ -528,6 +523,10 @@ static ssize_t reload_store(struct device *dev,
if (val != 1)
return size;
 
+   tmp_ret = microcode_ops->request_microcode_fw(bsp, 
µcode_pdev->dev, true);
+   if (tmp_ret != UCODE_OK)
+   return size;
+
get_online_cpus();
 
ret = check_online_cpus();
-- 
2.7.4



[4.15 & 4.14 stable 04/12] x86/microcode: Get rid of struct apply_microcode_ctx

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit 854857f5944c59a881ff607b37ed9ed41d031a3b upstream

It is a useless remnant from earlier times. Use the ucode_state enum
directly.

No functional change.

Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Tom Lendacky 
Tested-by: Ashok Raj 
Cc: Arjan Van De Ven 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180228102846.13447-2...@alien8.de
---
 arch/x86/kernel/cpu/microcode/core.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index b40b56e..cbeace2 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -373,26 +373,23 @@ static int collect_cpu_info(int cpu)
return ret;
 }
 
-struct apply_microcode_ctx {
-   enum ucode_state err;
-};
-
 static void apply_microcode_local(void *arg)
 {
-   struct apply_microcode_ctx *ctx = arg;
+   enum ucode_state *err = arg;
 
-   ctx->err = microcode_ops->apply_microcode(smp_processor_id());
+   *err = microcode_ops->apply_microcode(smp_processor_id());
 }
 
 static int apply_microcode_on_target(int cpu)
 {
-   struct apply_microcode_ctx ctx = { .err = 0 };
+   enum ucode_state err;
int ret;
 
-   ret = smp_call_function_single(cpu, apply_microcode_local, &ctx, 1);
-   if (!ret)
-   ret = ctx.err;
-
+   ret = smp_call_function_single(cpu, apply_microcode_local, &err, 1);
+   if (!ret) {
+   if (err == UCODE_ERROR)
+   ret = 1;
+   }
return ret;
 }
 
-- 
2.7.4



[4.15 & 4.14 stable 05/12] x86/microcode/intel: Check microcode revision before updating sibling threads

2018-04-06 Thread Ashok Raj
commit c182d2b7d0ca48e0d6ff16f7d883161238c447ed upstream

After updating microcode on one of the threads of a core, the other
thread sibling automatically gets the update since the microcode
resources on a hyperthreaded core are shared between the two threads.

Check the microcode revision on the CPU before performing a microcode
update and thus save us the WRMSR 0x79 because it is a particularly
expensive operation.

[ Borislav: Massage changelog and coding style. ]

Signed-off-by: Ashok Raj 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Tom Lendacky 
Tested-by: Ashok Raj 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Cc: Arjan Van De Ven 
Link: 
http://lkml.kernel.org/r/1519352533-15992-2-git-send-email-ashok@intel.com
Link: https://lkml.kernel.org/r/20180228102846.13447-3...@alien8.de
---
 arch/x86/kernel/cpu/microcode/intel.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 923054a..87bd6dc 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -589,6 +589,17 @@ static int apply_microcode_early(struct ucode_cpu_info 
*uci, bool early)
if (!mc)
return 0;
 
+   /*
+* Save us the MSR write below - which is a particular expensive
+* operation - when the other hyperthread has updated the microcode
+* already.
+*/
+   rev = intel_get_microcode_revision();
+   if (rev >= mc->hdr.rev) {
+   uci->cpu_sig.rev = rev;
+   return UCODE_OK;
+   }
+
/* write microcode via MSR 0x79 */
native_wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
 
@@ -776,7 +787,7 @@ static enum ucode_state apply_microcode_intel(int cpu)
 {
struct microcode_intel *mc;
struct ucode_cpu_info *uci;
-   struct cpuinfo_x86 *c;
+   struct cpuinfo_x86 *c = &cpu_data(cpu);
static int prev_rev;
u32 rev;
 
@@ -793,6 +804,18 @@ static enum ucode_state apply_microcode_intel(int cpu)
return UCODE_NFOUND;
}
 
+   /*
+* Save us the MSR write below - which is a particular expensive
+* operation - when the other hyperthread has updated the microcode
+* already.
+*/
+   rev = intel_get_microcode_revision();
+   if (rev >= mc->hdr.rev) {
+   uci->cpu_sig.rev = rev;
+   c->microcode = rev;
+   return UCODE_OK;
+   }
+
/* write microcode via MSR 0x79 */
wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
 
@@ -813,8 +836,6 @@ static enum ucode_state apply_microcode_intel(int cpu)
prev_rev = rev;
}
 
-   c = &cpu_data(cpu);
-
uci->cpu_sig.rev = rev;
c->microcode = rev;
 
-- 
2.7.4



[4.15 & 4.14 stable 06/12] x86/microcode/intel: Writeback and invalidate caches before updating microcode

2018-04-06 Thread Ashok Raj
commit 91df9fdf51492aec9fed6b4cbd33160886740f47 upstream

Updating microcode is less error prone when caches have been flushed and
depending on what exactly the microcode is updating. For example, some
of the issues around certain Broadwell parts can be addressed by doing a
full cache flush.

[ Borislav: Massage it and use native_wbinvd() in both cases. ]

Signed-off-by: Ashok Raj 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Tom Lendacky 
Tested-by: Ashok Raj 
Cc: Arjan Van De Ven 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1519352533-15992-3-git-send-email-ashok@intel.com
Link: https://lkml.kernel.org/r/20180228102846.13447-4...@alien8.de
---
 arch/x86/kernel/cpu/microcode/intel.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 87bd6dc..e2864bc 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -600,6 +600,12 @@ static int apply_microcode_early(struct ucode_cpu_info 
*uci, bool early)
return UCODE_OK;
}
 
+   /*
+* Writeback and invalidate caches before updating microcode to avoid
+* internal issues depending on what the microcode is updating.
+*/
+   native_wbinvd();
+
/* write microcode via MSR 0x79 */
native_wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
 
@@ -816,6 +822,12 @@ static enum ucode_state apply_microcode_intel(int cpu)
return UCODE_OK;
}
 
+   /*
+* Writeback and invalidate caches before updating microcode to avoid
+* internal issues depending on what the microcode is updating.
+*/
+   native_wbinvd();
+
/* write microcode via MSR 0x79 */
wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
 
-- 
2.7.4



[4.15 & 4.14 stable 10/12] x86/microcode: Synchronize late microcode loading

2018-04-06 Thread Ashok Raj
commit a5321aec6412b20b5ad15db2d6b916c05349dbff upstream

Original idea by Ashok, completely rewritten by Borislav.

Before you read any further: the early loading method is still the
preferred one and you should always do that. The following patch is
improving the late loading mechanism for long running jobs and cloud use
cases.

Gather all cores and serialize the microcode update on them by doing it
one-by-one to make the late update process as reliable as possible and
avoid potential issues caused by the microcode update.

[ Borislav: Rewrite completely. ]

Co-developed-by: Borislav Petkov 
Signed-off-by: Ashok Raj 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Tom Lendacky 
Tested-by: Ashok Raj 
Reviewed-by: Tom Lendacky 
Cc: Arjan Van De Ven 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180228102846.13447-8...@alien8.de
---
 arch/x86/kernel/cpu/microcode/core.c | 118 +++
 1 file changed, 92 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index 8adbf43..bde629e 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -22,13 +22,16 @@
 #define pr_fmt(fmt) "microcode: " fmt
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -64,6 +67,11 @@ LIST_HEAD(microcode_cache);
  */
 static DEFINE_MUTEX(microcode_mutex);
 
+/*
+ * Serialize late loading so that CPUs get updated one-by-one.
+ */
+static DEFINE_SPINLOCK(update_lock);
+
 struct ucode_cpu_info  ucode_cpu_info[NR_CPUS];
 
 struct cpu_info_ctx {
@@ -486,6 +494,19 @@ static void __exit microcode_dev_exit(void)
 /* fake device for request_firmware */
 static struct platform_device  *microcode_pdev;
 
+/*
+ * Late loading dance. Why the heavy-handed stomp_machine effort?
+ *
+ * - HT siblings must be idle and not execute other code while the other 
sibling
+ *   is loading microcode in order to avoid any negative interactions caused by
+ *   the loading.
+ *
+ * - In addition, microcode update on the cores must be serialized until this
+ *   requirement can be relaxed in the future. Right now, this is conservative
+ *   and good.
+ */
+#define SPINUNIT 100 /* 100 nsec */
+
 static int check_online_cpus(void)
 {
if (num_online_cpus() == num_present_cpus())
@@ -496,23 +517,85 @@ static int check_online_cpus(void)
return -EINVAL;
 }
 
-static enum ucode_state reload_for_cpu(int cpu)
+static atomic_t late_cpus;
+
+/*
+ * Returns:
+ * < 0 - on error
+ *   0 - no update done
+ *   1 - microcode was updated
+ */
+static int __reload_late(void *info)
 {
-   struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
+   unsigned int timeout = NSEC_PER_SEC;
+   int all_cpus = num_online_cpus();
+   int cpu = smp_processor_id();
+   enum ucode_state err;
+   int ret = 0;
 
-   if (!uci->valid)
-   return UCODE_OK;
+   atomic_dec(&late_cpus);
+
+   /*
+* Wait for all CPUs to arrive. A load will not be attempted unless all
+* CPUs show up.
+* */
+   while (atomic_read(&late_cpus)) {
+   if (timeout < SPINUNIT) {
+   pr_err("Timeout while waiting for CPUs rendezvous, 
remaining: %d\n",
+   atomic_read(&late_cpus));
+   return -1;
+   }
+
+   ndelay(SPINUNIT);
+   timeout -= SPINUNIT;
+
+   touch_nmi_watchdog();
+   }
+
+   spin_lock(&update_lock);
+   apply_microcode_local(&err);
+   spin_unlock(&update_lock);
+
+   if (err > UCODE_NFOUND) {
+   pr_warn("Error reloading microcode on CPU %d\n", cpu);
+   ret = -1;
+   } else if (err == UCODE_UPDATED) {
+   ret = 1;
+   }
 
-   return apply_microcode_on_target(cpu);
+   atomic_inc(&late_cpus);
+
+   while (atomic_read(&late_cpus) != all_cpus)
+   cpu_relax();
+
+   return ret;
+}
+
+/*
+ * Reload microcode late on all CPUs. Wait for a sec until they
+ * all gather together.
+ */
+static int microcode_reload_late(void)
+{
+   int ret;
+
+   atomic_set(&late_cpus, num_online_cpus());
+
+   ret = stop_machine_cpuslocked(__reload_late, NULL, cpu_online_mask);
+   if (ret < 0)
+   return ret;
+   else if (ret > 0)
+   microcode_check();
+
+   return ret;
 }
 
 static ssize_t reload_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t size)
 {
-   int cpu, bsp = boot_cpu_data.cpu_index;
enum ucode_state tmp_ret = UCODE_OK;
-   bool do_callback = false;
+   int bsp = boot_cpu_data

[4.15 & 4.14 stable 00/12] Series to update microcode loading.

2018-04-06 Thread Ashok Raj
Hi Greg

Here is a series that addresses microcode loading stability issues post
Spectre. All of them are simply cherry-picked and the patches themselves
have the upstream commit ID's. 

I checked this for Intel platforms and thanks to Boris for checking
on AMD platforms. 

I'm still working on a 4.9 backport, will send those once i get them to 
work. stop_machine differences seem big enough that i might choose a 
different approach for the 4.9 backport.

Cheers,
Ashok

Ashok Raj (4):
  x86/microcode/intel: Check microcode revision before updating sibling
threads
  x86/microcode/intel: Writeback and invalidate caches before updating
microcode
  x86/microcode: Do not upload microcode if CPUs are offline
  x86/microcode: Synchronize late microcode loading

Borislav Petkov (8):
  x86/microcode: Propagate return value from updating functions
  x86/CPU: Add a microcode loader callback
  x86/CPU: Check CPU feature bits after microcode upgrade
  x86/microcode: Get rid of struct apply_microcode_ctx
  x86/microcode/intel: Look into the patch cache first
  x86/microcode: Request microcode on the BSP
  x86/microcode: Attempt late loading only when new microcode is present
  x86/microcode: Fix CPU synchronization routine

 arch/x86/include/asm/microcode.h  |  10 +-
 arch/x86/include/asm/processor.h  |   1 +
 arch/x86/kernel/cpu/common.c  |  30 ++
 arch/x86/kernel/cpu/microcode/amd.c   |  44 +
 arch/x86/kernel/cpu/microcode/core.c  | 181 ++
 arch/x86/kernel/cpu/microcode/intel.c |  62 +---
 6 files changed, 252 insertions(+), 76 deletions(-)

-- 
2.7.4



[4.15 & 4.14 stable 12/12] x86/microcode: Fix CPU synchronization routine

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit bb8c13d61a629276a162c1d2b1a20a815cbcfbb7 upstream

Emanuel reported an issue with a hang during microcode update because my
dumb idea to use one atomic synchronization variable for both rendezvous
- before and after update - was simply bollocks:

  microcode: microcode_reload_late: late_cpus: 4
  microcode: __reload_late: cpu 2 entered
  microcode: __reload_late: cpu 1 entered
  microcode: __reload_late: cpu 3 entered
  microcode: __reload_late: cpu 0 entered
  microcode: __reload_late: cpu 1 left
  microcode: Timeout while waiting for CPUs rendezvous, remaining: 1

CPU1 above would finish, leave and the others will still spin waiting for
it to join.

So do two synchronization atomics instead, which makes the code a lot more
straightforward.

Also, since the update is serialized and it also takes quite some time per
microcode engine, increase the exit timeout by the number of CPUs on the
system.

That's ok because the moment all CPUs are done, that timeout will be cut
short.

Furthermore, panic when some of the CPUs timeout when returning from a
microcode update: we can't allow a system with not all cores updated.

Also, as an optimization, do not do the exit sync if microcode wasn't
updated.

Reported-by: Emanuel Czirai 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Emanuel Czirai 
Tested-by: Ashok Raj 
Tested-by: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180314183615.17629-2...@alien8.de
---
 arch/x86/kernel/cpu/microcode/core.c | 68 ++--
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index e6d5caa..021c904 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -517,7 +517,29 @@ static int check_online_cpus(void)
return -EINVAL;
 }
 
-static atomic_t late_cpus;
+static atomic_t late_cpus_in;
+static atomic_t late_cpus_out;
+
+static int __wait_for_cpus(atomic_t *t, long long timeout)
+{
+   int all_cpus = num_online_cpus();
+
+   atomic_inc(t);
+
+   while (atomic_read(t) < all_cpus) {
+   if (timeout < SPINUNIT) {
+   pr_err("Timeout while waiting for CPUs rendezvous, 
remaining: %d\n",
+   all_cpus - atomic_read(t));
+   return 1;
+   }
+
+   ndelay(SPINUNIT);
+   timeout -= SPINUNIT;
+
+   touch_nmi_watchdog();
+   }
+   return 0;
+}
 
 /*
  * Returns:
@@ -527,30 +549,16 @@ static atomic_t late_cpus;
  */
 static int __reload_late(void *info)
 {
-   unsigned int timeout = NSEC_PER_SEC;
-   int all_cpus = num_online_cpus();
int cpu = smp_processor_id();
enum ucode_state err;
int ret = 0;
 
-   atomic_dec(&late_cpus);
-
/*
 * Wait for all CPUs to arrive. A load will not be attempted unless all
 * CPUs show up.
 * */
-   while (atomic_read(&late_cpus)) {
-   if (timeout < SPINUNIT) {
-   pr_err("Timeout while waiting for CPUs rendezvous, 
remaining: %d\n",
-   atomic_read(&late_cpus));
-   return -1;
-   }
-
-   ndelay(SPINUNIT);
-   timeout -= SPINUNIT;
-
-   touch_nmi_watchdog();
-   }
+   if (__wait_for_cpus(&late_cpus_in, NSEC_PER_SEC))
+   return -1;
 
spin_lock(&update_lock);
apply_microcode_local(&err);
@@ -558,15 +566,22 @@ static int __reload_late(void *info)
 
if (err > UCODE_NFOUND) {
pr_warn("Error reloading microcode on CPU %d\n", cpu);
-   ret = -1;
-   } else if (err == UCODE_UPDATED) {
+   return -1;
+   /* siblings return UCODE_OK because their engine got updated already */
+   } else if (err == UCODE_UPDATED || err == UCODE_OK) {
ret = 1;
+   } else {
+   return ret;
}
 
-   atomic_inc(&late_cpus);
-
-   while (atomic_read(&late_cpus) != all_cpus)
-   cpu_relax();
+   /*
+* Increase the wait timeout to a safe value here since we're
+* serializing the microcode update and that could take a while on a
+* large number of CPUs. And that is fine as the *actual* timeout will
+* be determined by the last CPU finished updating and thus cut short.
+*/
+   if (__wait_for_cpus(&late_cpus_out, NSEC_PER_SEC * num_online_cpus()))
+   panic("Timeout during microcode update!\n");
 
return ret;
 }
@@ -579,12 +594,11 @@ static int microcode_reload_late(void)
 {
int ret;
 
-   atomic_set(&late_cpus, num_online_cpus());
+   atomic_set(

[4.15 & 4.14 stable 11/12] x86/microcode: Attempt late loading only when new microcode is present

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit 2613f36ed965d0e5a595a1d931fd3b480e82d6fd upstream

Return UCODE_NEW from the scanning functions to denote that new microcode
was found and only then attempt the expensive synchronization dance.

Reported-by: Emanuel Czirai 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Tested-by: Emanuel Czirai 
Tested-by: Ashok Raj 
Tested-by: Tom Lendacky 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180314183615.17629-1...@alien8.de
---
 arch/x86/include/asm/microcode.h  |  1 +
 arch/x86/kernel/cpu/microcode/amd.c   | 34 +-
 arch/x86/kernel/cpu/microcode/core.c  |  8 +++-
 arch/x86/kernel/cpu/microcode/intel.c |  4 +++-
 4 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/microcode.h b/arch/x86/include/asm/microcode.h
index 7fb1047..6cf0e4c 100644
--- a/arch/x86/include/asm/microcode.h
+++ b/arch/x86/include/asm/microcode.h
@@ -39,6 +39,7 @@ struct device;
 
 enum ucode_state {
UCODE_OK= 0,
+   UCODE_NEW,
UCODE_UPDATED,
UCODE_NFOUND,
UCODE_ERROR,
diff --git a/arch/x86/kernel/cpu/microcode/amd.c 
b/arch/x86/kernel/cpu/microcode/amd.c
index a998e1a..4817992 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -339,7 +339,7 @@ int __init save_microcode_in_initrd_amd(unsigned int 
cpuid_1_eax)
return -EINVAL;
 
ret = load_microcode_amd(true, x86_family(cpuid_1_eax), desc.data, 
desc.size);
-   if (ret != UCODE_OK)
+   if (ret > UCODE_UPDATED)
return -EINVAL;
 
return 0;
@@ -683,27 +683,35 @@ static enum ucode_state __load_microcode_amd(u8 family, 
const u8 *data,
 static enum ucode_state
 load_microcode_amd(bool save, u8 family, const u8 *data, size_t size)
 {
+   struct ucode_patch *p;
enum ucode_state ret;
 
/* free old equiv table */
free_equiv_cpu_table();
 
ret = __load_microcode_amd(family, data, size);
-
-   if (ret != UCODE_OK)
+   if (ret != UCODE_OK) {
cleanup();
+   return ret;
+   }
 
-#ifdef CONFIG_X86_32
-   /* save BSP's matching patch for early load */
-   if (save) {
-   struct ucode_patch *p = find_patch(0);
-   if (p) {
-   memset(amd_ucode_patch, 0, PATCH_MAX_SIZE);
-   memcpy(amd_ucode_patch, p->data, min_t(u32, 
ksize(p->data),
-  PATCH_MAX_SIZE));
-   }
+   p = find_patch(0);
+   if (!p) {
+   return ret;
+   } else {
+   if (boot_cpu_data.microcode == p->patch_id)
+   return ret;
+
+   ret = UCODE_NEW;
}
-#endif
+
+   /* save BSP's matching patch for early load */
+   if (!save)
+   return ret;
+
+   memset(amd_ucode_patch, 0, PATCH_MAX_SIZE);
+   memcpy(amd_ucode_patch, p->data, min_t(u32, ksize(p->data), 
PATCH_MAX_SIZE));
+
return ret;
 }
 
diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index bde629e..e6d5caa 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -607,7 +607,7 @@ static ssize_t reload_store(struct device *dev,
return size;
 
tmp_ret = microcode_ops->request_microcode_fw(bsp, 
µcode_pdev->dev, true);
-   if (tmp_ret != UCODE_OK)
+   if (tmp_ret != UCODE_NEW)
return size;
 
get_online_cpus();
@@ -691,10 +691,8 @@ static enum ucode_state microcode_init_cpu(int cpu, bool 
refresh_fw)
if (system_state != SYSTEM_RUNNING)
return UCODE_NFOUND;
 
-   ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev,
-refresh_fw);
-
-   if (ustate == UCODE_OK) {
+   ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev, 
refresh_fw);
+   if (ustate == UCODE_NEW) {
pr_debug("CPU%d updated upon init\n", cpu);
apply_microcode_on_target(cpu);
}
diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 2aded9d..32b8e57 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -862,6 +862,7 @@ static enum ucode_state generic_load_microcode(int cpu, 
void *data, size_t size,
unsigned int leftover = size;
unsigned int curr_mc_size = 0, new_mc_size = 0;
unsigned int csig, cpf;
+   enum ucode_state ret = UCODE_OK;
 
while (leftover) {
struct microcode_header_intel mc_header;
@@ -903,6 +904,7 @@ static enum ucode_state generic_load_microcode(int cpu, 
void *data, size_t size,
 

[4.15 & 4.14 stable 03/12] x86/CPU: Check CPU feature bits after microcode upgrade

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit 42ca8082e260dcfd8afa2afa6ec1940b9d41724c upstream

With some microcode upgrades, new CPUID features can become visible on
the CPU. Check what the kernel has mirrored now and issue a warning
hinting at possible things the user/admin can do to make use of the
newly visible features.

Originally-by: Ashok Raj 
Tested-by: Ashok Raj 
Signed-off-by: Borislav Petkov 
Reviewed-by: Ashok Raj 
Cc: Andy Lutomirski 
Cc: Arjan van de Ven 
Cc: Borislav Petkov 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Woodhouse 
Cc: Greg Kroah-Hartman 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: http://lkml.kernel.org/r/20180216112640.11554-4...@alien8.de
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/common.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 84f1cd8..348cf48 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1757,5 +1757,25 @@ core_initcall(init_cpu_syscore);
  */
 void microcode_check(void)
 {
+   struct cpuinfo_x86 info;
+
perf_check_microcode();
+
+   /* Reload CPUID max function as it might've changed. */
+   info.cpuid_level = cpuid_eax(0);
+
+   /*
+* Copy all capability leafs to pick up the synthetic ones so that
+* memcmp() below doesn't fail on that. The ones coming from CPUID will
+* get overwritten in get_cpu_cap().
+*/
+   memcpy(&info.x86_capability, &boot_cpu_data.x86_capability, 
sizeof(info.x86_capability));
+
+   get_cpu_cap(&info);
+
+   if (!memcmp(&info.x86_capability, &boot_cpu_data.x86_capability, 
sizeof(info.x86_capability)))
+   return;
+
+   pr_warn("x86/CPU: CPU features have changed after loading microcode, 
but might not take effect.\n");
+   pr_warn("x86/CPU: Please consider either early loading through 
initrd/built-in or a potential BIOS update.\n");
 }
-- 
2.7.4



[4.15 & 4.14 stable 01/12] x86/microcode: Propagate return value from updating functions

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit 3f1f576a195aa266813cbd4ca70291deb61e0129 upstream

... so that callers can know when microcode was updated and act
accordingly.

Tested-by: Ashok Raj 
Signed-off-by: Borislav Petkov 
Reviewed-by: Ashok Raj 
Cc: Andy Lutomirski 
Cc: Arjan van de Ven 
Cc: Borislav Petkov 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Woodhouse 
Cc: Greg Kroah-Hartman 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: http://lkml.kernel.org/r/20180216112640.11554-2...@alien8.de
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/microcode.h  |  9 +++--
 arch/x86/kernel/cpu/microcode/amd.c   | 10 +-
 arch/x86/kernel/cpu/microcode/core.c  | 33 +
 arch/x86/kernel/cpu/microcode/intel.c | 10 +-
 4 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/microcode.h b/arch/x86/include/asm/microcode.h
index 55520cec..7fb1047 100644
--- a/arch/x86/include/asm/microcode.h
+++ b/arch/x86/include/asm/microcode.h
@@ -37,7 +37,12 @@ struct cpu_signature {
 
 struct device;
 
-enum ucode_state { UCODE_ERROR, UCODE_OK, UCODE_NFOUND };
+enum ucode_state {
+   UCODE_OK= 0,
+   UCODE_UPDATED,
+   UCODE_NFOUND,
+   UCODE_ERROR,
+};
 
 struct microcode_ops {
enum ucode_state (*request_microcode_user) (int cpu,
@@ -54,7 +59,7 @@ struct microcode_ops {
 * are being called.
 * See also the "Synchronization" section in microcode_core.c.
 */
-   int (*apply_microcode) (int cpu);
+   enum ucode_state (*apply_microcode) (int cpu);
int (*collect_cpu_info) (int cpu, struct cpu_signature *csig);
 };
 
diff --git a/arch/x86/kernel/cpu/microcode/amd.c 
b/arch/x86/kernel/cpu/microcode/amd.c
index 330b846..a998e1a 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -498,7 +498,7 @@ static unsigned int verify_patch_size(u8 family, u32 
patch_size,
return patch_size;
 }
 
-static int apply_microcode_amd(int cpu)
+static enum ucode_state apply_microcode_amd(int cpu)
 {
struct cpuinfo_x86 *c = &cpu_data(cpu);
struct microcode_amd *mc_amd;
@@ -512,7 +512,7 @@ static int apply_microcode_amd(int cpu)
 
p = find_patch(cpu);
if (!p)
-   return 0;
+   return UCODE_NFOUND;
 
mc_amd  = p->data;
uci->mc = p->data;
@@ -523,13 +523,13 @@ static int apply_microcode_amd(int cpu)
if (rev >= mc_amd->hdr.patch_id) {
c->microcode = rev;
uci->cpu_sig.rev = rev;
-   return 0;
+   return UCODE_OK;
}
 
if (__apply_microcode_amd(mc_amd)) {
pr_err("CPU%d: update failed for patch_level=0x%08x\n",
cpu, mc_amd->hdr.patch_id);
-   return -1;
+   return UCODE_ERROR;
}
pr_info("CPU%d: new patch_level=0x%08x\n", cpu,
mc_amd->hdr.patch_id);
@@ -537,7 +537,7 @@ static int apply_microcode_amd(int cpu)
uci->cpu_sig.rev = mc_amd->hdr.patch_id;
c->microcode = mc_amd->hdr.patch_id;
 
-   return 0;
+   return UCODE_UPDATED;
 }
 
 static int install_equiv_cpu_table(const u8 *buf)
diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index e4fc595..7c42326 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -374,7 +374,7 @@ static int collect_cpu_info(int cpu)
 }
 
 struct apply_microcode_ctx {
-   int err;
+   enum ucode_state err;
 };
 
 static void apply_microcode_local(void *arg)
@@ -489,31 +489,29 @@ static void __exit microcode_dev_exit(void)
 /* fake device for request_firmware */
 static struct platform_device  *microcode_pdev;
 
-static int reload_for_cpu(int cpu)
+static enum ucode_state reload_for_cpu(int cpu)
 {
struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
enum ucode_state ustate;
-   int err = 0;
 
if (!uci->valid)
-   return err;
+   return UCODE_OK;
 
ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev, 
true);
-   if (ustate == UCODE_OK)
-   apply_microcode_on_target(cpu);
-   else
-   if (ustate == UCODE_ERROR)
-   err = -EINVAL;
-   return err;
+   if (ustate != UCODE_OK)
+   return ustate;
+
+   return apply_microcode_on_target(cpu);
 }
 
 static ssize_t reload_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t size)
 {
+   enum ucode_state tmp_ret = UCODE_OK;
unsigned long val;
+   ssize_t ret = 0;
int cpu;
-   ssize_t ret = 0, tmp_ret;
 
r

[4.15 & 4.14 stable 02/12] x86/CPU: Add a microcode loader callback

2018-04-06 Thread Ashok Raj
From: Borislav Petkov 

commit 1008c52c09dcb23d93f8e0ea83a6246265d2cce0 upstream

Add a callback function which the microcode loader calls when microcode
has been updated to a newer revision. Do the callback only when no error
was encountered during loading.

Tested-by: Ashok Raj 
Signed-off-by: Borislav Petkov 
Reviewed-by: Ashok Raj 
Cc: Andy Lutomirski 
Cc: Arjan van de Ven 
Cc: Borislav Petkov 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Woodhouse 
Cc: Greg Kroah-Hartman 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Tom Lendacky 
Cc: Asit K Mallick 
Cc: sta...@vger.kernel.org
Link: http://lkml.kernel.org/r/20180216112640.11554-3...@alien8.de
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/processor.h |  1 +
 arch/x86/kernel/cpu/common.c | 10 ++
 arch/x86/kernel/cpu/microcode/core.c |  8 ++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 44c2c4e..a5fc8f8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -969,4 +969,5 @@ bool xen_set_default_idle(void);
 
 void stop_this_cpu(void *dummy);
 void df_debug(struct pt_regs *regs, long error_code);
+void microcode_check(void);
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 824aee0..84f1cd8 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1749,3 +1749,13 @@ static int __init init_cpu_syscore(void)
return 0;
 }
 core_initcall(init_cpu_syscore);
+
+/*
+ * The microcode loader calls this upon late microcode load to recheck 
features,
+ * only when microcode has been updated. Caller holds microcode_mutex and CPU
+ * hotplug lock.
+ */
+void microcode_check(void)
+{
+   perf_check_microcode();
+}
diff --git a/arch/x86/kernel/cpu/microcode/core.c 
b/arch/x86/kernel/cpu/microcode/core.c
index 7c42326..b40b56e 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -509,6 +509,7 @@ static ssize_t reload_store(struct device *dev,
const char *buf, size_t size)
 {
enum ucode_state tmp_ret = UCODE_OK;
+   bool do_callback = false;
unsigned long val;
ssize_t ret = 0;
int cpu;
@@ -531,10 +532,13 @@ static ssize_t reload_store(struct device *dev,
if (!ret)
ret = -EINVAL;
}
+
+   if (tmp_ret == UCODE_UPDATED)
+   do_callback = true;
}
 
-   if (!ret && tmp_ret == UCODE_UPDATED)
-   perf_check_microcode();
+   if (!ret && do_callback)
+   microcode_check();
 
mutex_unlock(µcode_mutex);
put_online_cpus();
-- 
2.7.4



[Patch V2 1/3] x86, mce: Add LMCE definitions.

2015-06-02 Thread Ashok Raj
Add required definitions to support Local Machine Check Exceptions.

See http://www.intel.com/sdm Volume 3, System Programming Guide, chapter 15
for more information on MSR's and documentation on Local MCE.

Signed-off-by: Ashok Raj 
---
 arch/x86/include/asm/mce.h| 5 +
 arch/x86/include/uapi/asm/msr-index.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1f5a86d..677a408 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -17,11 +17,16 @@
 #define MCG_EXT_CNT(c) (((c) & MCG_EXT_CNT_MASK) >> MCG_EXT_CNT_SHIFT)
 #define MCG_SER_P  (1ULL<<24)   /* MCA recovery/new status bits */
 #define MCG_ELOG_P (1ULL<<26)   /* Extended error log supported */
+#define MCG_LMCE_P (1ULL<<27)   /* Local machine check supported */
 
 /* MCG_STATUS register defines */
 #define MCG_STATUS_RIPV  (1ULL<<0)   /* restart ip valid */
 #define MCG_STATUS_EIPV  (1ULL<<1)   /* ip points to correct instruction */
 #define MCG_STATUS_MCIP  (1ULL<<2)   /* machine check in progress */
+#define MCG_STATUS_LMCES (1ULL<<3)   /* LMCE signaled */
+
+/* MCG_EXT_CTL register defines */
+#define MCG_EXT_CTL_LMCE_EN (1ULL<<0) /* Enable LMCE */
 
 /* MCi_STATUS register defines */
 #define MCI_STATUS_VAL   (1ULL<<63)  /* valid error */
diff --git a/arch/x86/include/uapi/asm/msr-index.h 
b/arch/x86/include/uapi/asm/msr-index.h
index c469490..32c69d5 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -56,6 +56,7 @@
 #define MSR_IA32_MCG_CAP   0x0179
 #define MSR_IA32_MCG_STATUS0x017a
 #define MSR_IA32_MCG_CTL   0x017b
+#define MSR_IA32_MCG_EXT_CTL   0x04d0
 
 #define MSR_OFFCORE_RSP_0  0x01a6
 #define MSR_OFFCORE_RSP_1  0x01a7
@@ -379,6 +380,7 @@
 #define FEATURE_CONTROL_LOCKED (1<<0)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (1<<1)
 #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX  (1<<2)
+#define FEATURE_CONTROL_LMCE   (1<<20)
 
 #define MSR_IA32_APICBASE  0x001b
 #define MSR_IA32_APICBASE_BSP  (1<<8)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch V2 3/3] x86, mce: Handling LMCE events

2015-06-02 Thread Ashok Raj
This patch has handling changes to do_machine_check() to process MCE
signaled as local MCE. Typically only recoverable errors (SRAR) type
error will be Signaled as LMCE. But architecture does not restrict to
only those errors.

When errors are signaled as LMCE, there is no need for the MCE handler to
perform rendezvous with other logical processors unlike earlier processors
that would broadcast machine check errors.

See http://www.intel.com/sdm Volume 3, Chapter 15 for more information
on MSR's and documentation on Local MCE.

Signed-off-by: Ashok Raj 
---
 arch/x86/kernel/cpu/mcheck/mce.c   | 32 ++--
 arch/x86/kernel/cpu/mcheck/mce_intel.c |  1 +
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d10aada..3d71daf 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1047,6 +1047,7 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
char *msg = "Unknown";
u64 recover_paddr = ~0ull;
int flags = MF_ACTION_REQUIRED;
+   int lmce = 0;
 
prev_state = ist_enter(regs);
 
@@ -1074,11 +1075,20 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
kill_it = 1;
 
/*
-* Go through all the banks in exclusion of the other CPUs.
-* This way we don't report duplicated events on shared banks
-* because the first one to see it will clear it.
+* Check if this MCE is signaled to only this logical processor
 */
-   order = mce_start(&no_way_out);
+   if (m.mcgstatus & MCG_STATUS_LMCES)
+   lmce = 1;
+   else {
+   /*
+* Go through all the banks in exclusion of the other CPUs.
+* This way we don't report duplicated events on shared banks
+* because the first one to see it will clear it.
+* If this is a Local MCE, then no need to perform rendezvous.
+*/
+   order = mce_start(&no_way_out);
+   }
+
for (i = 0; i < cfg->banks; i++) {
__clear_bit(i, toclear);
if (!test_bit(i, valid_banks))
@@ -1155,8 +1165,18 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
 * Do most of the synchronization with other CPUs.
 * When there's any problem use only local no_way_out state.
 */
-   if (mce_end(order) < 0)
-   no_way_out = worst >= MCE_PANIC_SEVERITY;
+   if (!lmce) {
+   if (mce_end(order) < 0)
+   no_way_out = worst >= MCE_PANIC_SEVERITY;
+   } else {
+   /*
+* Local MCE skipped calling mce_reign()
+* If we found a fatal error, we need to panic here.
+*/
+if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
+   mce_panic("Machine check from unknown source",
+   NULL, NULL);
+   }
 
/*
 * At insane "tolerant" levels we take no action. Otherwise
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c 
b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 7d500b6..47b2a2b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -468,4 +468,5 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c)
 {
intel_init_thermal(c);
intel_init_cmci();
+   intel_init_lmce();
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch V2 2/3] x86, mce: Add infrastructure required to support LMCE

2015-06-02 Thread Ashok Raj
Initialization and handling for LMCE
- boot time option to disable LMCE for that boot instance
- Check for capability via IA32_MCG_CAP

Incorporated feedback from Boris from V1

See http://www.intel.com/sdm Volume 3 System Programming Guide, Chapter 15
for more information on MSR's and documentation on Local MCE.

Signed-off-by: Ashok Raj 
---
 Documentation/x86/x86_64/boot-options.txt |  3 ++
 arch/x86/include/asm/mce.h|  5 +++
 arch/x86/kernel/cpu/mcheck/mce.c  |  3 ++
 arch/x86/kernel/cpu/mcheck/mce_intel.c| 59 +++
 4 files changed, 70 insertions(+)

diff --git a/Documentation/x86/x86_64/boot-options.txt 
b/Documentation/x86/x86_64/boot-options.txt
index 5223479..79edee0 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -31,6 +31,9 @@ Machine check
(e.g. BIOS or hardware monitoring applications), conflicting
with OS's error handling, and you cannot deactivate the agent,
then this option will be a help.
+   mce=no_lmce
+   Do not opt-in to Local MCE delivery. Use legacy method
+   to broadcast MCE's.
mce=bootlog
Enable logging of machine checks left over from booting.
Disabled by default on AMD because some BIOS leave bogus ones.
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 677a408..8ba4d7a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -109,6 +109,7 @@ struct mce_log {
 struct mca_config {
bool dont_log_ce;
bool cmci_disabled;
+   bool lmce_disabled;
bool ignore_ce;
bool disabled;
bool ser;
@@ -173,12 +174,16 @@ void cmci_clear(void);
 void cmci_reenable(void);
 void cmci_rediscover(void);
 void cmci_recheck(void);
+void lmce_clear(void);
+void lmce_enable(void);
 #else
 static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { }
 static inline void cmci_clear(void) {}
 static inline void cmci_reenable(void) {}
 static inline void cmci_rediscover(void) {}
 static inline void cmci_recheck(void) {}
+static inline void lmce_clear(void) {}
+static inline void lmce_enable(void) {}
 #endif
 
 #ifdef CONFIG_X86_MCE_AMD
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e535533..d10aada 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1976,6 +1976,7 @@ void mce_disable_bank(int bank)
 /*
  * mce=off Disables machine check
  * mce=no_cmci Disables CMCI
+ * mce=no_lmce Disables LMCE
  * mce=dont_log_ce Clears corrected events silently, no log created for CEs.
  * mce=ignore_ce Disables polling and CMCI, corrected events are not cleared.
  * mce=TOLERANCELEVEL[,monarchtimeout] (number, see above)
@@ -1999,6 +2000,8 @@ static int __init mcheck_enable(char *str)
cfg->disabled = true;
else if (!strcmp(str, "no_cmci"))
cfg->cmci_disabled = true;
+   else if (!strcmp(str, "no_lmce"))
+   cfg->lmce_disabled = true;
else if (!strcmp(str, "dont_log_ce"))
cfg->dont_log_ce = true;
else if (!strcmp(str, "ignore_ce"))
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c 
b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index b4a41cf..7d500b6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -91,6 +91,37 @@ static int cmci_supported(int *banks)
return !!(cap & MCG_CMCI_P);
 }
 
+static bool lmce_supported(void)
+{
+   u64 cap, feature_control;
+
+   if (mca_cfg.lmce_disabled)
+   return false;
+
+   rdmsrl(MSR_IA32_MCG_CAP, cap);
+   /*
+* LMCE depends on recovery support in the processor.
+* Hence both MCG_SER_P and MCG_LMCE_P should be present in
+* MCG_CAP
+*/
+   if (!((cap & (MCG_SER_P | MCG_LMCE_P)) == (MCG_SER_P | MCG_LMCE_P)))
+   return false;
+
+   /*
+* BIOS should indicate support for LMCE by setting
+* bit20 in IA32_FEATURE_CONTROL. without which touching
+* MCG_EXT_CTL will generate #GP fault.
+*/
+   rdmsrl(MSR_IA32_FEATURE_CONTROL, feature_control);
+   if (((feature_control & (FEATURE_CONTROL_LOCKED |
+   FEATURE_CONTROL_LMCE)) == (FEATURE_CONTROL_LOCKED |
+   FEATURE_CONTROL_LMCE)))
+   return true;
+   else
+   return false;
+
+}
+
 bool mce_intel_cmci_poll(void)
 {
if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
@@ -405,6 +436,34 @@ static void intel_init_cmci(void)
cmci_recheck();
 }
 
+void intel_init_lmce(void)
+{
+   u64 val;
+
+   if (!lmce_supported())
+   return;
+
+   rdmsrl(MSR_IA32_MCG_EXT_CTL, val);
+   val |= MCG_EXT_CTL_LMCE_EN;
+   wrmsrl(

[Patch V2 0/3] x86, mce: Local Machine Check Exception (LMCE)

2015-06-02 Thread Ashok Raj
Hi Boris

Thanks for the feedback on V1. Almost all of your recommendations are
included in this update.

I haven't got a chance to test on qemu yet, but this patch fixes access to
MSR per your recommandation, so should be fine. I'm in process of making
similar changes to kvm/Qemu that i will send once I have learned to build
test it :-).

Historically machine checks on Intel X86 processors have been broadcast to all
logical processors in the system. Upcoming CPUs will support an opt-in
mechanism to request some machine checks delivered to a single logical
processor experiencing the fault.

For more details see Vol3, Chapter 15, Machine Check Architecture.

Modified to incorporate feedback from Boris on V1 patches.

Ashok Raj (3):
  x86, mce: Add LMCE definitions.
  x86, mce: Add infrastructure required to support LMCE
  x86, mce: Handling LMCE events

 Documentation/x86/x86_64/boot-options.txt |  3 ++
 arch/x86/include/asm/mce.h| 10 ++
 arch/x86/include/uapi/asm/msr-index.h |  2 ++
 arch/x86/kernel/cpu/mcheck/mce.c  | 35 ++
 arch/x86/kernel/cpu/mcheck/mce_intel.c| 60 +++
 5 files changed, 104 insertions(+), 6 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch V1 1/3] x86, mce: Add LMCE definitions.

2015-05-29 Thread Ashok Raj
Add required definitions to support Local Machine Check Exceptions.

See http://www.intel.com/sdm Volume 3, System Programming Guide, chapter 15
for more information on MSR's and documentation on Local MCE.

Signed-off-by: Ashok Raj 
---
 arch/x86/include/asm/mce.h| 5 +
 arch/x86/include/uapi/asm/msr-index.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1f5a86d..677a408 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -17,11 +17,16 @@
 #define MCG_EXT_CNT(c) (((c) & MCG_EXT_CNT_MASK) >> MCG_EXT_CNT_SHIFT)
 #define MCG_SER_P  (1ULL<<24)   /* MCA recovery/new status bits */
 #define MCG_ELOG_P (1ULL<<26)   /* Extended error log supported */
+#define MCG_LMCE_P (1ULL<<27)   /* Local machine check supported */
 
 /* MCG_STATUS register defines */
 #define MCG_STATUS_RIPV  (1ULL<<0)   /* restart ip valid */
 #define MCG_STATUS_EIPV  (1ULL<<1)   /* ip points to correct instruction */
 #define MCG_STATUS_MCIP  (1ULL<<2)   /* machine check in progress */
+#define MCG_STATUS_LMCES (1ULL<<3)   /* LMCE signaled */
+
+/* MCG_EXT_CTL register defines */
+#define MCG_EXT_CTL_LMCE_EN (1ULL<<0) /* Enable LMCE */
 
 /* MCi_STATUS register defines */
 #define MCI_STATUS_VAL   (1ULL<<63)  /* valid error */
diff --git a/arch/x86/include/uapi/asm/msr-index.h 
b/arch/x86/include/uapi/asm/msr-index.h
index c469490..e28d5a2 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -56,6 +56,7 @@
 #define MSR_IA32_MCG_CAP   0x0179
 #define MSR_IA32_MCG_STATUS0x017a
 #define MSR_IA32_MCG_CTL   0x017b
+#define MSR_IA32_MCG_EXT_CTL   0x04d0
 
 #define MSR_OFFCORE_RSP_0  0x01a6
 #define MSR_OFFCORE_RSP_1  0x01a7
@@ -379,6 +380,7 @@
 #define FEATURE_CONTROL_LOCKED (1<<0)
 #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX   (1<<1)
 #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX  (1<<2)
+#define FEATURE_CONTROL_LMCE_SUPPORT_ENABLED   (1<<20)
 
 #define MSR_IA32_APICBASE  0x001b
 #define MSR_IA32_APICBASE_BSP  (1<<8)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch V1 0/3] x86 Local Machine Check Exception (LMCE)

2015-05-29 Thread Ashok Raj
Historically machine checks on Intel X86 processors have been broadcast to all
logical processors in the system. Upcoming CPUs will support an opt-in
mechanism to request some machine checks delivered to a single logical
processor experiencing the fault.

For more details see Vol3, Chapter 15, Machine Check Architecture.

Ashok Raj (3):
  x86, mce: Add LMCE definitions.
  x86, mce: Add infrastructure required to support LMCE
  x86, mce: Handling LMCE events

 Documentation/x86/x86_64/boot-options.txt |  3 ++
 arch/x86/include/asm/mce.h| 10 
 arch/x86/include/uapi/asm/msr-index.h |  2 +
 arch/x86/kernel/cpu/mcheck/mce.c  | 28 ++--
 arch/x86/kernel/cpu/mcheck/mce_intel.c| 76 +++
 5 files changed, 116 insertions(+), 3 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch V1 3/3] x86, mce: Handling LMCE events

2015-05-29 Thread Ashok Raj
This patch has handling changes to do_machine_check() to process MCE
signaled as local MCE. Typically only recoverable errors (SRAR) type
error will be Signaled as LMCE. But architecture does not restrict to
only those errors.

When errors are signaled as LMCE, there is no need for the MCE handler to
perform rendezvous with other logical processors unlike earlier processors
that would broadcast machine check errors.

See http://www.intel.com/sdm Volume 3, Chapter 15 for more information
on MSR's and documentation on Local MCE.

Signed-off-by: Ashok Raj 
---
 arch/x86/kernel/cpu/mcheck/mce.c   | 25 ++---
 arch/x86/kernel/cpu/mcheck/mce_intel.c |  1 +
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d10aada..c130391 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1047,6 +1047,7 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
char *msg = "Unknown";
u64 recover_paddr = ~0ull;
int flags = MF_ACTION_REQUIRED;
+   int lmce = 0;
 
prev_state = ist_enter(regs);
 
@@ -1074,11 +1075,19 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
kill_it = 1;
 
/*
+* Check if this MCE is signaled to only this logical processor
+*/
+   if (m.mcgstatus & MCG_STATUS_LMCES)
+   lmce = 1;
+   /*
 * Go through all the banks in exclusion of the other CPUs.
 * This way we don't report duplicated events on shared banks
 * because the first one to see it will clear it.
+* If this is a Local MCE, then no need to perform rendezvous.
 */
-   order = mce_start(&no_way_out);
+   if (!lmce)
+   order = mce_start(&no_way_out);
+
for (i = 0; i < cfg->banks; i++) {
__clear_bit(i, toclear);
if (!test_bit(i, valid_banks))
@@ -1155,8 +1164,18 @@ void do_machine_check(struct pt_regs *regs, long 
error_code)
 * Do most of the synchronization with other CPUs.
 * When there's any problem use only local no_way_out state.
 */
-   if (mce_end(order) < 0)
-   no_way_out = worst >= MCE_PANIC_SEVERITY;
+   if (!lmce) {
+   if (mce_end(order) < 0)
+   no_way_out = worst >= MCE_PANIC_SEVERITY;
+   } else {
+   /*
+* Local MCE skipped calling mce_reign()
+* If we found a fatal error, we need to panic here.
+*/
+if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
+   mce_panic("Machine check from unknown source",
+   NULL, NULL);
+   }
 
/*
 * At insane "tolerant" levels we take no action. Otherwise
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c 
b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index be3a5c6..73a2844 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -484,4 +484,5 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c)
 {
intel_init_thermal(c);
intel_init_cmci();
+   intel_init_lmce();
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch V1 2/3] x86, mce: Add infrastructure required to support LMCE

2015-05-29 Thread Ashok Raj
Initialization and handling for LMCE
- boot time option to disable LMCE for that boot instance
- Check for capability via IA32_MCG_CAP
- provide ability to enable/disable LMCE on demand.

See http://www.intel.com/sdm Volume 3 System Programming Guide, Chapter 15
for more information on MSR's and documentation on Local MCE.

Signed-off-by: Ashok Raj 
---
 Documentation/x86/x86_64/boot-options.txt |  3 ++
 arch/x86/include/asm/mce.h|  5 +++
 arch/x86/kernel/cpu/mcheck/mce.c  |  3 ++
 arch/x86/kernel/cpu/mcheck/mce_intel.c| 75 +++
 4 files changed, 86 insertions(+)

diff --git a/Documentation/x86/x86_64/boot-options.txt 
b/Documentation/x86/x86_64/boot-options.txt
index 5223479..79edee0 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -31,6 +31,9 @@ Machine check
(e.g. BIOS or hardware monitoring applications), conflicting
with OS's error handling, and you cannot deactivate the agent,
then this option will be a help.
+   mce=no_lmce
+   Do not opt-in to Local MCE delivery. Use legacy method
+   to broadcast MCE's.
mce=bootlog
Enable logging of machine checks left over from booting.
Disabled by default on AMD because some BIOS leave bogus ones.
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 677a408..8ba4d7a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -109,6 +109,7 @@ struct mce_log {
 struct mca_config {
bool dont_log_ce;
bool cmci_disabled;
+   bool lmce_disabled;
bool ignore_ce;
bool disabled;
bool ser;
@@ -173,12 +174,16 @@ void cmci_clear(void);
 void cmci_reenable(void);
 void cmci_rediscover(void);
 void cmci_recheck(void);
+void lmce_clear(void);
+void lmce_enable(void);
 #else
 static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { }
 static inline void cmci_clear(void) {}
 static inline void cmci_reenable(void) {}
 static inline void cmci_rediscover(void) {}
 static inline void cmci_recheck(void) {}
+static inline void lmce_clear(void) {}
+static inline void lmce_enable(void) {}
 #endif
 
 #ifdef CONFIG_X86_MCE_AMD
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e535533..d10aada 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1976,6 +1976,7 @@ void mce_disable_bank(int bank)
 /*
  * mce=off Disables machine check
  * mce=no_cmci Disables CMCI
+ * mce=no_lmce Disables LMCE
  * mce=dont_log_ce Clears corrected events silently, no log created for CEs.
  * mce=ignore_ce Disables polling and CMCI, corrected events are not cleared.
  * mce=TOLERANCELEVEL[,monarchtimeout] (number, see above)
@@ -1999,6 +2000,8 @@ static int __init mcheck_enable(char *str)
cfg->disabled = true;
else if (!strcmp(str, "no_cmci"))
cfg->cmci_disabled = true;
+   else if (!strcmp(str, "no_lmce"))
+   cfg->lmce_disabled = true;
else if (!strcmp(str, "dont_log_ce"))
cfg->dont_log_ce = true;
else if (!strcmp(str, "ignore_ce"))
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c 
b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index b4a41cf..be3a5c6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -70,6 +70,10 @@ enum {
 
 static atomic_t cmci_storm_on_cpus;
 
+#define FEATURE_CONTROL_LMCE_BITS  ((FEATURE_CONTROL_LOCKED) | \
+(FEATURE_CONTROL_LMCE_SUPPORT_ENABLED))
+#define MCG_CAP_LMCE_BITS  ((MCG_SER_P) | (MCG_LMCE_P))
+
 static int cmci_supported(int *banks)
 {
u64 cap;
@@ -91,6 +95,34 @@ static int cmci_supported(int *banks)
return !!(cap & MCG_CMCI_P);
 }
 
+static bool lmce_supported(void)
+{
+   u64 cap, feature_ctl;
+   bool lmce_bios_support, retval;
+
+   if (mca_cfg.lmce_disabled)
+   return false;
+
+   rdmsrl(MSR_IA32_MCG_CAP, cap);
+   rdmsrl(MSR_IA32_FEATURE_CONTROL, feature_ctl);
+
+   /*
+* BIOS should indicate support for LMCE by setting
+* bit20 in IA32_FEATURE_CONTROL. without which touching
+* MCG_EXT_CTL will generate #GP fault.
+*/
+   lmce_bios_support = ((feature_ctl & (FEATURE_CONTROL_LMCE_BITS)) ==
+   (FEATURE_CONTROL_LMCE_BITS));
+
+   /*
+* MCG_CAP should indicate both MCG_SER_P and MCG_LMCE_P
+*/
+   cap = ((cap & MCG_CAP_LMCE_BITS) == (MCG_CAP_LMCE_BITS));
+   retval = (cap && lmce_bios_support);
+
+   return retval;
+}
+
 bool mce_intel_cmci_poll(void)
 {
if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
@@ -405,6 +437,49 @@ static void intel_init_cmci(void)
cmc

  1   2   >