Re: [ACPI] Re: [RFC 5/6]clean cpu state after hotremove CPU
On Mon, Apr 04, 2005 at 03:46:20PM -0700, Nathan Lynch wrote: > >Hi Nigel! > >On Tue, Apr 05, 2005 at 08:14:25AM +1000, Nigel Cunningham wrote: >> >> On Tue, 2005-04-05 at 01:33, Nathan Lynch wrote: >> > > Yes, exactly. Someone who understand do_exit please help clean > >No, that wouldn't work. I am saying that there's little to gain by >adding all this complexity for destroying the idle tasks when it's >fairly simple to create num_possible_cpus() - 1 idle tasks* to >accommodate any additional cpus which may come along. This is what >ppc64 does now, and it should be feasible on any architecture which >supports cpu hotplug. > >Nathan > >* num_possible_cpus() - 1 because the idle task for the boot cpu is > created in sched_init. > In ia64 we create idle threads on demand if one is not available for the same logical cpu number, and re-used when the same logical cpu number is re-used. just a minor improvement, i also thought about idle exit, but wasnt worth anything in return. Cheers, ashok - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PC300 pci_enable_device fix
On Wed, Apr 13, 2005 at 02:31:43PM -0700, Bjorn Helgaas wrote: > >Call pci_enable_device() before looking at IRQ and resources. >The driver requires this fix or the "pci=routeirq" workaround >on 2.6.10 and later kernels. the failure cases dont seem to worry about pci_disable_device()? in err_release_ram: etc? > >Reported and tested by Artur Lipowski. > >Signed-off-by: Bjorn Helgaas <[EMAIL PROTECTED]> > >= drivers/net/wan/pc300_drv.c 1.24 vs edited = >--- 1.24/drivers/net/wan/pc300_drv.c2004-12-29 12:25:16 -07:00 >+++ edited/drivers/net/wan/pc300_drv.c 2005-04-13 13:35:21 -06:00 >@@ -3439,6 +3439,9 @@ > #endif >} > >+ if ((err = pci_enable_device(pdev)) != 0) >+ return err; >+ >card = (pc300_t *) kmalloc(sizeof(pc300_t), GFP_KERNEL); >if (card == NULL) { >printk("PC300 found at RAM 0x%08lx, " >@@ -3526,9 +3529,6 @@ >err = -ENODEV; >goto err_release_ram; >} >- >- if ((err = pci_enable_device(pdev)) != 0) >- goto err_release_sca; > > card->hw.plxbase = ioremap(card->hw.plxphys, >card->hw.plxsize); > card->hw.rambase = ioremap(card->hw.ramphys, >card->hw.alloc_ramsize); > >- >To unsubscribe from this list: send the line "unsubscribe >linux-kernel" in >the body of a message to [EMAIL PROTECTED] >More majordomo info at [1]http://vger.kernel.org/majordomo-info.html > Please read the FAQ at [2]http://www.tux.org/lkml/ > > References > >1. http://vger.kernel.org/majordomo-info.html >2. http://www.tux.org/lkml/ -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Extending defconfig for x86_64
Hi Andi This patch is a trivial one. Provide a differnet defconfig for x86_64. Each time people get bitten by which scsi controller/eth to use. It might be possible to setup configs for other systems as well, if there are well known system names to make it simple for devl. Please consider for next update. -- Cheers, Ashok Raj - Open Source Technology Center This provides a working default config file for Intel systems. Tested on harwich (4p + ht systems), if more are required either add to this config, or create new defconfig's as required. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> -- arch/x86_64/configs/harwich_defconfig | 1185 ++ 1 files changed, 1185 insertions(+) Index: linux-2.6.13-rc3-mm1/arch/x86_64/configs/harwich_defconfig === --- /dev/null +++ linux-2.6.13-rc3-mm1/arch/x86_64/configs/harwich_defconfig @@ -0,0 +1,1185 @@ +# +# Automatically generated make config: don't edit +# Linux kernel version: 2.6.13-rc3 +# Mon Jul 18 12:18:34 2005 +# +CONFIG_X86_64=y +CONFIG_64BIT=y +CONFIG_X86=y +CONFIG_MMU=y +CONFIG_RWSEM_GENERIC_SPINLOCK=y +CONFIG_GENERIC_CALIBRATE_DELAY=y +CONFIG_X86_CMPXCHG=y +CONFIG_EARLY_PRINTK=y +CONFIG_GENERIC_ISA_DMA=y +CONFIG_GENERIC_IOMAP=y + +# +# Code maturity level options +# +CONFIG_EXPERIMENTAL=y +CONFIG_CLEAN_COMPILE=y +CONFIG_LOCK_KERNEL=y +CONFIG_INIT_ENV_ARG_LIMIT=32 + +# +# General setup +# +CONFIG_LOCALVERSION="" +CONFIG_SWAP=y +CONFIG_SYSVIPC=y +CONFIG_POSIX_MQUEUE=y +# CONFIG_BSD_PROCESS_ACCT is not set +CONFIG_SYSCTL=y +# CONFIG_AUDIT is not set +CONFIG_HOTPLUG=y +CONFIG_KOBJECT_UEVENT=y +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y +# CONFIG_CPUSETS is not set +# CONFIG_EMBEDDED is not set +CONFIG_KALLSYMS=y +CONFIG_KALLSYMS_ALL=y +# CONFIG_KALLSYMS_EXTRA_PASS is not set +CONFIG_PRINTK=y +CONFIG_BUG=y +CONFIG_BASE_FULL=y +CONFIG_FUTEX=y +CONFIG_EPOLL=y +CONFIG_SHMEM=y +CONFIG_CC_ALIGN_FUNCTIONS=0 +CONFIG_CC_ALIGN_LABELS=0 +CONFIG_CC_ALIGN_LOOPS=0 +CONFIG_CC_ALIGN_JUMPS=0 +# CONFIG_TINY_SHMEM is not set +CONFIG_BASE_SMALL=0 + +# +# Loadable module support +# +CONFIG_MODULES=y +CONFIG_MODULE_UNLOAD=y +CONFIG_MODULE_FORCE_UNLOAD=y +CONFIG_OBSOLETE_MODPARM=y +# CONFIG_MODVERSIONS is not set +# CONFIG_MODULE_SRCVERSION_ALL is not set +# CONFIG_KMOD is not set +CONFIG_STOP_MACHINE=y + +# +# Processor type and features +# +# CONFIG_MK8 is not set +# CONFIG_MPSC is not set +CONFIG_GENERIC_CPU=y +CONFIG_X86_L1_CACHE_BYTES=128 +CONFIG_X86_L1_CACHE_SHIFT=7 +CONFIG_X86_TSC=y +CONFIG_X86_GOOD_APIC=y +# CONFIG_MICROCODE is not set +CONFIG_X86_MSR=y +CONFIG_X86_CPUID=y +CONFIG_X86_HT=y +CONFIG_X86_IO_APIC=y +CONFIG_X86_LOCAL_APIC=y +CONFIG_MTRR=y +CONFIG_SMP=y +CONFIG_SCHED_SMT=y +CONFIG_PREEMPT_NONE=y +# CONFIG_PREEMPT_VOLUNTARY is not set +# CONFIG_PREEMPT is not set +CONFIG_PREEMPT_BKL=y +# CONFIG_K8_NUMA is not set +# CONFIG_NUMA_EMU is not set +# CONFIG_NUMA is not set +CONFIG_ARCH_FLATMEM_ENABLE=y +CONFIG_SELECT_MEMORY_MODEL=y +CONFIG_FLATMEM_MANUAL=y +# CONFIG_DISCONTIGMEM_MANUAL is not set +# CONFIG_SPARSEMEM_MANUAL is not set +CONFIG_FLATMEM=y +CONFIG_FLAT_NODE_MEM_MAP=y +CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y +CONFIG_HAVE_DEC_LOCK=y +CONFIG_NR_CPUS=8 +CONFIG_HOTPLUG_CPU=y +CONFIG_HPET_TIMER=y +CONFIG_X86_PM_TIMER=y +CONFIG_HPET_EMULATE_RTC=y +CONFIG_GART_IOMMU=y +CONFIG_SWIOTLB=y +CONFIG_X86_MCE=y +CONFIG_X86_MCE_INTEL=y +CONFIG_PHYSICAL_START=0x10 +# CONFIG_KEXEC is not set +CONFIG_SECCOMP=y +# CONFIG_HZ_100 is not set +CONFIG_HZ_250=y +# CONFIG_HZ_1000 is not set +CONFIG_HZ=250 +CONFIG_GENERIC_HARDIRQS=y +CONFIG_GENERIC_IRQ_PROBE=y +CONFIG_ISA_DMA_API=y + +# +# Power management options +# +CONFIG_PM=y +# CONFIG_PM_DEBUG is not set +CONFIG_SOFTWARE_SUSPEND=y +CONFIG_PM_STD_PARTITION="" +CONFIG_SUSPEND_SMP=y + +# +# ACPI (Advanced Configuration and Power Interface) Support +# +CONFIG_ACPI=y +CONFIG_ACPI_BOOT=y +CONFIG_ACPI_INTERPRETER=y +# CONFIG_ACPI_SLEEP is not set +CONFIG_ACPI_AC=y +CONFIG_ACPI_BATTERY=y +CONFIG_ACPI_BUTTON=y +# CONFIG_ACPI_VIDEO is not set +CONFIG_ACPI_HOTKEY=m +CONFIG_ACPI_FAN=y +CONFIG_ACPI_PROCESSOR=y +CONFIG_ACPI_HOTPLUG_CPU=y +CONFIG_ACPI_THERMAL=y +# CONFIG_ACPI_ASUS is not set +# CONFIG_ACPI_IBM is not set +CONFIG_ACPI_TOSHIBA=y +CONFIG_ACPI_BLACKLIST_YEAR=2001 +# CONFIG_ACPI_DEBUG is not set +CONFIG_ACPI_BUS=y +CONFIG_ACPI_EC=y +CONFIG_ACPI_POWER=y +CONFIG_ACPI_PCI=y +CONFIG_ACPI_SYSTEM=y +CONFIG_ACPI_CONTAINER=y + +# +# CPU Frequency scaling +# +CONFIG_CPU_FREQ=y +CONFIG_CPU_FREQ_TABLE=y +# CONFIG_CPU_FREQ_DEBUG is not set +CONFIG_CPU_FREQ_STAT=y +# CONFIG_CPU_FREQ_STAT_DETAILS is not set +CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y +# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set +CONFIG_CPU_FREQ_GOV_PERFORMANCE=y +# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set +CONFIG_CPU_FREQ_GOV_USERSPACE=y +CONFIG_CPU_FREQ_GOV_ONDEMAND=y +# CONFIG_
2.6.13-rc5-mm1 doesnt boot on x86_64
Folks, Iam getting this on the recent 2.6.12-rc5-mm1 kernel built with defconfig. Cheers, Ashok Raj --- [cut here ] - [please bite here ] - Kernel BUG at "include/linux/list.h":165 invalid operand: [1] SMP CPU 2 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.13-rc5-mm1 RIP: 0010:[] {attribute_container_unregist}RSP: 0018:8100bfb63f00 EFLAGS: 00010283 RAX: 8100bfbd4c58 RBX: 8100bfbd4c00 RCX: 804e6600 RDX: 00200200 RSI: RDI: 804e6600 RBP: R08: 8100bfbd4c48 R09: 0020 R10: R11: 8019baa0 R12: 80100190 R13: R14: 8010 R15: 80627fb0 FS: () GS:80616980() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 00101000 CR4: 06e0 Process swapper (pid: 1, threadinfo 8100bfb62000, task 8100bfb614d0) Stack: 8032643d 8064499f 80100190 80651288 8010b249 0246 00020800 804ae180 Call Trace:{spi_release_transport+13} {ahd} {init+505} {child_rip+8} {init+0} {child_rip+0} Code: 0f 0b a3 e1 d9 44 80 ff ff ff ff c2 a5 00 49 8b 00 4c 39 40 RIP {attribute_container_unregister+52} RSP <0>Kernel panic - not syncing: Attempted to kill init! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.13-rc5-mm1 doesnt boot on x86_64
On Mon, Aug 08, 2005 at 07:11:26PM +0200, Andi Kleen wrote: > On Mon, Aug 08, 2005 at 09:48:19AM -0700, Ashok Raj wrote: > > Folks, > > > > Iam getting this on the recent 2.6.12-rc5-mm1 kernel built with defconfig. > > > > Cheers, > > Ashok Raj > > > > --- [cut here ] - [please bite here ] - > > Kernel BUG at "include/linux/list.h":165 > > invalid operand: [1] SMP > > CPU 2 > > Modules linked in: > > Pid: 1, comm: swapper Not tainted 2.6.13-rc5-mm1 > > RIP: 0010:[] > > {attribute_container_unregist}RSP: 0018:8100bfb63f00 > > EFLAGS: 00010283 > > RAX: 8100bfbd4c58 RBX: 8100bfbd4c00 RCX: 804e6600 > > RDX: 00200200 RSI: RDI: 804e6600 > > RBP: R08: 8100bfbd4c48 R09: 0020 > > R10: R11: 8019baa0 R12: 80100190 > > R13: R14: 8010 R15: 80627fb0 > > FS: () GS:80616980() knlGS: > > CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b > > CR2: CR3: 00101000 CR4: 06e0 > > Process swapper (pid: 1, threadinfo 8100bfb62000, task 8100bfb614d0) > > Stack: 8032643d 8064499f 80100190 > >80651288 8010b249 0246 > >00020800 804ae180 > > Call Trace:{spi_release_transport+13} > > {ahd} {init+505} > > {child_rip+8} > >{init+0} {child_rip+0} > > Looks like a SCSI problem. The machine has an Adaptec SCSI adapter, right? Yep, its adaptec problem Actually i dont need AIX7XXX, since my system requires only CONFIG_FUSION. I turned that option off, and it seems to boot fine now. Ashok > > -AndI > > > > > > Code: 0f 0b a3 e1 d9 44 80 ff ff ff ff c2 a5 00 49 8b 00 4c 39 40 > > RIP {attribute_container_unregister+52} RSP > > <0>Kernel panic - not syncing: Attempted to kill init! > > -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.13-rc5-mm1 doesnt boot on x86_64
On Mon, Aug 08, 2005 at 12:33:29PM -0500, James Bottomley wrote: > On Mon, 2005-08-08 at 19:11 +0200, Andi Kleen wrote: > > Looks like a SCSI problem. The machine has an Adaptec SCSI adapter, right? > > The traceback looks pretty meaningless. > > What was happening on the machine before this. i.e. was it booting up, > in which case can we have the prior dmesg file; or was the aic79xxx > driver being removed? I can get the trace again, but basically the system was booting. AIC_7XXX was defined in defconfig, but my system doesnt have it. Seems like the senario was the driver tried to probe, found nothing, and tries to de-reg resulting in the BUG(). I will try to get the recompile and entire dmesg log in the meantime. > > James > > -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.13-rc5-mm1 doesnt boot on x86_64
On Mon, Aug 08, 2005 at 07:06:50PM -0500, James Bottomley wrote: > On Mon, 2005-08-08 at 10:42 -0700, Andrew Morton wrote: > > -mm has extra list_head debugging goodies. I'd be suspecting a list_head > > corruption detected somewhere under spi_release_transport(). > > Aha, looking in wrong driver ... the problem actually appears to be a > double release of the transport template in aic79xx. Try this patch Hi James Sorry for the delay... This patch works like a charm!. Cheers, ashok > > James > > diff --git a/drivers/scsi/aic7xxx/aic79xx_osm.c > b/drivers/scsi/aic7xxx/aic79xx_osm.c > --- a/drivers/scsi/aic7xxx/aic79xx_osm.c > +++ b/drivers/scsi/aic7xxx/aic79xx_osm.c > @@ -2326,8 +2326,6 @@ done: > return (retval); > } > > -static void ahd_linux_exit(void); > - > static void ahd_linux_set_width(struct scsi_target *starget, int width) > { > struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); > @@ -2772,7 +2770,7 @@ ahd_linux_init(void) > if (ahd_linux_detect(&aic79xx_driver_template) > 0) > return 0; > spi_release_transport(ahd_linux_transport_template); > - ahd_linux_exit(); > + > return -ENODEV; > } > > diff --git a/drivers/scsi/aic7xxx/aic7xxx_osm.c > b/drivers/scsi/aic7xxx/aic7xxx_osm.c > --- a/drivers/scsi/aic7xxx/aic7xxx_osm.c > +++ b/drivers/scsi/aic7xxx/aic7xxx_osm.c > @@ -2331,8 +2331,6 @@ ahc_platform_dump_card_state(struct ahc_ > { > } > > -static void ahc_linux_exit(void); > - > static void ahc_linux_set_width(struct scsi_target *starget, int width) > { > struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); > > -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 5/8] x86_64:Dont do broadcast IPIs when hotplug is enabled in flat mode.
the use of non-shortcut version of routines breaking CPU hotplug. The option to select this via cmdline also is deleted with the physflat patch, hence directly placing this code under CONFIG_HOTPLUG_CPU. We dont want to use broadcast mode IPI's when hotplug is enabled. This causes bad effects in send IPI to a cpu that is offline which can trip when the cpu is in the process of being kicked alive. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> --- arch/x86_64/kernel/genapic_flat.c |8 1 files changed, 8 insertions(+) Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c @@ -78,8 +78,16 @@ static void flat_send_IPI_mask(cpumask_t static void flat_send_IPI_allbutself(int vector) { +#ifndef CONFIG_HOTPLUG_CPU if (((num_online_cpus()) - 1) >= 1) __send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL); +#else + cpumask_t allbutme = cpu_online_map; + int me = get_cpu(); /* Ensure we are not preempted when we clear */ + cpu_clear(me, allbutme); + flat_send_IPI_mask(allbutme, vector); + put_cpu(); +#endif } static void flat_send_IPI_all(int vector) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/8] x86_64:Dont call enforce_max_cpus when hotplug is enabled
No need to enforce_max_cpus when hotplug code is enabled. This nukes out cpu_present_map and cpu_possible_map making it impossible to add new cpus in the system. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> arch/x86_64/kernel/smpboot.c | 40 +++- 1 files changed, 23 insertions(+), 17 deletions(-) Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/smpboot.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/smpboot.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/smpboot.c @@ -893,23 +893,6 @@ static __init void disable_smp(void) cpu_set(0, cpu_core_map[0]); } -/* - * Handle user cpus=... parameter. - */ -static __init void enforce_max_cpus(unsigned max_cpus) -{ - int i, k; - k = 0; - for (i = 0; i < NR_CPUS; i++) { - if (!cpu_possible(i)) - continue; - if (++k > max_cpus) { - cpu_clear(i, cpu_possible_map); - cpu_clear(i, cpu_present_map); - } - } -} - #ifdef CONFIG_HOTPLUG_CPU /* * cpu_possible_map should be static, it cannot change as cpu's @@ -929,6 +912,29 @@ static void prefill_possible_map(void) for (i = 0; i < NR_CPUS; i++) cpu_set(i, cpu_possible_map); } + +/* + * Dont need this when we have hotplug enabled + */ +#define enforce_max_cpus(x) + +#else +/* + * Handle user cpus=... parameter. + */ +static __init void enforce_max_cpus(unsigned max_cpus) +{ + int i, k; + k = 0; + + for_each_cpu(i) { + if (++k > max_cpus) { + cpu_clear(i, cpu_possible_map); + cpu_clear(i, cpu_present_map); + } + } +} + #endif /* -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/8] x86_64: Reintroduce clustered_apic_check() for x86_64.
Auto selection of bigsmp patch removed this check from a shared common file in arch/i386/kernel/acpi/boot.c. We still need to call this to determine the right genapic code for x86_64. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> --- arch/x86_64/kernel/setup.c |1 + 1 files changed, 1 insertion(+) Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/setup.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c @@ -663,6 +663,7 @@ void __init setup_arch(char **cmdline_p) * Read APIC and some other early information from ACPI tables. */ acpi_boot_init(); + clustered_apic_check(); #endif #ifdef CONFIG_X86_LOCAL_APIC -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 7/8] x86_64:Use common functions in cluster and physflat mode
Newly introduced physflat_* shares way too much with cluster with only a very differences. So we introduce some common functions in that can be reused in both cases. In addition the following are also fixed. - Use of non-existent CONFIG_CPU_HOTPLUG option renamed to actual one in use. - Removed comment that ACPI would provide a way to select this dynamically since ACPI_CONFIG_HOTPLUG_CPU already exists that indicates platform support for hotplug via ACPI. In addition CONFIG_HOTPLUG_CPU only indicates logical offline/online which is even used by Suspend/Resume folks where the same support (for no-broadcast) is required. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> arch/x86_64/kernel/genapic.c | 52 + arch/x86_64/kernel/genapic_cluster.c | 55 +++ arch/x86_64/kernel/genapic_flat.c| 49 +++ include/asm-x86_64/ipi.h |5 +++ 4 files changed, 61 insertions(+), 100 deletions(-) Index: linux-2.6.13-rc4-mm1/include/asm-x86_64/ipi.h === --- linux-2.6.13-rc4-mm1.orig/include/asm-x86_64/ipi.h +++ linux-2.6.13-rc4-mm1/include/asm-x86_64/ipi.h @@ -107,4 +107,9 @@ static inline void send_IPI_mask_sequenc local_irq_restore(flags); } +extern cpumask_t generic_target_cpus(void); +extern void generic_send_IPI_mask(cpumask_t mask, int vector); +extern void generic_send_IPI_allbutself(int vector); +extern void generic_send_IPI_all(int vector); +extern unsigned int generic_cpu_mask_to_apicid(cpumask_t cpumask); #endif /* __ASM_IPI_H */ Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c @@ -134,56 +134,17 @@ struct genapic apic_flat = { * overflows, so use physical mode. */ -static cpumask_t physflat_target_cpus(void) -{ - return cpumask_of_cpu(0); -} - -static void physflat_send_IPI_mask(cpumask_t cpumask, int vector) -{ - send_IPI_mask_sequence(cpumask, vector); -} - -static void physflat_send_IPI_allbutself(int vector) -{ - cpumask_t allbutme = cpu_online_map; - int me = get_cpu(); - cpu_clear(me, allbutme); - physflat_send_IPI_mask(allbutme, vector); - put_cpu(); -} - -static void physflat_send_IPI_all(int vector) -{ - physflat_send_IPI_mask(cpu_online_map, vector); -} - -static unsigned int physflat_cpu_mask_to_apicid(cpumask_t cpumask) -{ - int cpu; - - /* -* We're using fixed IRQ delivery, can only return one phys APIC ID. -* May as well be the first. -*/ - cpu = first_cpu(cpumask); - if ((unsigned)cpu < NR_CPUS) - return x86_cpu_to_apicid[cpu]; - else - return BAD_APICID; -} - struct genapic apic_physflat = { .name = "physical flat", .int_delivery_mode = dest_Fixed, .int_dest_mode = (APIC_DEST_PHYSICAL != 0), .int_delivery_dest = APIC_DEST_PHYSICAL | APIC_DM_FIXED, - .target_cpus = physflat_target_cpus, + .target_cpus = generic_target_cpus, .apic_id_registered = flat_apic_id_registered, .init_apic_ldr = flat_init_apic_ldr,/*not needed, but shouldn't hurt*/ - .send_IPI_all = physflat_send_IPI_all, - .send_IPI_allbutself = physflat_send_IPI_allbutself, - .send_IPI_mask = physflat_send_IPI_mask, - .cpu_mask_to_apicid = physflat_cpu_mask_to_apicid, + .send_IPI_all = generic_send_IPI_all, + .send_IPI_allbutself = generic_send_IPI_allbutself, + .send_IPI_mask = generic_send_IPI_mask, + .cpu_mask_to_apicid = generic_cpu_mask_to_apicid, .phys_pkg_id = phys_pkg_id, }; Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_cluster.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_cluster.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_cluster.c @@ -57,56 +57,11 @@ static void cluster_init_apic_ldr(void) apic_write_around(APIC_LDR, val); } -/* Start with all IRQs pointing to boot CPU. IRQ balancing will shift them. */ - -static cpumask_t cluster_target_cpus(void) -{ - return cpumask_of_cpu(0); -} - -static void cluster_send_IPI_mask(cpumask_t mask, int vector) -{ - send_IPI_mask_sequence(mask, vector); -} - -static void cluster_send_IPI_allbutself(int vector) -{ - cpumask_t mask = cpu_online_map; - int me = get_cpu(); /* Ensure we are not preempted when we clear */ - - cpu_clear(me, mask); - - if (!cpus_empty(mask)) - cluster_send_IPI_mask(mask, vector); - - put_cpu(); -} - -static void cluster_send_IPI_all(int vector) -{
[patch 8/8] x86_64: Choose physflat for AMD systems only when >8 CPUS.
It is not required to choose the physflat mode when CPU hotplug is enabled and CPUs <=8 case. Use of genapic_flat with the mask version is capable of doing the same, instead of doing the send_IPI_mask_sequence() where its a unicast. This is another change that Andi introduced with the physflat mode. Andi: Do you think this is acceptable? Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> --- arch/x86_64/kernel/genapic.c |9 + 1 files changed, 1 insertion(+), 8 deletions(-) Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic.c @@ -69,15 +69,8 @@ void __init clustered_apic_check(void) } /* Don't use clustered mode on AMD platforms. */ - if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) { + if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (num_cpus > 8)) { genapic = &apic_physflat; - /* In the CPU hotplug case we cannot use broadcast mode - because that opens a race when a CPU is removed. - Stay at physflat mode in this case. - AK */ -#ifdef CONFIG_HOTPLUG_CPU - if (num_cpus <= 8) - genapic = &apic_flat; -#endif goto print; } -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/8] x86_64: Reintroduce clustered_apic_check() for x86_64.
On Mon, Aug 01, 2005 at 01:20:18PM -0700, Ashok Raj wrote: > Auto selection of bigsmp patch removed this check from a shared common file > in arch/i386/kernel/acpi/boot.c. We still need to call this to determine > the right genapic code for x86_64. > Thanks venki, missed the check for lapic and smp_found_config before the call. Resending patch. -- Cheers, Ashok Raj - Open Source Technology Center Auto selection of bigsmp patch removed this check from a shared common file in arch/i386/kernel/acpi/boot.c. We still need to call this to determine the right genapic code for x86_64. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> --- arch/x86_64/kernel/setup.c |2 ++ 1 files changed, 2 insertions(+) Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/setup.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/setup.c @@ -663,6 +663,8 @@ void __init setup_arch(char **cmdline_p) * Read APIC and some other early information from ACPI tables. */ acpi_boot_init(); + if (acpi_lapic && smp_found_config) + clustered_apic_check(); #endif #ifdef CONFIG_X86_LOCAL_APIC - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 6/8] x86_64:Dont use Lowest Priority when using physical mode.
Delivery mode should be APIC_DM_FIXED when using physical mode. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> arch/x86_64/kernel/genapic_flat.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c @@ -175,9 +175,9 @@ static unsigned int physflat_cpu_mask_to struct genapic apic_physflat = { .name = "physical flat", - .int_delivery_mode = dest_LowestPrio, + .int_delivery_mode = dest_Fixed, .int_dest_mode = (APIC_DEST_PHYSICAL != 0), - .int_delivery_dest = APIC_DEST_PHYSICAL | APIC_DM_LOWEST, + .int_delivery_dest = APIC_DEST_PHYSICAL | APIC_DM_FIXED, .target_cpus = physflat_target_cpus, .apic_id_registered = flat_apic_id_registered, .init_apic_ldr = flat_init_apic_ldr,/*not needed, but shouldn't hurt*/ -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/8] x86_64:Dont call enforce_max_cpus when hotplug is enabled
On Thu, Aug 04, 2005 at 12:41:10PM +0200, Andi Kleen wrote: > On Mon, Aug 01, 2005 at 01:20:20PM -0700, Ashok Raj wrote: > > No need to enforce_max_cpus when hotplug code is enabled. This > > nukes out cpu_present_map and cpu_possible_map making it impossible to add > > new cpus in the system. > > Hmm - i think there was some reason for this early zeroing, > but I cannot remember it right now. > > It might be related to some checks later that check max possible cpus. > > So it would be still good to have some way to limit max possible cpus. > Maybe with a new option? The only useful thing with enfore_max() is that cpu_possible_map is trimmed so some resource allocations that use for_each_cpu() for upfront allocation wont allocate resources. Currently i see max_cpus only limiting boot-time start, none trim cpu_possible which is done in only x86_64. max_cpu is still honored, just that for initial boot. I would think maybe remove enforce_max_cpus() altogether like other archs instead of adding one more just for x86_64. Maybe we should add only if there is a need, instead of adding and finding no-one using it and finally removing it very soon. > > -Andi -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 4/8] x86_64:Fix cluster mode send_IPI_allbutself to use get_cpu()/put_cpu()
On Thu, Aug 04, 2005 at 12:43:02PM +0200, Andi Kleen wrote: > On Mon, Aug 01, 2005 at 01:20:21PM -0700, Ashok Raj wrote: > > Need to ensure we dont get prempted when we clear ourself from mask when > > using > > clustered mode genapic code. > > It's not needed I think. If the caller wants to execute code > on the current CPU then it has to have disabled preemption > itself already to avoid races. And if not it doesn't care. > > One could argue that this function should be always called > with preemption disabled though. Perhaps better a WARN_ON(). > This is only required for smp_call_function(), since we do allbutself by exclusing self, its the internal function that needs to do this. allbutself shortcut takes care of it, since it doesnt matter which cpu we write the shortcut, in the mask version and for cluster i think its required to ensure in the low level function. Otherwise we would need each implementation of smp_call_function() and send_IPI_allbutself() callers would need to do this, which would be lots of changes. > -Andi -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/8] x86_64:Dont do broadcast IPIs when hotplug is enabled in flat mode.
On Thu, Aug 04, 2005 at 12:51:07PM +0200, Andi Kleen wrote: > > static void flat_send_IPI_allbutself(int vector) > > { > > +#ifndef CONFIG_HOTPLUG_CPU > > if (((num_online_cpus()) - 1) >= 1) > > __send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL); > > +#else > > + cpumask_t allbutme = cpu_online_map; > > + int me = get_cpu(); /* Ensure we are not preempted when we clear */ > > + cpu_clear(me, allbutme); > > + flat_send_IPI_mask(allbutme, vector); > > + put_cpu(); > > This still needs the num_online_cpus()s check. Opps missed that... Thanks for spotting it. I will send an updated one to Andrew. -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/8] x86_64:Dont do broadcast IPIs when hotplug is enabled in flat mode.
On Thu, Aug 04, 2005 at 12:51:07PM +0200, Andi Kleen wrote: > > static void flat_send_IPI_allbutself(int vector) > > { > > +#ifndef CONFIG_HOTPLUG_CPU > > if (((num_online_cpus()) - 1) >= 1) > > __send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL); > > +#else > > + cpumask_t allbutme = cpu_online_map; > > + int me = get_cpu(); /* Ensure we are not preempted when we clear */ > > + cpu_clear(me, allbutme); > > + flat_send_IPI_mask(allbutme, vector); > > + put_cpu(); > > This still needs the num_online_cpus()s check. > > -Andi Modified patch attached. Andrew: the filename in your -mm queue is below, with the attached patch. x86_64dont-do-broadcast-ipis-when-hotplug-is-enabled-in-flat-mode.patch -- Cheers, Ashok Raj - Open Source Technology Center Note: Recent introduction of physflat mode for x86_64 inadvertently deleted the use of non-shortcut version of routines breaking CPU hotplug. The option to select this via cmdline also is deleted with the physflat patch, hence directly placing this code under CONFIG_HOTPLUG_CPU. We dont want to use broadcast mode IPI's when hotplug is enabled. This causes bad effects in send IPI to a cpu that is offline which can trip when the cpu is in the process of being kicked alive. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> --- arch/x86_64/kernel/genapic_flat.c | 10 ++ 1 files changed, 10 insertions(+) Index: linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c === --- linux-2.6.13-rc4-mm1.orig/arch/x86_64/kernel/genapic_flat.c +++ linux-2.6.13-rc4-mm1/arch/x86_64/kernel/genapic_flat.c @@ -78,8 +78,18 @@ static void flat_send_IPI_mask(cpumask_t static void flat_send_IPI_allbutself(int vector) { +#ifndef CONFIG_HOTPLUG_CPU if (((num_online_cpus()) - 1) >= 1) __send_IPI_shortcut(APIC_DEST_ALLBUT, vector,APIC_DEST_LOGICAL); +#else + cpumask_t allbutme = cpu_online_map; + int me = get_cpu(); /* Ensure we are not preempted when we clear */ + cpu_clear(me, allbutme); + + if (!cpus_empty(allbutme)) + flat_send_IPI_mask(allbutme, vector); + put_cpu(); +#endif } static void flat_send_IPI_all(int vector) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] Hot plug CPU to support physical add of new processors (i386)
return; > } > +#endif > num_processors++; > ver = m->mpc_apicver; > > diff -puN arch/i386/kernel/smpboot.c~hotcpu-i386 arch/i386/kernel/smpboot.c > --- linux-2.6.13-rc6-mm2/arch/i386/kernel/smpboot.c~hotcpu-i386 > 2005-08-31 04:17:20.924024616 -0700 > +++ linux-2.6.13-rc6-mm2-root/arch/i386/kernel/smpboot.c 2005-08-31 > 04:21:49.474198784 -0700 > @@ -1003,9 +1003,10 @@ int __devinit smp_prepare_cpu(int cpu) > struct warm_boot_cpu_info info; > struct work_struct task; > int apicid, ret; > + extern u8 bios_cpu_apicid[NR_CPUS]; > > lock_cpu_hotplug(); > - apicid = x86_cpu_to_apicid[cpu]; > + apicid = bios_cpu_apicid[cpu]; > if (apicid == BAD_APICID) { > ret = -ENODEV; > goto exit; > diff -puN arch/i386/mach-default/topology.c~hotcpu-i386 > arch/i386/mach-default/topology.c > --- linux-2.6.13-rc6-mm2/arch/i386/mach-default/topology.c~hotcpu-i386 > 2005-08-31 04:17:20.957019600 -0700 > +++ linux-2.6.13-rc6-mm2-root/arch/i386/mach-default/topology.c > 2005-08-31 04:22:13.020619184 -0700 > @@ -76,7 +76,7 @@ static int __init topology_init(void) > for_each_online_node(i) > arch_register_node(i); > > - for_each_present_cpu(i) > + for_each_cpu(i) Nope. Should still be for present_cpus. with NR_CPUS large we would see way too many files in sysfs than what is really available. > arch_register_cpu(i); > return 0; > } > @@ -87,7 +87,7 @@ static int __init topology_init(void) > { > int i; > > - for_each_present_cpu(i) > + for_each_cpu(i) > arch_register_cpu(i); > return 0; > } > diff -puN kernel/cpu.c~hotcpu-i386 kernel/cpu.c > --- linux-2.6.13-rc6-mm2/kernel/cpu.c~hotcpu-i386 2005-08-31 > 04:17:21.002012760 -0700 > +++ linux-2.6.13-rc6-mm2-root/kernel/cpu.c2005-08-31 04:23:34.378250944 > -0700 > @@ -158,7 +158,11 @@ int __devinit cpu_up(unsigned int cpu) > if ((ret = down_interruptible(&cpucontrol)) != 0) > return ret; > > +#ifdef CONFIG_HOTPLUG_CPU > + if (cpu_online(cpu)) { > +#else > if (cpu_online(cpu) || !cpu_present(cpu)) { > +#endif ditto. > ret = -EINVAL; > goto out; > } > diff -puN arch/i386/kernel/irq.c~hotcpu-i386 arch/i386/kernel/irq.c > --- linux-2.6.13-rc6-mm2/arch/i386/kernel/irq.c~hotcpu-i386 2005-08-31 > 04:17:21.047005920 -0700 > +++ linux-2.6.13-rc6-mm2-root/arch/i386/kernel/irq.c 2005-08-31 > 04:25:21.761926144 -0700 > @@ -248,7 +248,7 @@ int show_interrupts(struct seq_file *p, > > if (i == 0) { > seq_printf(p, " "); > - for_each_cpu(j) > + for_each_online_cpu(j) > seq_printf(p, "CPU%d ",j); > seq_putc(p, '\n'); > } > @@ -262,7 +262,7 @@ int show_interrupts(struct seq_file *p, > #ifndef CONFIG_SMP > seq_printf(p, "%10u ", kstat_irqs(i)); > #else > - for_each_cpu(j) > + for_each_online_cpu(j) > seq_printf(p, "%10u ", kstat_cpu(j).irqs[i]); > #endif > seq_printf(p, " %14s", irq_desc[i].handler->typename); > @@ -276,12 +276,12 @@ skip: > spin_unlock_irqrestore(&irq_desc[i].lock, flags); > } else if (i == NR_IRQS) { > seq_printf(p, "NMI: "); > - for_each_cpu(j) > + for_each_online_cpu(j) > seq_printf(p, "%10u ", nmi_count(j)); > seq_putc(p, '\n'); > #ifdef CONFIG_X86_LOCAL_APIC > seq_printf(p, "LOC: "); > - for_each_cpu(j) > + for_each_online_cpu(j) > seq_printf(p, "%10u ", > per_cpu(irq_stat,j).apic_timer_irqs); > seq_putc(p, '\n'); > _ -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] Hot plug CPU to support physical add of new processors (i386)
On Thu, Sep 01, 2005 at 10:45:10AM +0200, Andi Kleen wrote: > Hallo Natalie, > > On Wednesday 31 August 2005 14:13, [EMAIL PROTECTED] wrote: > > Current IA32 CPU hotplug code doesn't allow bringing up processors that > > were not present in the boot configuration. To make existing hot plug > > facility more practical for physical hot plug, possible processors should > > be encountered during boot for potentual hot add/replace/remove. On ES7000, > > ACPI marks all the sockets that are empty or not assigned to the > > partitionas as "disabled". > > Good idea. In fact I always hated the behaviour of the existing > hotplug code that assumes all possible CPUs can be hotplugged. > It would be much nicer to be told be the firmware what CPUs > are hotpluggable. It would be great if all ia32/x86-64 hotplug capable > BIOS behaved like your. > Andi, you are getting mixed up with software only ability to offline with hardware eject capability. ACPI indicates ability to hotplug by the presence of _EJD in the appropriate scope of the object. So ACPI does have ability to do what you mention above precicely, but the entire namespace is not known upfront since some could be dynamically loaded. Which is why we need to show the entire NR_CPUS as hotpluggable. Possibly we can keep cpu_possible_map as NR_CPUS only when support for PHYSICAL_CPU_HOTPLUG is present, otherwise we can keep it cloned as cpu_present_map. (we dont have a generic PHYSICAL hotplug CONFIG option today) What CONFIG_HOTPLUG_CPU=y indicates is ability to offline a processor from the kernel. It DOES NOT indicate physical hotpluggablity. So we dont need any hardware support (apart arch/kernel support) for this to work. Support for physical hotplug is indicated via CONFIG_ACPI_HOTPLUG_CPU. Be aware that suspend/resume folks using CPU hotplug to offline CPUS except BSP need just the kernel support to offline. BIOS has nothing to do with being able to offline a CPU (preferably called as soft-removal). Cheers, ashok - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/1] Hot plug CPU to support physical add of new processors (i386)
On Thu, Sep 01, 2005 at 04:09:09PM -0500, Protasevich, Natalie wrote: > > > > > Current IA32 CPU hotplug code doesn't allow bringing up > > processors that were not present in the boot configuration. > > > To make existing hot plug facility more practical for physical hot > > > plug, possible processors should be encountered during boot for > > > potentual hot add/replace/remove. On ES7000, ACPI marks all the > > > sockets that are empty or not assigned to the partitionas as > > > "disabled". The patch allows arrays/masks with APIC info > > for disabled > > > processors to be > > > > This sounds like a cluge to me. The correct implementation > > would be you would need some sysmgmt deamon or something that > > works with the kernel to notify of new cpus and populate > > apicid and grow cpu_present_map. Making an assumption that > > disabled APICID are valid for ES7000 sake is not a safe assumption. > > Yes, this is a kludge, I realize that. The AML code was not there so far > (it will be in the next one). I have a point here though that if the > processor is there, but is unusable (what "disabled" means as the ACPI > spec says), meaning bad maybe, then with physical hot plug it can > certainly be made usable and I think it should be taken into > consideration (and into configuration). It should be counted as possible > at least, with hot plug, because it represent existing socket. I think marking it as present, and considering in cpu_possible_map is perfectly ok. But we would need more glue logic, that is if firmware marked it as disabled, then one would expect you then run _STA and find that the CPU is now present and functional as reported by _STA, then the CPU is onlinable. So if _STA can work favorably in your case you can use it to override the disabled setting at boot time which would be prefectly fine. > -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/14] x86_64: Don't call enforce_max_cpus when hotplug is enabled
Hi Andi On Mon, Sep 05, 2005 at 06:48:21AM +0200, Andi Kleen wrote: > On Sat, Sep 03, 2005 at 02:33:26PM -0700, [EMAIL PROTECTED] wrote: > > > > From: Ashok Raj <[EMAIL PROTECTED]> > > > > No need to enforce_max_cpus when hotplug code is enabled. This nukes out > > cpu_present_map and cpu_possible_map making it impossible to add new cpus in > > the system. > > I see the point, but the implementation is wrong. If anything > we shouldn't do it neither for the !HOTPLUG_CPU case.Why did > you not do it unconditionally? > > I would prefer to keep the special cases for hotplug to be > as narrow as possible. Link to earlier discussion below http://marc.theaimsgroup.com/?l=linux-kernel&m=112317327529855&w=2 I had suggested that we remove it completely in our discussion but i didnt hear anything from you after that, so i thought that was acceptable. You had suggested in that discussion that it would be better to add an option for startup. Iam opposed to adding any option, when we certainly know there are no users. Earlier based on your suggestion i added a startup option to choose ipi broadcast mode, which you promptly removed when you put physflat changes. I think its better to not add any option without real need. Do you agree? Please reply if you want me to remove the !HOTPLUG case which is my preference as well, and maybe while the memory is fresh, we can stick with it this time when we are in the same page :-( > > -Andi -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 14/14] x86_64: Choose physflat for AMD systems only when >8 CPUS.
On Tue, Sep 06, 2005 at 01:18:08AM +0200, Andi Kleen wrote: > On Sat, Sep 03, 2005 at 02:33:30PM -0700, [EMAIL PROTECTED] wrote: > > > > From: Ashok Raj <[EMAIL PROTECTED]> > > > > It is not required to choose the physflat mode when CPU hotplug is enabled > > and > > CPUs <=8 case. Use of genapic_flat with the mask version is capable of > > doing > > the same, instead of doing the send_IPI_mask_sequence() where its a unicast. > > I don't get the reasoning of this change. So probably not. Hummm...Please see below. Nothing has changed since then, any idea why its not acceptable now? http://marc.theaimsgroup.com/?l=linux-kernel&m=112315304423377&w=2 This really doesnt affect me, it just bothers me to go over inefficient code. send_IPI_mask_sequence() does unicast IPI's. When number of CPUs is <=8 the mask version acheives the same with just one write, so its a selective broadcast which is more efficient. Based on our earlier exchange i assumed it was clear and apparent which is why you "OK"ed the version when it was submitted to -mm. Nothing has changed, its the exact same patch. Hope its clear now. Entirely up to you... :-( -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/14] x86_64: Use common functions in cluster and physflat mode
On Tue, Sep 06, 2005 at 01:16:28AM +0200, Andi Kleen wrote: > On Sat, Sep 03, 2005 at 02:33:30PM -0700, [EMAIL PROTECTED] wrote: > > > > From: Ashok Raj <[EMAIL PROTECTED]> > > > > Newly introduced physflat_* shares way too much with cluster with only a > > very > > differences. So we introduce some common functions in that can be reused in > > both cases. > > > > In addition the following are also fixed. > > - Use of non-existent CONFIG_CPU_HOTPLUG option renamed to actual one in > > use. > > - Removed comment that ACPI would provide a way to select this dynamically > > since ACPI_CONFIG_HOTPLUG_CPU already exists that indicates platform > > support > > for hotplug via ACPI. In addition CONFIG_HOTPLUG_CPU only indicates > > logical > > offline/online which is even used by Suspend/Resume folks where the same > > support (for no-broadcast) is required. > > > (hmm did I reply to that? I think I did but my mailer seems to have > lost the r flag. My apologies if it's a duplicate) > > I didn't like that one because it makes the code less readable than > before imho. I did a separate patch for the CPU_HOTPLUG typo. The code is less readable? Now iam confused. Attached the link to patch below to refresh your memory. http://marc.theaimsgroup.com/?l=linux-kernel&m=112293577309653&w=2 diffstat would show we have fewer lines ~40 less lines of code. physflat basicaly copied/cloned some useful code in cluster and some from flat mode genapic code. I would have consolidated the code in the first place when you put the physflat mode. Again this was just my habit, cant step over code bloat and duplication. Which part of the code is unreadable to you? If you are happy with just renamed functions with copied body of the code which is what physflat did, thats fine. I was just puzzeled at the convoluted and less readable part of the code. If there is something you like to point out, i would be happy to fix it.. or you can if you prefer it that way. -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/14] x86_64: Don't call enforce_max_cpus when hotplug is enabled
On Wed, Sep 07, 2005 at 08:49:50AM +0200, Andi Kleen wrote: > > > > You had suggested in that discussion that it would be better to add an > > option for startup. Iam opposed to adding any option, when we certainly > > know > > I suggested to auto detect it based on ACPI information. I don't > think I ever wrote anything about an option. > > If that is not possible it's better to always use the sequence mechanism. Using ACPI or any other method to choose broadcast or use mask version of IPI in flat mode for <=8 cpus has no real value. I had posted a small stat program that showed using mask IPI provides same performance numbers. We didnt choose that method only because there is no perf gain except code bloat. I dont understand putting all that complexity without any real merrit. Moreover CONFIG_HOTPLUG_CPU does not imply physical CPU hotplug, which i had tried to convey several times. It is important to understand that there is no just ONE RIGHT way and that we consider alternatives for the right reason. > > > P.S.: Don't bother sending me such "blame game" mails again. I will > just d them next time because they're a waste of time. Sorry Andi if you felt that way. I was trying to get some consistent feedback and that you also consider and weight in what we explain instead of being a one way street. Certainly my intend was not to blame you, but to explain with clarity so we dont end up reworking some trivial patches for a long time. If you feel that way, i deeply apologize, and repeat, thats not my intend. > -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/14] x86_64: Use common functions in cluster and physflat mode
On Fri, Sep 09, 2005 at 10:07:28AM -0700, Zwane Mwaikambo wrote: > On a slightly different topic, how come we're using physflat for hotplug > cpu? > > -#ifndef CONFIG_CPU_HOTPLUG > /* In the CPU hotplug case we cannot use broadcast mode > because that opens a race when a CPU is removed. > -Stay at physflat mode in this case. > -It is bad to do this unconditionally though. Once > -we have ACPI platform support for CPU hotplug > -we should detect hotplug capablity from ACPI tables and > -only do this when really needed. -AK */ > +Stay at physflat mode in this case. - AK */ > +#ifdef CONFIG_HOTPLUG_CPU > if (num_cpus <= 8) > genapic = &apic_flat; What you say was true before this patch, (Although now that you point out i realize the ifdef CONFIG_HOTPLUG_CPU is not required). Think Andi is fixing this in his next drop to -mm* When physflat was introduced, it also switched to use physflat mode for #cpus <=8 when hotplug is enabled, since it doesnt use shortcuts and so is also safer (although slower). http://marc.theaimsgroup.com/?l=linux-kernel&m=112317686712929&w=2 The link above made using genapic_flat safer by using the flat_send_IPI_mask(), and hence i switched back to using logical flat mode when #cpus <=8, since that a little more efficient than the send_IPI_mask_sequence() used in physflat mode. In general we need flat_mode - #cpus <= 8 (Hotplug defined or not, so we use mask version for safety) physflat or cluster_mode when #cpus >8. If we choose physflat as default for #cpus <=8 (with hotplug) would make IPI performance worse, since it would do one cpu at a time, and requires 2 writes per cpu for each IPI v.s just 2 for a flat mode mask version of the API. -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS] hotplugging cpus via /sys/devices/system/cpu/
On Fri, Sep 09, 2005 at 01:41:58PM -0700, Christopher Beppler wrote: > >[1.] One line summary of the problem: >If I deactivate a CPU with /sys/devices/system/cpux and try to >reactivate it, then the CPU doesn't start and the kernel prints out an >oops. > Could you try this on 2.6.13-mm2? If this is due to a sending broadcast IPI related issue that should fix the problem. I should say i didnt try i386 in a while but i suspect some of the recent suspend/resume code required some modifications to the i386 hotplug code which might be getting in the way if you just try logical cpu hotplug alone without using it for suspend/resume. Shaohua might know more about the status. Cheers, ashok - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Fix irq_affinity write from /proc for IPF
Hi Andrew/Tony This patch is required for IPF to perform deferred write to rte's when affinity is programmed via /proc. These entries can only be programmed when interrupt is pending. We will eventually need the same method for x86 and x86_64 as well. This patch is only for IPF though. (the others are comming, more testing and changes needed in my sandbox) Could you please queue this up for next mm candidate? if it looks acceptable. Since iam touching a common file for GENERIC_HARDIRQ, it would be best it its reviwed in the -mm releases for any potential conflicts. [ sorry for the cross post to lia64 ] -- Cheers, Ashok Raj - Open Source Technology Center --- fix_ia64_smp_affinity - Make GENERIC_HARDIRQ work for IPF and CPU Hotplug Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Made GENERIC_HARDIRQ mechanism work for IPF and CPU hotplug. When write to /proc/irq is handled it is not appropriate to perform set_rte immediatly, since there is a race when the interrupt is asserted while the re-program is happening. Hence such programming is only safe when we do the re-program at the time of servicing an interrupt. This got broken when GENERIC_HARDIRQ got introduced for IPF. - added CONFIG_PENDING_IRQ so default /proc/irq write handler can do the right thing. TBD: We currently dont handle redirectable hint either in the display, or when we handle writes to /proc/irq/XX/smp_affinity. We need an arch specific way to account for the presence of "r" hint when we handle the proc write. --- release_dir-araj/arch/ia64/kernel/irq.c | 12 ++-- release_dir-araj/kernel/irq/proc.c | 10 -- 2 files changed, 18 insertions(+), 4 deletions(-) diff -puN arch/ia64/kernel/irq.c~fix_ia64_smp_affinity arch/ia64/kernel/irq.c --- release_dir/arch/ia64/kernel/irq.c~fix_ia64_smp_affinity2005-03-14 14:35:44.589293491 -0800 +++ release_dir-araj/arch/ia64/kernel/irq.c 2005-03-14 15:27:54.262106715 -0800 @@ -94,12 +94,20 @@ skip: /* * This is updated when the user sets irq affinity via /proc */ -cpumask_t __cacheline_aligned pending_irq_cpumask[NR_IRQS]; +static cpumask_t __cacheline_aligned pending_irq_cpumask[NR_IRQS]; static unsigned long pending_irq_redir[BITS_TO_LONGS(NR_IRQS)]; -static cpumask_t irq_affinity [NR_IRQS] = { [0 ... NR_IRQS-1] = CPU_MASK_ALL }; static char irq_redir [NR_IRQS]; // = { [0 ... NR_IRQS-1] = 1 }; +/* + * Arch specific routine for deferred write to iosapic rte to reprogram + * intr destination. + */ +void proc_set_irq_affinity(unsigned int irq, cpumask_t mask_val) +{ + pending_irq_cpumask[irq] = mask_val; +} + void set_irq_affinity_info (unsigned int irq, int hwid, int redir) { cpumask_t mask = CPU_MASK_NONE; diff -puN kernel/irq/proc.c~fix_ia64_smp_affinity kernel/irq/proc.c --- release_dir/kernel/irq/proc.c~fix_ia64_smp_affinity 2005-03-14 14:41:05.475031747 -0800 +++ release_dir-araj/kernel/irq/proc.c 2005-03-14 15:27:59.436911339 -0800 @@ -19,6 +19,13 @@ static struct proc_dir_entry *root_irq_d */ static struct proc_dir_entry *smp_affinity_entry[NR_IRQS]; +void __attribute__((weak)) +proc_set_irq_affinity(unsigned int irq, cpumask_t mask_val) +{ + irq_affinity[irq] = mask_val; + irq_desc[irq].handler->set_affinity(irq, mask_val); +} + static int irq_affinity_read_proc(char *page, char **start, off_t off, int count, int *eof, void *data) { @@ -53,8 +60,7 @@ static int irq_affinity_write_proc(struc if (cpus_empty(tmp)) return -EINVAL; - irq_affinity[irq] = new_value; - irq_desc[irq].handler->set_affinity(irq, new_value); + proc_set_irq_affinity(irq, new_value); return full_count; } _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fix irq_affinity write from /proc for IPF
On Mon, Mar 14, 2005 at 03:59:23PM -0800, Andrew Morton wrote: > Ashok Raj <[EMAIL PROTECTED]> wrote: > > > > "ia64" is preferred, please. Nobody knows what an IPF is. Right!. Sorry about that. > > > Is it not possible for ia64's ->set_affinity() handler to do this deferring? > There are other places where we re-program, and its fine to call the current version of set_affinity directly, like when we are doing cpu offline and trying to force migrate irqs for ia64. Changing the default set_affinity() for ia64 would result in many changes, this still keeps the same purpose of those access functions, and differentiates the proc write cases alone without changing the meaning of those handler functions. (and a smaller patch) this would further complicate the force migrate irq's when we consider MSI interrupts as well. Since it would have its own set_affinity, and we need to hack into MSI's set affinity handler as well which would complicate things. -- Cheers, Ashok Raj - Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] User Level Interrupts
Hi Michael have you thought about how this infrastructure would play well with existing CPU hotplug code for ia64? Once you return to user mode via the iret, is it possible that user mode thread could get switched due to a pending cpu quiese attempt to remove a cpu? (Current cpu removal code would bring the entire system to knees by scheduling a high priority thread and looping with intr disabled, until the target cpu is removed) the cpu removal code would also attempt to migrate user process to another cpu, retarget interrupts to another existing cpu etc. I havent tested the hotplug code on sgi boxes so far. (only tested on some hp boxes by Alex Williamson and on tiger4 boxes so far) Cheers, ashok On Wed, Mar 23, 2005 at 08:38:33AM -0800, Michael Raymond wrote: > > Allow fast (1+us) user notification of device interrupts. This >allows >more powerful user I/O applications to be written. The process of >porting >to other architectures is straight forward and fully documented. More >information can be found at [1]http://oss.sgi.com/projects/uli/. > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling
On Mon, Jun 11, 2007 at 02:14:49PM -0700, Andrew Morton wrote: > > > > Again, if dma_map_{single|sg} API's fails due to > > failure to allocate memory, the only thing that can > > be done is to panic as this is what few of the other > > IOMMU implementation is doing today. > > If the only option is to panic then something's busted. If it's network IO > then there should be a way of dropping the frame. If it's disk IO then we > should report the failure and cause an IO error. Just looking at the code.. appears that quite a few popular ones (or should say most) dont even look at the dma_addr_t returned to check for failure. Thats going to be another major cleanup work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling
On Tue, Jun 12, 2007 at 12:25:57AM +0200, Andi Kleen wrote: > > > Please advice. > > I think the short term only safe option would be to fully preallocate an > aperture. > If it is too small you can try GFP_ATOMIC but it would be just > a unreliable fallback. For safety you could perhaps have some kernel thread > that tries to enlarge it in the background depending on current > use. That would be not 100% guaranteed to keep up with load, > but would at least keep up if the system is not too busy. > > That is basically what your resource pools do, but they seem > to be unnecessarily convoluted for the task :- after all you > could just preallocate the page tables and rewrite/flush them without > having some kind of allocator inbetween, can't you? Each iommu has multiple domains, where each domain represents an address space. PCIexpress endpoints can be located on its own domain for addr protection reasons, and also have its own tag for iotlb cache. each addr space can be either a 3 or 4 level. So it would be hard to predict how much to setup ahead of time for each domain/device. Its not a simple single level table with a small window like the gart case. Just keeping a pool of page sized pages its easy to respond and use where its really necessary without having to lock pages down without knowing real demand. The addr space is plenty, so growing on demand is the best use of memory available. > If you make the start value large enough (256+MB?) that might reasonably > work. How much memory in page tables would that take? Or perhaps scale > it with available memory or available devices. > > In theory it could also be precomputed from the block/network device queue > lengths etc.; the trouble is just such checks would need to be added to all > kinds of > other odd subsystems that manage devices too. That would be much more work. > > Some investigation how to do sleeping block/network submit would be > also interesting (e.g. replace the spinlocks there with mutexes and see how > much it affects performance). For networking you would need to keep > at least a non sleeping path though because packets can be legally > submitted from interrupt context. If it works out then sleeping > interfaces to the IOMMU code could be added. > > -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/8] [Intel IOMMU] Support for Intel Virtualization Technology for Directed I/O
Hi, Pleased to announce support for Intel(R) Virtualization Technology for Directed I/O use as an IOMMU in Linux. This is a series of patches to support the same. A brief description of the patches follows. 1. Support for ACPI framework to parse and work with DMA Remapping Tables. 2. Add support for PCI infrastructure to search parent relationships. 3. Hardware support for providing DMA remapping support for Intel Chipsets. 4. Supporting Zero Length Reads on DMAR's not able to support ZLR. 5. Graphics driver workarounds to provide unity map since they dont use dma api. 6. Updates to Documentation area for startup options and some basics. 7. Workaround to provide unity map for ISA bridge device to enable floppy disk. 8. Ability to preserve some mappings for devices not able to address entire range. Please help review and provide feedback. Cheers, Ashok Raj & Shaohua Li - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/8] [Intel IOMMU] Some generic search functions required to lookup device relationships.
PCI support functions for DMAR, to find parent bridge. When devices are under a p2p bridge, upstream transactions get replaced by the device id of the bridge as it owns the PCIE transaction. Hence its necessary to setup translations on behalf of the bridge as well. Due to this limitation all devices under a p2p share the same domain in a DMAR. We just cache the type of device, if its a native PCIe device or not for later use. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> - Index: linux-2.6.21-rc5/drivers/pci/pci.h === --- linux-2.6.21-rc5.orig/drivers/pci/pci.h 2007-04-03 04:30:44.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/pci.h 2007-04-03 06:58:58.0 -0700 @@ -90,3 +90,4 @@ return NULL; } +struct pci_dev *pci_find_upstream_pcie_bridge(struct pci_dev *pdev); Index: linux-2.6.21-rc5/drivers/pci/probe.c === --- linux-2.6.21-rc5.orig/drivers/pci/probe.c 2007-04-03 04:30:44.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/probe.c2007-04-03 06:58:58.0 -0700 @@ -822,6 +822,19 @@ kfree(pci_dev); } +static void set_pcie_port_type(struct pci_dev *pdev) +{ + int pos; + u16 reg16; + + pos = pci_find_capability(pdev, PCI_CAP_ID_EXP); + if (!pos) + return; + pdev->is_pcie = 1; + pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); + pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; +} + /** * pci_cfg_space_size - get the configuration space size of the PCI device. * @dev: PCI device @@ -919,6 +932,7 @@ dev->device = (l >> 16) & 0x; dev->cfg_size = pci_cfg_space_size(dev); dev->error_state = pci_channel_io_normal; + set_pcie_port_type(dev); /* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer) set this higher, assuming the system even supports it. */ Index: linux-2.6.21-rc5/drivers/pci/search.c === --- linux-2.6.21-rc5.orig/drivers/pci/search.c 2007-04-03 04:30:44.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/search.c 2007-04-03 06:58:58.0 -0700 @@ -14,6 +14,36 @@ #include "pci.h" DECLARE_RWSEM(pci_bus_sem); +/* + * find the upstream PCIE-to-PCI bridge of a PCI device + * if the device is PCIE, return NULL + * if the device isn't connected to a PCIE bridge (that is its parent is a + * legacy PCI bridge and the bridge is directly connected to bus 0), return its + * parent + */ +struct pci_dev * +pci_find_upstream_pcie_bridge(struct pci_dev *pdev) +{ + struct pci_dev *tmp = NULL; + + if (pdev->is_pcie) + return NULL; + while (1) { + if (!pdev->bus->self) + break; + pdev = pdev->bus->self; + /* a p2p bridge */ + if (!pdev->is_pcie) { + tmp = pdev; + continue; + } + /* PCI device should connect to a PCIE bridge */ + BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE); + return pdev; + } + + return tmp; +} static struct pci_bus * pci_do_find_bus(struct pci_bus* bus, unsigned char busnr) Index: linux-2.6.21-rc5/include/linux/pci.h === --- linux-2.6.21-rc5.orig/include/linux/pci.h 2007-04-03 04:30:51.0 -0700 +++ linux-2.6.21-rc5/include/linux/pci.h2007-04-03 06:58:58.0 -0700 @@ -126,6 +126,7 @@ unsigned short subsystem_device; unsigned intclass; /* 3 bytes: (base,sub,prog-if) */ u8 hdr_type; /* PCI header type (`multi' flag masked out) */ + u8 pcie_type; /* PCI-E device/port type */ u8 rom_base_reg; /* which config register controls the ROM */ u8 pin;/* which interrupt pin this device uses */ @@ -168,6 +169,7 @@ unsigned intmsi_enabled:1; unsigned intmsix_enabled:1; unsigned intis_managed:1; + unsigned intis_pcie:1; atomic_tenable_cnt; /* pci_enable_device has been called */ u32 saved_config_space[16]; /* config space saved at suspend time */ -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/8] [Intel IOMMU] Supporting Zero Length Reads in Intel IOMMU.
PCI specs permit zero length reads (ZLR) even if the mapping for that region is write only. Support for this feature is indicated by the presence of a bit in the DMAR capability. If a particular DMAR does not support this capability we map write-only regions as read-write. This option can also provides a workaround for some drivers that request a write-only mapping when they really should request a read-write. (We ran into one such case in eepro100.c in handling rx_ring_dma) Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> -- drivers/pci/intel-iommu.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 03:05:25.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-09 03:05:32.0 -0700 @@ -84,7 +84,7 @@ struct sys_device sysdev; }; -static int dmar_disabled; +static int dmar_disabled, dmar_force_rw; static char *get_fault_reason(u8 fault_reason) { @@ -102,6 +102,9 @@ if (!strncmp(str, "off", 3)) { dmar_disabled = 1; printk(KERN_INFO"Intel-IOMMU: disabled\n"); + } else if (!strncmp(str, "forcerw", 7)) { + dmar_force_rw = 1; + printk(KERN_INFO"Intel-IOMMU: force R/W for W/O mapping\n"); } str += strcspn(str, ","); while (*str == ',') @@ -1668,7 +1671,12 @@ goto error; } - if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) + /* +* Check if DMAR supports zero-length reads on write only +* mappings.. +*/ + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \ + !cap_zlr(domain->iommu->cap) || dmar_force_rw) prot |= DMA_PTE_READ; if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) prot |= DMA_PTE_WRITE; Index: linux-2.6.21-rc5/include/linux/intel-iommu.h === --- linux-2.6.21-rc5.orig/include/linux/intel-iommu.h 2007-04-09 03:05:25.0 -0700 +++ linux-2.6.21-rc5/include/linux/intel-iommu.h2007-04-09 03:05:32.0 -0700 @@ -79,6 +79,7 @@ #define cap_max_fault_reg_offset(c) \ (cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16) +#define cap_zlr(c) (((c) >> 22) & 1) #define cap_isoch(c) (((c) >> 23) & 1) #define cap_mgaw(c)c) >> 16) & 0x3f) + 1) #define cap_sagaw(c) (((c) >> 8) & 0x1f) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 5/8] [Intel IOMMU] Graphics driver workarounds to provide unity map
Most GFX drivers don't call standard PCI DMA APIs to allocate DMA buffer, Such drivers will be broken with IOMMU enabled. To workaround this issue, we added two options. Once graphics devices are converted over to use the DMA-API's this entire patch can be removed... a. intel_iommu=igfx_off. With this option, DMAR who has just gfx devices under it will be ignored. This mostly affect intergated gfx devices. If the DMAR is ignored, gfx device under it will get physical address for DMA. b. intel_iommu=gfx_workaround. With this option, we will setup 1:1 mapping for whole memory for gfx devices, that is physical address equals to virtual address.In this way, gfx will use physical address for DMA, this is primarily for add-in card GFX device. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/arch/x86_64/kernel/e820.c === --- linux-2.6.21-rc5.orig/arch/x86_64/kernel/e820.c 2007-04-09 03:02:37.0 -0700 +++ linux-2.6.21-rc5/arch/x86_64/kernel/e820.c 2007-04-09 03:05:34.0 -0700 @@ -730,3 +730,22 @@ printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n", pci_mem_start, gapstart, gapsize); } + +int __init arch_get_ram_range(int slot, u64 *addr, u64 *size) +{ + int i; + + if (slot < 0 || slot >= e820.nr_map) + return -1; + for (i = slot; i < e820.nr_map; i++) { + if(e820.map[i].type != E820_RAM) + continue; + break; + } + if (i == e820.nr_map || e820.map[i].addr > (max_pfn << PAGE_SHIFT)) + return -1; + *addr = e820.map[i].addr; + *size = min_t(u64, e820.map[i].size + e820.map[i].addr, + max_pfn << PAGE_SHIFT) - *addr; + return i + 1; +} Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 03:05:32.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-09 03:05:34.0 -0700 @@ -36,6 +36,7 @@ #include "iova.h" #include "pci.h" +#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY) #define IOAPIC_RANGE_START (0xfee0) #define IOAPIC_RANGE_END (0xfeef) #define IOAPIC_RANGE_SIZE (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1) @@ -85,6 +86,7 @@ }; static int dmar_disabled, dmar_force_rw; +static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; static char *get_fault_reason(u8 fault_reason) { @@ -105,7 +107,14 @@ } else if (!strncmp(str, "forcerw", 7)) { dmar_force_rw = 1; printk(KERN_INFO"Intel-IOMMU: force R/W for W/O mapping\n"); + } else if (!strncmp(str, "igfx_off", 8)) { + dmar_map_gfx = 0; + printk(KERN_INFO"Intel-IOMMU: disable GFX device mapping\n"); + } else if (!strncmp(str, "gfx_workaround", 14)) { + dmar_no_gfx_identity_map = 0; + printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole physical memory for GFX device\n"); } + str += strcspn(str, ","); while (*str == ',') str++; @@ -1311,6 +1320,7 @@ struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct domain *domain; }; +#define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) static DEFINE_SPINLOCK(device_domain_lock); static LIST_HEAD(device_domain_list); @@ -1531,10 +1541,40 @@ static inline int iommu_prepare_rmrr_dev(struct acpi_rmrr_unit *rmrr, struct pci_dev *pdev) { + if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) + return 0; return iommu_prepare_identity_map(pdev, rmrr->base_address, rmrr->end_address + 1); } +static void iommu_prepare_gfx_mapping(void) +{ + struct pci_dev *pdev = NULL; + u64 base, size; + int slot; + int ret; + + if (dmar_no_gfx_identity_map) + return; + + for_each_pci_dev(pdev) { + if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO || + !IS_GFX_DEVICE(pdev)) + continue; + printk(KERN_INFO "IOMMU: gfx device %s 1-1 mapping\n", + pci_name(pdev)); + slot = 0; + while ((slot = arch_get_ram_range(slot, &base, &size)) >= 0) { + ret = iommu_prepare_identity_map(pdev, base, base + size); + if (ret) +
[patch 6/8] [Intel IOMMU] Doc updates for Intel Virtualization Technology for Directed I/O.
Document Intel IOMMU driver boot option. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt 2007-04-09 03:05:36.0 -0700 @@ -0,0 +1,119 @@ +Linux IOMMU Support +=== + +The architecture spec can be obtained from the below location. + +http://www.intel.com/technology/virtualization/ + +This guide gives a quick cheat sheet for some basic understanding. + +Some Keywords + +DMAR - DMA remapping +DRHD - DMA Engine Reporting Structure +RMRR - Reserved memory Region Reporting Structure +ZLR - Zero length reads from PCI devices +IOVA - IO Virtual address. + +Basic stuff +--- + +ACPI enumerates and lists the different DMA engines in the platform, and +device scope relationships between PCI devices and which DMA engine controls +them. + +What is RMRR? +- + +There are some devices the BIOS controls, for e.g USB devices to perform +PS2 emulation. The regions of memory used for these devices are marked +reserved in the e820 map. When we turn on DMA translation, DMA to those +regions will fail. Hence BIOS uses RMRR to specify these regions along with +devices that need to access these regions. OS is expected to setup +unity mappings for these regions for these devices to access these regions. + +How is IOVA generated? +- + +Well behaved drivers call pci_map_*() calls before sending command to device +that needs to perform DMA. Once DMA is completed and mapping is no longer +required, device performs a pci_unmap_*() calls to unmap the region. + +The Intel IOMMU driver allocates a virtual address per domain. Each PCIE +device has its own domain (hence protection). Devices under p2p bridges +share the virtual address with all devices under the p2p bridge due to +transaction id aliasing for p2p bridges. + +IOVA generation is pretty generic. We used the same technique as vmalloc() +but these are not global address spaces, but separate for each domain. +Different DMA engines may support different number of domains. + +We also allocate gaurd pages with each mapping, so we can attempt to catch +any overflow that might happen. + + +Graphics Problems? +-- +If you encounter issues with graphics devices, you can try adding +option intel_iommu=igfx_off to turn off the integrated graphics engine. + +If it happens to be a PCI device included in the INCLUDE_ALL Engine, +then try the intel_iommu=gfx_workaround to setup a 1-1 map. We hear +graphics drivers may be in process of using DMA api's in the near +future + +Some exceptions to IOVA +--- +Interrupt ranges are not address translated, (0xfee0 - 0xfeef). +The same is true for peer to peer transactions. Hence we reserve the +address from PCI MMIO ranges so they are not allocated for IOVA addresses. + + +Fault reporting +--- +When errors are reported, the DMA engine signals via an interrupt. The fault +reason and device that caused it with fault reason is printed on console. + +See below for sample. + + +Boot Message Sample +--- + +Something like this gets printed indicating presence of DMAR tables +in ACPI. + +ACPI: DMAR (v001 A M I OEMDMAR 0x0001 MSFT 0x0097) @ 0x7f5b5ef0 + +When DMAR is being processed and initialized by ACPI, prints DMAR locations +and any RMRR's processed. + +ACPI DMAR:Host address width 36 +ACPI DMAR:DRHD (flags: 0x)base: 0xfed9 +ACPI DMAR:DRHD (flags: 0x)base: 0xfed91000 +ACPI DMAR:DRHD (flags: 0x0001)base: 0xfed93000 +ACPI DMAR:RMRR base: 0x000ed000 end: 0x000e +ACPI DMAR:RMRR base: 0x7f60 end: 0x7fff + +When DMAR is enabled for use, you will notice.. + +PCI-DMA: Using DMAR IOMMU + +Fault reporting +--- + +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set + +TBD + + +- No Performance tuning / analysis yet. +- sysfs needs useful data to be populated. + DMAR info, device scope, stats could be exposed to some extent. +- Add support to Firmware Developer Kit to test ACPI tables for DMAR. +- For compatibility testing, could use unity map domain for all devices, just + provide a 1-1 for all useful memory under a single domain for all devices. +- API for paravirt ops for abstracting functionlity for VMM folks. Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-09 03:02:37.0 -070
[patch 8/8] [Intel IOMMU] Preserve some Virtual Address when devices cannot address entire range.
Some devices may not support entire 64bit DMA. In a situation where such devices are co-located in a shared domain, we need to ensure there is some address space reserved for such devices without the low addresses getting depleted by other devices capable of handling high dma addresses. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-09 03:05:38.0 -0700 +++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-09 03:05:41.0 -0700 @@ -735,6 +735,11 @@ first 16M. The floppy disk could be modified to use the DMA api's but thats a lot of pain for very small gain. This option is turned on by default. + preserve_{1g/2g/4g/512m/256m/16m} + If a device is sharing a domain with other devices + and the device mask doesnt cover the 64bit range, + use this option to let the iommu code preserve some + virtual addr for such devices. io7=[HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 03:05:38.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-09 03:06:17.0 -0700 @@ -90,6 +90,7 @@ static int dmar_disabled, dmar_force_rw; static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; static int dmar_fix_isa = 1; +static u64 dmar_preserve_iova_mask; static char *get_fault_reason(u8 fault_reason) { @@ -119,6 +120,28 @@ } else if (!strncmp(str, "noisamap", 8)) { dmar_fix_isa = 0; printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity map for LPC\n"); + } else if (!strncmp(str, "preserve_", 9)) { + if (!strncmp(str + 9, "4g", 2) || + !strncmp(str + 9, "4G", 2)) + dmar_preserve_iova_mask = DMA_32BIT_MASK; + else if (!strncmp(str + 9, "2g", 2) || + !strncmp(str + 9, "2G", 2)) + dmar_preserve_iova_mask = DMA_31BIT_MASK; + else if (!strncmp(str + 9, "1g", 2) || +!strncmp(str + 9, "1G", 2)) + dmar_preserve_iova_mask = DMA_30BIT_MASK; + else if (!strncmp(str + 9, "512m", 2) || +!strncmp(str + 9, "512M", 2)) + dmar_preserve_iova_mask = DMA_29BIT_MASK; + else if (!strncmp(str + 9, "256m", 4) || +!strncmp(str + 9, "256M", 4)) + dmar_preserve_iova_mask = DMA_28BIT_MASK; + else if (!strncmp(str + 9, "16m", 3) || +!strncmp(str + 9, "16M", 3)) + dmar_preserve_iova_mask = DMA_24BIT_MASK; + printk(KERN_INFO + "DMAR: Preserved IOVA mask 0x%Lx for devices " + "sharing domain\n", dmar_preserve_iova_mask); } str += strcspn(str, ","); @@ -1726,9 +1749,10 @@ * leave rooms for other devices */ if ((domain->flags & DOMAIN_FLAG_MULTIPLE_DEVICES) && - pdev->dma_mask > DMA_32BIT_MASK) + dmar_preserve_iova_mask && + pdev->dma_mask > dmar_preserve_iova_mask) iova = alloc_iova(domain, addr, size, - DMA_32BIT_MASK + 1, pdev->dma_mask); + dmar_preserve_iova_mask + 1, pdev->dma_mask); else iova = alloc_iova(domain, addr, size, IOVA_START_ADDR, pdev->dma_mask); -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/8] [Intel IOMMU] ACPI support for Intel Virtualization Technology for Directed I/O
This patch contains basic ACPI parsing and enumeration support. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/arch/x86_64/Kconfig === --- linux-2.6.21-rc5.orig/arch/x86_64/Kconfig 2007-04-03 04:30:40.0 -0700 +++ linux-2.6.21-rc5/arch/x86_64/Kconfig2007-04-03 06:34:17.0 -0700 @@ -687,6 +687,14 @@ bool "Support mmconfig PCI config space access" depends on PCI && ACPI +config DMAR + bool "Support for DMA Remapping Devices (EXPERIMENTAL)" + depends on PCI_MSI && ACPI && EXPERIMENTAL + help + Support DMA Remapping Devices. The devices are reported via + ACPI tables and includes pci device scope under each DMA + remapping device. + source "drivers/pci/pcie/Kconfig" source "drivers/pci/Kconfig" Index: linux-2.6.21-rc5/drivers/acpi/Makefile === --- linux-2.6.21-rc5.orig/drivers/acpi/Makefile 2007-04-03 04:30:40.0 -0700 +++ linux-2.6.21-rc5/drivers/acpi/Makefile 2007-04-03 06:34:17.0 -0700 @@ -60,3 +60,4 @@ obj-$(CONFIG_ACPI_HOTPLUG_MEMORY) += acpi_memhotplug.o obj-y += cm_sbs.o obj-$(CONFIG_ACPI_SBS) += i2c_ec.o sbs.o +obj-$(CONFIG_DMAR) += dmar.o Index: linux-2.6.21-rc5/drivers/acpi/dmar.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.21-rc5/drivers/acpi/dmar.c2007-04-03 06:54:27.0 -0700 @@ -0,0 +1,344 @@ +/* + * Copyright (c) 2006, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * + * Copyright (C) Ashok Raj <[EMAIL PROTECTED]> + * Copyright (C) Shaohua Li <[EMAIL PROTECTED]> + */ + +#include +#include +#include +#include + +#include + +#undef PREFIX +#define PREFIX "ACPI DMAR:" + +#define MIN_SCOPE_LEN (sizeof(struct acpi_pci_path) + sizeof(struct acpi_dev_scope)) + +LIST_HEAD(acpi_drhd_units); +LIST_HEAD(acpi_rmrr_units); +u8 dmar_host_address_width; + +static int __init acpi_register_drhd_unit(struct acpi_drhd_unit *drhd) +{ + /* +* add INCLUDE_ALL at the tail, so scan the list will find it at +* the very end. +*/ + if (drhd->include_all) + list_add_tail(&drhd->list, &acpi_drhd_units); + else + list_add(&drhd->list, &acpi_drhd_units); + return 0; +} + +static int __init acpi_register_rmrr_unit(struct acpi_rmrr_unit *rmrr) +{ + list_add(&rmrr->list, &acpi_rmrr_units); + return 0; +} + +static int acpi_pci_device_match(struct pci_dev *devices[], int cnt, +struct pci_dev *dev) +{ + int index; + + while (dev) { + for (index = 0; index < cnt; index ++) + if (dev == devices[index]) + return 1; + + /* Check our parent */ + dev = dev->bus->self; + } + + return 0; +} + +struct acpi_drhd_unit * acpi_find_matched_drhd_unit(struct pci_dev *dev) +{ + struct acpi_drhd_unit *drhd = NULL; + + list_for_each_entry(drhd, &acpi_drhd_units, list) { + if (drhd->include_all || acpi_pci_device_match(drhd->devices, + drhd->devices_cnt, dev)) + break; + } + + return drhd; +} + +struct acpi_rmrr_unit * acpi_find_matched_rmrr_unit(struct pci_dev *dev) +{ + struct acpi_rmrr_unit *rmrr; + + list_for_each_entry(rmrr, &acpi_rmrr_units, list) { + if (acpi_pci_device_match(rmrr->devices, + rmrr->devices_cnt, dev)) + goto out; + } + rmrr = NULL; +out: + return rmrr; +} + +static int __init acpi_parse_one_dev_scope(struct acpi_dev_scope *scope, + struct pci_dev **dev, u16 segment) +{ + struct pci_bus *bus; + struct pci_dev *pdev = NULL; + struct acpi_pci_path *path; +
[patch 7/8] [Intel IOMMU] Support for legacy ISA devices
Floppy disk drivers dont work well with DMA remapping. Its possible to extend the current use for x86_64, but the gain is very little. If someone feels compelled to clean this up, its up for grabs. Since these use 16M, we just provide a unity map for the ISA bridge device. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-09 03:05:36.0 -0700 +++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-09 03:05:38.0 -0700 @@ -730,6 +730,11 @@ the IOMMU driver to set a unity map for all OS visible memory. Hence the driver can continue to use physical addresses for DMA. + noisamap + This option is required to setup identify map for + first 16M. The floppy disk could be modified to use + the DMA api's but thats a lot of pain for very small + gain. This option is turned on by default. io7=[HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-09 03:05:34.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-09 03:05:38.0 -0700 @@ -37,6 +37,8 @@ #include "pci.h" #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY) +#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA) + #define IOAPIC_RANGE_START (0xfee0) #define IOAPIC_RANGE_END (0xfeef) #define IOAPIC_RANGE_SIZE (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1) @@ -87,6 +89,7 @@ static int dmar_disabled, dmar_force_rw; static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; +static int dmar_fix_isa = 1; static char *get_fault_reason(u8 fault_reason) { @@ -113,6 +116,9 @@ } else if (!strncmp(str, "gfx_workaround", 14)) { dmar_no_gfx_identity_map = 0; printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole physical memory for GFX device\n"); + } else if (!strncmp(str, "noisamap", 8)) { + dmar_fix_isa = 0; + printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity map for LPC\n"); } str += strcspn(str, ","); @@ -1575,6 +1581,25 @@ } } +static void iommu_prepare_isa(void) +{ + struct pci_dev *pdev = NULL; + int ret; + + if (!dmar_fix_isa) + return; + + pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL); + if (!pdev) + return; + + printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n"); + ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024); + + if (ret) + printk ("IOMMU: Failed to create 0-64M identity map, Floppy might not work\n"); + +} int __init init_dmars(void) { struct acpi_drhd_unit *drhd; @@ -1631,6 +1656,7 @@ end_for_each_rmrr_device(rmrr, pdev) iommu_prepare_gfx_mapping(); + iommu_prepare_isa(); /* * for each drhd -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/8] [Intel IOMMU] ACPI support for Intel Virtualization Technology for Directed I/O
On Mon, Apr 09, 2007 at 11:39:19PM -0400, Len Brown wrote: > On Monday 09 April 2007 17:55, Ashok Raj wrote: > > This patch contains basic ACPI parsing and enumeration support. > > AFAICS, ACPI supplies the envelope which delivers the table, > and ACPI has some convenience structure definitions for that > table in include/acpi/actbl1.h (primarily for the acpixtract table > dis-assembler), > but ACPI is otherwise not involved in IOMMU support. > > Indeed, one might argue that all new functions in this patch series with > "acpi..." would more appropriately be called "pci...", since a cursory > scan of the IOMMU spec seems to suggest it is specific to PCI. Think we can migrate some of the code to make the core part just perform the get-table. We will do that in the next respin. > > So on first blush, it looks like the only call to a function that begins with > "acpi" in this patch series should be acpi_get_table() from some IOMMU > specific file outside of drivers/acpi, > and the only modification to any code with an "acpi" in the file path or > filename should > be any updates to the convenience structure definitions in acpitbl1.h > > thanks, > -Len - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/8] [Intel IOMMU] Support for Intel Virtualization Technology for Directed I/O
On Tue, Apr 10, 2007 at 09:49:55AM +0200, Andi Kleen wrote: > On Monday 09 April 2007 23:55:52 Ashok Raj wrote: > > > Please help review and provide feedback. > > High level question: how did you solve the "user X server needs IOMMU bypass" > problem? There is no special consideration for user space. Since all useful memeory is mapped 1-1, guess user space would work as is. Unless iam missing something here.. so yes.. there is no protection with 1-1, but guess its like compatibility mode until the code gets converted over. Keith: cced says some of the user space side also is getting converted over with some driver support possibly. So this is an interim problem until X catches up with it. > > -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/8] [Intel IOMMU] Support for Intel Virtualization Technology for Directed I/O
On Tue, Apr 10, 2007 at 04:34:48AM -0400, Jeff Garzik wrote: > Shaohua Li wrote: > >DMA remapping just uses ACPI table to tell which dma remapping engine a > >pci device is controlled by at boot time. At run time, DMA remapping > >hasn't any interactive with ACPI. > > The Linux kernel _really_ wants a non-ACPI way to detect this. > > Just use the hardware registers themselves, you don't need an ACPI table. > > Jeff ACPI is required just not for identifying the DMA remapping hardware in the system. We also need them to identify which engines control which pci devices. Also there are some reserved sections that BIOS uses for its purpose that it needs identity map.. say for legacy emulation via usb etc that needs to be passed to the OS. not sure we can get away from using ACPI that easily.. its just only for setup information, once the identification is complete we dont bother ACPI anymore. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 2/8] Some generic search functions required to lookup device relationships.
PCI support functions for DMAR, to find parent bridge. When devices are under a p2p bridge, upstream transactions get replaced by the device id of the bridge as it owns the PCIE transaction. Hence its necessary to setup translations on behalf of the bridge as well. Due to this limitation all devices under a p2p share the same domain in a DMAR. We just cache the type of device, if its a native PCIe device or not for later use. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> - Index: 2.6.21-rc6/drivers/pci/pci.h === --- 2.6.21-rc6.orig/drivers/pci/pci.h 2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/pci.h2007-04-11 13:52:29.0 +0800 @@ -90,3 +90,4 @@ pci_match_one_device(const struct pci_de return NULL; } +struct pci_dev *pci_find_upstream_pcie_bridge(struct pci_dev *pdev); Index: 2.6.21-rc6/drivers/pci/probe.c === --- 2.6.21-rc6.orig/drivers/pci/probe.c 2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/probe.c 2007-04-11 13:52:29.0 +0800 @@ -822,6 +822,19 @@ static void pci_release_dev(struct devic kfree(pci_dev); } +static void set_pcie_port_type(struct pci_dev *pdev) +{ + int pos; + u16 reg16; + + pos = pci_find_capability(pdev, PCI_CAP_ID_EXP); + if (!pos) + return; + pdev->is_pcie = 1; + pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); + pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; +} + /** * pci_cfg_space_size - get the configuration space size of the PCI device. * @dev: PCI device @@ -919,6 +932,7 @@ pci_scan_device(struct pci_bus *bus, int dev->device = (l >> 16) & 0x; dev->cfg_size = pci_cfg_space_size(dev); dev->error_state = pci_channel_io_normal; + set_pcie_port_type(dev); /* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer) set this higher, assuming the system even supports it. */ Index: 2.6.21-rc6/drivers/pci/search.c === --- 2.6.21-rc6.orig/drivers/pci/search.c2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/search.c 2007-04-11 13:52:29.0 +0800 @@ -14,6 +14,36 @@ #include "pci.h" DECLARE_RWSEM(pci_bus_sem); +/* + * find the upstream PCIE-to-PCI bridge of a PCI device + * if the device is PCIE, return NULL + * if the device isn't connected to a PCIE bridge (that is its parent is a + * legacy PCI bridge and the bridge is directly connected to bus 0), return its + * parent + */ +struct pci_dev * +pci_find_upstream_pcie_bridge(struct pci_dev *pdev) +{ + struct pci_dev *tmp = NULL; + + if (pdev->is_pcie) + return NULL; + while (1) { + if (!pdev->bus->self) + break; + pdev = pdev->bus->self; + /* a p2p bridge */ + if (!pdev->is_pcie) { + tmp = pdev; + continue; + } + /* PCI device should connect to a PCIE bridge */ + BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE); + return pdev; + } + + return tmp; +} static struct pci_bus * pci_do_find_bus(struct pci_bus* bus, unsigned char busnr) Index: 2.6.21-rc6/include/linux/pci.h === --- 2.6.21-rc6.orig/include/linux/pci.h 2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/include/linux/pci.h 2007-04-11 13:52:29.0 +0800 @@ -126,6 +126,7 @@ struct pci_dev { unsigned short subsystem_device; unsigned intclass; /* 3 bytes: (base,sub,prog-if) */ u8 hdr_type; /* PCI header type (`multi' flag masked out) */ + u8 pcie_type; /* PCI-E device/port type */ u8 rom_base_reg; /* which config register controls the ROM */ u8 pin;/* which interrupt pin this device uses */ @@ -168,6 +169,7 @@ struct pci_dev { unsigned intmsi_enabled:1; unsigned intmsix_enabled:1; unsigned intis_managed:1; + unsigned intis_pcie:1; atomic_tenable_cnt; /* pci_enable_device has been called */ u32 saved_config_space[16]; /* config space saved at suspend time */ -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 4/8] Supporting Zero Length Reads in Intel IOMMU.
PCI specs permit zero length reads (ZLR) even if the mapping for that region is write only. Support for this feature is indicated by the presence of a bit in the DMAR capability. If a particular DMAR does not support this capability we map write-only regions as read-write. This option can also provides a workaround for some drivers that request a write-only mapping when they really should request a read-write. (We ran into one such case in eepro100.c in handling rx_ring_dma) Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> -- drivers/pci/intel-iommu.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: 2.6.21-rc6/drivers/pci/intel-iommu.c === --- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c 2007-04-18 09:04:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-18 09:04:59.0 +0800 @@ -84,7 +84,7 @@ struct iommu { struct sys_device sysdev; }; -static int dmar_disabled; +static int dmar_disabled, dmar_force_rw; static char *get_fault_reason(u8 fault_reason) { @@ -102,6 +102,9 @@ static int __init intel_iommu_setup(char if (!strncmp(str, "off", 3)) { dmar_disabled = 1; printk(KERN_INFO"Intel-IOMMU: disabled\n"); + } else if (!strncmp(str, "forcerw", 7)) { + dmar_force_rw = 1; + printk(KERN_INFO"Intel-IOMMU: force R/W for W/O mapping\n"); } str += strcspn(str, ","); while (*str == ',') @@ -1720,7 +1723,12 @@ static dma_addr_t __intel_map_single(str goto error; } - if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) + /* +* Check if DMAR supports zero-length reads on write only +* mappings.. +*/ + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \ + !cap_zlr(domain->iommu->cap) || dmar_force_rw) prot |= DMA_PTE_READ; if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) prot |= DMA_PTE_WRITE; Index: 2.6.21-rc6/include/linux/intel-iommu.h === --- 2.6.21-rc6.orig/include/linux/intel-iommu.h 2007-04-18 09:04:56.0 +0800 +++ 2.6.21-rc6/include/linux/intel-iommu.h 2007-04-18 09:04:59.0 +0800 @@ -79,6 +79,7 @@ #define cap_max_fault_reg_offset(c) \ (cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16) +#define cap_zlr(c) (((c) >> 22) & 1) #define cap_isoch(c) (((c) >> 23) & 1) #define cap_mgaw(c)c) >> 16) & 0x3f) + 1) #define cap_sagaw(c) (((c) >> 8) & 0x1f) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 2/8] Some generic search functions required to lookup device relationships.
PCI support functions for DMAR, to find parent bridge. When devices are under a p2p bridge, upstream transactions get replaced by the device id of the bridge as it owns the PCIE transaction. Hence its necessary to setup translations on behalf of the bridge as well. Due to this limitation all devices under a p2p share the same domain in a DMAR. We just cache the type of device, if its a native PCIe device or not for later use. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> - Index: 2.6.21-rc6/drivers/pci/pci.h === --- 2.6.21-rc6.orig/drivers/pci/pci.h 2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/pci.h2007-04-11 13:52:29.0 +0800 @@ -90,3 +90,4 @@ pci_match_one_device(const struct pci_de return NULL; } +struct pci_dev *pci_find_upstream_pcie_bridge(struct pci_dev *pdev); Index: 2.6.21-rc6/drivers/pci/probe.c === --- 2.6.21-rc6.orig/drivers/pci/probe.c 2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/probe.c 2007-04-11 13:52:29.0 +0800 @@ -822,6 +822,19 @@ static void pci_release_dev(struct devic kfree(pci_dev); } +static void set_pcie_port_type(struct pci_dev *pdev) +{ + int pos; + u16 reg16; + + pos = pci_find_capability(pdev, PCI_CAP_ID_EXP); + if (!pos) + return; + pdev->is_pcie = 1; + pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ®16); + pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; +} + /** * pci_cfg_space_size - get the configuration space size of the PCI device. * @dev: PCI device @@ -919,6 +932,7 @@ pci_scan_device(struct pci_bus *bus, int dev->device = (l >> 16) & 0x; dev->cfg_size = pci_cfg_space_size(dev); dev->error_state = pci_channel_io_normal; + set_pcie_port_type(dev); /* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer) set this higher, assuming the system even supports it. */ Index: 2.6.21-rc6/drivers/pci/search.c === --- 2.6.21-rc6.orig/drivers/pci/search.c2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/search.c 2007-04-11 13:52:29.0 +0800 @@ -14,6 +14,36 @@ #include "pci.h" DECLARE_RWSEM(pci_bus_sem); +/* + * find the upstream PCIE-to-PCI bridge of a PCI device + * if the device is PCIE, return NULL + * if the device isn't connected to a PCIE bridge (that is its parent is a + * legacy PCI bridge and the bridge is directly connected to bus 0), return its + * parent + */ +struct pci_dev * +pci_find_upstream_pcie_bridge(struct pci_dev *pdev) +{ + struct pci_dev *tmp = NULL; + + if (pdev->is_pcie) + return NULL; + while (1) { + if (!pdev->bus->self) + break; + pdev = pdev->bus->self; + /* a p2p bridge */ + if (!pdev->is_pcie) { + tmp = pdev; + continue; + } + /* PCI device should connect to a PCIE bridge */ + BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE); + return pdev; + } + + return tmp; +} static struct pci_bus * pci_do_find_bus(struct pci_bus* bus, unsigned char busnr) Index: 2.6.21-rc6/include/linux/pci.h === --- 2.6.21-rc6.orig/include/linux/pci.h 2007-04-06 10:36:56.0 +0800 +++ 2.6.21-rc6/include/linux/pci.h 2007-04-11 13:52:29.0 +0800 @@ -126,6 +126,7 @@ struct pci_dev { unsigned short subsystem_device; unsigned intclass; /* 3 bytes: (base,sub,prog-if) */ u8 hdr_type; /* PCI header type (`multi' flag masked out) */ + u8 pcie_type; /* PCI-E device/port type */ u8 rom_base_reg; /* which config register controls the ROM */ u8 pin;/* which interrupt pin this device uses */ @@ -168,6 +169,7 @@ struct pci_dev { unsigned intmsi_enabled:1; unsigned intmsix_enabled:1; unsigned intis_managed:1; + unsigned intis_pcie:1; atomic_tenable_cnt; /* pci_enable_device has been called */ u32 saved_config_space[16]; /* config space saved at suspend time */ -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 7/8] Support for legacy ISA devices
Floppy disk drivers dont work well with DMA remapping. Its possible to extend the current use for x86_64, but the gain is very little. If someone feels compelled to clean this up, its up for grabs. Since these use 16M, we just provide a unity map for the ISA bridge device. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-17 05:41:56.0 -0700 +++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 05:41:59.0 -0700 @@ -730,6 +730,11 @@ the IOMMU driver to set a unity map for all OS visible memory. Hence the driver can continue to use physical addresses for DMA. + noisamap + This option is required to setup identify map for + first 16M. The floppy disk could be modified to use + the DMA api's but thats a lot of pain for very small + gain. This option is turned on by default. io7=[HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 05:41:53.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-17 05:41:59.0 -0700 @@ -37,6 +37,8 @@ #include "pci.h" #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY) +#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA) + #define IOAPIC_RANGE_START (0xfee0) #define IOAPIC_RANGE_END (0xfeef) #define IOAPIC_RANGE_SIZE (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1) @@ -87,6 +89,7 @@ static int dmar_disabled, dmar_force_rw; static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; +static int dmar_fix_isa = 1; static char *get_fault_reason(u8 fault_reason) { @@ -113,6 +116,9 @@ } else if (!strncmp(str, "gfx_workaround", 14)) { dmar_no_gfx_identity_map = 0; printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole physical memory for GFX device\n"); + } else if (!strncmp(str, "noisamap", 8)) { + dmar_fix_isa = 0; + printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity map for LPC\n"); } str += strcspn(str, ","); @@ -1582,6 +1588,25 @@ } } +static void iommu_prepare_isa(void) +{ + struct pci_dev *pdev = NULL; + int ret; + + if (!dmar_fix_isa) + return; + + pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL); + if (!pdev) + return; + + printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n"); + ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024); + + if (ret) + printk ("IOMMU: Failed to create 0-64M identity map, Floppy might not work\n"); + +} int __init init_dmars(void) { struct dmar_drhd_unit *drhd; @@ -1638,6 +1663,7 @@ end_for_each_rmrr_device(rmrr, pdev) iommu_prepare_gfx_mapping(); + iommu_prepare_isa(); /* * for each drhd -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 6/8] Doc updates for Intel Virtualization Technology for Directed I/O.
Document Intel IOMMU driver boot option. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt 2007-04-17 05:41:56.0 -0700 @@ -0,0 +1,119 @@ +Linux IOMMU Support +=== + +The architecture spec can be obtained from the below location. + +http://www.intel.com/technology/virtualization/ + +This guide gives a quick cheat sheet for some basic understanding. + +Some Keywords + +DMAR - DMA remapping +DRHD - DMA Engine Reporting Structure +RMRR - Reserved memory Region Reporting Structure +ZLR - Zero length reads from PCI devices +IOVA - IO Virtual address. + +Basic stuff +--- + +ACPI enumerates and lists the different DMA engines in the platform, and +device scope relationships between PCI devices and which DMA engine controls +them. + +What is RMRR? +- + +There are some devices the BIOS controls, for e.g USB devices to perform +PS2 emulation. The regions of memory used for these devices are marked +reserved in the e820 map. When we turn on DMA translation, DMA to those +regions will fail. Hence BIOS uses RMRR to specify these regions along with +devices that need to access these regions. OS is expected to setup +unity mappings for these regions for these devices to access these regions. + +How is IOVA generated? +- + +Well behaved drivers call pci_map_*() calls before sending command to device +that needs to perform DMA. Once DMA is completed and mapping is no longer +required, device performs a pci_unmap_*() calls to unmap the region. + +The Intel IOMMU driver allocates a virtual address per domain. Each PCIE +device has its own domain (hence protection). Devices under p2p bridges +share the virtual address with all devices under the p2p bridge due to +transaction id aliasing for p2p bridges. + +IOVA generation is pretty generic. We used the same technique as vmalloc() +but these are not global address spaces, but separate for each domain. +Different DMA engines may support different number of domains. + +We also allocate gaurd pages with each mapping, so we can attempt to catch +any overflow that might happen. + + +Graphics Problems? +-- +If you encounter issues with graphics devices, you can try adding +option intel_iommu=igfx_off to turn off the integrated graphics engine. + +If it happens to be a PCI device included in the INCLUDE_ALL Engine, +then try the intel_iommu=gfx_workaround to setup a 1-1 map. We hear +graphics drivers may be in process of using DMA api's in the near +future + +Some exceptions to IOVA +--- +Interrupt ranges are not address translated, (0xfee0 - 0xfeef). +The same is true for peer to peer transactions. Hence we reserve the +address from PCI MMIO ranges so they are not allocated for IOVA addresses. + + +Fault reporting +--- +When errors are reported, the DMA engine signals via an interrupt. The fault +reason and device that caused it with fault reason is printed on console. + +See below for sample. + + +Boot Message Sample +--- + +Something like this gets printed indicating presence of DMAR tables +in ACPI. + +ACPI: DMAR (v001 A M I OEMDMAR 0x0001 MSFT 0x0097) @ 0x7f5b5ef0 + +When DMAR is being processed and initialized by ACPI, prints DMAR locations +and any RMRR's processed. + +ACPI DMAR:Host address width 36 +ACPI DMAR:DRHD (flags: 0x)base: 0xfed9 +ACPI DMAR:DRHD (flags: 0x)base: 0xfed91000 +ACPI DMAR:DRHD (flags: 0x0001)base: 0xfed93000 +ACPI DMAR:RMRR base: 0x000ed000 end: 0x000e +ACPI DMAR:RMRR base: 0x7f60 end: 0x7fff + +When DMAR is enabled for use, you will notice.. + +PCI-DMA: Using DMAR IOMMU + +Fault reporting +--- + +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set + +TBD + + +- No Performance tuning / analysis yet. +- sysfs needs useful data to be populated. + DMAR info, device scope, stats could be exposed to some extent. +- Add support to Firmware Developer Kit to test ACPI tables for DMAR. +- For compatibility testing, could use unity map domain for all devices, just + provide a 1-1 for all useful memory under a single domain for all devices. +- API for paravirt ops for abstracting functionlity for VMM folks. Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-17 04:59:42.0 -070
[Intel IOMMU][patch 5/8] Graphics driver workarounds to provide unity map
Most GFX drivers don't call standard PCI DMA APIs to allocate DMA buffer, Such drivers will be broken with IOMMU enabled. To workaround this issue, we added two options. Once graphics devices are converted over to use the DMA-API's this entire patch can be removed... a. intel_iommu=igfx_off. With this option, DMAR who has just gfx devices under it will be ignored. This mostly affect intergated gfx devices. If the DMAR is ignored, gfx device under it will get physical address for DMA. b. intel_iommu=gfx_workaround. With this option, we will setup 1:1 mapping for whole memory for gfx devices, that is physical address equals to virtual address.In this way, gfx will use physical address for DMA, this is primarily for add-in card GFX device. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: 2.6.21-rc6/arch/x86_64/kernel/e820.c === --- 2.6.21-rc6.orig/arch/x86_64/kernel/e820.c 2007-04-20 11:03:01.0 +0800 +++ 2.6.21-rc6/arch/x86_64/kernel/e820.c2007-04-20 11:45:56.0 +0800 @@ -730,3 +730,22 @@ __init void e820_setup_gap(void) printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n", pci_mem_start, gapstart, gapsize); } + +int __init arch_get_ram_range(int slot, u64 *addr, u64 *size) +{ + int i; + + if (slot < 0 || slot >= e820.nr_map) + return -1; + for (i = slot; i < e820.nr_map; i++) { + if(e820.map[i].type != E820_RAM) + continue; + break; + } + if (i == e820.nr_map || e820.map[i].addr > (max_pfn << PAGE_SHIFT)) + return -1; + *addr = e820.map[i].addr; + *size = min_t(u64, e820.map[i].size + e820.map[i].addr, + max_pfn << PAGE_SHIFT) - *addr; + return i + 1; +} Index: 2.6.21-rc6/drivers/pci/dmar.h === --- 2.6.21-rc6.orig/drivers/pci/dmar.h 2007-04-20 11:38:30.0 +0800 +++ 2.6.21-rc6/drivers/pci/dmar.h 2007-04-20 11:45:56.0 +0800 @@ -35,6 +35,7 @@ struct dmar_drhd_unit { int devices_cnt; u8 include_all:1; struct iommu *iommu; + int ignored:1; /* the drhd should be ignored */ }; struct dmar_rmrr_unit { Index: 2.6.21-rc6/drivers/pci/intel-iommu.c === --- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c 2007-04-20 11:45:52.0 +0800 +++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-20 11:45:56.0 +0800 @@ -36,6 +36,7 @@ #include "iova.h" #include "pci.h" +#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY) #define IOAPIC_RANGE_START (0xfee0) #define IOAPIC_RANGE_END (0xfeef) #define IOAPIC_RANGE_SIZE (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1) @@ -85,6 +86,7 @@ struct iommu { }; static int dmar_disabled, dmar_force_rw; +static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; static char *get_fault_reason(u8 fault_reason) { @@ -105,7 +107,14 @@ static int __init intel_iommu_setup(char } else if (!strncmp(str, "forcerw", 7)) { dmar_force_rw = 1; printk(KERN_INFO"Intel-IOMMU: force R/W for W/O mapping\n"); + } else if (!strncmp(str, "igfx_off", 8)) { + dmar_map_gfx = 0; + printk(KERN_INFO"Intel-IOMMU: disable GFX device mapping\n"); + } else if (!strncmp(str, "gfx_workaround", 14)) { + dmar_no_gfx_identity_map = 0; + printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole physical memory for GFX device\n"); } + str += strcspn(str, ","); while (*str == ',') str++; @@ -1318,6 +1327,7 @@ struct device_domain_info { struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct domain *domain; }; +#define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) static DEFINE_SPINLOCK(device_domain_lock); static LIST_HEAD(device_domain_list); @@ -1538,10 +1548,40 @@ error: static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr, struct pci_dev *pdev) { + if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) + return 0; return iommu_prepare_identity_map(pdev, rmrr->base_address, rmrr->end_address + 1); } +static void iommu_prepare_gfx_mapping(void) +{ + struct pci_dev *pdev = NULL; + u64 base, size; + int slot; + int ret; + + if (dmar_no_gfx_identity_m
[Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.
Some devices may not support entire 64bit DMA. In a situation where such devices are co-located in a shared domain, we need to ensure there is some address space reserved for such devices without the low addresses getting depleted by other devices capable of handling high dma addresses. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-17 06:02:24.0 -0700 +++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 06:02:33.0 -0700 @@ -735,6 +735,11 @@ first 16M. The floppy disk could be modified to use the DMA api's but thats a lot of pain for very small gain. This option is turned on by default. + preserve_{1g/2g/4g/512m/256m/16m} + If a device is sharing a domain with other devices + and the device mask doesnt cover the 64bit range, + use this option to let the iommu code preserve some + virtual addr for such devices. io7=[HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 06:02:24.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-17 06:05:49.0 -0700 @@ -90,6 +90,7 @@ static int dmar_disabled, dmar_force_rw; static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; static int dmar_fix_isa = 1; +static u64 dmar_preserve_iova_mask; static char *get_fault_reason(u8 fault_reason) { @@ -119,6 +120,32 @@ } else if (!strncmp(str, "noisamap", 8)) { dmar_fix_isa = 0; printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity map for LPC\n"); + } else if (!strncmp(str, "preserve_", 9)) { + if (!strncmp(str + 9, "4g", 2) || + !strncmp(str + 9, "4G", 2)) + dmar_preserve_iova_mask = DMA_32BIT_MASK; + else if (!strncmp(str + 9, "2g", 2) || + !strncmp(str + 9, "2G", 2)) + dmar_preserve_iova_mask = DMA_31BIT_MASK; + else if (!strncmp(str + 9, "1g", 2) || +!strncmp(str + 9, "1G", 2)) + dmar_preserve_iova_mask = DMA_30BIT_MASK; + else if (!strncmp(str + 9, "512m", 2) || +!strncmp(str + 9, "512M", 2)) + dmar_preserve_iova_mask = DMA_29BIT_MASK; + else if (!strncmp(str + 9, "256m", 4) || +!strncmp(str + 9, "256M", 4)) + dmar_preserve_iova_mask = DMA_28BIT_MASK; + else if (!strncmp(str + 9, "16m", 3) || +!strncmp(str + 9, "16M", 3)) + dmar_preserve_iova_mask = DMA_24BIT_MASK; + if (dmar_preserve_iova_mask) + printk(KERN_INFO + "DMAR: Preserved IOVA mask 0x%Lx for devices " + "sharing domain\n", dmar_preserve_iova_mask); + else + printk(KERN_ERR"DMAR: Unsuppored preserve mask" + " provided"); } str += strcspn(str, ","); @@ -1723,7 +1750,6 @@ last_addr : IOVA_START_ADDR); } return last_addr; - } #endif @@ -1751,13 +1777,14 @@ /* * If the device shares a domain with other devices and the device can -* handle > 4G DMA, let the device use DMA address started from 4G, so to -* leave rooms for other devices +* can handle higher address, leave rooms for devices that cant +* address high address ranges. */ if ((domain->flags & DOMAIN_FLAG_MULTIPLE_DEVICES) && - pdev->dma_mask > DMA_32BIT_MASK) + dmar_preserve_iova_mask && + (pdev->dma_mask > dmar_preserve_iova_mask)) iova = alloc_iova(domain, addr, size, -
[Intel IOMMU][patch 7/8] Support for legacy ISA devices
Floppy disk drivers dont work well with DMA remapping. Its possible to extend the current use for x86_64, but the gain is very little. If someone feels compelled to clean this up, its up for grabs. Since these use 16M, we just provide a unity map for the ISA bridge device. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-17 05:41:56.0 -0700 +++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 05:41:59.0 -0700 @@ -730,6 +730,11 @@ the IOMMU driver to set a unity map for all OS visible memory. Hence the driver can continue to use physical addresses for DMA. + noisamap + This option is required to setup identify map for + first 16M. The floppy disk could be modified to use + the DMA api's but thats a lot of pain for very small + gain. This option is turned on by default. io7=[HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 05:41:53.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-17 05:41:59.0 -0700 @@ -37,6 +37,8 @@ #include "pci.h" #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY) +#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA) + #define IOAPIC_RANGE_START (0xfee0) #define IOAPIC_RANGE_END (0xfeef) #define IOAPIC_RANGE_SIZE (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1) @@ -87,6 +89,7 @@ static int dmar_disabled, dmar_force_rw; static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; +static int dmar_fix_isa = 1; static char *get_fault_reason(u8 fault_reason) { @@ -113,6 +116,9 @@ } else if (!strncmp(str, "gfx_workaround", 14)) { dmar_no_gfx_identity_map = 0; printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole physical memory for GFX device\n"); + } else if (!strncmp(str, "noisamap", 8)) { + dmar_fix_isa = 0; + printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity map for LPC\n"); } str += strcspn(str, ","); @@ -1582,6 +1588,25 @@ } } +static void iommu_prepare_isa(void) +{ + struct pci_dev *pdev = NULL; + int ret; + + if (!dmar_fix_isa) + return; + + pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL); + if (!pdev) + return; + + printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n"); + ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024); + + if (ret) + printk ("IOMMU: Failed to create 0-64M identity map, Floppy might not work\n"); + +} int __init init_dmars(void) { struct dmar_drhd_unit *drhd; @@ -1638,6 +1663,7 @@ end_for_each_rmrr_device(rmrr, pdev) iommu_prepare_gfx_mapping(); + iommu_prepare_isa(); /* * for each drhd -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O
This patch contains basic ACPI parsing and enumeration support. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/arch/x86_64/Kconfig === --- linux-2.6.21-rc5.orig/arch/x86_64/Kconfig 2007-04-23 07:11:49.0 -0700 +++ linux-2.6.21-rc5/arch/x86_64/Kconfig2007-04-23 07:11:51.0 -0700 @@ -687,6 +687,14 @@ bool "Support mmconfig PCI config space access" depends on PCI && ACPI +config DMAR + bool "Support for DMA Remapping Devices (EXPERIMENTAL)" + depends on PCI_MSI && ACPI && EXPERIMENTAL + help + Support DMA Remapping Devices. The devices are reported via + ACPI tables and includes pci device scope under each DMA + remapping device. + source "drivers/pci/pcie/Kconfig" source "drivers/pci/Kconfig" Index: linux-2.6.21-rc5/drivers/pci/Makefile === --- linux-2.6.21-rc5.orig/drivers/pci/Makefile 2007-04-23 07:11:49.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/Makefile 2007-04-23 07:11:51.0 -0700 @@ -38,6 +38,7 @@ # obj-$(CONFIG_ACPI)+= pci-acpi.o +obj-$(CONFIG_DMAR) += dmar.o # Cardbus & CompactPCI use setup-bus obj-$(CONFIG_HOTPLUG) += setup-bus.o Index: linux-2.6.21-rc5/drivers/pci/dmar.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.21-rc5/drivers/pci/dmar.c 2007-04-23 07:12:00.0 -0700 @@ -0,0 +1,350 @@ +/* + * Copyright (c) 2006, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * + * Copyright (C) Ashok Raj <[EMAIL PROTECTED]> + * Copyright (C) Shaohua Li <[EMAIL PROTECTED]> + */ + +#include +#include +#include + +#include "dmar.h" + +#undef PREFIX +#define PREFIX "DMAR:" + +#define MIN_SCOPE_LEN (sizeof(struct acpi_dmar_pci_path) + \ + sizeof(struct acpi_dmar_device_scope)) + +LIST_HEAD(dmar_drhd_units); +LIST_HEAD(dmar_rmrr_units); +u8 dmar_host_address_width; + +static struct acpi_table_header *dmar_tbl; + +static int __init dmar_register_drhd_unit(struct dmar_drhd_unit *drhd) +{ + /* +* add INCLUDE_ALL at the tail, so scan the list will find it at +* the very end. +*/ + if (drhd->include_all) + list_add_tail(&drhd->list, &dmar_drhd_units); + else + list_add(&drhd->list, &dmar_drhd_units); + return 0; +} + +static int __init dmar_register_rmrr_unit(struct dmar_rmrr_unit *rmrr) +{ + list_add(&rmrr->list, &dmar_rmrr_units); + return 0; +} + +static int dmar_pci_device_match(struct pci_dev *devices[], int cnt, +struct pci_dev *dev) +{ + int index; + + while (dev) { + for (index = 0; index < cnt; index ++) + if (dev == devices[index]) + return 1; + + /* Check our parent */ + dev = dev->bus->self; + } + + return 0; +} + +struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev *dev) +{ + struct dmar_drhd_unit *drhd = NULL; + + list_for_each_entry(drhd, &dmar_drhd_units, list) { + if (drhd->include_all || dmar_pci_device_match(drhd->devices, + drhd->devices_cnt, dev)) + break; + } + + return drhd; +} + +struct dmar_rmrr_unit * dmar_find_matched_rmrr_unit(struct pci_dev *dev) +{ + struct dmar_rmrr_unit *rmrr; + + list_for_each_entry(rmrr, &dmar_rmrr_units, list) { + if (dmar_pci_device_match(rmrr->devices, + rmrr->devices_cnt, dev)) + goto out; + } + rmrr = NULL; +out: + return rmrr; +} + +static int __init dmar_parse_one_dev_scope(struct acpi_dmar_device_scope *scope, + struct pci_dev **dev, u16 segment) +{ + struct pci_bus *bus; + struct pci_dev *pdev = NULL; +
[Intel IOMMU][patch 0/8] Intel IOMMU Support
Hello again! Andrew: Could you help include in -mm to give it more exposure preparing for mainline inclusion with more testing? This is a resend of the patches after addressing some feedback we received. 1. As len requested, we moved most of the acpi parts to drivers/pci instead of leaving them in drivers/acpi, including some renaming of functions, using just acpi_get_table() only. 2. Made the guard page support configurable. 3. Added new option to allocate consecutive address instead of re-using free addresses as an experimental option, Not validated extensively. Its expected to improve in certain cases... 4. Fixed a couple minor bugs that got exposed in testing. Some more interesting possibilities. -- enhancements to work on. - In order to ensure we dont break any driver that may not be using dma api's here are some suggestions to work on. - Create a single 1-1 map, and make sure any pci device gets this map when they do a pci_set_master() to enable bus mastering automatically. - When the device driver does a first call to do a dma mapping, then dissociate the device from the unity map domain, to its own. This gives limited protection but doesnt break drivers that do not use dma mapping. - Creating context entries only 1 per segment.. today we create one per IOMMU which is not required. This way we can avoid doing some of the workarounds we do for some devices today, and will function as a default container for compatibility. On one hand, this will provide more compatibility, but we will lose oppertunity to identify broken device drivers that dont use dma api's and fix them Depending on who you talk to.. some like it.. some just hate it! and would like to fix the broken ones instead. Cheers, Ashok Raj -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 0/8] Intel IOMMU Support.
Hello again! Andrew: Could you help include in -mm to give it more exposure preparing for mainline inclusion with more testing? This is a resend of the patches after addressing some feedback we received. 1. As len requested, we moved most of the acpi parts to drivers/pci instead of leaving them in drivers/acpi, including some renaming of functions, using just acpi_get_table() only. 2. Made the guard page support configurable. 3. Added new CONFIG option to allocate consecutive address instead of re-using free addresses as an experimental option, Not validated extensively. Its expected to improve in certain cases... 4. Fixed a couple minor bugs that got exposed in testing. Other feedback: - Some suggested depend on ACPI, but thats not doable for several reasons. - Graphics 1-1 maps exist only for compatibility until graphics drivers start calling pci map functions, including user space X that might be using /dev/mem. Some more interesting possibilities. -- enhancements to work on. - In order to ensure we dont break any driver that may not be using dma api's here are some suggestions to work on. - Create a single 1-1 map, and make sure any pci device gets this map when they do a pci_set_master() to enable bus mastering automatically. - When the device driver does a first call to do a dma mapping, then dissociate the device from the unity map domain, to its own. This gives limited protection but doesnt break drivers that do not use dma mapping. - Creating context entries only 1 per segment.. today we create one per IOMMU which is not required. This way we can avoid doing some of the workarounds we do for some devices today, and will function as a default container for compatibility. On one hand, this will provide more compatibility, but we will lose oppertunity to identify broken device drivers that dont use dma api's and fix them Depending on who you talk to.. some like it.. some just hate it! and would like to fix the broken ones instead. Cheers, Ashok Raj -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 4/8] Supporting Zero Length Reads in Intel IOMMU.
PCI specs permit zero length reads (ZLR) even if the mapping for that region is write only. Support for this feature is indicated by the presence of a bit in the DMAR capability. If a particular DMAR does not support this capability we map write-only regions as read-write. This option can also provides a workaround for some drivers that request a write-only mapping when they really should request a read-write. (We ran into one such case in eepro100.c in handling rx_ring_dma) Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> -- drivers/pci/intel-iommu.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: 2.6.21-rc6/drivers/pci/intel-iommu.c === --- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c 2007-04-18 09:04:56.0 +0800 +++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-18 09:04:59.0 +0800 @@ -84,7 +84,7 @@ struct iommu { struct sys_device sysdev; }; -static int dmar_disabled; +static int dmar_disabled, dmar_force_rw; static char *get_fault_reason(u8 fault_reason) { @@ -102,6 +102,9 @@ static int __init intel_iommu_setup(char if (!strncmp(str, "off", 3)) { dmar_disabled = 1; printk(KERN_INFO"Intel-IOMMU: disabled\n"); + } else if (!strncmp(str, "forcerw", 7)) { + dmar_force_rw = 1; + printk(KERN_INFO"Intel-IOMMU: force R/W for W/O mapping\n"); } str += strcspn(str, ","); while (*str == ',') @@ -1720,7 +1723,12 @@ static dma_addr_t __intel_map_single(str goto error; } - if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) + /* +* Check if DMAR supports zero-length reads on write only +* mappings.. +*/ + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \ + !cap_zlr(domain->iommu->cap) || dmar_force_rw) prot |= DMA_PTE_READ; if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) prot |= DMA_PTE_WRITE; Index: 2.6.21-rc6/include/linux/intel-iommu.h === --- 2.6.21-rc6.orig/include/linux/intel-iommu.h 2007-04-18 09:04:56.0 +0800 +++ 2.6.21-rc6/include/linux/intel-iommu.h 2007-04-18 09:04:59.0 +0800 @@ -79,6 +79,7 @@ #define cap_max_fault_reg_offset(c) \ (cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16) +#define cap_zlr(c) (((c) >> 22) & 1) #define cap_isoch(c) (((c) >> 23) & 1) #define cap_mgaw(c)c) >> 16) & 0x3f) + 1) #define cap_sagaw(c) (((c) >> 8) & 0x1f) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O
This patch contains basic ACPI parsing and enumeration support. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/arch/x86_64/Kconfig === --- linux-2.6.21-rc5.orig/arch/x86_64/Kconfig 2007-04-23 07:11:49.0 -0700 +++ linux-2.6.21-rc5/arch/x86_64/Kconfig2007-04-23 07:11:51.0 -0700 @@ -687,6 +687,14 @@ bool "Support mmconfig PCI config space access" depends on PCI && ACPI +config DMAR + bool "Support for DMA Remapping Devices (EXPERIMENTAL)" + depends on PCI_MSI && ACPI && EXPERIMENTAL + help + Support DMA Remapping Devices. The devices are reported via + ACPI tables and includes pci device scope under each DMA + remapping device. + source "drivers/pci/pcie/Kconfig" source "drivers/pci/Kconfig" Index: linux-2.6.21-rc5/drivers/pci/Makefile === --- linux-2.6.21-rc5.orig/drivers/pci/Makefile 2007-04-23 07:11:49.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/Makefile 2007-04-23 07:11:51.0 -0700 @@ -38,6 +38,7 @@ # obj-$(CONFIG_ACPI)+= pci-acpi.o +obj-$(CONFIG_DMAR) += dmar.o # Cardbus & CompactPCI use setup-bus obj-$(CONFIG_HOTPLUG) += setup-bus.o Index: linux-2.6.21-rc5/drivers/pci/dmar.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.21-rc5/drivers/pci/dmar.c 2007-04-23 07:12:00.0 -0700 @@ -0,0 +1,350 @@ +/* + * Copyright (c) 2006, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * + * Copyright (C) Ashok Raj <[EMAIL PROTECTED]> + * Copyright (C) Shaohua Li <[EMAIL PROTECTED]> + */ + +#include +#include +#include + +#include "dmar.h" + +#undef PREFIX +#define PREFIX "DMAR:" + +#define MIN_SCOPE_LEN (sizeof(struct acpi_dmar_pci_path) + \ + sizeof(struct acpi_dmar_device_scope)) + +LIST_HEAD(dmar_drhd_units); +LIST_HEAD(dmar_rmrr_units); +u8 dmar_host_address_width; + +static struct acpi_table_header *dmar_tbl; + +static int __init dmar_register_drhd_unit(struct dmar_drhd_unit *drhd) +{ + /* +* add INCLUDE_ALL at the tail, so scan the list will find it at +* the very end. +*/ + if (drhd->include_all) + list_add_tail(&drhd->list, &dmar_drhd_units); + else + list_add(&drhd->list, &dmar_drhd_units); + return 0; +} + +static int __init dmar_register_rmrr_unit(struct dmar_rmrr_unit *rmrr) +{ + list_add(&rmrr->list, &dmar_rmrr_units); + return 0; +} + +static int dmar_pci_device_match(struct pci_dev *devices[], int cnt, +struct pci_dev *dev) +{ + int index; + + while (dev) { + for (index = 0; index < cnt; index ++) + if (dev == devices[index]) + return 1; + + /* Check our parent */ + dev = dev->bus->self; + } + + return 0; +} + +struct dmar_drhd_unit * dmar_find_matched_drhd_unit(struct pci_dev *dev) +{ + struct dmar_drhd_unit *drhd = NULL; + + list_for_each_entry(drhd, &dmar_drhd_units, list) { + if (drhd->include_all || dmar_pci_device_match(drhd->devices, + drhd->devices_cnt, dev)) + break; + } + + return drhd; +} + +struct dmar_rmrr_unit * dmar_find_matched_rmrr_unit(struct pci_dev *dev) +{ + struct dmar_rmrr_unit *rmrr; + + list_for_each_entry(rmrr, &dmar_rmrr_units, list) { + if (dmar_pci_device_match(rmrr->devices, + rmrr->devices_cnt, dev)) + goto out; + } + rmrr = NULL; +out: + return rmrr; +} + +static int __init dmar_parse_one_dev_scope(struct acpi_dmar_device_scope *scope, + struct pci_dev **dev, u16 segment) +{ + struct pci_bus *bus; + struct pci_dev *pdev = NULL; +
[Intel IOMMU][patch 6/8] Doc updates for Intel Virtualization Technology for Directed I/O.
Document Intel IOMMU driver boot option. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.21-rc5/Documentation/Intel-IOMMU.txt 2007-04-17 05:41:56.0 -0700 @@ -0,0 +1,119 @@ +Linux IOMMU Support +=== + +The architecture spec can be obtained from the below location. + +http://www.intel.com/technology/virtualization/ + +This guide gives a quick cheat sheet for some basic understanding. + +Some Keywords + +DMAR - DMA remapping +DRHD - DMA Engine Reporting Structure +RMRR - Reserved memory Region Reporting Structure +ZLR - Zero length reads from PCI devices +IOVA - IO Virtual address. + +Basic stuff +--- + +ACPI enumerates and lists the different DMA engines in the platform, and +device scope relationships between PCI devices and which DMA engine controls +them. + +What is RMRR? +- + +There are some devices the BIOS controls, for e.g USB devices to perform +PS2 emulation. The regions of memory used for these devices are marked +reserved in the e820 map. When we turn on DMA translation, DMA to those +regions will fail. Hence BIOS uses RMRR to specify these regions along with +devices that need to access these regions. OS is expected to setup +unity mappings for these regions for these devices to access these regions. + +How is IOVA generated? +- + +Well behaved drivers call pci_map_*() calls before sending command to device +that needs to perform DMA. Once DMA is completed and mapping is no longer +required, device performs a pci_unmap_*() calls to unmap the region. + +The Intel IOMMU driver allocates a virtual address per domain. Each PCIE +device has its own domain (hence protection). Devices under p2p bridges +share the virtual address with all devices under the p2p bridge due to +transaction id aliasing for p2p bridges. + +IOVA generation is pretty generic. We used the same technique as vmalloc() +but these are not global address spaces, but separate for each domain. +Different DMA engines may support different number of domains. + +We also allocate gaurd pages with each mapping, so we can attempt to catch +any overflow that might happen. + + +Graphics Problems? +-- +If you encounter issues with graphics devices, you can try adding +option intel_iommu=igfx_off to turn off the integrated graphics engine. + +If it happens to be a PCI device included in the INCLUDE_ALL Engine, +then try the intel_iommu=gfx_workaround to setup a 1-1 map. We hear +graphics drivers may be in process of using DMA api's in the near +future + +Some exceptions to IOVA +--- +Interrupt ranges are not address translated, (0xfee0 - 0xfeef). +The same is true for peer to peer transactions. Hence we reserve the +address from PCI MMIO ranges so they are not allocated for IOVA addresses. + + +Fault reporting +--- +When errors are reported, the DMA engine signals via an interrupt. The fault +reason and device that caused it with fault reason is printed on console. + +See below for sample. + + +Boot Message Sample +--- + +Something like this gets printed indicating presence of DMAR tables +in ACPI. + +ACPI: DMAR (v001 A M I OEMDMAR 0x0001 MSFT 0x0097) @ 0x7f5b5ef0 + +When DMAR is being processed and initialized by ACPI, prints DMAR locations +and any RMRR's processed. + +ACPI DMAR:Host address width 36 +ACPI DMAR:DRHD (flags: 0x)base: 0xfed9 +ACPI DMAR:DRHD (flags: 0x)base: 0xfed91000 +ACPI DMAR:DRHD (flags: 0x0001)base: 0xfed93000 +ACPI DMAR:RMRR base: 0x000ed000 end: 0x000e +ACPI DMAR:RMRR base: 0x7f60 end: 0x7fff + +When DMAR is enabled for use, you will notice.. + +PCI-DMA: Using DMAR IOMMU + +Fault reporting +--- + +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set + +TBD + + +- No Performance tuning / analysis yet. +- sysfs needs useful data to be populated. + DMAR info, device scope, stats could be exposed to some extent. +- Add support to Firmware Developer Kit to test ACPI tables for DMAR. +- For compatibility testing, could use unity map domain for all devices, just + provide a 1-1 for all useful memory under a single domain for all devices. +- API for paravirt ops for abstracting functionlity for VMM folks. Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-17 04:59:42.0 -070
[Intel IOMMU][patch 5/8] Graphics driver workarounds to provide unity map
Most GFX drivers don't call standard PCI DMA APIs to allocate DMA buffer, Such drivers will be broken with IOMMU enabled. To workaround this issue, we added two options. Once graphics devices are converted over to use the DMA-API's this entire patch can be removed... a. intel_iommu=igfx_off. With this option, DMAR who has just gfx devices under it will be ignored. This mostly affect intergated gfx devices. If the DMAR is ignored, gfx device under it will get physical address for DMA. b. intel_iommu=gfx_workaround. With this option, we will setup 1:1 mapping for whole memory for gfx devices, that is physical address equals to virtual address.In this way, gfx will use physical address for DMA, this is primarily for add-in card GFX device. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: 2.6.21-rc6/arch/x86_64/kernel/e820.c === --- 2.6.21-rc6.orig/arch/x86_64/kernel/e820.c 2007-04-20 11:03:01.0 +0800 +++ 2.6.21-rc6/arch/x86_64/kernel/e820.c2007-04-20 11:45:56.0 +0800 @@ -730,3 +730,22 @@ __init void e820_setup_gap(void) printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n", pci_mem_start, gapstart, gapsize); } + +int __init arch_get_ram_range(int slot, u64 *addr, u64 *size) +{ + int i; + + if (slot < 0 || slot >= e820.nr_map) + return -1; + for (i = slot; i < e820.nr_map; i++) { + if(e820.map[i].type != E820_RAM) + continue; + break; + } + if (i == e820.nr_map || e820.map[i].addr > (max_pfn << PAGE_SHIFT)) + return -1; + *addr = e820.map[i].addr; + *size = min_t(u64, e820.map[i].size + e820.map[i].addr, + max_pfn << PAGE_SHIFT) - *addr; + return i + 1; +} Index: 2.6.21-rc6/drivers/pci/dmar.h === --- 2.6.21-rc6.orig/drivers/pci/dmar.h 2007-04-20 11:38:30.0 +0800 +++ 2.6.21-rc6/drivers/pci/dmar.h 2007-04-20 11:45:56.0 +0800 @@ -35,6 +35,7 @@ struct dmar_drhd_unit { int devices_cnt; u8 include_all:1; struct iommu *iommu; + int ignored:1; /* the drhd should be ignored */ }; struct dmar_rmrr_unit { Index: 2.6.21-rc6/drivers/pci/intel-iommu.c === --- 2.6.21-rc6.orig/drivers/pci/intel-iommu.c 2007-04-20 11:45:52.0 +0800 +++ 2.6.21-rc6/drivers/pci/intel-iommu.c2007-04-20 11:45:56.0 +0800 @@ -36,6 +36,7 @@ #include "iova.h" #include "pci.h" +#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY) #define IOAPIC_RANGE_START (0xfee0) #define IOAPIC_RANGE_END (0xfeef) #define IOAPIC_RANGE_SIZE (IOAPIC_RANGE_END - IOAPIC_RANGE_START + 1) @@ -85,6 +86,7 @@ struct iommu { }; static int dmar_disabled, dmar_force_rw; +static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; static char *get_fault_reason(u8 fault_reason) { @@ -105,7 +107,14 @@ static int __init intel_iommu_setup(char } else if (!strncmp(str, "forcerw", 7)) { dmar_force_rw = 1; printk(KERN_INFO"Intel-IOMMU: force R/W for W/O mapping\n"); + } else if (!strncmp(str, "igfx_off", 8)) { + dmar_map_gfx = 0; + printk(KERN_INFO"Intel-IOMMU: disable GFX device mapping\n"); + } else if (!strncmp(str, "gfx_workaround", 14)) { + dmar_no_gfx_identity_map = 0; + printk(KERN_INFO"Intel-IOMMU: do 1-1 mapping whole physical memory for GFX device\n"); } + str += strcspn(str, ","); while (*str == ',') str++; @@ -1318,6 +1327,7 @@ struct device_domain_info { struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */ struct domain *domain; }; +#define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) static DEFINE_SPINLOCK(device_domain_lock); static LIST_HEAD(device_domain_list); @@ -1538,10 +1548,40 @@ error: static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr, struct pci_dev *pdev) { + if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO) + return 0; return iommu_prepare_identity_map(pdev, rmrr->base_address, rmrr->end_address + 1); } +static void iommu_prepare_gfx_mapping(void) +{ + struct pci_dev *pdev = NULL; + u64 base, size; + int slot; + int ret; + + if (dmar_no_gfx_identity_m
[Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.
Some devices may not support entire 64bit DMA. In a situation where such devices are co-located in a shared domain, we need to ensure there is some address space reserved for such devices without the low addresses getting depleted by other devices capable of handling high dma addresses. Signed-off-by: Ashok Raj <[EMAIL PROTECTED]> Signed-off-by: Shaohua Li <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5/Documentation/kernel-parameters.txt === --- linux-2.6.21-rc5.orig/Documentation/kernel-parameters.txt 2007-04-17 06:02:24.0 -0700 +++ linux-2.6.21-rc5/Documentation/kernel-parameters.txt2007-04-17 06:02:33.0 -0700 @@ -735,6 +735,11 @@ first 16M. The floppy disk could be modified to use the DMA api's but thats a lot of pain for very small gain. This option is turned on by default. + preserve_{1g/2g/4g/512m/256m/16m} + If a device is sharing a domain with other devices + and the device mask doesnt cover the 64bit range, + use this option to let the iommu code preserve some + virtual addr for such devices. io7=[HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. Index: linux-2.6.21-rc5/drivers/pci/intel-iommu.c === --- linux-2.6.21-rc5.orig/drivers/pci/intel-iommu.c 2007-04-17 06:02:24.0 -0700 +++ linux-2.6.21-rc5/drivers/pci/intel-iommu.c 2007-04-17 06:05:49.0 -0700 @@ -90,6 +90,7 @@ static int dmar_disabled, dmar_force_rw; static int dmar_map_gfx = 1, dmar_no_gfx_identity_map = 1; static int dmar_fix_isa = 1; +static u64 dmar_preserve_iova_mask; static char *get_fault_reason(u8 fault_reason) { @@ -119,6 +120,32 @@ } else if (!strncmp(str, "noisamap", 8)) { dmar_fix_isa = 0; printk (KERN_INFO"Intel-IOMMU: Turning off 16M unity map for LPC\n"); + } else if (!strncmp(str, "preserve_", 9)) { + if (!strncmp(str + 9, "4g", 2) || + !strncmp(str + 9, "4G", 2)) + dmar_preserve_iova_mask = DMA_32BIT_MASK; + else if (!strncmp(str + 9, "2g", 2) || + !strncmp(str + 9, "2G", 2)) + dmar_preserve_iova_mask = DMA_31BIT_MASK; + else if (!strncmp(str + 9, "1g", 2) || +!strncmp(str + 9, "1G", 2)) + dmar_preserve_iova_mask = DMA_30BIT_MASK; + else if (!strncmp(str + 9, "512m", 2) || +!strncmp(str + 9, "512M", 2)) + dmar_preserve_iova_mask = DMA_29BIT_MASK; + else if (!strncmp(str + 9, "256m", 4) || +!strncmp(str + 9, "256M", 4)) + dmar_preserve_iova_mask = DMA_28BIT_MASK; + else if (!strncmp(str + 9, "16m", 3) || +!strncmp(str + 9, "16M", 3)) + dmar_preserve_iova_mask = DMA_24BIT_MASK; + if (dmar_preserve_iova_mask) + printk(KERN_INFO + "DMAR: Preserved IOVA mask 0x%Lx for devices " + "sharing domain\n", dmar_preserve_iova_mask); + else + printk(KERN_ERR"DMAR: Unsuppored preserve mask" + " provided"); } str += strcspn(str, ","); @@ -1723,7 +1750,6 @@ last_addr : IOVA_START_ADDR); } return last_addr; - } #endif @@ -1751,13 +1777,14 @@ /* * If the device shares a domain with other devices and the device can -* handle > 4G DMA, let the device use DMA address started from 4G, so to -* leave rooms for other devices +* can handle higher address, leave rooms for devices that cant +* address high address ranges. */ if ((domain->flags & DOMAIN_FLAG_MULTIPLE_DEVICES) && - pdev->dma_mask > DMA_32BIT_MASK) + dmar_preserve_iova_mask && + (pdev->dma_mask > dmar_preserve_iova_mask)) iova = alloc_iova(domain, addr, size, -
Re: [Intel IOMMU][patch 1/8] ACPI support for Intel Virtualization Technology for Directed I/O
On Tue, Apr 24, 2007 at 08:50:48PM +0200, Andi Kleen wrote: > > > + > > +LIST_HEAD(dmar_drhd_units); > > +LIST_HEAD(dmar_rmrr_units); > > Comment describing what lock protects those lists? > In fact there seems to be no locking. What about hotplug? > There is no support to handle an IOMMU hotplug at this time. IOMMU hotplug requires additional support via ACPI that needs to be extended to handle this. These definitions are scanned at boot time from BIOS tables. They are pretty much static data that we process during boot. Hence no locking is required. We pretty much tread this as read only, and the information never gets changed after initial parsing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.
On Tue, Apr 24, 2007 at 09:33:15PM +0200, Andi Kleen wrote: > On Tuesday 24 April 2007 08:03:07 Ashok Raj wrote: > > Some devices may not support entire 64bit DMA. In a situation where such > > devices are co-located in a shared domain, we need to ensure there is some > > address space reserved for such devices without the low addresses getting > > depleted by other devices capable of handling high dma addresses. > > Sorry, but you need to find some way to make this usually work without special > options. Otherwise users will be unhappy. > > An possible way would be to allocate space upside down from the limit of the > device. Then the lower areas should be usually free. > With PCIE there is some benefit to keep dma addr low for performance reasons, since it will use 32bit Transaction level packets instead of 64bit. This reservation is only required if we have some legacy device under a p2p where its required to share its addr space with other devices. We could implement a default when one is not specified to keep things simple. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 7/8] Support for legacy ISA devices
On Tue, Apr 24, 2007 at 09:31:09PM +0200, Andi Kleen wrote: > On Tuesday 24 April 2007 08:03:06 Ashok Raj wrote: > > Floppy disk drivers dont work well with DMA remapping. > > What is the problem? You can't allocate mappings <16MB? No.. these drivers dont call DMA mapping api's.. thats the problem. > > > Its possible to > > extend the current use for x86_64, but the gain is very little. If someone > > feels compelled to clean this up, its up for grabs. Since these use 16M, we > > just provide a unity map for the ISA bridge device. > > > > While it's probably not worth for the floppy there are other devices > with similar weird addressing limitations. Some generic handling of it > would be nice. > In the intro we had outlined a way to handle this via a generic unity map for all devices, we could do that, i.e - implement a generic 1-1 map if the device is not calling dma api's and dynamically dissociate it if the device does start using dma apis. For some of the addr reservation as well, we could use set_dma_mask() to ensure there is some dma space. Problem is some drivers may not use dma apis. Also it might be difficult to address device hotplugged that has a weird requirement. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 4/8] Supporting Zero Length Reads in Intel IOMMU.
On Tue, Apr 24, 2007 at 09:28:11PM +0200, Andi Kleen wrote: > On Tuesday 24 April 2007 08:03:03 Ashok Raj wrote: > > PCI specs permit zero length reads (ZLR) even if the mapping for that > > region > > is write only. Support for this feature is indicated by the presence of a > > bit > > in the DMAR capability. If a particular DMAR does not support this > > capability > > we map write-only regions as read-write. > > > > This option can also provides a workaround for some drivers that request > > a write-only mapping when they really should request a read-write. > > (We ran into one such case in eepro100.c in handling rx_ring_dma) > > Better just fix the drivers instead of adding such hacks Some of the early DMAR's dont handle zero-length-reads as required. hardware that supports it correctly will advertise via its capabilities. We could remove the cmdline option since it should not be really required. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 3/8] Generic hardware support for Intel IOMMU.
On Tue, Apr 24, 2007 at 09:27:08PM +0200, Andi Kleen wrote: > On Tuesday 24 April 2007 08:03:02 Ashok Raj wrote: > > > > +#ifdef CONFIG_DMAR > > +#ifdef CONFIG_SMP > > +static void dmar_msi_set_affinity(unsigned int irq, cpumask_t mask) > > > Why does it need an own interrupt type? Problem is its MSI type interrupt, but we cannot use pci_dev since its not a PCI device, hence requires its own way of setup etc. > > > + > > +config IOVA_GUARD_PAGE > > + bool "Enables gaurd page when allocating IO Virtual Address for IOMMU" > > + depends on DMAR > > + > > +config IOVA_NEXT_CONTIG > > + bool "Keeps IOVA allocations consequent between allocations" > > + depends on DMAR && EXPERIMENTAL > > Needs reference to Intel and better description > > The file should have a high level description what it is good for etc. > > Need high level overview over what locks protects what and if there > is a locking order. > > It doesn't seem to enable sg merging? Since you have enough space > that should work. Most of the IOVA stuff is really generic, and could be used outside of the Intel code with probably some rework. Since today only DMAR requires it we have depends on DMAR, but we could make it more generic and let the IOMMU driver just turn it on as required. > > > +static char *fault_reason_strings[] = > > +{ > > + "Software", > > + "Present bit in root entry is clear", > > + "Present bit in context entry is clear", > > + "Invalid context entry", > > + "Access beyond MGAW", > > + "PTE Write access is not set", > > + "PTE Read access is not set", > > + "Next page table ptr is invalid", > > + "Root table address invalid", > > + "Context table ptr is invalid", > > + "non-zero reserved fields in RTP", > > + "non-zero reserved fields in CTP", > > + "non-zero reserved fields in PTE", > > + "Unknown" > > +}; > > + > > +#define MAX_FAULT_REASON_IDX (12) > > > You got 14 of them. better use ARRAY_SIZE Its the last index(zero based) of the useful entry returned by the fault record. Only used to find out if index from fault record is out of bounds. We will work on the remaining comments and repost. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 6/8] Doc updates for Intel Virtualization Technology for Directed I/O.
On Tue, Apr 24, 2007 at 11:17:55PM +0200, Markus Rechberger wrote: > >+We also allocate gaurd pages with each mapping, so we can attempt to catch > >+any overflow that might happen. > >+ > > guess you probably mean guard tables here... > So there is a good chance i can be "The Governor of California" :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.
On Tue, Apr 24, 2007 at 02:23:51PM -0700, David Miller wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Tue, 24 Apr 2007 23:12:54 +0200 > > > We already have a couple of other IOMMU architectures who essentially have > > the same > > problem. Have you checked how they solve this? > > Sparc64, for one, only uses 32-bit IOMMU addresses. And we simply > don't try to handle the funny devices at all, in fact we can't > handle the ones that want mappings only in the low 16MB for > example since the hardware IOMMU window usually starts in the > middle of the 32-bit PCI address space. > > We do it both because that's faster due to Single Address Cycles, as > mentioned, and also because that's simply is where the hardware's > IOMMU window is. You can't use 64-bit IOMMU addresses even if you > wanted to on sparc64. > > My suggestion would be to allocate top-down in the 32-bit IOMMU space. > > That might work, but my gut feeling is that this won't be sufficient > and we'll need some kind of device driver initiated reservation > mechanism for the <16MB et al. weird stuff. Its not clear if we have a very generic device breakage.. most devices on these platforms are going to be more recent, (except maybe some legacy fd)... Maybe we should wait to fix unless we are certain if there are more of them that breaks in these platforms. We could choose to use the generic 1-1 for those weird cases, since the driver does ensure today that physical mem is low 16M, then we could just turn on 1-1 domain for such devices without breaking any. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel IOMMU][patch 8/8] Preserve some Virtual Address when devices cannot address entire range.
On Wed, Apr 25, 2007 at 12:03:57AM +0200, Andi Kleen wrote: > On Tuesday 24 April 2007 23:50:26 David Miller wrote: > > From: Ashok Raj <[EMAIL PROTECTED]> > > Date: Tue, 24 Apr 2007 14:38:35 -0700 > > > > > Its not clear if we have a very generic device breakage.. most devices > > > on these platforms are going to be more recent, (except maybe some > > > legacy fd)... > > > > I'm not so sure, there are some "modern" sound cards that have > > a 31-bit DMA addressing limitation because they use the 31st > > bit as a status bit in their DMA descriptors :-) > > There's also a 2GB only megaraid RAID controller that's pretty popular > because Dell shipped it for a long time. Sounds like we have quite a few of those weird ones! The real question is whats the working set for mapped dma handles for such a controller. They would typically allocate only a few to what the controller could handle, and would submit when the io completes right.. So typically we shouldnt have any trouble since they would be able to reclaim what they freed before submission (for iova). Having a IOVA requirement in 2g etc is not a problem, except how many devices on the same pci bus and what the total working set for iova is for that config. The only way to gaurantee would be for the device to ask for a gauranteed set maybe during pci_set_dma_mask() or some such time, and pre-reserve some IOVA to gaurantee we never run out. BUt this is again driver changes and wont be fair if some driver is greedy. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] PCI: Cache PRI and PASID bits in pci_dev
From: Jean-Philippe Brucker Device drivers need to check if an IOMMU enabled ATS, PRI and PASID in order to know when they can use the SVM API. Cache PRI and PASID bits in the pci_dev structure, similarly to what is currently done for ATS. Signed-off-by: Jean-Philippe Brucker --- drivers/pci/ats.c | 23 +++ include/linux/pci.h | 2 ++ 2 files changed, 25 insertions(+) diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c index eeb9fb2..2126497 100644 --- a/drivers/pci/ats.c +++ b/drivers/pci/ats.c @@ -153,6 +153,9 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs) u32 max_requests; int pos; + if (WARN_ON(pdev->pri_enabled)) + return -EBUSY; + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI); if (!pos) return -EINVAL; @@ -170,6 +173,8 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs) control |= PCI_PRI_CTRL_ENABLE; pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control); + pdev->pri_enabled = 1; + return 0; } EXPORT_SYMBOL_GPL(pci_enable_pri); @@ -185,6 +190,9 @@ void pci_disable_pri(struct pci_dev *pdev) u16 control; int pos; + if (WARN_ON(!pdev->pri_enabled)) + return; + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI); if (!pos) return; @@ -192,6 +200,8 @@ void pci_disable_pri(struct pci_dev *pdev) pci_read_config_word(pdev, pos + PCI_PRI_CTRL, &control); control &= ~PCI_PRI_CTRL_ENABLE; pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control); + + pdev->pri_enabled = 0; } EXPORT_SYMBOL_GPL(pci_disable_pri); @@ -207,6 +217,9 @@ int pci_reset_pri(struct pci_dev *pdev) u16 control; int pos; + if (WARN_ON(pdev->pri_enabled)) + return -EBUSY; + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI); if (!pos) return -EINVAL; @@ -239,6 +252,9 @@ int pci_enable_pasid(struct pci_dev *pdev, int features) u16 control, supported; int pos; + if (WARN_ON(pdev->pasid_enabled)) + return -EBUSY; + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PASID); if (!pos) return -EINVAL; @@ -259,6 +275,8 @@ int pci_enable_pasid(struct pci_dev *pdev, int features) pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control); + pdev->pasid_enabled = 1; + return 0; } EXPORT_SYMBOL_GPL(pci_enable_pasid); @@ -273,11 +291,16 @@ void pci_disable_pasid(struct pci_dev *pdev) u16 control = 0; int pos; + if (WARN_ON(!pdev->pasid_enabled)) + return; + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PASID); if (!pos) return; pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control); + + pdev->pasid_enabled = 0; } EXPORT_SYMBOL_GPL(pci_disable_pasid); diff --git a/include/linux/pci.h b/include/linux/pci.h index eb3da1a..bee980e 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -351,6 +351,8 @@ struct pci_dev { unsigned intmsix_enabled:1; unsigned intari_enabled:1; /* ARI forwarding */ unsigned intats_enabled:1; /* Address Translation Service */ + unsigned intpasid_enabled:1;/* Process Address Space ID */ + unsigned intpri_enabled:1; /* Page Request Interface */ unsigned intis_managed:1; unsigned intneeds_freset:1; /* Dev requires fundamental reset */ unsigned intstate_saved:1; -- 2.7.4
[PATCH 0/2] Save and restore pci properties to support FLR
Resending Jean's patch so it can be included earlier than his large SVM commits. Original patch https://patchwork.kernel.org/patch/9593891 was ack'ed by Bjorn. Let's commit these separately since we need functionality earlier. Resending this series as requested by Jean. CQ Tang (1): PCI: Save properties required to handle FLR for replay purposes. Jean-Philippe Brucker (1): PCI: Cache PRI and PASID bits in pci_dev drivers/pci/ats.c | 88 - drivers/pci/pci.c | 3 ++ include/linux/pci-ats.h | 10 ++ include/linux/pci.h | 8 + 4 files changed, 94 insertions(+), 15 deletions(-) -- 2.7.4
[PATCH 2/2] PCI: Save properties required to handle FLR for replay purposes.
From: CQ Tang Requires: https://patchwork.kernel.org/patch/9593891 After a FLR, pci-states need to be restored. This patch saves PASID features and PRI reqs cached. To: Bjorn Helgaas To: Joerg Roedel To: linux-...@vger.kernel.org To: linux-kernel@vger.kernel.org Cc: Jean-Phillipe Brucker Cc: David Woodhouse Cc: io...@lists.linux-foundation.org Signed-off-by: CQ Tang Signed-off-by: Ashok Raj --- drivers/pci/ats.c | 65 + drivers/pci/pci.c | 3 +++ include/linux/pci-ats.h | 10 include/linux/pci.h | 6 + 4 files changed, 69 insertions(+), 15 deletions(-) diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c index 2126497..a769955 100644 --- a/drivers/pci/ats.c +++ b/drivers/pci/ats.c @@ -160,17 +160,16 @@ int pci_enable_pri(struct pci_dev *pdev, u32 reqs) if (!pos) return -EINVAL; - pci_read_config_word(pdev, pos + PCI_PRI_CTRL, &control); pci_read_config_word(pdev, pos + PCI_PRI_STATUS, &status); - if ((control & PCI_PRI_CTRL_ENABLE) || - !(status & PCI_PRI_STATUS_STOPPED)) + if (!(status & PCI_PRI_STATUS_STOPPED)) return -EBUSY; pci_read_config_dword(pdev, pos + PCI_PRI_MAX_REQ, &max_requests); reqs = min(max_requests, reqs); + pdev->pri_reqs_alloc = reqs; pci_write_config_dword(pdev, pos + PCI_PRI_ALLOC_REQ, reqs); - control |= PCI_PRI_CTRL_ENABLE; + control = PCI_PRI_CTRL_ENABLE; pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control); pdev->pri_enabled = 1; @@ -206,6 +205,29 @@ void pci_disable_pri(struct pci_dev *pdev) EXPORT_SYMBOL_GPL(pci_disable_pri); /** + * pci_restore_pri_state - Restore PRI + * @pdev: PCI device structure + * + */ +void pci_restore_pri_state(struct pci_dev *pdev) +{ + u16 control = PCI_PRI_CTRL_ENABLE; + u32 reqs = pdev->pri_reqs_alloc; + int pos; + + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PRI); + if (!pos) + return; + + if (!pdev->pri_enabled) + return; + + pci_write_config_dword(pdev, pos + PCI_PRI_ALLOC_REQ, reqs); + pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control); +} +EXPORT_SYMBOL_GPL(pci_restore_pri_state); + +/** * pci_reset_pri - Resets device's PRI state * @pdev: PCI device structure * @@ -224,12 +246,7 @@ int pci_reset_pri(struct pci_dev *pdev) if (!pos) return -EINVAL; - pci_read_config_word(pdev, pos + PCI_PRI_CTRL, &control); - if (control & PCI_PRI_CTRL_ENABLE) - return -EBUSY; - - control |= PCI_PRI_CTRL_RESET; - + control = PCI_PRI_CTRL_RESET; pci_write_config_word(pdev, pos + PCI_PRI_CTRL, control); return 0; @@ -259,12 +276,7 @@ int pci_enable_pasid(struct pci_dev *pdev, int features) if (!pos) return -EINVAL; - pci_read_config_word(pdev, pos + PCI_PASID_CTRL, &control); pci_read_config_word(pdev, pos + PCI_PASID_CAP, &supported); - - if (control & PCI_PASID_CTRL_ENABLE) - return -EINVAL; - supported &= PCI_PASID_CAP_EXEC | PCI_PASID_CAP_PRIV; /* User wants to enable anything unsupported? */ @@ -272,6 +284,7 @@ int pci_enable_pasid(struct pci_dev *pdev, int features) return -EINVAL; control = PCI_PASID_CTRL_ENABLE | features; + pdev->pasid_features = features; pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control); @@ -305,6 +318,28 @@ void pci_disable_pasid(struct pci_dev *pdev) EXPORT_SYMBOL_GPL(pci_disable_pasid); /** + * pci_restore_pasid_state - Restore PASID capabilities. + * @pdev: PCI device structure + * + */ +void pci_restore_pasid_state(struct pci_dev *pdev) +{ + u16 control; + int pos; + + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_PASID); + if (!pos) + return; + + if (!pdev->pasid_enabled) + return; + + control = PCI_PASID_CTRL_ENABLE | pdev->pasid_features; + pci_write_config_word(pdev, pos + PCI_PASID_CTRL, control); +} +EXPORT_SYMBOL_GPL(pci_restore_pasid_state); + +/** * pci_pasid_features - Check which PASID features are supported * @pdev: PCI device structure * diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 7904d02..c9a6510 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -1171,6 +1172,8 @@ void pci_restore_state(struct pci_dev *dev) /* PCI Express register must be restored first */ pci_restore_pcie_state(dev); + pci_restore_pasid_state(dev); + pci_restore_pri_state(dev); pci_restore_ats_state(dev); pci_restore_vc_state(dev); diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h index 57e0b82..782fb8e 100644 --- a/include/linux/pci-ats.h +++ b/include/linux/pci-
[Patch V0] x86, mce: Don't clear global error reporting banks during cpu_offline
During CPU offline, or during suspend/resume operations, its not safe to clear MCi_CTL. These MSR's are either thread scoped (meaning private to thread), or core scoped (private to threads in that core only), or socket scope i.e visible and controllable from all threads in the socket. When we turn off during CPU_OFFLINE, just offlining a single CPU will stop signaling for all the socket wide resources, such as LLC, iMC for e.g. It is true for Intel CPU's. But there seems some history that other processors may require to turn these off during every CPU offline. Intel Secure Guard eXtentions will be disabled when these controls are cleared from a security perspective. This patch enables SGX to work across suspend/resume. - Consolidated some code to use sharing - Minor changes to some prototypes to fit usage. - Left handling same for non-Intel CPU models to avoid any unknown regressions. Signed-off-by: Ashok Raj Reviewed-by: Tony Luck Tested-by: Serge Ayoun --- arch/x86/kernel/cpu/mcheck/mce.c | 38 -- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index d350858..5498a79 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -2100,7 +2100,7 @@ int __init mcheck_init(void) * Disable machine checks on suspend and shutdown. We can't really handle * them later. */ -static int mce_disable_error_reporting(void) +static void mce_disable_error_reporting(void) { int i; @@ -2110,17 +2110,40 @@ static int mce_disable_error_reporting(void) if (b->init) wrmsrl(MSR_IA32_MCx_CTL(i), 0); } - return 0; + return; +} + +static void _vendor_disable_error_reporting(void) +{ + struct cpuinfo_x86 *c = &boot_cpu_data; + + switch (c->x86_vendor) { + case X86_VENDOR_INTEL: + /* +* Don't clear on Intel CPU's. Some of these MSR's are +* socket wide. Disabling them for just a single cpu offline +* is bad, since it will inhibit reporting for all shared +* resources.. such as LLC, iMC for e.g. +*/ + break; + default: + /* +* Disble MCE reporting for all other CPU Vendor. +* Don't want to break functionality on those +*/ + mce_disable_error_reporting(); + } } static int mce_syscore_suspend(void) { - return mce_disable_error_reporting(); + _vendor_disable_error_reporting(); + return 0; } static void mce_syscore_shutdown(void) { - mce_disable_error_reporting(); + _vendor_disable_error_reporting(); } /* @@ -2400,19 +2423,14 @@ static void mce_device_remove(unsigned int cpu) static void mce_disable_cpu(void *h) { unsigned long action = *(unsigned long *)h; - int i; if (!mce_available(raw_cpu_ptr(&cpu_info))) return; if (!(action & CPU_TASKS_FROZEN)) cmci_clear(); - for (i = 0; i < mca_cfg.banks; i++) { - struct mce_bank *b = &mce_banks[i]; - if (b->init) - wrmsrl(MSR_IA32_MCx_CTL(i), 0); - } + _vendor_disable_error_reporting(); } static void mce_reenable_cpu(void *h) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch V1] x86, mce: Don't clear global error reporting banks during cpu_offline
During CPU offline, or during suspend/resume operations, its not safe to clear MCi_CTL. These MSR's are either thread scoped (meaning private to thread), or core scoped (private to threads in that core only), or socket scope i.e visible and controllable from all threads in the socket. When we turn off during CPU_OFFLINE, just offlining a single CPU will stop signaling for all the socket wide resources, such as LLC, iMC for e.g. It is true for Intel CPU's. But there seems some history that other processors may require to turn these off during every CPU offline. Intel Secure Guard eXtentions (SGX) is worried that it might be possible to compromise integrity in a SGX system if the attacker has control of host system to inject errors which would be otherwise ignored when MCi_CTL bits are cleared. Hence on SGX enabled systems, if MCi_CTL is cleared SGX becomes not available anymore. - Consolidated some code to use sharing - Minor changes to some prototypes to fit usage. - Left handling same for non-Intel CPU models to avoid any unknown regressions. - Fixed review comments from Boris Signed-off-by: Ashok Raj Reviewed-by: Tony Luck Tested-by: Serge Ayoun --- arch/x86/kernel/cpu/mcheck/mce.c | 30 -- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index d350858..69c7e3c 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -2100,7 +2100,7 @@ int __init mcheck_init(void) * Disable machine checks on suspend and shutdown. We can't really handle * them later. */ -static int mce_disable_error_reporting(void) +static void mce_disable_error_reporting(void) { int i; @@ -2110,17 +2110,32 @@ static int mce_disable_error_reporting(void) if (b->init) wrmsrl(MSR_IA32_MCx_CTL(i), 0); } - return 0; + return; +} + +static void vendor_disable_error_reporting(void) +{ + /* +* Don't clear on Intel CPUs. Some of these MSRs are +* socket wide. Disabling them for just a single CPU offline +* is bad, since it will inhibit reporting for all shared +* resources.. such as LLC, iMC for e.g. +*/ + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + return; + + mce_disable_error_reporting(); } static int mce_syscore_suspend(void) { - return mce_disable_error_reporting(); + vendor_disable_error_reporting(); + return 0; } static void mce_syscore_shutdown(void) { - mce_disable_error_reporting(); + vendor_disable_error_reporting(); } /* @@ -2400,19 +2415,14 @@ static void mce_device_remove(unsigned int cpu) static void mce_disable_cpu(void *h) { unsigned long action = *(unsigned long *)h; - int i; if (!mce_available(raw_cpu_ptr(&cpu_info))) return; if (!(action & CPU_TASKS_FROZEN)) cmci_clear(); - for (i = 0; i < mca_cfg.banks; i++) { - struct mce_bank *b = &mce_banks[i]; - if (b->init) - wrmsrl(MSR_IA32_MCx_CTL(i), 0); - } + vendor_disable_error_reporting(); } static void mce_reenable_cpu(void *h) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] Add support for IBRS & IBPB KVM support.
The following patches are based on v3 from Tim Chen https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1582043.html This patch set supports exposing MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD for user space. Thomas is steam blowing v3 :-).. but I didn't want to keep holding this much longer for the rebase to be complete in tip/x86/pti. Ashok Raj (4): x86/ibrs: Introduce native_rdmsrl, and native_wrmsrl x86/ibrs: Add new helper macros to save/restore MSR_IA32_SPEC_CTRL x86/ibrs: Add direct access support for MSR_IA32_SPEC_CTRL x86/feature: Detect the x86 feature Indirect Branch Prediction Barrier Paolo Bonzini (1): x86/svm: Direct access to MSR_IA32_SPEC_CTRL arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/msr-index.h | 3 +++ arch/x86/include/asm/spec_ctrl.h | 29 +- arch/x86/kernel/cpu/spec_ctrl.c| 19 ++ arch/x86/kvm/cpuid.c | 3 ++- arch/x86/kvm/svm.c | 51 ++ arch/x86/kvm/vmx.c | 51 ++ arch/x86/kvm/x86.c | 1 + 8 files changed, 156 insertions(+), 2 deletions(-) -- 2.7.4
[PATCH 4/5] x86/svm: Direct access to MSR_IA32_SPEC_CTRL
From: Paolo Bonzini Direct access to MSR_IA32_SPEC_CTRL is important for performance. Allow load/store of MSR_IA32_SPEC_CTRL, restore guest IBRS on VM entry and set restore host values on VM exit. it yet). TBD: need to check msr's can be passed through even if feature is not emuerated by the CPU. [Ashok: Modified to reuse V3 spec-ctrl patches from Tim] Signed-off-by: Paolo Bonzini Signed-off-by: Ashok Raj --- arch/x86/kvm/svm.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0e68f0b..7c14471a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -183,6 +183,8 @@ struct vcpu_svm { u64 gs_base; } host; + u64 spec_ctrl; + u32 *msrpm; ulong nmi_iret_rip; @@ -248,6 +250,7 @@ static const struct svm_direct_access_msrs { { .index = MSR_CSTAR, .always = true }, { .index = MSR_SYSCALL_MASK,.always = true }, #endif + { .index = MSR_IA32_SPEC_CTRL, .always = true }, { .index = MSR_IA32_LASTBRANCHFROMIP, .always = false }, { .index = MSR_IA32_LASTBRANCHTOIP, .always = false }, { .index = MSR_IA32_LASTINTFROMIP, .always = false }, @@ -917,6 +920,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm) set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1); } + + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) + set_msr_interception(msrpm, MSR_IA32_SPEC_CTRL, 1, 1); } static void add_msr_offset(u32 offset) @@ -3576,6 +3582,9 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_VM_CR: msr_info->data = svm->nested.vm_cr_msr; break; + case MSR_IA32_SPEC_CTRL: + msr_info->data = svm->spec_ctrl; + break; case MSR_IA32_UCODE_REV: msr_info->data = 0x0165; break; @@ -3724,6 +3733,9 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) case MSR_VM_IGNNE: vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", ecx, data); break; + case MSR_IA32_SPEC_CTRL: + svm->spec_ctrl = data; + break; case MSR_IA32_APICBASE: if (kvm_vcpu_apicv_active(vcpu)) avic_update_vapic_bar(to_svm(vcpu), data); @@ -4871,6 +4883,19 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu) svm_complete_interrupts(svm); } + +/* + * Save guest value of spec_ctrl and also restore host value + */ +static void save_guest_spec_ctrl(struct vcpu_svm *svm) +{ + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) { + svm->spec_ctrl = spec_ctrl_get(); + spec_ctrl_restriction_on(); + } else + rmb(); +} + static void svm_vcpu_run(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4910,6 +4935,14 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) clgi(); + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) { + /* +* FIXME: lockdep_assert_irqs_disabled(); +*/ + WARN_ON_ONCE(!irqs_disabled()); + spec_ctrl_set(svm->spec_ctrl); + } + local_irq_enable(); asm volatile ( @@ -4985,6 +5018,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) #endif ); + save_guest_spec_ctrl(svm); + #ifdef CONFIG_X86_64 wrmsrl(MSR_GS_BASE, svm->host.gs_base); #else -- 2.7.4
[PATCH 1/5] x86/ibrs: Introduce native_rdmsrl, and native_wrmsrl
- Remove including microcode.h, and use native macros from asm/msr.h - added license header for spec_ctrl.c Signed-off-by: Ashok Raj --- arch/x86/include/asm/spec_ctrl.h | 17 - arch/x86/kernel/cpu/spec_ctrl.c | 1 + 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/spec_ctrl.h b/arch/x86/include/asm/spec_ctrl.h index 948959b..2dfa31b 100644 --- a/arch/x86/include/asm/spec_ctrl.h +++ b/arch/x86/include/asm/spec_ctrl.h @@ -3,12 +3,27 @@ #ifndef _ASM_X86_SPEC_CTRL_H #define _ASM_X86_SPEC_CTRL_H -#include +#include +#include void spec_ctrl_scan_feature(struct cpuinfo_x86 *c); void spec_ctrl_unprotected_begin(void); void spec_ctrl_unprotected_end(void); +static inline u64 native_rdmsrl(unsigned int msr) +{ + u64 val; + + val = __rdmsr(msr); + + return val; +} + +static inline void native_wrmsrl(unsigned int msr, u64 val) +{ + __wrmsr(msr, (u32) (val & 0xULL), (u32) (val >> 32)); +} + static inline void __disable_indirect_speculation(void) { native_wrmsrl(MSR_IA32_SPEC_CTRL, SPEC_CTRL_ENABLE_IBRS); diff --git a/arch/x86/kernel/cpu/spec_ctrl.c b/arch/x86/kernel/cpu/spec_ctrl.c index 843b4e6..9e9d013 100644 --- a/arch/x86/kernel/cpu/spec_ctrl.c +++ b/arch/x86/kernel/cpu/spec_ctrl.c @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ #include #include -- 2.7.4
[PATCH 5/5] x86/feature: Detect the x86 feature Indirect Branch Prediction Barrier
cpuid ax=0x7, return rdx bit 26 to indicate presence of both IA32_SPEC_CTRL(MSR 0x48) and IA32_PRED_CMD(MSR 0x49) BIT0: Indirect Branch Prediction Barrier When this MSR is written with IBPB=1 it ensures that earlier code's behavior doesn't control later indirect branch predictions. Note this MSR is only writable and does not carry any state. Its a barrier so the code should perform a wrmsr when the barrier is needed. Signed-off-by: Ashok Raj --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/msr-index.h | 3 +++ arch/x86/kernel/cpu/spec_ctrl.c| 7 +++ arch/x86/kvm/svm.c | 16 arch/x86/kvm/vmx.c | 10 ++ 5 files changed, 37 insertions(+) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 624b58e..52f37fc 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -213,6 +213,7 @@ #define X86_FEATURE_MBA( 7*32+18) /* Memory Bandwidth Allocation */ #define X86_FEATURE_SPEC_CTRL ( 7*32+19) /* Speculation Control */ #define X86_FEATURE_SPEC_CTRL_IBRS ( 7*32+20) /* Speculation Control, use IBRS */ +#define X86_FEATURE_PRED_CMD ( 7*32+21) /* Indirect Branch Prediction Barrier */ /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 3e1cb18..1888e19 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -46,6 +46,9 @@ #define SPEC_CTRL_DISABLE_IBRS (0 << 0) #define SPEC_CTRL_ENABLE_IBRS (1 << 0) +#define MSR_IA32_PRED_CMD 0x0049 +#define FEATURE_SET_IBPB (1<<0) + #define MSR_IA32_PERFCTR0 0x00c1 #define MSR_IA32_PERFCTR1 0x00c2 #define MSR_FSB_FREQ 0x00cd diff --git a/arch/x86/kernel/cpu/spec_ctrl.c b/arch/x86/kernel/cpu/spec_ctrl.c index 02fc630..6cfec19 100644 --- a/arch/x86/kernel/cpu/spec_ctrl.c +++ b/arch/x86/kernel/cpu/spec_ctrl.c @@ -15,6 +15,13 @@ void spec_ctrl_scan_feature(struct cpuinfo_x86 *c) if (!c->cpu_index) static_branch_enable(&spec_ctrl_dynamic_ibrs); } + /* +* For Intel CPU's this MSR is shared the same cpuid +* enumeration. When MSR_IA32_SPEC_CTRL is present +* MSR_IA32_SPEC_CTRL is also available +* TBD: AMD might have a separate enumeration for each. +*/ + set_cpu_cap(c, X86_FEATURE_PRED_CMD); } } diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7c14471a..36924c9 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -251,6 +251,7 @@ static const struct svm_direct_access_msrs { { .index = MSR_SYSCALL_MASK,.always = true }, #endif { .index = MSR_IA32_SPEC_CTRL, .always = true }, + { .index = MSR_IA32_PRED_CMD, .always = false }, { .index = MSR_IA32_LASTBRANCHFROMIP, .always = false }, { .index = MSR_IA32_LASTBRANCHTOIP, .always = false }, { .index = MSR_IA32_LASTINTFROMIP, .always = false }, @@ -531,6 +532,7 @@ struct svm_cpu_data { struct kvm_ldttss_desc *tss_desc; struct page *save_area; + struct vmcb *current_vmcb; }; static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data); @@ -923,6 +925,8 @@ static void svm_vcpu_init_msrpm(u32 *msrpm) if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) set_msr_interception(msrpm, MSR_IA32_SPEC_CTRL, 1, 1); + if (boot_cpu_has(X86_FEATURE_PRED_CMD)) + set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1); } static void add_msr_offset(u32 offset) @@ -1711,11 +1715,18 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu) __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER); kvm_vcpu_uninit(vcpu); kmem_cache_free(kvm_vcpu_cache, svm); +/* + * The VMCB could be recycled, causing a false negative in svm_vcpu_load; + * block speculative execution. + */ + if (boot_cpu_has(X86_FEATURE_PRED_CMD)) +native_wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB); } static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { struct vcpu_svm *svm = to_svm(vcpu); + struct svm_cpu_data *sd = per_cpu(svm_data, cpu); int i; if (unlikely(cpu != vcpu->cpu)) { @@ -1744,6 +1755,11 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) if (static_cpu_has(X86_FEATURE_RDTSCP)) wrmsrl(MSR_TSC_AUX, svm->tsc_aux); + if (sd->current_vmcb != svm->vmcb) { + sd->current_vmcb = svm-&g
[PATCH 3/5] x86/ibrs: Add direct access support for MSR_IA32_SPEC_CTRL
Add direct access to MSR_IA32_SPEC_CTRL from a guest. Also save/restore IBRS values during exits and guest resume path. Rebasing based on Tim's patch Signed-off-by: Ashok Raj --- arch/x86/kvm/cpuid.c | 3 ++- arch/x86/kvm/vmx.c | 41 + arch/x86/kvm/x86.c | 1 + 3 files changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0099e10..6fa81c7 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -70,6 +70,7 @@ u64 kvm_supported_xcr0(void) /* These are scattered features in cpufeatures.h. */ #define KVM_CPUID_BIT_AVX512_4VNNIW 2 #define KVM_CPUID_BIT_AVX512_4FMAPS 3 +#define KVM_CPUID_BIT_SPEC_CTRL26 #define KF(x) bit(KVM_CPUID_BIT_##x) int kvm_update_cpuid(struct kvm_vcpu *vcpu) @@ -392,7 +393,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, /* cpuid 7.0.edx*/ const u32 kvm_cpuid_7_0_edx_x86_features = - KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS); + KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS) | KF(SPEC_CTRL); /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 62ee436..1913896 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -50,6 +50,7 @@ #include #include #include +#include #include "trace.h" #include "pmu.h" @@ -579,6 +580,7 @@ struct vcpu_vmx { u32 vm_entry_controls_shadow; u32 vm_exit_controls_shadow; u32 secondary_exec_control; + u64 spec_ctrl; /* * loaded_vmcs points to the VMCS currently used in this vcpu. For a @@ -3259,6 +3261,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_TSC: msr_info->data = guest_read_tsc(vcpu); break; + case MSR_IA32_SPEC_CTRL: + msr_info->data = to_vmx(vcpu)->spec_ctrl; + break; case MSR_IA32_SYSENTER_CS: msr_info->data = vmcs_read32(GUEST_SYSENTER_CS); break; @@ -3366,6 +3371,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_TSC: kvm_write_tsc(vcpu, msr_info); break; + case MSR_IA32_SPEC_CTRL: + to_vmx(vcpu)->spec_ctrl = msr_info->data; + break; case MSR_IA32_CR_PAT: if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) { if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data)) @@ -6790,6 +6798,13 @@ static __init int hardware_setup(void) kvm_tsc_scaling_ratio_frac_bits = 48; } + /* +* If feature is available then setup MSR_IA32_SPEC_CTRL to be in +* passthrough mode for the guest. +*/ + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) + vmx_disable_intercept_for_msr(MSR_IA32_SPEC_CTRL, false); + vmx_disable_intercept_for_msr(MSR_FS_BASE, false); vmx_disable_intercept_for_msr(MSR_GS_BASE, false); vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true); @@ -9242,6 +9257,15 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu) vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc); } +static void save_guest_spec_ctrl(struct vcpu_vmx *vmx) +{ + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) { + vmx->spec_ctrl = spec_ctrl_get(); + spec_ctrl_restriction_on(); + } else + rmb(); +} + static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -9298,6 +9322,21 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) vmx_arm_hv_timer(vcpu); vmx->__launched = vmx->loaded_vmcs->launched; + + /* +* Just update whatever the value was set for the MSR in guest. +* If this is unlaunched: Assume that initialized value is 0. +* IRQ's also need to be disabled. If guest value is 0, an interrupt +* could start running in unprotected mode (i.e with IBRS=0). +*/ + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) { + /* +* FIXME: lockdep_assert_irqs_disabled(); +*/ + WARN_ON_ONCE(!irqs_disabled()); + spec_ctrl_set(vmx->spec_ctrl); + } + asm( /* Store host registers */ "push %%" _ASM_DX "; push %%" _ASM_BP ";" @@ -9403,6 +9442,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) #endif ); + save_guest_spec_ctrl(vmx); + /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */ if (debugctlmsr) update_debugctlmsr(debugctlmsr); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm
[PATCH 2/5] x86/ibrs: Add new helper macros to save/restore MSR_IA32_SPEC_CTRL
Add some helper macros to save/restore MSR_IA32_SPEC_CTRL. Although we could use the spec_ctrl_unprotected_begin/end macros they seem be bit unreadable for some uses. spec_ctrl_get - read MSR_IA32_SPEC_CTRL to save spec_ctrl_set - write value restore MSR_IA32_SPEC_CTRL spec_ctrl_restriction_off - same as spec_ctrl_unprotected_begin spec_ctrl_restriction_on - same as spec_ctrl_unprotected_end Signed-off-by: Ashok Raj --- arch/x86/include/asm/spec_ctrl.h | 12 arch/x86/kernel/cpu/spec_ctrl.c | 11 +++ 2 files changed, 23 insertions(+) diff --git a/arch/x86/include/asm/spec_ctrl.h b/arch/x86/include/asm/spec_ctrl.h index 2dfa31b..926feb2 100644 --- a/arch/x86/include/asm/spec_ctrl.h +++ b/arch/x86/include/asm/spec_ctrl.h @@ -9,6 +9,10 @@ void spec_ctrl_scan_feature(struct cpuinfo_x86 *c); void spec_ctrl_unprotected_begin(void); void spec_ctrl_unprotected_end(void); +void spec_ctrl_set(u64 val); + +#define spec_ctrl_restriction_on spec_ctrl_unprotected_end +#define spec_ctrl_restriction_off spec_ctrl_unprotected_begin static inline u64 native_rdmsrl(unsigned int msr) { @@ -34,4 +38,12 @@ static inline void __enable_indirect_speculation(void) native_wrmsrl(MSR_IA32_SPEC_CTRL, SPEC_CTRL_DISABLE_IBRS); } +static inline u64 spec_ctrl_get(void) +{ + u64 val; + + val = native_rdmsrl(MSR_IA32_SPEC_CTRL); + + return val; +} #endif /* _ASM_X86_SPEC_CTRL_H */ diff --git a/arch/x86/kernel/cpu/spec_ctrl.c b/arch/x86/kernel/cpu/spec_ctrl.c index 9e9d013..02fc630 100644 --- a/arch/x86/kernel/cpu/spec_ctrl.c +++ b/arch/x86/kernel/cpu/spec_ctrl.c @@ -47,3 +47,14 @@ void spec_ctrl_unprotected_end(void) __disable_indirect_speculation(); } EXPORT_SYMBOL_GPL(spec_ctrl_unprotected_end); + +void spec_ctrl_set(u64 val) +{ + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL)) { + if (!val) { + spec_ctrl_restriction_off(); + } else + spec_ctrl_restriction_on(); + } +} +EXPORT_SYMBOL(spec_ctrl_set); -- 2.7.4
Re: [PATCH 3/7] kvm: vmx: pass MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD down to the guest
Hi Paolo Do you assume that host isn't using IBRS and only guest uses it? On Mon, Jan 8, 2018 at 10:08 AM, Paolo Bonzini wrote: > Direct access to MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD is important > for performance. Allow load/store of MSR_IA32_SPEC_CTRL, restore guest > IBRS on VM entry and set it to 0 on VM exit (because Linux does not use > it yet). > > Signed-off-by: Paolo Bonzini > --- > arch/x86/kvm/vmx.c | 32 > 1 file changed, 32 insertions(+) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 669f5f74857d..d00bcad7336e 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -120,6 +120,8 @@ > module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO); > #endif > > +static bool __read_mostly have_spec_ctrl; > + > #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD) > #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE) > #define KVM_VM_CR0_ALWAYS_ON \ > @@ -609,6 +611,8 @@ struct vcpu_vmx { > u64 msr_host_kernel_gs_base; > u64 msr_guest_kernel_gs_base; > #endif > + u64 spec_ctrl; > + > u32 vm_entry_controls_shadow; > u32 vm_exit_controls_shadow; > u32 secondary_exec_control; > @@ -3361,6 +3365,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > case MSR_IA32_TSC: > msr_info->data = guest_read_tsc(vcpu); > break; > + case MSR_IA32_SPEC_CTRL: > + msr_info->data = to_vmx(vcpu)->spec_ctrl; > + break; > case MSR_IA32_SYSENTER_CS: > msr_info->data = vmcs_read32(GUEST_SYSENTER_CS); > break; > @@ -3500,6 +3507,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > case MSR_IA32_TSC: > kvm_write_tsc(vcpu, msr_info); > break; > + case MSR_IA32_SPEC_CTRL: > + to_vmx(vcpu)->spec_ctrl = msr_info->data; > + break; > case MSR_IA32_CR_PAT: > if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) { > if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data)) > @@ -7062,6 +7072,17 @@ static __init int hardware_setup(void) > goto out; > } > > + /* > +* FIXME: this is only needed until SPEC_CTRL is supported > +* by upstream Linux in cpufeatures, then it can be replaced > +* with static_cpu_has. > +*/ > + have_spec_ctrl = cpu_has_spec_ctrl(); > + if (have_spec_ctrl) > + pr_info("kvm: SPEC_CTRL available\n"); > + else > + pr_info("kvm: SPEC_CTRL not available\n"); > + > if (boot_cpu_has(X86_FEATURE_NX)) > kvm_enable_efer_bits(EFER_NX); > > @@ -7131,6 +7152,8 @@ static __init int hardware_setup(void) > vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false); > vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false); > vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false); > + vmx_disable_intercept_for_msr(MSR_IA32_SPEC_CTRL, false); > + vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false); > > memcpy(vmx_msr_bitmap_legacy_x2apic_apicv, > vmx_msr_bitmap_legacy, PAGE_SIZE); > @@ -9597,6 +9620,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu > *vcpu) > > pt_guest_enter(vmx); > > + if (have_spec_ctrl && vmx->spec_ctrl != 0) > + wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl); > + Do we even need to optimize this? what if host Linux enabled IBRS, but guest has it turned off? Thought it might be simpler to blindly update it with what vmx->spec_ctrl value is? > atomic_switch_perf_msrs(vmx); > > vmx_arm_hv_timer(vcpu); > @@ -9707,6 +9733,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu > *vcpu) > #endif > ); > > + if (have_spec_ctrl) { > + rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl); > + if (vmx->spec_ctrl) > + wrmsrl(MSR_IA32_SPEC_CTRL, 0); > + } > + Same thing here.. if the host OS has enabled IBRS wouldn't you want to keep the same value? > /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */ > if (vmx->host_debugctlmsr) > update_debugctlmsr(vmx->host_debugctlmsr); > -- > 1.8.3.1 > >
[4.15 & 4.14 stable 07/12] x86/microcode: Do not upload microcode if CPUs are offline
commit 30ec26da9967d0d785abc24073129a34c3211777 upstream Avoid loading microcode if any of the CPUs are offline, and issue a warning. Having different microcode revisions on the system at any time is outright dangerous. [ Borislav: Massage changelog. ] Signed-off-by: Ashok Raj Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Tom Lendacky Tested-by: Ashok Raj Reviewed-by: Tom Lendacky Cc: Arjan Van De Ven Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/1519352533-15992-4-git-send-email-ashok@intel.com Link: https://lkml.kernel.org/r/20180228102846.13447-5...@alien8.de --- arch/x86/kernel/cpu/microcode/core.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index cbeace2..f25c395 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -486,6 +486,16 @@ static void __exit microcode_dev_exit(void) /* fake device for request_firmware */ static struct platform_device *microcode_pdev; +static int check_online_cpus(void) +{ + if (num_online_cpus() == num_present_cpus()) + return 0; + + pr_err("Not all CPUs online, aborting microcode update.\n"); + + return -EINVAL; +} + static enum ucode_state reload_for_cpu(int cpu) { struct ucode_cpu_info *uci = ucode_cpu_info + cpu; @@ -519,7 +529,13 @@ static ssize_t reload_store(struct device *dev, return size; get_online_cpus(); + + ret = check_online_cpus(); + if (ret) + goto put; + mutex_lock(µcode_mutex); + for_each_online_cpu(cpu) { tmp_ret = reload_for_cpu(cpu); if (tmp_ret > UCODE_NFOUND) { @@ -538,6 +554,8 @@ static ssize_t reload_store(struct device *dev, microcode_check(); mutex_unlock(µcode_mutex); + +put: put_online_cpus(); if (!ret) -- 2.7.4
[4.15 & 4.14 stable 08/12] x86/microcode/intel: Look into the patch cache first
From: Borislav Petkov commit d8c3b52c00a05036e0a6b315b4b17921a7b67997 upstream The cache might contain a newer patch - look in there first. A follow-on change will make sure newest patches are loaded into the cache of microcode patches. Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Tom Lendacky Tested-by: Ashok Raj Cc: Arjan Van De Ven Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180228102846.13447-6...@alien8.de --- arch/x86/kernel/cpu/microcode/intel.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index e2864bc..2aded9d 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -791,9 +791,9 @@ static int collect_cpu_info(int cpu_num, struct cpu_signature *csig) static enum ucode_state apply_microcode_intel(int cpu) { - struct microcode_intel *mc; - struct ucode_cpu_info *uci; + struct ucode_cpu_info *uci = ucode_cpu_info + cpu; struct cpuinfo_x86 *c = &cpu_data(cpu); + struct microcode_intel *mc; static int prev_rev; u32 rev; @@ -801,11 +801,10 @@ static enum ucode_state apply_microcode_intel(int cpu) if (WARN_ON(raw_smp_processor_id() != cpu)) return UCODE_ERROR; - uci = ucode_cpu_info + cpu; - mc = uci->mc; + /* Look for a newer patch in our cache: */ + mc = find_patch(uci); if (!mc) { - /* Look for a newer patch in our cache: */ - mc = find_patch(uci); + mc = uci->mc; if (!mc) return UCODE_NFOUND; } -- 2.7.4
[4.15 & 4.14 stable 09/12] x86/microcode: Request microcode on the BSP
From: Borislav Petkov commit cfb52a5a09c8ae3a1dafb44ce549fde5b69e8117 upstream ... so that any newer version can land in the cache and can later be fished out by the application functions. Do that before grabbing the hotplug lock. Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Tom Lendacky Tested-by: Ashok Raj Reviewed-by: Tom Lendacky Cc: Arjan Van De Ven Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180228102846.13447-7...@alien8.de --- arch/x86/kernel/cpu/microcode/core.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index f25c395..8adbf43 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -499,15 +499,10 @@ static int check_online_cpus(void) static enum ucode_state reload_for_cpu(int cpu) { struct ucode_cpu_info *uci = ucode_cpu_info + cpu; - enum ucode_state ustate; if (!uci->valid) return UCODE_OK; - ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev, true); - if (ustate != UCODE_OK) - return ustate; - return apply_microcode_on_target(cpu); } @@ -515,11 +510,11 @@ static ssize_t reload_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t size) { + int cpu, bsp = boot_cpu_data.cpu_index; enum ucode_state tmp_ret = UCODE_OK; bool do_callback = false; unsigned long val; ssize_t ret = 0; - int cpu; ret = kstrtoul(buf, 0, &val); if (ret) @@ -528,6 +523,10 @@ static ssize_t reload_store(struct device *dev, if (val != 1) return size; + tmp_ret = microcode_ops->request_microcode_fw(bsp, µcode_pdev->dev, true); + if (tmp_ret != UCODE_OK) + return size; + get_online_cpus(); ret = check_online_cpus(); -- 2.7.4
[4.15 & 4.14 stable 04/12] x86/microcode: Get rid of struct apply_microcode_ctx
From: Borislav Petkov commit 854857f5944c59a881ff607b37ed9ed41d031a3b upstream It is a useless remnant from earlier times. Use the ucode_state enum directly. No functional change. Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Tom Lendacky Tested-by: Ashok Raj Cc: Arjan Van De Ven Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180228102846.13447-2...@alien8.de --- arch/x86/kernel/cpu/microcode/core.c | 19 --- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index b40b56e..cbeace2 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -373,26 +373,23 @@ static int collect_cpu_info(int cpu) return ret; } -struct apply_microcode_ctx { - enum ucode_state err; -}; - static void apply_microcode_local(void *arg) { - struct apply_microcode_ctx *ctx = arg; + enum ucode_state *err = arg; - ctx->err = microcode_ops->apply_microcode(smp_processor_id()); + *err = microcode_ops->apply_microcode(smp_processor_id()); } static int apply_microcode_on_target(int cpu) { - struct apply_microcode_ctx ctx = { .err = 0 }; + enum ucode_state err; int ret; - ret = smp_call_function_single(cpu, apply_microcode_local, &ctx, 1); - if (!ret) - ret = ctx.err; - + ret = smp_call_function_single(cpu, apply_microcode_local, &err, 1); + if (!ret) { + if (err == UCODE_ERROR) + ret = 1; + } return ret; } -- 2.7.4
[4.15 & 4.14 stable 05/12] x86/microcode/intel: Check microcode revision before updating sibling threads
commit c182d2b7d0ca48e0d6ff16f7d883161238c447ed upstream After updating microcode on one of the threads of a core, the other thread sibling automatically gets the update since the microcode resources on a hyperthreaded core are shared between the two threads. Check the microcode revision on the CPU before performing a microcode update and thus save us the WRMSR 0x79 because it is a particularly expensive operation. [ Borislav: Massage changelog and coding style. ] Signed-off-by: Ashok Raj Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Tom Lendacky Tested-by: Ashok Raj Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Cc: Arjan Van De Ven Link: http://lkml.kernel.org/r/1519352533-15992-2-git-send-email-ashok@intel.com Link: https://lkml.kernel.org/r/20180228102846.13447-3...@alien8.de --- arch/x86/kernel/cpu/microcode/intel.c | 27 --- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 923054a..87bd6dc 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -589,6 +589,17 @@ static int apply_microcode_early(struct ucode_cpu_info *uci, bool early) if (!mc) return 0; + /* +* Save us the MSR write below - which is a particular expensive +* operation - when the other hyperthread has updated the microcode +* already. +*/ + rev = intel_get_microcode_revision(); + if (rev >= mc->hdr.rev) { + uci->cpu_sig.rev = rev; + return UCODE_OK; + } + /* write microcode via MSR 0x79 */ native_wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits); @@ -776,7 +787,7 @@ static enum ucode_state apply_microcode_intel(int cpu) { struct microcode_intel *mc; struct ucode_cpu_info *uci; - struct cpuinfo_x86 *c; + struct cpuinfo_x86 *c = &cpu_data(cpu); static int prev_rev; u32 rev; @@ -793,6 +804,18 @@ static enum ucode_state apply_microcode_intel(int cpu) return UCODE_NFOUND; } + /* +* Save us the MSR write below - which is a particular expensive +* operation - when the other hyperthread has updated the microcode +* already. +*/ + rev = intel_get_microcode_revision(); + if (rev >= mc->hdr.rev) { + uci->cpu_sig.rev = rev; + c->microcode = rev; + return UCODE_OK; + } + /* write microcode via MSR 0x79 */ wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits); @@ -813,8 +836,6 @@ static enum ucode_state apply_microcode_intel(int cpu) prev_rev = rev; } - c = &cpu_data(cpu); - uci->cpu_sig.rev = rev; c->microcode = rev; -- 2.7.4
[4.15 & 4.14 stable 06/12] x86/microcode/intel: Writeback and invalidate caches before updating microcode
commit 91df9fdf51492aec9fed6b4cbd33160886740f47 upstream Updating microcode is less error prone when caches have been flushed and depending on what exactly the microcode is updating. For example, some of the issues around certain Broadwell parts can be addressed by doing a full cache flush. [ Borislav: Massage it and use native_wbinvd() in both cases. ] Signed-off-by: Ashok Raj Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Tom Lendacky Tested-by: Ashok Raj Cc: Arjan Van De Ven Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/1519352533-15992-3-git-send-email-ashok@intel.com Link: https://lkml.kernel.org/r/20180228102846.13447-4...@alien8.de --- arch/x86/kernel/cpu/microcode/intel.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 87bd6dc..e2864bc 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -600,6 +600,12 @@ static int apply_microcode_early(struct ucode_cpu_info *uci, bool early) return UCODE_OK; } + /* +* Writeback and invalidate caches before updating microcode to avoid +* internal issues depending on what the microcode is updating. +*/ + native_wbinvd(); + /* write microcode via MSR 0x79 */ native_wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits); @@ -816,6 +822,12 @@ static enum ucode_state apply_microcode_intel(int cpu) return UCODE_OK; } + /* +* Writeback and invalidate caches before updating microcode to avoid +* internal issues depending on what the microcode is updating. +*/ + native_wbinvd(); + /* write microcode via MSR 0x79 */ wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits); -- 2.7.4
[4.15 & 4.14 stable 10/12] x86/microcode: Synchronize late microcode loading
commit a5321aec6412b20b5ad15db2d6b916c05349dbff upstream Original idea by Ashok, completely rewritten by Borislav. Before you read any further: the early loading method is still the preferred one and you should always do that. The following patch is improving the late loading mechanism for long running jobs and cloud use cases. Gather all cores and serialize the microcode update on them by doing it one-by-one to make the late update process as reliable as possible and avoid potential issues caused by the microcode update. [ Borislav: Rewrite completely. ] Co-developed-by: Borislav Petkov Signed-off-by: Ashok Raj Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Tom Lendacky Tested-by: Ashok Raj Reviewed-by: Tom Lendacky Cc: Arjan Van De Ven Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180228102846.13447-8...@alien8.de --- arch/x86/kernel/cpu/microcode/core.c | 118 +++ 1 file changed, 92 insertions(+), 26 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index 8adbf43..bde629e 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -22,13 +22,16 @@ #define pr_fmt(fmt) "microcode: " fmt #include +#include #include #include #include #include #include +#include #include #include +#include #include #include @@ -64,6 +67,11 @@ LIST_HEAD(microcode_cache); */ static DEFINE_MUTEX(microcode_mutex); +/* + * Serialize late loading so that CPUs get updated one-by-one. + */ +static DEFINE_SPINLOCK(update_lock); + struct ucode_cpu_info ucode_cpu_info[NR_CPUS]; struct cpu_info_ctx { @@ -486,6 +494,19 @@ static void __exit microcode_dev_exit(void) /* fake device for request_firmware */ static struct platform_device *microcode_pdev; +/* + * Late loading dance. Why the heavy-handed stomp_machine effort? + * + * - HT siblings must be idle and not execute other code while the other sibling + * is loading microcode in order to avoid any negative interactions caused by + * the loading. + * + * - In addition, microcode update on the cores must be serialized until this + * requirement can be relaxed in the future. Right now, this is conservative + * and good. + */ +#define SPINUNIT 100 /* 100 nsec */ + static int check_online_cpus(void) { if (num_online_cpus() == num_present_cpus()) @@ -496,23 +517,85 @@ static int check_online_cpus(void) return -EINVAL; } -static enum ucode_state reload_for_cpu(int cpu) +static atomic_t late_cpus; + +/* + * Returns: + * < 0 - on error + * 0 - no update done + * 1 - microcode was updated + */ +static int __reload_late(void *info) { - struct ucode_cpu_info *uci = ucode_cpu_info + cpu; + unsigned int timeout = NSEC_PER_SEC; + int all_cpus = num_online_cpus(); + int cpu = smp_processor_id(); + enum ucode_state err; + int ret = 0; - if (!uci->valid) - return UCODE_OK; + atomic_dec(&late_cpus); + + /* +* Wait for all CPUs to arrive. A load will not be attempted unless all +* CPUs show up. +* */ + while (atomic_read(&late_cpus)) { + if (timeout < SPINUNIT) { + pr_err("Timeout while waiting for CPUs rendezvous, remaining: %d\n", + atomic_read(&late_cpus)); + return -1; + } + + ndelay(SPINUNIT); + timeout -= SPINUNIT; + + touch_nmi_watchdog(); + } + + spin_lock(&update_lock); + apply_microcode_local(&err); + spin_unlock(&update_lock); + + if (err > UCODE_NFOUND) { + pr_warn("Error reloading microcode on CPU %d\n", cpu); + ret = -1; + } else if (err == UCODE_UPDATED) { + ret = 1; + } - return apply_microcode_on_target(cpu); + atomic_inc(&late_cpus); + + while (atomic_read(&late_cpus) != all_cpus) + cpu_relax(); + + return ret; +} + +/* + * Reload microcode late on all CPUs. Wait for a sec until they + * all gather together. + */ +static int microcode_reload_late(void) +{ + int ret; + + atomic_set(&late_cpus, num_online_cpus()); + + ret = stop_machine_cpuslocked(__reload_late, NULL, cpu_online_mask); + if (ret < 0) + return ret; + else if (ret > 0) + microcode_check(); + + return ret; } static ssize_t reload_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t size) { - int cpu, bsp = boot_cpu_data.cpu_index; enum ucode_state tmp_ret = UCODE_OK; - bool do_callback = false; + int bsp = boot_cpu_data
[4.15 & 4.14 stable 00/12] Series to update microcode loading.
Hi Greg Here is a series that addresses microcode loading stability issues post Spectre. All of them are simply cherry-picked and the patches themselves have the upstream commit ID's. I checked this for Intel platforms and thanks to Boris for checking on AMD platforms. I'm still working on a 4.9 backport, will send those once i get them to work. stop_machine differences seem big enough that i might choose a different approach for the 4.9 backport. Cheers, Ashok Ashok Raj (4): x86/microcode/intel: Check microcode revision before updating sibling threads x86/microcode/intel: Writeback and invalidate caches before updating microcode x86/microcode: Do not upload microcode if CPUs are offline x86/microcode: Synchronize late microcode loading Borislav Petkov (8): x86/microcode: Propagate return value from updating functions x86/CPU: Add a microcode loader callback x86/CPU: Check CPU feature bits after microcode upgrade x86/microcode: Get rid of struct apply_microcode_ctx x86/microcode/intel: Look into the patch cache first x86/microcode: Request microcode on the BSP x86/microcode: Attempt late loading only when new microcode is present x86/microcode: Fix CPU synchronization routine arch/x86/include/asm/microcode.h | 10 +- arch/x86/include/asm/processor.h | 1 + arch/x86/kernel/cpu/common.c | 30 ++ arch/x86/kernel/cpu/microcode/amd.c | 44 + arch/x86/kernel/cpu/microcode/core.c | 181 ++ arch/x86/kernel/cpu/microcode/intel.c | 62 +--- 6 files changed, 252 insertions(+), 76 deletions(-) -- 2.7.4
[4.15 & 4.14 stable 12/12] x86/microcode: Fix CPU synchronization routine
From: Borislav Petkov commit bb8c13d61a629276a162c1d2b1a20a815cbcfbb7 upstream Emanuel reported an issue with a hang during microcode update because my dumb idea to use one atomic synchronization variable for both rendezvous - before and after update - was simply bollocks: microcode: microcode_reload_late: late_cpus: 4 microcode: __reload_late: cpu 2 entered microcode: __reload_late: cpu 1 entered microcode: __reload_late: cpu 3 entered microcode: __reload_late: cpu 0 entered microcode: __reload_late: cpu 1 left microcode: Timeout while waiting for CPUs rendezvous, remaining: 1 CPU1 above would finish, leave and the others will still spin waiting for it to join. So do two synchronization atomics instead, which makes the code a lot more straightforward. Also, since the update is serialized and it also takes quite some time per microcode engine, increase the exit timeout by the number of CPUs on the system. That's ok because the moment all CPUs are done, that timeout will be cut short. Furthermore, panic when some of the CPUs timeout when returning from a microcode update: we can't allow a system with not all cores updated. Also, as an optimization, do not do the exit sync if microcode wasn't updated. Reported-by: Emanuel Czirai Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Emanuel Czirai Tested-by: Ashok Raj Tested-by: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180314183615.17629-2...@alien8.de --- arch/x86/kernel/cpu/microcode/core.c | 68 ++-- 1 file changed, 41 insertions(+), 27 deletions(-) diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index e6d5caa..021c904 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -517,7 +517,29 @@ static int check_online_cpus(void) return -EINVAL; } -static atomic_t late_cpus; +static atomic_t late_cpus_in; +static atomic_t late_cpus_out; + +static int __wait_for_cpus(atomic_t *t, long long timeout) +{ + int all_cpus = num_online_cpus(); + + atomic_inc(t); + + while (atomic_read(t) < all_cpus) { + if (timeout < SPINUNIT) { + pr_err("Timeout while waiting for CPUs rendezvous, remaining: %d\n", + all_cpus - atomic_read(t)); + return 1; + } + + ndelay(SPINUNIT); + timeout -= SPINUNIT; + + touch_nmi_watchdog(); + } + return 0; +} /* * Returns: @@ -527,30 +549,16 @@ static atomic_t late_cpus; */ static int __reload_late(void *info) { - unsigned int timeout = NSEC_PER_SEC; - int all_cpus = num_online_cpus(); int cpu = smp_processor_id(); enum ucode_state err; int ret = 0; - atomic_dec(&late_cpus); - /* * Wait for all CPUs to arrive. A load will not be attempted unless all * CPUs show up. * */ - while (atomic_read(&late_cpus)) { - if (timeout < SPINUNIT) { - pr_err("Timeout while waiting for CPUs rendezvous, remaining: %d\n", - atomic_read(&late_cpus)); - return -1; - } - - ndelay(SPINUNIT); - timeout -= SPINUNIT; - - touch_nmi_watchdog(); - } + if (__wait_for_cpus(&late_cpus_in, NSEC_PER_SEC)) + return -1; spin_lock(&update_lock); apply_microcode_local(&err); @@ -558,15 +566,22 @@ static int __reload_late(void *info) if (err > UCODE_NFOUND) { pr_warn("Error reloading microcode on CPU %d\n", cpu); - ret = -1; - } else if (err == UCODE_UPDATED) { + return -1; + /* siblings return UCODE_OK because their engine got updated already */ + } else if (err == UCODE_UPDATED || err == UCODE_OK) { ret = 1; + } else { + return ret; } - atomic_inc(&late_cpus); - - while (atomic_read(&late_cpus) != all_cpus) - cpu_relax(); + /* +* Increase the wait timeout to a safe value here since we're +* serializing the microcode update and that could take a while on a +* large number of CPUs. And that is fine as the *actual* timeout will +* be determined by the last CPU finished updating and thus cut short. +*/ + if (__wait_for_cpus(&late_cpus_out, NSEC_PER_SEC * num_online_cpus())) + panic("Timeout during microcode update!\n"); return ret; } @@ -579,12 +594,11 @@ static int microcode_reload_late(void) { int ret; - atomic_set(&late_cpus, num_online_cpus()); + atomic_set(
[4.15 & 4.14 stable 11/12] x86/microcode: Attempt late loading only when new microcode is present
From: Borislav Petkov commit 2613f36ed965d0e5a595a1d931fd3b480e82d6fd upstream Return UCODE_NEW from the scanning functions to denote that new microcode was found and only then attempt the expensive synchronization dance. Reported-by: Emanuel Czirai Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Tested-by: Emanuel Czirai Tested-by: Ashok Raj Tested-by: Tom Lendacky Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180314183615.17629-1...@alien8.de --- arch/x86/include/asm/microcode.h | 1 + arch/x86/kernel/cpu/microcode/amd.c | 34 +- arch/x86/kernel/cpu/microcode/core.c | 8 +++- arch/x86/kernel/cpu/microcode/intel.c | 4 +++- 4 files changed, 28 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/microcode.h b/arch/x86/include/asm/microcode.h index 7fb1047..6cf0e4c 100644 --- a/arch/x86/include/asm/microcode.h +++ b/arch/x86/include/asm/microcode.h @@ -39,6 +39,7 @@ struct device; enum ucode_state { UCODE_OK= 0, + UCODE_NEW, UCODE_UPDATED, UCODE_NFOUND, UCODE_ERROR, diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c index a998e1a..4817992 100644 --- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -339,7 +339,7 @@ int __init save_microcode_in_initrd_amd(unsigned int cpuid_1_eax) return -EINVAL; ret = load_microcode_amd(true, x86_family(cpuid_1_eax), desc.data, desc.size); - if (ret != UCODE_OK) + if (ret > UCODE_UPDATED) return -EINVAL; return 0; @@ -683,27 +683,35 @@ static enum ucode_state __load_microcode_amd(u8 family, const u8 *data, static enum ucode_state load_microcode_amd(bool save, u8 family, const u8 *data, size_t size) { + struct ucode_patch *p; enum ucode_state ret; /* free old equiv table */ free_equiv_cpu_table(); ret = __load_microcode_amd(family, data, size); - - if (ret != UCODE_OK) + if (ret != UCODE_OK) { cleanup(); + return ret; + } -#ifdef CONFIG_X86_32 - /* save BSP's matching patch for early load */ - if (save) { - struct ucode_patch *p = find_patch(0); - if (p) { - memset(amd_ucode_patch, 0, PATCH_MAX_SIZE); - memcpy(amd_ucode_patch, p->data, min_t(u32, ksize(p->data), - PATCH_MAX_SIZE)); - } + p = find_patch(0); + if (!p) { + return ret; + } else { + if (boot_cpu_data.microcode == p->patch_id) + return ret; + + ret = UCODE_NEW; } -#endif + + /* save BSP's matching patch for early load */ + if (!save) + return ret; + + memset(amd_ucode_patch, 0, PATCH_MAX_SIZE); + memcpy(amd_ucode_patch, p->data, min_t(u32, ksize(p->data), PATCH_MAX_SIZE)); + return ret; } diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index bde629e..e6d5caa 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -607,7 +607,7 @@ static ssize_t reload_store(struct device *dev, return size; tmp_ret = microcode_ops->request_microcode_fw(bsp, µcode_pdev->dev, true); - if (tmp_ret != UCODE_OK) + if (tmp_ret != UCODE_NEW) return size; get_online_cpus(); @@ -691,10 +691,8 @@ static enum ucode_state microcode_init_cpu(int cpu, bool refresh_fw) if (system_state != SYSTEM_RUNNING) return UCODE_NFOUND; - ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev, -refresh_fw); - - if (ustate == UCODE_OK) { + ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev, refresh_fw); + if (ustate == UCODE_NEW) { pr_debug("CPU%d updated upon init\n", cpu); apply_microcode_on_target(cpu); } diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 2aded9d..32b8e57 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -862,6 +862,7 @@ static enum ucode_state generic_load_microcode(int cpu, void *data, size_t size, unsigned int leftover = size; unsigned int curr_mc_size = 0, new_mc_size = 0; unsigned int csig, cpf; + enum ucode_state ret = UCODE_OK; while (leftover) { struct microcode_header_intel mc_header; @@ -903,6 +904,7 @@ static enum ucode_state generic_load_microcode(int cpu, void *data, size_t size,
[4.15 & 4.14 stable 03/12] x86/CPU: Check CPU feature bits after microcode upgrade
From: Borislav Petkov commit 42ca8082e260dcfd8afa2afa6ec1940b9d41724c upstream With some microcode upgrades, new CPUID features can become visible on the CPU. Check what the kernel has mirrored now and issue a warning hinting at possible things the user/admin can do to make use of the newly visible features. Originally-by: Ashok Raj Tested-by: Ashok Raj Signed-off-by: Borislav Petkov Reviewed-by: Ashok Raj Cc: Andy Lutomirski Cc: Arjan van de Ven Cc: Borislav Petkov Cc: Dan Williams Cc: Dave Hansen Cc: David Woodhouse Cc: Greg Kroah-Hartman Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/20180216112640.11554-4...@alien8.de Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/common.c | 20 1 file changed, 20 insertions(+) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 84f1cd8..348cf48 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1757,5 +1757,25 @@ core_initcall(init_cpu_syscore); */ void microcode_check(void) { + struct cpuinfo_x86 info; + perf_check_microcode(); + + /* Reload CPUID max function as it might've changed. */ + info.cpuid_level = cpuid_eax(0); + + /* +* Copy all capability leafs to pick up the synthetic ones so that +* memcmp() below doesn't fail on that. The ones coming from CPUID will +* get overwritten in get_cpu_cap(). +*/ + memcpy(&info.x86_capability, &boot_cpu_data.x86_capability, sizeof(info.x86_capability)); + + get_cpu_cap(&info); + + if (!memcmp(&info.x86_capability, &boot_cpu_data.x86_capability, sizeof(info.x86_capability))) + return; + + pr_warn("x86/CPU: CPU features have changed after loading microcode, but might not take effect.\n"); + pr_warn("x86/CPU: Please consider either early loading through initrd/built-in or a potential BIOS update.\n"); } -- 2.7.4
[4.15 & 4.14 stable 01/12] x86/microcode: Propagate return value from updating functions
From: Borislav Petkov commit 3f1f576a195aa266813cbd4ca70291deb61e0129 upstream ... so that callers can know when microcode was updated and act accordingly. Tested-by: Ashok Raj Signed-off-by: Borislav Petkov Reviewed-by: Ashok Raj Cc: Andy Lutomirski Cc: Arjan van de Ven Cc: Borislav Petkov Cc: Dan Williams Cc: Dave Hansen Cc: David Woodhouse Cc: Greg Kroah-Hartman Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/20180216112640.11554-2...@alien8.de Signed-off-by: Ingo Molnar --- arch/x86/include/asm/microcode.h | 9 +++-- arch/x86/kernel/cpu/microcode/amd.c | 10 +- arch/x86/kernel/cpu/microcode/core.c | 33 + arch/x86/kernel/cpu/microcode/intel.c | 10 +- 4 files changed, 34 insertions(+), 28 deletions(-) diff --git a/arch/x86/include/asm/microcode.h b/arch/x86/include/asm/microcode.h index 55520cec..7fb1047 100644 --- a/arch/x86/include/asm/microcode.h +++ b/arch/x86/include/asm/microcode.h @@ -37,7 +37,12 @@ struct cpu_signature { struct device; -enum ucode_state { UCODE_ERROR, UCODE_OK, UCODE_NFOUND }; +enum ucode_state { + UCODE_OK= 0, + UCODE_UPDATED, + UCODE_NFOUND, + UCODE_ERROR, +}; struct microcode_ops { enum ucode_state (*request_microcode_user) (int cpu, @@ -54,7 +59,7 @@ struct microcode_ops { * are being called. * See also the "Synchronization" section in microcode_core.c. */ - int (*apply_microcode) (int cpu); + enum ucode_state (*apply_microcode) (int cpu); int (*collect_cpu_info) (int cpu, struct cpu_signature *csig); }; diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c index 330b846..a998e1a 100644 --- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -498,7 +498,7 @@ static unsigned int verify_patch_size(u8 family, u32 patch_size, return patch_size; } -static int apply_microcode_amd(int cpu) +static enum ucode_state apply_microcode_amd(int cpu) { struct cpuinfo_x86 *c = &cpu_data(cpu); struct microcode_amd *mc_amd; @@ -512,7 +512,7 @@ static int apply_microcode_amd(int cpu) p = find_patch(cpu); if (!p) - return 0; + return UCODE_NFOUND; mc_amd = p->data; uci->mc = p->data; @@ -523,13 +523,13 @@ static int apply_microcode_amd(int cpu) if (rev >= mc_amd->hdr.patch_id) { c->microcode = rev; uci->cpu_sig.rev = rev; - return 0; + return UCODE_OK; } if (__apply_microcode_amd(mc_amd)) { pr_err("CPU%d: update failed for patch_level=0x%08x\n", cpu, mc_amd->hdr.patch_id); - return -1; + return UCODE_ERROR; } pr_info("CPU%d: new patch_level=0x%08x\n", cpu, mc_amd->hdr.patch_id); @@ -537,7 +537,7 @@ static int apply_microcode_amd(int cpu) uci->cpu_sig.rev = mc_amd->hdr.patch_id; c->microcode = mc_amd->hdr.patch_id; - return 0; + return UCODE_UPDATED; } static int install_equiv_cpu_table(const u8 *buf) diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index e4fc595..7c42326 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -374,7 +374,7 @@ static int collect_cpu_info(int cpu) } struct apply_microcode_ctx { - int err; + enum ucode_state err; }; static void apply_microcode_local(void *arg) @@ -489,31 +489,29 @@ static void __exit microcode_dev_exit(void) /* fake device for request_firmware */ static struct platform_device *microcode_pdev; -static int reload_for_cpu(int cpu) +static enum ucode_state reload_for_cpu(int cpu) { struct ucode_cpu_info *uci = ucode_cpu_info + cpu; enum ucode_state ustate; - int err = 0; if (!uci->valid) - return err; + return UCODE_OK; ustate = microcode_ops->request_microcode_fw(cpu, µcode_pdev->dev, true); - if (ustate == UCODE_OK) - apply_microcode_on_target(cpu); - else - if (ustate == UCODE_ERROR) - err = -EINVAL; - return err; + if (ustate != UCODE_OK) + return ustate; + + return apply_microcode_on_target(cpu); } static ssize_t reload_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t size) { + enum ucode_state tmp_ret = UCODE_OK; unsigned long val; + ssize_t ret = 0; int cpu; - ssize_t ret = 0, tmp_ret; r
[4.15 & 4.14 stable 02/12] x86/CPU: Add a microcode loader callback
From: Borislav Petkov commit 1008c52c09dcb23d93f8e0ea83a6246265d2cce0 upstream Add a callback function which the microcode loader calls when microcode has been updated to a newer revision. Do the callback only when no error was encountered during loading. Tested-by: Ashok Raj Signed-off-by: Borislav Petkov Reviewed-by: Ashok Raj Cc: Andy Lutomirski Cc: Arjan van de Ven Cc: Borislav Petkov Cc: Dan Williams Cc: Dave Hansen Cc: David Woodhouse Cc: Greg Kroah-Hartman Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Asit K Mallick Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/20180216112640.11554-3...@alien8.de Signed-off-by: Ingo Molnar --- arch/x86/include/asm/processor.h | 1 + arch/x86/kernel/cpu/common.c | 10 ++ arch/x86/kernel/cpu/microcode/core.c | 8 ++-- 3 files changed, 17 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 44c2c4e..a5fc8f8 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -969,4 +969,5 @@ bool xen_set_default_idle(void); void stop_this_cpu(void *dummy); void df_debug(struct pt_regs *regs, long error_code); +void microcode_check(void); #endif /* _ASM_X86_PROCESSOR_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 824aee0..84f1cd8 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1749,3 +1749,13 @@ static int __init init_cpu_syscore(void) return 0; } core_initcall(init_cpu_syscore); + +/* + * The microcode loader calls this upon late microcode load to recheck features, + * only when microcode has been updated. Caller holds microcode_mutex and CPU + * hotplug lock. + */ +void microcode_check(void) +{ + perf_check_microcode(); +} diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index 7c42326..b40b56e 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -509,6 +509,7 @@ static ssize_t reload_store(struct device *dev, const char *buf, size_t size) { enum ucode_state tmp_ret = UCODE_OK; + bool do_callback = false; unsigned long val; ssize_t ret = 0; int cpu; @@ -531,10 +532,13 @@ static ssize_t reload_store(struct device *dev, if (!ret) ret = -EINVAL; } + + if (tmp_ret == UCODE_UPDATED) + do_callback = true; } - if (!ret && tmp_ret == UCODE_UPDATED) - perf_check_microcode(); + if (!ret && do_callback) + microcode_check(); mutex_unlock(µcode_mutex); put_online_cpus(); -- 2.7.4
[Patch V2 1/3] x86, mce: Add LMCE definitions.
Add required definitions to support Local Machine Check Exceptions. See http://www.intel.com/sdm Volume 3, System Programming Guide, chapter 15 for more information on MSR's and documentation on Local MCE. Signed-off-by: Ashok Raj --- arch/x86/include/asm/mce.h| 5 + arch/x86/include/uapi/asm/msr-index.h | 2 ++ 2 files changed, 7 insertions(+) diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 1f5a86d..677a408 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -17,11 +17,16 @@ #define MCG_EXT_CNT(c) (((c) & MCG_EXT_CNT_MASK) >> MCG_EXT_CNT_SHIFT) #define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ #define MCG_ELOG_P (1ULL<<26) /* Extended error log supported */ +#define MCG_LMCE_P (1ULL<<27) /* Local machine check supported */ /* MCG_STATUS register defines */ #define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ #define MCG_STATUS_EIPV (1ULL<<1) /* ip points to correct instruction */ #define MCG_STATUS_MCIP (1ULL<<2) /* machine check in progress */ +#define MCG_STATUS_LMCES (1ULL<<3) /* LMCE signaled */ + +/* MCG_EXT_CTL register defines */ +#define MCG_EXT_CTL_LMCE_EN (1ULL<<0) /* Enable LMCE */ /* MCi_STATUS register defines */ #define MCI_STATUS_VAL (1ULL<<63) /* valid error */ diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index c469490..32c69d5 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -56,6 +56,7 @@ #define MSR_IA32_MCG_CAP 0x0179 #define MSR_IA32_MCG_STATUS0x017a #define MSR_IA32_MCG_CTL 0x017b +#define MSR_IA32_MCG_EXT_CTL 0x04d0 #define MSR_OFFCORE_RSP_0 0x01a6 #define MSR_OFFCORE_RSP_1 0x01a7 @@ -379,6 +380,7 @@ #define FEATURE_CONTROL_LOCKED (1<<0) #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (1<<1) #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX (1<<2) +#define FEATURE_CONTROL_LMCE (1<<20) #define MSR_IA32_APICBASE 0x001b #define MSR_IA32_APICBASE_BSP (1<<8) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch V2 3/3] x86, mce: Handling LMCE events
This patch has handling changes to do_machine_check() to process MCE signaled as local MCE. Typically only recoverable errors (SRAR) type error will be Signaled as LMCE. But architecture does not restrict to only those errors. When errors are signaled as LMCE, there is no need for the MCE handler to perform rendezvous with other logical processors unlike earlier processors that would broadcast machine check errors. See http://www.intel.com/sdm Volume 3, Chapter 15 for more information on MSR's and documentation on Local MCE. Signed-off-by: Ashok Raj --- arch/x86/kernel/cpu/mcheck/mce.c | 32 ++-- arch/x86/kernel/cpu/mcheck/mce_intel.c | 1 + 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index d10aada..3d71daf 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1047,6 +1047,7 @@ void do_machine_check(struct pt_regs *regs, long error_code) char *msg = "Unknown"; u64 recover_paddr = ~0ull; int flags = MF_ACTION_REQUIRED; + int lmce = 0; prev_state = ist_enter(regs); @@ -1074,11 +1075,20 @@ void do_machine_check(struct pt_regs *regs, long error_code) kill_it = 1; /* -* Go through all the banks in exclusion of the other CPUs. -* This way we don't report duplicated events on shared banks -* because the first one to see it will clear it. +* Check if this MCE is signaled to only this logical processor */ - order = mce_start(&no_way_out); + if (m.mcgstatus & MCG_STATUS_LMCES) + lmce = 1; + else { + /* +* Go through all the banks in exclusion of the other CPUs. +* This way we don't report duplicated events on shared banks +* because the first one to see it will clear it. +* If this is a Local MCE, then no need to perform rendezvous. +*/ + order = mce_start(&no_way_out); + } + for (i = 0; i < cfg->banks; i++) { __clear_bit(i, toclear); if (!test_bit(i, valid_banks)) @@ -1155,8 +1165,18 @@ void do_machine_check(struct pt_regs *regs, long error_code) * Do most of the synchronization with other CPUs. * When there's any problem use only local no_way_out state. */ - if (mce_end(order) < 0) - no_way_out = worst >= MCE_PANIC_SEVERITY; + if (!lmce) { + if (mce_end(order) < 0) + no_way_out = worst >= MCE_PANIC_SEVERITY; + } else { + /* +* Local MCE skipped calling mce_reign() +* If we found a fatal error, we need to panic here. +*/ +if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) + mce_panic("Machine check from unknown source", + NULL, NULL); + } /* * At insane "tolerant" levels we take no action. Otherwise diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c index 7d500b6..47b2a2b 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c @@ -468,4 +468,5 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c) { intel_init_thermal(c); intel_init_cmci(); + intel_init_lmce(); } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch V2 2/3] x86, mce: Add infrastructure required to support LMCE
Initialization and handling for LMCE - boot time option to disable LMCE for that boot instance - Check for capability via IA32_MCG_CAP Incorporated feedback from Boris from V1 See http://www.intel.com/sdm Volume 3 System Programming Guide, Chapter 15 for more information on MSR's and documentation on Local MCE. Signed-off-by: Ashok Raj --- Documentation/x86/x86_64/boot-options.txt | 3 ++ arch/x86/include/asm/mce.h| 5 +++ arch/x86/kernel/cpu/mcheck/mce.c | 3 ++ arch/x86/kernel/cpu/mcheck/mce_intel.c| 59 +++ 4 files changed, 70 insertions(+) diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index 5223479..79edee0 100644 --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -31,6 +31,9 @@ Machine check (e.g. BIOS or hardware monitoring applications), conflicting with OS's error handling, and you cannot deactivate the agent, then this option will be a help. + mce=no_lmce + Do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCE's. mce=bootlog Enable logging of machine checks left over from booting. Disabled by default on AMD because some BIOS leave bogus ones. diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 677a408..8ba4d7a 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -109,6 +109,7 @@ struct mce_log { struct mca_config { bool dont_log_ce; bool cmci_disabled; + bool lmce_disabled; bool ignore_ce; bool disabled; bool ser; @@ -173,12 +174,16 @@ void cmci_clear(void); void cmci_reenable(void); void cmci_rediscover(void); void cmci_recheck(void); +void lmce_clear(void); +void lmce_enable(void); #else static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { } static inline void cmci_clear(void) {} static inline void cmci_reenable(void) {} static inline void cmci_rediscover(void) {} static inline void cmci_recheck(void) {} +static inline void lmce_clear(void) {} +static inline void lmce_enable(void) {} #endif #ifdef CONFIG_X86_MCE_AMD diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index e535533..d10aada 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1976,6 +1976,7 @@ void mce_disable_bank(int bank) /* * mce=off Disables machine check * mce=no_cmci Disables CMCI + * mce=no_lmce Disables LMCE * mce=dont_log_ce Clears corrected events silently, no log created for CEs. * mce=ignore_ce Disables polling and CMCI, corrected events are not cleared. * mce=TOLERANCELEVEL[,monarchtimeout] (number, see above) @@ -1999,6 +2000,8 @@ static int __init mcheck_enable(char *str) cfg->disabled = true; else if (!strcmp(str, "no_cmci")) cfg->cmci_disabled = true; + else if (!strcmp(str, "no_lmce")) + cfg->lmce_disabled = true; else if (!strcmp(str, "dont_log_ce")) cfg->dont_log_ce = true; else if (!strcmp(str, "ignore_ce")) diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c index b4a41cf..7d500b6 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c @@ -91,6 +91,37 @@ static int cmci_supported(int *banks) return !!(cap & MCG_CMCI_P); } +static bool lmce_supported(void) +{ + u64 cap, feature_control; + + if (mca_cfg.lmce_disabled) + return false; + + rdmsrl(MSR_IA32_MCG_CAP, cap); + /* +* LMCE depends on recovery support in the processor. +* Hence both MCG_SER_P and MCG_LMCE_P should be present in +* MCG_CAP +*/ + if (!((cap & (MCG_SER_P | MCG_LMCE_P)) == (MCG_SER_P | MCG_LMCE_P))) + return false; + + /* +* BIOS should indicate support for LMCE by setting +* bit20 in IA32_FEATURE_CONTROL. without which touching +* MCG_EXT_CTL will generate #GP fault. +*/ + rdmsrl(MSR_IA32_FEATURE_CONTROL, feature_control); + if (((feature_control & (FEATURE_CONTROL_LOCKED | + FEATURE_CONTROL_LMCE)) == (FEATURE_CONTROL_LOCKED | + FEATURE_CONTROL_LMCE))) + return true; + else + return false; + +} + bool mce_intel_cmci_poll(void) { if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE) @@ -405,6 +436,34 @@ static void intel_init_cmci(void) cmci_recheck(); } +void intel_init_lmce(void) +{ + u64 val; + + if (!lmce_supported()) + return; + + rdmsrl(MSR_IA32_MCG_EXT_CTL, val); + val |= MCG_EXT_CTL_LMCE_EN; + wrmsrl(
[Patch V2 0/3] x86, mce: Local Machine Check Exception (LMCE)
Hi Boris Thanks for the feedback on V1. Almost all of your recommendations are included in this update. I haven't got a chance to test on qemu yet, but this patch fixes access to MSR per your recommandation, so should be fine. I'm in process of making similar changes to kvm/Qemu that i will send once I have learned to build test it :-). Historically machine checks on Intel X86 processors have been broadcast to all logical processors in the system. Upcoming CPUs will support an opt-in mechanism to request some machine checks delivered to a single logical processor experiencing the fault. For more details see Vol3, Chapter 15, Machine Check Architecture. Modified to incorporate feedback from Boris on V1 patches. Ashok Raj (3): x86, mce: Add LMCE definitions. x86, mce: Add infrastructure required to support LMCE x86, mce: Handling LMCE events Documentation/x86/x86_64/boot-options.txt | 3 ++ arch/x86/include/asm/mce.h| 10 ++ arch/x86/include/uapi/asm/msr-index.h | 2 ++ arch/x86/kernel/cpu/mcheck/mce.c | 35 ++ arch/x86/kernel/cpu/mcheck/mce_intel.c| 60 +++ 5 files changed, 104 insertions(+), 6 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch V1 1/3] x86, mce: Add LMCE definitions.
Add required definitions to support Local Machine Check Exceptions. See http://www.intel.com/sdm Volume 3, System Programming Guide, chapter 15 for more information on MSR's and documentation on Local MCE. Signed-off-by: Ashok Raj --- arch/x86/include/asm/mce.h| 5 + arch/x86/include/uapi/asm/msr-index.h | 2 ++ 2 files changed, 7 insertions(+) diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 1f5a86d..677a408 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -17,11 +17,16 @@ #define MCG_EXT_CNT(c) (((c) & MCG_EXT_CNT_MASK) >> MCG_EXT_CNT_SHIFT) #define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ #define MCG_ELOG_P (1ULL<<26) /* Extended error log supported */ +#define MCG_LMCE_P (1ULL<<27) /* Local machine check supported */ /* MCG_STATUS register defines */ #define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ #define MCG_STATUS_EIPV (1ULL<<1) /* ip points to correct instruction */ #define MCG_STATUS_MCIP (1ULL<<2) /* machine check in progress */ +#define MCG_STATUS_LMCES (1ULL<<3) /* LMCE signaled */ + +/* MCG_EXT_CTL register defines */ +#define MCG_EXT_CTL_LMCE_EN (1ULL<<0) /* Enable LMCE */ /* MCi_STATUS register defines */ #define MCI_STATUS_VAL (1ULL<<63) /* valid error */ diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index c469490..e28d5a2 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -56,6 +56,7 @@ #define MSR_IA32_MCG_CAP 0x0179 #define MSR_IA32_MCG_STATUS0x017a #define MSR_IA32_MCG_CTL 0x017b +#define MSR_IA32_MCG_EXT_CTL 0x04d0 #define MSR_OFFCORE_RSP_0 0x01a6 #define MSR_OFFCORE_RSP_1 0x01a7 @@ -379,6 +380,7 @@ #define FEATURE_CONTROL_LOCKED (1<<0) #define FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX (1<<1) #define FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX (1<<2) +#define FEATURE_CONTROL_LMCE_SUPPORT_ENABLED (1<<20) #define MSR_IA32_APICBASE 0x001b #define MSR_IA32_APICBASE_BSP (1<<8) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch V1 0/3] x86 Local Machine Check Exception (LMCE)
Historically machine checks on Intel X86 processors have been broadcast to all logical processors in the system. Upcoming CPUs will support an opt-in mechanism to request some machine checks delivered to a single logical processor experiencing the fault. For more details see Vol3, Chapter 15, Machine Check Architecture. Ashok Raj (3): x86, mce: Add LMCE definitions. x86, mce: Add infrastructure required to support LMCE x86, mce: Handling LMCE events Documentation/x86/x86_64/boot-options.txt | 3 ++ arch/x86/include/asm/mce.h| 10 arch/x86/include/uapi/asm/msr-index.h | 2 + arch/x86/kernel/cpu/mcheck/mce.c | 28 ++-- arch/x86/kernel/cpu/mcheck/mce_intel.c| 76 +++ 5 files changed, 116 insertions(+), 3 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch V1 3/3] x86, mce: Handling LMCE events
This patch has handling changes to do_machine_check() to process MCE signaled as local MCE. Typically only recoverable errors (SRAR) type error will be Signaled as LMCE. But architecture does not restrict to only those errors. When errors are signaled as LMCE, there is no need for the MCE handler to perform rendezvous with other logical processors unlike earlier processors that would broadcast machine check errors. See http://www.intel.com/sdm Volume 3, Chapter 15 for more information on MSR's and documentation on Local MCE. Signed-off-by: Ashok Raj --- arch/x86/kernel/cpu/mcheck/mce.c | 25 ++--- arch/x86/kernel/cpu/mcheck/mce_intel.c | 1 + 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index d10aada..c130391 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1047,6 +1047,7 @@ void do_machine_check(struct pt_regs *regs, long error_code) char *msg = "Unknown"; u64 recover_paddr = ~0ull; int flags = MF_ACTION_REQUIRED; + int lmce = 0; prev_state = ist_enter(regs); @@ -1074,11 +1075,19 @@ void do_machine_check(struct pt_regs *regs, long error_code) kill_it = 1; /* +* Check if this MCE is signaled to only this logical processor +*/ + if (m.mcgstatus & MCG_STATUS_LMCES) + lmce = 1; + /* * Go through all the banks in exclusion of the other CPUs. * This way we don't report duplicated events on shared banks * because the first one to see it will clear it. +* If this is a Local MCE, then no need to perform rendezvous. */ - order = mce_start(&no_way_out); + if (!lmce) + order = mce_start(&no_way_out); + for (i = 0; i < cfg->banks; i++) { __clear_bit(i, toclear); if (!test_bit(i, valid_banks)) @@ -1155,8 +1164,18 @@ void do_machine_check(struct pt_regs *regs, long error_code) * Do most of the synchronization with other CPUs. * When there's any problem use only local no_way_out state. */ - if (mce_end(order) < 0) - no_way_out = worst >= MCE_PANIC_SEVERITY; + if (!lmce) { + if (mce_end(order) < 0) + no_way_out = worst >= MCE_PANIC_SEVERITY; + } else { + /* +* Local MCE skipped calling mce_reign() +* If we found a fatal error, we need to panic here. +*/ +if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) + mce_panic("Machine check from unknown source", + NULL, NULL); + } /* * At insane "tolerant" levels we take no action. Otherwise diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c index be3a5c6..73a2844 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c @@ -484,4 +484,5 @@ void mce_intel_feature_init(struct cpuinfo_x86 *c) { intel_init_thermal(c); intel_init_cmci(); + intel_init_lmce(); } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch V1 2/3] x86, mce: Add infrastructure required to support LMCE
Initialization and handling for LMCE - boot time option to disable LMCE for that boot instance - Check for capability via IA32_MCG_CAP - provide ability to enable/disable LMCE on demand. See http://www.intel.com/sdm Volume 3 System Programming Guide, Chapter 15 for more information on MSR's and documentation on Local MCE. Signed-off-by: Ashok Raj --- Documentation/x86/x86_64/boot-options.txt | 3 ++ arch/x86/include/asm/mce.h| 5 +++ arch/x86/kernel/cpu/mcheck/mce.c | 3 ++ arch/x86/kernel/cpu/mcheck/mce_intel.c| 75 +++ 4 files changed, 86 insertions(+) diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index 5223479..79edee0 100644 --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -31,6 +31,9 @@ Machine check (e.g. BIOS or hardware monitoring applications), conflicting with OS's error handling, and you cannot deactivate the agent, then this option will be a help. + mce=no_lmce + Do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCE's. mce=bootlog Enable logging of machine checks left over from booting. Disabled by default on AMD because some BIOS leave bogus ones. diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 677a408..8ba4d7a 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -109,6 +109,7 @@ struct mce_log { struct mca_config { bool dont_log_ce; bool cmci_disabled; + bool lmce_disabled; bool ignore_ce; bool disabled; bool ser; @@ -173,12 +174,16 @@ void cmci_clear(void); void cmci_reenable(void); void cmci_rediscover(void); void cmci_recheck(void); +void lmce_clear(void); +void lmce_enable(void); #else static inline void mce_intel_feature_init(struct cpuinfo_x86 *c) { } static inline void cmci_clear(void) {} static inline void cmci_reenable(void) {} static inline void cmci_rediscover(void) {} static inline void cmci_recheck(void) {} +static inline void lmce_clear(void) {} +static inline void lmce_enable(void) {} #endif #ifdef CONFIG_X86_MCE_AMD diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index e535533..d10aada 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1976,6 +1976,7 @@ void mce_disable_bank(int bank) /* * mce=off Disables machine check * mce=no_cmci Disables CMCI + * mce=no_lmce Disables LMCE * mce=dont_log_ce Clears corrected events silently, no log created for CEs. * mce=ignore_ce Disables polling and CMCI, corrected events are not cleared. * mce=TOLERANCELEVEL[,monarchtimeout] (number, see above) @@ -1999,6 +2000,8 @@ static int __init mcheck_enable(char *str) cfg->disabled = true; else if (!strcmp(str, "no_cmci")) cfg->cmci_disabled = true; + else if (!strcmp(str, "no_lmce")) + cfg->lmce_disabled = true; else if (!strcmp(str, "dont_log_ce")) cfg->dont_log_ce = true; else if (!strcmp(str, "ignore_ce")) diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c index b4a41cf..be3a5c6 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c @@ -70,6 +70,10 @@ enum { static atomic_t cmci_storm_on_cpus; +#define FEATURE_CONTROL_LMCE_BITS ((FEATURE_CONTROL_LOCKED) | \ +(FEATURE_CONTROL_LMCE_SUPPORT_ENABLED)) +#define MCG_CAP_LMCE_BITS ((MCG_SER_P) | (MCG_LMCE_P)) + static int cmci_supported(int *banks) { u64 cap; @@ -91,6 +95,34 @@ static int cmci_supported(int *banks) return !!(cap & MCG_CMCI_P); } +static bool lmce_supported(void) +{ + u64 cap, feature_ctl; + bool lmce_bios_support, retval; + + if (mca_cfg.lmce_disabled) + return false; + + rdmsrl(MSR_IA32_MCG_CAP, cap); + rdmsrl(MSR_IA32_FEATURE_CONTROL, feature_ctl); + + /* +* BIOS should indicate support for LMCE by setting +* bit20 in IA32_FEATURE_CONTROL. without which touching +* MCG_EXT_CTL will generate #GP fault. +*/ + lmce_bios_support = ((feature_ctl & (FEATURE_CONTROL_LMCE_BITS)) == + (FEATURE_CONTROL_LMCE_BITS)); + + /* +* MCG_CAP should indicate both MCG_SER_P and MCG_LMCE_P +*/ + cap = ((cap & MCG_CAP_LMCE_BITS) == (MCG_CAP_LMCE_BITS)); + retval = (cap && lmce_bios_support); + + return retval; +} + bool mce_intel_cmci_poll(void) { if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE) @@ -405,6 +437,49 @@ static void intel_init_cmci(void) cmc