Re: About closed-source module use GPL module function
Hi, On Jan 30, 2008 8:41 AM, CooperYuan Cooper <[EMAIL PROTECTED]> wrote: > Now I am porting a device driver to Linux, its source code is not opened. > > In this module, I use some interface functions exported from GPL > module through EXPORT_SYMBOL macros. (not EXPORT_SYMBOL_GPL), For > example, register_sound_dsp() and so on. > > Do I violate GPL? How to solve it? This list is probably a not good source for (free) legal advice but the simplest way to be sure is to release the source code under GPLv2. HTH. Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
On Wed, Jan 30, 2008 at 06:22:43AM +0800, Yi Yang wrote: > On Tue, 2008-01-29 at 09:44 +0100, Sam Ravnborg wrote: > > > + > > > +static struct notifier_block __cpuinitdata cpuid_sysfs_cpu_notifier = { > > > + .notifier_call = cpuid_sysfs_cpu_callback, > > > +}; > > Data is annotated _cpuintidata > > > > but > > > > > + > > Data is annotated _cpuintidata > > > > > @@ -217,11 +445,14 @@ static void __exit cpuid_exit(void) > > > { > > > int cpu = 0; > > > > > > - for_each_online_cpu(cpu) > > > + for_each_online_cpu(cpu) { > > > cpuid_device_destroy(cpu); > > > + remove_cpuid_sysfs(cpu); > > > + } > > > class_destroy(cpuid_class); > > > unregister_chrdev(CPUID_MAJOR, "cpu/cpuid"); > > > unregister_hotcpu_notifier(&cpuid_class_cpu_notifier); > > > + unregister_hotcpu_notifier(&cpuid_sysfs_cpu_notifier); > > > > used in an __exit function. > > > > You should have seen a Section mismatch warning for this. > > The right fix is to annotate the cpuid_sysfs_cpu_notifier > > with __initdata_refok (soon to be named __refdata) > > Or even better to declare it const and use _refconst. > I think __cpuinitdata is different from __initdata, i have tested it > by insmod, rmmod, echo 0/1 > /sys/devices/system/cpu/cpu1/online > repeatly, it hasn't any issue. __cpuinit & _cpuinitdata have over time been used for different purposes: a) To annotate code/data used in the init path and that in the non HOTPLUG_CPU case can be discarded after init. b) To annotate code/data used in the 'core' HOTPLUG_CPU functionality that isonly in use if HOTPLUG_CPU='y' The b) usage is questionable and the annotation of cpuid_sysfs_cpu_notifier beongs in the b) category. The correct solution would be to factor out the 'core' HOTPLUG_CPU=y code to a set of separate files and used to usual mechanishm in the Makefile to select when to include this code in the kernel. The improved section mismatch checks by modpost has just brought this issue to the attention and now you add code that does the wrong thing it is being discussed. Sam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] ipwireless: driver for 3G PC Card
Hi, On Tue, Jan 29, 2008 at 02:49:24PM +0100, David Sterba wrote: > --- > From: David Sterba <[EMAIL PROTECTED]> > > ipwireless_cs: driver for PC Card, 3G internet connection > > The driver is manufactured by IPWireless. Sorry this ^^^ is not correct, should be "The device is manufactured by IPWireless." Thanks, Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
On Tue, 29 Jan 2008, Zan Lynx wrote: > Jon Masters wrote: > > I wouldn't quite say that. I wasn't going to comment, but...personally, > > I actually disagree with the assertions that ndiswrapper isn't causing > > proprietary code to link against GPL functions in the kernel (how is > > an NDIS implementation any different than a shim layer provided to > > load a graphics driver?), but I wasn't trying to make that point. > > Well, as long as *any* part of the kernel ever links to proprietary code, then > GPL functions link to it in exactly the same way ndiswrapper enables. It's > only a matter of how many steps of separation. > > A perfectly GPL USB network driver linked to GPL-only functions feeds data > into the kernel where it swirls about and emerges from a proprietary network > filesystem driver, for example. A proprietary network filesystem driver _on a different system_, you mean? In this case the proprietary code has no direct access to your kernel data, except through the communication protocol. No tainting is involved, as all corruption in your kernel is caused by kernel bugs in visible code that can be debugged. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED] In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
On Wed, 30 Jan 2008, M�ns Rullg�rd wrote: > Adrian Bunk <[EMAIL PROTECTED]> writes: > > On Tue, Jan 29, 2008 at 11:25:22PM +, M�ns Rullg�rd wrote: > >> As long as you don't distribute /proc/kcore, I can't see how the GPL > >> would have any say in the matter. The Windows drivers are (unrelated > >> violations aside) clearly not derived from GPL code. > > > > Someone might sell a laptop with Linux installed? > > Not a problem, unless it is booted when sold. Even that might not be > a problem, since it would be a matter of transferring ownership of a > single copy, not creating and distributing new copies, and the GPL > does is only concerned with the latter. Interesting... I never heard about this `transferring ownership of a single copy not involving GPL'. Note that some lawyers claim that at trade shows, you should not hand over a demo device running GPLed code to any interested party, as it would be distribution... Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED] In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
On Tue, 2008-01-29 at 09:44 +0100, Sam Ravnborg wrote: > > + > > +static struct notifier_block __cpuinitdata cpuid_sysfs_cpu_notifier = { > > + .notifier_call = cpuid_sysfs_cpu_callback, > > +}; > Data is annotated _cpuintidata > > but > > > + > Data is annotated _cpuintidata > > > @@ -217,11 +445,14 @@ static void __exit cpuid_exit(void) > > { > > int cpu = 0; > > > > - for_each_online_cpu(cpu) > > + for_each_online_cpu(cpu) { > > cpuid_device_destroy(cpu); > > + remove_cpuid_sysfs(cpu); > > + } > > class_destroy(cpuid_class); > > unregister_chrdev(CPUID_MAJOR, "cpu/cpuid"); > > unregister_hotcpu_notifier(&cpuid_class_cpu_notifier); > > + unregister_hotcpu_notifier(&cpuid_sysfs_cpu_notifier); > > used in an __exit function. > > You should have seen a Section mismatch warning for this. > The right fix is to annotate the cpuid_sysfs_cpu_notifier > with __initdata_refok (soon to be named __refdata) > Or even better to declare it const and use _refconst. I think __cpuinitdata is different from __initdata, i have tested it by insmod, rmmod, echo 0/1 > /sys/devices/system/cpu/cpu1/online repeatly, it hasn't any issue. > > Sam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
Yi Yang wrote: Function cpuid has reset ecx to 0 immediate before calling to __cpuid, so this shouldn't be a problem now. Unless, of course, you want to get to the information for the higher CPUID levels. The easiest way to fix that would be to use cpuid_count() and let /dev/cpu/*/cpuid take the %ecx value in the high half of the offset. That would have minimal impact on the interface. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 2/5] dmaengine: Add slave DMA interface
On Tuesday 29 January 2008, Haavard Skinnemoen wrote: > @@ -297,6 +356,13 @@ struct dma_device { > struct dma_async_tx_descriptor *(*device_prep_dma_interrupt)( > struct dma_chan *chan); > > + struct dma_slave_descriptor *(*device_prep_slave)( > + struct dma_chan *chan, dma_addr_t mem_addr, > + enum dma_slave_direction direction, > + enum dma_slave_width reg_width, > + size_t len, unsigned long flags); That isn't enough options! Check out arch/arm/plat-omap/dma.c (and maybe OMAP5912 DMA docs [1] for not-very-recent specs) as one example. You'll see more options that drivers need to use, including: - DMA priority and arbitration - Burst size, packing/unpacking support (for optimized memory access) - Multiple DMA quanta (not just reg_width, but also frames and blocks) - Multiple synch modes (per element/"width", frame, or block) - Multiple addressing modes: pre-index, post-index, double-index, ... - Both descriptor-based and register based transfers - ... lots more ... Example: USB tends to use one packet per "frame" and have the DMA request signal mean "give me the next frame". It's sometimes been very important to use use the tuning options to avoid some on-chip race conditions for transfers that cross lots of internal busses and clock domains, and to have special handling for aborting transfers and handling "short RX" packets. I wonder whether a unified programming interface is the right way to approach peripheral DMA support, given such variability. The DMAC from Synopsys that you're working with has some of those options, but not all of them... and other DMA controllers have their own oddities. For memcpy() acceleration, sure -- there shouldn't be much scope for differences. Source, destination, bytecount ... go! (Not that it's anywhere *near* that quick in the current interface.) For peripheral DMA, maybe it should be a "core plus subclasses" approach so that platform drivers can make use hardware-specific knowledge (SOC-specific peripheral drivers using SOC-specific DMA), sharing core code for dma-memcpy() and DMA channel housekeeping. - Dave [1] http://focus.ti.com/docs/prod/folders/print/omap5912.html lists spru755c near the bottom; the "System DMA" section. > + void (*device_terminate_all)(struct dma_chan *chan); > + > void (*device_dependency_added)(struct dma_chan *chan); > enum dma_status (*device_is_tx_complete)(struct dma_chan *chan, > dma_cookie_t cookie, dma_cookie_t *last, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 0/5] dmaengine: Slave DMA interface and example users
On Tuesday 29 January 2008, Haavard Skinnemoen wrote: > > Btw, there's one issue I forgot to mention: I believe the DMA Engine > framework is currently misusing the DMA mapping API, and this patchset > makes things worse. > > Currently, the async_tx bits of the API do the required calls to > dma_map_single() and/or dma_map_page(), but they rely on the driver to > do the unmapping. This is problematic ... > > How do we solve this? How about: for peripheral DMA, don't let the engine see anything except dma_addr_t values. The engine needs to be able to dma_alloc_coherent() memory too, which is pre-mapped. - Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ptrace API extensions for BTS
Sorry I did not get more into this discussion earlier. I still have not read through all of the email threads. But I have looked over the current version of your code now in -mm. I think this work has a great deal of overlap with the perfmon2 project. There are two facets that overlap, which together are the whole BTS feature. The same x86 "debug store" hardware is programmed for both the BTS and the performance monitoring features. The implementations clearly have to cooperate on managing that hardware. Your ds.c is a start in the right direction, to abstract the hardware configuration from the BTS feature and its interface. I'm not familiar with the perfmon2 code, but it may have something similar already. The rest of the BTS feature is the buffer management and the interface. It has to deal with the hardware buffer setup, context switching, and overflow interrupts, and delivering data from the hardware buffers to the interface in appropriate formats. We'd also like it to be able to trace kernel-mode as well as user-mode, and either deliver combined data or segregate the data between the two for user-space and kernel-space users who need not know about each other's tracing. (On some of the hardware you can program it to record only one or the other (X86_FEATURE_DSCPL). On older hardware, or when separately tracing both, you can trace both and then distinguish each sample by its from_ip.) perfmon2 also wants to address all of that. I don't much like the way you've shoe-horned the context-switch timestamp logging into the BTS feature. It's a nice feature to have in some form, and I sympathize with your seeing it as easy pickings once you had the BTS buffer machinery handy. But really it is not part of the BTS feature and there is nothing arch-dependent about it. Given some other general thing providing the buffer management et al, that could just be done in schedule(), i.e.: departs(prev); context_switch(rq, prev, next); /* unlocks the rq */ arrives(prev); If there is a general thing for event-reporting from perfmon2 or whatever, then it might be natural to have the context-switch event reports configurable to different record formats you might be using for other things. For a BTS-style record, I would use: departs: { .from = task_pt_regs(prev)->ip, .to = jiffies, .flag = MAGIC1 } arrives: { .from = jiffies, .to = task_pt_regs(next)->ip, .flag = MAGIC2 } MAGIC1 = 0x0001 MAGIC2 = 0x0002 or something like that, i.e. as if it were a "branch to block-time" and a "branch from wake-time". (Actually you might want MAGIC3 and MAGIC4 too, for whether it was a voluntary or involuntary context switch.) For different use that is doing mostly other event sampling rather than BTS, it might use a different format that gives more register into a la PEBS. I'm no expert on perfmon2 and I understand there are many issues to be resolved to get it into the kernel. But if you are not desperate to have the BTS feature in the kernel ASAP, it would ideal IMHO if you can work with Stephane et al on putting this work together. I'd like to see the work go into the kernel in much smaller pieces even than your BTS patch set that's in -mm. The first thing is just the DS hardware management, context switch and hardware-facing parts of the buffer management (one or three or fourth small bisect-friendly patches just for that much). If you and Stephane can hash out a fresh patch that provides what you both need for that, that would be a great start. Personally, I'd prefer to abandon the ptrace extensions altogether in favor of some generalized event buffer interface that comes from merging perfmon2. But if you still want to do the ptrace interface, it can be built on the shared DS-management code. What do you think? Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
On Tue, 2008-01-29 at 23:17 -0800, H. Peter Anvin wrote: > Yi Yang wrote: > >> > >> It's broken, because it doesn't take into account the fact that Intel > >> broke CPUID level 4 and made it "repeating" (neither did the cpuid char > >> device, because it predated the Intel braindamage; I've had a patch for > >> it privately for a while, but didn't push it upstream because paravirt > >> broke it royally and I wanted the situation to settle down.) > > > level 4 doesn't result in repeating on Intel CPU, cpuid module sets > > file offset to level, so cat /dev/cpu/*/cpuid will run cpuid instruction > > continuously. > > The issue is that Intel suddenly made CPUID ECX-sensitive, which there > is no way to represent. Function cpuid has reset ecx to 0 immediate before calling to __cpuid, so this shouldn't be a problem now. in include/asm-x86/processor_32.h /* * Generic CPUID function * clear %ecx since some cpus (Cyrix MII) do not set or clear %ecx * resulting in stale register contents being returned. */ static inline void cpuid(unsigned int op, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx) { *eax = op; *ecx = 0; __cpuid(eax, ebx, ecx, edx); } > > As far as cat /dev/cpu/*/cpuid, that's a user error. > > -hpa > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
Yi Yang wrote: It's broken, because it doesn't take into account the fact that Intel broke CPUID level 4 and made it "repeating" (neither did the cpuid char device, because it predated the Intel braindamage; I've had a patch for it privately for a while, but didn't push it upstream because paravirt broke it royally and I wanted the situation to settle down.) level 4 doesn't result in repeating on Intel CPU, cpuid module sets file offset to level, so cat /dev/cpu/*/cpuid will run cpuid instruction continuously. The issue is that Intel suddenly made CPUID ECX-sensitive, which there is no way to represent. As far as cat /dev/cpu/*/cpuid, that's a user error. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] ext4 update
On Tue, Jan 29, 2008 at 10:54:03PM +0100, Jan Engelhardt wrote: > > On Jan 29 2008 07:53, Theodore Tso wrote: > > > >>fwiw, diffstat is confused by git's diff output; you need to use > >>'diffstat -p1' > > I am seeing normal behavior: > > 22:52 sovereign:~/linux > git diff HEAD | diffstat That's because you are doing a diff stat of changes that haven't been checked in yet. I was doing a "git log -p origin.. | diffstat -p1", and in that incantation you definitely do need the -p1 to diffstat. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Debugfs compile fix.
Debugfs is not compiled without CONFIG_SYSFS in net-2.6 tree. Move kobject_create_and_add under appropriate ifdef. The fix looks correct from a first glance, but may be the dependency should be added into the Kconfig. Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]> --- fs/debugfs/inode.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index d26e282..61cc937 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -432,9 +432,11 @@ static int __init debugfs_init(void) { int retval; +#ifdef CONFIG_SYSFS debug_kobj = kobject_create_and_add("debug", kernel_kobj); if (!debug_kobj) return -EINVAL; +#endif retval = register_filesystem(&debug_fs_type); if (retval) -- 1.5.3.rc5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
On Tue, 2008-01-29 at 07:51 -0800, H. Peter Anvin wrote: > Yi Yang wrote: > > Current cpuid module will create a char device for every logical cpu, > > when a user cats /dev/cpu/*/cpuid, he/she will enter a limitless loop, > > the root cause is that cpuid module doesn't decide wether a cpuid level > > is valid, it just uses an offset to denote cpuid level and take it to > > cpuid instruction, cpuid instruction will ignore it and return some data > > > > This patch uses sysfs to avoid limitless loop and provide more flexible > > interface for cpuid, please consider to merge to -mm tree in order to test. > > This is broken. > > Triple broken. > > It's broken, because it doesn't take into account the fact that Intel > broke CPUID level 4 and made it "repeating" (neither did the cpuid char > device, because it predated the Intel braindamage; I've had a patch for > it privately for a while, but didn't push it upstream because paravirt > broke it royally and I wanted the situation to settle down.) level 4 doesn't result in repeating on Intel CPU, cpuid module sets file offset to level, so cat /dev/cpu/*/cpuid will run cpuid instruction continuously. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.24-git6-ext4-1 patchset released
I've just released 2.6.24-git6-ext4-1. It removes the patches that have been pulled into mainline by Linus, and adds the unlocked ioctl patches from Andi Kleen, and Eric's patch to allow the root inode to use in-inode EA's. As a git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 2.6.24-git6-ext4-1 http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=shortlog;h=2.6.24-git6-ext4-1 As a patchset: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.24-git6-ext4-1 - Ted What's now in the ext4 tree: Akira Fujita (4): ext4: online defrad header file changes ext4: online defrag-- Allocate new contiguous blocks with mballoc ext4: online defrag -- Move the file data to the new blocks Free space fragmentation functions Alex Tomas (2): vfs: add basic delayed allocation support ext4: Add basic delayed allocation support Andi Kleen (7): Convert ext2 over to use unlocked_ioctl Remove incorrect BKL comment in ext2 Convert ext3 to use unlocked_ioctl v2 ext3: Remove incorrect BKL comment Remove incorrect comment refering to lock_kernel() from jbd/jbd2 Convert ext4 to use unlocked_ioctl v2 Remove incorrect BKL comments in ext4 Aneesh Kumar K.V (2): ext4: Enable delalloc and mballoc by default. ext4: Show delalloc options Eric Sandeen (1): allow in-inode EAs on ext4 root inode Mingming Cao (2): jbd: blocks reservation fix for large block support jbd2: blocks reservation fix for large block support Theodore Ts'o (2): patch test-filesys-flag.patch ext4: New inode allocation for FLEX_BG meta-data groups. fs/buffer.c |3 fs/ext2/dir.c |2 fs/ext2/ext2.h |3 fs/ext2/file.c |4 fs/ext2/inode.c |1 fs/ext2/ioctl.c | 12 fs/ext3/dir.c |4 fs/ext3/file.c |2 fs/ext3/ioctl.c | 12 fs/ext4/Makefile|2 fs/ext4/balloc.c| 28 fs/ext4/defrag.c| 2206 fs/ext4/dir.c |4 fs/ext4/extents.c | 67 - fs/ext4/file.c |2 fs/ext4/ialloc.c| 96 + fs/ext4/inode.c | 174 ++- fs/ext4/ioctl.c | 25 fs/ext4/mballoc.c |7 fs/ext4/super.c | 91 + fs/jbd/journal.c|7 fs/jbd/recovery.c |2 fs/jbd2/journal.c |7 fs/jbd2/recovery.c |2 fs/mpage.c | 406 +++ include/linux/ext3_fs.h |3 include/linux/ext4_fs.h | 107 + include/linux/ext4_fs_extents.h | 22 include/linux/ext4_fs_sb.h |3 include/linux/mpage.h |2 30 files changed, 3220 insertions(+), 86 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] lib: Add support for DIF CRC
Add support for the T10 Data Integrity Field CRC. Signed-off-by: Martin K. Petersen <[EMAIL PROTECTED]> diff -r f5ec697e8b10 include/linux/crc-t10dif.h --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/include/linux/crc-t10dif.hTue Jan 29 13:26:19 2008 -0500 @@ -0,0 +1,9 @@ +#ifndef _LINUX_CRC_T10DIF_H +#define _LINUX_CRC_T10DIF_H + +#include + +const __u16 t10_dif_crc_table[256]; +__u16 crc_t10dif(unsigned char const *, size_t); + +#endif diff -r f5ec697e8b10 lib/Kconfig --- a/lib/Kconfig Tue Jan 29 12:11:57 2008 -0500 +++ b/lib/Kconfig Tue Jan 29 13:26:19 2008 -0500 @@ -22,6 +22,13 @@ config CRC16 modules require CRC16 functions, but a module built outside the kernel tree does. Such modules that use library CRC16 functions require M here. + +config CRC_T10DIF + tristate "CRC calculation for the T10 Data Integrity Field" + help + This option is only needed if a module that's not in the + kernel tree needs to calculate CRC checks for use with the + SCSI data integrity subsystem. config CRC_ITU_T tristate "CRC ITU-T V.41 functions" diff -r f5ec697e8b10 lib/Makefile --- a/lib/Makefile Tue Jan 29 12:11:57 2008 -0500 +++ b/lib/Makefile Tue Jan 29 13:26:19 2008 -0500 @@ -44,6 +44,7 @@ obj-$(CONFIG_BITREVERSE) += bitrev.o obj-$(CONFIG_BITREVERSE) += bitrev.o obj-$(CONFIG_CRC_CCITT)+= crc-ccitt.o obj-$(CONFIG_CRC16)+= crc16.o +obj-$(CONFIG_CRC_T10DIF)+= crc-t10dif.o obj-$(CONFIG_CRC_ITU_T)+= crc-itu-t.o obj-$(CONFIG_CRC32)+= crc32.o obj-$(CONFIG_CRC7) += crc7.o diff -r f5ec697e8b10 lib/crc-t10dif.c --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/lib/crc-t10dif.c Tue Jan 29 13:26:19 2008 -0500 @@ -0,0 +1,68 @@ +/* + * T10 Data Integrity Field CRC16 calculation + * + * Copyright (c) 2007 Oracle Corporation. All rights reserved. + * Written by Martin K. Petersen <[EMAIL PROTECTED]> + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. + */ + +#include +#include +#include + +/* Table generated using the following polynomium: + * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1 + * gt: 0x8bb7 + */ +const __u16 t10_dif_crc_table[256] = { + 0x, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B, + 0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6, + 0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6, + 0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B, + 0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1, + 0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C, + 0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C, + 0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781, + 0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8, + 0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255, + 0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925, + 0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698, + 0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472, + 0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF, + 0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF, + 0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02, + 0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA, + 0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067, + 0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17, + 0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA, + 0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640, + 0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD, + 0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D, + 0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30, + 0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759, + 0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4, + 0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394, + 0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29, + 0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3, + 0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E, + 0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E, + 0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3 +}; + +__u16 crc_t10dif(const unsigned char *buffer, size_t len) +{ + __u16 crc = 0; + unsigned int i; + + for (i=0 ; i < len ; i++) + crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 0xff]; + + return crc; +} + +EXPORT_SYMBOL(crc_t10dif); + +MODULE_DESCRIPTION("T10 DIF CRC calculation"); +MODULE_LICENSE("GPL"); -- To unsubscribe fr
Re: [PATCH] lib: Add support for DIF CRC
> "Jan" == Jan Engelhardt <[EMAIL PROTECTED]> writes: Jan> 'const unsigned char *', like the rest of all code does. Updated patch follows. Jan> Do we already have some users for the T10DIF CRC? This is a runt patch given that it doesn't fit in block and SCSI. The remaining bits will come in through those trees. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] e100 driver didn't support any MII-less PHYs...
Hi, On Tue, Jan 29, 2008 at 03:09:25PM -0800, Kok, Auke wrote: > Andreas Mohr wrote: > > Perhaps it's useful to file a bug/patch > > on http://sourceforge.net/projects/e1000/ ? Perhaps -mm testing? > > I wanted to push this though our testing labs first which has not happened > due to > time constraints - that should quickly at least confirm that the most common > nics > work OK after the change with your patch. I'll try and see if we can get this > testing done soon. Oh, full-scale regression testing even? Nice idea... Would optionally be even better if during hardware tests one could also dig out some i82503-based card (or additional MII-less cards?) since I didn't really make any effort yet to try to make them all recognized/supported by my patch already (would have been out of scope anyway since I have this single card only). Thanks a lot, Andreas Mohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] 2.6.24-git6 soft lockup detected while running libhugetlbfs
Hi, Softlockup is detected while running libhugetlbfs on the 2.6.24-git6 kernel. The machine is a Pentium III (Cascades) 16 cpu machine. BUG: soft lockup - CPU#13 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1) EIP: 0060:[] EFLAGS: 0246 CPU: 13 EIP is at default_idle+0x30/0x44 EAX: EBX: c10002f8 ECX: 0010 EDX: 8fcf ESI: 000d EDI: 00128868 EBP: e744bf9c ESP: e744bf9c DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: b7eadcc0 CR3: 01386000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 [] show_trace_log_lvl+0x19/0x2e [] show_trace+0x12/0x14 [] show_regs+0x1c/0x1f [] softlockup_tick+0xe0/0xf6 [] run_local_timers+0x17/0x19 [] update_process_times+0x24/0x49 [] tick_periodic+0x63/0x6f [] tick_handle_periodic+0x19/0x6a [] local_apic_timer_interrupt+0x4e/0x53 [] smp_apic_timer_interrupt+0x2a/0x39 [] apic_timer_interrupt+0x28/0x30 [] cpu_idle+0x76/0x8b [] start_secondary+0xb1/0xb3 [<>] _stext+0x3e40/0x19 === BUG: soft lockup - CPU#12 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1) EIP: 0060:[] EFLAGS: 0246 CPU: 12 EIP is at default_idle+0x30/0x44 EAX: EBX: c10002f8 ECX: 000f7000 EDX: 8fcf ESI: 000c EDI: 00128868 EBP: e7447f9c ESP: e7447f9c DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: b7f67f1c CR3: 01386000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 [] show_trace_log_lvl+0x19/0x2e [] show_trace+0x12/0x14 [] show_regs+0x1c/0x1f [] softlockup_tick+0xe0/0xf6 [] run_local_timers+0x17/0x19 [] update_process_times+0x24/0x49 [] tick_periodic+0x63/0x6f [] tick_handle_periodic+0x19/0x6a [] local_apic_timer_interrupt+0x4e/0x53 [] smp_apic_timer_interrupt+0x2a/0x39 [] apic_timer_interrupt+0x28/0x30 [] cpu_idle+0x76/0x8b [] start_secondary+0xb1/0xb3 [<>] _stext+0x3e40/0x19 === BUG: soft lockup - CPU#14 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1) EIP: 0060:[] EFLAGS: 0246 CPU: 14 EIP is at default_idle+0x30/0x44 EAX: EBX: c10002f8 ECX: 00109000 EDX: 8fcf ESI: 000e EDI: 00128868 EBP: e744ff9c ESP: e744ff9c DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: b7e12494 CR3: 01386000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 [] show_trace_log_lvl+0x19/0x2e [] show_trace+0x12/0x14 [] show_regs+0x1c/0x1f [] softlockup_tick+0xe0/0xf6 [] run_local_timers+0x17/0x19 [] update_process_times+0x24/0x49 [] tick_periodic+0x63/0x6f [] tick_handle_periodic+0x19/0x6a [] local_apic_timer_interrupt+0x4e/0x53 [] smp_apic_timer_interrupt+0x2a/0x39 [] apic_timer_interrupt+0x28/0x30 [] cpu_idle+0x76/0x8b [] start_secondary+0xb1/0xb3 [<>] _stext+0x3e40/0x19 === BUG: soft lockup - CPU#15 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1) EIP: 0060:[] EFLAGS: 0246 CPU: 15 EIP is at default_idle+0x30/0x44 EAX: EBX: c10002f8 ECX: 00112000 EDX: 8fcf ESI: 000f EDI: 00128868 EBP: e7451f9c ESP: e7451f9c DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: b7f2ecc0 CR3: 01386000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 [] show_trace_log_lvl+0x19/0x2e [] show_trace+0x12/0x14 [] show_regs+0x1c/0x1f [] softlockup_tick+0xe0/0xf6 [] run_local_timers+0x17/0x19 [] update_process_times+0x24/0x49 [] tick_periodic+0x63/0x6f [] tick_handle_periodic+0x19/0x6a [] local_apic_timer_interrupt+0x4e/0x53 [] smp_apic_timer_interrupt+0x2a/0x39 [] apic_timer_interrupt+0x28/0x30 [] cpu_idle+0x76/0x8b [] start_secondary+0xb1/0xb3 [<>] _stext+0x3e40/0x19 === BUG: soft lockup - CPU#10 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1) EIP: 0060:[] EFLAGS: 0246 CPU: 10 EIP is at default_idle+0x30/0x44 EAX: EBX: c10002f8 ECX: 000e5000 EDX: 8fcf ESI: 000a EDI: 00128868 EBP: e7443f9c ESP: e7443f9c DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 CR0: 8005003b CR2: b7ed5cc0 CR3: 01386000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 [] show_trace_log_lvl+0x19/0x2e [] show_trace+0x12/0x14 [] show_regs+0x1c/0x1f [] softlockup_tick+0xe0/0xf6 [] run_local_timers+0x17/0x19 [] update_process_times+0x24/0x49 [] tick_periodic+0x63/0x6f [] tick_handle_periodic+0x19/0x6a [] local_apic_timer_interrupt+0x4e/0x53 [] smp_apic_timer_interrupt+0x2a/0x39 [] apic_timer_interrupt+0x28/0x30 [] cpu_idle+0x76/0x8b [] start_secondary+0xb1/0xb3 [<>] _stext+0x3e40/0x19 === BUG: soft lockup - CPU#8 stuck for 61s! [swapper:0] Pid: 0, comm: swapper Not tainte
Re: [PATCH 24/27] NFS: Use local caching [try #2]
On Wed, 2008-01-30 at 03:25 +, David Howells wrote: > Chuck Lever <[EMAIL PROTECTED]> wrote: > > > This patch really ought to be broken into more manageable atomic > > changes to make it easier to review, and to provide more fine-grained > > explanation and rationalization for each specific change via > > individual patch descriptions. > > Hmmm I broke the patch up as Trond stipulated - at least, I thought I > had. > > In many ways this request doesn't make sense. You can't do NFS caching > without all the appropriate bits, so logically they should be one patch. > Breaking it up won't help git-bisect since the option to enable all this is > the last (or nearly last) patch. That depends entirely on what you are tracking. At this point in time, I'm completely uninterested in debugging cachefs, but _very_ interested in tracking and debugging changes to core NFS code. Trond -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
About closed-source module use GPL module function
Hi all, Now I am porting a device driver to Linux, its source code is not opened. In this module, I use some interface functions exported from GPL module through EXPORT_SYMBOL macros. (not EXPORT_SYMBOL_GPL), For example, register_sound_dsp() and so on. Do I violate GPL? How to solve it? Thanks a lot, any suggestion is appreciated. Cooper -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Change In sk_buff structure in 2.6.22 kernel
On Wed, 30 Jan 2008 10:49:49 +0530 "PV Juliet" <[EMAIL PROTECTED]> wrote: > Hi All, > > > The header fields in the sk_buff structure have been renamed and are > no longer unions. > > Networking code and drivers are supposed to use skb->transport_header, > skb->network_header, and skb->skb_mac_header. > But when I am trying to access fields of TCP using the code > struct tcphdr *tcp = skb->transport_header; > tcp-> //accessing proper field > It is not accessing the value properly ... > Can anyone please help me ??? > > > Thanks in advance > Regards > Juliet Read the source (include/linux/skbuff.h) Use the new accessor functions skb_transport_header(skb), skb_network_header(skb), -- Stephen Hemminger <[EMAIL PROTECTED]> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
Quoting Andi Kleen <[EMAIL PROTECTED]>: Pavel Roskin <[EMAIL PROTECTED]> writes: */ @@ -162,6 +163,7 @@ const char *print_tainted(void) if (tainted) { snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c%c%c%c%c", tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G', + tainted & TAINT_BLOB_WRAPPER ? 'W' : ' ', Are you sure you don't need to add a new '%c' to the format string too? I think gcc should have warned. You are right! Thanks. -- Regards, Pavel Roskin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
Yi Yang wrote: It's broken, because it doesn't take into account the fact that Intel broke CPUID level 4 and made it "repeating" (neither did the cpuid char device, because it predated the Intel braindamage; I've had a patch for it privately for a while, but didn't push it upstream because paravirt broke it royally and I wanted the situation to settle down.) It's broken, because the algorithm used to determine valid CPUID levels is incorrect; it fails to recognize any CPUID levels other than the main Intel and AMD ones, e.g. the Transmeta 0x8086 (and sometimes more) and VIA 0xc000 levels. Thank you for pointing out these issues, i think we can let users input any cpuid level and output the corresponding cpuid, in this way we can avoid to consider cpu differences and left this to userspace. We can also consider all the x86 platforms to do cpuid for every one. It's broken, because it is better for the userspace extractor to have this logic than to stuff it into the kernel, where it sits hogging unswappable memory at all times. It seems not to be very appropriate to let user space consider hardware details. /proc/cpuinfo should be an example to justify this. /proc/cpuinfo represents what the kernel needs to know, so it reflects the kernel's interpretation of CPUID. There is no reason to interpret things in the kernel that the kernel doesn't need. Is there any user application using /dev/cpu/*/cpuid? if no, i think it is feasible to provide an interface in the kernel. Yes. It's called x86info, I believe. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
Jan Kara wrote: > > Chris Snook wrote: > > > Al Boldi wrote: > > > > This RFC proposes to introduce a tunable which allows to disable > > > > fsync and changes ordered into writeback writeout on a per-process > > > > basis like this: > > > > > > > > echo 1 > /proc/`pidof process`/softsync > > > > > > This is basically a kernel workaround for stupid app behavior. > > > > Exactly right to some extent, but don't forget the underlying > > data=ordered starvation problem, which looks like a genuinely deep > > problem maybe related to blockIO. > > It is a problem with the way how ext3 does fsync (at least that's what > we ended up with in that konqueror problem)... It has to flush the > current transaction which means that app doing fsync() has to wait till > all dirty data of all files on the filesystem are written (if we are in > ordered mode). And that takes quite some time... There are possibilities > how to avoid that but especially with freshly created files, it's tough > and I don't see a way how to do it without some fundamental changes to > JBD. Ok, but keep in mind that this starvation occurs even in the absence of fsync, as the benchmarks show. And, a quick test of successive 1sec delayed syncs shows no hangs until about 1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for minutes on end, and io-wait shows almost 100%. Now it turns out that 'echo 3 > /proc/.../drop_caches' has no effect, but doing it a few more times makes the hangs go away for while, only to come back again and again. Thanks! -- Al -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
Pavel Roskin <[EMAIL PROTECTED]> writes: > */ > @@ -162,6 +163,7 @@ const char *print_tainted(void) > if (tainted) { > snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c%c%c%c%c", > tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G', > + tainted & TAINT_BLOB_WRAPPER ? 'W' : ' ', Are you sure you don't need to add a new '%c' to the format string too? I think gcc should have warned. -Andi > tainted & TAINT_FORCED_MODULE ? 'F' : ' ', > tainted & TAINT_UNSAFE_SMP ? 'S' : ' ', > tainted & TAINT_FORCED_RMMOD ? 'R' : ' ', > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] IB: introducing MTHCA_MR_DMABARRIER
Add MTHCA_MR_DMABARRIER to mthca's API, increment ABI version, and make use of MTHCA_MR_DMABARRIER when mapping a user-allocated memory region with ib_umem_get(). Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> -- drivers/infiniband/core/umem.c | 15 +--- drivers/infiniband/hw/mthca/mthca_provider.c |7 - drivers/infiniband/hw/mthca/mthca_user.h | 10 +++- include/rdma/ib_verbs.h | 33 +++ 4 files changed, 59 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 5b00408..57b5ce9 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -38,6 +38,7 @@ #include #include #include +#include #include "uverbs.h" @@ -72,6 +73,8 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d * @addr: userspace virtual address to start at * @size: length of region to pin * @access: IB_ACCESS_xxx flags for memory being pinned + * @dmabarrier: set a "dma barrier" so that in-flight DMA is + * flushed when the memory region is written to */ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, size_t size, int access, int dmabarrier) @@ -87,6 +90,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, int ret; int off; int i; + struct dma_attrs attrs; + + dma_set_attr(&attrs, dmabarrier ? DMA_ATTR_BARRIER : 0); if (!can_do_mlock()) return ERR_PTR(-EPERM); @@ -174,10 +180,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, sg_set_page(&chunk->page_list[i], page_list[i + off], PAGE_SIZE, 0); } - chunk->nmap = ib_dma_map_sg(context->device, - &chunk->page_list[0], - chunk->nents, - DMA_BIDIRECTIONAL); + chunk->nmap = ib_dma_map_sg_attrs(context->device, + &chunk->page_list[0], + chunk->nents, + DMA_BIDIRECTIONAL, + &attrs); if (chunk->nmap <= 0) { for (i = 0; i < chunk->nents; ++i) put_page(sg_page(&chunk->page_list[i])); diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 704d8ef..e837cc9 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1017,17 +1017,22 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, struct mthca_dev *dev = to_mdev(pd->device); struct ib_umem_chunk *chunk; struct mthca_mr *mr; + struct mthca_reg_mr ucmd; u64 *pages; int shift, n, len; int i, j, k; int err = 0; int write_mtt_size; + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return ERR_PTR(-EFAULT); + mr = kmalloc(sizeof *mr, GFP_KERNEL); if (!mr) return ERR_PTR(-ENOMEM); - mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); + mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, + ucmd.mr_attrs & MTHCA_MR_DMABARRIER); if (IS_ERR(mr->umem)) { err = PTR_ERR(mr->umem); diff --git a/drivers/infiniband/hw/mthca/mthca_user.h b/drivers/infiniband/hw/mthca/mthca_user.h index 02cc0a7..701a430 100644 --- a/drivers/infiniband/hw/mthca/mthca_user.h +++ b/drivers/infiniband/hw/mthca/mthca_user.h @@ -41,7 +41,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define MTHCA_UVERBS_ABI_VERSION 1 +#define MTHCA_UVERBS_ABI_VERSION 2 /* * Make sure that all structs defined in this file remain laid out so @@ -61,6 +61,14 @@ struct mthca_alloc_pd_resp { __u32 reserved; }; +struct mthca_reg_mr { + __u32 mr_attrs; +#define MTHCA_MR_DMABARRIER 0x1 /* set a dma barrier in order to flush + * in-flight DMA on a write to memory + * region */ + __u32 reserved; +}; + struct mthca_create_cq { __u32 lkey; __u32 pdn; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 11f3960..ac869e2 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1507,6 +1507,24 @@ static inline void ib_dma_unmap_single(struct ib_device *dev, dma_u
[PATCH 3/4] IB: add dmabarrier to ib_umem_get() prototype
Add a new parameter, dmabarrier, to the ib_umem_get() prototype. Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> -- drivers/infiniband/core/umem.c |2 +- drivers/infiniband/hw/amso1100/c2_provider.c |2 +- drivers/infiniband/hw/cxgb3/iwch_provider.c |2 +- drivers/infiniband/hw/ehca/ehca_mrmw.c |2 +- drivers/infiniband/hw/ipath/ipath_mr.c |3 ++- drivers/infiniband/hw/mlx4/cq.c |2 +- drivers/infiniband/hw/mlx4/doorbell.c|2 +- drivers/infiniband/hw/mlx4/mr.c |3 ++- drivers/infiniband/hw/mlx4/qp.c |2 +- drivers/infiniband/hw/mlx4/srq.c |2 +- drivers/infiniband/hw/mthca/mthca_provider.c |3 ++- include/rdma/ib_umem.h |4 ++-- 12 files changed, 16 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 4e3128f..5b00408 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -74,7 +74,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d * @access: IB_ACCESS_xxx flags for memory being pinned */ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, - size_t size, int access) + size_t size, int access, int dmabarrier) { struct ib_umem *umem; struct page **page_list; diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index 7a6cece..f571dff 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -449,7 +449,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, return ERR_PTR(-ENOMEM); c2mr->pd = c2pd; - c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc); + c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); if (IS_ERR(c2mr->umem)) { err = PTR_ERR(c2mr->umem); kfree(c2mr); diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index b5436ca..66d9d65 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -601,7 +601,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, if (!mhp) return ERR_PTR(-ENOMEM); - mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc); + mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); if (IS_ERR(mhp->umem)) { err = PTR_ERR(mhp->umem); kfree(mhp); diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index e239bbf..62a382c 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -325,7 +325,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, } e_mr->umem = ib_umem_get(pd->uobject->context, start, length, -mr_access_flags); +mr_access_flags, 0); if (IS_ERR(e_mr->umem)) { ib_mr = (void *)e_mr->umem; goto reg_user_mr_exit1; diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c index db4ba92..7ffb392 100644 --- a/drivers/infiniband/hw/ipath/ipath_mr.c +++ b/drivers/infiniband/hw/ipath/ipath_mr.c @@ -195,7 +195,8 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, goto bail; } - umem = ib_umem_get(pd->uobject->context, start, length, mr_access_flags); + umem = ib_umem_get(pd->uobject->context, start, length, + mr_access_flags, 0); if (IS_ERR(umem)) return (void *) umem; diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 9d32c49..3adad6f 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -122,7 +122,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector } cq->umem = ib_umem_get(context, ucmd.buf_addr, buf_size, - IB_ACCESS_LOCAL_WRITE); + IB_ACCESS_LOCAL_WRITE, 0); if (IS_ERR(cq->umem)) { err = PTR_ERR(cq->umem); goto err_cq; diff --git a/drivers/infiniband/hw/mlx4/doorbell.c b/drivers/infiniband/hw/mlx4/doorbell.c index 1c36087..0afde2d 100644 --- a/drivers/infiniband/hw/mlx4/doorbell.c +++ b/drivers/infiniband/hw/mlx4/doorbell.c @@ -181,7 +181,7 @@ int mlx4_ib_db_map_user(struct mlx4_ib_ucontext *context, unsigned long virt, page->user_virt = (virt & PAGE_MASK); p
Re: Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e"
On Wed, 30 Jan 2008, Linus Torvalds wrote: > > Untested, but as mentioned, this is more of a "this looks maintainable and > like it should solve the issues" rather than anything I was planning on > committing now. Side note: I "verified" this patch by also diffing it against the HEAD^ state (before adding the PCIE ID's back in), to check that I marked exactly the right entries as PCIE() entries. So while it's not tested, at least it looks right from two different angles ;) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] dma/ia64: update ia64 machvecs
Change all ia64 machvecs to use the new dma_{un}map_*_attrs() interfaces. Implement the old dma_{un}map_*() interfaces in terms of the corresponding new interfaces. For ia64/sn, make use of one dma attribute, DMA_ATTR_BARRIER. Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> -- arch/ia64/hp/common/hwsw_iommu.c | 60 arch/ia64/hp/common/sba_iommu.c | 62 ++-- arch/ia64/sn/pci/pci_dma.c | 77 --- include/asm-ia64/dma-mapping.h | 28 +-- include/asm-ia64/machvec.h | 52 include/asm-ia64/machvec_hpzx1.h | 16 +++--- include/asm-ia64/machvec_hpzx1_swiotlb.h | 16 +++--- include/asm-ia64/machvec_sn2.h | 16 +++--- include/linux/dma-attrs.h| 48 +++ lib/swiotlb.c| 50 10 files changed, 289 insertions(+), 136 deletions(-) diff --git a/arch/ia64/hp/common/hwsw_iommu.c b/arch/ia64/hp/common/hwsw_iommu.c index 94e5710..8cedd6c 100644 --- a/arch/ia64/hp/common/hwsw_iommu.c +++ b/arch/ia64/hp/common/hwsw_iommu.c @@ -20,10 +20,10 @@ extern int swiotlb_late_init_with_default_size (size_t size); extern ia64_mv_dma_alloc_coherent swiotlb_alloc_coherent; extern ia64_mv_dma_free_coherent swiotlb_free_coherent; -extern ia64_mv_dma_map_single swiotlb_map_single; -extern ia64_mv_dma_unmap_singleswiotlb_unmap_single; -extern ia64_mv_dma_map_sg swiotlb_map_sg; -extern ia64_mv_dma_unmap_sgswiotlb_unmap_sg; +extern ia64_mv_dma_map_single_attrsswiotlb_map_single_attrs; +extern ia64_mv_dma_unmap_single_attrs swiotlb_unmap_single_attrs; +extern ia64_mv_dma_map_sg_attrsswiotlb_map_sg_attrs; +extern ia64_mv_dma_unmap_sg_attrs swiotlb_unmap_sg_attrs; extern ia64_mv_dma_supported swiotlb_dma_supported; extern ia64_mv_dma_mapping_error swiotlb_dma_mapping_error; @@ -31,19 +31,19 @@ extern ia64_mv_dma_mapping_error swiotlb_dma_mapping_error; extern ia64_mv_dma_alloc_coherent sba_alloc_coherent; extern ia64_mv_dma_free_coherent sba_free_coherent; -extern ia64_mv_dma_map_single sba_map_single; -extern ia64_mv_dma_unmap_singlesba_unmap_single; -extern ia64_mv_dma_map_sg sba_map_sg; -extern ia64_mv_dma_unmap_sgsba_unmap_sg; +extern ia64_mv_dma_map_single_attrssba_map_single_attrs; +extern ia64_mv_dma_unmap_single_attrs sba_unmap_single_attrs; +extern ia64_mv_dma_map_sg_attrssba_map_sg_attrs; +extern ia64_mv_dma_unmap_sg_attrs sba_unmap_sg_attrs; extern ia64_mv_dma_supported sba_dma_supported; extern ia64_mv_dma_mapping_error sba_dma_mapping_error; #define hwiommu_alloc_coherent sba_alloc_coherent #define hwiommu_free_coherent sba_free_coherent -#define hwiommu_map_single sba_map_single -#define hwiommu_unmap_single sba_unmap_single -#define hwiommu_map_sg sba_map_sg -#define hwiommu_unmap_sg sba_unmap_sg +#define hwiommu_map_single_attrs sba_map_single_attrs +#define hwiommu_unmap_single_attrs sba_unmap_single_attrs +#define hwiommu_map_sg_attrs sba_map_sg_attrs +#define hwiommu_unmap_sg_attrs sba_unmap_sg_attrs #define hwiommu_dma_supported sba_dma_supported #define hwiommu_dma_mapping_error sba_dma_mapping_error #define hwiommu_sync_single_for_cpumachvec_dma_sync_single @@ -98,40 +98,44 @@ hwsw_free_coherent (struct device *dev, size_t size, void *vaddr, dma_addr_t dma } dma_addr_t -hwsw_map_single (struct device *dev, void *addr, size_t size, int dir) +hwsw_map_single_attrs (struct device *dev, void *addr, size_t size, int dir, + struct dma_attrs *attrs) { if (use_swiotlb(dev)) - return swiotlb_map_single(dev, addr, size, dir); + return swiotlb_map_single_attrs(dev, addr, size, dir, attrs); else - return hwiommu_map_single(dev, addr, size, dir); + return hwiommu_map_single_attrs(dev, addr, size, dir, attrs); } void -hwsw_unmap_single (struct device *dev, dma_addr_t iova, size_t size, int dir) +hwsw_unmap_single_attrs (struct device *dev, dma_addr_t iova, size_t size, +int dir, struct dma_attrs *attrs) { if (use_swiotlb(dev)) - return swiotlb_unmap_single(dev, iova, size, dir); + return swiotlb_unmap_single_attrs(dev, iova, size, dir, attrs); else - return hwiommu_unmap_single(dev, iova, size, dir); + return hwiommu_unmap_single_attrs(dev, iova, size, dir, attrs); } int -hwsw_map_sg (struct device *dev, struct scatterlist *sglist, int nents, int dir) +hwsw_map_sg_attrs (struct device *dev, struct scatterlist *sglist, in
Re: Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e"
On Tue, 29 Jan 2008, Randy Dunlap wrote: > > Andrew was concerned about this when the driver was in -mm. > He asked for a patch that would set E1000E to same value as E1000 > and I supplied that. Auke acked it IIRC. Other people vetoed it. :( Yeah, I've been discussing with Jeff and the gang. I think we have agreed on a solution where the ID's show up in the old driver if the new driver is not enabled at all. (And as a side note: it turns out that the problem I experienced didn't come from the new e1000e driver after all, so I'll be removing the EXPERIMENTAL flag again). So I'd suggest the final patch be something like this, but I'm sendign it out just as an example of how we could solve this, not necessarily as a final patch. Jeff, Auke, would something like this be acceptable? It makes it very obvious in the driver table which entries are for the PCIE versions that would be handled by the E1000E driver if it is enabled.. Untested, but as mentioned, this is more of a "this looks maintainable and like it should solve the issues" rather than anything I was planning on committing now. Linus --- drivers/net/Kconfig|5 ++- drivers/net/e1000/e1000_main.c | 60 ++-- 2 files changed, 37 insertions(+), 28 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 5a2d1dd..6c57540 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -1992,7 +1992,7 @@ config E1000_DISABLE_PACKET_SPLIT config E1000E tristate "Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support" - depends on PCI && EXPERIMENTAL + depends on PCI ---help--- This driver supports the PCI-Express Intel(R) PRO/1000 gigabit ethernet family of adapters. For PCI or PCI-X e1000 adapters, @@ -2009,6 +2009,9 @@ config E1000E To compile this driver as a module, choose M here. The module will be called e1000e. +config E1000E_ENABLED + def_bool E1000E != n + config IP1000 tristate "IP1000 Gigabit Ethernet support" depends on PCI && EXPERIMENTAL diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 3111af6..8c87940 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -47,6 +47,12 @@ static const char e1000_copyright[] = "Copyright (c) 1999-2006 Intel Corporation * Macro expands to... * {PCI_DEVICE(PCI_VENDOR_ID_INTEL, device_id)} */ +#ifdef CONFIG_E1000E_ENABLED + #define PCIE(x) +#else + #define PCIE(x) x, +#endif + static struct pci_device_id e1000_pci_tbl[] = { INTEL_E1000_ETHERNET_DEVICE(0x1000), INTEL_E1000_ETHERNET_DEVICE(0x1001), @@ -73,14 +79,14 @@ static struct pci_device_id e1000_pci_tbl[] = { INTEL_E1000_ETHERNET_DEVICE(0x1026), INTEL_E1000_ETHERNET_DEVICE(0x1027), INTEL_E1000_ETHERNET_DEVICE(0x1028), - INTEL_E1000_ETHERNET_DEVICE(0x1049), - INTEL_E1000_ETHERNET_DEVICE(0x104A), - INTEL_E1000_ETHERNET_DEVICE(0x104B), - INTEL_E1000_ETHERNET_DEVICE(0x104C), - INTEL_E1000_ETHERNET_DEVICE(0x104D), - INTEL_E1000_ETHERNET_DEVICE(0x105E), - INTEL_E1000_ETHERNET_DEVICE(0x105F), - INTEL_E1000_ETHERNET_DEVICE(0x1060), +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x1049)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x104A)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x104B)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x104C)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x104D)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x105E)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x105F)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x1060)) INTEL_E1000_ETHERNET_DEVICE(0x1075), INTEL_E1000_ETHERNET_DEVICE(0x1076), INTEL_E1000_ETHERNET_DEVICE(0x1077), @@ -89,28 +95,28 @@ static struct pci_device_id e1000_pci_tbl[] = { INTEL_E1000_ETHERNET_DEVICE(0x107A), INTEL_E1000_ETHERNET_DEVICE(0x107B), INTEL_E1000_ETHERNET_DEVICE(0x107C), - INTEL_E1000_ETHERNET_DEVICE(0x107D), - INTEL_E1000_ETHERNET_DEVICE(0x107E), - INTEL_E1000_ETHERNET_DEVICE(0x107F), +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x107D)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x107E)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x107F)) INTEL_E1000_ETHERNET_DEVICE(0x108A), - INTEL_E1000_ETHERNET_DEVICE(0x108B), - INTEL_E1000_ETHERNET_DEVICE(0x108C), - INTEL_E1000_ETHERNET_DEVICE(0x1096), - INTEL_E1000_ETHERNET_DEVICE(0x1098), +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x108B)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x108C)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x1096)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x1098)) INTEL_E1000_ETHERNET_DEVICE(0x1099), - INTEL_E1000_ETHERNET_DEVICE(0x109A), - INTEL_E1000_ETHERNET_DEVICE(0x10A4), - INTEL_E1000_ETHERNET_DEVICE(0x10A5), +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x109A)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x10A4)) +PCIE( INTEL_E1000_ETHERNET_DEVICE(0x10A5)) INTEL_E1000_ET
[PATCH 0/4] dma: dma_{un}map_{single|sg}_attrs() interface
Introduce a new interface for passing architecture-specific attributes when memory is mapped and unmapped for DMA. Give the interface a default implementation which ignores attributes. Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> -- dma-mapping.h | 33 + 1 files changed, 33 insertions(+) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 101a2d4..bc313e3 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -116,4 +116,37 @@ static inline void dmam_release_declared_memory(struct device *dev) } #endif /* ARCH_HAS_DMA_DECLARE_COHERENT_MEMORY */ +#ifndef ARCH_USES_DMA_ATTRS +struct dma_attrs; + +static inline dma_addr_t dma_map_single_attrs(struct device *dev, + void *cpu_addr, size_t size, + int dir, struct dma_attrs* attrs) +{ + return dma_map_single(dev, cpu_addr, size, dir); +} + +static inline void dma_unmap_single_attrs(struct device *dev, + dma_addr_t dma_addr, + size_t size, int dir, + struct dma_attrs* attrs) +{ + return dma_unmap_single(dev, dma_addr, size, dir); +} + +static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, int dir, struct dma_attrs *attrs) +{ + return dma_map_sg(dev, sgl, nents, dir); +} + +static inline void dma_unmap_sg_attrs(struct device *dev, + struct scatterlist *sgl, + int nents, int dir, + struct dma_attrs *attrs) +{ + return dma_unmap_sg(dev, sgl, nents, dir); +} +#endif /* ARCH_USES_DMA_ATTRS */ + #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] dma: document dma_{un}map_{single|sg}_attrs() interface
Document the new dma_{un}map_{single|sg}_attrs() functions. Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]> -- DMA-API.txt | 63 1 files changed, 63 insertions(+) diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index b939ebb..fad05e0 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -395,6 +395,69 @@ Notes: You must do this: See also dma_map_single(). +dma_addr_t +dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size, +enum dma_data_direction dir, +struct dma_attrs* attrs) + +void +dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs* attrs) + +int +dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, +int nents, enum dma_data_direction dir, +struct dma_attrs *attrs) + +void +dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, + struct dma_attrs *attrs) + +The four functions above are just like the counterpart functions +without the _attrs suffixes, except that they pass an optional +struct dma_attrs*. + +struct dma_attrs encapsulates a set of "dma attributes". For the +definition of struct dma_attrs see linux/dma-attrs.h. + +The interpretation of dma attributes is architecture-specific. + +If struct dma_attrs* is NULL, the semantics of each of these +functions is identical to those of the corresponding function +without the _attrs suffix. As a result dma_map_single_attrs() +can generally replace dma_map_single(), etc. + +As an example of the use of the *_attrs functions, here's how +you could pass an attribute DMA_ATTR_FOO when mapping memory +for DMA: + +#include +/* DMA_ATTR_FOO should be defined in linux/dma-attrs.h */ +... + + DECLARE_DMA_ATTRS(attrs); + dma_set_attr(&attrs, DMA_ATTR_FOO); + + n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, &attr); + + +Architectures that care about DMA_ATTR_FOO would check for its +presence in their implementations of the mapping and unmapping +routines, e.g.: + +void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, +size_t size, enum dma_data_direction dir, +struct dma_attrs* attrs) +{ + + int foo = dma_get_attr(attrs, DMA_ATTR_FOO); + + if (foo) + /* twizzle the frobnozzle */ + + Part II - Advanced dma_ usage - -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [WARNING -rc8] at fs/sysfs/dir.c:424 sysfs_add_one(), related with processor (ACPI)
On Jan 25, 2008 9:27 AM, Dave Young <[EMAIL PROTECTED]> wrote: > > On Jan 25, 2008 12:32 AM, Miguel Ojeda <[EMAIL PROTECTED]> wrote: > > > > On Jan 24, 2008 2:44 AM, Dave Young <[EMAIL PROTECTED]> wrote: > > > > > > On Wed, Jan 23, 2008 at 02:06:43PM -0800, Andrew Morton wrote: > > > > > On Mon, 21 Jan 2008 18:53:18 +0100 "Miguel Ojeda" <[EMAIL PROTECTED]> > > > > > wrote: > > > > > Booting 2.6.24-rc8 I get this: > > > > > > > > > > > > > > > sysfs: duplicate filename 'fan' can not be created > > > > > WARNING: at fs/sysfs/dir.c:424 sysfs_add_one() > > > > > Pid: 819, comm: modprobe Not tainted 2.6.24-rc8 #2 > > > > > [] sysfs_add_one+0x9f/0xe0 > > > > > [] create_dir+0x48/0x90 > > > > > [] sysfs_create_dir+0x29/0x50 > > > > > [] kobject_get+0xf/0x20 > > > > > [] kobject_add+0x8f/0x1b0 > > > > > [] kobject_register+0x21/0x50 > > > > > [] bus_add_driver+0x71/0x1e0 > > > > > [] acpi_fan_init+0x2f/0x4d [fan] > > > > > [] sys_init_module+0x126/0x19b0 > > > > > [] rb_insert_color+0xb7/0xe0 > > > > > [] acpi_bus_register_driver+0x0/0x38 > > > > > [] syscall_call+0x7/0xb > > > > > === > > > > > kobject_add failed for fan with -EEXIST, don't try to register things > > > > > with the same name in the same directory. > > > > > Pid: 819, comm: modprobe Not tainted 2.6.24-rc8 #2 > > > > > [] kobject_add+0x111/0x1b0 > > > > > [] kobject_register+0x21/0x50 > > > > > [] bus_add_driver+0x71/0x1e0 > > > > > [] acpi_fan_init+0x2f/0x4d [fan] > > > > > [] sys_init_module+0x126/0x19b0 > > > > > [] rb_insert_color+0xb7/0xe0 > > > > > [] acpi_bus_register_driver+0x0/0x38 > > > > > [] syscall_call+0x7/0xb > > > > > === > > > > > processor: exports duplicate symbol acpi_processor_set_thermal_limit > > > > > (owned by kernel) > > > > > > > > > > > Could apply following debug patch and see the result? > > > > > > > > > diff -upr linux/fs/sysfs/dir.c linux.new/fs/sysfs/dir.c > > > --- linux/fs/sysfs/dir.c2008-01-23 09:56:24.0 +0800 > > > +++ linux.new/fs/sysfs/dir.c2008-01-23 09:59:12.0 +0800 > > > @@ -418,6 +418,8 @@ void sysfs_addrm_start(struct sysfs_addr > > > */ > > > int sysfs_add_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd) > > > { > > > + if (!strcmp(sd->s_name, "fan")) > > > + dump_stack(); > > > if (sysfs_find_dirent(acxt->parent_sd, sd->s_name)) { > > > printk(KERN_WARNING "sysfs: duplicate filename '%s' " > > >"can not be created\n", sd->s_name); > > > > > > > > > > Done. The following appears in the new dmesg output. > > > > > > ACPI: Power Button (CM) [PBTN] > > input: Sleep Button (CM) as /class/input/input2 > > ACPI: Sleep Button (CM) [SBTN] > > Pid: 1, comm: swapper Not tainted 2.6.24-rc8 #3 > > [] sysfs_add_one+0x75/0x100 > > [] sysfs_addrm_start+0x3f/0xb0 > > [] create_dir+0x48/0x90 > > [] sysfs_create_dir+0x29/0x50 > > [] kobject_get+0xf/0x20 > > [] kobject_add+0x8f/0x1b0 > > [] kobject_register+0x21/0x50 > > [] bus_add_driver+0x71/0x1e0 > > [] acpi_fan_init+0x2f/0x4d > > [] kernel_init+0x121/0x300 > > [] ret_from_fork+0x6/0x1c > > [] kernel_init+0x0/0x300 > > [] kernel_init+0x0/0x300 > > [] kernel_thread_helper+0x7/0x18 > > === > > I'm curious, the "fan" is configured as built-in, why the modprobe be called? I guess initrd or your lib/modules need update. > > > > ACPI: SSDT 3FE93134, 0244 (r1 PmRef Cpu0Ist 3000 INTL 20050624) > > ACPI: SSDT 3FE92ACA, 05E5 (r1 PmRef Cpu0Cst 3001 INTL 20050624) > > Monitor-Mwait will be used to enter C-1 state > > Monitor-Mwait will be used to enter C-2 state > > > > Attached dmesg.txt > > > > > > -- > > Miguel Ojeda > > http://maxextreme.googlepages.com/index.htm > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
Jon Masters wrote: I wouldn't quite say that. I wasn't going to comment, but...personally, I actually disagree with the assertions that ndiswrapper isn't causing proprietary code to link against GPL functions in the kernel (how is an NDIS implementation any different than a shim layer provided to load a graphics driver?), but I wasn't trying to make that point. Well, as long as *any* part of the kernel ever links to proprietary code, then GPL functions link to it in exactly the same way ndiswrapper enables. It's only a matter of how many steps of separation. A perfectly GPL USB network driver linked to GPL-only functions feeds data into the kernel where it swirls about and emerges from a proprietary network filesystem driver, for example. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUILD FAILURE]2.6.24-git6 build failure on sis190 ethernet driver
On Wed, Jan 30, 2008 at 09:11:36AM +0530, Kamalesh Babulal wrote: > Hi, > > The 2.6.24-git6 kernel build fails on various x86_64 machines with the build > failure > > drivers/net/sis190.c:329: error: sis190_pci_tbl causes a section type conflict > make[2]: *** [drivers/net/sis190.o] Error 1 > > # gcc --version (machine1) > gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52) > > # gcc --version (machine2) > gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) Hi Kamalesh I know another patch is circulating, but please try the following. diff --git a/drivers/net/sis190.c b/drivers/net/sis190.c index b570402..0a5e024 100644 --- a/drivers/net/sis190.c +++ b/drivers/net/sis190.c @@ -1556,7 +1556,7 @@ static int __devinit sis190_get_mac_addr_from_eeprom(struct pci_dev *pdev, static int __devinit sis190_get_mac_addr_from_apc(struct pci_dev *pdev, struct net_device *dev) { - static const u16 __devinitdata ids[] = { 0x0965, 0x0966, 0x0968 }; + static const u16 __devinitconst ids[] = { 0x0965, 0x0966, 0x0968 }; struct sis190_private *tp = netdev_priv(dev); struct pci_dev *isa_bridge; u8 reg, tmp8; It is the better fix if you can confirm it working. The section conflict issued by gcc happens because we try to mix const and non-const data in the same section. Sam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
On Wed, 2008-01-30 at 05:07 +, Jon Masters wrote: > *). Add a new taint? > *). Move it later? > > It's all trivial, but a policy should be established for the future. I'd prefer a new taint. It's less likely to break. It provides more information in the stack dumps. It makes it clear the difference ndiswrapper and driverloader. Here's the patch: --- Introduce a new taint flag for ndiswrapper Although ndiswrapper loads proprietary code, it's under GPL itself. Introduce a different taint flag for this case, so that ndiswrapper retains access to GPL-only symbols. Add comments to show the difference between driverloader and ndiswrapper. Signed-off-by: Pavel Roskin <[EMAIL PROTECTED]> --- include/linux/kernel.h |1 + kernel/module.c|5 - kernel/panic.c |2 ++ 3 files changed, 7 insertions(+), 1 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index a7283c9..861a6ae 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -240,6 +240,7 @@ extern enum system_states { #define TAINT_BAD_PAGE (1<<5) #define TAINT_USER (1<<6) #define TAINT_DIE (1<<7) +#define TAINT_BLOB_WRAPPER (1<<8) extern void dump_stack(void) __cold; diff --git a/kernel/module.c b/kernel/module.c index f6a4e72..a64380c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1925,8 +1925,11 @@ static struct module *load_module(void __user *umod, /* Set up license info based on the info section */ set_license(mod, get_modinfo(sechdrs, infoindex, "license")); + /* GPL, but may load proprietary code */ if (strcmp(mod->name, "ndiswrapper") == 0) - add_taint_module(mod, TAINT_PROPRIETARY_MODULE); + add_taint_module(mod, TAINT_BLOB_WRAPPER); + + /* Wrongly claims to be under GPL */ if (strcmp(mod->name, "driverloader") == 0) add_taint_module(mod, TAINT_PROPRIETARY_MODULE); diff --git a/kernel/panic.c b/kernel/panic.c index da4d6ba..b040812 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -152,6 +152,7 @@ EXPORT_SYMBOL(panic); * 'M' - System experienced a machine check exception. * 'B' - System has hit bad_page. * 'U' - Userspace-defined naughtiness. + * 'W' - Wrapper for untrusted binary blobs has been loaded. * * The string is overwritten by the next call to print_taint(). */ @@ -162,6 +163,7 @@ const char *print_tainted(void) if (tainted) { snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c%c%c%c%c", tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G', + tainted & TAINT_BLOB_WRAPPER ? 'W' : ' ', tainted & TAINT_FORCED_MODULE ? 'F' : ' ', tainted & TAINT_UNSAFE_SMP ? 'S' : ' ', tainted & TAINT_FORCED_RMMOD ? 'R' : ' ', -- Regards, Pavel Roskin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e"
On Tue, 29 Jan 2008 23:59:37 GMT Linux Kernel Mailing List wrote: > Gitweb: > http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b10ca19ea4859d3884d10a3eb8495de92089792 > Commit: 5b10ca19ea4859d3884d10a3eb8495de92089792 > Parent: 9e97198dbf318be7958b57900d05b37c7e09ad7c > Author: Linus Torvalds <[EMAIL PROTECTED]> > AuthorDate: Wed Jan 30 09:54:54 2008 +1100 > Committer: Linus Torvalds <[EMAIL PROTECTED]> > CommitDate: Wed Jan 30 09:54:54 2008 +1100 > > Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e" > > The new e1000e driver is apparently not yet suitable for general use, so > mark it experimental, and re-instate all the PCI-Express device IDs in > the old and stable e1000 driver so that people (namely me) can continue > to use a driver that actually works. > > Auke & co have been appraised of the situation. > > Cc: Auke Kok <[EMAIL PROTECTED]> > Cc: Jeff Garzik <[EMAIL PROTECTED]> > Cc: David Miller <[EMAIL PROTECTED]> > Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> Andrew was concerned about this when the driver was in -mm. He asked for a patch that would set E1000E to same value as E1000 and I supplied that. Auke acked it IIRC. Other people vetoed it. :( > --- > drivers/net/Kconfig|2 +- > drivers/net/e1000/e1000_main.c | 27 +++ > 2 files changed, 28 insertions(+), 1 deletions(-) > > diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig > index af40ff4..5a2d1dd 100644 > --- a/drivers/net/Kconfig > +++ b/drivers/net/Kconfig > @@ -1992,7 +1992,7 @@ config E1000_DISABLE_PACKET_SPLIT > > config E1000E > tristate "Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support" > - depends on PCI > + depends on PCI && EXPERIMENTAL > ---help--- > This driver supports the PCI-Express Intel(R) PRO/1000 gigabit > ethernet family of adapters. For PCI or PCI-X e1000 adapters, > diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c > index 7f5b2ae..3111af6 100644 > --- a/drivers/net/e1000/e1000_main.c > +++ b/drivers/net/e1000/e1000_main.c > @@ -73,6 +73,14 @@ static struct pci_device_id e1000_pci_tbl[] = { > INTEL_E1000_ETHERNET_DEVICE(0x1026), > INTEL_E1000_ETHERNET_DEVICE(0x1027), > INTEL_E1000_ETHERNET_DEVICE(0x1028), > + INTEL_E1000_ETHERNET_DEVICE(0x1049), > + INTEL_E1000_ETHERNET_DEVICE(0x104A), > + INTEL_E1000_ETHERNET_DEVICE(0x104B), > + INTEL_E1000_ETHERNET_DEVICE(0x104C), > + INTEL_E1000_ETHERNET_DEVICE(0x104D), > + INTEL_E1000_ETHERNET_DEVICE(0x105E), > + INTEL_E1000_ETHERNET_DEVICE(0x105F), > + INTEL_E1000_ETHERNET_DEVICE(0x1060), > INTEL_E1000_ETHERNET_DEVICE(0x1075), > INTEL_E1000_ETHERNET_DEVICE(0x1076), > INTEL_E1000_ETHERNET_DEVICE(0x1077), > @@ -81,9 +89,28 @@ static struct pci_device_id e1000_pci_tbl[] = { > INTEL_E1000_ETHERNET_DEVICE(0x107A), > INTEL_E1000_ETHERNET_DEVICE(0x107B), > INTEL_E1000_ETHERNET_DEVICE(0x107C), > + INTEL_E1000_ETHERNET_DEVICE(0x107D), > + INTEL_E1000_ETHERNET_DEVICE(0x107E), > + INTEL_E1000_ETHERNET_DEVICE(0x107F), > INTEL_E1000_ETHERNET_DEVICE(0x108A), > + INTEL_E1000_ETHERNET_DEVICE(0x108B), > + INTEL_E1000_ETHERNET_DEVICE(0x108C), > + INTEL_E1000_ETHERNET_DEVICE(0x1096), > + INTEL_E1000_ETHERNET_DEVICE(0x1098), > INTEL_E1000_ETHERNET_DEVICE(0x1099), > + INTEL_E1000_ETHERNET_DEVICE(0x109A), > + INTEL_E1000_ETHERNET_DEVICE(0x10A4), > + INTEL_E1000_ETHERNET_DEVICE(0x10A5), > INTEL_E1000_ETHERNET_DEVICE(0x10B5), > + INTEL_E1000_ETHERNET_DEVICE(0x10B9), > + INTEL_E1000_ETHERNET_DEVICE(0x10BA), > + INTEL_E1000_ETHERNET_DEVICE(0x10BB), > + INTEL_E1000_ETHERNET_DEVICE(0x10BC), > + INTEL_E1000_ETHERNET_DEVICE(0x10C4), > + INTEL_E1000_ETHERNET_DEVICE(0x10C5), > + INTEL_E1000_ETHERNET_DEVICE(0x10D5), > + INTEL_E1000_ETHERNET_DEVICE(0x10D9), > + INTEL_E1000_ETHERNET_DEVICE(0x10DA), > /* required last entry */ > {0,} > }; --- ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Proportional bandwidth scheduling using anticipatory I/O scheduler on 2.6.24
This patch creates channels in anticipatory I/O scheduler for sharing bandwidth in specified proportions. It uses the ioprio_(get/set) interface to create various channels and as of now it is using the best effort levels. It is an initial attempt to get proportional b/w working in I/O schedulers. One of the applications can be to assign a portion of b/w on a device to a specified container. The advantages of this approach over putting absolute restricting on b/w of a container is that by restricting the b/w, we may end up not utilizing additional b/w in absence of load from other containers. Not to say that we cannot do I/O limiting in schedulers. The current patch works for read requests and we may need more work for congested queues and writes limiting. The idea is to allow processes to submit requests in a round-robin manner and if it exceeds it's limit wait till the other channel is done submitting it's share. In order to prevent a very active channel from submitting any request (even though it may have exceeded it's limit) in absence of i/o from other channel, counters are reset after a period of inactivity on idle channels. Also there is a loss of overall average bandwidth when using multiple classes. Some of which can be expected due to different behavior of applications in multiple containers sharing a single device, but apart from that a major portion of that loss is due to the fact that we are still not using the scheduler optimally. Here is a simple fio script to test this patch. <-- snip fio.script --> [global] ioengine=sync rw=read direct=0 exitall [file] name=buffered1 directory=/tmp bs=256k size=1g prio=0 prioclass=2 [file] name=buffered3 directory=/tmp bs=1m size=1g prio=3 prioclass=2 <-- end snip --> Other interfaces are in /sys/block/[device]/queue/iosched/ 1. priority_weights - Assign proportions to various channels. Note these poportions are now expressed in multiples of 1024*1024. I will work on getting these into exact proportions. 2. bandwidth_scheduling - writing 0 into this stops proportional scheduling. 3. bw_timer_expire - time period after which counters are reset. Writing large value to it will give you more exact proportions and small values increase overall average bandwidth. This is the time after which b/w on a different channels is reset due to inactivity. Some tuning of this variable may be needed to get required results. I will work on making this transparent. This patch has default four channels. I would like to know initial feedback regarding what do we expect especially when it comes to container groups. Is this something which is useful or we need hard limits for various channels? What other things are expected? Would assigning priorities be of any use, either absolute priorities or soft priorities along with b/w limitations. I can add cgroup interfaces in another patch. Signed-off-by: Naveen Gupta <[EMAIL PROTECTED]> Index: linux-2.6.24/block/Kconfig.iosched === --- linux-2.6.24.orig/block/Kconfig.iosched 2008-01-24 14:58:37.0 -0800 +++ linux-2.6.24/block/Kconfig.iosched 2008-01-27 11:24:50.0 -0800 @@ -21,6 +21,13 @@ config IOSCHED_AS deadline I/O scheduler, it can also be slower in some cases especially some database loads. +config IOPRIO_AS_MAX + int "Bandwidth channels in anticipatory I/O scheduler" + depends on IOSCHED_AS + default "4" + help + Number of valid b/w channels in anticipatory scheduler. + config IOSCHED_DEADLINE tristate "Deadline I/O scheduler" default y Index: linux-2.6.24/block/as-iosched.c === --- linux-2.6.24.orig/block/as-iosched.c2008-01-24 14:58:37.0 -0800 +++ linux-2.6.24/block/as-iosched.c 2008-01-29 12:05:52.0 -0800 @@ -16,6 +16,8 @@ #include #include #include +#include +#include #define REQ_SYNC 1 #define REQ_ASYNC 0 @@ -63,6 +65,9 @@ */ #define MAX_THINKTIME (HZ/50UL) +#define default_bandwidth_scheduling (0) +#define default_bw_timer_expire (16) /* msecs */ + /* Bits in as_io_context.state */ enum as_io_states { AS_TASK_RUNNING=0, /* Process has not exited */ @@ -89,10 +94,14 @@ struct as_data { /* * requests (as_rq s) are present on both sort_list and fifo_list */ - struct rb_root sort_list[2]; - struct list_head fifo_list[2]; + struct { + struct rb_root sort_list[2]; + struct list_head fifo_list[2]; + struct request *next_rq[2]; + unsigned long ioprio_wt; + unsigned long serviced; + } prio_q[IOPRIO_AS_MAX]; - struct request *next_rq[2]; /* next in sort order */ sector_t last_sector[2];/* last REQ_SYNC & REQ_ASYNC sectors */ unsigned long exit_prob;/
Re: [PATCH] add support for dynamic ticks and preempt rcu
On Tue, Jan 29, 2008 at 11:18:12AM -0500, Steven Rostedt wrote: > > [ > Paul, you had your Signed-off-by in the RT patch, so I attached it here > too > ] Works for me!!! > The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The > idle CPU will not progress the RCU through its grace period and a > synchronize_rcu my get stuck. Without this patch I have a box that will > not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine with > this patch. > > Note: This patch came directly from the -rt patch where it has been tested > for several months. For those who attended my lightening talk yesterday on changing RCU to "let sleeping CPUs lie", this is the patch. If your architecture calls rcu_irq_enter() or irq_enter() upon NMI/SMI/MC/whatever handler entry and also calls rcu_irq_exit() or irq_exit() upon NMI/SMI/MC/whatever handler exit, you are covered. Alternatively, if none of your architecture's NMI/SMI/MC/whatever handlers never invoke rcu_read_lock()/rcu_read_unlock() and friends, you are also covered. I believe that we are covered, but I cannot claim to fully understand all 20+ architectures. ;-) Thanx, Paul > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> > Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]> > --- > include/linux/hardirq.h| 10 ++ > include/linux/rcuclassic.h |3 > include/linux/rcupreempt.h | 22 > kernel/rcupreempt.c| 224 > - > kernel/softirq.c |1 > kernel/time/tick-sched.c |3 > 6 files changed, 259 insertions(+), 4 deletions(-) > > Index: linux-compile.git/kernel/rcupreempt.c > === > --- linux-compile.git.orig/kernel/rcupreempt.c2008-01-29 > 11:03:21.0 -0500 > +++ linux-compile.git/kernel/rcupreempt.c 2008-01-29 11:10:08.0 > -0500 > @@ -23,6 +23,10 @@ > * to Suparna Bhattacharya for pushing me completely away > * from atomic instructions on the read side. > * > + * - Added handling of Dynamic Ticks > + * Copyright 2007 - Paul E. Mckenney <[EMAIL PROTECTED]> > + * - Steven Rostedt <[EMAIL PROTECTED]> > + * > * Papers: http://www.rdrop.com/users/paulmck/RCU > * > * Design Document: http://lwn.net/Articles/253651/ > @@ -409,6 +413,212 @@ static void __rcu_advance_callbacks(stru > } > } > > +#ifdef CONFIG_NO_HZ > + > +DEFINE_PER_CPU(long, dynticks_progress_counter) = 1; > +static DEFINE_PER_CPU(long, rcu_dyntick_snapshot); > +static DEFINE_PER_CPU(int, rcu_update_flag); > + > +/** > + * rcu_irq_enter - Called from Hard irq handlers and NMI/SMI. > + * > + * If the CPU was idle with dynamic ticks active, this updates the > + * dynticks_progress_counter to let the RCU handling know that the > + * CPU is active. > + */ > +void rcu_irq_enter(void) > +{ > + int cpu = smp_processor_id(); > + > + if (per_cpu(rcu_update_flag, cpu)) > + per_cpu(rcu_update_flag, cpu)++; > + > + /* > + * Only update if we are coming from a stopped ticks mode > + * (dynticks_progress_counter is even). > + */ > + if (!in_interrupt() && > + (per_cpu(dynticks_progress_counter, cpu) & 0x1) == 0) { > + /* > + * The following might seem like we could have a race > + * with NMI/SMIs. But this really isn't a problem. > + * Here we do a read/modify/write, and the race happens > + * when an NMI/SMI comes in after the read and before > + * the write. But NMI/SMIs will increment this counter > + * twice before returning, so the zero bit will not > + * be corrupted by the NMI/SMI which is the most important > + * part. > + * > + * The only thing is that we would bring back the counter > + * to a postion that it was in during the NMI/SMI. > + * But the zero bit would be set, so the rest of the > + * counter would again be ignored. > + * > + * On return from the IRQ, the counter may have the zero > + * bit be 0 and the counter the same as the return from > + * the NMI/SMI. If the state machine was so unlucky to > + * see that, it still doesn't matter, since all > + * RCU read-side critical sections on this CPU would > + * have already completed. > + */ > + per_cpu(dynticks_progress_counter, cpu)++; > + /* > + * The following memory barrier ensures that any > + * rcu_read_lock() primitives in the irq handler > + * are seen by other CPUs to follow the above > + * increment to dynticks_progress_counter. This is > + * required in order for other CPUs to correctly > +
Re: [PATCH powerpc] Fake NUMA emulation for PowerPC (Take 3)
* Michael Ellerman <[EMAIL PROTECTED]> [2008-01-30 00:04:58]: > Why do you check !p after assigning to nid? I assume it's because we > might have reached the end of the command line, ie. p == NULL, but we're > still adding memory to the last node? If so it's a it's a little subtle > and deserves a comment I think. > The reason that we check for !p after assigning node id is that, in case we create fake NUMA nodes, we want nid to be the fake numa node id and not the real node id or in the non NUMA case, node id 0. The if (!p) checks to see if we do have more arguments to parse. > Otherwise this looks pretty good. > Thanks! > cheers > > -- > Michael Ellerman > OzLabs, IBM Australia Development Lab > > wwweb: http://michael.ellerman.id.au > phone: +61 2 6212 1183 (tie line 70 21183) > > We do not inherit the earth from our ancestors, > we borrow it from our children. - S.M.A.R.T Person -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Change In sk_buff structure in 2.6.22 kernel
Hi All, The header fields in the sk_buff structure have been renamed and are no longer unions. Networking code and drivers are supposed to use skb->transport_header, skb->network_header, and skb->skb_mac_header. But when I am trying to access fields of TCP using the code struct tcphdr *tcp = skb->transport_header; tcp-> //accessing proper field It is not accessing the value properly ... Can anyone please help me ??? Thanks in advance Regards Juliet -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 && -g8561b089
On Wed, Jan 30, 2008 at 06:03:47AM +0100, Borislav Petkov wrote: > On Wed, Jan 30, 2008 at 12:58:33AM +0100, Bartlomiej Zolnierkiewicz wrote: > > > > Hi, > > > > On Wednesday 30 January 2008, Kiyoshi Ueda wrote: > > > Hi Bart, > > > > > > On Tue, 29 Jan 2008 14:22:53 -0800, Roland Dreier wrote: > > > > Hi, I saw the same BUG from ide-cd on one of my systems. I applied > > > > the debugging patch to replace the BUG with blk_dump_rq_flags(), and I > > > > got the output below (full boot log and .config attached to this > > > > email). > > > > > > > > Please let me know if there's anything else that would help debug the > > > > problem. > > > > > > Thank you for the information, Roland. > > > > > > > > > > [4.072271] Uniform CD-ROM driver Revision: 3.20 > > > > [4.098236] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > > [4.100269] > > > > [4.100269] sector 1949759, nr/cnr 0/0 > > > > [4.100269] bio 8102418cc600, biotail 8102418cc600, buffer > > > > , d8 > > > > [4.100269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > > [4.101005] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > > [4.104269] > > > > [4.104269] sector 1949759, nr/cnr 0/0 > > > > [4.104269] bio 8102418cc600, biotail 8102418cc600, buffer > > > > , d2 > > > > [4.104269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > > [4.109203] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > > [4.112270] > > > > [4.112270] sector 1949759, nr/cnr 0/0 > > > > [4.112270] bio 8102418cc600, biotail 8102418cc600, buffer > > > > , d8 > > > > [4.112270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > > [4.112945] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > > [4.116270] > > > > [4.116270] sector 1949759, nr/cnr 0/0 > > > > [4.116270] bio 8102418cc600, biotail 8102418cc600, buffer > > > > , d2 > > > > [4.116270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > > > > Bart, > > > This means that the rq still has a bio even after DRQ_STAT is cleared. > > > The original ide-cd code was calling only end_that_request_last() there. > > > So I thought that the rq should have no bio when DRQ_STAT is cleared, > > > otherwise the bio leaks. > > > > > > Was my understanding wrong and is that correct behavior in ide-cd? > > > > Added Borislav to cc:. > > > > PS I'm extremely busy with "real-life" (unfortunately IDE hacking is not > > my paid job) and the friday is the earliest date on which I would be able > > to look in detail into this problem and other outstanding IDE stuff, sorry. > > Same here, will be able to look into it tomorrow. In the meantime, can someone > direct me the full BUG() output? Nevermind. Got it, thanks. -- Regards/Gruß, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH retry] bluetooth : add conn add/del workqueues to avoid connection fail
From: Dave Young <[EMAIL PROTECTED]> Date: Wed, 30 Jan 2008 10:23:54 +0800 > > The bluetooth hci_conn sysfs add/del executed in the default workqueue. > If the del_conn is executed after the new add_conn with same target, > add_conn will failed with warning of "same kobject name". > > Here add btaddconn & btdelconn workqueues, > flush the btdelconn workqueue in the add_conn function to avoid the issue. > > Signed-off-by: Dave Young <[EMAIL PROTECTED]> This looks good, applied, thanks Dave. I've queued this up for 2.6.25 merging, if you want me to schedule it for -stable, just let me know. Thanks again. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
On Wed, Jan 30, 2008 at 04:24:50AM +0100, Andi Kleen wrote: > Pavel Roskin <[EMAIL PROTECTED]> writes: > > > > static inline void add_taint_module(struct module *mod, unsigned flag) > > { > > add_taint(flag); > > mod->taints |= flag; > > } > > > > The module taint is set before the symbols are resolved. Therefore, the > > GPL-only symbols won't be resolved. > > I think using a separate taint flag that does not disable GPL symbols > for the ndiswrapper case would be a fair solution. After all the main > motivation for tainting ndiswrapper is to make it visible in oopses, but not > prevent it from loading in the first place. > > How about you submit an incremental patch to do that? I'll happily submit a patch to do whateve is wanted, and add comments (I'm also debugging seveal module issues right now, so I have a good opportunity to look over some of the code). But do we want to: *). Add a new taint? *). Move it later? It's all trivial, but a policy should be established for the future. Jon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 && -g8561b089
On Wed, Jan 30, 2008 at 12:58:33AM +0100, Bartlomiej Zolnierkiewicz wrote: > > Hi, > > On Wednesday 30 January 2008, Kiyoshi Ueda wrote: > > Hi Bart, > > > > On Tue, 29 Jan 2008 14:22:53 -0800, Roland Dreier wrote: > > > Hi, I saw the same BUG from ide-cd on one of my systems. I applied > > > the debugging patch to replace the BUG with blk_dump_rq_flags(), and I > > > got the output below (full boot log and .config attached to this > > > email). > > > > > > Please let me know if there's anything else that would help debug the > > > problem. > > > > Thank you for the information, Roland. > > > > > > > [4.072271] Uniform CD-ROM driver Revision: 3.20 > > > [4.098236] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > [4.100269] > > > [4.100269] sector 1949759, nr/cnr 0/0 > > > [4.100269] bio 8102418cc600, biotail 8102418cc600, buffer > > > , d8 > > > [4.100269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > [4.101005] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > [4.104269] > > > [4.104269] sector 1949759, nr/cnr 0/0 > > > [4.104269] bio 8102418cc600, biotail 8102418cc600, buffer > > > , d2 > > > [4.104269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > [4.109203] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > [4.112270] > > > [4.112270] sector 1949759, nr/cnr 0/0 > > > [4.112270] bio 8102418cc600, biotail 8102418cc600, buffer > > > , d8 > > > [4.112270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > [4.112945] ide-cd: rq still having bio: dev hda: type=2, flags=114c8 > > > [4.116270] > > > [4.116270] sector 1949759, nr/cnr 0/0 > > > [4.116270] bio 8102418cc600, biotail 8102418cc600, buffer > > > , d2 > > > [4.116270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 > > > > Bart, > > This means that the rq still has a bio even after DRQ_STAT is cleared. > > The original ide-cd code was calling only end_that_request_last() there. > > So I thought that the rq should have no bio when DRQ_STAT is cleared, > > otherwise the bio leaks. > > > > Was my understanding wrong and is that correct behavior in ide-cd? > > Added Borislav to cc:. > > PS I'm extremely busy with "real-life" (unfortunately IDE hacking is not > my paid job) and the friday is the earliest date on which I would be able > to look in detail into this problem and other outstanding IDE stuff, sorry. Same here, will be able to look into it tomorrow. In the meantime, can someone direct me the full BUG() output? -- Regards/Gruß, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
On Tue, Jan 29, 2008 at 08:48:21PM -0500, Pavel Roskin wrote: > On Tue, 2008-01-29 at 19:20 -0500, Jon Masters wrote: > > > Yes it is. But I thought the existing code was intending to taint the > > kernel (that's what it does), so it would really help to identify why it > > tainted the kernel, by calling add_taint_module instead of add_taint. I > > didn't put the existing match in there...don't shoot the messenger :) > > So, it's the same thing as in year 2006. Good intentions, unexpected > side effects, and a long discussion. I wouldn't quite say that. I wasn't going to comment, but...personally, I actually disagree with the assertions that ndiswrapper isn't causing proprietary code to link against GPL functions in the kernel (how is an NDIS implementation any different than a shim layer provided to load a graphics driver?), but I wasn't trying to make that point. Rusty - shall we just move the taint to post symbol resolution? Jon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU
On Tue, Jan 29, 2008 at 08:24:17PM -0700, Eric W. Biederman wrote: > Oleg Nesterov <[EMAIL PROTECTED]> writes: > > > With CONFIG_PREEMPT_RCU read_lock(tasklist_lock) doesn't imply > > rcu_read_lock(), > > but find_pid_ns()->hlist_for_each_entry_rcu() should be safe under tasklist. > > > > Usually it is, detach_pid() is always called under > > write_lock(tasklist_lock), > > but copy_process() calls free_pid() lockless. > > > > "#ifdef CONFIG_PREEMPT_RCU" is added mostly as documentation, perhaps it is > > too ugly and should be removed. > > > > Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]> > > > > --- MM/kernel/fork.c~PR_RCU 2008-01-27 17:09:47.0 +0300 > > +++ MM/kernel/fork.c2008-01-29 19:23:44.0 +0300 > > @@ -1335,8 +1335,19 @@ static struct task_struct *copy_process( > > return p; > > > > bad_fork_free_pid: > > - if (pid != &init_struct_pid) > > + if (pid != &init_struct_pid) { > > +#ifdef CONFIG_PREEMPT_RCU > > + /* > > +* read_lock(tasklist_lock) doesn't imply rcu_read_lock(), > > +* make sure find_pid() is safe under read_lock(tasklist). > > +*/ > > + write_lock_irq(&tasklist_lock); > > +#endif > > free_pid(pid); > > +#ifdef CONFIG_PREEMPT_RCU > > + write_unlock_irq(&tasklist_lock); > > +#endif > > + } > > bad_fork_cleanup_namespaces: > > exit_task_namespaces(p); > > bad_fork_cleanup_keys: > > Ok. I believe I see what problem you are trying to fix. That > a pid returned from find_pid might disappear if we are not rcu > protected. > > This patch in the simplest form is wrong because it is confusing. > > We currently appear to have two options. > 1) Force all pid hash table access and pid accesses that >do not get a count to be covered under rcu_read_lock. > 2) To modify the locking requirements for free_pid to require >the tasklist_lock. > >However this second approach is horribly brittle, as it >will break if we ever have intermediate entries in the >hash table protected by pidmap_lock. > > Using the tasklist_lock to still guarantee we see the list, the entire > list, and exactly the list for proper implementation of kill to > process groups and sessions still seems sane. > > So let's just remove the guarantee of find_pid being usable with > just the tasklist_lock held. Makes sense to me -- it is totally permissible to hold rcu_read_lock() across update code. ;-) Thanx, Paul > Eric > > diff --git a/include/linux/pid.h b/include/linux/pid.h > index e29a900..0ffb8cc 100644 > --- a/include/linux/pid.h > +++ b/include/linux/pid.h > @@ -100,8 +100,7 @@ struct pid_namespace; > extern struct pid_namespace init_pid_ns; > > /* > - * look up a PID in the hash table. Must be called with the tasklist_lock > - * or rcu_read_lock() held. > + * look up a PID in the hash table. Must be called with the rcu_read_lock() > held. > * > * find_pid_ns() finds the pid in the namespace specified > * find_pid() find the pid by its global id, i.e. in the init namespace -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU
On Tue, Jan 29, 2008 at 07:16:50PM -0700, Eric W. Biederman wrote: > Andrew Morton <[EMAIL PROTECTED]> writes: > > > On Tue, 29 Jan 2008 19:40:19 +0300 > > Oleg Nesterov <[EMAIL PROTECTED]> wrote: > > > >> With CONFIG_PREEMPT_RCU read_lock(tasklist_lock) doesn't imply > > rcu_read_lock(), > >> but find_pid_ns()->hlist_for_each_entry_rcu() should be safe under > >> tasklist. > >> > >> Usually it is, detach_pid() is always called under > >> write_lock(tasklist_lock), > >> but copy_process() calls free_pid() lockless. > >> > >> "#ifdef CONFIG_PREEMPT_RCU" is added mostly as documentation, perhaps it is > >> too ugly and should be removed. > >> > >> Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]> > >> > >> --- MM/kernel/fork.c~PR_RCU2008-01-27 17:09:47.0 +0300 > >> +++ MM/kernel/fork.c 2008-01-29 19:23:44.0 +0300 > >> @@ -1335,8 +1335,19 @@ static struct task_struct *copy_process( > >>return p; > >> > >> bad_fork_free_pid: > >> - if (pid != &init_struct_pid) > >> + if (pid != &init_struct_pid) { > >> +#ifdef CONFIG_PREEMPT_RCU > >> + /* > >> + * read_lock(tasklist_lock) doesn't imply rcu_read_lock(), > >> + * make sure find_pid() is safe under read_lock(tasklist). > >> + */ > >> + write_lock_irq(&tasklist_lock); > >> +#endif > >>free_pid(pid); > >> +#ifdef CONFIG_PREEMPT_RCU > >> + write_unlock_irq(&tasklist_lock); > >> +#endif > >> + } > >> bad_fork_cleanup_namespaces: > >>exit_task_namespaces(p); > >> bad_fork_cleanup_keys: > > > > My attempt to understand this change timed out. > > > > kernel/pid.c is full of global but undocumented functions. What are the > > locking requirements for free_pid()? free_pid_ns()? If it's just > > caller-must-hold-rcu_read_lock() then why not use rcu_read_lock() here? > > > > If the locking is "caller must hold write_lock_irq(tasklist_lock) then the > > sole relevant comment in there (in free_pid()) is wrong. > > > > Guys, more maintainable code please? > > Well I took a quick look. > > Yeah this looks complex. > Mutation of the hash table is protected by pidmap_lock. > But attachments of tasks to hash entries is protected task_lock. > > And it looks like it has been that way since commit > 92476d7fc0326a409ab1d3864a04093a6be9aca7 > > I thought free_pid did not have any requirements that a lock be held when > it was called, taking all of the needed locks. > > Now how read_lock doesn't imply rcu_read_lock is another question. Although read_lock() does accidentally imply rcu_read_lock() for Classic RCU, it no longer does so for preemptible RCU. But I thought that we had found these -- must have missed some... Thanx, Paul > Anyway I have to run. I will see about looking at this in a bit. > > Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUILD FAILURE]2.6.24-git6 build failure on sis190 ethernet driver
Kamalesh Babulal wrote: > Hi, > > The 2.6.24-git6 kernel build fails on various x86_64 machines with the build > failure > > drivers/net/sis190.c:329: error: sis190_pci_tbl causes a section type conflict > make[2]: *** [drivers/net/sis190.o] Error 1 > > # gcc --version (machine1) > gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52) > > # gcc --version (machine2) > gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) > Heh :) vger.kernel.org does not like emails directly from gmail , it seems =) ( sorry for sending this 3 time now ) The following patch should fix the build failure. diff --git a/drivers/net/sis190.c b/drivers/net/sis190.c index b570402..e48e4ad 100644 --- a/drivers/net/sis190.c +++ b/drivers/net/sis190.c @@ -326,7 +326,7 @@ static const struct { { "SiS 191 PCI Gigabit Ethernet adapter" }, }; -static struct pci_device_id sis190_pci_tbl[] __devinitdata = { +static const struct pci_device_id sis190_pci_tbl[] __devinitdata = { { PCI_DEVICE(PCI_VENDOR_ID_SI, 0x0190), 0, 0, 0 }, { PCI_DEVICE(PCI_VENDOR_ID_SI, 0x0191), 0, 0, 1 }, { 0, }, Gabriel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3][RFC] x86: Catch stray non-kprobe breakpoints
On Tue, Jan 29, 2008 at 02:29:41PM -0500, Masami Hiramatsu wrote: > Abhishek Sagar wrote: > > On 1/29/08, Masami Hiramatsu <[EMAIL PROTECTED]> wrote: > >> In that case, why don't you just reduce the priority of > >> kprobe_exceptions_nb? > >> Then, the execution path becomes very simple. > > > > Ananth mentioned that the kprobe notifier has to be the first to run. > > (Hmm.. I think he has just explained current implementation:)) > IMHO, since kprobes itself can not know what the external debugger > wants to do, the highest priority should be reserved for those external tools. The reason why kprobes needs to be the first to run is simple: it doesn't need user intervention and if it isn't the intended recepient of the breakpoint, it just lets the kernel take over (unlike a debugger, which would potentially need user attention). Also, if the underlying instruction itself is a breakpoint, we have the facility in kprobes to single-step inline so the kernel can take control and notify any other intended recepient of the underlying breakpoint. As such, I believe the current situation is fine, has worked fine for close to 4 years now and doesn't warrant any change. Ananth -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On Tue, Jan 29, 2008 at 05:19:55AM -0800, Greg KH wrote: > On Mon, Jan 28, 2008 at 08:18:04PM -0700, Matthew Wilcox wrote: > > I'm more optimistic because we've so severely restricted the use of > > mmconf after these patches that it's unlikely to cause problems. I also > > hear Vista is now using mmconf, so fewer implementations are going to > > be buggy at this point. > > Hahahaha, oh, that's a good one... Thanks Greg. What happened to "Can't we all try to get along"? > But what about the thousands of implementations out there that are > buggy? > > I'm with Arjan here, I'm very skeptical. Maybe I'm insufficiently imaginative. Can you come up with a plausible way in which the two patches I posted will succumb to bugs? After those patches we only use mmconf if: 1. conf1 has failed to work OR 2. user has compiled their own kernel without support for conf1 OR 3. kernel probes config space 0x100 to see if it can access extended config space (requires the device to be PCIe or PCI-X2) OR 4. root attempts to lspci - or lspci -v OR 5. device driver tries to access extended config space With Arjan's patch, I believe only case 3 changes. In cases 4 and 5, either lspci or the device driver will jump through the hoop to enable access to extended config space. > Matthew, with Arjan's patch, is anything that currently works now > broken? Why do you feel it is somehow "wrong"? lspci is broken. It used to be able to access extended config space, and now can't unless it is patched to know about the sysfs flag to enable it. If you're determined to implement something to disable extended config space by default, it can be done in a much better way than Arjan's patch -- less code (both source and object). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: at91sam9260 wakeup on serial port
On Monday 28 January 2008, Haavard Skinnemoen wrote: > > > What will AVR32 (AP7) need to do, when it supports system sleep states? > > Not sure. The PIOs seem to require a clock in order to detect a pin > change, so I don't think we can enter very deep sleep states if we want > to be woken up by the USART. Right; if no DMA is pending, then the HSB matrix clock can be idled, DRAM put into self-refresh, and most peripherals can issue wakeups ... AP7 "Frozen" state, very analagous to AT91 "standby" on Linux. UARTs and GPIOs can wake. Deeper sleep states -- "standby" with clocks running, "stop" with all except 32K (and RTC) off, "static" with no clocks at all -- can only wake from WAKE_N and external interrupts; and RTC except in "static". I suspect "stop" and "static" might want to use the on-chip SRAMs so they don't need to change DRAM timings while they fiddle with clocks. The closest analogue to the AT91 support would map /sys/power/state: standby --> to AP7 "Frozen" mem --> to AP7 "Stop" Except that there could be no GPIO wakeups from "mem" ... so the $SUBJECT patch wouldn't be useful on AVR32 (just AT91), unless USARTn.RXD is wired up to one of those special wake-capable pins (extremely board-specific). > There's a separate WAKE_N pin that is completely asynchronous, so with > some external logic, we can probably wake up the CPU all the way from > Static mode if a given input state is present. But that's definitely > "board specific" territory, and starting the oscillators take a _long_ > time on the AP7000 (especially the 32 kHz, but then again, it barely > consumes any power, so we might as well keep it running and keep the > RTC going as well.) I'd think the support of any "deeper" state for "mem" sleep would not be entirely board specific ... when the RTC alarm is set, any board should be able to use states other than "static". But otherwise, no board could enter those states unless WAKE_N or an external IRQ are doing something useful (like being hooked up to a button). Matching those few "deep wake" events to a given device would imply board-specific glue code. > So on AP7000, I think we'll just need to keep clocking the USART and > let it generate the interrupt that wakes up the rest of the system. For "standby" sleep state, yes -- map to at most AVR32 "Frozen" state. That'd be a good first step for PM support. - Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUILD FAILURE]2.6.24-git6 build failure on sis190 ethernet driver
Hi, The 2.6.24-git6 kernel build fails on various x86_64 machines with the build failure drivers/net/sis190.c:329: error: sis190_pci_tbl causes a section type conflict make[2]: *** [drivers/net/sis190.o] Error 1 # gcc --version (machine1) gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52) # gcc --version (machine2) gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) -- Thanks & Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] dm-band: The I/O bandwidth controller: Performance Report
Hi, > you mean that you run 128 processes on each user-device pairs? Namely, > I guess that > > user1: 128 processes on sdb5, > user2: 128 processes on sdb5, > another: 128 processes on sdb5, > user2: 128 processes on sdb6. "User-device pairs" means "band groups", right? What I actually did is the followings: user1: 128 processes on sdb5, user2: 128 processes on sdb5, user3: 128 processes on sdb5, user4: 128 processes on sdb6. > The second preliminary studies might be: > - What if you use a different I/O size on each device (or device-user pair)? > - What if you use a different number of processes on each device (or > device-user pair)? There are other ideas of controlling bandwidth, limiting bytes-per-sec, latency time or something. I think it is possible to implement it if a lot of people really require it. I feel there wouldn't be a single correct answer for this issue. Posting good ideas how it should work and submitting patches for it are also welcome. > And my impression is that it's natural dm-band is in device-mapper, > separated from I/O scheduler. Because bandwidth control and I/O > scheduling are two different things, it may be simpler that they are > implemented in different layers. I would like to know how dm-band works on various configurations on various type of hardware. I'll try running dm-band on with other configurations. Any reports or impressions of dm-band on your machines are also welcome. Thanks, Ryo Tsuruta -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 24/27] NFS: Use local caching [try #2]
Chuck Lever <[EMAIL PROTECTED]> wrote: > This patch really ought to be broken into more manageable atomic > changes to make it easier to review, and to provide more fine-grained > explanation and rationalization for each specific change via > individual patch descriptions. Hmmm I broke the patch up as Trond stipulated - at least, I thought I had. In many ways this request doesn't make sense. You can't do NFS caching without all the appropriate bits, so logically they should be one patch. Breaking it up won't help git-bisect since the option to enable all this is the last (or nearly last) patch. However, I can do it (when I get back from LCA next week). > This should no longer be necessary. The latest mount.nfs subcommand > from nfs-utils supports text-based mounts when running on kernels > 2.6.23 and later. Okay. I'll update my patches to reflect this. Note, however, I've got someone reporting a bug that seems to show otherwise. I'll have to investigate this more next week. > I hope you intend to provide updates to nfs(5) that describe the new > mount options you introduce in this and later patches. You don't > mention it, but I assume that "nofsc" is the default behavior. I should make SteveD do that, the options was his idea:-) But I'll deal with it. > Add comments like this in a separate clean up patch. > A suggestion: fs/nfs/fsc-index.c might be a better name. If you wish, though I'd prefer to use a name that isn't like to clash with a name that's going to appear in fs/fscache/ (or include/linux/ - I'd really like to rename fs/nfs/fscache.h as dealing with two fscache.h's is annoying. > > +struct nfs_fh_auxdata { > > + struct timespec i_mtime; > > + struct timespec i_ctime; > > + loff_t i_size; > > +}; > > It might be useful to explain here why you need to supplement the > mtime, ctime, and size fields that already exist in an NFS inode. Supplement? I don't understand. > > + key->port = clp->cl_addr.sin_port; > > Not sure why you are using the server's port here. In almost every > case the server side port number will be 2049, so it really doesn't > add any uniquification. The reason lies is "in almost every case". It's possible to configure it such that a server is running two separate NFS servers on different ports. > If you're going for the client side port number, that changes after > every connection, so it would be useless to identify a cache after a > reboot (or even after the connection idles out!). I'm going for the server side port number. Using the client side port number would be silly. > I strongly recommend you use the existing IPv6 address conversion > macros for this instead of open-coding yet another way of mapping an > IPv4 address to an IPv6 address. > > However, since AF_INET6 support is being introduced in the NFS client > in 2.6.24, I recommend you take a look at these source files after > Trond has pushed his NFS_ALL for 2.6.24. I'll look at them. > See below: the NFS cache-related stats should be added to nfs_iostats. I believe I asked Trond, but I'll check. I've got to move, so I'll deal with the rest of your email later. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU
Oleg Nesterov <[EMAIL PROTECTED]> writes: > With CONFIG_PREEMPT_RCU read_lock(tasklist_lock) doesn't imply > rcu_read_lock(), > but find_pid_ns()->hlist_for_each_entry_rcu() should be safe under tasklist. > > Usually it is, detach_pid() is always called under write_lock(tasklist_lock), > but copy_process() calls free_pid() lockless. > > "#ifdef CONFIG_PREEMPT_RCU" is added mostly as documentation, perhaps it is > too ugly and should be removed. > > Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]> > > --- MM/kernel/fork.c~PR_RCU 2008-01-27 17:09:47.0 +0300 > +++ MM/kernel/fork.c 2008-01-29 19:23:44.0 +0300 > @@ -1335,8 +1335,19 @@ static struct task_struct *copy_process( > return p; > > bad_fork_free_pid: > - if (pid != &init_struct_pid) > + if (pid != &init_struct_pid) { > +#ifdef CONFIG_PREEMPT_RCU > + /* > + * read_lock(tasklist_lock) doesn't imply rcu_read_lock(), > + * make sure find_pid() is safe under read_lock(tasklist). > + */ > + write_lock_irq(&tasklist_lock); > +#endif > free_pid(pid); > +#ifdef CONFIG_PREEMPT_RCU > + write_unlock_irq(&tasklist_lock); > +#endif > + } > bad_fork_cleanup_namespaces: > exit_task_namespaces(p); > bad_fork_cleanup_keys: Ok. I believe I see what problem you are trying to fix. That a pid returned from find_pid might disappear if we are not rcu protected. This patch in the simplest form is wrong because it is confusing. We currently appear to have two options. 1) Force all pid hash table access and pid accesses that do not get a count to be covered under rcu_read_lock. 2) To modify the locking requirements for free_pid to require the tasklist_lock. However this second approach is horribly brittle, as it will break if we ever have intermediate entries in the hash table protected by pidmap_lock. Using the tasklist_lock to still guarantee we see the list, the entire list, and exactly the list for proper implementation of kill to process groups and sessions still seems sane. So let's just remove the guarantee of find_pid being usable with just the tasklist_lock held. Eric diff --git a/include/linux/pid.h b/include/linux/pid.h index e29a900..0ffb8cc 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -100,8 +100,7 @@ struct pid_namespace; extern struct pid_namespace init_pid_ns; /* - * look up a PID in the hash table. Must be called with the tasklist_lock - * or rcu_read_lock() held. + * look up a PID in the hash table. Must be called with the rcu_read_lock() held. * * find_pid_ns() finds the pid in the namespace specified * find_pid() find the pid by its global id, i.e. in the init namespace -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/19] split LRU lists into anon & file sets
Hi Rik, Lee I tested new hackbench on rvr split LRU patch. http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c method of test (1) $ ./hackbench 150 process 1000 (2) # sync; echo 3 > /proc/sys/vm/drop_caches $ dd if=tmp10G of=/dev/null $ ./hackbench 150 process 1000 test machine CPU: Itanium2 x4 (logical 8cpu) MEM: 8GB A. vanilla 2.6.24-rc8-mm1 (1) 127.540 (2) 727.548 B. 2.6.24-rc8-mm1 + split-lru-patch-series (1) 92.730 (2) 758.369 comment: (1) active/inactive anon ratio improve performance significant. (2) incorrect page activation reduce performance. I investigate reason and found reason is [05/19] change. I tested a bit porton reverted split-lru-patch-series again. C. 2.6.24-rc8-mm1 + split-lru-patch-series + my-revert-patch (1) 83.014 (2) 717.009 Of course, We need reintroduce this portion after new page LRU (aka LRU for used only page). but now is too early. I hope this patch series merge to -mm ASAP. therefore, I hope remove any corner case regression. Thanks! - kosaki Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]> --- mm/vmscan.c | 26 +- 1 file changed, 25 insertions(+), 1 deletion(-) Index: b/mm/vmscan.c === --- a/mm/vmscan.c 2008-01-29 15:59:17.0 +0900 +++ b/mm/vmscan.c 2008-01-30 11:53:42.0 +0900 @@ -247,6 +247,27 @@ return ret; } +/* Called without lock on whether page is mapped, so answer is unstable */ +static inline int page_mapping_inuse(struct page *page) +{ + struct address_space *mapping; + + /* Page is in somebody's page tables. */ + if (page_mapped(page)) + return 1; + + /* Be more reluctant to reclaim swapcache than pagecache */ + if (PageSwapCache(page)) + return 1; + + mapping = page_mapping(page); + if (!mapping) + return 0; + + /* File is mmap'd by somebody? */ + return mapping_mapped(mapping); +} + static inline int is_page_cache_freeable(struct page *page) { return page_count(page) - !!PagePrivate(page) == 2; @@ -515,7 +536,8 @@ referenced = page_referenced(page, 1, sc->mem_cgroup); /* In active use or really unfreeable? Activate it. */ - if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced) + if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && + referenced && page_mapping_inuse(page)) goto activate_locked; #ifdef CONFIG_SWAP @@ -550,6 +572,8 @@ } if (PageDirty(page)) { + if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced) + goto keep_locked; if (!may_enter_fs) { sc->nr_io_pages++; goto keep_locked; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86_64: make bootmap_start page align v3
[PATCH 2/2] x86_64: make bootmap_start page align v3 boot oops when system get 64g or 128 installed Calling initcall 0x80bc33b6: sctp_init+0x0/0x711() BUG: unable to handle kernel NULL pointer dereference at 005f IP: [] proc_register+0xe7/0x10f PGD 0 Oops: [1] SMP CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.24-smp-g5a514e21-dirty #6 RIP: 0010:[] [] proc_register+0xe7/0x10f RSP: :810824c57e60 EFLAGS: 00010246 RAX: d7d7 RBX: 811024c5fa80 RCX: 810824c57e08 RDX: RSI: 0195 RDI: 80cc2460 RBP: R08: R09: 811024c5fa80 R10: R11: 0002 R12: 810824c57e6c R13: R14: 810824c57ee0 R15: 0006abd25bee FS: () GS:80b4d000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 005f CR3: 00201000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process swapper (pid: 1, threadinfo 810824c56000, task 812024c52000) Stack: 80a57348 0195 811024c5fa80 ff97 802bfef0 80bc3b4b 810824c57ee0 80bc34a5 Call Trace: [] ? create_proc_entry+0x73/0x8a [] ? sctp_snmp_proc_init+0x1c/0x34 [] ? sctp_init+0xef/0x711 [] ? kernel_init+0x175/0x2e1 [] ? child_rip+0xa/0x12 [] ? kernel_init+0x0/0x2e1 [] ? child_rip+0x0/0x12 Code: 1e 48 83 7b 38 00 75 08 48 c7 43 38 f0 e8 82 80 48 83 7b 30 00 75 08 48 c7 43 30 d0 e9 82 80 48 c7 c7 60 24 cc 80 e8 bd 5a 54 00 <48> 8b 45 60 48 89 6b 58 48 89 5d 60 48 89 43 50 fe 05 f5 25 a0 RIP [] proc_register+0xe7/0x10f RSP CR2: 005f ---[ end trace 02c2d78def82877a ]--- Kernel panic - not syncing: Attempted to kill init! it turns out some variables near end of bss is corrupted already. in System.map we have 80d40420 b rsi_table 80d40620 B krb5_seq_lock 80d40628 b i.20437 80d40630 b xprt_rdma_inline_write_padding 80d40638 b sunrpc_table_header 80d40640 b zero 80d40644 b min_memreg 80d40648 b rpcrdma_tk_lock_g 80d40650 B sctp_assocs_id_lock 80d40658 B proc_net_sctp 80d40660 B sctp_assocs_id 80d40680 B sysctl_sctp_mem 80d40690 B sysctl_sctp_rmem 80d406a0 B sysctl_sctp_wmem 80d406b0 b sctp_ctl_socket 80d406b8 b sctp_pf_inet6_specific 80d406c0 b sctp_pf_inet_specific 80d406c8 b sctp_af_v4_specific 80d406d0 b sctp_af_v6_specific 80d406d8 b sctp_rand.33270 80d406dc b sctp_memory_pressure 80d406e0 b sctp_sockets_allocated 80d406e4 b sctp_memory_allocated 80d406e8 b sctp_sysctl_header 80d406f0 b zero 80d406f4 A __bss_stop 80d406f4 A _end and setup_node_bootmem() will use that page 0xd4 for bootmap Bootmem setup node 0 -00082800 NODE_DATA [0008a485 - 00091484] bootmap [00d406f4 - 00e456f3] pages 105 Bootmem setup node 1 00082800-00102800 NODE_DATA [00082800 - 000828006fff] bootmap [000828007000 - 000828106fff] pages 100 Bootmem setup node 2 00102800-00182800 NODE_DATA [00102800 - 001028006fff] bootmap [001028007000 - 001028106fff] pages 100 Bootmem setup node 3 00182800-00202800 NODE_DATA [00182800 - 001828006fff] bootmap [001828007000 - 001828106fff] pages 100 the patch update bootmap_start to page_align to make sure we can extra range for alignment. Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> Index: linux-2.6/arch/x86/mm/numa_64.c === --- linux-2.6.orig/arch/x86/mm/numa_64.c +++ linux-2.6/arch/x86/mm/numa_64.c @@ -224,6 +224,9 @@ void __init setup_node_bootmem(int nodei } bootmap_start = __pa(bootmap); + /* make sure bootmap is not overlapped with bss section */ + bootmap_start = round_up(bootmap_start, PAGE_SIZE); + bootmap_size = init_bootmem_node(NODE_DATA(nodeid), bootmap_start >> PAGE_SHIFT, start_pfn, end_pfn); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ndiswrapper and GPL-only symbols redux
Pavel Roskin <[EMAIL PROTECTED]> writes: > > static inline void add_taint_module(struct module *mod, unsigned flag) > { > add_taint(flag); > mod->taints |= flag; > } > > The module taint is set before the symbols are resolved. Therefore, the > GPL-only symbols won't be resolved. I think using a separate taint flag that does not disable GPL symbols for the ndiswrapper case would be a fair solution. After all the main motivation for tainting ndiswrapper is to make it visible in oopses, but not prevent it from loading in the first place. How about you submit an incremental patch to do that? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 18/22 -v7] Trace irq disabled critical timings
This patch adds latency tracing for critical timings (how long interrupts are disabled for). "irqsoff" is added to /debugfs/tracing/available_tracers Note: tracing_max_latency also holds the max latency for irqsoff (in usecs). (default to large number so one must start latency tracing) tracing_thresh threshold (in usecs) to always print out if irqs off is detected to be longer than stated here. If irq_thresh is non-zero, then max_irq_latency is ignored. Here's an example of a trace with mcount_enabled = 0 === preemption latency trace v1.1.5 on 2.6.24-rc7 latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) - => started at: _spin_lock_irqsave+0x2a/0xb7 => ended at: _spin_unlock_irqrestore+0x32/0x5f _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 1d.s30us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000]) swapper-0 1d.s3 100us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1d.s3 100us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f) vim:ft=help === And this is a trace with mcount_enabled == 1 === preemption latency trace v1.1.5 on 2.6.24-rc7 latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) - => started at: _spin_lock_irqsave+0x2a/0xb7 => ended at: _spin_unlock_irqrestore+0x32/0x5f _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 1dNs30us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000]) swapper-0 1dNs3 46us : e1000_read_phy_reg+0x16/0x225 [e1000] (e1000_update_stats+0x5e2/0x64c [e1000]) swapper-0 1dNs3 46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] (e1000_read_phy_reg+0x49/0x225 [e1000]) swapper-0 1dNs3 46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] (e1000_swfw_sync_acquire+0x36/0x99 [e1000]) swapper-0 1dNs3 47us : __const_udelay+0x9/0x47 (e1000_read_phy_reg+0x116/0x225 [e1000]) swapper-0 1dNs3 47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47) swapper-0 1dNs3 97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50) swapper-0 1dNs3 98us : e1000_swfw_sync_release+0xc/0x55 [e1000] (e1000_read_phy_reg+0x211/0x225 [e1000]) swapper-0 1dNs3 99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] (e1000_swfw_sync_release+0x50/0x55 [e1000]) swapper-0 1dNs3 101us : _spin_unlock_irqrestore+0xe/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1dNs3 102us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1dNs3 102us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f) vim:ft=help === Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/process_64.c |3 arch/x86/lib/thunk_64.S | 18 + include/asm-x86/irqflags_32.h |4 include/asm-x86/irqflags_64.h |4 include/linux/irqflags.h | 37 ++- include/linux/mcount.h| 31 ++- kernel/fork.c |2 kernel/lockdep.c | 16 + lib/tracing/Kconfig | 18 + lib/tracing/Makefile |1 lib/tracing/trace_irqsoff.c | 415 ++ lib/tracing/tracer.c | 59 - lib/tracing/tracer.h |2 13 files changed, 575 insertions(+), 35 deletions(-) Index: linux-mcount.git/arch/x86/kernel/process_64.c === --- linux-mcount.git.orig/arch/x86/kernel/process_64.c 2008-01-29 18:06:20.0 -0500 +++ linux-mcount.git/arch/x86/kernel/process_64.c 2008-01-29 18:10:56.0 -0500 @@ -233,7 +233,10 @@ void cpu_idle (void) */ local_irq_disable(); enter_idle(); + /* Don't trace irqs off for idle */ + stop_critical_timings();
[PATCH 06/22 -v7] handle accurate time keeping over long delays
Handle accurate time even if there's a long delay between accumulated clock cycles. Signed-off-by: John Stultz <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/powerpc/kernel/time.c|3 +- arch/x86/kernel/vsyscall_64.c |5 ++- include/asm-x86/vgtod.h |2 - include/linux/clocksource.h | 58 -- kernel/time/timekeeping.c | 36 +- 5 files changed, 82 insertions(+), 22 deletions(-) Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c === --- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:47:06.0 -0500 +++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:47:09.0 -0500 @@ -86,6 +86,7 @@ void update_vsyscall(struct timespec *wa vsyscall_gtod_data.clock.mask = clock->mask; vsyscall_gtod_data.clock.mult = clock->mult; vsyscall_gtod_data.clock.shift = clock->shift; + vsyscall_gtod_data.clock.cycle_accumulated = clock->cycle_accumulated; vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec; vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec; vsyscall_gtod_data.wall_to_monotonic = wall_to_monotonic; @@ -121,7 +122,7 @@ static __always_inline long time_syscall static __always_inline void do_vgettimeofday(struct timeval * tv) { - cycle_t now, base, mask, cycle_delta; + cycle_t now, base, accumulated, mask, cycle_delta; unsigned seq; unsigned long mult, shift, nsec; cycle_t (*vread)(void); @@ -135,6 +136,7 @@ static __always_inline void do_vgettimeo } now = vread(); base = __vsyscall_gtod_data.clock.cycle_last; + accumulated = __vsyscall_gtod_data.clock.cycle_accumulated; mask = __vsyscall_gtod_data.clock.mask; mult = __vsyscall_gtod_data.clock.mult; shift = __vsyscall_gtod_data.clock.shift; @@ -145,6 +147,7 @@ static __always_inline void do_vgettimeo /* calculate interval: */ cycle_delta = (now - base) & mask; + cycle_delta += accumulated; /* convert to nsecs: */ nsec += (cycle_delta * mult) >> shift; Index: linux-mcount.git/include/asm-x86/vgtod.h === --- linux-mcount.git.orig/include/asm-x86/vgtod.h 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/asm-x86/vgtod.h2008-01-25 21:47:09.0 -0500 @@ -15,7 +15,7 @@ struct vsyscall_gtod_data { struct timezone sys_tz; struct { /* extract of a clocksource struct */ cycle_t (*vread)(void); - cycle_t cycle_last; + cycle_t cycle_last, cycle_accumulated; cycle_t mask; u32 mult; u32 shift; Index: linux-mcount.git/include/linux/clocksource.h === --- linux-mcount.git.orig/include/linux/clocksource.h 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/linux/clocksource.h2008-01-25 21:47:09.0 -0500 @@ -50,8 +50,12 @@ struct clocksource; * @flags: flags describing special properties * @vread: vsyscall based read * @resume:resume function for the clocksource, if necessary + * @cycle_last:Used internally by timekeeping core, please ignore. + * @cycle_accumulated: Used internally by timekeeping core, please ignore. * @cycle_interval:Used internally by timekeeping core, please ignore. * @xtime_interval:Used internally by timekeeping core, please ignore. + * @xtime_nsec:Used internally by timekeeping core, please ignore. + * @error: Used internally by timekeeping core, please ignore. */ struct clocksource { /* @@ -82,7 +86,10 @@ struct clocksource { * Keep it in a different cache line to dirty no * more than one cache line. */ - cycle_t cycle_last cacheline_aligned_in_smp; + struct { + cycle_t cycle_last, cycle_accumulated; + } cacheline_aligned_in_smp; + u64 xtime_nsec; s64 error; @@ -168,11 +175,44 @@ static inline cycle_t clocksource_read(s } /** + * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * @cs:pointer to clocksource being read + * @now: current cycle value + * + * Uses the clocksource to return the current cycle_t value. + * NOTE!!!: This is different from clocksource_read, because it + * returns the accumulated cycle value! Must hold xtime lock! + */ +static inline cycle_t +clocksource_get_cycles(struct clocksource *cs, cycle_t now) +{ + cycle_t offset = (now - cs->cycle_last) & cs->mask; + offset += cs->cycle_accumulated; +
[PATCH 13/22 -v7] Add tracing of context switches
This patch adds context switch tracing, of the format of: _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 1d..3 137us+: 0:140:R --> 2912:120 sshd-2912 1d..3 216us+: 2912:120:S --> 0:140 swapper-0 1d..3 261us+: 0:140:R --> 2912:120 bash-2920 0d..3 267us+: 2920:120:S --> 0:140 sshd-2912 1d..3 330us!: 2912:120:S --> 0:140 swapper-0 1d..3 2389us+: 0:140:R --> 2847:120 yum-upda-2847 1d..3 2411us!: 2847:120:S --> 0:140 swapper-0 0d..3 11089us+: 0:140:R --> 3139:120 gdm-bina-3139 0d..3 3us!: 3139:120:S --> 0:140 swapper-0 1d..3 102328us+: 0:140:R --> 2847:120 yum-upda-2847 1d..3 102348us!: 2847:120:S --> 0:140 "sched_switch" is added to /debugfs/tracing/available_tracers Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Cc: Mathieu Desnoyers <[EMAIL PROTECTED]> --- lib/tracing/Kconfig |9 ++ lib/tracing/Makefile |1 lib/tracing/trace_sched_switch.c | 165 +++ lib/tracing/tracer.c | 43 ++ lib/tracing/tracer.h | 23 + 5 files changed, 240 insertions(+), 1 deletion(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-29 18:06:25.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 18:08:06.0 -0500 @@ -23,3 +23,12 @@ config FUNCTION_TRACER insert a call to an architecture specific __mcount routine, that the debugging mechanism using this facility will hook by providing a set of inline routines. + +config CONTEXT_SWITCH_TRACER + bool "Trace process context switches" + depends on DEBUG_KERNEL + select TRACING + help + This tracer hooks into the context switch and records + all switching of tasks. + Index: linux-mcount.git/lib/tracing/Makefile === --- linux-mcount.git.orig/lib/tracing/Makefile 2008-01-29 18:06:25.0 -0500 +++ linux-mcount.git/lib/tracing/Makefile 2008-01-29 18:08:06.0 -0500 @@ -1,6 +1,7 @@ obj-$(CONFIG_MCOUNT) += libmcount.o obj-$(CONFIG_TRACING) += tracer.o +obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o libmcount-y := mcount.o Index: linux-mcount.git/lib/tracing/trace_sched_switch.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-mcount.git/lib/tracing/trace_sched_switch.c 2008-01-29 18:08:06.0 -0500 @@ -0,0 +1,165 @@ +/* + * trace context switch + * + * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]> + * + */ +#include +#include +#include +#include +#include +#include +#include + +#include "tracer.h" + +static struct tracing_trace *tracer_trace; +static int trace_enabled __read_mostly; +static atomic_t sched_ref; +int tracing_sched_switch_enabled __read_mostly; + +static notrace void sched_switch_callback(const struct marker *mdata, + void *private_data, + const char *format, ...) +{ + struct tracing_trace **p = mdata->private; + struct tracing_trace *tr = *p; + struct tracing_trace_cpu *data; + struct task_struct *prev; + struct task_struct *next; + unsigned long flags; + va_list ap; + int cpu; + + if (!trace_enabled) + return; + + va_start(ap, format); + prev = va_arg(ap, typeof(prev)); + next = va_arg(ap, typeof(next)); + va_end(ap); + + raw_local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + atomic_inc(&data->disabled); + + if (likely(atomic_read(&data->disabled) == 1)) + tracing_sched_switch_trace(tr, data, prev, next, flags); + + atomic_dec(&data->disabled); + raw_local_irq_restore(flags); +} + +static notrace void sched_switch_reset(struct tracing_trace *tr) +{ + int cpu; + + tr->time_start = now(); + + for_each_online_cpu(cpu) + tracing_reset(tr->data[cpu]); +} + +static notrace void start_sched_trace(struct tracing_trace *tr) +{ + sched_switch_reset(tr); + trace_enabled = 1; + tracing_start_sched_switch(); +} + +static notrace void stop_sched_trace(struct tracing_trace *tr) +{ + tracing_stop_sched_switch(); + trace_enabled = 0; +} + +static notrace void sched_switch_trace_i
[PATCH 09/22 -v7] add notrace annotations to timing events
This patch adds notrace annotations to timer functions that will be used by tracing. This helps speed things up and also keeps the ugliness of printing these functions down. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/apic_32.c |2 +- arch/x86/kernel/hpet.c|2 +- arch/x86/kernel/time_32.c |2 +- arch/x86/kernel/tsc_32.c |2 +- arch/x86/kernel/tsc_64.c |4 ++-- arch/x86/lib/delay_32.c |6 +++--- drivers/clocksource/acpi_pm.c |8 7 files changed, 13 insertions(+), 13 deletions(-) Index: linux-mcount.git/arch/x86/kernel/apic_32.c === --- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-29 11:35:35.0 -0500 +++ linux-mcount.git/arch/x86/kernel/apic_32.c 2008-01-29 11:49:47.0 -0500 @@ -577,7 +577,7 @@ static void local_apic_timer_interrupt(v * interrupt as well. Thus we cannot inline the local irq ... ] */ -void fastcall smp_apic_timer_interrupt(struct pt_regs *regs) +notrace fastcall void smp_apic_timer_interrupt(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); Index: linux-mcount.git/arch/x86/kernel/hpet.c === --- linux-mcount.git.orig/arch/x86/kernel/hpet.c2008-01-29 11:35:35.0 -0500 +++ linux-mcount.git/arch/x86/kernel/hpet.c 2008-01-29 11:49:47.0 -0500 @@ -295,7 +295,7 @@ static int hpet_legacy_next_event(unsign /* * Clock source related code */ -static cycle_t read_hpet(void) +static notrace cycle_t read_hpet(void) { return (cycle_t)hpet_readl(HPET_COUNTER); } Index: linux-mcount.git/arch/x86/kernel/time_32.c === --- linux-mcount.git.orig/arch/x86/kernel/time_32.c 2008-01-29 11:35:35.0 -0500 +++ linux-mcount.git/arch/x86/kernel/time_32.c 2008-01-29 11:49:47.0 -0500 @@ -122,7 +122,7 @@ static int set_rtc_mmss(unsigned long no int timer_ack; -unsigned long profile_pc(struct pt_regs *regs) +notrace unsigned long profile_pc(struct pt_regs *regs) { unsigned long pc = instruction_pointer(regs); Index: linux-mcount.git/arch/x86/kernel/tsc_32.c === --- linux-mcount.git.orig/arch/x86/kernel/tsc_32.c 2008-01-29 11:35:35.0 -0500 +++ linux-mcount.git/arch/x86/kernel/tsc_32.c 2008-01-29 11:49:47.0 -0500 @@ -269,7 +269,7 @@ core_initcall(cpufreq_tsc); static unsigned long current_tsc_khz = 0; -static cycle_t read_tsc(void) +static notrace cycle_t read_tsc(void) { cycle_t ret; Index: linux-mcount.git/arch/x86/kernel/tsc_64.c === --- linux-mcount.git.orig/arch/x86/kernel/tsc_64.c 2008-01-29 11:35:35.0 -0500 +++ linux-mcount.git/arch/x86/kernel/tsc_64.c 2008-01-29 11:49:47.0 -0500 @@ -248,13 +248,13 @@ __setup("notsc", notsc_setup); /* clock source code: */ -static cycle_t read_tsc(void) +static notrace cycle_t read_tsc(void) { cycle_t ret = (cycle_t)get_cycles_sync(); return ret; } -static cycle_t __vsyscall_fn vread_tsc(void) +static notrace cycle_t __vsyscall_fn vread_tsc(void) { cycle_t ret = (cycle_t)get_cycles_sync(); return ret; Index: linux-mcount.git/arch/x86/lib/delay_32.c === --- linux-mcount.git.orig/arch/x86/lib/delay_32.c 2008-01-29 11:35:35.0 -0500 +++ linux-mcount.git/arch/x86/lib/delay_32.c2008-01-29 11:49:47.0 -0500 @@ -24,7 +24,7 @@ #endif /* simple loop based delay: */ -static void delay_loop(unsigned long loops) +static notrace void delay_loop(unsigned long loops) { int d0; @@ -39,7 +39,7 @@ static void delay_loop(unsigned long loo } /* TSC based delay: */ -static void delay_tsc(unsigned long loops) +static notrace void delay_tsc(unsigned long loops) { unsigned long bclock, now; @@ -72,7 +72,7 @@ int read_current_timer(unsigned long *ti return -1; } -void __delay(unsigned long loops) +notrace void __delay(unsigned long loops) { delay_fn(loops); } Index: linux-mcount.git/drivers/clocksource/acpi_pm.c === --- linux-mcount.git.orig/drivers/clocksource/acpi_pm.c 2008-01-29 11:35:35.0 -0500 +++ linux-mcount.git/drivers/clocksource/acpi_pm.c 2008-01-29 11:49:47.0 -0500 @@ -30,13 +30,13 @@ */ u32 pmtmr_ioport __read_mostly; -static inline u32 read_pmtmr(void) +static inline notrace u32 read_pmtmr(void) { /* mask the output to 24 bits */ return inl(pmtmr_ioport) & ACPI_PM_MASK; } -u32 acpi_pm_read_verified(void) +notrace u32 acpi_pm_read_verified(void) { u32 v1 = 0, v2 = 0, v3 = 0;
[PATCH 16/22 -v7] Add marker in try_to_wake_up
Add markers into the wakeup code, to allow the tracer to record wakeup timings. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- kernel/sched.c |8 1 file changed, 8 insertions(+) Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:21.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:30.0 -0500 @@ -1885,6 +1885,10 @@ static int try_to_wake_up(struct task_st out_activate: #endif /* CONFIG_SMP */ + trace_mark(kernel_sched_wakeup, + "p %p rq->curr %p", + p, rq->curr); + schedstat_inc(p, se.nr_wakeups); if (sync) schedstat_inc(p, se.nr_wakeups_sync); @@ -2026,6 +2030,10 @@ void fastcall wake_up_new_task(struct ta p->sched_class->task_new(rq, p); inc_nr_running(p, rq); } + trace_mark(kernel_sched_wakeup_new, + "p %p rq->curr %p", + p, rq->curr); + check_preempt_curr(rq, p); #ifdef CONFIG_SMP if (p->sched_class->task_wake_up) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/22 -v7] Annotate core code that should not be traced
Mark with "notrace" functions in core code that should not be traced. The "notrace" attribute will prevent gcc from adding a call to mcount on the annotated funtions. Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/smp_processor_id.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-mcount.git/lib/smp_processor_id.c === --- linux-mcount.git.orig/lib/smp_processor_id.c2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/lib/smp_processor_id.c 2008-01-25 21:47:03.0 -0500 @@ -7,7 +7,7 @@ #include #include -unsigned int debug_smp_processor_id(void) +notrace unsigned int debug_smp_processor_id(void) { unsigned long preempt_count = preempt_count(); int this_cpu = raw_smp_processor_id(); -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 17/22 -v7] mcount tracer for wakeup latency timings.
This patch adds hooks to trace the wake up latency of the highest priority waking task. "wakeup" is added to /debugfs/tracing/available_tracers Also added to /debugfs/tracing tracing_max_latency holds the current max latency for the wakeup wakeup_thresh if set to other than zero, a log will be recorded for every wakeup that takes longer than the number entered in here (usecs for all counters) (deletes previous trace) Examples: (with mcount_enabled = 0) preemption latency trace v1.1.5 on 2.6.24-rc8 latency: 26 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: migration/0-3 (uid:0 nice:-5 policy:1 rt_prio:99) - _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / quilt-8551 0d..30us+: wake_up_process+0x15/0x17 (sched_exec+0xc9/0x100 ) quilt-8551 0d..4 26us : sched_switch_callback+0x73/0x81 (schedule+0x483/0x6d5 ) vim:ft=help (with mcount_enabled = 1) preemption latency trace v1.1.5 on 2.6.24-rc8 latency: 36 us, #45/45, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: migration/1-5 (uid:0 nice:-5 policy:1 rt_prio:99) - _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / bash-10653 1d..30us : wake_up_process+0x15/0x17 (sched_exec+0xc9/0x100 ) bash-10653 1d..31us : try_to_wake_up+0x271/0x2e7 (sub_preempt_count+0xc/0x7a ) bash-10653 1d..22us : try_to_wake_up+0x296/0x2e7 (update_rq_clock+0x9/0x20 ) bash-10653 1d..22us : update_rq_clock+0x1e/0x20 (__update_rq_clock+0xc/0x90 ) bash-10653 1d..23us : __update_rq_clock+0x1b/0x90 (sched_clock+0x9/0x29 ) bash-10653 1d..24us : try_to_wake_up+0x2a6/0x2e7 (activate_task+0xc/0x3f ) bash-10653 1d..24us : activate_task+0x2d/0x3f (enqueue_task+0xe/0x66 ) bash-10653 1d..25us : enqueue_task+0x5b/0x66 (enqueue_task_rt+0x9/0x3c ) bash-10653 1d..26us : try_to_wake_up+0x2ba/0x2e7 (check_preempt_wakeup+0x12/0x99 ) [...] bash-10653 1d..5 33us : tracing_record_cmdline+0xcf/0xd4 (_spin_unlock+0x9/0x33 ) bash-10653 1d..5 34us : _spin_unlock+0x19/0x33 (sub_preempt_count+0xc/0x7a ) bash-10653 1d..4 35us : wakeup_sched_switch+0x65/0x2ff (_spin_lock_irqsave+0xc/0xa9 ) bash-10653 1d..4 35us : _spin_lock_irqsave+0x19/0xa9 (add_preempt_count+0xe/0x77 ) bash-10653 1d..4 36us : sched_switch_callback+0x73/0x81 (schedule+0x483/0x6d5 ) vim:ft=help The [...] was added here to not waste your email box space. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig| 14 + lib/tracing/Makefile |1 lib/tracing/trace_wakeup.c | 359 + lib/tracing/tracer.c | 131 lib/tracing/tracer.h |5 5 files changed, 508 insertions(+), 2 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-29 18:09:01.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 18:10:17.0 -0500 @@ -9,6 +9,9 @@ config MCOUNT bool select FRAME_POINTER +config TRACER_MAX_TRACE + bool + config TRACING bool select DEBUG_FS @@ -25,6 +28,17 @@ config FUNCTION_TRACER that the debugging mechanism using this facility will hook by providing a set of inline routines. +config WAKEUP_TRACER + bool "Trace wakeup latencies" + depends on DEBUG_KERNEL + select TRACING + select CONTEXT_SWITCH_TRACER + select TRACER_MAX_TRACE + help + This tracer adds hooks into scheduling to time the latency + of the highest priority task tasks to be scheduled in + after it has worken up. + config CONTEXT_SWITCH_TRACER bool "Trace process context switches" depends on DEBUG_KERNEL Index: linux-mcount.git/lib/tracing/Makefile === --- linux-m
[PATCH 14/22 -v7] Generic command line storage
Saving the comm of tasks for each trace is very expensive. This patch includes in the context switch hook, a way to store the last 100 command lines of tasks. This table is examined when a trace is to be printed. Note: The comm may be destroyed if other traces are performed. Later (TBD) patches may simply store this information in the trace itself. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig |1 lib/tracing/trace_function.c |2 lib/tracing/trace_sched_switch.c |5 + lib/tracing/tracer.c | 108 --- lib/tracing/tracer.h |3 - 5 files changed, 111 insertions(+), 8 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-29 18:08:06.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 18:09:01.0 -0500 @@ -18,6 +18,7 @@ config FUNCTION_TRACER depends on DEBUG_KERNEL && HAVE_MCOUNT select MCOUNT select TRACING + select CONTEXT_SWITCH_TRACER help Use profiler instrumentation, adding -pg to CFLAGS. This will insert a call to an architecture specific __mcount routine, Index: linux-mcount.git/lib/tracing/trace_function.c === --- linux-mcount.git.orig/lib/tracing/trace_function.c 2008-01-29 18:06:24.0 -0500 +++ linux-mcount.git/lib/tracing/trace_function.c 2008-01-29 18:08:10.0 -0500 @@ -29,10 +29,12 @@ static notrace void start_function_trace { function_reset(tr); tracing_start_function_trace(); + tracing_start_sched_switch(); } static notrace void stop_function_trace(struct tracing_trace *tr) { + tracing_stop_sched_switch(); tracing_stop_function_trace(); } Index: linux-mcount.git/lib/tracing/trace_sched_switch.c === --- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c 2008-01-29 18:08:06.0 -0500 +++ linux-mcount.git/lib/tracing/trace_sched_switch.c 2008-01-29 18:09:03.0 -0500 @@ -32,6 +32,11 @@ static notrace void sched_switch_callbac va_list ap; int cpu; + if (!atomic_read(&sched_ref)) + return; + + tracing_record_cmdline(current); + if (!trace_enabled) return; Index: linux-mcount.git/lib/tracing/tracer.c === --- linux-mcount.git.orig/lib/tracing/tracer.c 2008-01-29 18:08:06.0 -0500 +++ linux-mcount.git/lib/tracing/tracer.c 2008-01-29 18:10:04.0 -0500 @@ -171,6 +171,87 @@ void tracing_stop_function_trace(void) unregister_mcount_function(&trace_ops); } +#define SAVED_CMDLINES 128 +static unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1]; +static unsigned map_cmdline_to_pid[SAVED_CMDLINES]; +static char saved_cmdlines[SAVED_CMDLINES][TASK_COMM_LEN]; +static int cmdline_idx; +static DEFINE_SPINLOCK(trace_cmdline_lock); +atomic_t trace_record_cmdline_disabled; + +static void trace_init_cmdlines(void) +{ + memset(&map_pid_to_cmdline, -1, sizeof(map_pid_to_cmdline)); + memset(&map_cmdline_to_pid, -1, sizeof(map_cmdline_to_pid)); + cmdline_idx = 0; +} + +notrace void trace_stop_cmdline_recording(void); + +static void notrace trace_save_cmdline(struct task_struct *tsk) +{ + unsigned map; + unsigned idx; + + if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT)) + return; + + /* +* It's not the end of the world if we don't get +* the lock, but we also don't want to spin +* nor do we want to disable interrupts, +* so if we miss here, then better luck next time. +*/ + if (!spin_trylock(&trace_cmdline_lock)) + return; + + idx = map_pid_to_cmdline[tsk->pid]; + if (idx >= SAVED_CMDLINES) { + idx = (cmdline_idx + 1) % SAVED_CMDLINES; + + map = map_cmdline_to_pid[idx]; + if (map <= PID_MAX_DEFAULT) + map_pid_to_cmdline[map] = (unsigned)-1; + + map_pid_to_cmdline[tsk->pid] = idx; + + cmdline_idx = idx; + } + + memcpy(&saved_cmdlines[idx], tsk->comm, TASK_COMM_LEN); + + spin_unlock(&trace_cmdline_lock); +} + +static notrace char *trace_find_cmdline(int pid) +{ + char *cmdline = "<...>"; + unsigned map; + + if (!pid) + return ""; + + if (pid > PID_MAX_DEFAULT) + goto out; + + map = map_pid_to_cmdline[pid]; + if (map >= SAVED_CMDLINES) + goto out; + + cmdline = saved_cmdlines[map]; + + out: + return cmdline; +} + +void tracing_record_cmdline(struct task_struct *tsk) +{ + if (atom
[PATCH 05/22 -v7] add notrace annotations to vsyscall.
Add the notrace annotations to some of the vsyscall functions. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/vsyscall_64.c |3 ++- arch/x86/vdso/vclock_gettime.c | 15 --- arch/x86/vdso/vgetcpu.c|3 ++- include/asm-x86/vsyscall.h |3 ++- 4 files changed, 14 insertions(+), 10 deletions(-) Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c === --- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:47:06.0 -0500 @@ -42,7 +42,8 @@ #include #include -#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) +#define __vsyscall(nr) \ + __attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace #define __syscall_clobber "r11","rcx","memory" #define __pa_vsymbol(x)\ ({unsigned long v; \ Index: linux-mcount.git/arch/x86/vdso/vclock_gettime.c === --- linux-mcount.git.orig/arch/x86/vdso/vclock_gettime.c2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/vdso/vclock_gettime.c 2008-01-25 21:47:06.0 -0500 @@ -24,7 +24,7 @@ #define gtod vdso_vsyscall_gtod_data -static long vdso_fallback_gettime(long clock, struct timespec *ts) +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; asm("syscall" : "=a" (ret) : @@ -32,7 +32,7 @@ static long vdso_fallback_gettime(long c return ret; } -static inline long vgetns(void) +notrace static inline long vgetns(void) { long v; cycles_t (*vread)(void); @@ -41,7 +41,7 @@ static inline long vgetns(void) return (v * gtod->clock.mult) >> gtod->clock.shift; } -static noinline int do_realtime(struct timespec *ts) +notrace static noinline int do_realtime(struct timespec *ts) { unsigned long seq, ns; do { @@ -55,7 +55,8 @@ static noinline int do_realtime(struct t } /* Copy of the version in kernel/time.c which we cannot directly access */ -static void vset_normalized_timespec(struct timespec *ts, long sec, long nsec) +notrace static void +vset_normalized_timespec(struct timespec *ts, long sec, long nsec) { while (nsec >= NSEC_PER_SEC) { nsec -= NSEC_PER_SEC; @@ -69,7 +70,7 @@ static void vset_normalized_timespec(str ts->tv_nsec = nsec; } -static noinline int do_monotonic(struct timespec *ts) +notrace static noinline int do_monotonic(struct timespec *ts) { unsigned long seq, ns, secs; do { @@ -83,7 +84,7 @@ static noinline int do_monotonic(struct return 0; } -int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) +notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { if (likely(gtod->sysctl_enabled && gtod->clock.vread)) switch (clock) { @@ -97,7 +98,7 @@ int __vdso_clock_gettime(clockid_t clock int clock_gettime(clockid_t, struct timespec *) __attribute__((weak, alias("__vdso_clock_gettime"))); -int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) +notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { long ret; if (likely(gtod->sysctl_enabled && gtod->clock.vread)) { Index: linux-mcount.git/arch/x86/vdso/vgetcpu.c === --- linux-mcount.git.orig/arch/x86/vdso/vgetcpu.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/vdso/vgetcpu.c2008-01-25 21:47:06.0 -0500 @@ -13,7 +13,8 @@ #include #include "vextern.h" -long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused) +notrace long +__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused) { unsigned int dummy, p; Index: linux-mcount.git/include/asm-x86/vsyscall.h === --- linux-mcount.git.orig/include/asm-x86/vsyscall.h2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/asm-x86/vsyscall.h 2008-01-25 21:47:06.0 -0500 @@ -24,7 +24,8 @@ enum vsyscall_num { ((unused, __section__ (".vsyscall_gtod_data"),aligned(16))) #define __section_vsyscall_clock __attribute__ \ ((unused, __section__ (".vsyscall_clock"),aligned(16))) -#define __vsyscall_fn __attribute__ ((unused,__section__(".vsyscall_fn"))) +#define __vsyscall_fn \ + __attribute__ ((unused, __section__(".vsyscall_fn"))) notrace #define VGETCPU_RDTSCP 1 #define VGETCPU_LSL2 -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 21/22 -v7] Add event tracer.
This patch adds a event trace that hooks into various events in the kernel. Although it can be used separately, it is mainly to help other traces (wakeup and preempt off) with seeing various events in the traces without having to enable the heavy mcount hooks. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig | 12 + lib/tracing/Makefile|1 lib/tracing/trace_events.c | 475 lib/tracing/trace_irqsoff.c |6 lib/tracing/trace_wakeup.c | 55 - lib/tracing/tracer.c| 159 ++ lib/tracing/tracer.h| 64 + 7 files changed, 721 insertions(+), 51 deletions(-) Index: linux-mcount.git/lib/tracing/trace_events.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-mcount.git/lib/tracing/trace_events.c 2008-01-29 18:11:37.0 -0500 @@ -0,0 +1,475 @@ +/* + * trace task events + * + * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]> + * + * Based on code from the latency_tracer, that is: + * + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ +#include +#include +#include +#include +#include +#include + +#include "tracer.h" + +static struct tracing_trace *tracer_trace __read_mostly; +static int trace_enabled __read_mostly; + +static void notrace event_reset(struct tracing_trace *tr) +{ + struct tracing_trace_cpu *data; + int cpu; + + for_each_possible_cpu(cpu) { + data = tr->data[cpu]; + tracing_reset(data); + } + + tr->time_start = now(); +} + +static void notrace event_trace_sched_switch(void *private, +struct task_struct *prev, +struct task_struct *next) +{ + struct tracing_trace **ptr = private; + struct tracing_trace *tr = *ptr; + struct tracing_trace_cpu *data; + unsigned long flags; + int cpu; + + if (!trace_enabled || !tr) + return; + + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + + atomic_inc(&data->disabled); + if (atomic_read(&data->disabled) != 1) + goto out; + + tracing_sched_switch_trace(tr, data, prev, next, flags); + + out: + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +static struct tracer_switch_ops switch_ops __read_mostly = { + .func = event_trace_sched_switch, + .private = &tracer_trace, +}; + +notrace int trace_event_enabled(void) +{ + return trace_enabled && tracer_trace; +} + +/* Taken from sched.c */ +#define __PRIO(prio) \ + ((prio) <= 99 ? 199 - (prio) : (prio) - 120) + +#define PRIO(p) __PRIO((p)->prio) + +notrace void trace_event_wakeup(unsigned long ip, + struct task_struct *p, + struct task_struct *curr) +{ + struct tracing_trace *tr = tracer_trace; + struct tracing_trace_cpu *data; + unsigned long flags; + int cpu; + + if (!trace_enabled || !tr) + return; + + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + + atomic_inc(&data->disabled); + if (atomic_read(&data->disabled) != 1) + goto out; + + /* record process's command line */ + tracing_record_cmdline(p); + tracing_record_cmdline(curr); + tracing_trace_pid(tr, data, flags, ip, p->pid, PRIO(p), PRIO(curr)); + + out: + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +struct event_probes { + const char *name; + const char *fmt; + void (*func)(const struct event_probes *probe, +struct tracing_trace *tr, +struct tracing_trace_cpu *data, +unsigned long flags, +unsigned long ip, +va_list ap); + int active; + int armed; +}; + +#define getarg(arg, ap) arg = va_arg(ap, typeof(arg)) + +static notrace void event_trace_apic_timer(const struct event_probes *probe, + struct tracing_trace *tr, + struct tracing_trace_cpu *data, + unsigned long flags, + unsigned long ip, + va_list ap) +{ + unsigned long parent_ip; + + getarg(parent_ip, ap); + + tracing_trace_special(tr, data, flags, ip, parent_ip, 0, 0); +} + +static notrace void event_trace_do_irq(const struct event_probes *probe, + struct tracing_trace *tr, + struct tracing_trace_cpu *data, + unsigned long flags, +
[PATCH 04/22 -v7] x86_64: notrace annotations
Add "notrace" annotation to x86_64 specific files. Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/head64.c |2 +- arch/x86/kernel/setup64.c|4 ++-- arch/x86/kernel/smpboot_64.c |2 +- 3 files changed, 4 insertions(+), 4 deletions(-) Index: linux-mcount.git/arch/x86/kernel/head64.c === --- linux-mcount.git.orig/arch/x86/kernel/head64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/head64.c 2008-01-25 21:47:05.0 -0500 @@ -46,7 +46,7 @@ static void __init copy_bootdata(char *r } } -void __init x86_64_start_kernel(char * real_mode_data) +notrace void __init x86_64_start_kernel(char *real_mode_data) { int i; Index: linux-mcount.git/arch/x86/kernel/setup64.c === --- linux-mcount.git.orig/arch/x86/kernel/setup64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/setup64.c 2008-01-25 21:47:05.0 -0500 @@ -114,7 +114,7 @@ void __init setup_per_cpu_areas(void) } } -void pda_init(int cpu) +notrace void pda_init(int cpu) { struct x8664_pda *pda = cpu_pda(cpu); @@ -197,7 +197,7 @@ DEFINE_PER_CPU(struct orig_ist, orig_ist * 'CPU state barrier', nothing should get across. * A lot of state is already set up in PDA init. */ -void __cpuinit cpu_init (void) +notrace void __cpuinit cpu_init(void) { int cpu = stack_smp_processor_id(); struct tss_struct *t = &per_cpu(init_tss, cpu); Index: linux-mcount.git/arch/x86/kernel/smpboot_64.c === --- linux-mcount.git.orig/arch/x86/kernel/smpboot_64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/smpboot_64.c 2008-01-25 21:47:05.0 -0500 @@ -317,7 +317,7 @@ static inline void set_cpu_sibling_map(i /* * Setup code on secondary processor (after comming out of the trampoline) */ -void __cpuinit start_secondary(void) +notrace __cpuinit void start_secondary(void) { /* * Dont put anything before smp_callin(), SMP -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 00/22 -v7] mcount and latency tracing utility -v7
[ version 7 (and hopefully last) of mcount / trace patches: changes include: Ported to lastest git 0ba6c33bcddc64a54b5f1c25a696c4767dc76292 Moved the markers around so they would only be armed when used, this brings down the overhead dramatically. Added printing of the "to" process name in the sched switch output: ksoftirq-8 2d..3 120829us+: 8:49:S --> 11:115 group_balance group_ba-112d..3 120836us!: 11:115:S --> 0:140 Removed notrace to nmi handlers. I've tested it a little with NMIs and function trace, and it seems to work fine. added "disable" to available_tracers that will unregister all tracers when written into current_tracer. Ran new benchmarks and got better results! See below. ] All released version of these patches can be found at: http://people.redhat.com/srostedt/tracing/ The following patch series brings to vanilla Linux a bit of the RT kernel trace facility. This incorporates the "-pg" profiling option of gcc that will call the "mcount" function for all functions called in the kernel. Note: I did investigate using -finstrument-functions but that adds a call to both start and end of a function. Using mcount only does the beginning of the function. mcount alone adds ~13% overhead. The -finstrument-functions added ~19%. Also it caused me to do tricks with inline, because it adds the function calls to inline functions as well. This patch series implements the code for x86 (32 and 64 bit), but other archs can easily be implemented as well (note: ARM and PPC are already implemented in -rt) Some Background: A while back, Ingo Molnar and William Lee Irwin III created a latency tracer to find problem latency areas in the kernel for the RT patch. This tracer became a very integral part of the RT kernel in solving where latency hot spots were. One of the features that the latency tracer added was a function trace. This function tracer would record all functions that were called (implemented by the gcc "-pg" option) and would show what was called when interrupts or preemption was turned off. This feature is also very helpful in normal debugging. So it's been talked about taking bits and pieces from the RT latency tracer and bring them to LKML. But no one had the time to do it. Arnaldo Carvalho de Melo took a crack at it. He pulled out the mcount as well as part of the tracing code and made it generic from the point of the tracing code. I'm not sure why this stopped. Probably because Arnaldo is a very busy man, and his efforts had to be utilized elsewhere. While I still maintain my own Logdev utility: http://rostedt.homelinux.com/logdev I came across a need to do the mcount with logdev too. I was successful but found that it became very dependent on a lot of code. One thing that I liked about my logdev utility was that it was very non-intrusive, and has been easy to port from the Linux 2.0 days. I did not want to burden the logdev patch with the intrusiveness of mcount (not really that intrusive, it just needs to add a "notrace" annotation to functions in the kernel that will cause more conflicts in applying patches for me). Being close to the holidays, I grabbed Arnaldos old patches and started massaging them into something that could be useful for logdev, and what I found out (and talking this over with Arnaldo too) that this can be much more useful for others as well. The main thing I changed, was that I made the mcount function itself generic, and not the dependency on the tracing code. That is I added register_mcount_function() and clear_mcount_function() So when ever mcount is enabled and a function is registered that function is called for all functions in the kernel that is not labeled with the "notrace" annotation. The Simple Tracer: -- To show the power of this I also massaged the tracer code that Arnaldo pulled from the RT patch and made it be a nice example of what can be done with this. The function that is registered to mcount has the prototype: void func(unsigned long ip, unsigned long parent_ip); The ip is the address of the function and parent_ip is the address of the parent function that called it. The x86_64 version has the assembly call the registered function directly to save having to do a double function call. To enable mcount, a sysctl is added: /proc/sys/kernel/mcount_enabled Once mcount is enabled, when a function is registed, it will be called by all functions. The tracer in this patch series shows how this is done. It adds a directory in the debugfs, called mctracer. With a ctrl file that will allow the user have the tracer register its function. Note, the order of enabling mcount and registering a function is not important, but both must be done to initiate the tracing. That is, you can disable tracing by either disabling mcount or by clearing the registered function. When one function is registered, it is called directly from the mcount asse
[PATCH 20/22 -v7] Add markers to various events
This patch adds markers to various events in the kernel. (interrupts, task activation and hrtimers) Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/apic_32.c |2 ++ arch/x86/kernel/irq_32.c |1 + arch/x86/kernel/irq_64.c |2 ++ arch/x86/kernel/traps_32.c |2 ++ arch/x86/kernel/traps_64.c |2 ++ arch/x86/mm/fault_32.c |3 +++ arch/x86/mm/fault_64.c |3 +++ kernel/hrtimer.c |7 +++ kernel/sched.c | 11 +++ 9 files changed, 33 insertions(+) Index: linux-mcount.git/arch/x86/kernel/apic_32.c === --- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-28 08:37:49.0 -0500 +++ linux-mcount.git/arch/x86/kernel/apic_32.c 2008-01-28 09:54:49.0 -0500 @@ -581,6 +581,8 @@ notrace fastcall void smp_apic_timer_int { struct pt_regs *old_regs = set_irq_regs(regs); + trace_mark(arch_apic_timer, "ip %lx", regs->eip); + /* * NOTE! We'd better ACK the irq immediately, * because timer handling can be slow. Index: linux-mcount.git/arch/x86/kernel/irq_32.c === --- linux-mcount.git.orig/arch/x86/kernel/irq_32.c 2008-01-28 08:37:14.0 -0500 +++ linux-mcount.git/arch/x86/kernel/irq_32.c 2008-01-28 09:54:49.0 -0500 @@ -85,6 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_r old_regs = set_irq_regs(regs); irq_enter(); + trace_mark(arch_do_irq, "ip %lx irq %d", regs->eip, irq); #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { Index: linux-mcount.git/arch/x86/kernel/irq_64.c === --- linux-mcount.git.orig/arch/x86/kernel/irq_64.c 2008-01-28 08:37:14.0 -0500 +++ linux-mcount.git/arch/x86/kernel/irq_64.c 2008-01-28 09:54:49.0 -0500 @@ -149,6 +149,8 @@ asmlinkage unsigned int do_IRQ(struct pt irq_enter(); irq = __get_cpu_var(vector_irq)[vector]; + trace_mark(arch_do_irq, "ip %lx irq %d", regs->rip, irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW stack_overflow_check(regs); #endif Index: linux-mcount.git/arch/x86/kernel/traps_32.c === --- linux-mcount.git.orig/arch/x86/kernel/traps_32.c2008-01-28 08:37:17.0 -0500 +++ linux-mcount.git/arch/x86/kernel/traps_32.c 2008-01-28 09:54:49.0 -0500 @@ -769,6 +769,8 @@ fastcall __kprobes void do_nmi(struct pt nmi_enter(); + trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->eip, regs->eflags); + cpu = smp_processor_id(); ++nmi_count(cpu); Index: linux-mcount.git/arch/x86/kernel/traps_64.c === --- linux-mcount.git.orig/arch/x86/kernel/traps_64.c2008-01-28 08:37:14.0 -0500 +++ linux-mcount.git/arch/x86/kernel/traps_64.c 2008-01-28 09:54:49.0 -0500 @@ -782,6 +782,8 @@ asmlinkage __kprobes void default_do_nmi cpu = smp_processor_id(); + trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->rip, regs->eflags); + /* Only the BSP gets external NMIs from the system. */ if (!cpu) reason = get_nmi_reason(); Index: linux-mcount.git/arch/x86/mm/fault_32.c === --- linux-mcount.git.orig/arch/x86/mm/fault_32.c2008-01-28 08:37:14.0 -0500 +++ linux-mcount.git/arch/x86/mm/fault_32.c 2008-01-28 09:54:49.0 -0500 @@ -311,6 +311,9 @@ fastcall void __kprobes do_page_fault(st /* get the address */ address = read_cr2(); + trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx", + regs->eip, error_code, address); + tsk = current; si_code = SEGV_MAPERR; Index: linux-mcount.git/arch/x86/mm/fault_64.c === --- linux-mcount.git.orig/arch/x86/mm/fault_64.c2008-01-28 08:37:14.0 -0500 +++ linux-mcount.git/arch/x86/mm/fault_64.c 2008-01-28 09:54:49.0 -0500 @@ -316,6 +316,9 @@ asmlinkage void __kprobes do_page_fault( /* get the address */ address = read_cr2(); + trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx", + regs->rip, error_code, address); + info.si_code = SEGV_MAPERR; Index: linux-mcount.git/kernel/hrtimer.c === --- linux-mcount.git.orig/kernel/hrtimer.c 2008-01-28 08:37:14.0 -0500 +++ linux-mcount.git/kernel/hrtimer.c 2008-01-28 09:54:49.0 -0500 @@ -709,6 +709,8 @@ static void enqueue_hrtimer(struct hrtim struct hrtimer *entry; int leftmost = 1
[PATCH 12/22 -v7] Make the task State char-string visible to all
The tracer wants to be able to convert the state number into a user visible character. This patch pulls that conversion string out the scheduler into the header. This way if it were to ever change, other parts of the kernel will know. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- include/linux/sched.h |2 ++ kernel/sched.c|2 +- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-mcount.git/include/linux/sched.h === --- linux-mcount.git.orig/include/linux/sched.h 2008-01-25 21:46:55.0 -0500 +++ linux-mcount.git/include/linux/sched.h 2008-01-25 21:47:21.0 -0500 @@ -2055,6 +2055,8 @@ static inline void migration_init(void) } #endif +#define TASK_STATE_TO_CHAR_STR "RSDTtZX" + #endif /* __KERNEL__ */ #endif Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:19.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:21.0 -0500 @@ -5149,7 +5149,7 @@ out_unlock: return retval; } -static const char stat_nam[] = "RSDTtZX"; +static const char stat_nam[] = TASK_STATE_TO_CHAR_STR; void sched_show_task(struct task_struct *p) { -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/22 -v7] add get_monotonic_cycles
The latency tracer needs a way to get an accurate time without grabbing any locks. Locks themselves might call the latency tracer and cause at best a slow down. This patch adds get_monotonic_cycles that returns cycles from a reliable clock source in a monotonic fashion. Signed-off-by: John Stultz <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- include/linux/clocksource.h | 54 +--- kernel/time/timekeeping.c | 26 +++-- 2 files changed, 70 insertions(+), 10 deletions(-) Index: linux-mcount.git/include/linux/clocksource.h === --- linux-mcount.git.orig/include/linux/clocksource.h 2008-01-25 21:47:11.0 -0500 +++ linux-mcount.git/include/linux/clocksource.h2008-01-25 21:47:13.0 -0500 @@ -88,8 +88,16 @@ struct clocksource { */ struct { cycle_t cycle_last, cycle_accumulated; - } cacheline_aligned_in_smp; + /* base structure provides lock-free read +* access to a virtualized 64bit counter +* Uses RCU-like update. +*/ + struct { + cycle_t cycle_base_last, cycle_base; + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,30 @@ static inline cycle_t clocksource_read(s } /** - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value * @cs:pointer to clocksource being read * @now: current cycle value * * Uses the clocksource to return the current cycle_t value. * NOTE!!!: This is different from clocksource_read, because it - * returns the accumulated cycle value! Must hold xtime lock! + * returns a 64bit wide accumulated value. */ static inline cycle_t -clocksource_get_cycles(struct clocksource *cs, cycle_t now) +clocksource_get_basecycles(struct clocksource *cs) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; - offset += cs->cycle_accumulated; + int num; + cycle_t now, offset; + + preempt_disable(); + num = cs->base_num; + /* base_num is shared, and some archs are wacky */ + smp_read_barrier_depends(); + now = clocksource_read(cs); + offset = (now - cs->base[num].cycle_base_last); + offset &= cs->mask; + offset += cs->base[num].cycle_base; + preempt_enable(); + return offset; } @@ -197,11 +216,27 @@ clocksource_get_cycles(struct clocksourc * @now: current cycle value * * Used to avoids clocksource hardware overflow by periodically - * accumulating the current cycle delta. Must hold xtime write lock! + * accumulating the current cycle delta. Uses RCU-like update, but + * ***still requires the xtime_lock is held for writing!*** */ static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; + /* +* First update the monotonic base portion. +* The dual array update method allows for lock-free reading. +* 'num' is always 1 or 0. +*/ + int num = 1 - cs->base_num; + cycle_t offset = (now - cs->base[1-num].cycle_base_last); + offset &= cs->mask; + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset; + cs->base[num].cycle_base_last = now; + /* make sure this array is visible to the world first */ + smp_wmb(); + cs->base_num = num; + + /* Now update the cycle_accumulated portion */ + offset = (now - cs->cycle_last) & cs->mask; cs->cycle_last = now; cs->cycle_accumulated += offset; } @@ -272,6 +307,9 @@ extern int clocksource_register(struct c extern struct clocksource* clocksource_get_next(void); extern void clocksource_change_rating(struct clocksource *cs, int rating); extern void clocksource_resume(void); +extern cycle_t get_monotonic_cycles(void); +extern unsigned long cycles_to_usecs(cycle_t cycles); +extern cycle_t usecs_to_cycles(unsigned long usecs); /* used to initialize clock */ extern struct clocksource clocksource_jiffies; Index: linux-mcount.git/kernel/time/timekeeping.c === --- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 21:47:11.0 -0500 +++ linux-mcount.git/kernel/time/timekeeping.c 2008-01-25 21:47:13.0 -0500 @@ -71,10 +71,12 @@ static struct clocksource *clock = &cloc */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle
[PATCH 10/22 -v7] mcount based trace in the form of a header file library
This is a simple trace that uses the mcount infrastructure. It is designed to be fast and small, and easy to use. It is useful to record things that happen over a very short period of time, and not to analyze the system in general. An interface is added to the debugfs /debugfs/tracing/ This patch adds the following files: available_tracers list of available tracers. Currently only "function" is available. current_tracer The trace that is currently active. Empty on start up. To switch to a tracer simply echo one of the tracers that are listed in available_tracers: echo function > /debugfs/tracing/current_tracer To disable the tracer: echo disable > /debugfs/tracing/current_tracer trace_ctrl echoing "1" into this file starts the mcount function tracing (if sysctl kernel.mcount_enabled=1) echoing "0" turns it off. latency_trace This file is readonly and holds the result of the trace. trace This file outputs a easier to read version of the trace. iter_ctrl Controls the way the output of traces look. So far there's two controls: echoing in "symonly" will only show the kallsyms variables without the addresses (if kallsyms was configured) echoing in "verbose" will change the output to show a lot more data, but not very easy to understand by humans. echoing in "nosymonly" turns off symonly. echoing in "noverbose" turns off verbose. The output of the function_trace file is as follows "echo noverbose > /debugfs/tracing/iter_ctrl" preemption latency trace v1.1.5 on 2.6.24-rc7-tst latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) - | task: -0 (uid:0 nice:0 policy:0 rt_prio:0) - _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 0d.h. 1595128us+: set_normalized_timespec+0x8/0x2d (ktime_get_ts+0x4a/0x4e ) swapper-0 0d.h. 1595131us+: _spin_lock+0x8/0x18 (hrtimer_interrupt+0x6e/0x1b0 ) Or with verbose turned on: "echo verbose > /debugfs/tracing/iter_ctrl" preemption latency trace v1.1.5 on 2.6.24-rc7-tst latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) - | task: -0 (uid:0 nice:0 policy:0 rt_prio:0) - swapper 0 0 9 [f3675f41] 1595.128ms (+0.003ms): set_normalized_timespec+0x8/0x2d (ktime_get_ts+0x4a/0x4e ) swapper 0 0 9 0001 [f3675f45] 1595.131ms (+0.003ms): _spin_lock+0x8/0x18 (hrtimer_interrupt+0x6e/0x1b0 ) swapper 0 0 9 0002 [f3675f48] 1595.135ms (+0.003ms): _spin_lock+0x8/0x18 (hrtimer_interrupt+0x6e/0x1b0 ) The "trace" file is not affected by the verbose mode, but is by the symonly. echo "nosymonly" > /debugfs/tracing/iter_ctrl tracer: [ 81.479967] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <-- _spin_unlock_irqrestore+0xe/0x5a [ 81.479967] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- sub_preempt_count+0xc/0x7a [ 81.479968] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- in_lock_functions+0x9/0x24 [ 81.479968] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- dnotify_parent+0x12/0x78 [ 81.479968] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70 [ 81.479969] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- add_preempt_count+0xe/0x77 [ 81.479969] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- in_lock_functions+0x9/0x24 echo "symonly" > /debugfs/tracing/iter_ctrl tracer: [ 81.479913] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <-- _spin_unlock_irqrestore+0xe/0x5a [ 81.479913] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- sub_preempt_count+0xc/0x7a [ 81.479913] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- in_lock_functions+0x9/0x24 [ 81.479914] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- dnotify_parent+0x12/0x78 [ 81.479914] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70 [ 81.479914] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- add_preempt_count+0xe/0x77 [ 81.479914] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- in_lock_functions+0x9/0x24 Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> --- lib/Makefile |1 lib/tracing/Kconfig | 15 lib/tracing/Makefile |3 lib/tracing/trace_function.c | 72 ++ lib/tracing/tracer.c | 1160 +++
[PATCH 07/22 -v7] initialize the clock source to jiffies clock.
The latency tracer can call clocksource_read very early in bootup and before the clock source variable has been initialized. This results in a crash at boot up (even before earlyprintk is initialized). Since the clock->read variable points to NULL. This patch simply initializes the clock to use clocksource_jiffies, so that any early user of clocksource_read will not crash. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Acked-by: John Stultz <[EMAIL PROTECTED]> --- include/linux/clocksource.h |3 +++ kernel/time/timekeeping.c |9 +++-- 2 files changed, 10 insertions(+), 2 deletions(-) Index: linux-mcount.git/include/linux/clocksource.h === --- linux-mcount.git.orig/include/linux/clocksource.h 2008-01-25 21:47:09.0 -0500 +++ linux-mcount.git/include/linux/clocksource.h2008-01-25 21:47:11.0 -0500 @@ -273,6 +273,9 @@ extern struct clocksource* clocksource_g extern void clocksource_change_rating(struct clocksource *cs, int rating); extern void clocksource_resume(void); +/* used to initialize clock */ +extern struct clocksource clocksource_jiffies; + #ifdef CONFIG_GENERIC_TIME_VSYSCALL extern void update_vsyscall(struct timespec *ts, struct clocksource *c); extern void update_vsyscall_tz(void); Index: linux-mcount.git/kernel/time/timekeeping.c === --- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 21:47:09.0 -0500 +++ linux-mcount.git/kernel/time/timekeeping.c 2008-01-25 21:47:11.0 -0500 @@ -53,8 +53,13 @@ static inline void update_xtime_cache(u6 timespec_add_ns(&xtime_cache, nsec); } -static struct clocksource *clock; /* pointer to current clocksource */ - +/* + * pointer to current clocksource + * Just in case we use clocksource_read before we initialize + * the actual clock source. Instead of calling a NULL read pointer + * we return jiffies. + */ +static struct clocksource *clock = &clocksource_jiffies; #ifdef CONFIG_GENERIC_TIME /** -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/22 -v7] Add context switch marker to sched.c
Add marker into context_switch to record the prev and next tasks. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- kernel/sched.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:46:55.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:19.0 -0500 @@ -2198,6 +2198,8 @@ context_switch(struct rq *rq, struct tas struct mm_struct *mm, *oldmm; prepare_task_switch(rq, prev, next); + trace_mark(kernel_sched_schedule, + "prev %p next %p", prev, next); mm = next->mm; oldmm = prev->active_mm; /* -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 19/22 -v7] trace preempt off critical timings
Add preempt off timings. A lot of kernel core code is taken from the RT patch latency trace that was written by Ingo Molnar. This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers Now instead of just tracing irqs off, preemption off can be selected to be recorded. When this is selected, it shares the same files as irqs off timings. One can either trace preemption off, irqs off, or one or the other off. By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording of preempt off only is performed. "irqsoff" will only record the time irqs are disabled, but "preemptirqsoff" will take the total time irqs or preemption are disabled. Runtime switching of these options is now supported by simpling echoing in the appropriate trace name into /debugfs/tracing/current_tracer. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/process_32.c |3 include/linux/irqflags.h |3 include/linux/mcount.h |8 + include/linux/preempt.h |2 kernel/sched.c | 24 + lib/tracing/Kconfig | 25 + lib/tracing/Makefile |1 lib/tracing/trace_irqsoff.c | 183 +++ 8 files changed, 196 insertions(+), 53 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-29 15:05:34.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 15:22:31.0 -0500 @@ -46,6 +46,31 @@ config CRITICAL_IRQSOFF_TIMING echo 0 > /debugfs/tracing/tracing_max_latency + (Note that kernel size and overhead increases with this option + enabled. This option and the preempt-off timing option can be + used together or separately.) + +config CRITICAL_PREEMPT_TIMING + bool "Preemption-off critical section latency timing" + default n + depends on GENERIC_TIME + depends on PREEMPT + select TRACING + select TRACER_MAX_TRACE + help + This option measures the time spent in preemption off critical + sections, with microsecond accuracy. + + The default measurement method is a maximum search, which is + disabled by default and can be runtime (re-)started + via: + + echo 0 > /debugfs/tracing/tracing_max_latency + + (Note that kernel size and overhead increases with this option + enabled. This option and the irqs-off timing option can be + used together or separately.) + config WAKEUP_TRACER bool "Trace wakeup latencies" depends on DEBUG_KERNEL Index: linux-mcount.git/lib/tracing/Makefile === --- linux-mcount.git.orig/lib/tracing/Makefile 2008-01-29 15:05:34.0 -0500 +++ linux-mcount.git/lib/tracing/Makefile 2008-01-29 15:22:31.0 -0500 @@ -4,6 +4,7 @@ obj-$(CONFIG_TRACING) += tracer.o obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o obj-$(CONFIG_CRITICAL_IRQSOFF_TIMING) += trace_irqsoff.o +obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += trace_irqsoff.o obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o libmcount-y := mcount.o Index: linux-mcount.git/lib/tracing/trace_irqsoff.c === --- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c 2008-01-29 15:05:34.0 -0500 +++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-29 15:25:28.0 -0500 @@ -21,6 +21,34 @@ static struct tracing_trace *tracer_trac static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex); static int trace_enabled __read_mostly; +static DEFINE_PER_CPU(int, tracing_cpu); + +enum { + TRACER_IRQS_OFF = (1 << 1), + TRACER_PREEMPT_OFF = (1 << 2), +}; + +static int trace_type __read_mostly; + +#ifdef CONFIG_CRITICAL_PREEMPT_TIMING +# define preempt_trace() \ + ((trace_type & TRACER_PREEMPT_OFF) && preempt_count()) +#else +# define preempt_trace() (0) +#endif + +#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING +# define irq_trace() \ + ((trace_type & TRACER_IRQS_OFF) && \ +({ \ +unsigned long __flags; \ +local_save_flags(__flags); \ +irqs_disabled_flags(__flags); \ +})) +#else +# define irq_trace() (0) +#endif + /* * Sequence count - we record it when starting a measurement and * skip the latency if the sequence has changed - some other section @@ -41,14 +69,11 @@ static void notrace irqsoff_trace_call(u unsigned long flags; int cpu; - if (likely(!trace_enabled)) + if (likely(!__get_cpu_var(tracing_cpu))) return; local_save_flags(flags); - if (!irqs_disable
[PATCH 22/22 -v7] Critical latency timings histogram
This patch adds hooks into the latency tracer to give us histograms of interrupts off, preemption off and wakeup timings. This code was based off of work done by Yi Yang <[EMAIL PROTECTED]> But heavily modified to work with the new tracer, and some clean ups by Steven Rostedt <[EMAIL PROTECTED]> This adds the following to /debugfs/tracing latency_hist/ - root dir for historgrams. Under latency_hist there is (depending on what's configured): interrupt_off_latency/ - latency histograms of interrupts off. preempt_interrupts_off_latency/ - latency histograms of preemption and/or interrupts off. preempt_off_latency/ - latency histograms of preemption off. wakeup_latency/ - latency histograms of wakeup timings. Under each of the above is a file labeled: CPU# for each possible CPU were # is the CPU number. reset - writing into this file will reset the histogram back to zeros and start again. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig | 20 + lib/tracing/Makefile|4 lib/tracing/trace_irqsoff.c | 19 + lib/tracing/trace_wakeup.c | 21 + lib/tracing/tracer_hist.c | 514 lib/tracing/tracer_hist.h | 39 +++ 6 files changed, 613 insertions(+), 4 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-29 21:34:14.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 21:34:30.0 -0500 @@ -102,3 +102,23 @@ config CONTEXT_SWITCH_TRACER This tracer hooks into the context switch and records all switching of tasks. +config INTERRUPT_OFF_HIST + bool "Interrupts off critical timings histogram" + depends on CRITICAL_IRQSOFF_TIMING + help + This option uses the infrastructure of the critical + irqs off timings to create a histogram of latencies. + +config PREEMPT_OFF_HIST + bool "Preempt off critical timings histogram" + depends on CRITICAL_PREEMPT_TIMING + help + This option uses the infrastructure of the critical + preemption off timings to create a histogram of latencies. + +config WAKEUP_LATENCY_HIST + bool "Interrupts off critical timings histogram" + depends on WAKEUP_TRACER + help + This option uses the infrastructure of the wakeup tracer + to create a histogram of latencies. Index: linux-mcount.git/lib/tracing/Makefile === --- linux-mcount.git.orig/lib/tracing/Makefile 2008-01-29 21:34:14.0 -0500 +++ linux-mcount.git/lib/tracing/Makefile 2008-01-29 21:34:30.0 -0500 @@ -8,4 +8,8 @@ obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o obj-$(CONFIG_EVENT_TRACER) += trace_events.o +obj-$(CONFIG_INTERRUPT_OFF_HIST) += tracer_hist.o +obj-$(CONFIG_PREEMPT_OFF_HIST) += tracer_hist.o +obj-$(CONFIG_WAKEUP_LATENCY_HIST) += tracer_hist.o + libmcount-y := mcount.o Index: linux-mcount.git/lib/tracing/trace_irqsoff.c === --- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c 2008-01-29 21:34:14.0 -0500 +++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-29 21:34:30.0 -0500 @@ -16,6 +16,7 @@ #include #include "tracer.h" +#include "tracer_hist.h" static struct tracing_trace *tracer_trace __read_mostly; static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex); @@ -261,10 +262,14 @@ void notrace start_critical_timings(void { if (preempt_trace() || irq_trace()) start_critical_timing(CALLER_ADDR0, 0); + + tracing_hist_preempt_start(); } void notrace stop_critical_timings(void) { + tracing_hist_preempt_stop(TRACE_STOP); + if (preempt_trace() || irq_trace()) stop_critical_timing(CALLER_ADDR0, 0); } @@ -273,6 +278,8 @@ void notrace stop_critical_timings(void) #ifdef CONFIG_LOCKDEP void notrace time_hardirqs_on(unsigned long a0, unsigned long a1) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(a0, a1); } @@ -281,6 +288,8 @@ void notrace time_hardirqs_off(unsigned { if (!preempt_trace() && irq_trace()) start_critical_timing(a0, a1); + + tracing_hist_preempt_start(); } #else /* !CONFIG_LOCKDEP */ @@ -314,6 +323,8 @@ inline void print_irqtrace_events(struct */ void notrace trace_hardirqs_on(void) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(CALLER_ADDR0, 0); } @@ -323,11 +334,15 @@ void notrace trace_hardirqs_off(void) { if (!preempt_trace() && irq_trace()) start_cr
[PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
If CONFIG_MCOUNT is selected and /proc/sys/kernel/mcount_enabled is set to a non-zero value the mcount routine will be called everytime we enter a kernel function that is not marked with the "notrace" attribute. The mcount routine will then call a registered function if a function happens to be registered. [This code has been highly hacked by Steven Rostedt, so don't blame Arnaldo for all of this ;-) ] Update: It is now possible to register more than one mcount function. If only one mcount function is registered, that will be the function that mcount calls directly. If more than one function is registered, then mcount will call a function that will loop through the functions to call. Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- Makefile |3 arch/x86/Kconfig |1 arch/x86/kernel/entry_32.S | 25 +++ arch/x86/kernel/entry_64.S | 36 +++ include/linux/linkage.h|2 include/linux/mcount.h | 38 kernel/sysctl.c| 11 +++ lib/Kconfig.debug |1 lib/Makefile |2 lib/tracing/Kconfig| 10 +++ lib/tracing/Makefile |3 lib/tracing/mcount.c | 141 + 12 files changed, 273 insertions(+) Index: linux-mcount.git/Makefile === --- linux-mcount.git.orig/Makefile 2008-01-29 17:01:56.0 -0500 +++ linux-mcount.git/Makefile 2008-01-29 17:26:17.0 -0500 @@ -509,6 +509,9 @@ endif include $(srctree)/arch/$(SRCARCH)/Makefile +ifdef CONFIG_MCOUNT +KBUILD_CFLAGS += -pg +endif ifdef CONFIG_FRAME_POINTER KBUILD_CFLAGS += -fno-omit-frame-pointer -fno-optimize-sibling-calls else Index: linux-mcount.git/arch/x86/Kconfig === --- linux-mcount.git.orig/arch/x86/Kconfig 2008-01-29 16:59:15.0 -0500 +++ linux-mcount.git/arch/x86/Kconfig 2008-01-29 17:26:18.0 -0500 @@ -19,6 +19,7 @@ config X86_64 config X86 bool default y + select HAVE_MCOUNT config GENERIC_TIME bool Index: linux-mcount.git/arch/x86/kernel/entry_32.S === --- linux-mcount.git.orig/arch/x86/kernel/entry_32.S2008-01-29 16:59:15.0 -0500 +++ linux-mcount.git/arch/x86/kernel/entry_32.S 2008-01-29 17:26:18.0 -0500 @@ -75,6 +75,31 @@ DF_MASK = 0x0400 NT_MASK= 0x4000 VM_MASK= 0x0002 +#ifdef CONFIG_MCOUNT +.globl mcount +mcount: + /* unlikely(mcount_enabled) */ + cmpl $0, mcount_enabled + jnz trace + ret + +trace: + /* taken from glibc */ + pushl %eax + pushl %ecx + pushl %edx + movl 0xc(%esp), %edx + movl 0x4(%ebp), %eax + + call *mcount_trace_function + + popl %edx + popl %ecx + popl %eax + + ret +#endif + #ifdef CONFIG_PREEMPT #define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else Index: linux-mcount.git/arch/x86/kernel/entry_64.S === --- linux-mcount.git.orig/arch/x86/kernel/entry_64.S2008-01-29 16:59:15.0 -0500 +++ linux-mcount.git/arch/x86/kernel/entry_64.S 2008-01-29 17:26:18.0 -0500 @@ -53,6 +53,42 @@ .code64 +#ifdef CONFIG_MCOUNT + +ENTRY(mcount) + /* unlikely(mcount_enabled) */ + cmpl $0, mcount_enabled + jnz trace + retq + +trace: + /* taken from glibc */ + subq $0x38, %rsp + movq %rax, (%rsp) + movq %rcx, 8(%rsp) + movq %rdx, 16(%rsp) + movq %rsi, 24(%rsp) + movq %rdi, 32(%rsp) + movq %r8, 40(%rsp) + movq %r9, 48(%rsp) + + movq 0x38(%rsp), %rsi + movq 8(%rbp), %rdi + + call *mcount_trace_function + + movq 48(%rsp), %r9 + movq 40(%rsp), %r8 + movq 32(%rsp), %rdi + movq 24(%rsp), %rsi + movq 16(%rsp), %rdx + movq 8(%rsp), %rcx + movq (%rsp), %rax + addq $0x38, %rsp + + retq +#endif + #ifndef CONFIG_PREEMPT #define retint_kernel retint_restore_args #endif Index: linux-mcount.git/include/linux/linkage.h === --- linux-mcount.git.orig/include/linux/linkage.h 2008-01-29 16:59:15.0 -0500 +++ linux-mcount.git/include/linux/linkage.h2008-01-29 17:26:18.0 -0500 @@ -3,6 +3,8 @@ #include +#define notrace __attribute__((no_instrument_function)) + #ifdef __cplusplus #define CPP_ASMLINKAGE extern "C" #else Index: linux-mcount.git/include/linux/mcount.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-mcount.git
[PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled
[ This patch is added to the series since the wakeup timings trace may lockup without it. ] I thought that one could place a printk anywhere without worrying. But it seems that it is not wise to place a printk where the runqueue lock is held. I just spent two hours debugging why some of my code was locking up, to find that the lockup was caused by some debugging printk's that I had in the scheduler. The printk's were only in rare paths so they shouldn't be too much of a problem, but after I hit the printk the system locked up. Thinking that it was locking up on my code I went looking down the wrong path. I finally found (after examining an NMI dump) that the lockup happened because printk was trying to wakeup the klogd daemon, which caused a deadlock when the try_to_wakeup code tries to grab the runqueue lock. This patch adds a runqueue_is_locked interface in sched.c for other files to see if the current runqueue lock is held. This is used in printk to determine whether it is safe or not to wakeup the klogd. And with this patch, my code ran fine ;-) Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- include/linux/sched.h |2 ++ kernel/printk.c | 14 ++ kernel/sched.c| 18 ++ 3 files changed, 30 insertions(+), 4 deletions(-) Index: linux-mcount.git/kernel/printk.c === --- linux-mcount.git.orig/kernel/printk.c 2008-01-29 17:02:10.0 -0500 +++ linux-mcount.git/kernel/printk.c2008-01-29 17:25:40.0 -0500 @@ -590,9 +590,11 @@ static int have_callable_console(void) * @fmt: format string * * This is printk(). It can be called from any context. We want it to work. - * Be aware of the fact that if oops_in_progress is not set, we might try to - * wake klogd up which could deadlock on runqueue lock if printk() is called - * from scheduler code. + * + * Note: if printk() is called with the runqueue lock held, it will not wake + * up the klogd. This is to avoid a deadlock from calling printk() in schedule + * with the runqueue lock held and having the wake_up grab the runqueue lock + * as well. * * We try to grab the console_sem. If we succeed, it's easy - we log the output and * call the console drivers. If we fail to get the semaphore we place the output @@ -1001,7 +1003,11 @@ void release_console_sem(void) console_locked = 0; up(&console_sem); spin_unlock_irqrestore(&logbuf_lock, flags); - if (wake_klogd) + /* +* If we try to wake up klogd while printing with the runqueue lock +* held, this will deadlock. +*/ + if (wake_klogd && !runqueue_is_locked()) wake_up_klogd(); } EXPORT_SYMBOL(release_console_sem); Index: linux-mcount.git/include/linux/sched.h === --- linux-mcount.git.orig/include/linux/sched.h 2008-01-29 17:02:10.0 -0500 +++ linux-mcount.git/include/linux/sched.h 2008-01-29 17:25:40.0 -0500 @@ -222,6 +222,8 @@ extern void sched_init_smp(void); extern void init_idle(struct task_struct *idle, int cpu); extern void init_idle_bootup_task(struct task_struct *idle); +extern int runqueue_is_locked(void); + extern cpumask_t nohz_cpu_mask; #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) extern int select_nohz_load_balancer(int cpu); Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-29 16:59:15.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-29 17:25:40.0 -0500 @@ -621,6 +621,24 @@ unsigned long rt_needs_cpu(int cpu) # define const_debug static const #endif +/** + * runqueue_is_locked + * + * Returns true if the current cpu runqueue is locked. + * This interface allows printk to be called with the runqueue lock + * held and know whether or not it is OK to wake up the klogd. + */ +int runqueue_is_locked(void) +{ + int cpu = get_cpu(); + struct rq *rq = cpu_rq(cpu); + int ret; + + ret = spin_is_locked(&rq->lock); + put_cpu(); + return ret; +} + /* * Debugging: various feature bits */ -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/22 -v7] trace generic call to schedule switch
This patch adds hooks into the schedule switch tracing to allow other latency traces to hook into the schedule switches. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/trace_sched_switch.c | 123 +-- lib/tracing/tracer.h | 14 2 files changed, 119 insertions(+), 18 deletions(-) Index: linux-mcount.git/lib/tracing/tracer.h === --- linux-mcount.git.orig/lib/tracing/tracer.h 2008-01-29 12:35:27.0 -0500 +++ linux-mcount.git/lib/tracing/tracer.h 2008-01-29 14:22:15.0 -0500 @@ -113,4 +113,18 @@ static inline notrace cycle_t now(void) return get_monotonic_cycles(); } +#ifdef CONFIG_CONTEXT_SWITCH_TRACER +typedef void (*tracer_switch_func_t)(void *private, +struct task_struct *prev, +struct task_struct *next); +struct tracer_switch_ops { + tracer_switch_func_t func; + void *private; + struct tracer_switch_ops *next; +}; + +extern int register_tracer_switch(struct tracer_switch_ops *ops); +extern int unregister_tracer_switch(struct tracer_switch_ops *ops); +#endif /* CONFIG_CONTEXT_SWITCH_TRACER */ + #endif /* _LINUX_MCOUNT_TRACER_H */ Index: linux-mcount.git/lib/tracing/trace_sched_switch.c === --- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c 2008-01-29 12:35:40.0 -0500 +++ linux-mcount.git/lib/tracing/trace_sched_switch.c 2008-01-29 14:24:35.0 -0500 @@ -18,33 +18,21 @@ static struct tracing_trace *tracer_trac static int trace_enabled __read_mostly; static atomic_t sched_ref; int tracing_sched_switch_enabled __read_mostly; +static DEFINE_SPINLOCK(sched_switch_func_lock); -static notrace void sched_switch_callback(const struct marker *mdata, - void *private_data, - const char *format, ...) +static void notrace sched_switch_func(void *private, + struct task_struct *prev, + struct task_struct *next) { - struct tracing_trace **p = mdata->private; - struct tracing_trace *tr = *p; + struct tracing_trace **ptr = private; + struct tracing_trace *tr = *ptr; struct tracing_trace_cpu *data; - struct task_struct *prev; - struct task_struct *next; unsigned long flags; - va_list ap; int cpu; - if (!atomic_read(&sched_ref)) - return; - - tracing_record_cmdline(current); - if (!trace_enabled) return; - va_start(ap, format); - prev = va_arg(ap, typeof(prev)); - next = va_arg(ap, typeof(next)); - va_end(ap); - raw_local_irq_save(flags); cpu = raw_smp_processor_id(); data = tr->data[cpu]; @@ -57,6 +45,105 @@ static notrace void sched_switch_callbac raw_local_irq_restore(flags); } +static struct tracer_switch_ops sched_switch_ops __read_mostly = +{ + .func = sched_switch_func, + .private = &tracer_trace, +}; + +static tracer_switch_func_t tracer_switch_func __read_mostly = + sched_switch_func; + +static struct tracer_switch_ops *tracer_switch_func_ops __read_mostly = + &sched_switch_ops; + +static void notrace sched_switch_func_loop(void *private, + struct task_struct *prev, + struct task_struct *next) +{ + struct tracer_switch_ops *ops = tracer_switch_func_ops; + + for (; ops != NULL; ops = ops->next) + ops->func(ops->private, prev, next); +} + +notrace int register_tracer_switch(struct tracer_switch_ops *ops) +{ + unsigned long flags; + + spin_lock_irqsave(&sched_switch_func_lock, flags); + ops->next = tracer_switch_func_ops; + smp_wmb(); + tracer_switch_func_ops = ops; + + if (ops->next == &sched_switch_ops) + tracer_switch_func = sched_switch_func_loop; + + spin_unlock_irqrestore(&sched_switch_func_lock, flags); + + return 0; +} + +notrace int unregister_tracer_switch(struct tracer_switch_ops *ops) +{ + unsigned long flags; + struct tracer_switch_ops **p = &tracer_switch_func_ops; + int ret; + + spin_lock_irqsave(&sched_switch_func_lock, flags); + + /* +* If the sched_switch is the only one left, then +* only call that function. +*/ + if (*p == ops && ops->next == &sched_switch_ops) { + tracer_switch_func = sched_switch_func; + tracer_switch_func_ops = &sched_switch_ops; + goto out; + } + + for (; *p != &sched_switch_ops; p = &(*p)->next) + if (*p == ops) + break; + + if (*p != ops) { +
Re: [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap
On Tuesday 29 January 2008 06:57:54 pm Andi Kleen wrote: > On Tuesday 29 January 2008 20:16, Yinghai Lu wrote: > > [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap > > > > otherise early_node_mem will use up these for 8 nodes system > > Yes this was the problem with my early_reserve node bootmem patch. > It adds a node limit. > > But even with increasing the limit is far too small. Probably best to not > use the patch. In theory it should not have been needed anyways because > there is no need to reserve here because there are no interfering users. > > Whatever your problem is it needs to be solved differently. ok, discard 3, and 4. how about 2 v2? YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] x86_64: make early_node_mem return align address
On Tuesday 29 January 2008 06:55:45 pm Andi Kleen wrote: > On Tuesday 29 January 2008 18:41, Yinghai Lu wrote: > > On Tuesday 29 January 2008 01:33:29 am Andi Kleen wrote: > > > On Tuesday 29 January 2008 10:05, Yinghai Lu wrote: > > > > [PATCH 2/2] x86_64: make early_node_mem return align address > > > > > > > > boot oops when system get 64g or 128g installed > > > > > > Probably it should just use reserve_early(). Does this patch work? > > > > > > The alignment change is needed at some point too, but only to > > > relax the alignment to not force all early allocations to be page > > > padded. > > > > No, my patch doesn't force all early allocations to be page padded. > > for find_e820_mem, i just change PAGE_ALIGN to be aligned align > > parameter > > They are already all PAGE_ALIGN()ed (which is too strict, but needs > some care to fix properly), but your patch uses it the wrong way. > The PAGE_ALIGNment was added some time ago to avoid such over > lapping, but it should not actually be needed for that anymore. > > > > > only make early_node_mem have aligned data. because it seems it like > > to...and assume that. > > Using alignment doesn't seem the correct way to avoid overlapping. > > If there is still overlap then some reservation needs to be extended. > > > I think your patch will get early panic about overlap between bss and > > bootmem... like the 256g machine, bss is overlapped with early page > > table... > > Well did you test it? > > bss should have been reserved by this line in head64.c > > reserve_early(__pa_symbol(&_text), __pa_symbol(&_end)); > > (in git-x86). In earlier kernels it was checked for explicitely by the e820 > allocator. no early panic. but the bss end still get corrupted. because bootmap_start is used as
Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module
On Tue, 2008-01-29 at 07:51 -0800, H. Peter Anvin wrote: > Yi Yang wrote: > > Current cpuid module will create a char device for every logical cpu, > > when a user cats /dev/cpu/*/cpuid, he/she will enter a limitless loop, > > the root cause is that cpuid module doesn't decide wether a cpuid level > > is valid, it just uses an offset to denote cpuid level and take it to > > cpuid instruction, cpuid instruction will ignore it and return some data > > specific to cpu model, cpuid doesn't an error return value because it is > > void type. So cpuid module will execute cpuid continuously and return > > data although most of data make no sense. > > > > This patch tries to add a sysfs interface for cpuid, users can see all the > > available cpuid levels, specify a specific level and get cpuid corresponding > > to this cpuid level. > > > > For every logical cpu, this patch will create a cpuid directory under > > /sys/devices/system/cpu/cpu*/, there are three entries under cpuid: > > > > avail_levelscur_level cur_cpuid > > > > A user can get all the available cpuid levels from avail_levels, he/she can > > set one available cpuid level to cur_level, then he/she can get cpuid from > > cur_cpuid, cur_cpuid corresponds to cur_level. > > > > This patch uses sysfs to avoid limitless loop and provide more flexible > > interface for cpuid, please consider to merge to -mm tree in order to test. > > This is broken. > > Triple broken. > > It's broken, because it doesn't take into account the fact that Intel > broke CPUID level 4 and made it "repeating" (neither did the cpuid char > device, because it predated the Intel braindamage; I've had a patch for > it privately for a while, but didn't push it upstream because paravirt > broke it royally and I wanted the situation to settle down.) > > It's broken, because the algorithm used to determine valid CPUID levels > is incorrect; it fails to recognize any CPUID levels other than the main > Intel and AMD ones, e.g. the Transmeta 0x8086 (and sometimes more) > and VIA 0xc000 levels. Thank you for pointing out these issues, i think we can let users input any cpuid level and output the corresponding cpuid, in this way we can avoid to consider cpu differences and left this to userspace. We can also consider all the x86 platforms to do cpuid for every one. > > It's broken, because it is better for the userspace extractor to have > this logic than to stuff it into the kernel, where it sits hogging > unswappable memory at all times. It seems not to be very appropriate to let user space consider hardware details. /proc/cpuinfo should be an example to justify this. Is there any user application using /dev/cpu/*/cpuid? if no, i think it is feasible to provide an interface in the kernel. I noticed an application cpu-z on Windows, maybe we can clone it on Linux. > > -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix NUMA emulation for x86_64
> ">" == Ingo Molnar <[EMAIL PROTECTED]> writes: >> * Minoru Usui <[EMAIL PROTECTED]> wrote: >> I found a small bug of NUMA emulation code for x86_64. >> (CONFIG_NUMA_EMU) If machine is non-NUMA, find_node_by_addr() should >> return NUMA_NO_NODE, but current implementation code returns existent >> maximum NUMA node number + 1. This is not existent NUMA node number. >> >> However, this behaviour does not affect NUMA emulation fortunately, >> because acpi_fake_nodes() that is caller of find_node_by_addr() gets >> pxm (proximity domain) by node_to_pxm() from non-existent NUMA node >> number that was returned by find_node_by_addr(). node_to_pxm() returns >> PXM_INVAL that means illegal or non-existent NUMA node number. >> thanks, i have applied your fix to x86.git. >> It seems this does not need to be backported to v2.6.24.1 because >> node_to_pxm() masked the bad effects of this bug, right? >> Ingo I think this bug is not urgency. If you mean that it's not necessary to release 2.6.24.1 only for this bug, I think so. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] x86_64: make early_node_mem return align address
On Tuesday 29 January 2008 18:41, Yinghai Lu wrote: > On Tuesday 29 January 2008 01:33:29 am Andi Kleen wrote: > > On Tuesday 29 January 2008 10:05, Yinghai Lu wrote: > > > [PATCH 2/2] x86_64: make early_node_mem return align address > > > > > > boot oops when system get 64g or 128g installed > > > > Probably it should just use reserve_early(). Does this patch work? > > > > The alignment change is needed at some point too, but only to > > relax the alignment to not force all early allocations to be page > > padded. > > No, my patch doesn't force all early allocations to be page padded. > for find_e820_mem, i just change PAGE_ALIGN to be aligned align > parameter They are already all PAGE_ALIGN()ed (which is too strict, but needs some care to fix properly), but your patch uses it the wrong way. The PAGE_ALIGNment was added some time ago to avoid such over lapping, but it should not actually be needed for that anymore. > > only make early_node_mem have aligned data. because it seems it like > to...and assume that. Using alignment doesn't seem the correct way to avoid overlapping. If there is still overlap then some reservation needs to be extended. > I think your patch will get early panic about overlap between bss and > bootmem... like the 256g machine, bss is overlapped with early page > table... Well did you test it? bss should have been reserved by this line in head64.c reserve_early(__pa_symbol(&_text), __pa_symbol(&_end)); (in git-x86). In earlier kernels it was checked for explicitely by the e820 allocator. -Andi > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap
On Tuesday 29 January 2008 20:16, Yinghai Lu wrote: > [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap > > otherise early_node_mem will use up these for 8 nodes system Yes this was the problem with my early_reserve node bootmem patch. It adds a node limit. But even with increasing the limit is far too small. Probably best to not use the patch. In theory it should not have been needed anyways because there is no need to reserve here because there are no interfering users. Whatever your problem is it needs to be solved differently. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sata_nv: fix for completion handling
Tejun Heo wrote: Robert Hancock wrote: This patch is based on an original patch from Kuan Luo of NVIDIA, posted under subject "fixed a bug of adma in rhel4u5 with HDS7250SASUN500G". His description follows. I've reworked it a bit to avoid some unnecessary repeated checks but it should be functionally identical. "The patch is to solve the error message "ata1: CPB flags CMD err, flags=0x11" when testing HDS7250SASUN500G in rhel4u5. I tested this hd in 2.6.24-rc7 which needed to remove the mask in blacklist to run the ncq and the same error also showed up. I traced the bug and found that the interrupt finished a command (for example, tag=0) when the driver got that adma status is NV_ADMA_STAT_DONE and cpb->resp_flags is NV_CPB_RESP_DONE. However, For this hd, the drive maybe didn't clear bit 0 at this moment. It meaned the hardware had not completely finished the command. If at the same time the driver freed the command(tag 0) and sended another command (tag 0), the error happened. The notifier register is 32-bit register containing notifier value. Value is bit vector containing one bit per tag number (0-31) in corresponding bit positions (bit 0 is for tag 0, etc). When bit is set then ADMA indicates that command with corresponding tag number completed execution. So i added the check notifier code. Sometimes i saw that the notifier reg set some bits , but the adma status set NV_ADMA_STAT_CMD_COMPLETE ,not NV_ADMA_STAT_DONE. So i added the NV_ADMA_STAT_CMD_COMPLETE check code." Signed-off-by: Robert Hancock <[EMAIL PROTECTED]> Any chance this fixes the FLUSH problem? I could still reproduce that issue when I took the udelay(20) out of the driver. Others have seen that without taking it out, so I suspect some systems/drives are more sensitive to that for some reason. However, who knows, it may help some people with that problem. The symptoms of the problem dealt with here are different, not a command timeout it appears, but the controller reporting an error. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: SATA DOM is not identified by ata_piix module
Hmm... Does anybody own this bug? Best regards, Mao Rui > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Mao Rui > Sent: Friday, January 18, 2008 2:12 PM > To: 'Alan Cox' > Cc: linux-kernel@vger.kernel.org > Subject: RE: SATA DOM is not identified by ata_piix module > > I tried to nail down when the problem was introduced. I compiled some > official kernel release. Here is the result. > 2.6.17.14 IDE -- Failed SATA -- passed > 2.6.18 IDE -- Failed SATA -- failed > 2.6.18.8 IDE -- Failed SATA -- failed > 2.6.24-rc7-git6 IDE -- passed SATA DOM -- failed > linux-2.6.24-rc8-git1 IDE -- passedSATA -- failed > All IDE failed reason is xfermode error, and all SATA failure is IDENTIFY > error. > > As you can find out, the failure of SATA DOM was introduced from kernel > 2.6.18. > > I'm not good at low level driver programming, so I cannot find out the root > cause by myself. But if Alan or someone else needs more info or wants to > test the patch, I'm glad to do it in my platform. > > Best regards, > Mao Rui > > > -Original Message- > > From: Alan Cox [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, January 15, 2008 7:02 PM > > To: Mao Rui > > Cc: linux-kernel@vger.kernel.org > > Subject: Re: SATA DOM is not identified by ata_piix module > > > > On Tue, 15 Jan 2008 18:11:25 +0800 > > "Mao Rui" <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I have a PQI Turbo SATA DOM. It works well under Windows. I installed it > in > > > a SuperMicro motherboard, Intel 5000P chipset. The OS is Ubuntu 7.04, > kernel > > > 2.6.20-15. But the DOM is not appeared as a device node, and I found > several > > > error messages in kernel log. > > > > Generally it is a good idea to report problems with vendor built kernels > > to the vendor and their support, especially one that is 3 releases behind. > > They have a much better idea what is in that kernel and who else has seen > > problems. > > > > > [ 67.124299] ata2.00: qc timeout (cmd 0x91) > > > [ 67.124306] ata2.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, > > > err_mask=0x4) > > > > We issued commands, it didn't respond. That could be a libata problem but > > actually looks more like an IRQ routing problem. > > > > > Actually I also have a PQI IDE DOM, it have same error with Ubuntu 7.04 > / > > > > The PQI DOM is a bit odd, it is however known to work with libata at > > least for the PATA one, and the versions which don't understand > > SET_XFER_MODE to work with current kernels. (Your failure mode isn't the > > SET_XFER_MODE one - it hasn't got that far). > > > > Alan > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
Matthew Wilcox wrote: On Tue, Jan 29, 2008 at 07:29:51AM -0800, Arjan van de Ven wrote: Right now, that isn't a lot of people in x86 land, but your patch encumbers drivers for non-x86 archs with an additional call to access space that they've never had a problem with. lets say s/x86/x86, IA64 and architectures that use intel, amd or via chipsets/ Umm .. ia64 already does exactly what I'm proposing for x86. It uses one SAL interface for bytes below 256 and a different SAL interface for bytes 256-4095. Not exactly. :) The interface is the same, ia64_sal_pci_config_write() and ia64_sal_pci_config_read(), but a flag bit in the mode argument is used to tell the SAL interface whether to translate the offset component of the config address as having 8 or 12 bits of of displacement. In my estimation, Ivan's patch, in his implementation of Loic's suggestion, is even more elegant, since there is no need to flag whether the access is for offsets below 256. Ivan's code automatically uses Port IO (or equivalent with Matthew's patch) for offsets below 256 and MMCONFIG for offsets from 256 to 4096. And even better, it removes the bitmap that tracks MMCONFIG-unfriendly devices for the first 16 buses, a solution that assumes systems with bus numbers higher than 16 will get MMCONFIG right, which turned out to be a very wrong assumption. Furthermore, the config address is translated by the Northbridge. The delivery mechanism to the Northbridge, whether Port IO or MMCONFIG, is utterly opaque to the devices on the bus, since all they see is PCI config cycles, not Port IO or MMCONFIG cycles. The test only needed to be made at the Northbridge level, not at the device level. Ivan's patch removes all this cruft. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Add a list for custom page fault handlers.
From: Pekka Paalanen <[EMAIL PROTECTED]> Provides kernel modules a way to register custom page fault handlers. On every page fault, except those handled in vmalloc_fault(), this will call a list of registered functions. The functions may handle the fault and force do_page_fault() to return immediately. This functionality is similar to the now removed page fault notifiers. Custom page fault handlers are used by debugging and reverse engineering tools. Mmio-trace is one such tool and a patch to add it into the tree will follow. The custom page fault handlers are called from the exact same points in do_page_fault() as the page fault notifiers were. Signed-off-by: Pekka Paalanen <[EMAIL PROTECTED]> Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]> --- Sorry, attached the wrong version to my last message missing the kdebug.h hunk. This is still just a straight port to current x86.git. arch/x86/Kconfig.debug |9 arch/x86/mm/fault.c | 51 ++ include/asm-x86/kdebug.h |8 +++ 3 files changed, 68 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index 2e1e3af..9b44bc5 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -225,4 +225,13 @@ config CPA_DEBUG help Do change_page_attr self tests at boot. +config PAGE_FAULT_HANDLERS + bool "Custom page fault handlers" + depends on DEBUG_KERNEL + help + Allow the use of custom page fault handlers. A kernel module may + register a function that is called on every page fault not handled + for vmalloc. Custom handlers are used by some debugging and reverse + engineering tools. + endmenu diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e28cc52..c6c8164 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -49,6 +49,54 @@ #define PF_RSVD(1<<3) #define PF_INSTR (1<<4) +#ifdef CONFIG_PAGE_FAULT_HANDLERS +static HLIST_HEAD(pf_handlers); /* protected by RCU */ +static DEFINE_SPINLOCK(pf_handlers_writer); + +void register_page_fault_handler(struct pf_handler *new_pfh) +{ + spin_lock(&pf_handlers_writer); + hlist_add_head_rcu(&new_pfh->hlist, &pf_handlers); + spin_unlock(&pf_handlers_writer); +} +EXPORT_SYMBOL_GPL(register_page_fault_handler); + +void unregister_page_fault_handler(struct pf_handler *old_pfh) +{ + might_sleep(); + spin_lock(&pf_handlers_writer); + hlist_del_rcu(&old_pfh->hlist); + spin_unlock(&pf_handlers_writer); + synchronize_rcu(); +} +EXPORT_SYMBOL_GPL(unregister_page_fault_handler); +#endif + +/* returns non-zero if do_page_fault() should return */ +static int handle_custom_pf(struct pt_regs *regs, unsigned long error_code, + unsigned long address) +{ +#ifdef CONFIG_PAGE_FAULT_HANDLERS + int ret = 0; + struct pf_handler *cur; + struct hlist_node *ncur; + + if (hlist_empty(&pf_handlers)) + return 0; + + rcu_read_lock(); + hlist_for_each_entry_rcu(cur, ncur, &pf_handlers, hlist) { + ret = cur->handler(regs, error_code, address); + if (ret) + break; + } + rcu_read_unlock(); + return ret; +#else + return 0; +#endif +} + static inline int notify_page_fault(struct pt_regs *regs) { #ifdef CONFIG_KPROBES @@ -588,6 +636,9 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) if (notify_page_fault(regs)) return; + if (handle_custom_pf(regs, error_code, address)) + return; + /* * We fault-in kernel-space virtual memory on-demand. The * 'reference' page table is init_mm.pgd. diff --git a/include/asm-x86/kdebug.h b/include/asm-x86/kdebug.h index dd442a1..ba03368 100644 --- a/include/asm-x86/kdebug.h +++ b/include/asm-x86/kdebug.h @@ -35,4 +35,12 @@ extern void dump_pagetable(unsigned long); extern unsigned long oops_begin(void); extern void oops_end(unsigned long, struct pt_regs *, int signr); +struct pf_handler { + struct hlist_node hlist; + int (*handler)(struct pt_regs *regs, unsigned long error_code, + unsigned long address); +}; + +extern void register_page_fault_handler(struct pf_handler *new_pfh); +extern void unregister_page_fault_handler(struct pf_handler *old_pfh); #endif -- 1.5.4.rc4.1142.gf5a97 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
The invalidation of address ranges in a mm_struct needs to be performed when pages are removed or permissions etc change. Most of the VM address space changes can use the range invalidate callback. invalidate_range() is generally called with mmap_sem held but no spinlocks are active. If invalidate_range() is called with locks held then we pass a flag into invalidate_range() Comments state that mmap_sem must be held for remap_pfn_range() but various drivers do not seem to do this. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> Signed-off-by: Robin Holt <[EMAIL PROTECTED]> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/fremap.c |2 ++ mm/hugetlb.c |2 ++ mm/memory.c | 11 +-- mm/mmap.c|1 + 4 files changed, 14 insertions(+), 2 deletions(-) Index: linux-2.6/mm/fremap.c === --- linux-2.6.orig/mm/fremap.c 2008-01-29 16:56:33.0 -0800 +++ linux-2.6/mm/fremap.c 2008-01-29 16:59:24.0 -0800 @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -212,6 +213,7 @@ asmlinkage long sys_remap_file_pages(uns } err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier(invalidate_range, mm, start, start + size, 0); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2008-01-29 16:56:33.0 -0800 +++ linux-2.6/mm/memory.c 2008-01-29 16:59:24.0 -0800 @@ -50,6 +50,7 @@ #include #include #include +#include #include #include @@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range, mm, address, end, + (details ? (details->i_mmap_lock != NULL) : 0)); return end; } @@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); @@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1669,6 +1674,8 @@ gotten: page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); + mmu_notifier(invalidate_range, mm, address, + address + PAGE_SIZE - 1, 0); if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c2008-01-29 16:56:36.0 -0800 +++ linux-2.6/mm/mmap.c 2008-01-29 16:58:15.0 -0800 @@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range, mm, start, end, 0); } /* Index: linux-2.6/mm/hugetlb.c === --- linux-2.6.orig/mm/hugetlb.c 2008-01-29 16:56:33.0 -0800 +++ linux-2.6/mm/hugetlb.c 2008-01-29 16:58:15.0 -0800 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range, mm, start, end, 1); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/m
Re: [PATCH 2/4] x86_64: make early_node_mem return align address v2
On Tuesday 29 January 2008 11:14:48 am Yinghai Lu wrote: > [PATCH 2/4] x86_64: make early_node_mem return align address v2 > > boot oops when system get 64g or 128 installed > can you apply this updated version with others? setup_node_mem should return with PAGE_ALIGN. in setup_node_bootmem, it need bootmap_start to be PAGE_ALIGN, without this patch it will overlap with bss. YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 6/6] mmu_notifier: Add invalidate_all()
when a task exits we can remove all external pts at once. At that point the extern mmu may also unregister itself from the mmu notifier chain to avoid future calls. Note the complications because of RCU. Other processors may not see that the notifier was unlinked until a quiescent period has passed! Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/mmu_notifier.h |4 mm/mmap.c|1 + 2 files changed, 5 insertions(+) Index: linux-2.6/include/linux/mmu_notifier.h === --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-28 14:02:18.0 -0800 +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-28 14:15:49.0 -0800 @@ -62,6 +62,10 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + /* Dummy needed because the mmu_notifier() macro requires it */ + void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm, + int dummy); + /* * lock indicates that the function is called under spinlock. */ Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c2008-01-28 14:15:49.0 -0800 +++ linux-2.6/mm/mmap.c 2008-01-28 14:15:49.0 -0800 @@ -2034,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm) unsigned long end; /* mm's last user has gone, and its about to be pulled down */ + mmu_notifier(invalidate_all, mm, 0); arch_exit_mmap(mm); lru_add_drain(); -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps
These notifiers here use the Linux rmaps to perform the callbacks. In order to walk the rmaps locks must be held. Callbacks can therefore only operate in an atomic context. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/rmap.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c2008-01-29 16:58:25.0 -0800 +++ linux-2.6/mm/rmap.c 2008-01-29 16:58:39.0 -0800 @@ -285,7 +285,8 @@ static int page_referenced_one(struct pa if (!pte) goto out; - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)) referenced++; /* Pretend the page is referenced if the task has the @@ -435,6 +436,7 @@ static int page_mkclean_one(struct page flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -680,7 +682,8 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte { + (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address { ret = SWAP_FAIL; goto out_unmap; } @@ -688,6 +691,7 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -812,12 +816,14 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/6] [RFC] MMU Notifiers V3
This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel. The known immediate users are KVM (establishes a refcount to the page. External references called spte) GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb) XPmem (uses its own reverse mappings and refcount. Remote ptes, Needs to sleep when sending messages) Issues: - Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU Early tests with the GRU were successful. - Pages may be freed before the external mapping are torn down through invalidate_range() if no refcount on the page is taken. There is the chance that page content may be visible after they have been reallocated (mainly an issue for the GRU that takes no refcount). - invalidate_range() callbacks are sometimes called under i_mmap_lock. These need to be dealt with or XPmem needs to be able to work around these. - filemap_xip.c does not follow conventions for Rmap callbacks. We could depends on XIP support not being active to avoid the issue. Things that we leave as is: - RCU quiescent periods are required on registering and unregistering notifiers to guarantee visibility to other processors. Currently only mmu_notifier_release() does the correct thing. It is up to the user to provide RCU quiescent periods for register/unregister functions if they are called outside of the ->release method. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() V2->V3: - Further RCU fixes - Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page and sys_remap_file_pages() after the pte clearing. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap
Callbacks to remove individual pages if the subsystem has an rmap capability. The pagelock is held but no spinlocks are held. The refcount of the page is elevated so that dropping the refcount in the subsystem will not directly free the page. The callbacks occur after the Linux rmaps have been walked. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/rmap.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-2.6/mm/rmap.c === --- linux-2.6.orig/mm/rmap.c2008-01-25 14:24:19.0 -0800 +++ linux-2.6/mm/rmap.c 2008-01-25 14:24:38.0 -0800 @@ -49,6 +49,7 @@ #include #include #include +#include #include @@ -473,6 +474,8 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); + if (unlikely(PageExternalRmap(page))) + mmu_rmap_notifier(invalidate_page, page); if (page_test_dirty(page)) { page_clear_dirty(page); ret = 1; @@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (unlikely(PageExternalRmap(page))) + mmu_rmap_notifier(invalidate_page, page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/