Re: epoll and shared fd's
On Jan 25, 2008 12:57 AM, Davide Libenzi <[EMAIL PROTECTED]> wrote: > > On Thu, 24 Jan 2008, Pierre Habouzit wrote: > > > On Fri, Jan 18, 2008 at 09:10:18PM +, Davide Libenzi wrote: > > > On Fri, 18 Jan 2008, Pierre Habouzit wrote: > > > > > > > Hi, > > > > > > > > I just came across a strange behavior of epoll that seems to > > > > contradict the documentation. Here is what happens: > > > > > > > > * I have two processes P1 and P2, P1 accept()s connections, and send the > > > > resulting file descriptors to P2 through a unix socket. > > > > > > > > * P2 registers the received socket in his epollfd. > > > > > > > > [time passes] > > > > > > > > * P2 is done with the socket and closes it > > > > > > > > * P2 gets events for the socket again ! > > > > > > > > > > > > Though the documentation says that if a process closes a file > > > > descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup() > > > > the file descriptor. Though (because of a bug) it was still open in > > > > P1[0], hence the referenced socket still live at the kernel level. > > > > > > > > Of course the userland workaround is to force the EPOLL_CTL_DEL before > > > > the close, which I now do, but costs me a syscall where I wanted to > > > > spare one :| > > > > > > For epoll, a close is when the kernel file* is released (that is, when all > > > its instances are gone). > > > We could put a special handling in filp_close(), but I don't think is a > > > good idea, and we're better live with the current behaviour. > > > > Okay, maybe updating the linux manpages to be more clear about that is > > the way to go then. Thanks > > Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in > the epoll man page. Thanks Davide -- yes please send me a patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: threshold_init_device/kobject_uevent_env oops
On Fri, Jan 25, 2008 at 11:24:55PM -0800, Greg KH wrote: > On Fri, Jan 25, 2008 at 11:08:53PM -0800, Yinghai Lu wrote: > > On Jan 25, 2008 10:14 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote: > > > > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote: > > > > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote: > > > > > > > > > > > > > > * Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote: > > > > > > > > > current linus tree + x86.git > > > > > > > > > > > > > > > > > > got > > > > > > > > > > > > > > > > > > Calling initcall 0x80b93d98: > > > > > > > > > threshold_init_device+0x0/0x3f() > > > > > > > > > BUG: unable to handle kernel NULL pointer dereference at > > > > > > > > > 0040 > > > > > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9 > > > > > > > > > > > > > > > > Does this happen on just Linus's tree? > > > > > > > > > > > > > > > > Can you send me a .config file for this? > > > > > > > > > > > > > > > > What is threshold_init()? Is it something new in the x86.git > > > > > > > > tree? > > > > > > > > > > > > > > no. A quick grep shows that it is in a file that _your_ changes in > > > > > > > Linus' latest have touched: > > > > > > > > > > > > > > arch/x86/kernel/cpu/mcheck/mce_amd_64.c > > > > > > > > > > > > Ok, those are pretty much just search/and/replace type changes, but > > > > > > I > > > > > > have been running x86-64 boxes with these changes in place. > > > > > > > > > > Oh wait, I do see a change. We are now (finally) emitting a kobject > > > > > uevent for these devices, which somehow the code can't handle > > > > > properly. > > > > > > > > > > Let me go poke this some more, unfortunatly I don't have any AMD 64 > > > > > boxes here anymore, only Intel based processors, so I can't run this > > > > > module... > > > > > > > > it only happens with AMD Quad Core CPU or Fam 10h. > > > > > > > > works well with AMD opteron Rev E, and Rev F. > > > > > > So this only dies on a multi-core system? Or does 2 processor boxes > > > work, but not 4? > > > > 2 sockets x quad core will fail (fam 10h) > > 2 sockets x dual core works( rev E, and rev F opteron) > > > > there are some changs between opteron and fam10h. fam10h may have > > more local vectors for MCE... > > or more banks and blocks... > > > > will look at AMD64 Bios and kernel porting guide for Fam 10h again.. > > > > wonder if your code uncover some bugs ... > > No, the logic in this function is just crazy. It's recursive, but we > can circumvent the creation for the kobject and whole creation of the > threshold_block if some conditions are met. That's why we see the > allocate_threshold_blocks so many times in the callstack, yet only a few > kobjects created. > > Then we blow up in kobject_uevent_env() on the first debug printk. > Which means that we are just passing in garbage. > > Let me know if the patch below fixes this for you, I think it should, as > there is a code path where b is NULL and then we call kobject_uevent. > > Man, this is one time that comments in code would have been very nice to > have, and why forward goto's into major code blocks are just evil... > > thanks, > > greg k-h > > diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c > b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c > index 7535887..8a7f204 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c > +++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c > @@ -450,7 +450,8 @@ recurse: > if (err) > goto out_free; > > - kobject_uevent(&b->kobj, KOBJ_ADD); > + if (b && &b->kobj) > + kobject_uevent(&b->kobj, KOBJ_ADD); > > return err; > Actually the second test doesn't make sense, it can just be: diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c index 7535887..8a7f204 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c @@ -450,7 +450,8 @@ recurse: if (err) goto out_free; - kobject_uevent(&b->kobj, KOBJ_ADD); + if (b) + kobject_uevent(&b->kobj, KOBJ_ADD); return err; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 158/196] Driver core: convert block from raw kobjects
On Sat, Jan 26, 2008 at 12:23:18AM +0100, Alexander van Heukelum wrote: > Fix build with CONFIG_BLOCK off. > > Building git-2d94dfc with CONFIG_BLOCK turned off gives me: > > drivers/base/core.c: In function 'device_add_class_symlinks': > drivers/base/core.c:704: error: 'part_type' undeclared (first use in this > function) > drivers/base/core.c:704: error: (Each undeclared identifier is reported only > once > drivers/base/core.c:704: error: for each function it appears in.) > drivers/base/core.c: In function 'device_remove_class_symlinks': > drivers/base/core.c:743: error: 'part_type' undeclared (first use in this > function) > > git-blame points to Kay Sievers. > > The problem is obvious. I think te solution is too ;). Heh, thanks, I'll test this in the morning, it's been a long day... greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] remove duplicating priority setting in try_to_free_p
shrink_zones in try_to_free_pages already set zone through note_zone_scanning_priority. So, setting prev_priority in try_to_free_pages is needless. This patch is made by 2.6.24-rc8. Signed-off-by: barrios <[EMAIL PROTECTED]> --- mm/vmscan.c | 17 - 1 files changed, 0 insertions(+), 17 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e5a9597..fc55c23 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1273,23 +1273,6 @@ unsigned long try_to_free_pages(struct z if (!sc.all_unreclaimable) ret = 1; out: - /* -* Now that we've scanned all the zones at this priority level, note -* that level within the zone so that the next thread which performs -* scanning of this zone will immediately start out at this priority -* level. This affects only the decision whether or not to bring -* mapped pages onto the inactive list. -*/ - if (priority < 0) - priority = 0; - for (i = 0; zones[i] != NULL; i++) { - struct zone *zone = zones[i]; - - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) - continue; - - zone->prev_priority = priority; - } return ret; } -- Kinds regards, barrios -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: threshold_init_device/kobject_uevent_env oops
On Fri, Jan 25, 2008 at 11:08:53PM -0800, Yinghai Lu wrote: > On Jan 25, 2008 10:14 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > > On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote: > > > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote: > > > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote: > > > > > > > > > > > > * Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote: > > > > > > > > current linus tree + x86.git > > > > > > > > > > > > > > > > got > > > > > > > > > > > > > > > > Calling initcall 0x80b93d98: > > > > > > > > threshold_init_device+0x0/0x3f() > > > > > > > > BUG: unable to handle kernel NULL pointer dereference at > > > > > > > > 0040 > > > > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9 > > > > > > > > > > > > > > Does this happen on just Linus's tree? > > > > > > > > > > > > > > Can you send me a .config file for this? > > > > > > > > > > > > > > What is threshold_init()? Is it something new in the x86.git > > > > > > > tree? > > > > > > > > > > > > no. A quick grep shows that it is in a file that _your_ changes in > > > > > > Linus' latest have touched: > > > > > > > > > > > > arch/x86/kernel/cpu/mcheck/mce_amd_64.c > > > > > > > > > > Ok, those are pretty much just search/and/replace type changes, but I > > > > > have been running x86-64 boxes with these changes in place. > > > > > > > > Oh wait, I do see a change. We are now (finally) emitting a kobject > > > > uevent for these devices, which somehow the code can't handle properly. > > > > > > > > Let me go poke this some more, unfortunatly I don't have any AMD 64 > > > > boxes here anymore, only Intel based processors, so I can't run this > > > > module... > > > > > > it only happens with AMD Quad Core CPU or Fam 10h. > > > > > > works well with AMD opteron Rev E, and Rev F. > > > > So this only dies on a multi-core system? Or does 2 processor boxes > > work, but not 4? > > 2 sockets x quad core will fail (fam 10h) > 2 sockets x dual core works( rev E, and rev F opteron) > > there are some changs between opteron and fam10h. fam10h may have > more local vectors for MCE... > or more banks and blocks... > > will look at AMD64 Bios and kernel porting guide for Fam 10h again.. > > wonder if your code uncover some bugs ... No, the logic in this function is just crazy. It's recursive, but we can circumvent the creation for the kobject and whole creation of the threshold_block if some conditions are met. That's why we see the allocate_threshold_blocks so many times in the callstack, yet only a few kobjects created. Then we blow up in kobject_uevent_env() on the first debug printk. Which means that we are just passing in garbage. Let me know if the patch below fixes this for you, I think it should, as there is a code path where b is NULL and then we call kobject_uevent. Man, this is one time that comments in code would have been very nice to have, and why forward goto's into major code blocks are just evil... thanks, greg k-h diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c index 7535887..8a7f204 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c @@ -450,7 +450,8 @@ recurse: if (err) goto out_free; - kobject_uevent(&b->kobj, KOBJ_ADD); + if (b && &b->kobj) + kobject_uevent(&b->kobj, KOBJ_ADD); return err; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable NestedPaging
On Fri, Jan 25, 2008 at 05:47:11PM -0800, Nakajima, Jun wrote: > Joerg Roedel wrote: > > To disable the use of the Nested Paging feature even if it is > available in > > hardware this patch adds a module parameter. Nested Paging can be > disabled by > > passing npt=off to the kvm_amd module. > > I think it's better to use a (common) parameter to qemu. That way you > can control on/off for each VM. Generally I see no problem with it. But at least for NPT I don't see a reason why someone should want to disable it on a VM basis (as far as it works stable). Avi, what do you think? Joerg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] RUSAGE_THREAD
On Jan 19, 2008 2:14 AM, Roland McGrath <[EMAIL PROTECTED]> wrote: > > This adds the RUSAGE_THREAD option for the getrusage system call. > Solaris calls this RUSAGE_LWP and uses the same value (1). > That name is not a natural one for Linux, but we keep it as an alias. Hey Roland, Would you please CC at this address me on patches that change the kernel-userland API. Cheers, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fake NUMA emulation for PowerPC (Take 2)
* Michael Ellerman <[EMAIL PROTECTED]> [2008-01-18 16:44:58]: > > This fixes it, although I'm a little worried about some of the > removals/movings of node_set_online() in the patch. > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 1666e7d..dcedc26 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -49,7 +49,6 @@ static int __cpuinit fake_numa_create_new_node(unsigned > long end_pfn, > static unsigned int fake_nid = 0; > static unsigned long long curr_boundary = 0; > > - *nid = fake_nid; > if (!p) > return 0; > > @@ -60,6 +59,7 @@ static int __cpuinit fake_numa_create_new_node(unsigned > long end_pfn, > if (mem < curr_boundary) > return 0; > > + *nid = fake_nid; > curr_boundary = mem; > > if ((end_pfn << PAGE_SHIFT) > mem) { > Hi, Michael, Here's a better and more complete fix for the problem. Could you please see if it works for you? I tested it on a real NUMA box and it seemed to work fine there. Description --- This patch provides a fix for the problem found by Michael Ellerman <[EMAIL PROTECTED]> while using fake NUMA nodes on a cell box. The code modifies node id iff (as in if and only if) fake NUMA nodes are created. Signed-off-by: Balbir Singh <[EMAIL PROTECTED]> --- arch/powerpc/mm/numa.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff -puN arch/powerpc/mm/numa.c~fix-fake-numa-nid-on-numa arch/powerpc/mm/numa.c --- linux-2.6.24-rc8/arch/powerpc/mm/numa.c~fix-fake-numa-nid-on-numa 2008-01-26 12:20:29.0 +0530 +++ linux-2.6.24-rc8-balbir/arch/powerpc/mm/numa.c 2008-01-26 12:27:53.0 +0530 @@ -49,7 +49,12 @@ static int __cpuinit fake_numa_create_ne static unsigned int fake_nid = 0; static unsigned long long curr_boundary = 0; - *nid = fake_nid; + /* +* If we did enable fake nodes and cross a node, +* remember the last node and start from there. +*/ + if (fake_nid) + *nid = fake_nid; if (!p) return 0; _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: threshold_init_device/kobject_uevent_env oops
On Jan 25, 2008 10:14 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote: > > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote: > > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote: > > > > > > > > > > * Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote: > > > > > > > current linus tree + x86.git > > > > > > > > > > > > > > got > > > > > > > > > > > > > > Calling initcall 0x80b93d98: > > > > > > > threshold_init_device+0x0/0x3f() > > > > > > > BUG: unable to handle kernel NULL pointer dereference at > > > > > > > 0040 > > > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9 > > > > > > > > > > > > Does this happen on just Linus's tree? > > > > > > > > > > > > Can you send me a .config file for this? > > > > > > > > > > > > What is threshold_init()? Is it something new in the x86.git tree? > > > > > > > > > > no. A quick grep shows that it is in a file that _your_ changes in > > > > > Linus' latest have touched: > > > > > > > > > > arch/x86/kernel/cpu/mcheck/mce_amd_64.c > > > > > > > > Ok, those are pretty much just search/and/replace type changes, but I > > > > have been running x86-64 boxes with these changes in place. > > > > > > Oh wait, I do see a change. We are now (finally) emitting a kobject > > > uevent for these devices, which somehow the code can't handle properly. > > > > > > Let me go poke this some more, unfortunatly I don't have any AMD 64 > > > boxes here anymore, only Intel based processors, so I can't run this > > > module... > > > > it only happens with AMD Quad Core CPU or Fam 10h. > > > > works well with AMD opteron Rev E, and Rev F. > > So this only dies on a multi-core system? Or does 2 processor boxes > work, but not 4? 2 sockets x quad core will fail (fam 10h) 2 sockets x dual core works( rev E, and rev F opteron) there are some changs between opteron and fam10h. fam10h may have more local vectors for MCE... or more banks and blocks... will look at AMD64 Bios and kernel porting guide for Fam 10h again.. wonder if your code uncover some bugs ... YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PATCH] driver core patches against 2.6.24
On Fri, 25 Jan 2008 11:11:48 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Fri, 25 Jan 2008, Greg KH wrote: > > > > That's really wierd, I don't see that at all here just running with > > your 2.6.24 + my git tree and lots of USB drivers built into the > > kernel also (like ehci_hcd). > > But do you use an initrd that tries to load the same driver too? > > I'm too lazy to want to do my own initrd. I just use the prepackaged > ones and rely on the fact that my private kernel will refuse to load > modules that aren't meant for it anyway. > you know about "make install" right? That copies the needed files to /boot, adds them to grub AND makes an initrd for you.. all for free ;) -- If you want to reach me at my work email, use [EMAIL PROTECTED] For development, discussion and tips for power savings, visit http://www.lesswatts.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Moving spinlock to struct usb_hcd
Hi, This is an attempt to move the hcd_urb_list_lock to struct usb_hcd. The lock is taken on functions that try to add/delete/use urb against a given hcd. I have not seen any association of an urb with multiple hcds. Hence I thought this can be moved within usb_hcd. This should help reduce contention to usb during high load where i/o is happening to multiple hcds. I am also trying to see if hcd_root_hub_lock can also be moved to usb_hcd. Any comments on this? I have done some testing with this patch and it seems to be holding fine. If this looks ok I will submit the lock statistics before and after the change. Thanks, -Romit --- drivers/usb/core/hcd.c | 24 +++- drivers/usb/core/hcd.h |1 + 2 files changed, 12 insertions(+), 13 deletions(-) diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index d5ed3fa..6eb0f45 100644 --- a/drivers/usb/core/hcd.c +++ b/drivers/usb/core/hcd.c @@ -98,9 +98,6 @@ EXPORT_SYMBOL_GPL (usb_bus_list_lock); /* used for controlling access to virtual root hubs */ static DEFINE_SPINLOCK(hcd_root_hub_lock); -/* used when updating an endpoint's URB list */ -static DEFINE_SPINLOCK(hcd_urb_list_lock); - /* wait queue for synchronous unlinks */ DECLARE_WAIT_QUEUE_HEAD(usb_kill_urb_queue); @@ -1000,7 +997,7 @@ int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd, struct urb *urb) { int rc = 0; - spin_lock(&hcd_urb_list_lock); + spin_lock(&hcd->hcd_urb_list_lock); /* Check that the URB isn't being killed */ if (unlikely(urb->reject)) { @@ -1033,7 +1030,7 @@ int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd, struct urb *urb) goto done; } done: - spin_unlock(&hcd_urb_list_lock); + spin_unlock(&hcd->hcd_urb_list_lock); return rc; } EXPORT_SYMBOL_GPL(usb_hcd_link_urb_to_ep); @@ -1106,9 +1103,9 @@ EXPORT_SYMBOL_GPL(usb_hcd_check_unlink_urb); void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb) { /* clear all state linking urb to this dev (and hcd) */ - spin_lock(&hcd_urb_list_lock); + spin_lock(&hcd->hcd_urb_list_lock); list_del_init(&urb->urb_list); - spin_unlock(&hcd_urb_list_lock); + spin_unlock(&hcd->hcd_urb_list_lock); } EXPORT_SYMBOL_GPL(usb_hcd_unlink_urb_from_ep); @@ -1311,7 +1308,7 @@ void usb_hcd_flush_endpoint(struct usb_device *udev, hcd = bus_to_hcd(udev->bus); /* No more submits can occur */ - spin_lock_irq(&hcd_urb_list_lock); + spin_lock_irq(&hcd->hcd_urb_list_lock); rescan: list_for_each_entry (urb, &ep->urb_list, urb_list) { int is_in; @@ -1320,7 +1317,7 @@ rescan: continue; usb_get_urb (urb); is_in = usb_urb_dir_in(urb); - spin_unlock(&hcd_urb_list_lock); + spin_unlock(&hcd->hcd_urb_list_lock); /* kick hcd */ unlink1(hcd, urb, -ESHUTDOWN); @@ -1345,14 +1342,14 @@ rescan: usb_put_urb (urb); /* list contents may have changed */ - spin_lock(&hcd_urb_list_lock); + spin_lock(&hcd->hcd_urb_list_lock); goto rescan; } - spin_unlock_irq(&hcd_urb_list_lock); + spin_unlock_irq(&hcd->hcd_urb_list_lock); /* Wait until the endpoint queue is completely empty */ while (!list_empty (&ep->urb_list)) { - spin_lock_irq(&hcd_urb_list_lock); + spin_lock_irq(&hcd->hcd_urb_list_lock); /* The list may have changed while we acquired the spinlock */ urb = NULL; @@ -1361,7 +1358,7 @@ rescan: urb_list); usb_get_urb (urb); } - spin_unlock_irq(&hcd_urb_list_lock); + spin_unlock_irq(&hcd->hcd_urb_list_lock); if (urb) { usb_kill_urb (urb); @@ -1618,6 +1615,7 @@ struct usb_hcd *usb_create_hcd (const struct hc_driver *driver, dev_dbg (dev, "hcd alloc failed\n"); return NULL; } + spin_lock_init(&hcd->hcd_urb_list_lock); dev_set_drvdata(dev, hcd); kref_init(&hcd->kref); diff --git a/drivers/usb/core/hcd.h b/drivers/usb/core/hcd.h index 98e2419..e23ff45 100644 --- a/drivers/usb/core/hcd.h +++ b/drivers/usb/core/hcd.h @@ -128,6 +128,7 @@ struct usb_hcd { * input size of periodic table to an interrupt scheduler. * (ohci 32, uhci 1024, ehci 256/512/1024). */ + spinlock_t hcd_urb_list_lock; /* The HC driver's private data is stored at the end of * this structure. -- 1.4.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org
Re: threshold_init_device/kobject_uevent_env oops
On Fri, Jan 25, 2008 at 03:20:45PM -0800, Yinghai Lu wrote: > On Jan 25, 2008 3:08 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote: > .. > > Also, can someone enable CONFIG_KOBJECT_DEBUG and send me the output of > > the startup of this code? That should help explain what order things > > are happening it. > > Calling initcall 0x80ba1dee: threshold_init_device+0x0/0x3f() > kobject: 'threshold_bank4' (8108265450c0): kobject_add_internal: parent: > 'machinecheck0', set: '' > kobject: 'misc0' (810425497418): kobject_add_internal: parent: > 'threshold_bank4', set: '' > kobject: 'misc1' (810425497498): kobject_add_internal: parent: > 'threshold_bank4', set: '' > kobject: 'misc2' (810425497518): kobject_add_internal: parent: > 'threshold_bank4', set: '' > Unable to handle kernel NULL pointer dereference at 0018 RIP: > [] kobject_uevent_env+0x31/0x45f 2 of these work just fine, and the third blows up in kobject_uevent(). So wierd, let me dig further... Hm, it's when we unwind that we blow up on the kobject_uevent, as that's the first time it is called (gotta love recursion here...) So it is really never working for these objects at all, what a mess. As a work-around for now, you can probably just comment out the 'kobject_uevent() in the file arch/x86/kernel/cpu/mcheck/mcd_amd_64.c and everything should work just fine, as there never really was an event being properly generated before, no one would miss it now :) I'll keep digging... thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: threshold_init_device/kobject_uevent_env oops
On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote: > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote: > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote: > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote: > > > > > > > > * Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote: > > > > > > current linus tree + x86.git > > > > > > > > > > > > got > > > > > > > > > > > > Calling initcall 0x80b93d98: > > > > > > threshold_init_device+0x0/0x3f() > > > > > > BUG: unable to handle kernel NULL pointer dereference at > > > > > > 0040 > > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9 > > > > > > > > > > Does this happen on just Linus's tree? > > > > > > > > > > Can you send me a .config file for this? > > > > > > > > > > What is threshold_init()? Is it something new in the x86.git tree? > > > > > > > > no. A quick grep shows that it is in a file that _your_ changes in > > > > Linus' latest have touched: > > > > > > > > arch/x86/kernel/cpu/mcheck/mce_amd_64.c > > > > > > Ok, those are pretty much just search/and/replace type changes, but I > > > have been running x86-64 boxes with these changes in place. > > > > Oh wait, I do see a change. We are now (finally) emitting a kobject > > uevent for these devices, which somehow the code can't handle properly. > > > > Let me go poke this some more, unfortunatly I don't have any AMD 64 > > boxes here anymore, only Intel based processors, so I can't run this > > module... > > it only happens with AMD Quad Core CPU or Fam 10h. > > works well with AMD opteron Rev E, and Rev F. So this only dies on a multi-core system? Or does 2 processor boxes work, but not 4? > So you may need have access to new system with quad core cpu. Ugh, that's not good. The kobjects here are really not making much sense. Jacob, any hints on exactly what you were trying to do with these kobjects? What's the end goal here, and why didn't you just use a struct device instead? The mce_amd_64.c file is the only thing in the tree using this userspace API, can you please document it in Documentation/ABI so that others can understand what it is used for, what files are expected, and what values in the files are? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()
Andi Kleen wrote: so INVLPG makes sense for pagetable fault realated single-address flushes, but they rarely make sense for range flushes. (and that's how Linux uses it) I think it would be an interesting experiment to switch flush_tlb_range() over to INVLPG if the length is below some threshold and see if there are visible effects in macro benchmarks. The main problem would be to determine the right threshold -- would likely be CPU dependent. It would be an interesting experiment. Odds are pretty good that the cutover is roughly linear in the TLB size. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: threshold_init_device/kobject_uevent_env oops
On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote: > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote: > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote: > > > > > > * Greg KH <[EMAIL PROTECTED]> wrote: > > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote: > > > > > current linus tree + x86.git > > > > > > > > > > got > > > > > > > > > > Calling initcall 0x80b93d98: threshold_init_device+0x0/0x3f() > > > > > BUG: unable to handle kernel NULL pointer dereference at > > > > > 0040 > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9 > > > > > > > > Does this happen on just Linus's tree? > > > > > > > > Can you send me a .config file for this? > > > > > > > > What is threshold_init()? Is it something new in the x86.git tree? > > > > > > no. A quick grep shows that it is in a file that _your_ changes in > > > Linus' latest have touched: > > > > > > arch/x86/kernel/cpu/mcheck/mce_amd_64.c > > > > Ok, those are pretty much just search/and/replace type changes, but I > > have been running x86-64 boxes with these changes in place. > > Oh wait, I do see a change. We are now (finally) emitting a kobject > uevent for these devices, which somehow the code can't handle properly. > > Let me go poke this some more, unfortunatly I don't have any AMD 64 > boxes here anymore, only Intel based processors, so I can't run this > module... it only happens with AMD Quad Core CPU or Fam 10h. works well with AMD opteron Rev E, and Rev F. So you may need have access to new system with quad core cpu. YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()
On Saturday 26 January 2008 01:11:28 Ingo Molnar wrote: (plus > any add-on TLB miss costs - but those are amortized quite well as long > as the pagetables are well cached - which they usually are on today's > 2MB-ish L2 caches), Did you measure the cost of that amortizing too? My guess is that especially with TLBs getting larger and larger the cost of full CR3 flushes are rising. > so INVLPG makes sense for pagetable fault realated single-address > flushes, but they rarely make sense for range flushes. (and that's how > Linux uses it) I think it would be an interesting experiment to switch flush_tlb_range() over to INVLPG if the length is below some threshold and see if there are visible effects in macro benchmarks. The main problem would be to determine the right threshold -- would likely be CPU dependent. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Sat, Jan 26, 2008 at 04:35:26PM +1100, David Chinner wrote: > On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: > > The points of the implementation are followings. > > - Add calls of the freeze function (freeze_bdev) and > > the unfreeze function (thaw_bdev) in ext3_ioctl(). > > > > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) > > is registered to the delayed work queue to unfreeze the filesystem > > automatically after the lapse of the specified time. > > Seems like pointless complexity to me - what happens if a > timeout occurs while the filsystem is still freezing? > > It's not uncommon for a freeze to take minutes if memory > is full of dirty data that needs to be flushed out, esp. if > dm-snap is doing COWs for every write issued Sorry, ignore this bit - I just realised the timer is set up after the freeze has occurred Still, that makes it potentially dangerous to whatever is being done while the filesystem is frozen Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 085/196] kset: convert s390 ipl.c to use kset_create
On Sat, Jan 26, 2008 at 12:11:33AM +0100, Heiko Carstens wrote: > On Fri, Jan 25, 2008 at 09:48:58AM -0800, Greg KH wrote: > > On Fri, Jan 25, 2008 at 01:20:53PM +0100, Heiko Carstens wrote: > > > On Thu, Jan 24, 2008 at 11:31:54PM -0800, Greg Kroah-Hartman wrote: > > > > Dynamically create the kset instead of declaring it statically. > > > > This makes the kobject attributes now work properly that I broke in the > > > > previous patch. > > > > > > Could you please merge this and the previous patch before it goes > > > upstream? Having an intermediate state where things are broken > > > will cause pain and additional work in case of bisecting. > > > > It will not cause a build error (see the previous patch for details.) > > The sysfs files will not properly show the correct data, that is all. > > > > The odds that you will hit this in a 'git bisect' is VERY low, and the > > previous patch states that the files are now broken, so there should not > > be any confusion regarding any user that might run across this. > > The odds are very low, as long as not more patch sets come up which > introduce intermediate broken kernels. > What exactly is the advantage of breaking the kernel with patch 1 and > then fix it again with patch 2 instead of doing the straight forward > conversions all with one patch? I was trying to do one logical thing at a time with this driver as I did not have the hardware to test, and I could not even build the code at the time. In looking more closer, I think the 084 patch might still work properly, but I can't guarantee it as the the default kobject parent might not be pointing to the correct attribute at the time. I know 085 fixes this to be sure that it will work properly. It helped in reviewing this code by the other s390 developers to have this in at least 2 pieces, to try to untangle the mess of sysfs files, ksets, and other attrocities that you all have grown into over the years. So again, I'm sorry if this happens to break your run-time tests when doing a 'git bisect', but as I explicitly stated it did in the patch, I think everyone is properly forwarned :) This core rework was tough to do, there was a reason no one had done it before. Now it is cleaner, smaller, able to be understood by at least one active kernel developer, if not more, and it's documented, with working examples. If the downside of this effort is only this one thing (note that others are finally finding real bugs...) I'll be very happy. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PATCH] driver core patches against 2.6.24
On Sat, Jan 26, 2008 at 03:50:57PM +1100, Rusty Russell wrote: > On Saturday 26 January 2008 06:42:19 Greg KH wrote: > > On Fri, Jan 25, 2008 at 10:44:59AM -0800, Linus Torvalds wrote: > > > On Thu, 24 Jan 2008, Greg KH wrote: > > > > Here are a pretty large number of kobject, documentation, and driver > > > > core patches against your 2.6.24 git tree. > > > > > > I've merged it all, but it causes lots of scary warnings: > > > > > > - from the purely broken ones: > > > > > > ehci_hcd: no version for "struct_module" found: kernel tainted. > > > > Ok, in looking at the code, this should also be showing up for you on a > > "clean" 2.6.24 release, I didn't change anything in this code path. > > > > That is what taints your kernel with the "F" flag. > > > > > - to the scary ones: > > > > > > sysfs: duplicate filename 'ehci_hcd' can not be created > > > WARNING: at fs/sysfs/dir.c:424 sysfs_add_one() > > > Pid: 610, comm: insmod Tainted: GF 2.6.24-gb47711bf #28 > > > > > > Call Trace: > > >[] sysfs_add_one+0x54/0xbd > > >[] create_dir+0x4f/0x87 > > >[] sysfs_create_dir+0x35/0x4a > > >[] kobject_get+0x12/0x17 > > >[] kobject_add_internal+0xd9/0x194 > > >[] kobject_add_varg+0x54/0x61 > > >[] __alloc_pages+0x66/0x2ee > > >[] kobject_init+0x42/0x82 > > >[] kobject_init_and_add+0x9a/0xa7 > > >[] __vmalloc_area_node+0x111/0x135 > > >[] mod_sysfs_init+0x6e/0x83 > > >[] sys_init_module+0xa3d/0x1833 > > >[] dput+0x1c/0x10b > > >[] system_call+0x7e/0x83 > > > > This is the sysfs core telling you that someone did something stupid :) > > > > Yes, that's new, but the "error" was always there, I just made the > > warning more visible to get people to pay attention to it, and find the > > real errors where this happens (and it has found them, which is a good > > thing.) > > > > But in this case, it doesn't look like the module loading code will > > detect that we are trying to load a module that is already present until > > the kobjects are set up here. It's been this way for a long time :( > > > > Rusty, any ideas of us adding a different check for "duplicate" modules > > like this earlier in the load_module() function, so we don't spend so > > much effort in building everything up when we don't need to? > > module.c:1832 (in load_module) > > if (find_module(mod->name)) { > err = -EEXIST; > goto free_mod; > } > > That's pretty early, and before this backtrace. But that doesn't catch the case here, of trying to load a module when the code itself is already built into the kernel. For that we are relying on the sysfs core to tell us we have a duplicate name problem, which happens much later. Is there any test you can do sooner, or is relying on the sysfs test acceptable? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: > The points of the implementation are followings. > - Add calls of the freeze function (freeze_bdev) and > the unfreeze function (thaw_bdev) in ext3_ioctl(). > > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) > is registered to the delayed work queue to unfreeze the filesystem > automatically after the lapse of the specified time. Seems like pointless complexity to me - what happens if a timeout occurs while the filsystem is still freezing? It's not uncommon for a freeze to take minutes if memory is full of dirty data that needs to be flushed out, esp. if dm-snap is doing COWs for every write issued > + case EXT3_IOC_FREEZE: { > + if (inode->i_sb->s_frozen != SB_UNFROZEN) > + return -EINVAL; > + freeze_bdev(inode->i_sb->s_bdev); > + case EXT3_IOC_THAW: { > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + if (inode->i_sb->s_frozen == SB_UNFROZEN) > + return -EINVAL; . > + /* Unfreeze */ > + thaw_bdev(inode->i_sb->s_bdev, inode->i_sb); That's inherently unsafe - you can have multiple unfreezes running in parallel which seriously screws with the bdev semaphore count that is used to lock the device due to doing multiple up()s for every down. Your timeout thingy guarantee that at some point you will get multiple up()s occuring due to the timer firing racing with a thaw ioctl. If this interface is to be more widely exported, then it needs a complete revamp of the bdev is locked while it is frozen so that there is no chance of a double up() ever occuring on the bd_mount_sem due to racing thaws. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
Chris Snook wrote: > Al Boldi wrote: > > Greetings! > > > > data=ordered mode has proven reliable over the years, and it does this > > by ordering filedata flushes before metadata flushes. But this > > sometimes causes contention in the order of a 10x slowdown for certain > > apps, either due to the misuse of fsync or due to inherent behaviour > > like db's, as well as inherent starvation issues exposed by the > > data=ordered mode. > > > > data=writeback mode alleviates data=order mode slowdowns, but only works > > per-mount and is too dangerous to run as a default mode. > > > > This RFC proposes to introduce a tunable which allows to disable fsync > > and changes ordered into writeback writeout on a per-process basis like > > this: > > > > echo 1 > /proc/`pidof process`/softsync > > > > > > Your comments are much welcome! > > This is basically a kernel workaround for stupid app behavior. Exactly right to some extent, but don't forget the underlying data=ordered starvation problem, which looks like a genuinely deep problem maybe related to blockIO. > It > wouldn't be the first time we've provided such an option, but we shouldn't > do it without a very good justification. At the very least, we need a > test case that demonstrates the problem See the 'konqueror deadlocks in 2.6.22' thread. > and benchmark results that prove that this approach actually fixes it. 8M-record insert into indexed db-table: ordered writeback sqlite3: 75m22s8m45s mysql4 : 23m35s5m29s > I suspect we can find a cleaner fix for the problem. I hope so, but even with a fix available addressing the data=ordered starvation issue, this tunable could remain useful for those apps that misbehave. Thanks! -- Al -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
Jan Kara wrote: > > Greetings! > > > > data=ordered mode has proven reliable over the years, and it does this > > by ordering filedata flushes before metadata flushes. But this > > sometimes causes contention in the order of a 10x slowdown for certain > > apps, either due to the misuse of fsync or due to inherent behaviour > > like db's, as well as inherent starvation issues exposed by the > > data=ordered mode. > > > > data=writeback mode alleviates data=order mode slowdowns, but only works > > per-mount and is too dangerous to run as a default mode. > > > > This RFC proposes to introduce a tunable which allows to disable fsync > > and changes ordered into writeback writeout on a per-process basis like > > this: > > > > echo 1 > /proc/`pidof process`/softsync > > I guess disabling fsync() was already commented on enough. Regarding > switching to writeback mode on per-process basis - not easily possible > because sometimes data is not written out by the process which stored > them (think of mmaped file). Do you mean there is a locking problem? > And in case of DB, they use direct-io > anyway most of the time so they don't care about journaling mode anyway. Testing with sqlite3 and mysql4 shows that performance drastically improves with writeback writeout. > But as Diego wrote, there is definitely some room for improvement in > current data=ordered mode so the difference shouldn't be as big in the > end. Yes, it would be nice to get to the bottom of this starvation problem, but even then, the proposed tunable remains useful for misbehaving apps. Thanks! -- Al -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
[EMAIL PROTECTED] wrote: > On Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi said: > > This RFC proposes to introduce a tunable which allows to disable fsync > > and changes ordered into writeback writeout on a per-process basis like > > this: : : > But if you want to give them enough rope to shoot themselves in the foot > with, I'd suggest abusing LD_PRELOAD to replace the fsync() glibc code > instead. No need to clutter the kernel with rope that can be (and has > been) done in userspace. Ok that's possible, but as you cannot use LD_PRELOAD to deal with changing ordered into writeback mode, we might as well allow them to disable fsync here, because it is in the same use-case. Thanks! -- Al -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
Diego Calleja wrote: > El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi <[EMAIL PROTECTED]> escribió: > > Greetings! > > > > data=ordered mode has proven reliable over the years, and it does this > > by ordering filedata flushes before metadata flushes. But this > > sometimes causes contention in the order of a 10x slowdown for certain > > apps, either due to the misuse of fsync or due to inherent behaviour > > like db's, as well as inherent starvation issues exposed by the > > data=ordered mode. > > There's a related bug in bugzilla: > http://bugzilla.kernel.org/show_bug.cgi?id=9546 > > The diagnostic from Jan Kara is different though, but I think it may be > the same problem... > > "One process does data-intensive load. Thus in the ordered mode the > transaction is tiny but has tons of data buffers attached. If commit > happens, it takes a long time to sync all the data before the commit > can proceed... In the writeback mode, we don't wait for data buffers, in > the journal mode amount of data to be written is really limited by the > maximum size of a transaction and so we write by much smaller chunks > and better latency is thus ensured." > > > I'm hitting this bug too...it's surprising that there's not many people > reporting more bugs about this, because it's really annoying. > > > There's a patch by Jan Kara (that I'm including here because bugzilla > didn't include it and took me a while to find it) which I don't know if > it's supposed to fix the problem , but it'd be interesting to try: Thanks a lot, but it doesn't fix it. -- Al -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 063/196] kset: convert /sys/devices to use kset_create
On Fri, Jan 25, 2008 at 09:40:55PM -0600, Olof Johansson wrote: > On Thu, Jan 24, 2008 at 11:10:01PM -0800, Greg Kroah-Hartman wrote: > > Dynamically create the kset instead of declaring it statically. We also > > rename devices_subsys to devices_kset to catch all users of the > > variable. > > Guess what, you broke powerpc again! I did this ON PURPOSE!!! The linux-kernel archives hold the details, and I was told by the PPC64 IBM people that they would fix this properly for 2.6.25, and not to hold back on my changes. This has been known for many months now. > [EMAIL PROTECTED]:~/work/linux/k.org $ git grep devices_subsys > arch/powerpc/kernel/vio.c:extern struct kset devices_subsys; /* needed for > vio_find_name() */ > arch/powerpc/kernel/vio.c: found = kset_find_obj(&devices_subsys, > kobj_name); > > Obviously causes build failues, even of ppc64_defconfig. > > (I can unfortunately not boot test, since I lack hardware that uses vio) > > > Signed-off-by: Olof Johansson <[EMAIL PROTECTED]> > > diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c > index 19a5656..ee752ab 100644 > --- a/arch/powerpc/kernel/vio.c > +++ b/arch/powerpc/kernel/vio.c > @@ -37,7 +37,7 @@ > #include > #include > > -extern struct kset devices_subsys; /* needed for vio_find_name() */ > +extern struct kset *devices_kset; /* needed for vio_find_name() */ No, this just papers over the real problem here. For some reason, the vio code thinks it is acceptable to walk the whole device tree and match by a name and just assume that they got the correct device. You call this "enterprise grade"? :) You need to just put your device on a real bus, and then just walk the bus. That's the ONLY way you can guarantee the proper name will return what you want, and you get the pointer that you really think you are getting. There is a reason that devices_kset is not exported, don't make me go and have to name it something like: devices_kset_dont_touch_this_or_gregkh_will_make_fun_of_you Or I'll just mush 3 files in the driver core together and keep the symbol from being accessible at all. So no, I'm going to leave the build broken for this code, because that is what it really is. Please fix it correctly. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/20 -v5] add notrace annotations for NMI routines
On Wed, 23 Jan 2008, Mathieu Desnoyers wrote: > * Steven Rostedt ([EMAIL PROTECTED]) wrote: > > This annotates NMI functions with notrace. Some tracers may be able > > to live with this, but some cannot. So we turn off NMI tracing. > > > > One solution might be to make a notrace_nmi which would only turn > > off NMI tracing if a trace utility needed it off. > > > Is this still needed with the atomic clocksource read ? > Before you ask again, I've still included this in -v6, simply because I didn't get a chance to test it without this patch. I'll try to remember to do that on Monday. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote: > >I am also wondering whether we should have system call(s) for these: > > > >On Jan 25, 2008 12:59 PM, Takashi Sato <[EMAIL PROTECTED]> wrote: > >>+ case EXT3_IOC_FREEZE: { > > > >>+ case EXT3_IOC_THAW: { > > > >And just convert XFS to use them too? > > I think it is reasonable to implement it as the generic system call, as you > said. Does XFS folks think so? Sure. Note that we can't immediately remove the XFS ioctls otherwise we'd break userspace utilities that use them Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rt1
On Fri, 25 Jan 2008, Steven Rostedt wrote: > > *** NOTICE *** > > This still has the old version of the latency tracer. I'll try to > release a -rt2 soon that has the new version. This way we can see what > kind of regressions the new version might give. > This is taking longer than expected. Removing the old latency tracer has caused a bit to be broken. The latency tracer has been in the RT kernel for so long that it has hooks in lots of unrelated patches. It's taking a bit of surgury to remove all the bits without killing the rest. I will not be working on this over the weekend. I'll start back up on Monday. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)
In message <[EMAIL PROTECTED]>, Al Viro writes: > After grep for locking-related things: > > * lock_parent(): who said that you won't get dentry moved > before managing to grab i_mutex on parent? While we are at it, > who said that you won't get dentry moved between fetching d_parent > and doing dget()? In that case parent could've been _freed_ before > you get to dget(). OK, so looks like I should use dget_parent() in my lock_parent(), as I've done elsewhere. I'll also take a look at all instances in which I get dentry->d_parent and see if a d_lock is needed there. > * in create_parents(): > + struct inode *inode = lower_dentry->d_inode; > + /* > +* If we get here, it means that we created a new > +* dentry+inode, but copying permissions failed. > +* Therefore, we should delete this inode and dput > +* the dentry so as not to leave cruft behind. > +*/ > + if (lower_dentry->d_op && lower_dentry->d_op->d_iput) > + lower_dentry->d_op->d_iput(lower_dentry, > + inode); > + else > + iput(inode); > + lower_dentry->d_inode = NULL; > + dput(lower_dentry); > + lower_dentry = ERR_PTR(err); > + goto out; > Really? So what happens if it had become positive after your test and > somebody had looked it up in lower layer and just now happens to be > in the middle of operations on it? Will be thucking frilled by that... Good catch. That ->d_iput call was an old fix to a bug that has since been fixed more cleanly and generically in our copyup_permission routine and our unionfs_d_iput. I've removed the above ->d_iput "if" and tested to verify that it's indeed unnecessary. > * __unionfs_rename(): > + lock_rename(lower_old_dir_dentry, lower_new_dir_dentry); > + err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry, > +lower_new_dir_dentry->d_inode, lower_new_dentry); > + unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry); > > Uh-huh... To start with, what guarantees that your lower_old_dentry > is still a child of your lower_old_dir_dentry? We dget/dget_parent the old/new dentry and parents a few lines above (actually, it looked like I forgot to dget(lower_new_dentry) -- fixed). This is a generic stackable f/s issue: ecryptfs does the same stuff before calling vfs_rename() on the lower objects. > What's more, you are > not checking the result of lock_rename(), i.e. asking for serious trouble. OK. I'm now checking for the return from lock_rename for ancestor/rename rules. I'm CC'ing Mike Halcrow so he can do the same for ecryptfs. > * revalidation stuff: err... how the devil can it work for > directories, when there's nothing to prevent changes in underlying > layers between ->d_revalidate() and operation itself? For the upper > layer (unionfs itself) everything's more or less fine, but the rest > of that... In a stacked f/s, we keep references to the lower dentries/inodes, so they can't disappear on us (that happens in our interpose function, called from our ->lookup). On entry to every f/s method in unionfs, we first perform lightweight revalidation of our dentry against the lower ones: we check if m/ctime changed (users modifying lower files) or if the generation# b/t our super and the our dentries have changed (branch-management took place); if needed, then we perform a full revalidation of all lower objects (while holding a lock on the branch configuration). If we have to do a full reval upon entry to our ->op, and the reval failed, then we return an appropriate error; o/w we proceed. (In certain cases, the VFS re-issues a lookup if the f/s says that it's dentry is invalid.) Without changes to the VFS, I don't see how else I can ensure cache coherency cleanly, while allowing users to modify lower files; this feature is very useful to some unionfs users, who depend on it (so even if I could "lock out" the lower directories from being modified, there will be users who'd still want to be able to modify lower files). BTW, my sense of the relationship b/t upper and lower objects and their validity in a stackable f/s, is that it's similar to the relationship b/t the NFS client and server -- the client can't be sure that a file on the server doesn't change b/t ->revalidate and ->op (hence nfs's reliance on dir mtime checks). Perhaps this general topic is a good one to discuss at more length at LSF? Suggestions are welcome. Thanks, Erez. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-
[PATCH 4/4] Unionfs: lock_rename related locking fixes
CC: Mike Halcrow <[EMAIL PROTECTED]> Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/unionfs/rename.c | 16 +++- 1 files changed, 15 insertions(+), 1 deletions(-) diff --git a/fs/unionfs/rename.c b/fs/unionfs/rename.c index 9306a2b..5ab13f9 100644 --- a/fs/unionfs/rename.c +++ b/fs/unionfs/rename.c @@ -29,6 +29,7 @@ static int __unionfs_rename(struct inode *old_dir, struct dentry *old_dentry, struct dentry *lower_new_dir_dentry; struct dentry *lower_wh_dentry; struct dentry *lower_wh_dir_dentry; + struct dentry *trap; char *wh_name = NULL; lower_new_dentry = unionfs_lower_dentry_idx(new_dentry, bindex); @@ -95,6 +96,7 @@ static int __unionfs_rename(struct inode *old_dir, struct dentry *old_dentry, goto out; dget(lower_old_dentry); + dget(lower_new_dentry); lower_old_dir_dentry = dget_parent(lower_old_dentry); lower_new_dir_dentry = dget_parent(lower_new_dentry); @@ -122,9 +124,20 @@ static int __unionfs_rename(struct inode *old_dir, struct dentry *old_dentry, /* see Documentation/filesystems/unionfs/issues.txt */ lockdep_off(); - lock_rename(lower_old_dir_dentry, lower_new_dir_dentry); + trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry); + /* source should not be ancenstor of target */ + if (trap == lower_old_dentry) { + err = -EINVAL; + goto out_err_unlock; + } + /* target should not be ancenstor of source */ + if (trap == lower_new_dentry) { + err = -ENOTEMPTY; + goto out_err_unlock; + } err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry, lower_new_dir_dentry->d_inode, lower_new_dentry); +out_err_unlock: unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry); lockdep_on(); @@ -132,6 +145,7 @@ out_dput: dput(lower_old_dir_dentry); dput(lower_new_dir_dentry); dput(lower_old_dentry); + dput(lower_new_dentry); out: if (!err) { -- 1.5.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] Unionfs: remove unnecessary call to d_iput
This old code was to fix a bug which has long since been fixed in our copyup_permission and unionfs_d_iput. Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/unionfs/copyup.c | 13 - 1 files changed, 0 insertions(+), 13 deletions(-) diff --git a/fs/unionfs/copyup.c b/fs/unionfs/copyup.c index 16b2c7c..8663224 100644 --- a/fs/unionfs/copyup.c +++ b/fs/unionfs/copyup.c @@ -807,19 +807,6 @@ begin: lower_dentry); unlock_dir(lower_parent_dentry); if (err) { - struct inode *inode = lower_dentry->d_inode; - /* -* If we get here, it means that we created a new -* dentry+inode, but copying permissions failed. -* Therefore, we should delete this inode and dput -* the dentry so as not to leave cruft behind. -*/ - if (lower_dentry->d_op && lower_dentry->d_op->d_iput) - lower_dentry->d_op->d_iput(lower_dentry, - inode); - else - iput(inode); - lower_dentry->d_inode = NULL; dput(lower_dentry); lower_dentry = ERR_PTR(err); goto out; -- 1.5.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] Unionfs: d_parent related locking fixes
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/unionfs/copyup.c |3 +-- fs/unionfs/union.h |4 ++-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/unionfs/copyup.c b/fs/unionfs/copyup.c index 8663224..9beac01 100644 --- a/fs/unionfs/copyup.c +++ b/fs/unionfs/copyup.c @@ -716,8 +716,7 @@ struct dentry *create_parents(struct inode *dir, struct dentry *dentry, child_dentry = parent_dentry; /* find the parent directory dentry in unionfs */ - parent_dentry = child_dentry->d_parent; - dget(parent_dentry); + parent_dentry = dget_parent(child_dentry); /* find out the lower_parent_dentry in the given branch */ lower_parent_dentry = diff --git a/fs/unionfs/union.h b/fs/unionfs/union.h index d324f83..4b4d6c9 100644 --- a/fs/unionfs/union.h +++ b/fs/unionfs/union.h @@ -487,13 +487,13 @@ extern int parse_branch_mode(const char *name, int *perms); /* locking helpers */ static inline struct dentry *lock_parent(struct dentry *dentry) { - struct dentry *dir = dget(dentry->d_parent); + struct dentry *dir = dget_parent(dentry); mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT); return dir; } static inline struct dentry *lock_parent_wh(struct dentry *dentry) { - struct dentry *dir = dget(dentry->d_parent); + struct dentry *dir = dget_parent(dentry); mutex_lock_nested(&dir->d_inode->i_mutex, UNIONFS_DMUTEX_WHITEOUT); return dir; -- 1.5.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] Unionfs: use first writable branch (fix/cleanup)
Cleanup code in ->create, ->symlink, and ->mknod: refactor common code into helper functions. Also, this allows writing to multiple branches again, which was broken by an earlier patch. Signed-off-by: Erez Zadok <[EMAIL PROTECTED]> --- fs/unionfs/inode.c | 395 +--- 1 files changed, 156 insertions(+), 239 deletions(-) diff --git a/fs/unionfs/inode.c b/fs/unionfs/inode.c index e15ddb9..0b92da2 100644 --- a/fs/unionfs/inode.c +++ b/fs/unionfs/inode.c @@ -18,14 +18,159 @@ #include "union.h" +/* + * Helper function when creating new objects (create, symlink, and mknod). + * Checks to see if there's a whiteout in @lower_dentry's parent directory, + * whose name is taken from @dentry. Then tries to remove that whiteout, if + * found. + * + * Return 0 if no whiteout was found, or if one was found and successfully + * removed (a zero tells the caller that @lower_dentry belongs to a good + * branch to create the new object in). Return -ERRNO if an error occurred + * during whiteout lookup or in trying to unlink the whiteout. + */ +static int check_for_whiteout(struct dentry *dentry, + struct dentry *lower_dentry) +{ + int err = 0; + struct dentry *wh_dentry = NULL; + struct dentry *lower_dir_dentry; + char *name = NULL; + + /* +* check if whiteout exists in this branch, i.e. lookup .wh.foo +* first. +*/ + name = alloc_whname(dentry->d_name.name, dentry->d_name.len); + if (unlikely(IS_ERR(name))) { + err = PTR_ERR(name); + goto out; + } + + wh_dentry = lookup_one_len(name, lower_dentry->d_parent, + dentry->d_name.len + UNIONFS_WHLEN); + if (IS_ERR(wh_dentry)) { + err = PTR_ERR(wh_dentry); + wh_dentry = NULL; + goto out; + } + + if (!wh_dentry->d_inode) /* no whiteout exists */ + goto out; + + /* .wh.foo has been found, so let's unlink it */ + lower_dir_dentry = lock_parent_wh(wh_dentry); + /* see Documentation/filesystems/unionfs/issues.txt */ + lockdep_off(); + err = vfs_unlink(lower_dir_dentry->d_inode, wh_dentry); + lockdep_on(); + unlock_dir(lower_dir_dentry); + + /* +* Whiteouts are special files and should be deleted no matter what +* (as if they never existed), in order to allow this create +* operation to succeed. This is especially important in sticky +* directories: a whiteout may have been created by one user, but +* the newly created file may be created by another user. +* Therefore, in order to maintain Unix semantics, if the vfs_unlink +* above failed, then we have to try to directly unlink the +* whiteout. Note: in the ODF version of unionfs, whiteout are +* handled much more cleanly. +*/ + if (err == -EPERM) { + struct inode *inode = lower_dir_dentry->d_inode; + err = inode->i_op->unlink(inode, wh_dentry); + } + if (err) + printk(KERN_ERR "unionfs: could not " + "unlink whiteout, err = %d\n", err); + +out: + dput(wh_dentry); + kfree(name); + return err; +} + +/* + * Find a writeable branch to create new object in. Checks all writeble + * branches of the parent inode, from istart to iend order; if none are + * suitable, also tries branch 0 (which may require a copyup). + * + * Return a lower_dentry we can use to create object in, or ERR_PTR. + */ +static struct dentry *find_writeable_branch(struct inode *parent, + struct dentry *dentry) +{ + int err = -EINVAL; + int bindex, istart, iend; + struct dentry *lower_dentry = NULL; + + istart = ibstart(parent); + iend = ibend(parent); + if (istart < 0) + goto out; + +begin: + for (bindex = istart; bindex <= iend; bindex++) { + /* skip non-writeable branches */ + err = is_robranch_super(dentry->d_sb, bindex); + if (err) { + err = -EROFS; + continue; + } + lower_dentry = unionfs_lower_dentry_idx(dentry, bindex); + if (!lower_dentry) + continue; + /* +* check for whiteouts in writeable branch, and remove them +* if necessary. +*/ + err = check_for_whiteout(dentry, lower_dentry); + if (err) + continue; + } + /* +* If istart wasn't already branch 0, and we got any error, then try +* branch 0 (which may require copyup) +*/ + if (err && istart > 0) { + istart = iend = 0; + goto begin; + } + + /* +* If w
[GIT PULL -mm] 0/4 Unionfs updates/fixes/cleanups
The following is a series of patchsets related to Unionfs. This is the fifth set of patchsets resulting from an lkml review of the entire unionfs code base, in preparation for a merge into mainline. The most significant changes here are a few locking related fixes, and a correction to broken logic which didn't allow writing to the first available writable branch. These patches were tested (where appropriate) on 2.6.24, MM, as well as the backports to 2.6.{23,22,21,20,19,18,9} on ext2/3/4, xfs, reiserfs, nfs2/3/4, jffs2, ramfs, tmpfs, cramfs, and squashfs (where available). Also tested with LTP-full and with a continuous parallel kernel compile (while forcing cache flushing, manipulating lower branches, etc.). See http://unionfs.filesystems.org/ to download back-ported unionfs code. Please pull from the 'master' branch of git://git.kernel.org/pub/scm/linux/kernel/git/ezk/unionfs.git to receive the following: Erez Zadok (4): Unionfs: use first writable branch (fix/cleanup) Unionfs: remove unnecessary call to d_iput Unionfs: d_parent related locking fixes Unionfs: lock_rename related locking fixes copyup.c | 16 -- inode.c | 395 --- rename.c | 16 ++ union.h |4 4 files changed, 174 insertions(+), 257 deletions(-) --- Erez Zadok [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] block: look up block device path in sysfs
Given an fd on a block device, returns a string like /block/sda/sda1 which can be used to find related information in /sys. Ideally we should have an ioctl that works on char devices as well, but that seems far from trivial, so it seems reasonable to have this until the latter can be implemented. Cc: Jens Axboe <[EMAIL PROTECTED]> Cc: Neil Brown <[EMAIL PROTECTED]> Cc: Kay Sievers <[EMAIL PROTECTED]> Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- Things have been quiet since this was posted about a month ago, and I am hoping to see this in 2.6.25. It is based on Neil's BLKGETNAME patch and is updated with Kay's comments. Regards, Dan block/compat_ioctl.c |1 + block/ioctl.c| 28 include/linux/fs.h |1 + 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c index cae0a85..d71d287 100644 --- a/block/compat_ioctl.c +++ b/block/compat_ioctl.c @@ -784,6 +784,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) switch (cmd) { case HDIO_GETGEO: return compat_hdio_getgeo(disk, bdev, compat_ptr(arg)); + case BLKGETDEVPATH: case BLKFLSBUF: case BLKROSET: /* diff --git a/block/ioctl.c b/block/ioctl.c index 52d6385..d048ae4 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -229,6 +229,34 @@ int blkdev_ioctl(struct inode *inode, struct file *file, unsigned cmd, int ret, n; switch(cmd) { + case BLKGETDEVPATH: { + char *path; + char b[BDEVNAME_SIZE]; + size_t len; + + path = kobject_get_path(&disk->kobj, GFP_KERNEL); + + if (!path) + return -ENOMEM; + + len = strlen(path); + if (copy_to_user((char __user *)arg, path, len + 1)) { + kfree(path); + return -EFAULT; + } + kfree(path); + + if (bdev->bd_contains == bdev) + return 0; + + bdevname(bdev, b); + if (copy_to_user((char __user *)arg + len, "/", 2)) + return -EFAULT; + if (copy_to_user((char __user *)arg + len + 1, b, +strlen(b) + 1)) + return -EFAULT; + return 0; + } case BLKFLSBUF: if (!capable(CAP_SYS_ADMIN)) return -EACCES; diff --git a/include/linux/fs.h b/include/linux/fs.h index b3ec4a4..b4cf8f3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -217,6 +217,7 @@ extern int dir_notify_enable; #define BLKTRACESTART _IO(0x12,116) #define BLKTRACESTOP _IO(0x12,117) #define BLKTRACETEARDOWN _IO(0x12,118) +#define BLKGETDEVPATH _IOR(0x12, 119, char [1024]) #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP_IO(0x00,1) /* bmap access */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PATCH] driver core patches against 2.6.24
On Saturday 26 January 2008 06:42:19 Greg KH wrote: > On Fri, Jan 25, 2008 at 10:44:59AM -0800, Linus Torvalds wrote: > > On Thu, 24 Jan 2008, Greg KH wrote: > > > Here are a pretty large number of kobject, documentation, and driver > > > core patches against your 2.6.24 git tree. > > > > I've merged it all, but it causes lots of scary warnings: > > > > - from the purely broken ones: > > > > ehci_hcd: no version for "struct_module" found: kernel tainted. > > Ok, in looking at the code, this should also be showing up for you on a > "clean" 2.6.24 release, I didn't change anything in this code path. > > That is what taints your kernel with the "F" flag. > > > - to the scary ones: > > > > sysfs: duplicate filename 'ehci_hcd' can not be created > > WARNING: at fs/sysfs/dir.c:424 sysfs_add_one() > > Pid: 610, comm: insmod Tainted: GF 2.6.24-gb47711bf #28 > > > > Call Trace: > > [] sysfs_add_one+0x54/0xbd > > [] create_dir+0x4f/0x87 > > [] sysfs_create_dir+0x35/0x4a > > [] kobject_get+0x12/0x17 > > [] kobject_add_internal+0xd9/0x194 > > [] kobject_add_varg+0x54/0x61 > > [] __alloc_pages+0x66/0x2ee > > [] kobject_init+0x42/0x82 > > [] kobject_init_and_add+0x9a/0xa7 > > [] __vmalloc_area_node+0x111/0x135 > > [] mod_sysfs_init+0x6e/0x83 > > [] sys_init_module+0xa3d/0x1833 > > [] dput+0x1c/0x10b > > [] system_call+0x7e/0x83 > > This is the sysfs core telling you that someone did something stupid :) > > Yes, that's new, but the "error" was always there, I just made the > warning more visible to get people to pay attention to it, and find the > real errors where this happens (and it has found them, which is a good > thing.) > > But in this case, it doesn't look like the module loading code will > detect that we are trying to load a module that is already present until > the kobjects are set up here. It's been this way for a long time :( > > Rusty, any ideas of us adding a different check for "duplicate" modules > like this earlier in the load_module() function, so we don't spend so > much effort in building everything up when we don't need to? module.c:1832 (in load_module) if (find_module(mod->name)) { err = -EEXIST; goto free_mod; } That's pretty early, and before this backtrace. Even for simultaneous loads, there's a mutex which protects from here to the list insertion. Puzzled, Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 19/23 -v6] Trace irq disabled critical timings
This patch adds latency tracing for critical timings (how long interrupts are disabled for). "irqsoff" is added to /debugfs/tracing/available_tracers Note: tracing_max_latency also holds the max latency for irqsoff (in usecs). (default to large number so one must start latency tracing) tracing_thresh threshold (in usecs) to always print out if irqs off is detected to be longer than stated here. If irq_thresh is non-zero, then max_irq_latency is ignored. Here's an example of a trace with mcount_enabled = 0 === preemption latency trace v1.1.5 on 2.6.24-rc7 latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) - => started at: _spin_lock_irqsave+0x2a/0xb7 => ended at: _spin_unlock_irqrestore+0x32/0x5f _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 1d.s30us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000]) swapper-0 1d.s3 100us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1d.s3 100us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f) vim:ft=help === And this is a trace with mcount_enabled == 1 === preemption latency trace v1.1.5 on 2.6.24-rc7 latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) - => started at: _spin_lock_irqsave+0x2a/0xb7 => ended at: _spin_unlock_irqrestore+0x32/0x5f _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 1dNs30us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000]) swapper-0 1dNs3 46us : e1000_read_phy_reg+0x16/0x225 [e1000] (e1000_update_stats+0x5e2/0x64c [e1000]) swapper-0 1dNs3 46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] (e1000_read_phy_reg+0x49/0x225 [e1000]) swapper-0 1dNs3 46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] (e1000_swfw_sync_acquire+0x36/0x99 [e1000]) swapper-0 1dNs3 47us : __const_udelay+0x9/0x47 (e1000_read_phy_reg+0x116/0x225 [e1000]) swapper-0 1dNs3 47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47) swapper-0 1dNs3 97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50) swapper-0 1dNs3 98us : e1000_swfw_sync_release+0xc/0x55 [e1000] (e1000_read_phy_reg+0x211/0x225 [e1000]) swapper-0 1dNs3 99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] (e1000_swfw_sync_release+0x50/0x55 [e1000]) swapper-0 1dNs3 101us : _spin_unlock_irqrestore+0xe/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1dNs3 102us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1dNs3 102us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f) vim:ft=help === Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/process_64.c |3 arch/x86/lib/thunk_64.S | 18 + include/asm-x86/irqflags_32.h |4 include/asm-x86/irqflags_64.h |4 include/linux/irqflags.h | 37 ++- include/linux/mcount.h| 31 ++- kernel/fork.c |2 kernel/lockdep.c | 16 + lib/tracing/Kconfig | 18 + lib/tracing/Makefile |1 lib/tracing/trace_irqsoff.c | 415 ++ lib/tracing/tracer.c | 59 - lib/tracing/tracer.h |2 13 files changed, 575 insertions(+), 35 deletions(-) Index: linux-mcount.git/arch/x86/kernel/process_64.c === --- linux-mcount.git.orig/arch/x86/kernel/process_64.c 2008-01-25 21:46:48.0 -0500 +++ linux-mcount.git/arch/x86/kernel/process_64.c 2008-01-25 21:47:34.0 -0500 @@ -233,7 +233,10 @@ void cpu_idle (void) */ local_irq_disable(); enter_idle(); + /* Don't trace irqs off for idle */ + stop_critical_timings();
[PATCH 01/23 -v6] printk - dont wakeup klogd with interrupts disabled
[ This patch is added to the series since the wakeup timings trace may lockup without it. ] I thought that one could place a printk anywhere without worrying. But it seems that it is not wise to place a printk where the runqueue lock is held. I just spent two hours debugging why some of my code was locking up, to find that the lockup was caused by some debugging printk's that I had in the scheduler. The printk's were only in rare paths so they shouldn't be too much of a problem, but after I hit the printk the system locked up. Thinking that it was locking up on my code I went looking down the wrong path. I finally found (after examining an NMI dump) that the lockup happened because printk was trying to wakeup the klogd daemon, which caused a deadlock when the try_to_wakeup code tries to grab the runqueue lock. This patch adds a runqueue_is_locked interface in sched.c for other files to see if the current runqueue lock is held. This is used in printk to determine whether it is safe or not to wakeup the klogd. And with this patch, my code ran fine ;-) Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- include/linux/sched.h |2 ++ kernel/printk.c | 14 ++ kernel/sched.c| 18 ++ 3 files changed, 30 insertions(+), 4 deletions(-) Index: linux-mcount.git/kernel/printk.c === --- linux-mcount.git.orig/kernel/printk.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/kernel/printk.c2008-01-25 21:46:55.0 -0500 @@ -590,9 +590,11 @@ static int have_callable_console(void) * @fmt: format string * * This is printk(). It can be called from any context. We want it to work. - * Be aware of the fact that if oops_in_progress is not set, we might try to - * wake klogd up which could deadlock on runqueue lock if printk() is called - * from scheduler code. + * + * Note: if printk() is called with the runqueue lock held, it will not wake + * up the klogd. This is to avoid a deadlock from calling printk() in schedule + * with the runqueue lock held and having the wake_up grab the runqueue lock + * as well. * * We try to grab the console_sem. If we succeed, it's easy - we log the output and * call the console drivers. If we fail to get the semaphore we place the output @@ -1003,7 +1005,11 @@ void release_console_sem(void) console_locked = 0; up(&console_sem); spin_unlock_irqrestore(&logbuf_lock, flags); - if (wake_klogd) + /* +* If we try to wake up klogd while printing with the runqueue lock +* held, this will deadlock. +*/ + if (wake_klogd && !runqueue_is_locked()) wake_up_klogd(); } EXPORT_SYMBOL(release_console_sem); Index: linux-mcount.git/include/linux/sched.h === --- linux-mcount.git.orig/include/linux/sched.h 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/linux/sched.h 2008-01-25 21:46:55.0 -0500 @@ -221,6 +221,8 @@ extern void sched_init_smp(void); extern void init_idle(struct task_struct *idle, int cpu); extern void init_idle_bootup_task(struct task_struct *idle); +extern int runqueue_is_locked(void); + extern cpumask_t nohz_cpu_mask; #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) extern int select_nohz_load_balancer(int cpu); Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-25 21:46:55.0 -0500 @@ -621,6 +621,24 @@ unsigned long rt_needs_cpu(int cpu) # define const_debug static const #endif +/** + * runqueue_is_locked + * + * Returns true if the current cpu runqueue is locked. + * This interface allows printk to be called with the runqueue lock + * held and know whether or not it is OK to wake up the klogd. + */ +int runqueue_is_locked(void) +{ + int cpu = get_cpu(); + struct rq *rq = cpu_rq(cpu); + int ret; + + ret = spin_is_locked(&rq->lock); + put_cpu(); + return ret; +} + /* * Debugging: various feature bits */ -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/23 -v6] mcount based trace in the form of a header file library
This is a simple trace that uses the mcount infrastructure. It is designed to be fast and small, and easy to use. It is useful to record things that happen over a very short period of time, and not to analyze the system in general. An interface is added to the debugfs /debugfs/tracing/ This patch adds the following files: available_tracers list of available tracers. Currently only "function" is available. current_tracer The trace that is currently active. Empty on start up. To switch to a tracer simply echo one of the tracers that are listed in available_tracers: echo function > /debugfs/tracing/current_tracer trace_ctrl echoing "1" into this file starts the mcount function tracing (if sysctl kernel.mcount_enabled=1) echoing "0" turns it off. latency_trace This file is readonly and holds the result of the trace. trace This file outputs a easier to read version of the trace. iter_ctrl Controls the way the output of traces look. So far there's two controls: echoing in "symonly" will only show the kallsyms variables without the addresses (if kallsyms was configured) echoing in "verbose" will change the output to show a lot more data, but not very easy to understand by humans. echoing in "nosymonly" turns off symonly. echoing in "noverbose" turns off verbose. The output of the function_trace file is as follows "echo noverbose > /debugfs/tracing/iter_ctrl" preemption latency trace v1.1.5 on 2.6.24-rc7-tst latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) - | task: -0 (uid:0 nice:0 policy:0 rt_prio:0) - _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 0d.h. 1595128us+: set_normalized_timespec+0x8/0x2d (ktime_get_ts+0x4a/0x4e ) swapper-0 0d.h. 1595131us+: _spin_lock+0x8/0x18 (hrtimer_interrupt+0x6e/0x1b0 ) Or with verbose turned on: "echo verbose > /debugfs/tracing/iter_ctrl" preemption latency trace v1.1.5 on 2.6.24-rc7-tst latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) - | task: -0 (uid:0 nice:0 policy:0 rt_prio:0) - swapper 0 0 9 [f3675f41] 1595.128ms (+0.003ms): set_normalized_timespec+0x8/0x2d (ktime_get_ts+0x4a/0x4e ) swapper 0 0 9 0001 [f3675f45] 1595.131ms (+0.003ms): _spin_lock+0x8/0x18 (hrtimer_interrupt+0x6e/0x1b0 ) swapper 0 0 9 0002 [f3675f48] 1595.135ms (+0.003ms): _spin_lock+0x8/0x18 (hrtimer_interrupt+0x6e/0x1b0 ) The "trace" file is not affected by the verbose mode, but is by the symonly. echo "nosymonly" > /debugfs/tracing/iter_ctrl tracer: [ 81.479967] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <-- _spin_unlock_irqrestore+0xe/0x5a [ 81.479967] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- sub_preempt_count+0xc/0x7a [ 81.479968] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- in_lock_functions+0x9/0x24 [ 81.479968] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- dnotify_parent+0x12/0x78 [ 81.479968] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70 [ 81.479969] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- add_preempt_count+0xe/0x77 [ 81.479969] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- in_lock_functions+0x9/0x24 echo "symonly" > /debugfs/tracing/iter_ctrl tracer: [ 81.479913] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <-- _spin_unlock_irqrestore+0xe/0x5a [ 81.479913] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- sub_preempt_count+0xc/0x7a [ 81.479913] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- in_lock_functions+0x9/0x24 [ 81.479914] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- dnotify_parent+0x12/0x78 [ 81.479914] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70 [ 81.479914] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- add_preempt_count+0xe/0x77 [ 81.479914] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- in_lock_functions+0x9/0x24 Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> --- lib/Makefile |1 lib/tracing/Kconfig | 15 lib/tracing/Makefile |3 lib/tracing/trace_function.c | 72 ++ lib/tracing/tracer.c | 1150 +++ lib/tracing/tracer.h | 96 +++ 6 files changed,
[PATCH 18/23 -v6] mcount tracer for wakeup latency timings.
This patch adds hooks to trace the wake up latency of the highest priority waking task. "wakeup" is added to /debugfs/tracing/available_tracers Also added to /debugfs/tracing tracing_max_latency holds the current max latency for the wakeup wakeup_thresh if set to other than zero, a log will be recorded for every wakeup that takes longer than the number entered in here (usecs for all counters) (deletes previous trace) Examples: (with mcount_enabled = 0) preemption latency trace v1.1.5 on 2.6.24-rc8 latency: 26 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: migration/0-3 (uid:0 nice:-5 policy:1 rt_prio:99) - _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / quilt-8551 0d..30us+: wake_up_process+0x15/0x17 (sched_exec+0xc9/0x100 ) quilt-8551 0d..4 26us : sched_switch_callback+0x73/0x81 (schedule+0x483/0x6d5 ) vim:ft=help (with mcount_enabled = 1) preemption latency trace v1.1.5 on 2.6.24-rc8 latency: 36 us, #45/45, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) - | task: migration/1-5 (uid:0 nice:-5 policy:1 rt_prio:99) - _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / bash-10653 1d..30us : wake_up_process+0x15/0x17 (sched_exec+0xc9/0x100 ) bash-10653 1d..31us : try_to_wake_up+0x271/0x2e7 (sub_preempt_count+0xc/0x7a ) bash-10653 1d..22us : try_to_wake_up+0x296/0x2e7 (update_rq_clock+0x9/0x20 ) bash-10653 1d..22us : update_rq_clock+0x1e/0x20 (__update_rq_clock+0xc/0x90 ) bash-10653 1d..23us : __update_rq_clock+0x1b/0x90 (sched_clock+0x9/0x29 ) bash-10653 1d..24us : try_to_wake_up+0x2a6/0x2e7 (activate_task+0xc/0x3f ) bash-10653 1d..24us : activate_task+0x2d/0x3f (enqueue_task+0xe/0x66 ) bash-10653 1d..25us : enqueue_task+0x5b/0x66 (enqueue_task_rt+0x9/0x3c ) bash-10653 1d..26us : try_to_wake_up+0x2ba/0x2e7 (check_preempt_wakeup+0x12/0x99 ) [...] bash-10653 1d..5 33us : tracing_record_cmdline+0xcf/0xd4 (_spin_unlock+0x9/0x33 ) bash-10653 1d..5 34us : _spin_unlock+0x19/0x33 (sub_preempt_count+0xc/0x7a ) bash-10653 1d..4 35us : wakeup_sched_switch+0x65/0x2ff (_spin_lock_irqsave+0xc/0xa9 ) bash-10653 1d..4 35us : _spin_lock_irqsave+0x19/0xa9 (add_preempt_count+0xe/0x77 ) bash-10653 1d..4 36us : sched_switch_callback+0x73/0x81 (schedule+0x483/0x6d5 ) vim:ft=help The [...] was added here to not waste your email box space. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig| 14 + lib/tracing/Makefile |1 lib/tracing/trace_wakeup.c | 350 + lib/tracing/tracer.c | 131 lib/tracing/tracer.h |6 5 files changed, 500 insertions(+), 2 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-25 21:47:25.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:32.0 -0500 @@ -9,6 +9,9 @@ config MCOUNT bool select FRAME_POINTER +config TRACER_MAX_TRACE + bool + config TRACING bool select DEBUG_FS @@ -25,6 +28,17 @@ config FUNCTION_TRACER that the debugging mechanism using this facility will hook by providing a set of inline routines. +config WAKEUP_TRACER + bool "Trace wakeup latencies" + depends on DEBUG_KERNEL + select TRACING + select CONTEXT_SWITCH_TRACER + select TRACER_MAX_TRACE + help + This tracer adds hooks into scheduling to time the latency + of the highest priority task tasks to be scheduled in + after it has worken up. + config CONTEXT_SWITCH_TRACER bool "Trace process context switches" depends on DEBUG_KERNEL Index: linux-mcount.git/lib/tracing/Makefile === --- linux-m
[PATCH 16/23 -v6] trace generic call to schedule switch
This patch adds hooks into the schedule switch tracing to allow other latency traces to hook into the schedule switches. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/trace_sched_switch.c | 123 +-- lib/tracing/tracer.h | 14 2 files changed, 119 insertions(+), 18 deletions(-) Index: linux-mcount.git/lib/tracing/tracer.h === --- linux-mcount.git.orig/lib/tracing/tracer.h 2008-01-25 21:47:25.0 -0500 +++ linux-mcount.git/lib/tracing/tracer.h 2008-01-25 21:47:27.0 -0500 @@ -112,4 +112,18 @@ static inline notrace cycle_t now(void) return get_monotonic_cycles(); } +#ifdef CONFIG_CONTEXT_SWITCH_TRACER +typedef void (*tracer_switch_func_t)(void *private, +struct task_struct *prev, +struct task_struct *next); +struct tracer_switch_ops { + tracer_switch_func_t func; + void *private; + struct tracer_switch_ops *next; +}; + +extern int register_tracer_switch(struct tracer_switch_ops *ops); +extern int unregister_tracer_switch(struct tracer_switch_ops *ops); +#endif /* CONFIG_CONTEXT_SWITCH_TRACER */ + #endif /* _LINUX_MCOUNT_TRACER_H */ Index: linux-mcount.git/lib/tracing/trace_sched_switch.c === --- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c 2008-01-25 21:47:25.0 -0500 +++ linux-mcount.git/lib/tracing/trace_sched_switch.c 2008-01-25 21:47:27.0 -0500 @@ -16,33 +16,21 @@ static struct tracing_trace *tracer_trace; static int trace_enabled __read_mostly; +static DEFINE_SPINLOCK(sched_switch_func_lock); -static notrace void sched_switch_callback(const struct marker *mdata, - void *private_data, - const char *format, ...) +static void notrace sched_switch_func(void *private, + struct task_struct *prev, + struct task_struct *next) { - struct tracing_trace **p = mdata->private; - struct tracing_trace *tr = *p; + struct tracing_trace **ptr = private; + struct tracing_trace *tr = *ptr; struct tracing_trace_cpu *data; - struct task_struct *prev; - struct task_struct *next; unsigned long flags; - va_list ap; int cpu; - if (likely(!atomic_read(&trace_record_cmdline))) - return; - - tracing_record_cmdline(current); - if (likely(!trace_enabled)) return; - va_start(ap, format); - prev = va_arg(ap, typeof(prev)); - next = va_arg(ap, typeof(next)); - va_end(ap); - raw_local_irq_save(flags); cpu = raw_smp_processor_id(); data = tr->data[cpu]; @@ -55,6 +43,105 @@ static notrace void sched_switch_callbac raw_local_irq_restore(flags); } +static struct tracer_switch_ops sched_switch_ops __read_mostly = +{ + .func = sched_switch_func, + .private = &tracer_trace, +}; + +static tracer_switch_func_t tracer_switch_func __read_mostly = + sched_switch_func; + +static struct tracer_switch_ops *tracer_switch_func_ops __read_mostly = + &sched_switch_ops; + +static void notrace sched_switch_func_loop(void *private, + struct task_struct *prev, + struct task_struct *next) +{ + struct tracer_switch_ops *ops = tracer_switch_func_ops; + + for (; ops != NULL; ops = ops->next) + ops->func(ops->private, prev, next); +} + +notrace int register_tracer_switch(struct tracer_switch_ops *ops) +{ + unsigned long flags; + + spin_lock_irqsave(&sched_switch_func_lock, flags); + ops->next = tracer_switch_func_ops; + smp_wmb(); + tracer_switch_func_ops = ops; + + if (ops->next == &sched_switch_ops) + tracer_switch_func = sched_switch_func_loop; + + spin_unlock_irqrestore(&sched_switch_func_lock, flags); + + return 0; +} + +notrace int unregister_tracer_switch(struct tracer_switch_ops *ops) +{ + unsigned long flags; + struct tracer_switch_ops **p = &tracer_switch_func_ops; + int ret; + + spin_lock_irqsave(&sched_switch_func_lock, flags); + + /* +* If the sched_switch is the only one left, then +* only call that function. +*/ + if (*p == ops && ops->next == &sched_switch_ops) { + tracer_switch_func = sched_switch_func; + tracer_switch_func_ops = &sched_switch_ops; + goto out; + } + + for (; *p != &sched_switch_ops; p = &(*p)->next) + if (*p == ops) + break; + + if (*p != ops) { + ret = -1; + goto out; +
[PATCH 15/23 -v6] Generic command line storage
Saving the comm of tasks for each trace is very expensive. This patch includes in the context switch hook, a way to store the last 100 command lines of tasks. This table is examined when a trace is to be printed. Note: The comm may be destroyed if other traces are performed. Later (TBD) patches may simply store this information in the trace itself. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig |1 lib/tracing/trace_function.c |2 lib/tracing/trace_sched_switch.c |7 ++ lib/tracing/tracer.c | 104 +-- lib/tracing/tracer.h |6 +- 5 files changed, 114 insertions(+), 6 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-25 21:47:23.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:25.0 -0500 @@ -18,6 +18,7 @@ config FUNCTION_TRACER depends on DEBUG_KERNEL && HAVE_MCOUNT select MCOUNT select TRACING + select CONTEXT_SWITCH_TRACER help Use profiler instrumentation, adding -pg to CFLAGS. This will insert a call to an architecture specific __mcount routine, Index: linux-mcount.git/lib/tracing/trace_function.c === --- linux-mcount.git.orig/lib/tracing/trace_function.c 2008-01-25 21:47:17.0 -0500 +++ linux-mcount.git/lib/tracing/trace_function.c 2008-01-25 21:47:25.0 -0500 @@ -28,11 +28,13 @@ static notrace void function_reset(struc static notrace void start_function_trace(struct tracing_trace *tr) { function_reset(tr); + atomic_inc(&trace_record_cmdline); tracing_start_function_trace(); } static notrace void stop_function_trace(struct tracing_trace *tr) { + atomic_dec(&trace_record_cmdline); tracing_stop_function_trace(); } Index: linux-mcount.git/lib/tracing/trace_sched_switch.c === --- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c 2008-01-25 21:47:23.0 -0500 +++ linux-mcount.git/lib/tracing/trace_sched_switch.c 2008-01-25 21:47:25.0 -0500 @@ -30,6 +30,11 @@ static notrace void sched_switch_callbac va_list ap; int cpu; + if (likely(!atomic_read(&trace_record_cmdline))) + return; + + tracing_record_cmdline(current); + if (likely(!trace_enabled)) return; @@ -62,6 +67,7 @@ static notrace void sched_switch_reset(s static notrace void start_sched_trace(struct tracing_trace *tr) { + atomic_inc(&trace_record_cmdline); sched_switch_reset(tr); trace_enabled = 1; } @@ -69,6 +75,7 @@ static notrace void start_sched_trace(st static notrace void stop_sched_trace(struct tracing_trace *tr) { trace_enabled = 0; + atomic_dec(&trace_record_cmdline); } static notrace void sched_switch_trace_init(struct tracing_trace *tr) Index: linux-mcount.git/lib/tracing/tracer.c === --- linux-mcount.git.orig/lib/tracing/tracer.c 2008-01-25 21:47:23.0 -0500 +++ linux-mcount.git/lib/tracing/tracer.c 2008-01-25 21:47:25.0 -0500 @@ -169,6 +169,88 @@ void tracing_stop_function_trace(void) unregister_mcount_function(&trace_ops); } +#define SAVED_CMDLINES 128 +static unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1]; +static unsigned map_cmdline_to_pid[SAVED_CMDLINES]; +static char saved_cmdlines[SAVED_CMDLINES][TASK_COMM_LEN]; +static int cmdline_idx; +static DEFINE_SPINLOCK(trace_cmdline_lock); +atomic_t trace_record_cmdline; +atomic_t trace_record_cmdline_disabled; + +static void trace_init_cmdlines(void) +{ + memset(&map_pid_to_cmdline, -1, sizeof(map_pid_to_cmdline)); + memset(&map_cmdline_to_pid, -1, sizeof(map_cmdline_to_pid)); + cmdline_idx = 0; +} + +notrace void trace_stop_cmdline_recording(void); + +static void notrace trace_save_cmdline(struct task_struct *tsk) +{ + unsigned map; + unsigned idx; + + if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT)) + return; + + /* +* It's not the end of the world if we don't get +* the lock, but we also don't want to spin +* nor do we want to disable interrupts, +* so if we miss here, then better luck next time. +*/ + if (!spin_trylock(&trace_cmdline_lock)) + return; + + idx = map_pid_to_cmdline[tsk->pid]; + if (idx >= SAVED_CMDLINES) { + idx = (cmdline_idx + 1) % SAVED_CMDLINES; + + map = map_cmdline_to_pid[idx]; + if (map <= PID_MAX_DEFAULT) + map_pid_to_cmdline[map] = (unsigned)-1; + + map_pid_to_cmdline[tsk->
[PATCH 21/23 -v6] Add markers to various events
This patch adds markers to various events in the kernel. (interrupts, task activation and hrtimers) Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/apic_32.c |2 ++ arch/x86/kernel/irq_32.c |1 + arch/x86/kernel/irq_64.c |2 ++ arch/x86/kernel/traps_32.c |2 ++ arch/x86/kernel/traps_64.c |2 ++ arch/x86/mm/fault_32.c |3 +++ arch/x86/mm/fault_64.c |3 +++ kernel/hrtimer.c |7 +++ kernel/sched.c | 11 +++ 9 files changed, 33 insertions(+) Index: linux-mcount.git/arch/x86/kernel/apic_32.c === --- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-25 21:47:15.0 -0500 +++ linux-mcount.git/arch/x86/kernel/apic_32.c 2008-01-25 21:47:38.0 -0500 @@ -581,6 +581,8 @@ notrace fastcall void smp_apic_timer_int { struct pt_regs *old_regs = set_irq_regs(regs); + trace_mark(arch_apic_timer, "ip %lx", regs->eip); + /* * NOTE! We'd better ACK the irq immediately, * because timer handling can be slow. Index: linux-mcount.git/arch/x86/kernel/irq_32.c === --- linux-mcount.git.orig/arch/x86/kernel/irq_32.c 2008-01-25 21:46:48.0 -0500 +++ linux-mcount.git/arch/x86/kernel/irq_32.c 2008-01-25 21:47:38.0 -0500 @@ -85,6 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_r old_regs = set_irq_regs(regs); irq_enter(); + trace_mark(arch_do_irq, "ip %lx irq %d", regs->eip, irq); #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { Index: linux-mcount.git/arch/x86/kernel/irq_64.c === --- linux-mcount.git.orig/arch/x86/kernel/irq_64.c 2008-01-25 21:46:48.0 -0500 +++ linux-mcount.git/arch/x86/kernel/irq_64.c 2008-01-25 21:47:38.0 -0500 @@ -149,6 +149,8 @@ asmlinkage unsigned int do_IRQ(struct pt irq_enter(); irq = __get_cpu_var(vector_irq)[vector]; + trace_mark(arch_do_irq, "ip %lx irq %d", regs->rip, irq); + #ifdef CONFIG_DEBUG_STACKOVERFLOW stack_overflow_check(regs); #endif Index: linux-mcount.git/arch/x86/kernel/traps_32.c === --- linux-mcount.git.orig/arch/x86/kernel/traps_32.c2008-01-25 21:47:08.0 -0500 +++ linux-mcount.git/arch/x86/kernel/traps_32.c 2008-01-25 21:47:38.0 -0500 @@ -769,6 +769,8 @@ notrace fastcall __kprobes void do_nmi(s nmi_enter(); + trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->eip, regs->eflags); + cpu = smp_processor_id(); ++nmi_count(cpu); Index: linux-mcount.git/arch/x86/kernel/traps_64.c === --- linux-mcount.git.orig/arch/x86/kernel/traps_64.c2008-01-25 21:46:48.0 -0500 +++ linux-mcount.git/arch/x86/kernel/traps_64.c 2008-01-25 21:47:38.0 -0500 @@ -782,6 +782,8 @@ asmlinkage __kprobes void default_do_nmi cpu = smp_processor_id(); + trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->rip, regs->eflags); + /* Only the BSP gets external NMIs from the system. */ if (!cpu) reason = get_nmi_reason(); Index: linux-mcount.git/arch/x86/mm/fault_32.c === --- linux-mcount.git.orig/arch/x86/mm/fault_32.c2008-01-25 21:46:48.0 -0500 +++ linux-mcount.git/arch/x86/mm/fault_32.c 2008-01-25 21:47:38.0 -0500 @@ -311,6 +311,9 @@ fastcall void __kprobes do_page_fault(st /* get the address */ address = read_cr2(); + trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx", + regs->eip, error_code, address); + tsk = current; si_code = SEGV_MAPERR; Index: linux-mcount.git/arch/x86/mm/fault_64.c === --- linux-mcount.git.orig/arch/x86/mm/fault_64.c2008-01-25 21:46:48.0 -0500 +++ linux-mcount.git/arch/x86/mm/fault_64.c 2008-01-25 21:47:38.0 -0500 @@ -316,6 +316,9 @@ asmlinkage void __kprobes do_page_fault( /* get the address */ address = read_cr2(); + trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx", + regs->rip, error_code, address); + info.si_code = SEGV_MAPERR; Index: linux-mcount.git/kernel/hrtimer.c === --- linux-mcount.git.orig/kernel/hrtimer.c 2008-01-25 21:46:48.0 -0500 +++ linux-mcount.git/kernel/hrtimer.c 2008-01-25 21:47:38.0 -0500 @@ -709,6 +709,8 @@ static void enqueue_hrtimer(struct hrtim struct hrtimer *entry; int leftmost = 1
[PATCH 22/23 -v6] Add event tracer.
This patch adds a event trace that hooks into various events in the kernel. Although it can be used separately, it is mainly to help other traces (wakeup and preempt off) with seeing various events in the traces without having to enable the heavy mcount hooks. Here's an example: _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / bash-2857 1d..3 180us : try_to_wake_up+0xdc/0x172 (0 0) bash-2857 1d..3 181us+: activate_task+0x7d/0xb9 (0 2) bash-2857 1d..2 192us!: enqueue_hrtimer+0x55/0x156(a4bf49c2e 81000 10b3ce8) bash-2857 1d..3 331us+: deactivate_task+0x7c/0xa8 (0 3) bash-2857 1d..3 334us+: 2857:120:S --> 2849:120 sshd-2849 1d..3 338us+: enqueue_hrtimer+0x55/0x156(a4c2a94a7 81000 10b3ce8) sshd-2849 1d..3 370us : try_to_wake_up+0xdc/0x172 (0 0) sshd-2849 1d..3 370us+: activate_task+0x7d/0xb9 (0 2) sshd-2849 1d..2 380us+: enqueue_hrtimer+0x55/0x156(a4c0cae6f 81000 10b3ce8) Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig | 12 + lib/tracing/Makefile|1 lib/tracing/trace_events.c | 472 lib/tracing/trace_irqsoff.c |6 lib/tracing/trace_wakeup.c | 13 + lib/tracing/tracer.c| 154 ++ lib/tracing/tracer.h| 62 + 7 files changed, 678 insertions(+), 42 deletions(-) Index: linux-mcount.git/lib/tracing/trace_events.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-mcount.git/lib/tracing/trace_events.c 2008-01-25 21:47:40.0 -0500 @@ -0,0 +1,472 @@ +/* + * trace task events + * + * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]> + * + * Based on code from the latency_tracer, that is: + * + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ +#include +#include +#include +#include +#include +#include + +#include "tracer.h" + +static struct tracing_trace *tracer_trace __read_mostly; +static int trace_enabled __read_mostly; + +static void notrace event_reset(struct tracing_trace *tr) +{ + struct tracing_trace_cpu *data; + int cpu; + + tr->time_start = now(); + + for_each_possible_cpu(cpu) { + data = tr->data[cpu]; + tracing_reset(data); + } +} + +static void notrace event_trace_sched_switch(void *private, +struct task_struct *prev, +struct task_struct *next) +{ + struct tracing_trace **ptr = private; + struct tracing_trace *tr = *ptr; + struct tracing_trace_cpu *data; + unsigned long flags; + int cpu; + + if (!trace_enabled || !tr) + return; + + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + + atomic_inc(&data->disabled); + if (atomic_read(&data->disabled) != 1) + goto out; + + tracing_sched_switch_trace(tr, data, prev, next, flags); + + out: + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +static struct tracer_switch_ops switch_ops __read_mostly = { + .func = event_trace_sched_switch, + .private = &tracer_trace, +}; + +notrace int trace_event_enabled(void) +{ + return trace_enabled && tracer_trace; +} + +/* Taken from sched.c */ +#define __PRIO(prio) \ + ((prio) <= 99 ? 199 - (prio) : (prio) - 120) + +#define PRIO(p) __PRIO((p)->prio) + +notrace void trace_event_wakeup(unsigned long ip, + struct task_struct *p, + struct task_struct *curr) +{ + struct tracing_trace *tr = tracer_trace; + struct tracing_trace_cpu *data; + unsigned long flags; + int cpu; + + if (!trace_enabled || !tr) + return; + + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + + atomic_inc(&data->disabled); + if (atomic_read(&data->disabled) != 1) + goto out; + + /* record process's command line */ + tracing_record_cmdline(p); + tracing_record_cmdline(curr); + tracing_trace_pid(tr, data, flags, ip, p->pid, PRIO(p), PRIO(curr)); + + out: + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +struct event_probes { + const char *name; + const char *fmt; + void (*func)(const struct event_probes *probe, +struct tracing_trace *tr, +struct tracing_trace_cpu
[PATCH 13/23 -v6] Make the task State char-string visible to all
The tracer wants to be able to convert the state number into a user visible character. This patch pulls that conversion string out the scheduler into the header. This way if it were to ever change, other parts of the kernel will know. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- include/linux/sched.h |2 ++ kernel/sched.c|2 +- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-mcount.git/include/linux/sched.h === --- linux-mcount.git.orig/include/linux/sched.h 2008-01-25 21:46:55.0 -0500 +++ linux-mcount.git/include/linux/sched.h 2008-01-25 21:47:21.0 -0500 @@ -2055,6 +2055,8 @@ static inline void migration_init(void) } #endif +#define TASK_STATE_TO_CHAR_STR "RSDTtZX" + #endif /* __KERNEL__ */ #endif Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:19.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:21.0 -0500 @@ -5149,7 +5149,7 @@ out_unlock: return retval; } -static const char stat_nam[] = "RSDTtZX"; +static const char stat_nam[] = TASK_STATE_TO_CHAR_STR; void sched_show_task(struct task_struct *p) { -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 20/23 -v6] trace preempt off critical timings
Add preempt off timings. A lot of kernel core code is taken from the RT patch latency trace that was written by Ingo Molnar. This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers Now instead of just tracing irqs off, preemption off can be selected to be recorded. When this is selected, it shares the same files as irqs off timings. One can either trace preemption off, irqs off, or one or the other off. By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording of preempt off only is performed. "irqsoff" will only record the time irqs are disabled, but "preemptirqsoff" will take the total time irqs or preemption are disabled. Runtime switching of these options is now supported by simpling echoing in the appropriate trace name into /debugfs/tracing/current_tracer. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/process_32.c |3 include/linux/irqflags.h |3 include/linux/mcount.h |8 + include/linux/preempt.h |2 kernel/sched.c | 24 + lib/tracing/Kconfig | 25 + lib/tracing/Makefile |1 lib/tracing/trace_irqsoff.c | 181 +++ 8 files changed, 195 insertions(+), 52 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-25 21:47:34.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:36.0 -0500 @@ -46,6 +46,31 @@ config CRITICAL_IRQSOFF_TIMING echo 0 > /debugfs/tracing/tracing_max_latency + (Note that kernel size and overhead increases with this option + enabled. This option and the preempt-off timing option can be + used together or separately.) + +config CRITICAL_PREEMPT_TIMING + bool "Preemption-off critical section latency timing" + default n + depends on GENERIC_TIME + depends on PREEMPT + select TRACING + select TRACER_MAX_TRACE + help + This option measures the time spent in preemption off critical + sections, with microsecond accuracy. + + The default measurement method is a maximum search, which is + disabled by default and can be runtime (re-)started + via: + + echo 0 > /debugfs/tracing/tracing_max_latency + + (Note that kernel size and overhead increases with this option + enabled. This option and the irqs-off timing option can be + used together or separately.) + config WAKEUP_TRACER bool "Trace wakeup latencies" depends on DEBUG_KERNEL Index: linux-mcount.git/lib/tracing/Makefile === --- linux-mcount.git.orig/lib/tracing/Makefile 2008-01-25 21:47:34.0 -0500 +++ linux-mcount.git/lib/tracing/Makefile 2008-01-25 21:47:36.0 -0500 @@ -4,6 +4,7 @@ obj-$(CONFIG_TRACING) += tracer.o obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o obj-$(CONFIG_CRITICAL_IRQSOFF_TIMING) += trace_irqsoff.o +obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += trace_irqsoff.o obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o libmcount-y := mcount.o Index: linux-mcount.git/lib/tracing/trace_irqsoff.c === --- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c 2008-01-25 21:47:34.0 -0500 +++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-25 21:47:36.0 -0500 @@ -21,6 +21,34 @@ static struct tracing_trace *tracer_trac static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex); static int trace_enabled __read_mostly; +static DEFINE_PER_CPU(int, tracing_cpu); + +enum { + TRACER_IRQS_OFF = (1 << 1), + TRACER_PREEMPT_OFF = (1 << 2), +}; + +static int trace_type __read_mostly; + +#ifdef CONFIG_CRITICAL_PREEMPT_TIMING +# define preempt_trace() \ + ((trace_type & TRACER_PREEMPT_OFF) && preempt_count()) +#else +# define preempt_trace() (0) +#endif + +#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING +# define irq_trace() \ + ((trace_type & TRACER_IRQS_OFF) && \ +({ \ +unsigned long __flags; \ +local_save_flags(__flags); \ +irqs_disabled_flags(__flags); \ +})) +#else +# define irq_trace() (0) +#endif + /* * Sequence count - we record it when starting a measurement and * skip the latency if the sequence has changed - some other section @@ -41,14 +69,11 @@ static void notrace irqsoff_trace_call(u unsigned long flags; int cpu; - if (likely(!trace_enabled)) + if (likely(!__get_cpu_var(tracing_cpu))) return; local_save_flags(flags); - if (!irqs_disable
[PATCH 06/23 -v6] add notrace annotations for NMI routines
This annotates NMI functions with notrace. Some tracers may be able to live with this, but some cannot. So we turn off NMI tracing. One solution might be to make a notrace_nmi which would only turn off NMI tracing if a trace utility needed it off. Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/nmi_32.c |2 +- arch/x86/kernel/nmi_64.c |2 +- arch/x86/kernel/traps_32.c |4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) Index: linux-mcount.git/arch/x86/kernel/nmi_32.c === --- linux-mcount.git.orig/arch/x86/kernel/nmi_32.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/nmi_32.c 2008-01-25 21:47:08.0 -0500 @@ -318,7 +318,7 @@ EXPORT_SYMBOL(touch_nmi_watchdog); extern void die_nmi(struct pt_regs *, const char *msg); -__kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) +notrace __kprobes int nmi_watchdog_tick(struct pt_regs *regs, unsigned reason) { /* Index: linux-mcount.git/arch/x86/kernel/nmi_64.c === --- linux-mcount.git.orig/arch/x86/kernel/nmi_64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/nmi_64.c 2008-01-25 21:47:08.0 -0500 @@ -314,7 +314,7 @@ void touch_nmi_watchdog(void) touch_softlockup_watchdog(); } -int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) +notrace __kprobes int nmi_watchdog_tick(struct pt_regs *regs, unsigned reason) { int sum; int touched = 0; Index: linux-mcount.git/arch/x86/kernel/traps_32.c === --- linux-mcount.git.orig/arch/x86/kernel/traps_32.c2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/traps_32.c 2008-01-25 21:47:08.0 -0500 @@ -723,7 +723,7 @@ void __kprobes die_nmi(struct pt_regs *r do_exit(SIGSEGV); } -static __kprobes void default_do_nmi(struct pt_regs * regs) +static notrace __kprobes void default_do_nmi(struct pt_regs *regs) { unsigned char reason = 0; @@ -763,7 +763,7 @@ static __kprobes void default_do_nmi(str static int ignore_nmis; -fastcall __kprobes void do_nmi(struct pt_regs * regs, long error_code) +notrace fastcall __kprobes void do_nmi(struct pt_regs *regs, long error_code) { int cpu; -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/23 -v6] add notrace annotations to vsyscall.
Add the notrace annotations to some of the vsyscall functions. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/vsyscall_64.c |3 ++- arch/x86/vdso/vclock_gettime.c | 15 --- arch/x86/vdso/vgetcpu.c|3 ++- include/asm-x86/vsyscall.h |3 ++- 4 files changed, 14 insertions(+), 10 deletions(-) Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c === --- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:47:06.0 -0500 @@ -42,7 +42,8 @@ #include #include -#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) +#define __vsyscall(nr) \ + __attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace #define __syscall_clobber "r11","rcx","memory" #define __pa_vsymbol(x)\ ({unsigned long v; \ Index: linux-mcount.git/arch/x86/vdso/vclock_gettime.c === --- linux-mcount.git.orig/arch/x86/vdso/vclock_gettime.c2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/vdso/vclock_gettime.c 2008-01-25 21:47:06.0 -0500 @@ -24,7 +24,7 @@ #define gtod vdso_vsyscall_gtod_data -static long vdso_fallback_gettime(long clock, struct timespec *ts) +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; asm("syscall" : "=a" (ret) : @@ -32,7 +32,7 @@ static long vdso_fallback_gettime(long c return ret; } -static inline long vgetns(void) +notrace static inline long vgetns(void) { long v; cycles_t (*vread)(void); @@ -41,7 +41,7 @@ static inline long vgetns(void) return (v * gtod->clock.mult) >> gtod->clock.shift; } -static noinline int do_realtime(struct timespec *ts) +notrace static noinline int do_realtime(struct timespec *ts) { unsigned long seq, ns; do { @@ -55,7 +55,8 @@ static noinline int do_realtime(struct t } /* Copy of the version in kernel/time.c which we cannot directly access */ -static void vset_normalized_timespec(struct timespec *ts, long sec, long nsec) +notrace static void +vset_normalized_timespec(struct timespec *ts, long sec, long nsec) { while (nsec >= NSEC_PER_SEC) { nsec -= NSEC_PER_SEC; @@ -69,7 +70,7 @@ static void vset_normalized_timespec(str ts->tv_nsec = nsec; } -static noinline int do_monotonic(struct timespec *ts) +notrace static noinline int do_monotonic(struct timespec *ts) { unsigned long seq, ns, secs; do { @@ -83,7 +84,7 @@ static noinline int do_monotonic(struct return 0; } -int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) +notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { if (likely(gtod->sysctl_enabled && gtod->clock.vread)) switch (clock) { @@ -97,7 +98,7 @@ int __vdso_clock_gettime(clockid_t clock int clock_gettime(clockid_t, struct timespec *) __attribute__((weak, alias("__vdso_clock_gettime"))); -int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) +notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { long ret; if (likely(gtod->sysctl_enabled && gtod->clock.vread)) { Index: linux-mcount.git/arch/x86/vdso/vgetcpu.c === --- linux-mcount.git.orig/arch/x86/vdso/vgetcpu.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/vdso/vgetcpu.c2008-01-25 21:47:06.0 -0500 @@ -13,7 +13,8 @@ #include #include "vextern.h" -long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused) +notrace long +__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused) { unsigned int dummy, p; Index: linux-mcount.git/include/asm-x86/vsyscall.h === --- linux-mcount.git.orig/include/asm-x86/vsyscall.h2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/asm-x86/vsyscall.h 2008-01-25 21:47:06.0 -0500 @@ -24,7 +24,8 @@ enum vsyscall_num { ((unused, __section__ (".vsyscall_gtod_data"),aligned(16))) #define __section_vsyscall_clock __attribute__ \ ((unused, __section__ (".vsyscall_clock"),aligned(16))) -#define __vsyscall_fn __attribute__ ((unused,__section__(".vsyscall_fn"))) +#define __vsyscall_fn \ + __attribute__ ((unused, __section__(".vsyscall_fn"))) notrace #define VGETCPU_RDTSCP 1 #define VGETCPU_LSL2 -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/23 -v6] initialize the clock source to jiffies clock.
The latency tracer can call clocksource_read very early in bootup and before the clock source variable has been initialized. This results in a crash at boot up (even before earlyprintk is initialized). Since the clock->read variable points to NULL. This patch simply initializes the clock to use clocksource_jiffies, so that any early user of clocksource_read will not crash. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Acked-by: John Stultz <[EMAIL PROTECTED]> --- include/linux/clocksource.h |3 +++ kernel/time/timekeeping.c |9 +++-- 2 files changed, 10 insertions(+), 2 deletions(-) Index: linux-mcount.git/include/linux/clocksource.h === --- linux-mcount.git.orig/include/linux/clocksource.h 2008-01-25 21:47:09.0 -0500 +++ linux-mcount.git/include/linux/clocksource.h2008-01-25 21:47:11.0 -0500 @@ -273,6 +273,9 @@ extern struct clocksource* clocksource_g extern void clocksource_change_rating(struct clocksource *cs, int rating); extern void clocksource_resume(void); +/* used to initialize clock */ +extern struct clocksource clocksource_jiffies; + #ifdef CONFIG_GENERIC_TIME_VSYSCALL extern void update_vsyscall(struct timespec *ts, struct clocksource *c); extern void update_vsyscall_tz(void); Index: linux-mcount.git/kernel/time/timekeeping.c === --- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 21:47:09.0 -0500 +++ linux-mcount.git/kernel/time/timekeeping.c 2008-01-25 21:47:11.0 -0500 @@ -53,8 +53,13 @@ static inline void update_xtime_cache(u6 timespec_add_ns(&xtime_cache, nsec); } -static struct clocksource *clock; /* pointer to current clocksource */ - +/* + * pointer to current clocksource + * Just in case we use clocksource_read before we initialize + * the actual clock source. Instead of calling a NULL read pointer + * we return jiffies. + */ +static struct clocksource *clock = &clocksource_jiffies; #ifdef CONFIG_GENERIC_TIME /** -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/23 -v6] add notrace annotations to timing events
This patch adds notrace annotations to timer functions that will be used by tracing. This helps speed things up and also keeps the ugliness of printing these functions down. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/apic_32.c |2 +- arch/x86/kernel/hpet.c|2 +- arch/x86/kernel/time_32.c |2 +- arch/x86/kernel/tsc_32.c |2 +- arch/x86/kernel/tsc_64.c |4 ++-- arch/x86/lib/delay_32.c |6 +++--- drivers/clocksource/acpi_pm.c |8 7 files changed, 13 insertions(+), 13 deletions(-) Index: linux-mcount.git/arch/x86/kernel/apic_32.c === --- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-25 21:46:49.0 -0500 +++ linux-mcount.git/arch/x86/kernel/apic_32.c 2008-01-25 21:47:15.0 -0500 @@ -577,7 +577,7 @@ static void local_apic_timer_interrupt(v * interrupt as well. Thus we cannot inline the local irq ... ] */ -void fastcall smp_apic_timer_interrupt(struct pt_regs *regs) +notrace fastcall void smp_apic_timer_interrupt(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); Index: linux-mcount.git/arch/x86/kernel/hpet.c === --- linux-mcount.git.orig/arch/x86/kernel/hpet.c2008-01-25 21:46:49.0 -0500 +++ linux-mcount.git/arch/x86/kernel/hpet.c 2008-01-25 21:47:15.0 -0500 @@ -295,7 +295,7 @@ static int hpet_legacy_next_event(unsign /* * Clock source related code */ -static cycle_t read_hpet(void) +static notrace cycle_t read_hpet(void) { return (cycle_t)hpet_readl(HPET_COUNTER); } Index: linux-mcount.git/arch/x86/kernel/time_32.c === --- linux-mcount.git.orig/arch/x86/kernel/time_32.c 2008-01-25 21:46:49.0 -0500 +++ linux-mcount.git/arch/x86/kernel/time_32.c 2008-01-25 21:47:15.0 -0500 @@ -122,7 +122,7 @@ static int set_rtc_mmss(unsigned long no int timer_ack; -unsigned long profile_pc(struct pt_regs *regs) +notrace unsigned long profile_pc(struct pt_regs *regs) { unsigned long pc = instruction_pointer(regs); Index: linux-mcount.git/arch/x86/kernel/tsc_32.c === --- linux-mcount.git.orig/arch/x86/kernel/tsc_32.c 2008-01-25 21:46:49.0 -0500 +++ linux-mcount.git/arch/x86/kernel/tsc_32.c 2008-01-25 21:47:15.0 -0500 @@ -269,7 +269,7 @@ core_initcall(cpufreq_tsc); static unsigned long current_tsc_khz = 0; -static cycle_t read_tsc(void) +static notrace cycle_t read_tsc(void) { cycle_t ret; Index: linux-mcount.git/arch/x86/kernel/tsc_64.c === --- linux-mcount.git.orig/arch/x86/kernel/tsc_64.c 2008-01-25 21:46:49.0 -0500 +++ linux-mcount.git/arch/x86/kernel/tsc_64.c 2008-01-25 21:47:15.0 -0500 @@ -248,13 +248,13 @@ __setup("notsc", notsc_setup); /* clock source code: */ -static cycle_t read_tsc(void) +static notrace cycle_t read_tsc(void) { cycle_t ret = (cycle_t)get_cycles_sync(); return ret; } -static cycle_t __vsyscall_fn vread_tsc(void) +static notrace cycle_t __vsyscall_fn vread_tsc(void) { cycle_t ret = (cycle_t)get_cycles_sync(); return ret; Index: linux-mcount.git/arch/x86/lib/delay_32.c === --- linux-mcount.git.orig/arch/x86/lib/delay_32.c 2008-01-25 21:46:49.0 -0500 +++ linux-mcount.git/arch/x86/lib/delay_32.c2008-01-25 21:47:15.0 -0500 @@ -24,7 +24,7 @@ #endif /* simple loop based delay: */ -static void delay_loop(unsigned long loops) +static notrace void delay_loop(unsigned long loops) { int d0; @@ -39,7 +39,7 @@ static void delay_loop(unsigned long loo } /* TSC based delay: */ -static void delay_tsc(unsigned long loops) +static notrace void delay_tsc(unsigned long loops) { unsigned long bclock, now; @@ -72,7 +72,7 @@ int read_current_timer(unsigned long *ti return -1; } -void __delay(unsigned long loops) +notrace void __delay(unsigned long loops) { delay_fn(loops); } Index: linux-mcount.git/drivers/clocksource/acpi_pm.c === --- linux-mcount.git.orig/drivers/clocksource/acpi_pm.c 2008-01-25 21:46:49.0 -0500 +++ linux-mcount.git/drivers/clocksource/acpi_pm.c 2008-01-25 21:47:15.0 -0500 @@ -30,13 +30,13 @@ */ u32 pmtmr_ioport __read_mostly; -static inline u32 read_pmtmr(void) +static inline notrace u32 read_pmtmr(void) { /* mask the output to 24 bits */ return inl(pmtmr_ioport) & ACPI_PM_MASK; } -u32 acpi_pm_read_verified(void) +notrace u32 acpi_pm_read_verified(void) { u32 v1 = 0, v2 = 0, v3 = 0;
[PATCH 02/23 -v6] Add basic support for gcc profiler instrumentation
If CONFIG_MCOUNT is selected and /proc/sys/kernel/mcount_enabled is set to a non-zero value the mcount routine will be called everytime we enter a kernel function that is not marked with the "notrace" attribute. The mcount routine will then call a registered function if a function happens to be registered. [This code has been highly hacked by Steven Rostedt, so don't blame Arnaldo for all of this ;-) ] Update: It is now possible to register more than one mcount function. If only one mcount function is registered, that will be the function that mcount calls directly. If more than one function is registered, then mcount will call a function that will loop through the functions to call. Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- Makefile |3 arch/x86/Kconfig |1 arch/x86/kernel/entry_32.S | 25 +++ arch/x86/kernel/entry_64.S | 36 +++ include/linux/linkage.h|2 include/linux/mcount.h | 38 kernel/sysctl.c| 11 +++ lib/Kconfig.debug |1 lib/Makefile |2 lib/tracing/Kconfig| 10 +++ lib/tracing/Makefile |3 lib/tracing/mcount.c | 141 + 12 files changed, 273 insertions(+) Index: linux-mcount.git/Makefile === --- linux-mcount.git.orig/Makefile 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/Makefile 2008-01-25 21:47:00.0 -0500 @@ -509,6 +509,9 @@ endif include $(srctree)/arch/$(SRCARCH)/Makefile +ifdef CONFIG_MCOUNT +KBUILD_CFLAGS += -pg +endif ifdef CONFIG_FRAME_POINTER KBUILD_CFLAGS += -fno-omit-frame-pointer -fno-optimize-sibling-calls else Index: linux-mcount.git/arch/x86/Kconfig === --- linux-mcount.git.orig/arch/x86/Kconfig 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/Kconfig 2008-01-25 21:47:00.0 -0500 @@ -19,6 +19,7 @@ config X86_64 config X86 bool default y + select HAVE_MCOUNT config GENERIC_TIME bool Index: linux-mcount.git/arch/x86/kernel/entry_32.S === --- linux-mcount.git.orig/arch/x86/kernel/entry_32.S2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/entry_32.S 2008-01-25 21:47:00.0 -0500 @@ -75,6 +75,31 @@ DF_MASK = 0x0400 NT_MASK= 0x4000 VM_MASK= 0x0002 +#ifdef CONFIG_MCOUNT +.globl mcount +mcount: + /* unlikely(mcount_enabled) */ + cmpl $0, mcount_enabled + jnz trace + ret + +trace: + /* taken from glibc */ + pushl %eax + pushl %ecx + pushl %edx + movl 0xc(%esp), %edx + movl 0x4(%ebp), %eax + + call *mcount_trace_function + + popl %edx + popl %ecx + popl %eax + + ret +#endif + #ifdef CONFIG_PREEMPT #define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else Index: linux-mcount.git/arch/x86/kernel/entry_64.S === --- linux-mcount.git.orig/arch/x86/kernel/entry_64.S2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/entry_64.S 2008-01-25 21:47:00.0 -0500 @@ -53,6 +53,42 @@ .code64 +#ifdef CONFIG_MCOUNT + +ENTRY(mcount) + /* unlikely(mcount_enabled) */ + cmpl $0, mcount_enabled + jnz trace + retq + +trace: + /* taken from glibc */ + subq $0x38, %rsp + movq %rax, (%rsp) + movq %rcx, 8(%rsp) + movq %rdx, 16(%rsp) + movq %rsi, 24(%rsp) + movq %rdi, 32(%rsp) + movq %r8, 40(%rsp) + movq %r9, 48(%rsp) + + movq 0x38(%rsp), %rsi + movq 8(%rbp), %rdi + + call *mcount_trace_function + + movq 48(%rsp), %r9 + movq 40(%rsp), %r8 + movq 32(%rsp), %rdi + movq 24(%rsp), %rsi + movq 16(%rsp), %rdx + movq 8(%rsp), %rcx + movq (%rsp), %rax + addq $0x38, %rsp + + retq +#endif + #ifndef CONFIG_PREEMPT #define retint_kernel retint_restore_args #endif Index: linux-mcount.git/include/linux/linkage.h === --- linux-mcount.git.orig/include/linux/linkage.h 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/linux/linkage.h2008-01-25 21:47:00.0 -0500 @@ -3,6 +3,8 @@ #include +#define notrace __attribute__((no_instrument_function)) + #ifdef __cplusplus #define CPP_ASMLINKAGE extern "C" #else Index: linux-mcount.git/include/linux/mcount.h === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-mcount.git
[PATCH 14/23 -v6] Add tracing of context switches
This patch adds context switch tracing, of the format of: _--=> CPU# / _-=> irqs-off | / _=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth / | delay cmd pid | time | caller \ /| \ | / swapper-0 1d..3 137us+: 0:140:R --> 2912:120 sshd-2912 1d..3 216us+: 2912:120:S --> 0:140 swapper-0 1d..3 261us+: 0:140:R --> 2912:120 bash-2920 0d..3 267us+: 2920:120:S --> 0:140 sshd-2912 1d..3 330us!: 2912:120:S --> 0:140 swapper-0 1d..3 2389us+: 0:140:R --> 2847:120 yum-upda-2847 1d..3 2411us!: 2847:120:S --> 0:140 swapper-0 0d..3 11089us+: 0:140:R --> 3139:120 gdm-bina-3139 0d..3 3us!: 3139:120:S --> 0:140 swapper-0 1d..3 102328us+: 0:140:R --> 2847:120 yum-upda-2847 1d..3 102348us!: 2847:120:S --> 0:140 "sched_switch" is added to /debugfs/tracing/available_tracers Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Cc: Mathieu Desnoyers <[EMAIL PROTECTED]> --- lib/tracing/Kconfig |9 ++ lib/tracing/Makefile |1 lib/tracing/trace_sched_switch.c | 134 +++ lib/tracing/tracer.c | 43 lib/tracing/tracer.h | 19 + 5 files changed, 205 insertions(+), 1 deletion(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-25 21:47:17.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:23.0 -0500 @@ -23,3 +23,12 @@ config FUNCTION_TRACER insert a call to an architecture specific __mcount routine, that the debugging mechanism using this facility will hook by providing a set of inline routines. + +config CONTEXT_SWITCH_TRACER + bool "Trace process context switches" + depends on DEBUG_KERNEL + select TRACING + help + This tracer hooks into the context switch and records + all switching of tasks. + Index: linux-mcount.git/lib/tracing/Makefile === --- linux-mcount.git.orig/lib/tracing/Makefile 2008-01-25 21:47:17.0 -0500 +++ linux-mcount.git/lib/tracing/Makefile 2008-01-25 21:47:23.0 -0500 @@ -1,6 +1,7 @@ obj-$(CONFIG_MCOUNT) += libmcount.o obj-$(CONFIG_TRACING) += tracer.o +obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o libmcount-y := mcount.o Index: linux-mcount.git/lib/tracing/trace_sched_switch.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-mcount.git/lib/tracing/trace_sched_switch.c 2008-01-25 21:47:23.0 -0500 @@ -0,0 +1,134 @@ +/* + * trace context switch + * + * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]> + * + */ +#include +#include +#include +#include +#include +#include +#include + +#include "tracer.h" + +static struct tracing_trace *tracer_trace; +static int trace_enabled __read_mostly; + +static notrace void sched_switch_callback(const struct marker *mdata, + void *private_data, + const char *format, ...) +{ + struct tracing_trace **p = mdata->private; + struct tracing_trace *tr = *p; + struct tracing_trace_cpu *data; + struct task_struct *prev; + struct task_struct *next; + unsigned long flags; + va_list ap; + int cpu; + + if (likely(!trace_enabled)) + return; + + va_start(ap, format); + prev = va_arg(ap, typeof(prev)); + next = va_arg(ap, typeof(next)); + va_end(ap); + + raw_local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + atomic_inc(&data->disabled); + + if (likely(atomic_read(&data->disabled) == 1)) + tracing_sched_switch_trace(tr, data, prev, next, flags); + + atomic_dec(&data->disabled); + raw_local_irq_restore(flags); +} + +static notrace void sched_switch_reset(struct tracing_trace *tr) +{ + int cpu; + + tr->time_start = now(); + + for_each_online_cpu(cpu) + tracing_reset(tr->data[cpu]); +} + +static notrace void start_sched_trace(struct tracing_trace *tr) +{ + sched_switch_reset(tr); + trace_enabled = 1; +} + +static notrace void stop_sched_trace(struct tracing_trace *tr) +{ + trace_enabled = 0; +} + +static notrace void sched_switch_trace_init(struct tracing_trace *tr) +{ + tracer_trace = tr; + + if (tr->ctrl) + start_sched_trace(tr); +} + +static notrac
[PATCH 07/23 -v6] handle accurate time keeping over long delays
Handle accurate time even if there's a long delay between accumulated clock cycles. Signed-off-by: John Stultz <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/powerpc/kernel/time.c|3 +- arch/x86/kernel/vsyscall_64.c |5 ++- include/asm-x86/vgtod.h |2 - include/linux/clocksource.h | 58 -- kernel/time/timekeeping.c | 36 +- 5 files changed, 82 insertions(+), 22 deletions(-) Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c === --- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:47:06.0 -0500 +++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c 2008-01-25 21:47:09.0 -0500 @@ -86,6 +86,7 @@ void update_vsyscall(struct timespec *wa vsyscall_gtod_data.clock.mask = clock->mask; vsyscall_gtod_data.clock.mult = clock->mult; vsyscall_gtod_data.clock.shift = clock->shift; + vsyscall_gtod_data.clock.cycle_accumulated = clock->cycle_accumulated; vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec; vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec; vsyscall_gtod_data.wall_to_monotonic = wall_to_monotonic; @@ -121,7 +122,7 @@ static __always_inline long time_syscall static __always_inline void do_vgettimeofday(struct timeval * tv) { - cycle_t now, base, mask, cycle_delta; + cycle_t now, base, accumulated, mask, cycle_delta; unsigned seq; unsigned long mult, shift, nsec; cycle_t (*vread)(void); @@ -135,6 +136,7 @@ static __always_inline void do_vgettimeo } now = vread(); base = __vsyscall_gtod_data.clock.cycle_last; + accumulated = __vsyscall_gtod_data.clock.cycle_accumulated; mask = __vsyscall_gtod_data.clock.mask; mult = __vsyscall_gtod_data.clock.mult; shift = __vsyscall_gtod_data.clock.shift; @@ -145,6 +147,7 @@ static __always_inline void do_vgettimeo /* calculate interval: */ cycle_delta = (now - base) & mask; + cycle_delta += accumulated; /* convert to nsecs: */ nsec += (cycle_delta * mult) >> shift; Index: linux-mcount.git/include/asm-x86/vgtod.h === --- linux-mcount.git.orig/include/asm-x86/vgtod.h 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/asm-x86/vgtod.h2008-01-25 21:47:09.0 -0500 @@ -15,7 +15,7 @@ struct vsyscall_gtod_data { struct timezone sys_tz; struct { /* extract of a clocksource struct */ cycle_t (*vread)(void); - cycle_t cycle_last; + cycle_t cycle_last, cycle_accumulated; cycle_t mask; u32 mult; u32 shift; Index: linux-mcount.git/include/linux/clocksource.h === --- linux-mcount.git.orig/include/linux/clocksource.h 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/include/linux/clocksource.h2008-01-25 21:47:09.0 -0500 @@ -50,8 +50,12 @@ struct clocksource; * @flags: flags describing special properties * @vread: vsyscall based read * @resume:resume function for the clocksource, if necessary + * @cycle_last:Used internally by timekeeping core, please ignore. + * @cycle_accumulated: Used internally by timekeeping core, please ignore. * @cycle_interval:Used internally by timekeeping core, please ignore. * @xtime_interval:Used internally by timekeeping core, please ignore. + * @xtime_nsec:Used internally by timekeeping core, please ignore. + * @error: Used internally by timekeeping core, please ignore. */ struct clocksource { /* @@ -82,7 +86,10 @@ struct clocksource { * Keep it in a different cache line to dirty no * more than one cache line. */ - cycle_t cycle_last cacheline_aligned_in_smp; + struct { + cycle_t cycle_last, cycle_accumulated; + } cacheline_aligned_in_smp; + u64 xtime_nsec; s64 error; @@ -168,11 +175,44 @@ static inline cycle_t clocksource_read(s } /** + * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * @cs:pointer to clocksource being read + * @now: current cycle value + * + * Uses the clocksource to return the current cycle_t value. + * NOTE!!!: This is different from clocksource_read, because it + * returns the accumulated cycle value! Must hold xtime lock! + */ +static inline cycle_t +clocksource_get_cycles(struct clocksource *cs, cycle_t now) +{ + cycle_t offset = (now - cs->cycle_last) & cs->mask; + offset += cs->cycle_accumulated; +
[PATCH 03/23 -v6] Annotate core code that should not be traced
Mark with "notrace" functions in core code that should not be traced. The "notrace" attribute will prevent gcc from adding a call to mcount on the annotated funtions. Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/smp_processor_id.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-mcount.git/lib/smp_processor_id.c === --- linux-mcount.git.orig/lib/smp_processor_id.c2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/lib/smp_processor_id.c 2008-01-25 21:47:03.0 -0500 @@ -7,7 +7,7 @@ #include #include -unsigned int debug_smp_processor_id(void) +notrace unsigned int debug_smp_processor_id(void) { unsigned long preempt_count = preempt_count(); int this_cpu = raw_smp_processor_id(); -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 17/23 -v6] Add marker in try_to_wake_up
Add markers into the wakeup code, to allow the tracer to record wakeup timings. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- kernel/sched.c |8 1 file changed, 8 insertions(+) Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:21.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:30.0 -0500 @@ -1885,6 +1885,10 @@ static int try_to_wake_up(struct task_st out_activate: #endif /* CONFIG_SMP */ + trace_mark(kernel_sched_wakeup, + "p %p rq->curr %p", + p, rq->curr); + schedstat_inc(p, se.nr_wakeups); if (sync) schedstat_inc(p, se.nr_wakeups_sync); @@ -2026,6 +2030,10 @@ void fastcall wake_up_new_task(struct ta p->sched_class->task_new(rq, p); inc_nr_running(p, rq); } + trace_mark(kernel_sched_wakeup_new, + "p %p rq->curr %p", + p, rq->curr); + check_preempt_curr(rq, p); #ifdef CONFIG_SMP if (p->sched_class->task_wake_up) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/23 -v6] add get_monotonic_cycles
The latency tracer needs a way to get an accurate time without grabbing any locks. Locks themselves might call the latency tracer and cause at best a slow down. This patch adds get_monotonic_cycles that returns cycles from a reliable clock source in a monotonic fashion. Signed-off-by: John Stultz <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- include/linux/clocksource.h | 54 +--- kernel/time/timekeeping.c | 26 +++-- 2 files changed, 70 insertions(+), 10 deletions(-) Index: linux-mcount.git/include/linux/clocksource.h === --- linux-mcount.git.orig/include/linux/clocksource.h 2008-01-25 21:47:11.0 -0500 +++ linux-mcount.git/include/linux/clocksource.h2008-01-25 21:47:13.0 -0500 @@ -88,8 +88,16 @@ struct clocksource { */ struct { cycle_t cycle_last, cycle_accumulated; - } cacheline_aligned_in_smp; + /* base structure provides lock-free read +* access to a virtualized 64bit counter +* Uses RCU-like update. +*/ + struct { + cycle_t cycle_base_last, cycle_base; + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,30 @@ static inline cycle_t clocksource_read(s } /** - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value * @cs:pointer to clocksource being read * @now: current cycle value * * Uses the clocksource to return the current cycle_t value. * NOTE!!!: This is different from clocksource_read, because it - * returns the accumulated cycle value! Must hold xtime lock! + * returns a 64bit wide accumulated value. */ static inline cycle_t -clocksource_get_cycles(struct clocksource *cs, cycle_t now) +clocksource_get_basecycles(struct clocksource *cs) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; - offset += cs->cycle_accumulated; + int num; + cycle_t now, offset; + + preempt_disable(); + num = cs->base_num; + /* base_num is shared, and some archs are wacky */ + smp_read_barrier_depends(); + now = clocksource_read(cs); + offset = (now - cs->base[num].cycle_base_last); + offset &= cs->mask; + offset += cs->base[num].cycle_base; + preempt_enable(); + return offset; } @@ -197,11 +216,27 @@ clocksource_get_cycles(struct clocksourc * @now: current cycle value * * Used to avoids clocksource hardware overflow by periodically - * accumulating the current cycle delta. Must hold xtime write lock! + * accumulating the current cycle delta. Uses RCU-like update, but + * ***still requires the xtime_lock is held for writing!*** */ static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; + /* +* First update the monotonic base portion. +* The dual array update method allows for lock-free reading. +* 'num' is always 1 or 0. +*/ + int num = 1 - cs->base_num; + cycle_t offset = (now - cs->base[1-num].cycle_base_last); + offset &= cs->mask; + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset; + cs->base[num].cycle_base_last = now; + /* make sure this array is visible to the world first */ + smp_wmb(); + cs->base_num = num; + + /* Now update the cycle_accumulated portion */ + offset = (now - cs->cycle_last) & cs->mask; cs->cycle_last = now; cs->cycle_accumulated += offset; } @@ -272,6 +307,9 @@ extern int clocksource_register(struct c extern struct clocksource* clocksource_get_next(void); extern void clocksource_change_rating(struct clocksource *cs, int rating); extern void clocksource_resume(void); +extern cycle_t get_monotonic_cycles(void); +extern unsigned long cycles_to_usecs(cycle_t cycles); +extern cycle_t usecs_to_cycles(unsigned long usecs); /* used to initialize clock */ extern struct clocksource clocksource_jiffies; Index: linux-mcount.git/kernel/time/timekeeping.c === --- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 21:47:11.0 -0500 +++ linux-mcount.git/kernel/time/timekeeping.c 2008-01-25 21:47:13.0 -0500 @@ -71,10 +71,12 @@ static struct clocksource *clock = &cloc */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle
[PATCH 23/23 -v6] Critical latency timings histogram
This patch adds hooks into the latency tracer to give us histograms of interrupts off, preemption off and wakeup timings. This code was based off of work done by Yi Yang <[EMAIL PROTECTED]> But heavily modified to work with the new tracer, and some clean ups by Steven Rostedt <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- lib/tracing/Kconfig | 20 + lib/tracing/Makefile|4 lib/tracing/trace_irqsoff.c | 21 + lib/tracing/trace_wakeup.c | 18 + lib/tracing/tracer_hist.c | 513 lib/tracing/tracer_hist.h | 39 +++ 6 files changed, 610 insertions(+), 5 deletions(-) Index: linux-mcount.git/lib/tracing/Kconfig === --- linux-mcount.git.orig/lib/tracing/Kconfig 2008-01-25 21:47:40.0 -0500 +++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:42.0 -0500 @@ -102,3 +102,23 @@ config CONTEXT_SWITCH_TRACER This tracer hooks into the context switch and records all switching of tasks. +config INTERRUPT_OFF_HIST + bool "Interrupts off critical timings histogram" + depends on CRITICAL_IRQSOFF_TIMING + help + This option uses the infrastructure of the critical + irqs off timings to create a histogram of latencies. + +config PREEMPT_OFF_HIST + bool "Preempt off critical timings histogram" + depends on CRITICAL_PREEMPT_TIMING + help + This option uses the infrastructure of the critical + preemption off timings to create a histogram of latencies. + +config WAKEUP_LATENCY_HIST + bool "Interrupts off critical timings histogram" + depends on WAKEUP_TRACER + help + This option uses the infrastructure of the wakeup tracer + to create a histogram of latencies. Index: linux-mcount.git/lib/tracing/Makefile === --- linux-mcount.git.orig/lib/tracing/Makefile 2008-01-25 21:47:40.0 -0500 +++ linux-mcount.git/lib/tracing/Makefile 2008-01-25 21:47:42.0 -0500 @@ -8,4 +8,8 @@ obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o obj-$(CONFIG_EVENT_TRACER) += trace_events.o +obj-$(CONFIG_INTERRUPT_OFF_HIST) += tracer_hist.o +obj-$(CONFIG_PREEMPT_OFF_HIST) += tracer_hist.o +obj-$(CONFIG_WAKEUP_LATENCY_HIST) += tracer_hist.o + libmcount-y := mcount.o Index: linux-mcount.git/lib/tracing/trace_irqsoff.c === --- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c 2008-01-25 21:47:40.0 -0500 +++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-25 21:47:42.0 -0500 @@ -16,6 +16,7 @@ #include #include "tracer.h" +#include "tracer_hist.h" static struct tracing_trace *tracer_trace __read_mostly; static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex); @@ -237,7 +238,7 @@ stop_critical_timing(unsigned long ip, u else return; - if (likely(!trace_enabled)) + if (!trace_enabled) return; cpu = raw_smp_processor_id(); @@ -261,10 +262,14 @@ void notrace start_critical_timings(void { if (preempt_trace() || irq_trace()) start_critical_timing(CALLER_ADDR0, 0); + + tracing_hist_preempt_start(); } void notrace stop_critical_timings(void) { + tracing_hist_preempt_stop(TRACE_STOP); + if (preempt_trace() || irq_trace()) stop_critical_timing(CALLER_ADDR0, 0); } @@ -273,6 +278,8 @@ void notrace stop_critical_timings(void) #ifdef CONFIG_LOCKDEP void notrace time_hardirqs_on(unsigned long a0, unsigned long a1) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(a0, a1); } @@ -281,6 +288,8 @@ void notrace time_hardirqs_off(unsigned { if (!preempt_trace() && irq_trace()) start_critical_timing(a0, a1); + + tracing_hist_preempt_start(); } #else /* !CONFIG_LOCKDEP */ @@ -314,6 +323,8 @@ inline void print_irqtrace_events(struct */ void notrace trace_hardirqs_on(void) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(CALLER_ADDR0, 0); } @@ -323,11 +334,15 @@ void notrace trace_hardirqs_off(void) { if (!preempt_trace() && irq_trace()) start_critical_timing(CALLER_ADDR0, 0); + + tracing_hist_preempt_start(); } EXPORT_SYMBOL(trace_hardirqs_off); void notrace trace_hardirqs_on_caller(unsigned long caller_addr) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(CALLER_ADDR0, caller_addr); } @@ -337,6 +352,8 @@ void notrace trace_hardirqs_off_caller(u { if (!preempt_trace() && irq_trace())
[PATCH 04/23 -v6] x86_64: notrace annotations
Add "notrace" annotation to x86_64 specific files. Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/x86/kernel/head64.c |2 +- arch/x86/kernel/setup64.c|4 ++-- arch/x86/kernel/smpboot_64.c |2 +- 3 files changed, 4 insertions(+), 4 deletions(-) Index: linux-mcount.git/arch/x86/kernel/head64.c === --- linux-mcount.git.orig/arch/x86/kernel/head64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/head64.c 2008-01-25 21:47:05.0 -0500 @@ -46,7 +46,7 @@ static void __init copy_bootdata(char *r } } -void __init x86_64_start_kernel(char * real_mode_data) +notrace void __init x86_64_start_kernel(char *real_mode_data) { int i; Index: linux-mcount.git/arch/x86/kernel/setup64.c === --- linux-mcount.git.orig/arch/x86/kernel/setup64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/setup64.c 2008-01-25 21:47:05.0 -0500 @@ -114,7 +114,7 @@ void __init setup_per_cpu_areas(void) } } -void pda_init(int cpu) +notrace void pda_init(int cpu) { struct x8664_pda *pda = cpu_pda(cpu); @@ -197,7 +197,7 @@ DEFINE_PER_CPU(struct orig_ist, orig_ist * 'CPU state barrier', nothing should get across. * A lot of state is already set up in PDA init. */ -void __cpuinit cpu_init (void) +notrace void __cpuinit cpu_init(void) { int cpu = stack_smp_processor_id(); struct tss_struct *t = &per_cpu(init_tss, cpu); Index: linux-mcount.git/arch/x86/kernel/smpboot_64.c === --- linux-mcount.git.orig/arch/x86/kernel/smpboot_64.c 2008-01-25 21:46:50.0 -0500 +++ linux-mcount.git/arch/x86/kernel/smpboot_64.c 2008-01-25 21:47:05.0 -0500 @@ -317,7 +317,7 @@ static inline void set_cpu_sibling_map(i /* * Setup code on secondary processor (after comming out of the trampoline) */ -void __cpuinit start_secondary(void) +notrace __cpuinit void start_secondary(void) { /* * Dont put anything before smp_callin(), SMP -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/23 -v6] Add context switch marker to sched.c
Add marker into context_switch to record the prev and next tasks. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- kernel/sched.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-mcount.git/kernel/sched.c === --- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:46:55.0 -0500 +++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:19.0 -0500 @@ -2198,6 +2198,8 @@ context_switch(struct rq *rq, struct tas struct mm_struct *mm, *oldmm; prepare_task_switch(rq, prev, next); + trace_mark(kernel_sched_schedule, + "prev %p next %p", prev, next); mm = next->mm; oldmm = prev->active_mm; /* -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 00/23 -v6] mcount and latency tracing utility -v6
[ version 6 of mcount / trace patches: changes include: Ported to lastest git 99f1c97dbdb30e958edfd1ced0ae43df62504e07 Added the runqueue_is_locked schedule api to let others (printk) know if the runqueue is locked and if it is safe to call wake_up on klogd. Added event_trace! This records various events like interrupts, timer events, page fault events, some scheduler stuff. This is also used to help show a little more information to the other latency tracers. It is a lot less overhead than what mcount gives us (without as much data). Also included in this series the histograms. When configured in the interrupts/preemption off and wakeup timings will be recorded into histograms found in /debugfs/tracing/latency_hist/ This code is from MontaVista contributions into the RT kernel, with some cleanups and porting effort by myself. ] All released version of these patches can be found at: http://people.redhat.com/srostedt/tracing/ The following patch series brings to vanilla Linux a bit of the RT kernel trace facility. This incorporates the "-pg" profiling option of gcc that will call the "mcount" function for all functions called in the kernel. Note: I did investigate using -finstrument-functions but that adds a call to both start and end of a function. Using mcount only does the beginning of the function. mcount alone adds ~13% overhead. The -finstrument-functions added ~19%. Also it caused me to do tricks with inline, because it adds the function calls to inline functions as well. This patch series implements the code for x86 (32 and 64 bit), but other archs can easily be implemented as well (note: ARM and PPC are already implemented in -rt) Some Background: A while back, Ingo Molnar and William Lee Irwin III created a latency tracer to find problem latency areas in the kernel for the RT patch. This tracer became a very integral part of the RT kernel in solving where latency hot spots were. One of the features that the latency tracer added was a function trace. This function tracer would record all functions that were called (implemented by the gcc "-pg" option) and would show what was called when interrupts or preemption was turned off. This feature is also very helpful in normal debugging. So it's been talked about taking bits and pieces from the RT latency tracer and bring them to LKML. But no one had the time to do it. Arnaldo Carvalho de Melo took a crack at it. He pulled out the mcount as well as part of the tracing code and made it generic from the point of the tracing code. I'm not sure why this stopped. Probably because Arnaldo is a very busy man, and his efforts had to be utilized elsewhere. While I still maintain my own Logdev utility: http://rostedt.homelinux.com/logdev I came across a need to do the mcount with logdev too. I was successful but found that it became very dependent on a lot of code. One thing that I liked about my logdev utility was that it was very non-intrusive, and has been easy to port from the Linux 2.0 days. I did not want to burden the logdev patch with the intrusiveness of mcount (not really that intrusive, it just needs to add a "notrace" annotation to functions in the kernel that will cause more conflicts in applying patches for me). Being close to the holidays, I grabbed Arnaldos old patches and started massaging them into something that could be useful for logdev, and what I found out (and talking this over with Arnaldo too) that this can be much more useful for others as well. The main thing I changed, was that I made the mcount function itself generic, and not the dependency on the tracing code. That is I added register_mcount_function() and clear_mcount_function() So when ever mcount is enabled and a function is registered that function is called for all functions in the kernel that is not labeled with the "notrace" annotation. The Simple Tracer: -- To show the power of this I also massaged the tracer code that Arnaldo pulled from the RT patch and made it be a nice example of what can be done with this. The function that is registered to mcount has the prototype: void func(unsigned long ip, unsigned long parent_ip); The ip is the address of the function and parent_ip is the address of the parent function that called it. The x86_64 version has the assembly call the registered function directly to save having to do a double function call. To enable mcount, a sysctl is added: /proc/sys/kernel/mcount_enabled Once mcount is enabled, when a function is registed, it will be called by all functions. The tracer in this patch series shows how this is done. It adds a directory in the debugfs, called mctracer. With a ctrl file that will allow the user have the tracer register its function. Note, the order of enabling mcount and registering a function is not important, but both must be done to initiate the tracing. That is, you can disable tracing by eith
Re: [PATCH] Linux Kernel Markers Support for Proprierary Modules
On Sat, 2008-01-26 at 14:27 +1100, Rusty Russell wrote: > 2) Unconditionally reject modules with a wrong module section size. > Currently > we have no such check, which means without KALLSYMS, anything goes. I favor the latter, since it's safest. Jon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl
On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote: > +static int free_ext_idx(handle_t *handle, struct inode *inode, > + struct ext4_extent_idx *ix) > +{ > + int i, retval = 0; > + ext4_fsblk_t block; > + struct buffer_head *bh; > + struct ext4_extent_header *eh; > + > + block = idx_pblock(ix); > + bh = sb_bread(inode->i_sb, block); > + if (!bh) > + return -EIO; > + > + eh = (struct ext4_extent_header *)bh->b_data; > + if (eh->eh_depth == 0) { > + brelse(bh); > + ext4_free_blocks(handle, inode, block, 1); > + } else { > + ix = EXT_FIRST_INDEX(eh); > + for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) { > + retval = free_ext_idx(handle, inode, ix); > + if (retval) > + return retval; > + } > + } > + return retval; > +} Aneesh, looks like if eh->eh_depth is != 0, bh gets leaked. This is how I plan to fix it up: +static int free_ext_idx(handle_t *handle, struct inode *inode, + struct ext4_extent_idx *ix) +{ + int i, retval = 0; + ext4_fsblk_t block; + struct buffer_head *bh; + struct ext4_extent_header *eh; + + block = idx_pblock(ix); + bh = sb_bread(inode->i_sb, block); + if (!bh) + return -EIO; + + eh = (struct ext4_extent_header *)bh->b_data; + if (eh->eh_depth == 0) + ext4_free_blocks(handle, inode, block, 1); + else { + ix = EXT_FIRST_INDEX(eh); + for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) { + retval = free_ext_idx(handle, inode, ix); + if (retval) + break; + } + } + put_bh(bh); + return retval; +} - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Netatalk-admins] netatalk slow after system upgrade (possibly kernel problem?)
Hi, On Fri, 25 Jan 2008 12:55:42 +0100, Michael Monnerie wrote > Dear lists, > > I've been spending a LOT of time trying to find out where's the > problem, but can't find it and therefore seek urgent help now. We > have the following system: Did you try to force the server MTU to 1500 (it looks like you have jumbo frame enabled) ? Some interesting TCP/IP packets though :) OT as it was working before but if it's a simple LAN you may have a flaky/wire in the loop too. Didier -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 063/196] kset: convert /sys/devices to use kset_create
On Thu, Jan 24, 2008 at 11:10:01PM -0800, Greg Kroah-Hartman wrote: > Dynamically create the kset instead of declaring it statically. We also > rename devices_subsys to devices_kset to catch all users of the > variable. Guess what, you broke powerpc again! [EMAIL PROTECTED]:~/work/linux/k.org $ git grep devices_subsys arch/powerpc/kernel/vio.c:extern struct kset devices_subsys; /* needed for vio_find_name() */ arch/powerpc/kernel/vio.c: found = kset_find_obj(&devices_subsys, kobj_name); Obviously causes build failues, even of ppc64_defconfig. (I can unfortunately not boot test, since I lack hardware that uses vio) Signed-off-by: Olof Johansson <[EMAIL PROTECTED]> diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c index 19a5656..ee752ab 100644 --- a/arch/powerpc/kernel/vio.c +++ b/arch/powerpc/kernel/vio.c @@ -37,7 +37,7 @@ #include #include -extern struct kset devices_subsys; /* needed for vio_find_name() */ +extern struct kset *devices_kset; /* needed for vio_find_name() */ static struct bus_type vio_bus_type; @@ -369,7 +369,7 @@ static struct vio_dev *vio_find_name(const char *kobj_name) { struct kobject *found; - found = kset_find_obj(&devices_subsys, kobj_name); + found = kset_find_obj(devices_kset, kobj_name); if (!found) return NULL; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Linux Kernel Markers Support for Proprierary Modules
On Saturday 26 January 2008 02:31:30 Jon Masters wrote: > On Fri, 2008-01-25 at 08:56 +0100, Jan Engelhardt wrote: > > So what is needed is an Oops with an explaining message > > if (kernel_tainted) "blame that proprietary module first", > > and make sure the user sees that oops even if in X. > > The former is actually trivially doable with the module->taints bits. We > could have the equivalent of a neon flashing "blame this module" sign. > > I also agree, we should stop force loading. Incompatible struct module, > etc. are really bad things to have mapped into a running kernel. I think there are two things here: 1) Currently we allow modules with no kallsyms info to be loaded into a KALLSYMS kernel (and taint). A new option is needed to control this: CONFIG_ACCEPT_NO_KALLSYMS under KERNEL_DEBUG which allows loading of such "stripped" modules (a-la modprobe --force). 2) Unconditionally reject modules with a wrong module section size. Currently we have no such check, which means without KALLSYMS, anything goes. Thoughts? Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: using LKML for subsystem development (was Re: Linux 2.6.24)
On Sat, 26 Jan 2008 01:42:43 +0100, Stefan Richter said: > Even if you only look at the Subject: and number of postings in a > thread, how to judge whether there is a stability risk for the next -rc > in the making, without experience or personal interest in the domain? My general rule of thumb is "if my laptop has one of those on it, I look at it more, even if I don't have the foggiest idea how it works". Of course, this only works for threads with semi-sane Subject: headers pgpY55jybX5o4.pgp Description: PGP signature
Re: [GIT PATCH] driver core patches against 2.6.24
On Sat, 26 Jan 2008 01:05:26 +0100, Ingo Molnar said: > all it takes for me on Fedora is to boot a modular distro kernel once, > then copy the /dev to the real (persistent) /dev: > >mkdir /tmp2 >mount /dev/sda1 /tmp2 >cp -a /dev/* /tmp2/dev/ > > and from that point on a bzImage/vmlinuz can boot up on Fedora without > any problems (as long as it has the right drivers built in), and the > initrd line can be removed from grub.conf. Tried something like that once - it didn't play nice with the fact that I have root-on-LVM, so you need an initrd to do the 'lvm vgchange -a' and get it online. pgpJjqf9nRzjE.pgp Description: PGP signature
Re: Linux 2.6.24
On Sat, 26 Jan 2008 00:50:44 +0100, Stefan Richter said: > How often is "bisectability" being broken already before merge in > subsystem trees, and how often only in the context of the merge result? I don't bisect git trees often - but I'd say that at least half the time I have to bisect -mm, I'll hit a busticated bisection point and need to move several one way or another. Fortunately, Andrew does a good job of keeping fixes near their parents, so it's usually not *that* hard to clean up (at least for me - but I recently realized that I had passed the 3-decade mark of breaking and fixing software). Newcomer kernel testers are likely in for a rude awakening if they hit one of those points. pgpfUQMyT426h.pgp Description: PGP signature
Re: Hot (un)plugging of a SATA drive with sata_nv (CK8S) ?
Ignacy Gawedzki wrote: Hi everyone, I'm having trouble to determine the cause of the following behavior. I'm not even sure that I'm supposed to hot plug and unplug a SATA drive from a nForce3 Ultra (apparently CK8S, on a Gigabyte K8NS Ultra 939 mobo) SATA interface, to begin with. The information is hard to find given that the sata_nv driver supports a range of different hardware. I've recently acquired an external drive with (among others) an eSATA interface, so I also bought a eSATA->SATA bracket and intend to use that drive (Lacie d2 quadra 500G) through eSATA. BTW, eSATA cannot technically be converted properly to SATA with a simple connector adapter. eSATA is supposed to use higher signalling voltages and so using such an adapter is not guaranteed to work. The thing is that if I boot the machine with the drive plugged and turned on, it is properly detected and usable. If, at some point, I want to remove the drive, I unmount any partitions on it and issue the proper scsiadd -r command (usually scsiadd -r 1 0 0 0, since this is the second SATA drive) and everything is fine (I turn the drive off and unplug it), so far. Next, when I want to use the drive again, it's still detected alright (although appears as sdc and not sdb anymore), but the SCSI layer issues "scsi 1:0:0:0: rejecting I/O to dead device" from time to time. Then any scsiadd -r 1 0 0 0 command fails with "No such device or address", although it appears in the output of scsiadd -p or even scsiadd -s (always as 1 0 0 0). If I ignore that detail and switch the drive off, then the kernel eventually notices that the drive is gone and the SCSI layer attempts to stop the device and fails ([sdc] START_STOP FAILED). From that moment on, any attempt to plug the drive again fails. The kernel issues "ata2: hard resetting port" and "ata2: port is slow to respond, please be patient (Status 0x80)" periodically, until I switch the drive off. If the drive is not present at boot, then hot plugging it fails. The kernel first soft resets the port, then issues the "please be patient (Status 0x80)" message, complains that SRST failed (errno=-16) and goes on hard resetting the port, issuing "please be patient (Status 0x80)" and complaining that COMRESET failed (errno=-16), periodically, until the drive is switched off. Full dmesg output would be useful.. If somebody could tell me whether hot-plugging is supposed to work with my SATA interface, it would be nice. =) The motherboard happens to offer another SATA interface (Sil3512A) which is well supported and appears to support hot-plugging as well, but it conflicts nastily with my PCTV Pro (bttv) card (which are apparently known to conflict with the Sil SATA interfaces). Thanks for any help. Ignacy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] Q: x86 suspend/hibernation code consolidation
On Friday 25 January 2008 19:32, Rafael J. Wysocki wrote: > Hi, > > I'd like to move the 64-bit suspend/hibernation files from arch/x86/kernel to > arch/x86/power, modify the names of the 32-bit files already in > arch/x86/power and update the Makefiles accordingly, but there are some > changes > queued for merging that touch the files in question. > > When is the right time for making changes like that? > > Rafael In Cambridge, when we discussed cleanups that touch a lot of files but have no functional change -- somebody suggested that right after rc1 closes is a good time. The reasoning was that they would not conflict with the functional changes in rc1. However, I recall Linus saying something about "Andrew is special" WRT permission to push cleanups after the rc1 window; so I don't know what the final ruling was -- if there was such a ruling. -Len -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86_64: change aper valid checking sequence
[PATCH] x86_64: change aper valid checking sequence old sequence: size ==> >4G ==> point to RAM changed to >4G ==> point to RAM ==> size some bios even leave aper to unclear, so check size at last. to avoid reporting that like Node 0: Aperture @ 4a4200 size 32 MB Aperture too small (32 MB) with patch will get Node 0: Aperture @ 4a4200 size 32 MB Aperture beyond 4G. Ignoring. Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c index 0b837bb..608152a 100644 --- a/arch/x86/kernel/aperture_64.c +++ b/arch/x86/kernel/aperture_64.c @@ -85,10 +85,6 @@ static int __init aperture_valid(u64 aper_base, u32 aper_size) if (!aper_base) return 0; - if (aper_size < 64*1024*1024) { - printk(KERN_ERR "Aperture too small (%d MB)\n", aper_size>>20); - return 0; - } if (aper_base + aper_size > 0x1UL) { printk(KERN_ERR "Aperture beyond 4GB. Ignoring.\n"); return 0; @@ -97,6 +93,10 @@ static int __init aperture_valid(u64 aper_base, u32 aper_size) printk(KERN_ERR "Aperture pointing to e820 RAM. Ignoring.\n"); return 0; } + if (aper_size < 64*1024*1024) { + printk(KERN_ERR "Aperture too small (%d MB)\n", aper_size>>20); + return 0; + } return 1; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [PATCH][RFC] SVM: Add Support for Nested Paging in AMD Fam16 CPUs
Joerg Roedel wrote: > Hi, > > here is the first release of patches for KVM to support the Nested Paging > (NPT) feature of AMD QuadCore CPUs for comments and public testing. This > feature improves the guest performance significantly. I measured an > improvement of around 17% using kernbench in my first tests. > > This patch series is basically tested with Linux guests (32 bit legacy > paging, 32 bit PAE paging and 64 bit Long Mode). Also tested with Windows > Vista 32 bit and 64 bit. All these guests ran successfully with these > patches. The patch series only enables NPT for 64 bit Linux hosts at the > moment. > > Please give these patches a good and deep testing. I hope we have this > patchset ready for merging soon. Good. We also ported the EPT patch for Xen to KVM, which we submitted last year. We've been cleaning up the patch with Avi. We are working on live migration support now, and we'll submit the patch once it's done. So please stay tuned. > > Joerg > Jun --- Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
>> Incidentally, some context for the AIX approach to the OOM problem: a >> process may exclude itself from OOM vulnerability altogether. It places >> itself in "early allocation" mode, which means at the time it creates >> virtual memory, it reserves enough backing store for the worst case. The >> memory manager does not send such a process the SIGDANGER signal or >> terminate it when it runs out of paging space. Before c. 2000, this was >> the only mode. Now the default is late allocation mode, which is similar >> to Linux. > >This is an interesting approach. It feels like some programs might be >interested in choosing this mode instead of risking OOM. It's the way virtual memory always worked when it was first invented. The system not only reserved space to back every page of virtual memory; it assigned the particular blocks for it. Late allocation was a later innovation, and I believe its main goal was to make it possible to use the cheaper disk drives for paging instead of drums. Late allocation gives you better locality on disk, so the seeking doesn't eat you alive (drums don't seek). Even then, I assume (but am not sure) that the system at least reserved the space in an account somewhere so at pageout time there was guaranteed to be a place to which to page out. Overcommitting page space to save on disk space was a later idea. I was surprised to see AIX do late allocation by default, because IBM's traditional style is bulletproof systems. A system where a process can be killed at unpredictable times because of resource demands of unrelated processes doesn't really fit that style. It's really a fairly unusual application that benefits from late allocation: one that creates a lot more virtual memory than it ever touches. For example, a sparse array. Or am I missing something? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable NestedPaging
Joerg Roedel wrote: > To disable the use of the Nested Paging feature even if it is available in > hardware this patch adds a module parameter. Nested Paging can be disabled by > passing npt=off to the kvm_amd module. I think it's better to use a (common) parameter to qemu. That way you can control on/off for each VM. Jun --- Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CS5536 mfgpt timer setup register hangs board
Jordan, Although, I am using TinyBios v.99 with MFGPT workaround disabled, and upon a subsequent write I still run in to that system hang problem. To try out that fix u mentioned, I thought I enable the workaround in the BIOS and then apply the fix, It still hangs. I am dumping this info from the module at load time, when setting up the devive. - read 0 from: 6206 read 0 from: 620e read 0 from: 6216 read 0 from: 621e read 0 from: 6226 read 0 from: 622e read 0 from: 6236 read 0 from: 623e geode-mfgpt: MFGPT PCI device enabled geode-mfgpt: 8 timers available. geode-mfgpt: Registered timer # 0 writting 306 to: 6206 [And then it hangs as if CS5536 is now mad] - I tried specifying a timer number, but the same behaviour with all. In the code this is all I am doing /* Set up the timer */ geode_mfgpt_write(wdt_timer, MFGPT_REG_SETUP, GEODEWDT_SCALE | (3 << 8) ); geode_mfgpt_read(wdt_timer, MFGPT_REG_SETUP); void geode_mfgpt_write(int i, u16 r, u16 v) { printk("writting %x to: %lx \n", v, (unsigned long)(mfgpt_iobase + (r + (i * 8; outl(v, (unsigned long)(mfgpt_iobase + (r + (i * 8))) ); } u16 geode_mfgpt_read(int i, u16 r) { u16 val; val = inl((unsigned long)mfgpt_iobase + (r + (i * 8))); printk("read %x from: %lx\n", val, (unsigned long)(mfgpt_iobase + (r + (i * 8))) ); return val; } Now, while experimenting, I set the Counter enable bit on the first write and I don't touch the setup register again. geode_mfgpt_write(wdt_timer, MFGPT_REG_SETUP, GEODEWDT_SCALE | (3 << 8) | MFGPT_SETUP_CNTEN); Before calling the above function, I set the reset event and initialized CMP2 with 0x7530h. Therefore, on every "geode_ping" to the timer I only re-write 0x0 in the Up Counter register. This works fine, except the reset event seems to get unhooked as the system never reboots as expected. So, I figured its either that the event is unset or the counter gets disabled. I tried setting the reset event on every ping but that didn't solve the problem. Then I tried setting the Counter Enable bit (MFGPT_SETUP_CNTEN), which as you might've guessed hung the system but, interestingly though the system rebooted after 60 secs. That got me thinking that it was the counter enable bit that gets unset. Anyhow, that's where I am stuck. The Alix2c0 boards use AMD Geode LX700, I looked in the databook to see if there are any GPIO registers that can be used as an alternative to program a watchdog timer but I couldn't find anything usable. And I can't think of anything different to try with the MFGPTs. Not sure, but does the kernel version make a difference in any of this? I am using 2.4 I have yet to try this on 2.6? > On 25/01/08 15:50 -0800, Hasan Rashid wrote: >> >> Hi, >> >> I have been working on a watchdog timer using the mfgpt on AMD Geode >> CS5536. I initialize the setup register MFGPT0_SETUP (0x6206) with hex >> value 0x306 (110110b). However, after this first initialization if I >> ever read/write to the register it hangs the system. >> >> I have been through all the documentation, tried several different >> methods >> but all the efforts, frustratingly, to no avail. >> >> Does anyone have any idea as to why would this be? TIA! > > It looks like you are using TinyBIOS. Make sure that if you are using > v0.99 > that you do *not* enable the MFGPT workaround. If you are using an older > version, then you will need this patch: > > http://lkml.org/lkml/2008/1/23/372 > > And enable mfgptfix on the command line. There seems to be a problem with > the MFGPT "workaround" that causes hangs exactly like you are seeing. > > Jordan > > -- > Jordan Crouse > Systems Software Development Engineer > Advanced Micro Devices, Inc. > > > -- Regards, Hasan Rashid -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()
Jeremy Fitzhardinge wrote: Now, all of this reminds me of something somewhat messy: if we share the kernel page tables for trampoline page tables, as discussed elsewhere, we HAVE to do a complete, all-tlb-including-global-pages flush after use, since the kernel pages are global and otherwise will stick around. Unlike the permissions pages, there aren't G enable bits on the higher levels, but only for the PTEs themselves. That wouldn't happen to often though, would it. The identity mapping is only interested in a 1:1 view on RAM, and that's not going to change at all? Does the TLB cache PAT attributes? Do you need to do a global flush after changing a PTE's PAT bits to make sure that all that PTE's mappings have a consistent view on memory? You do need to flush *that page* globally, yes. As far as flushing after using the trampoline pagetables, we're talking about rare, expensive events here like suspend to ram. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()
H. Peter Anvin wrote: Keir Fraser wrote: On 25/1/08 22:54, "Jeremy Fitzhardinge" <[EMAIL PROTECTED]> wrote: The only possibly relevant comment I can find in vol3a is: Older IA-32 processors that implement the PAE mechanism use uncached accesses when loading page-directory-pointer table entries. This behavior is model specific and not architectural. More recent IA-32 processors may cache page-directory-pointer table entries. Go read the Intel application note "TLBs, Paging-Structure Caches, and Their Invalidation" at http://www.intel.com/design/processor/applnots/317080.pdf Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can only be refreshed by appropriate tickling of CR0, CR3 or CR4. It is also important to note that *any* valid page directory entry at *any* level in the page-table hierarchy can become cached at *any* time. Basically TLB lookup is performed as a longest-prefix match on the linear address to skip as many levels in a page-table walk as possible (where a walk is needed, because there is no full-length match on the linear address). So, if you modify a directory entry from present to not-present, or change the page directory that a valid pde points to, you probably need to flush the pde caching structure. One piece of good news is that all pde caches are flushed by any arbitrary INVLPG. Actually, it's trickier than that. The PDPTR, just like the segments, aren't a real cache, and aren't invalidated by INVLPG. This means you can't go from less permissive to more permissive, which is normally permitted in the x86. The PDPTR should really be thought of as an extended cr3 with four entries (this is also how it would be typically implemented in hardware) rather than as a part of the paging structure per se. Yeah, that's basically what 8.1 says. PAE doesn't follow the normal TLB rules for the top level, though they reserve the right to make it behave properly (as it would if you graft a PAE pagetable into a full 64-bit pagetable). Now, all of this reminds me of something somewhat messy: if we share the kernel page tables for trampoline page tables, as discussed elsewhere, we HAVE to do a complete, all-tlb-including-global-pages flush after use, since the kernel pages are global and otherwise will stick around. Unlike the permissions pages, there aren't G enable bits on the higher levels, but only for the PTEs themselves. That wouldn't happen to often though, would it. The identity mapping is only interested in a 1:1 view on RAM, and that's not going to change at all? Does the TLB cache PAT attributes? Do you need to do a global flush after changing a PTE's PAT bits to make sure that all that PTE's mappings have a consistent view on memory? J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
On Fri, 2008-01-25 at 04:09 -0700, Andreas Dilger wrote: > On Jan 24, 2008 17:25 -0700, Zan Lynx wrote: > > Have y'all been following the /dev/mem_notify patches? > > http://article.gmane.org/gmane.linux.kernel/628653 > > Having the notification be via poll() is a very restrictive processing > model. Having the notification be via a signal means that any kind of > process (and not just those that are event loop driven) can register > a callback at some arbitrary point in the code and be notified. I > don't object to the poll() interface, but it would be good to have a > signal mechanism also. The commentary on the mem_notify threads claimed that the signal is easily provided by setting up the file handle for SIGIO. Yeah. Here it is...copied from email written by KOSAKI Motohiro: implement FASYNC capability to /dev/mem_notify. fd = open("/dev/mem_notify", O_RDONLY); fcntl(fd, F_SETOWN, getpid()); flags = fcntl(fd, F_GETFL); fcntl(fd, F_SETFL, flags|FASYNC); /* when low memory, receive SIGIO */ -- Zan Lynx <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
[GIT PATCH] SCSI updates for 2.6.24 (part 1)
We have a difficult merge this time; the SCSI tree is split between components that can go now and pieces that are waiting on other trees. Part 1 is the components that can go now ... you'll be getting part 2 towards the end of the merge window. There's misc driver updates, the accessor conversions (peparation for large scatterlists) and tons of other misc updates. There are also some sysfs changes (with Greg's ack) because of the way the dependencies thread through SCSI. The patch is available here: master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git The short changelog is: Adrian Bunk (4): qla2xxx: Code cleanups. megaraid: add __devexit annotation lpfc: minor cleanups 53c7xx: fix removal fallout Alan Cox (1): aacraid: fix security weakness Andi Kleen (1): sg: Only print SCSI data direction warning once for a command Andrew Morton (1): sgiwd93: export sgiwd93_reset() Andrew Vasquez (13): qla2xxx: Update version number to 8.02.00-k7. qla2xxx: Correct late-memset() of EFT buffer. qla2xxx: Add Fibre Channel Event (FCE) tracing support. qla2xxx: Trace-Control naming cleanups. qla2xxx: Don't schedule the DPC routine to perform an issue-lip request. qla2xxx: Restrict MSI/MSI-X enablement on select ISP2432-type HBAs. qla2xxx: Wait for FLASH write-protection to complete after a write. qla2xxx: Fix for 32-bit platforms with 64-bit resources. qla2xxx: Retrieve additional HBA port statistics from recent ISPs. qla2xxx: Consolidate duplicate sense-data handling codes. qla2xxx: Update version number to 8.02.00-k6. qla2xxx: Correct NPIV support for recent ISPs. qla2xxx: Don't explicitly read mbx registers while processing a system-err Boaz Harrosh (26): libiscsi,iser: patch for AHS support iscsi_tcp, libiscsi: initial AHS Support iscsi: Prettify resid handling and some extra checks imm: convert to accessors and !use_sg cleanup ppa: convert to accessors and !use_sg cleanup NCR5380 family: convert to accessors & !use_sg cleanup wd7000: proper fix for boards without sg support atp870u: convert to accessors and !use_sg cleanup scsi_debug: convert to use the data buffer accessors isd200: use one-element sg list in issuing commands usb: transport - convert to accessors and !use_sg code path removal usb: shuttle_usbat - convert to accessors and !use_sg code path removal usb: freecom & sddr09 - convert to accessors and !use_sg cleanup usb: protocol - convert to accessors and !use_sg code path removal seagate: Remove driver psi240i: remove driver in2000: convert to accessors and !use_sg cleanup qlogicpti: convert to accessors and !use_sg cleanup wd33c93: convert to accessors and !use_sg cleanup fd_mcs: convert to accessors and !use_sg cleanup aha1542: convert to accessors and !use_sg cleanup a3000: convert to accessors and !use_sg cleanup a2091: convert to accessors and !use_sg cleanup eata_pio: convert to accessors and !use_sg cleanup nsp_cs: convert to data accessors and !use_sg cleanup aha152x: Use scsi_eh API for REQUEST_SENSE invocation Brian King (1): ibmvscsi: Set default command timeout Christof Schmitt (11): zfcp: Hold queue lock when checking port/unit handle for task management c zfcp: Hold queue lock when checking port/unit handle for FCP command zfcp: Hold queue lock when checking port handle for ELS command zfcp: Hold queue lock when checking port/unit handle for abort command zfcp: Fix evaluation of port handles in abort handler zfcp: Reduce flood on hba trace zfcp: Fix deadlock when adding invalid LUN zfcp: Remove SCSI devices when removing complete adapter zfcp: Specify waiting times in ERP in seconds zfcp: Use also port and adapter to identify unit in messages. zfcp: Remove unnecessary eh_bus_reset_handler callback Christoph Hellwig (1): aacraid: don't assign cpu_to_le32(int) to u8 Darrick J. Wong (2): libsas: Fix various sparse complaints libsas: Convert sas_proto users to sas_protocol Denis Cheng (1): ipr: use LIST_HEAD instead of LIST_HEAD_INIT Erez Zilber (1): IB/iSER: add logical unit reset support FUJITA Tomonori (13): ch: remove forward declarations ch: fix device minor number management bug ch: handle class_device_create failure properly use dynamically allocated sense buffer sg: handle class_device_create failure properly sg: set class_data after success replace sizeof sense_buffer with SCSI_SENSE_BUFFERSIZE aic7xxx_old, eata_pio, ips, libsas: don't zero out sense_buffer in queueco libsas: fix sense_buffer overrun fix scsi_setup_command_freelist failure path race mpt fusion: make mptsas_smp_handler update resid iscsi_tcp: update
[PATCH] 2.4: Back-port of pl2303.c from 2.6.23.14
I experienced major major data loss on a PL-2303 USB-serial converter under 2.4.36, which I remedied by back-porting the pl2303.c from the latest 2.6 kernel tree. --- diff -u linux-2.4.36/drivers/usb/serial/pl2303.c.orig linux-2.4.36/drivers/usb/serial/pl2303.c --- pl2303.c.orig 2008-01-01 22:36:40.0 +1030 +++ pl2303.c2008-01-26 05:32:00.0 +1030 @@ -1,17 +1,20 @@ /* * Prolific PL2303 USB to serial adaptor driver * - * Copyright (C) 2001-2003 Greg Kroah-Hartman ([EMAIL PROTECTED]) + * Copyright (C) 2001-2007 Greg Kroah-Hartman ([EMAIL PROTECTED]) * Copyright (C) 2003 IBM Corp. * * Original driver for 2.2.x by anonymous * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. * * See Documentation/usb/usb-serial.txt for more information on using this driver + * 2007_Jan_25 dn + * Back-port pl2303.c from linux-2.6.23.14 - corrects major loss of + * transmitted data, plus minor loss during close. [EMAIL PROTECTED] + * * 2003_Apr_24 gkh * Added line error reporting support. Hopefully it is correct... * @@ -33,6 +36,9 @@ * */ +/* TODO first char sent is lost on second open of device. anecdotal evidence + * TODO suggests this might be on all even opens of device. dn. */ + #include #include #include @@ -59,31 +65,60 @@ /* * Version Information */ -#define DRIVER_VERSION "v0.10.1" /* Takes from 2.6's */ #define DRIVER_DESC "Prolific PL2303 USB to serial adaptor driver" +#define PL2303_CLOSING_WAIT(30*HZ) +#define PL2303_BUF_SIZE1024 +#define PL2303_TMP_BUF_SIZE1024 + +struct pl2303_buf { + unsigned intbuf_size; + char*buf_buf; + char*buf_get; + char*buf_put; +}; static struct usb_device_id id_table [] = { { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID) }, { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_RSAQ2) }, + { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_DCU11) }, + { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_RSAQ3) }, + { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_PHAROS) }, { USB_DEVICE(IODATA_VENDOR_ID, IODATA_PRODUCT_ID) }, { USB_DEVICE(ATEN_VENDOR_ID, ATEN_PRODUCT_ID) }, { USB_DEVICE(ATEN_VENDOR_ID2, ATEN_PRODUCT_ID) }, { USB_DEVICE(ELCOM_VENDOR_ID, ELCOM_PRODUCT_ID) }, + { USB_DEVICE(ELCOM_VENDOR_ID, ELCOM_PRODUCT_ID_UCSGT) }, { USB_DEVICE(ITEGNO_VENDOR_ID, ITEGNO_PRODUCT_ID) }, + { USB_DEVICE(ITEGNO_VENDOR_ID, ITEGNO_PRODUCT_ID_2080) }, { USB_DEVICE(MA620_VENDOR_ID, MA620_PRODUCT_ID) }, { USB_DEVICE(RATOC_VENDOR_ID, RATOC_PRODUCT_ID) }, { USB_DEVICE(TRIPP_VENDOR_ID, TRIPP_PRODUCT_ID) }, { USB_DEVICE(RADIOSHACK_VENDOR_ID, RADIOSHACK_PRODUCT_ID) }, { USB_DEVICE(DCU10_VENDOR_ID, DCU10_PRODUCT_ID) }, { USB_DEVICE(SITECOM_VENDOR_ID, SITECOM_PRODUCT_ID) }, + { USB_DEVICE(ALCATEL_VENDOR_ID, ALCATEL_PRODUCT_ID) }, + { USB_DEVICE(SAMSUNG_VENDOR_ID, SAMSUNG_PRODUCT_ID) }, + { USB_DEVICE(SIEMENS_VENDOR_ID, SIEMENS_PRODUCT_ID_SX1) }, + { USB_DEVICE(SIEMENS_VENDOR_ID, SIEMENS_PRODUCT_ID_X65) }, + { USB_DEVICE(SIEMENS_VENDOR_ID, SIEMENS_PRODUCT_ID_X75) }, + { USB_DEVICE(SYNTECH_VENDOR_ID, SYNTECH_PRODUCT_ID) }, + { USB_DEVICE(NOKIA_CA42_VENDOR_ID, NOKIA_CA42_PRODUCT_ID) }, + { USB_DEVICE(CA_42_CA42_VENDOR_ID, CA_42_CA42_PRODUCT_ID) }, + { USB_DEVICE(SAGEM_VENDOR_ID, SAGEM_PRODUCT_ID) }, + { USB_DEVICE(LEADTEK_VENDOR_ID, LEADTEK_9531_PRODUCT_ID) }, + { USB_DEVICE(SPEEDDRAGON_VENDOR_ID, SPEEDDRAGON_PRODUCT_ID) }, + { USB_DEVICE(DATAPILOT_U2_VENDOR_ID, DATAPILOT_U2_PRODUCT_ID) }, + { USB_DEVICE(BELKIN_VENDOR_ID, BELKIN_PRODUCT_ID) }, + { USB_DEVICE(ALCOR_VENDOR_ID, ALCOR_PRODUCT_ID) }, + { USB_DEVICE(HUAWEI_VENDOR_ID, HUAWEI_PRODUCT_ID) }, + { USB_DEVICE(WS002IN_VENDOR_ID, WS002IN_PRODUCT_ID) }, { } /* Terminating entry */ }; MODULE_DEVICE_TABLE (usb, id_table); - #define SET_LINE_REQUEST_TYPE 0x21 #define SET_LINE_REQUEST 0x20 @@ -164,6 +199,8 @@ struct pl2303_private { spinlock_t lock; + struct pl2303_buf *buf; + int write_urb_in_use; wait_queue_head_t delta_msr_wait; u8 line_control; u8 line_status; @@ -171,6 +208,175 @@ enum pl2303_type type; }; +/* + * pl2303_buf_alloc + * + * Allocate a circular buffer and all associated memory. + */ +static struc
using LKML for subsystem development (was Re: Linux 2.6.24)
(I already deleted the posting I'm going to reply to, therefore References and In-Reply-To are wrong. Sorry.) On 2008-01-25, Ingo Molnar wrote in http://lkml.org/lkml/2008/1/25/320: > * Giacomo A. Catenazzi <[EMAIL PROTECTED]> wrote: >> As a tester, I'm not so happy. >> The last few merge windows were a nightmare for us (the tester). >> It remember me the 2.1.x times, but with few differences: >> - more changes, so bugs are unnoticed/ignored in the first weeks or >> - or people are pushing more patches possible, so they delay >> bug corrections to later times (after merge windows). > > i think this heavily varies per subsystem. > > v2.6.24-rc was pretty bad due to the sglist design bug that crept in and > that kept most of the IO hackers busy for a few weeks, while testsystems > kept crashing and no progress was made on _other_ bugs. v2.6.24 early > rc's were also marred by half-cooked networking patches messing up > bisectability. I've seen a number of testers give up on that alone. > There was an unusually high flux of networking fixes throughout v2.6.24, > up to the very last day before the release. > > Since it's Friday already, i put the blame for that on all the > subsystems that do not develop on lkml! :-) > > It is _very_ hard for us to judge the stability and sanity of a > subsystem (and the risk factor of upcoming features!) if it's not > developed on lkml. Observing the bugs alone helps in getting a picture, > but it does not help the testers of early -rc's: [...] > there's way too much 'surprise factor' > in the git merges and all the hidden development that is not directly > visible on lkml. The 'surprise factor' is not even come mainly from > combining all the trees together (that is relatively easy), it is in the > cumulative risk factor that is hard to get right due to development not > always being done on lkml. > > Case point from arch/x86: everyone who follows lkml could have predicted > it from the PAT development discussions that PAT is simply not ready > yet. We deferred it to v2.6.26, The remedy can't just be to Cc: LKML all the time. This would shift the burden of directing the "general public's" attention from the domain experts to the general public. How will subscribers of LKML decide which discussion threads in the huge amount of traffic are worth to glance at? Each of us has only a limited amount of time for LKML consumption. Even if you only look at the Subject: and number of postings in a thread, how to judge whether there is a stability risk for the next -rc in the making, without experience or personal interest in the domain? > but had we tried to cram it into v2.6.25 > and had it broken boxes left and right, we'd rightfully be confronted > with all the existing lkml track record that suggested bad PAT related > problems and predicted the outcome. For subsystems that do not develop > on lkml, no such lkml track record exists and the danger of introducing > bad patches and ruining early -rc's increases. Having a track record in list archives doesn't prevent bugs from happen, at least not directly. It might help to clarify who's responsible, if the changelog doesn't tell us already, and thus might have a positive long term effect on quality. (I work in an industry where it is often hard to identify responsibilities which IMO contributes to chronic quality issues in that industry.) Anyhow, I will try to remember to add a list archive pointer into my future "what's in abc123-2.6.git" messages, so that those who care can browse over the topics and threads to get at least a superficial impression of what went on on the development list behind LKML's back. (I usually also add Cc: LKML to discussions when I get the feeling that the expertise and judgment on the development list might not be sufficient during a respective stage of development --- but of course my judgment of when to involve LKML isn't objective and perfect. That is, I /don't/ claim this to be the best way to handle subsystem development discussions.) -- Stefan Richter -=-==--- ---= ==-=- http://arcgraph.de/sr/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PATCH] driver core patches against 2.6.24
On Sat, 2008-01-26 at 01:27 +0100, Peter Zijlstra wrote: > On Sat, 2008-01-26 at 01:05 +0100, Ingo Molnar wrote: > > * Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > > > My wish is that distros would just boot without requiring an initrd. I > > > know how to make them for redhat and debian based distros, but the > > > fact that you can't (easily) cross-build them makes it a very tedious > > > construct. > > > > all it takes for me on Fedora is to boot a modular distro kernel once, > > then copy the /dev to the real (persistent) /dev: > > > >mkdir /tmp2 > >mount /dev/sda1 /tmp2 > >cp -a /dev/* /tmp2/dev/ > > > > and from that point on a bzImage/vmlinuz can boot up on Fedora without > > any problems (as long as it has the right drivers built in), and the > > initrd line can be removed from grub.conf. > > Yeah, I usually do the same but with a bind mount, still it would be > grand if such things would not be needed. Agreed. But it's not likely to be a priority - all the vendors want completely modular kernels. But now we see what Linus wants to do, perhaps we can try to be a bit more friendly toward that. It's not actually rocket science, after all. I was concerned that he wanted to use the modules in the initrd, but now I see Linus, and everyone else, just want to do what I also secretly do, and just not use an initrd. Isn't it funny. We all secretly hate using initrds ourselves :) Jon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unpredictable performance
On Saturday 26 January 2008 02:03, Asbjørn Sannes wrote: > Asbjørn Sannes wrote: > > Nick Piggin wrote: > >> On Friday 25 January 2008 22:32, Asbjorn Sannes wrote: > >>> Hi, > >>> > >>> I am experiencing unpredictable results with the following test > >>> without other processes running (exception is udev, I believe): > >>> cd /usr/src/test > >>> tar -jxf ../linux-2.6.22.12 > >>> cp ../working-config linux-2.6.22.12/.config > >>> cd linux-2.6.22.12 > >>> make oldconfig > >>> time make -j3 > /dev/null # This is what I note down as a "test" result > >>> cd /usr/src ; umount /usr/src/test ; mkfs.ext3 /dev/cc/test > >>> and then reboot > >>> > >>> The kernel is booted with the parameter mem=8192 > >>> > >>> For 2.6.23.14 the results vary from (real time) 33m30.551s to > >>> 45m32.703s (30 runs) > >>> For 2.6.23.14 with nop i/o scheduler from 29m8.827s to 55m36.744s (24 > >>> runs) For 2.6.22.14 also varied a lot.. but, lost results :( > >>> For 2.6.20.21 only vary from 34m32.054s to 38m1.928s (10 runs) > >>> > >>> Any idea of what can cause this? I have tried to make the runs as equal > >>> as possible, rebooting between each run.. i/o scheduler is cfq as > >>> default. > >>> > >>> sys and user time only varies a couple of seconds.. and the order of > >>> when it is "fast" and when it is "slow" is completly random, but it > >>> seems that the results are mostly concentrated around the mean. > >> > >> Hmm, lots of things could cause it. With such big variations in > >> elapsed time, and small variations on CPU time, I guess the fs/IO > >> layers are the prime suspects, although it could also involve the > >> VM if you are doing a fair amount of page reclaim. > >> > >> Can you boot with enough memory such that it never enters page > >> reclaim? `grep scan /proc/vmstat` to check. > >> > >> Otherwise you could mount the working directory as tmpfs to > >> eliminate IO. > >> > >> bisecting it down to a single patch would be really helpful if you > >> can spare the time. > > > > I'm going to run some tests without limiting the memory to 80 megabytes > > (so that it is 2 gigabyte) and see how much it varies then, but iff I > > recall correctly it did not vary much. I'll reply to this e-mail with > > the results. > > 5 runs gives me: > real5m58.626s > real5m57.280s > real5m56.584s > real5m57.565s > real5m56.613s > > Should I test with tmpfs aswell? I wouldn't worry about it. It seems like it might be due to page reclaim (fs / IO can't be ruled out completely though). Hmm, I haven't been following reclaim so closely lately; you say it started going bad around 2.6.22? It may be lumpy reclaim patches? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_32: trim memory by updating e820 v2
On Fri, 25 Jan 2008, Yinghai Lu wrote: On Jan 25, 2008 4:01 PM, Justin Piszcz <[EMAIL PROTECTED]> wrote: ... Tried it, it worked successfully! With stock kernel, previous way I had to use it was mem=8832M and top showed this: top - 18:53:52 up 1 min, 2 users, load average: 1.03, 0.30, 0.10 Tasks: 169 total, 1 running, 168 sleeping, 0 stopped, 0 zombie Cpu(s): 6.1%us, 2.6%sy, 4.5%ni, 81.3%id, 5.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8039464k total, 1288948k used, 6750516k free, 3640k buffers Swap: 16787768k total,0k used, 16787768k free, 178528k cached With kernel you mentioned and use e820 v3: top - 18:48:13 up 3 min, 6 users, load average: 1.67, 0.68, 0.25 Tasks: 195 total, 2 running, 193 sleeping, 0 stopped, 0 zombie Cpu(s): 18.5%us, 1.2%sy, 1.6%ni, 74.8%id, 3.9%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8037668k total, 1438732k used, 6598936k free, 6844k buffers Swap: 16787768k total,0k used, 16787768k free, 273928k cached No append mem= required. thanks any chance to try 32 bit with higemem64 option? YH My distribution is setup for 64-bit (64bit-clean) only, I do not have a 32-bit userland, so cannot help here, sorry. Justin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Q: x86 suspend/hibernation code consolidation
Hi, I'd like to move the 64-bit suspend/hibernation files from arch/x86/kernel to arch/x86/power, modify the names of the 32-bit files already in arch/x86/power and update the Makefiles accordingly, but there are some changes queued for merging that touch the files in question. When is the right time for making changes like that? Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PATCH] driver core patches against 2.6.24
On Sat, 2008-01-26 at 01:05 +0100, Ingo Molnar wrote: > * Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > My wish is that distros would just boot without requiring an initrd. I > > know how to make them for redhat and debian based distros, but the > > fact that you can't (easily) cross-build them makes it a very tedious > > construct. > > all it takes for me on Fedora is to boot a modular distro kernel once, > then copy the /dev to the real (persistent) /dev: > >mkdir /tmp2 >mount /dev/sda1 /tmp2 >cp -a /dev/* /tmp2/dev/ > > and from that point on a bzImage/vmlinuz can boot up on Fedora without > any problems (as long as it has the right drivers built in), and the > initrd line can be removed from grub.conf. Yeah, I usually do the same but with a bind mount, still it would be grand if such things would not be needed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()
Ingo Molnar wrote: * Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: Is there any guide about the tradeoff of when to use invlpg vs flushing the whole tlb? 1 page? 10? 90% of the tlb? i made measurements some time ago and INVLPG was quite uniformly slow on all important CPU types - on the order of 100+ cycles. It's probably microcode. With a cr3 flush being on the order of 200-300 cycles (plus any add-on TLB miss costs - but those are amortized quite well as long as the pagetables are well cached - which they usually are on today's 2MB-ish L2 caches), the high cost of INVLPG rarely makes it worthwile for anything more than a few pages. so INVLPG makes sense for pagetable fault realated single-address flushes, but they rarely make sense for range flushes. (and that's how Linux uses it) Incidentally, as far as I can tell, the main INVLPG is so slow is because of its painful behaviour with regards to large pages which may have been split by hardware. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CS5536 mfgpt timer setup register hangs board
This is what TinyBios posts --- PC Engines ALIX.2 v0.99 640 KB Base Memory 130048 KB Extended Memory 01F0 Master 848A CF 128MB Phys C/H/S 1002/8/32 Log C/H/S 1002/8/32 BIOS setup: (9) 9600 baud (2) 19200 baud *3* 38400 baud (5) 57600 baud (1) 115200 baud *C* CHS mode (L) LBA mode (W) HDD wait (V) HDD slave (U) UDMA enable (M) MFGPT workaround (P) late PCI init *R* Serial console enable (E) PXE boot enable (X) Xmodem upload (Q) Quit The MFGPT workaround is disabled, and I deleted the workaround code that came with voyage linux's distribution. Anyhow I will try the fix and see if that solves my problem. > On 25/01/08 15:50 -0800, Hasan Rashid wrote: >> >> Hi, >> >> I have been working on a watchdog timer using the mfgpt on AMD Geode >> CS5536. I initialize the setup register MFGPT0_SETUP (0x6206) with hex >> value 0x306 (110110b). However, after this first initialization if I >> ever read/write to the register it hangs the system. >> >> I have been through all the documentation, tried several different >> methods >> but all the efforts, frustratingly, to no avail. >> >> Does anyone have any idea as to why would this be? TIA! > > It looks like you are using TinyBIOS. Make sure that if you are using > v0.99 > that you do *not* enable the MFGPT workaround. If you are using an older > version, then you will need this patch: > > http://lkml.org/lkml/2008/1/23/372 > > And enable mfgptfix on the command line. There seems to be a problem with > the MFGPT "workaround" that causes hangs exactly like you are seeing. > > Jordan > > -- > Jordan Crouse > Systems Software Development Engineer > Advanced Micro Devices, Inc. > > > -- Regards, Hasan Rashid -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_32: trim memory by updating e820 v2
On Jan 25, 2008 4:01 PM, Justin Piszcz <[EMAIL PROTECTED]> wrote: > > ... > Tried it, it worked successfully! > > With stock kernel, previous way I had to use it was mem=8832M and top > showed this: > > top - 18:53:52 up 1 min, 2 users, load average: 1.03, 0.30, 0.10 > Tasks: 169 total, 1 running, 168 sleeping, 0 stopped, 0 zombie > Cpu(s): 6.1%us, 2.6%sy, 4.5%ni, 81.3%id, 5.5%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 8039464k total, 1288948k used, 6750516k free, 3640k buffers > Swap: 16787768k total,0k used, 16787768k free, 178528k cached > > With kernel you mentioned and use e820 v3: > > top - 18:48:13 up 3 min, 6 users, load average: 1.67, 0.68, 0.25 > Tasks: 195 total, 2 running, 193 sleeping, 0 stopped, 0 zombie > Cpu(s): 18.5%us, 1.2%sy, 1.6%ni, 74.8%id, 3.9%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 8037668k total, 1438732k used, 6598936k free, 6844k buffers > Swap: 16787768k total,0k used, 16787768k free, 273928k cached > > No append mem= required. > thanks any chance to try 32 bit with higemem64 option? YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 158/196] Driver core: convert block from raw kobjects
Fix build with CONFIG_BLOCK off. Building git-2d94dfc with CONFIG_BLOCK turned off gives me: drivers/base/core.c: In function 'device_add_class_symlinks': drivers/base/core.c:704: error: 'part_type' undeclared (first use in this function) drivers/base/core.c:704: error: (Each undeclared identifier is reported only once drivers/base/core.c:704: error: for each function it appears in.) drivers/base/core.c: In function 'device_remove_class_symlinks': drivers/base/core.c:743: error: 'part_type' undeclared (first use in this function) git-blame points to Kay Sievers. The problem is obvious. I think te solution is too ;). Tested with a silly configuration that contains just enough wits to boot and get to the prompt of klibc-dash on the built-in initramfs using: qemu -m 8 -cpu pentium -serial stdio -cdrom arch/x86/boot/image.iso Compile-tested i386-defconfig. Signed-off-by: Alexander van Heukelum <[EMAIL PROTECTED]> Oh, and the compile-problem still exists in git-99f1c97. The git-tree is changing faster than I can test the patch and write an e-mail :-/. diff --git a/drivers/base/core.c b/drivers/base/core.c index edf3bbe..3751843 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -75,6 +75,15 @@ static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr, return ret; } +static int dev_needs_link(struct device *dev) +{ +#ifdef CONFIG_BLOCK + return dev->type != &part_type; +#else + return 1; +#endif +} + static struct sysfs_ops dev_sysfs_ops = { .show = dev_attr_show, .store = dev_attr_store, @@ -652,14 +661,14 @@ static int device_add_class_symlinks(struct device *dev) #ifdef CONFIG_SYSFS_DEPRECATED /* stacked class devices need a symlink in the class directory */ if (dev->kobj.parent != &dev->class->subsys.kobj && - dev->type != &part_type) { + dev_needs_link(dev)) { error = sysfs_create_link(&dev->class->subsys.kobj, &dev->kobj, dev->bus_id); if (error) goto out_subsys; } - if (dev->parent && dev->type != &part_type) { + if (dev->parent && dev_needs_link(dev)) { struct device *parent = dev->parent; char *class_name; @@ -688,11 +697,11 @@ static int device_add_class_symlinks(struct device *dev) return 0; out_device: - if (dev->parent && dev->type != &part_type) + if (dev->parent && dev_needs_link(dev)) sysfs_remove_link(&dev->kobj, "device"); out_busid: if (dev->kobj.parent != &dev->class->subsys.kobj && - dev->type != &part_type) + dev_needs_link(dev)) sysfs_remove_link(&dev->class->subsys.kobj, dev->bus_id); #else /* link in the class directory pointing to the device */ @@ -701,7 +710,7 @@ out_busid: if (error) goto out_subsys; - if (dev->parent && dev->type != &part_type) { + if (dev->parent && dev_needs_link(dev)) { error = sysfs_create_link(&dev->kobj, &dev->parent->kobj, "device"); if (error) @@ -725,7 +734,7 @@ static void device_remove_class_symlinks(struct device *dev) return; #ifdef CONFIG_SYSFS_DEPRECATED - if (dev->parent && dev->type != &part_type) { + if (dev->parent && dev_needs_link(dev)) { char *class_name; class_name = make_class_name(dev->class->name, &dev->kobj); @@ -737,10 +746,10 @@ static void device_remove_class_symlinks(struct device *dev) } if (dev->kobj.parent != &dev->class->subsys.kobj && - dev->type != &part_type) + dev_needs_link(dev)) sysfs_remove_link(&dev->class->subsys.kobj, dev->bus_id); #else - if (dev->parent && dev->type != &part_type) + if (dev->parent && dev_needs_link(dev)) sysfs_remove_link(&dev->kobj, "device"); sysfs_remove_link(&dev->class->subsys.kobj, dev->bus_id); diff --git a/include/linux/genhd.h b/include/linux/genhd.h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nfs server patches not in 2.6.25
Just some idea what we might be working on for 2.6.26, besides continued bug-fixing and cleanup: Work that we already have patches for and that I expect to be included in whole or in 2.6.26: - ipv6: Aurélien Charbon's patch to add ipv6 support to the server's export interface is ready. I'm not clear what else remains for full ipv6 support. - failover and migration: Wendy Cheng's patches appear to be in good shape, so I expect them or something with equivalent functionality to be in 2.6.26. - gss callbacks: We have patches to add support for rpcsec_gss on NFSv4's callback channel (allowing us to support delegations on kerberos mounts), but they've been put on hold pending improvements to the client's gssd upcall. I hope to get back to that work in the next few weeks. Also in progress: - spkm3 and future gss mechanisms may generate context initiation rpc's that are very large. Olga Kornievskaia and I have been working on fixing the server gssd interfaces to permit this. - There are some mismatches between the semantics required for nfsv4 delegations and what Linux's lease subsystem provides us. David Richter and I have done a little work on this. We need to start submitting it. Three items I identified previously as issues I'd like fixed before we removed the dependency of CONFIG_NFSD_V4 on CONFIG_EXPERIMENTAL: http://linux-nfs.org/pipermail/nfsv4/2006-December/005497.html - export paths consistent between v2/v3/v4: We have some code that fixes this entirely in userspace. That approach doesn't provide stable filehandles in the NFSv4 pseudofilesystem, and there seems to be a general sentiment that it's overly complicated. It has the one advantage that we don't have to commit to it, since it uses only existing kernel interfaces. So I think we're probably going to apply that to nfs-utils as a stopgap measure and start work on fixing this in the kernel at the same time - reboot recovery: there have been complaints about the server-side nfsv4 reboot recovery code for a while, we've had code that tries to fix it for a while, and it just hasn't happened. I'm hoping we can finally get this ready for 2.6.26. - export security: this was finished in 2.6.23; we now support export options like sec=krb5:krb5i:krb5p, which have a few advantages over the special gss/krb5 client names. This could be better documented, though. I've probably left a lot out. Let me know of ongoing projects and todo's that I've forgotten --b. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()
Keir Fraser wrote: On 25/1/08 22:54, "Jeremy Fitzhardinge" <[EMAIL PROTECTED]> wrote: The only possibly relevant comment I can find in vol3a is: Older IA-32 processors that implement the PAE mechanism use uncached accesses when loading page-directory-pointer table entries. This behavior is model specific and not architectural. More recent IA-32 processors may cache page-directory-pointer table entries. Go read the Intel application note "TLBs, Paging-Structure Caches, and Their Invalidation" at http://www.intel.com/design/processor/applnots/317080.pdf Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can only be refreshed by appropriate tickling of CR0, CR3 or CR4. It is also important to note that *any* valid page directory entry at *any* level in the page-table hierarchy can become cached at *any* time. Basically TLB lookup is performed as a longest-prefix match on the linear address to skip as many levels in a page-table walk as possible (where a walk is needed, because there is no full-length match on the linear address). So, if you modify a directory entry from present to not-present, or change the page directory that a valid pde points to, you probably need to flush the pde caching structure. One piece of good news is that all pde caches are flushed by any arbitrary INVLPG. Actually, it's trickier than that. The PDPTR, just like the segments, aren't a real cache, and aren't invalidated by INVLPG. This means you can't go from less permissive to more permissive, which is normally permitted in the x86. The PDPTR should really be thought of as an extended cr3 with four entries (this is also how it would be typically implemented in hardware) rather than as a part of the paging structure per se. We do NOT want to frob %cr4 unless we actually need to clear all the global pages. The stuff in chapter 10 sounds like they're flagging for a revised INVLPG instruction or mode which would fit some of the extremely serious defects in INVLPG that was introduced by haphazard semantics from the P5 and early P6 days. In general, we should assume that INVLPG only flushes the hierarchy above it, and not rely on side effects. In particular, we should only assume INVLPG invalidates the hierarchy immediately above it, not on any side effects. That's basically sane design anyway. Now, all of this reminds me of something somewhat messy: if we share the kernel page tables for trampoline page tables, as discussed elsewhere, we HAVE to do a complete, all-tlb-including-global-pages flush after use, since the kernel pages are global and otherwise will stick around. Unlike the permissions pages, there aren't G enable bits on the higher levels, but only for the PTEs themselves. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > Is there any guide about the tradeoff of when to use invlpg vs > flushing the whole tlb? 1 page? 10? 90% of the tlb? i made measurements some time ago and INVLPG was quite uniformly slow on all important CPU types - on the order of 100+ cycles. It's probably microcode. With a cr3 flush being on the order of 200-300 cycles (plus any add-on TLB miss costs - but those are amortized quite well as long as the pagetables are well cached - which they usually are on today's 2MB-ish L2 caches), the high cost of INVLPG rarely makes it worthwile for anything more than a few pages. so INVLPG makes sense for pagetable fault realated single-address flushes, but they rarely make sense for range flushes. (and that's how Linux uses it) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PATCH] driver core patches against 2.6.24
* Peter Zijlstra <[EMAIL PROTECTED]> wrote: > My wish is that distros would just boot without requiring an initrd. I > know how to make them for redhat and debian based distros, but the > fact that you can't (easily) cross-build them makes it a very tedious > construct. all it takes for me on Fedora is to boot a modular distro kernel once, then copy the /dev to the real (persistent) /dev: mkdir /tmp2 mount /dev/sda1 /tmp2 cp -a /dev/* /tmp2/dev/ and from that point on a bzImage/vmlinuz can boot up on Fedora without any problems (as long as it has the right drivers built in), and the initrd line can be removed from grub.conf. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CS5536 mfgpt timer setup register hangs board
On 25/01/08 15:50 -0800, Hasan Rashid wrote: > > Hi, > > I have been working on a watchdog timer using the mfgpt on AMD Geode > CS5536. I initialize the setup register MFGPT0_SETUP (0x6206) with hex > value 0x306 (110110b). However, after this first initialization if I > ever read/write to the register it hangs the system. > > I have been through all the documentation, tried several different methods > but all the efforts, frustratingly, to no avail. > > Does anyone have any idea as to why would this be? TIA! It looks like you are using TinyBIOS. Make sure that if you are using v0.99 that you do *not* enable the MFGPT workaround. If you are using an older version, then you will need this patch: http://lkml.org/lkml/2008/1/23/372 And enable mfgptfix on the command line. There seems to be a problem with the MFGPT "workaround" that causes hangs exactly like you are seeing. Jordan -- Jordan Crouse Systems Software Development Engineer Advanced Micro Devices, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_32: trim memory by updating e820 v2
On Tue, 22 Jan 2008, Yinghai Lu wrote: On Monday 21 January 2008 01:37:09 pm Justin Piszcz wrote: On Mon, 21 Jan 2008, Yinghai Lu wrote: On Monday 21 January 2008 11:14:02 am Justin Piszcz wrote: please get x86.git git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git cd linux-2.6 #--{ x86.git instructions }--> # Add Linus's tree as a remote git remote add linus git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git # Add Ingo's tree as a remote git remote add x86 git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git # With that setup, just run the following to get any changes you # don't have. It will also notice any new branches Ingo/Linus # add to their repo. Look in .git/config afterwards, the format # to add new remotes is easy to figure out. git remote update #- git merge x86/master git merge x86/mm and apply [PATCH] x86_64: check if Tom2 is enabled http://lkml.org/lkml/2008/1/21/20 [PATCH] x86_64: update e820 instead of updating end_pfn v3 http://lkml.org/lkml/2008/1/21/19 [PATCH] x86_32: trim memory by updating e820 v2 http://lkml.org/lkml/2008/1/21/18 YH Thanks, I am all patched up and ready to test, unfortunately one of my disks in my RAID 1 just died, I already filled out the advanced replacement form, I will test when I receive the replacement disk. please get x86.git and apply [PATCH] x86_32: trim memory by updating e820 v3 http://lkml.org/lkml/2008/1/22/394 Ingo already put other two into the tree. Thanks YH Tried it, it worked successfully! With stock kernel, previous way I had to use it was mem=8832M and top showed this: top - 18:53:52 up 1 min, 2 users, load average: 1.03, 0.30, 0.10 Tasks: 169 total, 1 running, 168 sleeping, 0 stopped, 0 zombie Cpu(s): 6.1%us, 2.6%sy, 4.5%ni, 81.3%id, 5.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8039464k total, 1288948k used, 6750516k free, 3640k buffers Swap: 16787768k total,0k used, 16787768k free, 178528k cached With kernel you mentioned and use e820 v3: top - 18:48:13 up 3 min, 6 users, load average: 1.67, 0.68, 0.25 Tasks: 195 total, 2 running, 193 sleeping, 0 stopped, 0 zombie Cpu(s): 18.5%us, 1.2%sy, 1.6%ni, 74.8%id, 3.9%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8037668k total, 1438732k used, 6598936k free, 6844k buffers Swap: 16787768k total,0k used, 16787768k free, 273928k cached No append mem= required. A full dmesg is attached so you can analyze the e820/MTRR mapping. File: dmesg-e820v3patch.txt.bz2 Justin. dmesg-e820v3patch.txt.bz2 Description: Binary data
Re: [build bug] ./drivers/crypto/hifn_795x.c
* Herbert Xu <[EMAIL PROTECTED]> wrote: > On Sat, Jan 26, 2008 at 12:51:31AM +0100, Ingo Molnar wrote: > > > > find a workaround below - but i'm not sure it's the right one. > > Thanks, but I've already checked in a fix :) hey, that's my punishment for not reading my email promptly :) Could have saved me some time in the Kconfig web of dependencies :-/ Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [build bug] ./drivers/crypto/hifn_795x.c
On Sat, Jan 26, 2008 at 12:51:31AM +0100, Ingo Molnar wrote: > > find a workaround below - but i'm not sure it's the right one. Thanks, but I've already checked in a fix :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [build bug] ./drivers/crypto/hifn_795x.c
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > randconfig testing found this (post-v2.6.24) build bug: > > drivers/built-in.o: In function `hifn_unregister_rng': > hifn_795x.c:(.text+0x17bbd9): undefined reference to `hwrng_unregister' > drivers/built-in.o: In function `hifn_probe': > hifn_795x.c:(.text+0x17df70): undefined reference to `hwrng_register' > > config attached. find a workaround below - but i'm not sure it's the right one. Ingo Index: linux/drivers/crypto/Kconfig === --- linux.orig/drivers/crypto/Kconfig +++ linux/drivers/crypto/Kconfig @@ -89,6 +89,7 @@ config CRYPTO_DEV_HIFN_795X select CRYPTO_ALGAPI select CRYPTO_BLKCIPHER depends on PCI + depends on DEV_HIFN_795X = HW_RANDOM help This option allows you to have support for HIFN 795x crypto adapters. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.24
Giacomo A. Catenazzi wrote: >> On Friday, 25 of January 2008, [EMAIL PROTECTED] wrote: [-mm] >>> should flush out most of the truly stupid mistakes, but those are >>> usually found and fixed literally within hours. Anyhow, the proper >>> time for test compiles is *before* it goes into the git trees at >>> all - it should have been tested before it gets sent to a >>> maintainer for inclusion. > > few hours, but a lot of changeset will broke bisect (few doc tell > us how to continue bisecting on compile errors). [...] > I only want to raise the problem, to see if it is possible to improve > testing environment without affecting the development of Linux. How often is "bisectability" being broken already before merge in subsystem trees, and how often only in the context of the merge result? (Probably impossible to answer because nobody has the data.) Much of the former type of breakage (if we really have such breakage) could probably be found in mostly automated ways and by volunteer testers, by regularly testing the subsystem trees. -- Stefan Richter -=-==--- ---= ==-=- http://arcgraph.de/sr/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/