Re: epoll and shared fd's

2008-01-25 Thread Michael Kerrisk
On Jan 25, 2008 12:57 AM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
>
> On Thu, 24 Jan 2008, Pierre Habouzit wrote:
>
> > On Fri, Jan 18, 2008 at 09:10:18PM +, Davide Libenzi wrote:
> > > On Fri, 18 Jan 2008, Pierre Habouzit wrote:
> > >
> > > >   Hi,
> > > >
> > > >   I just came across a strange behavior of epoll that seems to
> > > > contradict the documentation. Here is what happens:
> > > >
> > > > * I have two processes P1 and P2, P1 accept()s connections, and send the
> > > >   resulting file descriptors to P2 through a unix socket.
> > > >
> > > > * P2 registers the received socket in his epollfd.
> > > >
> > > >   [time passes]
> > > >
> > > > * P2 is done with the socket and closes it
> > > >
> > > > * P2 gets events for the socket again !
> > > >
> > > >
> > > >   Though the documentation says that if a process closes a file
> > > > descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup()
> > > > the file descriptor. Though (because of a bug) it was still open in
> > > > P1[0], hence the referenced socket still live at the kernel level.
> > > >
> > > >   Of course the userland workaround is to force the EPOLL_CTL_DEL before
> > > > the close, which I now do, but costs me a syscall where I wanted to
> > > > spare one :|
> > >
> > > For epoll, a close is when the kernel file* is released (that is, when all
> > > its instances are gone).
> > > We could put a special handling in filp_close(), but I don't think is a
> > > good idea, and we're better live with the current behaviour.
> >
> >   Okay, maybe updating the linux manpages to be more clear about that is
> > the way to go then. Thanks
>
> Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in
> the epoll man page.

Thanks Davide -- yes please send me a patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: threshold_init_device/kobject_uevent_env oops

2008-01-25 Thread Greg KH
On Fri, Jan 25, 2008 at 11:24:55PM -0800, Greg KH wrote:
> On Fri, Jan 25, 2008 at 11:08:53PM -0800, Yinghai Lu wrote:
> > On Jan 25, 2008 10:14 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> > >
> > > On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote:
> > > > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> > > > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote:
> > > > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote:
> > > > > > >
> > > > > > > * Greg KH <[EMAIL PROTECTED]> wrote:
> > > > > > >
> > > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote:
> > > > > > > > > current linus tree + x86.git
> > > > > > > > >
> > > > > > > > > got
> > > > > > > > >
> > > > > > > > > Calling initcall 0x80b93d98: 
> > > > > > > > > threshold_init_device+0x0/0x3f()
> > > > > > > > > BUG: unable to handle kernel NULL pointer dereference at 
> > > > > > > > > 0040
> > > > > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9
> > > > > > > >
> > > > > > > > Does this happen on just Linus's tree?
> > > > > > > >
> > > > > > > > Can you send me a .config file for this?
> > > > > > > >
> > > > > > > > What is threshold_init()?  Is it something new in the x86.git 
> > > > > > > > tree?
> > > > > > >
> > > > > > > no. A quick grep shows that it is in a file that _your_ changes in
> > > > > > > Linus' latest have touched:
> > > > > > >
> > > > > > >   arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> > > > > >
> > > > > > Ok, those are pretty much just search/and/replace type changes, but 
> > > > > > I
> > > > > > have been running x86-64 boxes with these changes in place.
> > > > >
> > > > > Oh wait, I do see a change.  We are now (finally) emitting a kobject
> > > > > uevent for these devices, which somehow the code can't handle 
> > > > > properly.
> > > > >
> > > > > Let me go poke this some more, unfortunatly I don't have any AMD 64
> > > > > boxes here anymore, only Intel based processors, so I can't run this
> > > > > module...
> > > >
> > > > it only happens with AMD Quad Core CPU or Fam 10h.
> > > >
> > > > works well with AMD opteron Rev E, and Rev F.
> > >
> > > So this only dies on a multi-core system?  Or does 2 processor boxes
> > > work, but not 4?
> > 
> > 2 sockets x quad core will fail (fam 10h)
> > 2 sockets x dual core works( rev E, and rev F opteron)
> > 
> > there are some changs between opteron and fam10h.  fam10h may have
> > more local vectors for MCE...
> > or more banks and blocks...
> > 
> > will look at AMD64 Bios and kernel porting guide for Fam 10h again..
> > 
> > wonder if your code uncover some bugs ...
> 
> No, the logic in this function is just crazy.  It's recursive, but we
> can circumvent the creation for the kobject and whole creation of the
> threshold_block if some conditions are met.  That's why we see the
> allocate_threshold_blocks so many times in the callstack, yet only a few
> kobjects created.
> 
> Then we blow up in kobject_uevent_env() on the first debug printk.
> Which means that we are just passing in garbage.
> 
> Let me know if the patch below fixes this for you, I think it should, as
> there is a code path where b is NULL and then we call kobject_uevent.
> 
> Man, this is one time that comments in code would have been very nice to
> have, and why forward goto's into major code blocks are just evil...
> 
> thanks,
> 
> greg k-h
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c 
> b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> index 7535887..8a7f204 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> @@ -450,7 +450,8 @@ recurse:
>   if (err)
>   goto out_free;
>  
> - kobject_uevent(&b->kobj, KOBJ_ADD);
> + if (b && &b->kobj)
> + kobject_uevent(&b->kobj, KOBJ_ADD);
>  
>   return err;
>  

Actually the second test doesn't make sense, it can just be:


diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c 
b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
index 7535887..8a7f204 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
@@ -450,7 +450,8 @@ recurse:
if (err)
goto out_free;
 
-   kobject_uevent(&b->kobj, KOBJ_ADD);
+   if (b)
+   kobject_uevent(&b->kobj, KOBJ_ADD);
 
return err;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 158/196] Driver core: convert block from raw kobjects

2008-01-25 Thread Greg KH
On Sat, Jan 26, 2008 at 12:23:18AM +0100, Alexander van Heukelum wrote:
> Fix build with CONFIG_BLOCK off.
> 
> Building git-2d94dfc with CONFIG_BLOCK turned off gives me:
> 
> drivers/base/core.c: In function 'device_add_class_symlinks':
> drivers/base/core.c:704: error: 'part_type' undeclared (first use in this 
> function)
> drivers/base/core.c:704: error: (Each undeclared identifier is reported only 
> once
> drivers/base/core.c:704: error: for each function it appears in.)
> drivers/base/core.c: In function 'device_remove_class_symlinks':
> drivers/base/core.c:743: error: 'part_type' undeclared (first use in this 
> function)
> 
> git-blame points to Kay Sievers.
> 
> The problem is obvious. I think te solution is too ;).

Heh, thanks, I'll test this in the morning, it's been a long day...

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] remove duplicating priority setting in try_to_free_p

2008-01-25 Thread minchan kim
shrink_zones in try_to_free_pages already set zone through
note_zone_scanning_priority.
So, setting prev_priority in try_to_free_pages is needless.

This patch is made by 2.6.24-rc8.

Signed-off-by: barrios <[EMAIL PROTECTED]>
---
 mm/vmscan.c |   17 -
 1 files changed, 0 insertions(+), 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e5a9597..fc55c23 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1273,23 +1273,6 @@ unsigned long try_to_free_pages(struct z
if (!sc.all_unreclaimable)
ret = 1;
 out:
-   /*
-* Now that we've scanned all the zones at this priority level, note
-* that level within the zone so that the next thread which performs
-* scanning of this zone will immediately start out at this priority
-* level.  This affects only the decision whether or not to bring
-* mapped pages onto the inactive list.
-*/
-   if (priority < 0)
-   priority = 0;
-   for (i = 0; zones[i] != NULL; i++) {
-   struct zone *zone = zones[i];
-
-   if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-   continue;
-
-   zone->prev_priority = priority;
-   }
return ret;
 }


-- 
Kinds regards,
barrios
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: threshold_init_device/kobject_uevent_env oops

2008-01-25 Thread Greg KH
On Fri, Jan 25, 2008 at 11:08:53PM -0800, Yinghai Lu wrote:
> On Jan 25, 2008 10:14 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> >
> > On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote:
> > > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> > > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote:
> > > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote:
> > > > > >
> > > > > > * Greg KH <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote:
> > > > > > > > current linus tree + x86.git
> > > > > > > >
> > > > > > > > got
> > > > > > > >
> > > > > > > > Calling initcall 0x80b93d98: 
> > > > > > > > threshold_init_device+0x0/0x3f()
> > > > > > > > BUG: unable to handle kernel NULL pointer dereference at 
> > > > > > > > 0040
> > > > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9
> > > > > > >
> > > > > > > Does this happen on just Linus's tree?
> > > > > > >
> > > > > > > Can you send me a .config file for this?
> > > > > > >
> > > > > > > What is threshold_init()?  Is it something new in the x86.git 
> > > > > > > tree?
> > > > > >
> > > > > > no. A quick grep shows that it is in a file that _your_ changes in
> > > > > > Linus' latest have touched:
> > > > > >
> > > > > >   arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> > > > >
> > > > > Ok, those are pretty much just search/and/replace type changes, but I
> > > > > have been running x86-64 boxes with these changes in place.
> > > >
> > > > Oh wait, I do see a change.  We are now (finally) emitting a kobject
> > > > uevent for these devices, which somehow the code can't handle properly.
> > > >
> > > > Let me go poke this some more, unfortunatly I don't have any AMD 64
> > > > boxes here anymore, only Intel based processors, so I can't run this
> > > > module...
> > >
> > > it only happens with AMD Quad Core CPU or Fam 10h.
> > >
> > > works well with AMD opteron Rev E, and Rev F.
> >
> > So this only dies on a multi-core system?  Or does 2 processor boxes
> > work, but not 4?
> 
> 2 sockets x quad core will fail (fam 10h)
> 2 sockets x dual core works( rev E, and rev F opteron)
> 
> there are some changs between opteron and fam10h.  fam10h may have
> more local vectors for MCE...
> or more banks and blocks...
> 
> will look at AMD64 Bios and kernel porting guide for Fam 10h again..
> 
> wonder if your code uncover some bugs ...

No, the logic in this function is just crazy.  It's recursive, but we
can circumvent the creation for the kobject and whole creation of the
threshold_block if some conditions are met.  That's why we see the
allocate_threshold_blocks so many times in the callstack, yet only a few
kobjects created.

Then we blow up in kobject_uevent_env() on the first debug printk.
Which means that we are just passing in garbage.

Let me know if the patch below fixes this for you, I think it should, as
there is a code path where b is NULL and then we call kobject_uevent.

Man, this is one time that comments in code would have been very nice to
have, and why forward goto's into major code blocks are just evil...

thanks,

greg k-h

diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c 
b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
index 7535887..8a7f204 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
@@ -450,7 +450,8 @@ recurse:
if (err)
goto out_free;
 
-   kobject_uevent(&b->kobj, KOBJ_ADD);
+   if (b && &b->kobj)
+   kobject_uevent(&b->kobj, KOBJ_ADD);
 
return err;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable NestedPaging

2008-01-25 Thread Joerg Roedel
On Fri, Jan 25, 2008 at 05:47:11PM -0800, Nakajima, Jun wrote:
> Joerg Roedel wrote:
> > To disable the use of the Nested Paging feature even if it is
> available in
> > hardware this patch adds a module parameter. Nested Paging can be
> disabled by
> > passing npt=off to the kvm_amd module.
> 
> I think it's better to use a (common) parameter to qemu. That way you
> can control on/off for each VM.

Generally I see no problem with it. But at least for NPT I don't see a
reason why someone should want to disable it on a VM basis (as far as it
works stable). Avi, what do you think?

Joerg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] RUSAGE_THREAD

2008-01-25 Thread Michael Kerrisk
On Jan 19, 2008 2:14 AM, Roland McGrath <[EMAIL PROTECTED]> wrote:
>
> This adds the RUSAGE_THREAD option for the getrusage system call.
> Solaris calls this RUSAGE_LWP and uses the same value (1).
> That name is not a natural one for Linux, but we keep it as an alias.

Hey Roland,

Would you please CC at this address me on patches that change the
kernel-userland API.

Cheers,

Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fake NUMA emulation for PowerPC (Take 2)

2008-01-25 Thread Balbir Singh
* Michael Ellerman <[EMAIL PROTECTED]> [2008-01-18 16:44:58]:

> 
> This fixes it, although I'm a little worried about some of the
> removals/movings of node_set_online() in the patch.
> 
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 1666e7d..dcedc26 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -49,7 +49,6 @@ static int __cpuinit fake_numa_create_new_node(unsigned 
> long end_pfn,
>   static unsigned int fake_nid = 0;
>   static unsigned long long curr_boundary = 0;
>  
> - *nid = fake_nid;
>   if (!p)
>   return 0;
>  
> @@ -60,6 +59,7 @@ static int __cpuinit fake_numa_create_new_node(unsigned 
> long end_pfn,
>   if (mem < curr_boundary)
>   return 0;
>  
> + *nid = fake_nid;
>   curr_boundary = mem;
>  
>   if ((end_pfn << PAGE_SHIFT) > mem) {
> 

Hi, Michael,

Here's a better and more complete fix for the problem. Could you
please see if it works for you? I tested it on a real NUMA box and it
seemed to work fine there.

Description
---

This patch provides a fix for the problem found by
Michael Ellerman <[EMAIL PROTECTED]> while using fake NUMA nodes
on a cell box. The code modifies node id iff (as in if and only if)
fake NUMA nodes are created.

Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
---

 arch/powerpc/mm/numa.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/powerpc/mm/numa.c~fix-fake-numa-nid-on-numa 
arch/powerpc/mm/numa.c
--- linux-2.6.24-rc8/arch/powerpc/mm/numa.c~fix-fake-numa-nid-on-numa   
2008-01-26 12:20:29.0 +0530
+++ linux-2.6.24-rc8-balbir/arch/powerpc/mm/numa.c  2008-01-26 
12:27:53.0 +0530
@@ -49,7 +49,12 @@ static int __cpuinit fake_numa_create_ne
static unsigned int fake_nid = 0;
static unsigned long long curr_boundary = 0;
 
-   *nid = fake_nid;
+   /*
+* If we did enable fake nodes and cross a node,
+* remember the last node and start from there.
+*/
+   if (fake_nid)
+   *nid = fake_nid;
if (!p)
return 0;
 
_

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: threshold_init_device/kobject_uevent_env oops

2008-01-25 Thread Yinghai Lu
On Jan 25, 2008 10:14 PM, Greg KH <[EMAIL PROTECTED]> wrote:
>
> On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote:
> > On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> > > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote:
> > > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote:
> > > > >
> > > > > * Greg KH <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote:
> > > > > > > current linus tree + x86.git
> > > > > > >
> > > > > > > got
> > > > > > >
> > > > > > > Calling initcall 0x80b93d98: 
> > > > > > > threshold_init_device+0x0/0x3f()
> > > > > > > BUG: unable to handle kernel NULL pointer dereference at 
> > > > > > > 0040
> > > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9
> > > > > >
> > > > > > Does this happen on just Linus's tree?
> > > > > >
> > > > > > Can you send me a .config file for this?
> > > > > >
> > > > > > What is threshold_init()?  Is it something new in the x86.git tree?
> > > > >
> > > > > no. A quick grep shows that it is in a file that _your_ changes in
> > > > > Linus' latest have touched:
> > > > >
> > > > >   arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> > > >
> > > > Ok, those are pretty much just search/and/replace type changes, but I
> > > > have been running x86-64 boxes with these changes in place.
> > >
> > > Oh wait, I do see a change.  We are now (finally) emitting a kobject
> > > uevent for these devices, which somehow the code can't handle properly.
> > >
> > > Let me go poke this some more, unfortunatly I don't have any AMD 64
> > > boxes here anymore, only Intel based processors, so I can't run this
> > > module...
> >
> > it only happens with AMD Quad Core CPU or Fam 10h.
> >
> > works well with AMD opteron Rev E, and Rev F.
>
> So this only dies on a multi-core system?  Or does 2 processor boxes
> work, but not 4?

2 sockets x quad core will fail (fam 10h)
2 sockets x dual core works( rev E, and rev F opteron)

there are some changs between opteron and fam10h.  fam10h may have
more local vectors for MCE...
or more banks and blocks...

will look at AMD64 Bios and kernel porting guide for Fam 10h again..

wonder if your code uncover some bugs ...

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PATCH] driver core patches against 2.6.24

2008-01-25 Thread Arjan van de Ven
On Fri, 25 Jan 2008 11:11:48 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 25 Jan 2008, Greg KH wrote:
> > 
> > That's really wierd, I don't see that at all here just running with
> > your 2.6.24 + my git tree and lots of USB drivers built into the
> > kernel also (like ehci_hcd).
> 
> But do you use an initrd that tries to load the same driver too?
> 
> I'm too lazy to want to do my own initrd. I just use the prepackaged
> ones and rely on the fact that my private kernel will refuse to load
> modules that aren't meant for it anyway.
> 

you know about "make install" right? That copies the needed files to /boot,
adds them to grub AND makes an initrd for you.. all for free ;)


-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Moving spinlock to struct usb_hcd

2008-01-25 Thread Romit Dasgupta
Hi,
   This is an attempt to move the hcd_urb_list_lock to struct usb_hcd.
The lock is taken on functions that try to add/delete/use urb against a
given hcd. I have not seen any association of an urb with multiple hcds.
Hence I thought this can be moved within usb_hcd. This should help
reduce contention to usb during high load where i/o is happening  to
multiple hcds. I am also trying to see if hcd_root_hub_lock can also be
moved to usb_hcd. Any comments on this?  I have done some testing with
this patch and it seems to be holding fine. If this looks ok I will
submit the lock statistics before and after the change.

Thanks,
-Romit


---
 drivers/usb/core/hcd.c |   24 +++-
 drivers/usb/core/hcd.h |1 +
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index d5ed3fa..6eb0f45 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -98,9 +98,6 @@ EXPORT_SYMBOL_GPL (usb_bus_list_lock);
 /* used for controlling access to virtual root hubs */
 static DEFINE_SPINLOCK(hcd_root_hub_lock);
 
-/* used when updating an endpoint's URB list */
-static DEFINE_SPINLOCK(hcd_urb_list_lock);
-
 /* wait queue for synchronous unlinks */
 DECLARE_WAIT_QUEUE_HEAD(usb_kill_urb_queue);
 
@@ -1000,7 +997,7 @@ int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd,
struct urb *urb)
 {
int rc = 0;
 
-   spin_lock(&hcd_urb_list_lock);
+   spin_lock(&hcd->hcd_urb_list_lock);
 
/* Check that the URB isn't being killed */
if (unlikely(urb->reject)) {
@@ -1033,7 +1030,7 @@ int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd,
struct urb *urb)
goto done;
}
  done:
-   spin_unlock(&hcd_urb_list_lock);
+   spin_unlock(&hcd->hcd_urb_list_lock);
return rc;
 }
 EXPORT_SYMBOL_GPL(usb_hcd_link_urb_to_ep);
@@ -1106,9 +1103,9 @@ EXPORT_SYMBOL_GPL(usb_hcd_check_unlink_urb);
 void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb)
 {
/* clear all state linking urb to this dev (and hcd) */
-   spin_lock(&hcd_urb_list_lock);
+   spin_lock(&hcd->hcd_urb_list_lock);
list_del_init(&urb->urb_list);
-   spin_unlock(&hcd_urb_list_lock);
+   spin_unlock(&hcd->hcd_urb_list_lock);
 }
 EXPORT_SYMBOL_GPL(usb_hcd_unlink_urb_from_ep);
 
@@ -1311,7 +1308,7 @@ void usb_hcd_flush_endpoint(struct usb_device
*udev,
hcd = bus_to_hcd(udev->bus);
 
/* No more submits can occur */
-   spin_lock_irq(&hcd_urb_list_lock);
+   spin_lock_irq(&hcd->hcd_urb_list_lock);
 rescan:
list_for_each_entry (urb, &ep->urb_list, urb_list) {
int is_in;
@@ -1320,7 +1317,7 @@ rescan:
continue;
usb_get_urb (urb);
is_in = usb_urb_dir_in(urb);
-   spin_unlock(&hcd_urb_list_lock);
+   spin_unlock(&hcd->hcd_urb_list_lock);
 
/* kick hcd */
unlink1(hcd, urb, -ESHUTDOWN);
@@ -1345,14 +1342,14 @@ rescan:
usb_put_urb (urb);
 
/* list contents may have changed */
-   spin_lock(&hcd_urb_list_lock);
+   spin_lock(&hcd->hcd_urb_list_lock);
goto rescan;
}
-   spin_unlock_irq(&hcd_urb_list_lock);
+   spin_unlock_irq(&hcd->hcd_urb_list_lock);
 
/* Wait until the endpoint queue is completely empty */
while (!list_empty (&ep->urb_list)) {
-   spin_lock_irq(&hcd_urb_list_lock);
+   spin_lock_irq(&hcd->hcd_urb_list_lock);
 
/* The list may have changed while we acquired the spinlock */
urb = NULL;
@@ -1361,7 +1358,7 @@ rescan:
urb_list);
usb_get_urb (urb);
}
-   spin_unlock_irq(&hcd_urb_list_lock);
+   spin_unlock_irq(&hcd->hcd_urb_list_lock);
 
if (urb) {
usb_kill_urb (urb);
@@ -1618,6 +1615,7 @@ struct usb_hcd *usb_create_hcd (const struct
hc_driver *driver,
dev_dbg (dev, "hcd alloc failed\n");
return NULL;
}
+   spin_lock_init(&hcd->hcd_urb_list_lock);
dev_set_drvdata(dev, hcd);
kref_init(&hcd->kref);
 
diff --git a/drivers/usb/core/hcd.h b/drivers/usb/core/hcd.h
index 98e2419..e23ff45 100644
--- a/drivers/usb/core/hcd.h
+++ b/drivers/usb/core/hcd.h
@@ -128,6 +128,7 @@ struct usb_hcd {
 * input size of periodic table to an interrupt scheduler. 
 * (ohci 32, uhci 1024, ehci 256/512/1024).
 */
+   spinlock_t hcd_urb_list_lock;
 
/* The HC driver's private data is stored at the end of
 * this structure.
-- 
1.4.4.2


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org

Re: threshold_init_device/kobject_uevent_env oops

2008-01-25 Thread Greg KH
On Fri, Jan 25, 2008 at 03:20:45PM -0800, Yinghai Lu wrote:
> On Jan 25, 2008 3:08 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote:
> ..
> > Also, can someone enable CONFIG_KOBJECT_DEBUG and send me the output of
> > the startup of this code?  That should help explain what order things
> > are happening it.
> 
> Calling initcall 0x80ba1dee: threshold_init_device+0x0/0x3f()
> kobject: 'threshold_bank4' (8108265450c0): kobject_add_internal: parent: 
> 'machinecheck0', set: ''
> kobject: 'misc0' (810425497418): kobject_add_internal: parent: 
> 'threshold_bank4', set: ''
> kobject: 'misc1' (810425497498): kobject_add_internal: parent: 
> 'threshold_bank4', set: ''
> kobject: 'misc2' (810425497518): kobject_add_internal: parent: 
> 'threshold_bank4', set: ''
> Unable to handle kernel NULL pointer dereference at 0018 RIP: 
> [] kobject_uevent_env+0x31/0x45f

2 of these work just fine, and the third blows up in kobject_uevent().
So wierd, let me dig further...

Hm, it's when we unwind that we blow up on the kobject_uevent, as that's
the first time it is called (gotta love recursion here...)  So it is
really never working for these objects at all, what a mess.

As a work-around for now, you can probably just comment out the
'kobject_uevent() in the file arch/x86/kernel/cpu/mcheck/mcd_amd_64.c
and everything should work just fine, as there never really was an event
being properly generated before, no one would miss it now :)

I'll keep digging...

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: threshold_init_device/kobject_uevent_env oops

2008-01-25 Thread Greg KH
On Fri, Jan 25, 2008 at 10:04:19PM -0800, Yinghai Lu wrote:
> On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> > On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote:
> > > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote:
> > > >
> > > > * Greg KH <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote:
> > > > > > current linus tree + x86.git
> > > > > >
> > > > > > got
> > > > > >
> > > > > > Calling initcall 0x80b93d98: 
> > > > > > threshold_init_device+0x0/0x3f()
> > > > > > BUG: unable to handle kernel NULL pointer dereference at 
> > > > > > 0040
> > > > > > IP: [] kobject_uevent_env+0x2a/0x3d9
> > > > >
> > > > > Does this happen on just Linus's tree?
> > > > >
> > > > > Can you send me a .config file for this?
> > > > >
> > > > > What is threshold_init()?  Is it something new in the x86.git tree?
> > > >
> > > > no. A quick grep shows that it is in a file that _your_ changes in
> > > > Linus' latest have touched:
> > > >
> > > >   arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> > >
> > > Ok, those are pretty much just search/and/replace type changes, but I
> > > have been running x86-64 boxes with these changes in place.
> >
> > Oh wait, I do see a change.  We are now (finally) emitting a kobject
> > uevent for these devices, which somehow the code can't handle properly.
> >
> > Let me go poke this some more, unfortunatly I don't have any AMD 64
> > boxes here anymore, only Intel based processors, so I can't run this
> > module...
> 
> it only happens with AMD Quad Core CPU or Fam 10h.
> 
> works well with AMD opteron Rev E, and Rev F.

So this only dies on a multi-core system?  Or does 2 processor boxes
work, but not 4?

> So you may need have access to new system with quad core cpu.

Ugh, that's not good.

The kobjects here are really not making much sense.

Jacob, any hints on exactly what you were trying to do with these
kobjects?  What's the end goal here, and why didn't you just use a
struct device instead?

The mce_amd_64.c file is the only thing in the tree using this userspace
API, can you please document it in Documentation/ABI so that others can
understand what it is used for, what files are expected, and what values
in the files are?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

2008-01-25 Thread H. Peter Anvin

Andi Kleen wrote:


so INVLPG makes sense for pagetable fault realated single-address 
flushes, but they rarely make sense for range flushes. (and that's how 
Linux uses it)


I think it would be an interesting experiment to switch flush_tlb_range()
over to INVLPG if the length is below some threshold and see if there 
are visible effects in macro benchmarks. The main problem

would be to determine the right threshold -- would likely be CPU dependent.



It would be an interesting experiment.  Odds are pretty good that the 
cutover is roughly linear in the TLB size.


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: threshold_init_device/kobject_uevent_env oops

2008-01-25 Thread Yinghai Lu
On Jan 25, 2008 2:50 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> On Fri, Jan 25, 2008 at 02:47:11PM -0800, Greg KH wrote:
> > On Fri, Jan 25, 2008 at 11:35:56PM +0100, Ingo Molnar wrote:
> > >
> > > * Greg KH <[EMAIL PROTECTED]> wrote:
> > >
> > > > On Fri, Jan 25, 2008 at 01:05:40PM -0800, Yinghai Lu wrote:
> > > > > current linus tree + x86.git
> > > > >
> > > > > got
> > > > >
> > > > > Calling initcall 0x80b93d98: threshold_init_device+0x0/0x3f()
> > > > > BUG: unable to handle kernel NULL pointer dereference at 
> > > > > 0040
> > > > > IP: [] kobject_uevent_env+0x2a/0x3d9
> > > >
> > > > Does this happen on just Linus's tree?
> > > >
> > > > Can you send me a .config file for this?
> > > >
> > > > What is threshold_init()?  Is it something new in the x86.git tree?
> > >
> > > no. A quick grep shows that it is in a file that _your_ changes in
> > > Linus' latest have touched:
> > >
> > >   arch/x86/kernel/cpu/mcheck/mce_amd_64.c
> >
> > Ok, those are pretty much just search/and/replace type changes, but I
> > have been running x86-64 boxes with these changes in place.
>
> Oh wait, I do see a change.  We are now (finally) emitting a kobject
> uevent for these devices, which somehow the code can't handle properly.
>
> Let me go poke this some more, unfortunatly I don't have any AMD 64
> boxes here anymore, only Intel based processors, so I can't run this
> module...

it only happens with AMD Quad Core CPU or Fam 10h.

works well with AMD opteron Rev E, and Rev F.

So you may need have access to new system with quad core cpu.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

2008-01-25 Thread Andi Kleen
On Saturday 26 January 2008 01:11:28 Ingo Molnar wrote:
(plus 
> any add-on TLB miss costs - but those are amortized quite well as long 
> as the pagetables are well cached - which they usually are on today's 
> 2MB-ish L2 caches), 

Did you measure the cost of that amortizing too?

My guess is that especially with TLBs getting larger and larger the
cost of full CR3 flushes are rising.

> so INVLPG makes sense for pagetable fault realated single-address 
> flushes, but they rarely make sense for range flushes. (and that's how 
> Linux uses it)

I think it would be an interesting experiment to switch flush_tlb_range()
over to INVLPG if the length is below some threshold and see if there 
are visible effects in macro benchmarks. The main problem
would be to determine the right threshold -- would likely be CPU dependent.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner
On Sat, Jan 26, 2008 at 04:35:26PM +1100, David Chinner wrote:
> On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote:
> > The points of the implementation are followings.
> > - Add calls of the freeze function (freeze_bdev) and
> >   the unfreeze function (thaw_bdev) in ext3_ioctl().
> > 
> > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev)
> >   is registered to the delayed work queue to unfreeze the filesystem
> >   automatically after the lapse of the specified time.
> 
> Seems like pointless complexity to me - what happens if a
> timeout occurs while the filsystem is still freezing?
> 
> It's not uncommon for a freeze to take minutes if memory
> is full of dirty data that needs to be flushed out, esp. if
> dm-snap is doing COWs for every write issued

Sorry, ignore this bit - I just realised the timer is set
up after the freeze has occurred

Still, that makes it potentially dangerous to whatever is being
done while the filesystem is frozen

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 085/196] kset: convert s390 ipl.c to use kset_create

2008-01-25 Thread Greg KH
On Sat, Jan 26, 2008 at 12:11:33AM +0100, Heiko Carstens wrote:
> On Fri, Jan 25, 2008 at 09:48:58AM -0800, Greg KH wrote:
> > On Fri, Jan 25, 2008 at 01:20:53PM +0100, Heiko Carstens wrote:
> > > On Thu, Jan 24, 2008 at 11:31:54PM -0800, Greg Kroah-Hartman wrote:
> > > > Dynamically create the kset instead of declaring it statically.
> > > > This makes the kobject attributes now work properly that I broke in the
> > > > previous patch.
> > > 
> > > Could you please merge this and the previous patch before it goes
> > > upstream? Having an intermediate state where things are broken
> > > will cause pain and additional work in case of bisecting.
> > 
> > It will not cause a build error (see the previous patch for details.)
> > The sysfs files will not properly show the correct data, that is all.
> >
> > The odds that you will hit this in a 'git bisect' is VERY low, and the
> > previous patch states that the files are now broken, so there should not
> > be any confusion regarding any user that might run across this.
> 
> The odds are very low, as long as not more patch sets come up which
> introduce intermediate broken kernels.
> What exactly is the advantage of breaking the kernel with patch 1 and
> then fix it again with patch 2 instead of doing the straight forward
> conversions all with one patch?

I was trying to do one logical thing at a time with this driver as I did
not have the hardware to test, and I could not even build the code at
the time.

In looking more closer, I think the 084 patch might still work properly,
but I can't guarantee it as the the default kobject parent might not be
pointing to the correct attribute at the time.  I know 085 fixes this to
be sure that it will work properly.

It helped in reviewing this code by the other s390 developers to have
this in at least 2 pieces, to try to untangle the mess of sysfs files,
ksets, and other attrocities that you all have grown into over the
years.

So again, I'm sorry if this happens to break your run-time tests when
doing a 'git bisect', but as I explicitly stated it did in the patch, I
think everyone is properly forwarned :)

This core rework was tough to do, there was a reason no one had done it
before.  Now it is cleaner, smaller, able to be understood by at least
one active kernel developer, if not more, and it's documented, with
working examples.  If the downside of this effort is only this one thing
(note that others are finally finding real bugs...) I'll be very happy.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PATCH] driver core patches against 2.6.24

2008-01-25 Thread Greg KH
On Sat, Jan 26, 2008 at 03:50:57PM +1100, Rusty Russell wrote:
> On Saturday 26 January 2008 06:42:19 Greg KH wrote:
> > On Fri, Jan 25, 2008 at 10:44:59AM -0800, Linus Torvalds wrote:
> > > On Thu, 24 Jan 2008, Greg KH wrote:
> > > > Here are a pretty large number of kobject, documentation, and driver
> > > > core patches against your 2.6.24 git tree.
> > >
> > > I've merged it all, but it causes lots of scary warnings:
> > >
> > >  - from the purely broken ones:
> > >
> > >   ehci_hcd: no version for "struct_module" found: kernel tainted.
> >
> > Ok, in looking at the code, this should also be showing up for you on a
> > "clean" 2.6.24 release, I didn't change anything in this code path.
> >
> > That is what taints your kernel with the "F" flag.
> >
> > >  - to the scary ones:
> > >
> > >   sysfs: duplicate filename 'ehci_hcd' can not be created
> > >   WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
> > >   Pid: 610, comm: insmod Tainted: GF   2.6.24-gb47711bf #28
> > >
> > >   Call Trace:
> > >[] sysfs_add_one+0x54/0xbd
> > >[] create_dir+0x4f/0x87
> > >[] sysfs_create_dir+0x35/0x4a
> > >[] kobject_get+0x12/0x17
> > >[] kobject_add_internal+0xd9/0x194
> > >[] kobject_add_varg+0x54/0x61
> > >[] __alloc_pages+0x66/0x2ee
> > >[] kobject_init+0x42/0x82
> > >[] kobject_init_and_add+0x9a/0xa7
> > >[] __vmalloc_area_node+0x111/0x135
> > >[] mod_sysfs_init+0x6e/0x83
> > >[] sys_init_module+0xa3d/0x1833
> > >[] dput+0x1c/0x10b
> > >[] system_call+0x7e/0x83
> >
> > This is the sysfs core telling you that someone did something stupid :)
> >
> > Yes, that's new, but the "error" was always there, I just made the
> > warning more visible to get people to pay attention to it, and find the
> > real errors where this happens (and it has found them, which is a good
> > thing.)
> >
> > But in this case, it doesn't look like the module loading code will
> > detect that we are trying to load a module that is already present until
> > the kobjects are set up here.  It's been this way for a long time :(
> >
> > Rusty, any ideas of us adding a different check for "duplicate" modules
> > like this earlier in the load_module() function, so we don't spend so
> > much effort in building everything up when we don't need to?
> 
> module.c:1832 (in load_module)
> 
>   if (find_module(mod->name)) {
>   err = -EEXIST;
>   goto free_mod;
>   }
> 
> That's pretty early, and before this backtrace.

But that doesn't catch the case here, of trying to load a module when
the code itself is already built into the kernel.  For that we are
relying on the sysfs core to tell us we have a duplicate name problem,
which happens much later.

Is there any test you can do sooner, or is relying on the sysfs test
acceptable?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner
On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote:
> The points of the implementation are followings.
> - Add calls of the freeze function (freeze_bdev) and
>   the unfreeze function (thaw_bdev) in ext3_ioctl().
> 
> - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev)
>   is registered to the delayed work queue to unfreeze the filesystem
>   automatically after the lapse of the specified time.

Seems like pointless complexity to me - what happens if a
timeout occurs while the filsystem is still freezing?

It's not uncommon for a freeze to take minutes if memory
is full of dirty data that needs to be flushed out, esp. if
dm-snap is doing COWs for every write issued

> + case EXT3_IOC_FREEZE: {

> + if (inode->i_sb->s_frozen != SB_UNFROZEN)
> + return -EINVAL;

> + freeze_bdev(inode->i_sb->s_bdev);

> + case EXT3_IOC_THAW: {
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> + if (inode->i_sb->s_frozen == SB_UNFROZEN)
> + return -EINVAL;
.
> + /* Unfreeze */
> + thaw_bdev(inode->i_sb->s_bdev, inode->i_sb);

That's inherently unsafe - you can have multiple unfreezes
running in parallel which seriously screws with the bdev semaphore
count that is used to lock the device due to doing multiple up()s
for every down.

Your timeout thingy guarantee that at some point you will get
multiple up()s occuring due to the timer firing racing with
a thaw ioctl. 

If this interface is to be more widely exported, then it needs
a complete revamp of the bdev is locked while it is frozen so
that there is no chance of a double up() ever occuring on the
bd_mount_sem due to racing thaws.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Chris Snook wrote:
> Al Boldi wrote:
> > Greetings!
> >
> > data=ordered mode has proven reliable over the years, and it does this
> > by ordering filedata flushes before metadata flushes.  But this
> > sometimes causes contention in the order of a 10x slowdown for certain
> > apps, either due to the misuse of fsync or due to inherent behaviour
> > like db's, as well as inherent starvation issues exposed by the
> > data=ordered mode.
> >
> > data=writeback mode alleviates data=order mode slowdowns, but only works
> > per-mount and is too dangerous to run as a default mode.
> >
> > This RFC proposes to introduce a tunable which allows to disable fsync
> > and changes ordered into writeback writeout on a per-process basis like
> > this:
> >
> >   echo 1 > /proc/`pidof process`/softsync
> >
> >
> > Your comments are much welcome!
>
> This is basically a kernel workaround for stupid app behavior.

Exactly right to some extent, but don't forget the underlying data=ordered 
starvation problem, which looks like a genuinely deep problem maybe related 
to blockIO.

> It
> wouldn't be the first time we've provided such an option, but we shouldn't
> do it without a very good justification.  At the very least, we need a
> test case that demonstrates the problem

See the 'konqueror deadlocks in 2.6.22' thread.

> and benchmark results that prove that this approach actually fixes it.

8M-record insert into indexed db-table:
 ordered  writeback
sqlite3:  75m22s8m45s
mysql4 :  23m35s5m29s

> I suspect we can find a cleaner fix for the problem.

I hope so, but even with a fix available addressing the data=ordered 
starvation issue, this tunable could remain useful for those apps that 
misbehave.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Jan Kara wrote:
> > Greetings!
> >
> > data=ordered mode has proven reliable over the years, and it does this
> > by ordering filedata flushes before metadata flushes.  But this
> > sometimes causes contention in the order of a 10x slowdown for certain
> > apps, either due to the misuse of fsync or due to inherent behaviour
> > like db's, as well as inherent starvation issues exposed by the
> > data=ordered mode.
> >
> > data=writeback mode alleviates data=order mode slowdowns, but only works
> > per-mount and is too dangerous to run as a default mode.
> >
> > This RFC proposes to introduce a tunable which allows to disable fsync
> > and changes ordered into writeback writeout on a per-process basis like
> > this:
> >
> >   echo 1 > /proc/`pidof process`/softsync
>
>   I guess disabling fsync() was already commented on enough. Regarding
> switching to writeback mode on per-process basis - not easily possible
> because sometimes data is not written out by the process which stored
> them (think of mmaped file).

Do you mean there is a locking problem?

> And in case of DB, they use direct-io
> anyway most of the time so they don't care about journaling mode anyway.

Testing with sqlite3 and mysql4 shows that performance drastically improves 
with writeback writeout.

>  But as Diego wrote, there is definitely some room for improvement in
> current data=ordered mode so the difference shouldn't be as big in the
> end.

Yes, it would be nice to get to the bottom of this starvation problem, but 
even then, the proposed tunable remains useful for misbehaving apps.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
[EMAIL PROTECTED] wrote:
> On Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi said:
> > This RFC proposes to introduce a tunable which allows to disable fsync
> > and changes ordered into writeback writeout on a per-process basis like
> > this:
:
:
> But if you want to give them enough rope to shoot themselves in the foot
> with, I'd suggest abusing LD_PRELOAD to replace the fsync() glibc code
> instead.  No need to clutter the kernel with rope that can be (and has
> been) done in userspace.

Ok that's possible, but as you cannot use LD_PRELOAD to deal with changing 
ordered into writeback mode, we might as well allow them to disable fsync 
here, because it is in the same use-case.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Diego Calleja wrote:
> El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi <[EMAIL PROTECTED]> escribió:
> > Greetings!
> >
> > data=ordered mode has proven reliable over the years, and it does this
> > by ordering filedata flushes before metadata flushes.  But this
> > sometimes causes contention in the order of a 10x slowdown for certain
> > apps, either due to the misuse of fsync or due to inherent behaviour
> > like db's, as well as inherent starvation issues exposed by the
> > data=ordered mode.
>
> There's a related bug in bugzilla:
> http://bugzilla.kernel.org/show_bug.cgi?id=9546
>
> The diagnostic from Jan Kara is different though, but I think it may be
> the same problem...
>
> "One process does data-intensive load. Thus in the ordered mode the
> transaction is tiny but has tons of data buffers attached. If commit
> happens, it takes a long time to sync all the data before the commit
> can proceed... In the writeback mode, we don't wait for data buffers, in
> the journal mode amount of data to be written is really limited by the
> maximum size of a transaction and so we write by much smaller chunks
> and better latency is thus ensured."
>
>
> I'm hitting this bug too...it's surprising that there's not many people
> reporting more bugs about this, because it's really annoying.
>
>
> There's a patch by Jan Kara (that I'm including here because bugzilla
> didn't include it and took me a while to find it) which I don't know if
> it's supposed to fix the problem , but it'd be interesting to try:


Thanks a lot, but it doesn't fix it.

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 063/196] kset: convert /sys/devices to use kset_create

2008-01-25 Thread Greg KH
On Fri, Jan 25, 2008 at 09:40:55PM -0600, Olof Johansson wrote:
> On Thu, Jan 24, 2008 at 11:10:01PM -0800, Greg Kroah-Hartman wrote:
> > Dynamically create the kset instead of declaring it statically.  We also
> > rename devices_subsys to devices_kset to catch all users of the
> > variable.
> 
> Guess what, you broke powerpc again!

I did this ON PURPOSE!!!

The linux-kernel archives hold the details, and I was told by the PPC64
IBM people that they would fix this properly for 2.6.25, and not to hold
back on my changes.  This has been known for many months now.

> [EMAIL PROTECTED]:~/work/linux/k.org $ git grep devices_subsys
> arch/powerpc/kernel/vio.c:extern struct kset devices_subsys; /* needed for 
> vio_find_name() */
> arch/powerpc/kernel/vio.c:  found = kset_find_obj(&devices_subsys, 
> kobj_name);
> 
> Obviously causes build failues, even of ppc64_defconfig.
> 
> (I can unfortunately not boot test, since I lack hardware that uses vio)
> 
> 
> Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>
> 
> diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
> index 19a5656..ee752ab 100644
> --- a/arch/powerpc/kernel/vio.c
> +++ b/arch/powerpc/kernel/vio.c
> @@ -37,7 +37,7 @@
>  #include 
>  #include 
>  
> -extern struct kset devices_subsys; /* needed for vio_find_name() */
> +extern struct kset *devices_kset; /* needed for vio_find_name() */

No, this just papers over the real problem here.  For some reason, the
vio code thinks it is acceptable to walk the whole device tree and match
by a name and just assume that they got the correct device.  You call
this "enterprise grade"?  :)

You need to just put your device on a real bus, and then just walk the
bus.  That's the ONLY way you can guarantee the proper name will return
what you want, and you get the pointer that you really think you are
getting.

There is a reason that devices_kset is not exported, don't make me go
and have to name it something like:
devices_kset_dont_touch_this_or_gregkh_will_make_fun_of_you

Or I'll just mush 3 files in the driver core together and keep the
symbol from being accessible at all.

So no, I'm going to leave the build broken for this code, because that
is what it really is.

Please fix it correctly.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/20 -v5] add notrace annotations for NMI routines

2008-01-25 Thread Steven Rostedt

On Wed, 23 Jan 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt ([EMAIL PROTECTED]) wrote:
> > This annotates NMI functions with notrace. Some tracers may be able
> > to live with this, but some cannot. So we turn off NMI tracing.
> >
> > One solution might be to make a notrace_nmi which would only turn
> > off NMI tracing if a trace utility needed it off.
> >
> Is this still needed with the atomic clocksource read ?
>

Before you ask again, I've still included this in -v6, simply because I
didn't get a chance to test it without this patch. I'll try to remember to
do that on Monday.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner
On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote:
> >I am also wondering whether we should have system call(s) for these:
> >
> >On Jan 25, 2008 12:59 PM, Takashi Sato <[EMAIL PROTECTED]> wrote:
> >>+   case EXT3_IOC_FREEZE: {
> >
> >>+   case EXT3_IOC_THAW: {
> >
> >And just convert XFS to use them too?
> 
> I think it is reasonable to implement it as the generic system call, as you
> said.  Does XFS folks think so?

Sure.

Note that we can't immediately remove the XFS ioctls otherwise
we'd break userspace utilities that use them

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rt1

2008-01-25 Thread Steven Rostedt

On Fri, 25 Jan 2008, Steven Rostedt wrote:
>
> *** NOTICE ***
>
> This still has the old version of the latency tracer. I'll try to
> release a -rt2 soon that has the new version. This way we can see what
> kind of regressions the new version might give.
>

This is taking longer than expected. Removing the old latency tracer has
caused a bit to be broken. The latency tracer has been in the RT kernel
for so long that it has hooks in lots of unrelated patches. It's taking a
bit of surgury to remove all the bits without killing the rest.

I will not be working on this over the weekend. I'll start back up on
Monday.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-25 Thread Erez Zadok
In message <[EMAIL PROTECTED]>, Al Viro writes:
> After grep for locking-related things:
> 
>   * lock_parent(): who said that you won't get dentry moved
> before managing to grab i_mutex on parent?  While we are at it,
> who said that you won't get dentry moved between fetching d_parent
> and doing dget()?  In that case parent could've been _freed_ before
> you get to dget().

OK, so looks like I should use dget_parent() in my lock_parent(), as I've
done elsewhere.  I'll also take a look at all instances in which I get
dentry->d_parent and see if a d_lock is needed there.

>   * in create_parents():
> +   struct inode *inode = lower_dentry->d_inode;
> +   /*
> +* If we get here, it means that we created a new
> +* dentry+inode, but copying permissions failed.
> +* Therefore, we should delete this inode and dput
> +* the dentry so as not to leave cruft behind.
> +*/
> +   if (lower_dentry->d_op && lower_dentry->d_op->d_iput)
> +   lower_dentry->d_op->d_iput(lower_dentry,
> +  inode);
> +   else
> +   iput(inode);
> +   lower_dentry->d_inode = NULL;
> +   dput(lower_dentry);
> +   lower_dentry = ERR_PTR(err);
> +   goto out;
> Really?  So what happens if it had become positive after your test and
> somebody had looked it up in lower layer and just now happens to be
> in the middle of operations on it?  Will be thucking frilled by that...

Good catch.  That ->d_iput call was an old fix to a bug that has since been
fixed more cleanly and generically in our copyup_permission routine and our
unionfs_d_iput.  I've removed the above ->d_iput "if" and tested to verify
that it's indeed unnecessary.

>   * __unionfs_rename():
> +   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> +   err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
> +lower_new_dir_dentry->d_inode, lower_new_dentry);
> +   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> 
> Uh-huh...  To start with, what guarantees that your lower_old_dentry
> is still a child of your lower_old_dir_dentry?

We dget/dget_parent the old/new dentry and parents a few lines above
(actually, it looked like I forgot to dget(lower_new_dentry) -- fixed).
This is a generic stackable f/s issue: ecryptfs does the same stuff before
calling vfs_rename() on the lower objects.

> What's more, you are
> not checking the result of lock_rename(), i.e. asking for serious trouble.

OK.  I'm now checking for the return from lock_rename for ancestor/rename
rules.  I'm CC'ing Mike Halcrow so he can do the same for ecryptfs.

>   * revalidation stuff: err...  how the devil can it work for
> directories, when there's nothing to prevent changes in underlying
> layers between ->d_revalidate() and operation itself?  For the upper
> layer (unionfs itself) everything's more or less fine, but the rest
> of that...

In a stacked f/s, we keep references to the lower dentries/inodes, so they
can't disappear on us (that happens in our interpose function, called from
our ->lookup).  On entry to every f/s method in unionfs, we first perform
lightweight revalidation of our dentry against the lower ones: we check if
m/ctime changed (users modifying lower files) or if the generation# b/t our
super and the our dentries have changed (branch-management took place); if
needed, then we perform a full revalidation of all lower objects (while
holding a lock on the branch configuration).  If we have to do a full reval
upon entry to our ->op, and the reval failed, then we return an appropriate
error; o/w we proceed.  (In certain cases, the VFS re-issues a lookup if the
f/s says that it's dentry is invalid.)

Without changes to the VFS, I don't see how else I can ensure cache
coherency cleanly, while allowing users to modify lower files; this feature
is very useful to some unionfs users, who depend on it (so even if I could
"lock out" the lower directories from being modified, there will be users
who'd still want to be able to modify lower files).

BTW, my sense of the relationship b/t upper and lower objects and their
validity in a stackable f/s, is that it's similar to the relationship b/t
the NFS client and server -- the client can't be sure that a file on the
server doesn't change b/t ->revalidate and ->op (hence nfs's reliance on dir
mtime checks).

Perhaps this general topic is a good one to discuss at more length at LSF?
Suggestions are welcome.

Thanks,
Erez.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-

[PATCH 4/4] Unionfs: lock_rename related locking fixes

2008-01-25 Thread Erez Zadok
CC: Mike Halcrow <[EMAIL PROTECTED]>

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/unionfs/rename.c |   16 +++-
 1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/fs/unionfs/rename.c b/fs/unionfs/rename.c
index 9306a2b..5ab13f9 100644
--- a/fs/unionfs/rename.c
+++ b/fs/unionfs/rename.c
@@ -29,6 +29,7 @@ static int __unionfs_rename(struct inode *old_dir, struct 
dentry *old_dentry,
struct dentry *lower_new_dir_dentry;
struct dentry *lower_wh_dentry;
struct dentry *lower_wh_dir_dentry;
+   struct dentry *trap;
char *wh_name = NULL;
 
lower_new_dentry = unionfs_lower_dentry_idx(new_dentry, bindex);
@@ -95,6 +96,7 @@ static int __unionfs_rename(struct inode *old_dir, struct 
dentry *old_dentry,
goto out;
 
dget(lower_old_dentry);
+   dget(lower_new_dentry);
lower_old_dir_dentry = dget_parent(lower_old_dentry);
lower_new_dir_dentry = dget_parent(lower_new_dentry);
 
@@ -122,9 +124,20 @@ static int __unionfs_rename(struct inode *old_dir, struct 
dentry *old_dentry,
 
/* see Documentation/filesystems/unionfs/issues.txt */
lockdep_off();
-   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+   trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+   /* source should not be ancenstor of target */
+   if (trap == lower_old_dentry) {
+   err = -EINVAL;
+   goto out_err_unlock;
+   }
+   /* target should not be ancenstor of source */
+   if (trap == lower_new_dentry) {
+   err = -ENOTEMPTY;
+   goto out_err_unlock;
+   }
err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
 lower_new_dir_dentry->d_inode, lower_new_dentry);
+out_err_unlock:
unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
lockdep_on();
 
@@ -132,6 +145,7 @@ out_dput:
dput(lower_old_dir_dentry);
dput(lower_new_dir_dentry);
dput(lower_old_dentry);
+   dput(lower_new_dentry);
 
 out:
if (!err) {
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] Unionfs: remove unnecessary call to d_iput

2008-01-25 Thread Erez Zadok
This old code was to fix a bug which has long since been fixed in our
copyup_permission and unionfs_d_iput.

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/unionfs/copyup.c |   13 -
 1 files changed, 0 insertions(+), 13 deletions(-)

diff --git a/fs/unionfs/copyup.c b/fs/unionfs/copyup.c
index 16b2c7c..8663224 100644
--- a/fs/unionfs/copyup.c
+++ b/fs/unionfs/copyup.c
@@ -807,19 +807,6 @@ begin:
 lower_dentry);
unlock_dir(lower_parent_dentry);
if (err) {
-   struct inode *inode = lower_dentry->d_inode;
-   /*
-* If we get here, it means that we created a new
-* dentry+inode, but copying permissions failed.
-* Therefore, we should delete this inode and dput
-* the dentry so as not to leave cruft behind.
-*/
-   if (lower_dentry->d_op && lower_dentry->d_op->d_iput)
-   lower_dentry->d_op->d_iput(lower_dentry,
-  inode);
-   else
-   iput(inode);
-   lower_dentry->d_inode = NULL;
dput(lower_dentry);
lower_dentry = ERR_PTR(err);
goto out;
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] Unionfs: d_parent related locking fixes

2008-01-25 Thread Erez Zadok
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/unionfs/copyup.c |3 +--
 fs/unionfs/union.h  |4 ++--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/unionfs/copyup.c b/fs/unionfs/copyup.c
index 8663224..9beac01 100644
--- a/fs/unionfs/copyup.c
+++ b/fs/unionfs/copyup.c
@@ -716,8 +716,7 @@ struct dentry *create_parents(struct inode *dir, struct 
dentry *dentry,
child_dentry = parent_dentry;
 
/* find the parent directory dentry in unionfs */
-   parent_dentry = child_dentry->d_parent;
-   dget(parent_dentry);
+   parent_dentry = dget_parent(child_dentry);
 
/* find out the lower_parent_dentry in the given branch */
lower_parent_dentry =
diff --git a/fs/unionfs/union.h b/fs/unionfs/union.h
index d324f83..4b4d6c9 100644
--- a/fs/unionfs/union.h
+++ b/fs/unionfs/union.h
@@ -487,13 +487,13 @@ extern int parse_branch_mode(const char *name, int 
*perms);
 /* locking helpers */
 static inline struct dentry *lock_parent(struct dentry *dentry)
 {
-   struct dentry *dir = dget(dentry->d_parent);
+   struct dentry *dir = dget_parent(dentry);
mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT);
return dir;
 }
 static inline struct dentry *lock_parent_wh(struct dentry *dentry)
 {
-   struct dentry *dir = dget(dentry->d_parent);
+   struct dentry *dir = dget_parent(dentry);
 
mutex_lock_nested(&dir->d_inode->i_mutex, UNIONFS_DMUTEX_WHITEOUT);
return dir;
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] Unionfs: use first writable branch (fix/cleanup)

2008-01-25 Thread Erez Zadok
Cleanup code in ->create, ->symlink, and ->mknod: refactor common code into
helper functions.  Also, this allows writing to multiple branches again,
which was broken by an earlier patch.

Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>
---
 fs/unionfs/inode.c |  395 +---
 1 files changed, 156 insertions(+), 239 deletions(-)

diff --git a/fs/unionfs/inode.c b/fs/unionfs/inode.c
index e15ddb9..0b92da2 100644
--- a/fs/unionfs/inode.c
+++ b/fs/unionfs/inode.c
@@ -18,14 +18,159 @@
 
 #include "union.h"
 
+/*
+ * Helper function when creating new objects (create, symlink, and mknod).
+ * Checks to see if there's a whiteout in @lower_dentry's parent directory,
+ * whose name is taken from @dentry.  Then tries to remove that whiteout, if
+ * found.
+ *
+ * Return 0 if no whiteout was found, or if one was found and successfully
+ * removed (a zero tells the caller that @lower_dentry belongs to a good
+ * branch to create the new object in).  Return -ERRNO if an error occurred
+ * during whiteout lookup or in trying to unlink the whiteout.
+ */
+static int check_for_whiteout(struct dentry *dentry,
+ struct dentry *lower_dentry)
+{
+   int err = 0;
+   struct dentry *wh_dentry = NULL;
+   struct dentry *lower_dir_dentry;
+   char *name = NULL;
+
+   /*
+* check if whiteout exists in this branch, i.e. lookup .wh.foo
+* first.
+*/
+   name = alloc_whname(dentry->d_name.name, dentry->d_name.len);
+   if (unlikely(IS_ERR(name))) {
+   err = PTR_ERR(name);
+   goto out;
+   }
+
+   wh_dentry = lookup_one_len(name, lower_dentry->d_parent,
+  dentry->d_name.len + UNIONFS_WHLEN);
+   if (IS_ERR(wh_dentry)) {
+   err = PTR_ERR(wh_dentry);
+   wh_dentry = NULL;
+   goto out;
+   }
+
+   if (!wh_dentry->d_inode) /* no whiteout exists */
+   goto out;
+
+   /* .wh.foo has been found, so let's unlink it */
+   lower_dir_dentry = lock_parent_wh(wh_dentry);
+   /* see Documentation/filesystems/unionfs/issues.txt */
+   lockdep_off();
+   err = vfs_unlink(lower_dir_dentry->d_inode, wh_dentry);
+   lockdep_on();
+   unlock_dir(lower_dir_dentry);
+
+   /*
+* Whiteouts are special files and should be deleted no matter what
+* (as if they never existed), in order to allow this create
+* operation to succeed.  This is especially important in sticky
+* directories: a whiteout may have been created by one user, but
+* the newly created file may be created by another user.
+* Therefore, in order to maintain Unix semantics, if the vfs_unlink
+* above failed, then we have to try to directly unlink the
+* whiteout.  Note: in the ODF version of unionfs, whiteout are
+* handled much more cleanly.
+*/
+   if (err == -EPERM) {
+   struct inode *inode = lower_dir_dentry->d_inode;
+   err = inode->i_op->unlink(inode, wh_dentry);
+   }
+   if (err)
+   printk(KERN_ERR "unionfs: could not "
+  "unlink whiteout, err = %d\n", err);
+
+out:
+   dput(wh_dentry);
+   kfree(name);
+   return err;
+}
+
+/*
+ * Find a writeable branch to create new object in.  Checks all writeble
+ * branches of the parent inode, from istart to iend order; if none are
+ * suitable, also tries branch 0 (which may require a copyup).
+ *
+ * Return a lower_dentry we can use to create object in, or ERR_PTR.
+ */
+static struct dentry *find_writeable_branch(struct inode *parent,
+   struct dentry *dentry)
+{
+   int err = -EINVAL;
+   int bindex, istart, iend;
+   struct dentry *lower_dentry = NULL;
+
+   istart = ibstart(parent);
+   iend = ibend(parent);
+   if (istart < 0)
+   goto out;
+
+begin:
+   for (bindex = istart; bindex <= iend; bindex++) {
+   /* skip non-writeable branches */
+   err = is_robranch_super(dentry->d_sb, bindex);
+   if (err) {
+   err = -EROFS;
+   continue;
+   }
+   lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+   if (!lower_dentry)
+   continue;
+   /*
+* check for whiteouts in writeable branch, and remove them
+* if necessary.
+*/
+   err = check_for_whiteout(dentry, lower_dentry);
+   if (err)
+   continue;
+   }
+   /*
+* If istart wasn't already branch 0, and we got any error, then try
+* branch 0 (which may require copyup)
+*/
+   if (err && istart > 0) {
+   istart = iend = 0;
+   goto begin;
+   }
+
+   /*
+* If w

[GIT PULL -mm] 0/4 Unionfs updates/fixes/cleanups

2008-01-25 Thread Erez Zadok

The following is a series of patchsets related to Unionfs.  This is the
fifth set of patchsets resulting from an lkml review of the entire unionfs
code base, in preparation for a merge into mainline.  The most significant
changes here are a few locking related fixes, and a correction to broken
logic which didn't allow writing to the first available writable branch.

These patches were tested (where appropriate) on 2.6.24, MM, as well as the
backports to 2.6.{23,22,21,20,19,18,9} on ext2/3/4, xfs, reiserfs, nfs2/3/4,
jffs2, ramfs, tmpfs, cramfs, and squashfs (where available).  Also tested
with LTP-full and with a continuous parallel kernel compile (while forcing
cache flushing, manipulating lower branches, etc.).  See
http://unionfs.filesystems.org/ to download back-ported unionfs code.

Please pull from the 'master' branch of
git://git.kernel.org/pub/scm/linux/kernel/git/ezk/unionfs.git

to receive the following:

Erez Zadok (4):
  Unionfs: use first writable branch (fix/cleanup)
  Unionfs: remove unnecessary call to d_iput
  Unionfs: d_parent related locking fixes
  Unionfs: lock_rename related locking fixes

 copyup.c |   16 --
 inode.c  |  395 ---
 rename.c |   16 ++
 union.h  |4 
 4 files changed, 174 insertions(+), 257 deletions(-)

---
Erez Zadok
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] block: look up block device path in sysfs

2008-01-25 Thread Dan Williams
Given an fd on a block device, returns a string like

/block/sda/sda1

which can be used to find related information in /sys.

Ideally we should have an ioctl that works on char devices as well,
but that seems far from trivial, so it seems reasonable to have
this until the latter can be implemented.

Cc: Jens Axboe <[EMAIL PROTECTED]>
Cc: Neil Brown <[EMAIL PROTECTED]>
Cc: Kay Sievers <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---
Things have been quiet since this was posted about a month ago, and I am
hoping to see this in 2.6.25.  It is based on Neil's BLKGETNAME patch
and is updated with Kay's comments.

Regards,
Dan

 block/compat_ioctl.c |1 +
 block/ioctl.c|   28 
 include/linux/fs.h   |1 +
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
index cae0a85..d71d287 100644
--- a/block/compat_ioctl.c
+++ b/block/compat_ioctl.c
@@ -784,6 +784,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, 
unsigned long arg)
switch (cmd) {
case HDIO_GETGEO:
return compat_hdio_getgeo(disk, bdev, compat_ptr(arg));
+   case BLKGETDEVPATH:
case BLKFLSBUF:
case BLKROSET:
/*
diff --git a/block/ioctl.c b/block/ioctl.c
index 52d6385..d048ae4 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -229,6 +229,34 @@ int blkdev_ioctl(struct inode *inode, struct file *file, 
unsigned cmd,
int ret, n;
 
switch(cmd) {
+   case BLKGETDEVPATH: {
+   char *path;
+   char b[BDEVNAME_SIZE];
+   size_t len;
+
+   path = kobject_get_path(&disk->kobj, GFP_KERNEL);
+
+   if (!path)
+   return -ENOMEM;
+
+   len = strlen(path);
+   if (copy_to_user((char __user *)arg, path, len + 1)) {
+   kfree(path);
+   return -EFAULT;
+   }
+   kfree(path);
+
+   if (bdev->bd_contains == bdev)
+   return 0;
+
+   bdevname(bdev, b);
+   if (copy_to_user((char __user *)arg + len, "/", 2))
+   return -EFAULT;
+   if (copy_to_user((char __user *)arg + len + 1, b,
+strlen(b) + 1))
+   return -EFAULT;
+   return 0;
+   }
case BLKFLSBUF:
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..b4cf8f3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -217,6 +217,7 @@ extern int dir_notify_enable;
 #define BLKTRACESTART _IO(0x12,116)
 #define BLKTRACESTOP _IO(0x12,117)
 #define BLKTRACETEARDOWN _IO(0x12,118)
+#define BLKGETDEVPATH _IOR(0x12, 119, char [1024])
 
 #define BMAP_IOCTL 1   /* obsolete - kept for compatibility */
 #define FIBMAP_IO(0x00,1)  /* bmap access */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PATCH] driver core patches against 2.6.24

2008-01-25 Thread Rusty Russell
On Saturday 26 January 2008 06:42:19 Greg KH wrote:
> On Fri, Jan 25, 2008 at 10:44:59AM -0800, Linus Torvalds wrote:
> > On Thu, 24 Jan 2008, Greg KH wrote:
> > > Here are a pretty large number of kobject, documentation, and driver
> > > core patches against your 2.6.24 git tree.
> >
> > I've merged it all, but it causes lots of scary warnings:
> >
> >  - from the purely broken ones:
> >
> > ehci_hcd: no version for "struct_module" found: kernel tainted.
>
> Ok, in looking at the code, this should also be showing up for you on a
> "clean" 2.6.24 release, I didn't change anything in this code path.
>
> That is what taints your kernel with the "F" flag.
>
> >  - to the scary ones:
> >
> > sysfs: duplicate filename 'ehci_hcd' can not be created
> > WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
> > Pid: 610, comm: insmod Tainted: GF   2.6.24-gb47711bf #28
> >
> > Call Trace:
> >  [] sysfs_add_one+0x54/0xbd
> >  [] create_dir+0x4f/0x87
> >  [] sysfs_create_dir+0x35/0x4a
> >  [] kobject_get+0x12/0x17
> >  [] kobject_add_internal+0xd9/0x194
> >  [] kobject_add_varg+0x54/0x61
> >  [] __alloc_pages+0x66/0x2ee
> >  [] kobject_init+0x42/0x82
> >  [] kobject_init_and_add+0x9a/0xa7
> >  [] __vmalloc_area_node+0x111/0x135
> >  [] mod_sysfs_init+0x6e/0x83
> >  [] sys_init_module+0xa3d/0x1833
> >  [] dput+0x1c/0x10b
> >  [] system_call+0x7e/0x83
>
> This is the sysfs core telling you that someone did something stupid :)
>
> Yes, that's new, but the "error" was always there, I just made the
> warning more visible to get people to pay attention to it, and find the
> real errors where this happens (and it has found them, which is a good
> thing.)
>
> But in this case, it doesn't look like the module loading code will
> detect that we are trying to load a module that is already present until
> the kobjects are set up here.  It's been this way for a long time :(
>
> Rusty, any ideas of us adding a different check for "duplicate" modules
> like this earlier in the load_module() function, so we don't spend so
> much effort in building everything up when we don't need to?

module.c:1832 (in load_module)

if (find_module(mod->name)) {
err = -EEXIST;
goto free_mod;
}

That's pretty early, and before this backtrace.

Even for simultaneous loads, there's a mutex which protects from here to the 
list insertion.

Puzzled,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 19/23 -v6] Trace irq disabled critical timings

2008-01-25 Thread Steven Rostedt
This patch adds latency tracing for critical timings
(how long interrupts are disabled for).

 "irqsoff" is added to /debugfs/tracing/available_tracers

Note:
  tracing_max_latency
also holds the max latency for irqsoff (in usecs).
   (default to large number so one must start latency tracing)

  tracing_thresh
threshold (in usecs) to always print out if irqs off
is detected to be longer than stated here.
If irq_thresh is non-zero, then max_irq_latency
is ignored.

Here's an example of a trace with mcount_enabled = 0

===
preemption latency trace v1.1.5 on 2.6.24-rc7

 latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
-
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
 swapper-0 1d.s30us+: _spin_lock_irqsave+0x2a/0xb7 
(e1000_update_stats+0x47/0x64c [e1000])
 swapper-0 1d.s3  100us : _spin_unlock_irqrestore+0x32/0x5f 
(e1000_update_stats+0x641/0x64c [e1000])
 swapper-0 1d.s3  100us : trace_hardirqs_on_caller+0x75/0x89 
(_spin_unlock_irqrestore+0x32/0x5f)


vim:ft=help
===


And this is a trace with mcount_enabled == 1


===
preemption latency trace v1.1.5 on 2.6.24-rc7

 latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
-
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
 swapper-0 1dNs30us+: _spin_lock_irqsave+0x2a/0xb7 
(e1000_update_stats+0x47/0x64c [e1000])
 swapper-0 1dNs3   46us : e1000_read_phy_reg+0x16/0x225 [e1000] 
(e1000_update_stats+0x5e2/0x64c [e1000])
 swapper-0 1dNs3   46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] 
(e1000_read_phy_reg+0x49/0x225 [e1000])
 swapper-0 1dNs3   46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] 
(e1000_swfw_sync_acquire+0x36/0x99 [e1000])
 swapper-0 1dNs3   47us : __const_udelay+0x9/0x47 
(e1000_read_phy_reg+0x116/0x225 [e1000])
 swapper-0 1dNs3   47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47)
 swapper-0 1dNs3   97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50)
 swapper-0 1dNs3   98us : e1000_swfw_sync_release+0xc/0x55 [e1000] 
(e1000_read_phy_reg+0x211/0x225 [e1000])
 swapper-0 1dNs3   99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] 
(e1000_swfw_sync_release+0x50/0x55 [e1000])
 swapper-0 1dNs3  101us : _spin_unlock_irqrestore+0xe/0x5f 
(e1000_update_stats+0x641/0x64c [e1000])
 swapper-0 1dNs3  102us : _spin_unlock_irqrestore+0x32/0x5f 
(e1000_update_stats+0x641/0x64c [e1000])
 swapper-0 1dNs3  102us : trace_hardirqs_on_caller+0x75/0x89 
(_spin_unlock_irqrestore+0x32/0x5f)


vim:ft=help
===


Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/process_64.c  |3 
 arch/x86/lib/thunk_64.S   |   18 +
 include/asm-x86/irqflags_32.h |4 
 include/asm-x86/irqflags_64.h |4 
 include/linux/irqflags.h  |   37 ++-
 include/linux/mcount.h|   31 ++-
 kernel/fork.c |2 
 kernel/lockdep.c  |   16 +
 lib/tracing/Kconfig   |   18 +
 lib/tracing/Makefile  |1 
 lib/tracing/trace_irqsoff.c   |  415 ++
 lib/tracing/tracer.c  |   59 -
 lib/tracing/tracer.h  |2 
 13 files changed, 575 insertions(+), 35 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/process_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/process_64.c  2008-01-25 
21:46:48.0 -0500
+++ linux-mcount.git/arch/x86/kernel/process_64.c   2008-01-25 
21:47:34.0 -0500
@@ -233,7 +233,10 @@ void cpu_idle (void)
 */
local_irq_disable();
enter_idle();
+   /* Don't trace irqs off for idle */
+   stop_critical_timings();
   

[PATCH 01/23 -v6] printk - dont wakeup klogd with interrupts disabled

2008-01-25 Thread Steven Rostedt
[ This patch is added to the series since the wakeup timings trace
  may lockup without it. ]

I thought that one could place a printk anywhere without worrying.
But it seems that it is not wise to place a printk where the runqueue
lock is held.

I just spent two hours debugging why some of my code was locking up,
to find that the lockup was caused by some debugging printk's that
I had in the scheduler.  The printk's were only in rare paths so
they shouldn't be too much of a problem, but after I hit the printk
the system locked up.

Thinking that it was locking up on my code I went looking down the
wrong path. I finally found (after examining an NMI dump) that
the lockup happened because printk was trying to wakeup the klogd
daemon, which caused a deadlock when the try_to_wakeup code tries
to grab the runqueue lock.

This patch adds a runqueue_is_locked interface in sched.c for other
files to see if the current runqueue lock is held. This is used
in printk to determine whether it is safe or not to wakeup the klogd.

And with this patch, my code ran fine ;-)

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/sched.h |2 ++
 kernel/printk.c   |   14 ++
 kernel/sched.c|   18 ++
 3 files changed, 30 insertions(+), 4 deletions(-)

Index: linux-mcount.git/kernel/printk.c
===
--- linux-mcount.git.orig/kernel/printk.c   2008-01-25 21:46:50.0 
-0500
+++ linux-mcount.git/kernel/printk.c2008-01-25 21:46:55.0 -0500
@@ -590,9 +590,11 @@ static int have_callable_console(void)
  * @fmt: format string
  *
  * This is printk().  It can be called from any context.  We want it to work.
- * Be aware of the fact that if oops_in_progress is not set, we might try to
- * wake klogd up which could deadlock on runqueue lock if printk() is called
- * from scheduler code.
+ *
+ * Note: if printk() is called with the runqueue lock held, it will not wake
+ * up the klogd. This is to avoid a deadlock from calling printk() in schedule
+ * with the runqueue lock held and having the wake_up grab the runqueue lock
+ * as well.
  *
  * We try to grab the console_sem.  If we succeed, it's easy - we log the 
output and
  * call the console drivers.  If we fail to get the semaphore we place the 
output
@@ -1003,7 +1005,11 @@ void release_console_sem(void)
console_locked = 0;
up(&console_sem);
spin_unlock_irqrestore(&logbuf_lock, flags);
-   if (wake_klogd)
+   /*
+* If we try to wake up klogd while printing with the runqueue lock
+* held, this will deadlock.
+*/
+   if (wake_klogd && !runqueue_is_locked())
wake_up_klogd();
 }
 EXPORT_SYMBOL(release_console_sem);
Index: linux-mcount.git/include/linux/sched.h
===
--- linux-mcount.git.orig/include/linux/sched.h 2008-01-25 21:46:50.0 
-0500
+++ linux-mcount.git/include/linux/sched.h  2008-01-25 21:46:55.0 
-0500
@@ -221,6 +221,8 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 extern void init_idle_bootup_task(struct task_struct *idle);
 
+extern int runqueue_is_locked(void);
+
 extern cpumask_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
 extern int select_nohz_load_balancer(int cpu);
Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:46:50.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-25 21:46:55.0 -0500
@@ -621,6 +621,24 @@ unsigned long rt_needs_cpu(int cpu)
 # define const_debug static const
 #endif
 
+/**
+ * runqueue_is_locked
+ *
+ * Returns true if the current cpu runqueue is locked.
+ * This interface allows printk to be called with the runqueue lock
+ * held and know whether or not it is OK to wake up the klogd.
+ */
+int runqueue_is_locked(void)
+{
+   int cpu = get_cpu();
+   struct rq *rq = cpu_rq(cpu);
+   int ret;
+
+   ret = spin_is_locked(&rq->lock);
+   put_cpu();
+   return ret;
+}
+
 /*
  * Debugging: various feature bits
  */

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/23 -v6] mcount based trace in the form of a header file library

2008-01-25 Thread Steven Rostedt
This is a simple trace that uses the mcount infrastructure. It is
designed to be fast and small, and easy to use. It is useful to
record things that happen over a very short period of time, and
not to analyze the system in general.

An interface is added to the debugfs

  /debugfs/tracing/

This patch adds the following files:

  available_tracers
 list of available tracers. Currently only "function" is
 available.

  current_tracer
 The trace that is currently active. Empty on start up.
 To switch to a tracer simply echo one of the tracers that
 are listed in available_tracers:

  echo function > /debugfs/tracing/current_tracer


  trace_ctrl
 echoing "1" into this file starts the mcount function tracing
  (if sysctl kernel.mcount_enabled=1)
 echoing "0" turns it off.

  latency_trace
  This file is readonly and holds the result of the trace.

  trace
  This file outputs a easier to read version of the trace.

  iter_ctrl
  Controls the way the output of traces look.
  So far there's two controls:
echoing in "symonly" will only show the kallsyms variables
without the addresses (if kallsyms was configured)
echoing in "verbose" will change the output to show
a lot more data, but not very easy to understand by
humans.
echoing in "nosymonly" turns off symonly.
echoing in "noverbose" turns off verbose.

The output of the function_trace file is as follows

  "echo noverbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst

 latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
-
| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
-

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq
   ||| / _--=> preempt-depth
    /
   | delay
   cmd pid | time  |   caller
  \   /|   \   |   /
 swapper-0 0d.h. 1595128us+: set_normalized_timespec+0x8/0x2d  
(ktime_get_ts+0x4a/0x4e )
 swapper-0 0d.h. 1595131us+: _spin_lock+0x8/0x18  
(hrtimer_interrupt+0x6e/0x1b0 )

Or with verbose turned on:

  "echo verbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst

 latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
-
| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
-

 swapper 0 0 9   [f3675f41] 1595.128ms (+0.003ms): 
set_normalized_timespec+0x8/0x2d  (ktime_get_ts+0x4a/0x4e )
 swapper 0 0 9  0001 [f3675f45] 1595.131ms (+0.003ms): 
_spin_lock+0x8/0x18  (hrtimer_interrupt+0x6e/0x1b0 )
 swapper 0 0 9  0002 [f3675f48] 1595.135ms (+0.003ms): 
_spin_lock+0x8/0x18  (hrtimer_interrupt+0x6e/0x1b0 )


The "trace" file is not affected by the verbose mode, but is by the symonly.

 echo "nosymonly" > /debugfs/tracing/iter_ctrl

tracer:
[   81.479967] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 
 <-- _spin_unlock_irqrestore+0xe/0x5a 
[   81.479967] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a 
 <-- sub_preempt_count+0xc/0x7a 
[   81.479968] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a  
<-- in_lock_functions+0x9/0x24 
[   81.479968] CPU 0: bash:3154 vfs_write+0x11d/0x155  <-- 
dnotify_parent+0x12/0x78 
[   81.479968] CPU 0: bash:3154 dnotify_parent+0x2d/0x78  <-- 
_spin_lock+0xe/0x70 
[   81.479969] CPU 0: bash:3154 _spin_lock+0x1b/0x70  <-- 
add_preempt_count+0xe/0x77 
[   81.479969] CPU 0: bash:3154 add_preempt_count+0x3e/0x77  
<-- in_lock_functions+0x9/0x24 


 echo "symonly" > /debugfs/tracing/iter_ctrl

tracer:
[   81.479913] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <-- 
_spin_unlock_irqrestore+0xe/0x5a
[   81.479913] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- 
sub_preempt_count+0xc/0x7a
[   81.479913] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- 
in_lock_functions+0x9/0x24
[   81.479914] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- 
dnotify_parent+0x12/0x78
[   81.479914] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70
[   81.479914] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- 
add_preempt_count+0xe/0x77
[   81.479914] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- 
in_lock_functions+0x9/0x24


Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
---
 lib/Makefile |1 
 lib/tracing/Kconfig  |   15 
 lib/tracing/Makefile |3 
 lib/tracing/trace_function.c |   72 ++
 lib/tracing/tracer.c | 1150 +++
 lib/tracing/tracer.h |   96 +++
 6 files changed, 

[PATCH 18/23 -v6] mcount tracer for wakeup latency timings.

2008-01-25 Thread Steven Rostedt
This patch adds hooks to trace the wake up latency of the highest
priority waking task.

  "wakeup" is added to /debugfs/tracing/available_tracers

Also added to /debugfs/tracing

  tracing_max_latency
 holds the current max latency for the wakeup

  wakeup_thresh
 if set to other than zero, a log will be recorded
 for every wakeup that takes longer than the number
 entered in here (usecs for all counters)
 (deletes previous trace)

Examples:

  (with mcount_enabled = 0)


preemption latency trace v1.1.5 on 2.6.24-rc8

 latency: 26 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: migration/0-3 (uid:0 nice:-5 policy:1 rt_prio:99)
-

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
   quilt-8551  0d..30us+: wake_up_process+0x15/0x17  
(sched_exec+0xc9/0x100 )
   quilt-8551  0d..4   26us : sched_switch_callback+0x73/0x81 
 (schedule+0x483/0x6d5 )


vim:ft=help



  (with mcount_enabled = 1)


preemption latency trace v1.1.5 on 2.6.24-rc8

 latency: 36 us, #45/45, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: migration/1-5 (uid:0 nice:-5 policy:1 rt_prio:99)
-

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
bash-10653 1d..30us : wake_up_process+0x15/0x17  
(sched_exec+0xc9/0x100 )
bash-10653 1d..31us : try_to_wake_up+0x271/0x2e7  
(sub_preempt_count+0xc/0x7a )
bash-10653 1d..22us : try_to_wake_up+0x296/0x2e7  
(update_rq_clock+0x9/0x20 )
bash-10653 1d..22us : update_rq_clock+0x1e/0x20  
(__update_rq_clock+0xc/0x90 )
bash-10653 1d..23us : __update_rq_clock+0x1b/0x90  
(sched_clock+0x9/0x29 )
bash-10653 1d..24us : try_to_wake_up+0x2a6/0x2e7  
(activate_task+0xc/0x3f )
bash-10653 1d..24us : activate_task+0x2d/0x3f  
(enqueue_task+0xe/0x66 )
bash-10653 1d..25us : enqueue_task+0x5b/0x66  
(enqueue_task_rt+0x9/0x3c )
bash-10653 1d..26us : try_to_wake_up+0x2ba/0x2e7  
(check_preempt_wakeup+0x12/0x99 )
[...]
bash-10653 1d..5   33us : tracing_record_cmdline+0xcf/0xd4 
 (_spin_unlock+0x9/0x33 )
bash-10653 1d..5   34us : _spin_unlock+0x19/0x33  
(sub_preempt_count+0xc/0x7a )
bash-10653 1d..4   35us : wakeup_sched_switch+0x65/0x2ff  
(_spin_lock_irqsave+0xc/0xa9 )
bash-10653 1d..4   35us : _spin_lock_irqsave+0x19/0xa9  
(add_preempt_count+0xe/0x77 )
bash-10653 1d..4   36us : sched_switch_callback+0x73/0x81 
 (schedule+0x483/0x6d5 )


vim:ft=help


The [...] was added here to not waste your email box space.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig|   14 +
 lib/tracing/Makefile   |1 
 lib/tracing/trace_wakeup.c |  350 +
 lib/tracing/tracer.c   |  131 
 lib/tracing/tracer.h   |6 
 5 files changed, 500 insertions(+), 2 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-25 21:47:25.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:32.0 
-0500
@@ -9,6 +9,9 @@ config MCOUNT
bool
select FRAME_POINTER
 
+config TRACER_MAX_TRACE
+   bool
+
 config TRACING
 bool
select DEBUG_FS
@@ -25,6 +28,17 @@ config FUNCTION_TRACER
  that the debugging mechanism using this facility will hook by
  providing a set of inline routines.
 
+config WAKEUP_TRACER
+   bool "Trace wakeup latencies"
+   depends on DEBUG_KERNEL
+   select TRACING
+   select CONTEXT_SWITCH_TRACER
+   select TRACER_MAX_TRACE
+   help
+ This tracer adds hooks into scheduling to time the latency
+ of the highest priority task tasks to be scheduled in
+ after it has worken up.
+
 config CONTEXT_SWITCH_TRACER
bool "Trace process context switches"
depends on DEBUG_KERNEL
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-m

[PATCH 16/23 -v6] trace generic call to schedule switch

2008-01-25 Thread Steven Rostedt
This patch adds hooks into the schedule switch tracing to
allow other latency traces to hook into the schedule switches.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/trace_sched_switch.c |  123 +--
 lib/tracing/tracer.h |   14 
 2 files changed, 119 insertions(+), 18 deletions(-)

Index: linux-mcount.git/lib/tracing/tracer.h
===
--- linux-mcount.git.orig/lib/tracing/tracer.h  2008-01-25 21:47:25.0 
-0500
+++ linux-mcount.git/lib/tracing/tracer.h   2008-01-25 21:47:27.0 
-0500
@@ -112,4 +112,18 @@ static inline notrace cycle_t now(void)
return get_monotonic_cycles();
 }
 
+#ifdef CONFIG_CONTEXT_SWITCH_TRACER
+typedef void (*tracer_switch_func_t)(void *private,
+struct task_struct *prev,
+struct task_struct *next);
+struct tracer_switch_ops {
+   tracer_switch_func_t func;
+   void *private;
+   struct tracer_switch_ops *next;
+};
+
+extern int register_tracer_switch(struct tracer_switch_ops *ops);
+extern int unregister_tracer_switch(struct tracer_switch_ops *ops);
+#endif /* CONFIG_CONTEXT_SWITCH_TRACER */
+
 #endif /* _LINUX_MCOUNT_TRACER_H */
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===
--- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c  2008-01-25 
21:47:25.0 -0500
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c   2008-01-25 
21:47:27.0 -0500
@@ -16,33 +16,21 @@
 
 static struct tracing_trace *tracer_trace;
 static int trace_enabled __read_mostly;
+static DEFINE_SPINLOCK(sched_switch_func_lock);
 
-static notrace void sched_switch_callback(const struct marker *mdata,
- void *private_data,
- const char *format, ...)
+static void notrace sched_switch_func(void *private,
+ struct task_struct *prev,
+ struct task_struct *next)
 {
-   struct tracing_trace **p = mdata->private;
-   struct tracing_trace *tr = *p;
+   struct tracing_trace **ptr = private;
+   struct tracing_trace *tr = *ptr;
struct tracing_trace_cpu *data;
-   struct task_struct *prev;
-   struct task_struct *next;
unsigned long flags;
-   va_list ap;
int cpu;
 
-   if (likely(!atomic_read(&trace_record_cmdline)))
-   return;
-
-   tracing_record_cmdline(current);
-
if (likely(!trace_enabled))
return;
 
-   va_start(ap, format);
-   prev = va_arg(ap, typeof(prev));
-   next = va_arg(ap, typeof(next));
-   va_end(ap);
-
raw_local_irq_save(flags);
cpu = raw_smp_processor_id();
data = tr->data[cpu];
@@ -55,6 +43,105 @@ static notrace void sched_switch_callbac
raw_local_irq_restore(flags);
 }
 
+static struct tracer_switch_ops sched_switch_ops __read_mostly =
+{
+   .func = sched_switch_func,
+   .private = &tracer_trace,
+};
+
+static tracer_switch_func_t tracer_switch_func __read_mostly =
+   sched_switch_func;
+
+static struct tracer_switch_ops *tracer_switch_func_ops __read_mostly =
+   &sched_switch_ops;
+
+static void notrace sched_switch_func_loop(void *private,
+  struct task_struct *prev,
+  struct task_struct *next)
+{
+   struct tracer_switch_ops *ops = tracer_switch_func_ops;
+
+   for (; ops != NULL; ops = ops->next)
+   ops->func(ops->private, prev, next);
+}
+
+notrace int register_tracer_switch(struct tracer_switch_ops *ops)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(&sched_switch_func_lock, flags);
+   ops->next = tracer_switch_func_ops;
+   smp_wmb();
+   tracer_switch_func_ops = ops;
+
+   if (ops->next == &sched_switch_ops)
+   tracer_switch_func = sched_switch_func_loop;
+
+   spin_unlock_irqrestore(&sched_switch_func_lock, flags);
+
+   return 0;
+}
+
+notrace int unregister_tracer_switch(struct tracer_switch_ops *ops)
+{
+   unsigned long flags;
+   struct tracer_switch_ops **p = &tracer_switch_func_ops;
+   int ret;
+
+   spin_lock_irqsave(&sched_switch_func_lock, flags);
+
+   /*
+* If the sched_switch is the only one left, then
+*  only call that function.
+*/
+   if (*p == ops && ops->next == &sched_switch_ops) {
+   tracer_switch_func = sched_switch_func;
+   tracer_switch_func_ops = &sched_switch_ops;
+   goto out;
+   }
+
+   for (; *p != &sched_switch_ops; p = &(*p)->next)
+   if (*p == ops)
+   break;
+
+   if (*p != ops) {
+   ret = -1;
+   goto out;
+ 

[PATCH 15/23 -v6] Generic command line storage

2008-01-25 Thread Steven Rostedt
Saving the comm of tasks for each trace is very expensive.
This patch includes in the context switch hook, a way to
store the last 100 command lines of tasks. This table is
examined when a trace is to be printed.

Note: The comm may be destroyed if other traces are performed.
Later (TBD) patches may simply store this information in the trace
itself.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig  |1 
 lib/tracing/trace_function.c |2 
 lib/tracing/trace_sched_switch.c |7 ++
 lib/tracing/tracer.c |  104 +--
 lib/tracing/tracer.h |6 +-
 5 files changed, 114 insertions(+), 6 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-25 21:47:23.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:25.0 
-0500
@@ -18,6 +18,7 @@ config FUNCTION_TRACER
depends on DEBUG_KERNEL && HAVE_MCOUNT
select MCOUNT
select TRACING
+   select CONTEXT_SWITCH_TRACER
help
  Use profiler instrumentation, adding -pg to CFLAGS. This will
  insert a call to an architecture specific __mcount routine,
Index: linux-mcount.git/lib/tracing/trace_function.c
===
--- linux-mcount.git.orig/lib/tracing/trace_function.c  2008-01-25 
21:47:17.0 -0500
+++ linux-mcount.git/lib/tracing/trace_function.c   2008-01-25 
21:47:25.0 -0500
@@ -28,11 +28,13 @@ static notrace void function_reset(struc
 static notrace void start_function_trace(struct tracing_trace *tr)
 {
function_reset(tr);
+   atomic_inc(&trace_record_cmdline);
tracing_start_function_trace();
 }
 
 static notrace void stop_function_trace(struct tracing_trace *tr)
 {
+   atomic_dec(&trace_record_cmdline);
tracing_stop_function_trace();
 }
 
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===
--- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c  2008-01-25 
21:47:23.0 -0500
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c   2008-01-25 
21:47:25.0 -0500
@@ -30,6 +30,11 @@ static notrace void sched_switch_callbac
va_list ap;
int cpu;
 
+   if (likely(!atomic_read(&trace_record_cmdline)))
+   return;
+
+   tracing_record_cmdline(current);
+
if (likely(!trace_enabled))
return;
 
@@ -62,6 +67,7 @@ static notrace void sched_switch_reset(s
 
 static notrace void start_sched_trace(struct tracing_trace *tr)
 {
+   atomic_inc(&trace_record_cmdline);
sched_switch_reset(tr);
trace_enabled = 1;
 }
@@ -69,6 +75,7 @@ static notrace void start_sched_trace(st
 static notrace void stop_sched_trace(struct tracing_trace *tr)
 {
trace_enabled = 0;
+   atomic_dec(&trace_record_cmdline);
 }
 
 static notrace void sched_switch_trace_init(struct tracing_trace *tr)
Index: linux-mcount.git/lib/tracing/tracer.c
===
--- linux-mcount.git.orig/lib/tracing/tracer.c  2008-01-25 21:47:23.0 
-0500
+++ linux-mcount.git/lib/tracing/tracer.c   2008-01-25 21:47:25.0 
-0500
@@ -169,6 +169,88 @@ void tracing_stop_function_trace(void)
unregister_mcount_function(&trace_ops);
 }
 
+#define SAVED_CMDLINES 128
+static unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1];
+static unsigned map_cmdline_to_pid[SAVED_CMDLINES];
+static char saved_cmdlines[SAVED_CMDLINES][TASK_COMM_LEN];
+static int cmdline_idx;
+static DEFINE_SPINLOCK(trace_cmdline_lock);
+atomic_t trace_record_cmdline;
+atomic_t trace_record_cmdline_disabled;
+
+static void trace_init_cmdlines(void)
+{
+   memset(&map_pid_to_cmdline, -1, sizeof(map_pid_to_cmdline));
+   memset(&map_cmdline_to_pid, -1, sizeof(map_cmdline_to_pid));
+   cmdline_idx = 0;
+}
+
+notrace void trace_stop_cmdline_recording(void);
+
+static void notrace trace_save_cmdline(struct task_struct *tsk)
+{
+   unsigned map;
+   unsigned idx;
+
+   if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT))
+   return;
+
+   /*
+* It's not the end of the world if we don't get
+* the lock, but we also don't want to spin
+* nor do we want to disable interrupts,
+* so if we miss here, then better luck next time.
+*/
+   if (!spin_trylock(&trace_cmdline_lock))
+   return;
+
+   idx = map_pid_to_cmdline[tsk->pid];
+   if (idx >= SAVED_CMDLINES) {
+   idx = (cmdline_idx + 1) % SAVED_CMDLINES;
+
+   map = map_cmdline_to_pid[idx];
+   if (map <= PID_MAX_DEFAULT)
+   map_pid_to_cmdline[map] = (unsigned)-1;
+
+   map_pid_to_cmdline[tsk->

[PATCH 21/23 -v6] Add markers to various events

2008-01-25 Thread Steven Rostedt
This patch adds markers to various events in the kernel.
(interrupts, task activation and hrtimers)

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/apic_32.c  |2 ++
 arch/x86/kernel/irq_32.c   |1 +
 arch/x86/kernel/irq_64.c   |2 ++
 arch/x86/kernel/traps_32.c |2 ++
 arch/x86/kernel/traps_64.c |2 ++
 arch/x86/mm/fault_32.c |3 +++
 arch/x86/mm/fault_64.c |3 +++
 kernel/hrtimer.c   |7 +++
 kernel/sched.c |   11 +++
 9 files changed, 33 insertions(+)

Index: linux-mcount.git/arch/x86/kernel/apic_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-25 
21:47:15.0 -0500
+++ linux-mcount.git/arch/x86/kernel/apic_32.c  2008-01-25 21:47:38.0 
-0500
@@ -581,6 +581,8 @@ notrace fastcall void smp_apic_timer_int
 {
struct pt_regs *old_regs = set_irq_regs(regs);
 
+   trace_mark(arch_apic_timer, "ip %lx", regs->eip);
+
/*
 * NOTE! We'd better ACK the irq immediately,
 * because timer handling can be slow.
Index: linux-mcount.git/arch/x86/kernel/irq_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/irq_32.c  2008-01-25 
21:46:48.0 -0500
+++ linux-mcount.git/arch/x86/kernel/irq_32.c   2008-01-25 21:47:38.0 
-0500
@@ -85,6 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_r
 
old_regs = set_irq_regs(regs);
irq_enter();
+   trace_mark(arch_do_irq, "ip %lx irq %d", regs->eip, irq);
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
Index: linux-mcount.git/arch/x86/kernel/irq_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/irq_64.c  2008-01-25 
21:46:48.0 -0500
+++ linux-mcount.git/arch/x86/kernel/irq_64.c   2008-01-25 21:47:38.0 
-0500
@@ -149,6 +149,8 @@ asmlinkage unsigned int do_IRQ(struct pt
irq_enter();
irq = __get_cpu_var(vector_irq)[vector];
 
+   trace_mark(arch_do_irq, "ip %lx irq %d", regs->rip, irq);
+
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
stack_overflow_check(regs);
 #endif
Index: linux-mcount.git/arch/x86/kernel/traps_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/traps_32.c2008-01-25 
21:47:08.0 -0500
+++ linux-mcount.git/arch/x86/kernel/traps_32.c 2008-01-25 21:47:38.0 
-0500
@@ -769,6 +769,8 @@ notrace fastcall __kprobes void do_nmi(s
 
nmi_enter();
 
+   trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->eip, regs->eflags);
+
cpu = smp_processor_id();
 
++nmi_count(cpu);
Index: linux-mcount.git/arch/x86/kernel/traps_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/traps_64.c2008-01-25 
21:46:48.0 -0500
+++ linux-mcount.git/arch/x86/kernel/traps_64.c 2008-01-25 21:47:38.0 
-0500
@@ -782,6 +782,8 @@ asmlinkage __kprobes void default_do_nmi
 
cpu = smp_processor_id();
 
+   trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->rip, regs->eflags);
+
/* Only the BSP gets external NMIs from the system.  */
if (!cpu)
reason = get_nmi_reason();
Index: linux-mcount.git/arch/x86/mm/fault_32.c
===
--- linux-mcount.git.orig/arch/x86/mm/fault_32.c2008-01-25 
21:46:48.0 -0500
+++ linux-mcount.git/arch/x86/mm/fault_32.c 2008-01-25 21:47:38.0 
-0500
@@ -311,6 +311,9 @@ fastcall void __kprobes do_page_fault(st
/* get the address */
 address = read_cr2();
 
+   trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
+  regs->eip, error_code, address);
+
tsk = current;
 
si_code = SEGV_MAPERR;
Index: linux-mcount.git/arch/x86/mm/fault_64.c
===
--- linux-mcount.git.orig/arch/x86/mm/fault_64.c2008-01-25 
21:46:48.0 -0500
+++ linux-mcount.git/arch/x86/mm/fault_64.c 2008-01-25 21:47:38.0 
-0500
@@ -316,6 +316,9 @@ asmlinkage void __kprobes do_page_fault(
/* get the address */
address = read_cr2();
 
+   trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
+  regs->rip, error_code, address);
+
info.si_code = SEGV_MAPERR;
 
 
Index: linux-mcount.git/kernel/hrtimer.c
===
--- linux-mcount.git.orig/kernel/hrtimer.c  2008-01-25 21:46:48.0 
-0500
+++ linux-mcount.git/kernel/hrtimer.c   2008-01-25 21:47:38.0 -0500
@@ -709,6 +709,8 @@ static void enqueue_hrtimer(struct hrtim
struct hrtimer *entry;
int leftmost = 1

[PATCH 22/23 -v6] Add event tracer.

2008-01-25 Thread Steven Rostedt
This patch adds a event trace that hooks into various events
in the kernel. Although it can be used separately, it is mainly
to help other traces (wakeup and preempt off) with seeing various
events in the traces without having to enable the heavy mcount
hooks.

Here's an example:


 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
bash-2857  1d..3  180us : try_to_wake_up+0xdc/0x172  (0 0)
bash-2857  1d..3  181us+: activate_task+0x7d/0xb9  (0 2)
bash-2857  1d..2  192us!: enqueue_hrtimer+0x55/0x156(a4bf49c2e 81000
10b3ce8)
bash-2857  1d..3  331us+: deactivate_task+0x7c/0xa8  (0 3)
bash-2857  1d..3  334us+:  2857:120:S --> 2849:120
sshd-2849  1d..3  338us+: enqueue_hrtimer+0x55/0x156(a4c2a94a7 81000
10b3ce8)
sshd-2849  1d..3  370us : try_to_wake_up+0xdc/0x172  (0 0)
sshd-2849  1d..3  370us+: activate_task+0x7d/0xb9  (0 2)
sshd-2849  1d..2  380us+: enqueue_hrtimer+0x55/0x156(a4c0cae6f 81000
10b3ce8)


Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig |   12 +
 lib/tracing/Makefile|1 
 lib/tracing/trace_events.c  |  472 
 lib/tracing/trace_irqsoff.c |6 
 lib/tracing/trace_wakeup.c  |   13 +
 lib/tracing/tracer.c|  154 ++
 lib/tracing/tracer.h|   62 +
 7 files changed, 678 insertions(+), 42 deletions(-)

Index: linux-mcount.git/lib/tracing/trace_events.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-mcount.git/lib/tracing/trace_events.c 2008-01-25 21:47:40.0 
-0500
@@ -0,0 +1,472 @@
+/*
+ * trace task events
+ *
+ * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]>
+ *
+ * Based on code from the latency_tracer, that is:
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace __read_mostly;
+static int trace_enabled __read_mostly;
+
+static void notrace event_reset(struct tracing_trace *tr)
+{
+   struct tracing_trace_cpu *data;
+   int cpu;
+
+   tr->time_start = now();
+
+   for_each_possible_cpu(cpu) {
+   data = tr->data[cpu];
+   tracing_reset(data);
+   }
+}
+
+static void notrace event_trace_sched_switch(void *private,
+struct task_struct *prev,
+struct task_struct *next)
+{
+   struct tracing_trace **ptr = private;
+   struct tracing_trace *tr = *ptr;
+   struct tracing_trace_cpu *data;
+   unsigned long flags;
+   int cpu;
+
+   if (!trace_enabled || !tr)
+   return;
+
+   local_irq_save(flags);
+   cpu = raw_smp_processor_id();
+   data = tr->data[cpu];
+
+   atomic_inc(&data->disabled);
+   if (atomic_read(&data->disabled) != 1)
+   goto out;
+
+   tracing_sched_switch_trace(tr, data, prev, next, flags);
+
+ out:
+   atomic_dec(&data->disabled);
+   local_irq_restore(flags);
+}
+
+static struct tracer_switch_ops switch_ops __read_mostly = {
+   .func = event_trace_sched_switch,
+   .private = &tracer_trace,
+};
+
+notrace int trace_event_enabled(void)
+{
+   return trace_enabled && tracer_trace;
+}
+
+/* Taken from sched.c */
+#define __PRIO(prio) \
+   ((prio) <= 99 ? 199 - (prio) : (prio) - 120)
+
+#define PRIO(p) __PRIO((p)->prio)
+
+notrace void trace_event_wakeup(unsigned long ip,
+   struct task_struct *p,
+   struct task_struct *curr)
+{
+   struct tracing_trace *tr = tracer_trace;
+   struct tracing_trace_cpu *data;
+   unsigned long flags;
+   int cpu;
+
+   if (!trace_enabled || !tr)
+   return;
+
+   local_irq_save(flags);
+   cpu = raw_smp_processor_id();
+   data = tr->data[cpu];
+
+   atomic_inc(&data->disabled);
+   if (atomic_read(&data->disabled) != 1)
+   goto out;
+
+   /* record process's command line */
+   tracing_record_cmdline(p);
+   tracing_record_cmdline(curr);
+   tracing_trace_pid(tr, data, flags, ip, p->pid, PRIO(p), PRIO(curr));
+
+ out:
+   atomic_dec(&data->disabled);
+   local_irq_restore(flags);
+}
+
+struct event_probes {
+   const char *name;
+   const char *fmt;
+   void (*func)(const struct event_probes *probe,
+struct tracing_trace *tr,
+struct tracing_trace_cpu

[PATCH 13/23 -v6] Make the task State char-string visible to all

2008-01-25 Thread Steven Rostedt
The tracer wants to be able to convert the state number
into a user visible character. This patch pulls that conversion
string out the scheduler into the header. This way if it were to
ever change, other parts of the kernel will know.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/sched.h |2 ++
 kernel/sched.c|2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-mcount.git/include/linux/sched.h
===
--- linux-mcount.git.orig/include/linux/sched.h 2008-01-25 21:46:55.0 
-0500
+++ linux-mcount.git/include/linux/sched.h  2008-01-25 21:47:21.0 
-0500
@@ -2055,6 +2055,8 @@ static inline void migration_init(void)
 }
 #endif
 
+#define TASK_STATE_TO_CHAR_STR "RSDTtZX"
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:19.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:21.0 -0500
@@ -5149,7 +5149,7 @@ out_unlock:
return retval;
 }
 
-static const char stat_nam[] = "RSDTtZX";
+static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
 {

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/23 -v6] trace preempt off critical timings

2008-01-25 Thread Steven Rostedt
Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.

This adds "preemptoff" and "preemptirqsoff" to 
/debugfs/tracing/available_tracers

Now instead of just tracing irqs off, preemption off can be selected
to be recorded.

When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.

By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/process_32.c |3 
 include/linux/irqflags.h |3 
 include/linux/mcount.h   |8 +
 include/linux/preempt.h  |2 
 kernel/sched.c   |   24 +
 lib/tracing/Kconfig  |   25 +
 lib/tracing/Makefile |1 
 lib/tracing/trace_irqsoff.c  |  181 +++
 8 files changed, 195 insertions(+), 52 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-25 21:47:34.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:36.0 
-0500
@@ -46,6 +46,31 @@ config CRITICAL_IRQSOFF_TIMING
 
  echo 0 > /debugfs/tracing/tracing_max_latency
 
+ (Note that kernel size and overhead increases with this option
+ enabled. This option and the preempt-off timing option can be
+ used together or separately.)
+
+config CRITICAL_PREEMPT_TIMING
+   bool "Preemption-off critical section latency timing"
+   default n
+   depends on GENERIC_TIME
+   depends on PREEMPT
+   select TRACING
+   select TRACER_MAX_TRACE
+   help
+ This option measures the time spent in preemption off critical
+ sections, with microsecond accuracy.
+
+ The default measurement method is a maximum search, which is
+ disabled by default and can be runtime (re-)started
+ via:
+
+ echo 0 > /debugfs/tracing/tracing_max_latency
+
+ (Note that kernel size and overhead increases with this option
+ enabled. This option and the irqs-off timing option can be
+ used together or separately.)
+
 config WAKEUP_TRACER
bool "Trace wakeup latencies"
depends on DEBUG_KERNEL
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-mcount.git.orig/lib/tracing/Makefile  2008-01-25 21:47:34.0 
-0500
+++ linux-mcount.git/lib/tracing/Makefile   2008-01-25 21:47:36.0 
-0500
@@ -4,6 +4,7 @@ obj-$(CONFIG_TRACING) += tracer.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
 obj-$(CONFIG_CRITICAL_IRQSOFF_TIMING) += trace_irqsoff.o
+obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += trace_irqsoff.o
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===
--- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c   2008-01-25 
21:47:34.0 -0500
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-25 
21:47:36.0 -0500
@@ -21,6 +21,34 @@ static struct tracing_trace *tracer_trac
 static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex);
 static int trace_enabled __read_mostly;
 
+static DEFINE_PER_CPU(int, tracing_cpu);
+
+enum {
+   TRACER_IRQS_OFF = (1 << 1),
+   TRACER_PREEMPT_OFF  = (1 << 2),
+};
+
+static int trace_type __read_mostly;
+
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+# define preempt_trace() \
+   ((trace_type & TRACER_PREEMPT_OFF) && preempt_count())
+#else
+# define preempt_trace() (0)
+#endif
+
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+# define irq_trace()   \
+   ((trace_type & TRACER_IRQS_OFF) &&  \
+({ \
+unsigned long __flags; \
+local_save_flags(__flags); \
+irqs_disabled_flags(__flags);  \
+}))
+#else
+# define irq_trace() (0)
+#endif
+
 /*
  * Sequence count - we record it when starting a measurement and
  * skip the latency if the sequence has changed - some other section
@@ -41,14 +69,11 @@ static void notrace irqsoff_trace_call(u
unsigned long flags;
int cpu;
 
-   if (likely(!trace_enabled))
+   if (likely(!__get_cpu_var(tracing_cpu)))
return;
 
local_save_flags(flags);
 
-   if (!irqs_disable

[PATCH 06/23 -v6] add notrace annotations for NMI routines

2008-01-25 Thread Steven Rostedt
This annotates NMI functions with notrace. Some tracers may be able
to live with this, but some cannot. So we turn off NMI tracing.

One solution might be to make a notrace_nmi which would only turn
off NMI tracing if a trace utility needed it off.

Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

---
 arch/x86/kernel/nmi_32.c   |2 +-
 arch/x86/kernel/nmi_64.c   |2 +-
 arch/x86/kernel/traps_32.c |4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/nmi_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/nmi_32.c  2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/nmi_32.c   2008-01-25 21:47:08.0 
-0500
@@ -318,7 +318,7 @@ EXPORT_SYMBOL(touch_nmi_watchdog);
 
 extern void die_nmi(struct pt_regs *, const char *msg);
 
-__kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
+notrace __kprobes int nmi_watchdog_tick(struct pt_regs *regs, unsigned reason)
 {
 
/*
Index: linux-mcount.git/arch/x86/kernel/nmi_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/nmi_64.c  2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/nmi_64.c   2008-01-25 21:47:08.0 
-0500
@@ -314,7 +314,7 @@ void touch_nmi_watchdog(void)
touch_softlockup_watchdog();
 }
 
-int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
+notrace __kprobes int nmi_watchdog_tick(struct pt_regs *regs, unsigned reason)
 {
int sum;
int touched = 0;
Index: linux-mcount.git/arch/x86/kernel/traps_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/traps_32.c2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/traps_32.c 2008-01-25 21:47:08.0 
-0500
@@ -723,7 +723,7 @@ void __kprobes die_nmi(struct pt_regs *r
do_exit(SIGSEGV);
 }
 
-static __kprobes void default_do_nmi(struct pt_regs * regs)
+static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 {
unsigned char reason = 0;
 
@@ -763,7 +763,7 @@ static __kprobes void default_do_nmi(str
 
 static int ignore_nmis;
 
-fastcall __kprobes void do_nmi(struct pt_regs * regs, long error_code)
+notrace fastcall __kprobes void do_nmi(struct pt_regs *regs, long error_code)
 {
int cpu;
 

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/23 -v6] add notrace annotations to vsyscall.

2008-01-25 Thread Steven Rostedt
Add the notrace annotations to some of the vsyscall functions.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/vsyscall_64.c  |3 ++-
 arch/x86/vdso/vclock_gettime.c |   15 ---
 arch/x86/vdso/vgetcpu.c|3 ++-
 include/asm-x86/vsyscall.h |3 ++-
 4 files changed, 14 insertions(+), 10 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c  2008-01-25 
21:47:06.0 -0500
@@ -42,7 +42,8 @@
 #include 
 #include 
 
-#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
+#define __vsyscall(nr) \
+   __attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
 #define __syscall_clobber "r11","rcx","memory"
 #define __pa_vsymbol(x)\
({unsigned long v;  \
Index: linux-mcount.git/arch/x86/vdso/vclock_gettime.c
===
--- linux-mcount.git.orig/arch/x86/vdso/vclock_gettime.c2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/vdso/vclock_gettime.c 2008-01-25 
21:47:06.0 -0500
@@ -24,7 +24,7 @@
 
 #define gtod vdso_vsyscall_gtod_data
 
-static long vdso_fallback_gettime(long clock, struct timespec *ts)
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
asm("syscall" : "=a" (ret) :
@@ -32,7 +32,7 @@ static long vdso_fallback_gettime(long c
return ret;
 }
 
-static inline long vgetns(void)
+notrace static inline long vgetns(void)
 {
long v;
cycles_t (*vread)(void);
@@ -41,7 +41,7 @@ static inline long vgetns(void)
return (v * gtod->clock.mult) >> gtod->clock.shift;
 }
 
-static noinline int do_realtime(struct timespec *ts)
+notrace static noinline int do_realtime(struct timespec *ts)
 {
unsigned long seq, ns;
do {
@@ -55,7 +55,8 @@ static noinline int do_realtime(struct t
 }
 
 /* Copy of the version in kernel/time.c which we cannot directly access */
-static void vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
+notrace static void
+vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
 {
while (nsec >= NSEC_PER_SEC) {
nsec -= NSEC_PER_SEC;
@@ -69,7 +70,7 @@ static void vset_normalized_timespec(str
ts->tv_nsec = nsec;
 }
 
-static noinline int do_monotonic(struct timespec *ts)
+notrace static noinline int do_monotonic(struct timespec *ts)
 {
unsigned long seq, ns, secs;
do {
@@ -83,7 +84,7 @@ static noinline int do_monotonic(struct 
return 0;
 }
 
-int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
if (likely(gtod->sysctl_enabled && gtod->clock.vread))
switch (clock) {
@@ -97,7 +98,7 @@ int __vdso_clock_gettime(clockid_t clock
 int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
 
-int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
+notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
long ret;
if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
Index: linux-mcount.git/arch/x86/vdso/vgetcpu.c
===
--- linux-mcount.git.orig/arch/x86/vdso/vgetcpu.c   2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/vdso/vgetcpu.c2008-01-25 21:47:06.0 
-0500
@@ -13,7 +13,8 @@
 #include 
 #include "vextern.h"
 
-long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+notrace long
+__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
 {
unsigned int dummy, p;
 
Index: linux-mcount.git/include/asm-x86/vsyscall.h
===
--- linux-mcount.git.orig/include/asm-x86/vsyscall.h2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/include/asm-x86/vsyscall.h 2008-01-25 21:47:06.0 
-0500
@@ -24,7 +24,8 @@ enum vsyscall_num {
((unused, __section__ (".vsyscall_gtod_data"),aligned(16)))
 #define __section_vsyscall_clock __attribute__ \
((unused, __section__ (".vsyscall_clock"),aligned(16)))
-#define __vsyscall_fn __attribute__ ((unused,__section__(".vsyscall_fn")))
+#define __vsyscall_fn \
+   __attribute__ ((unused, __section__(".vsyscall_fn"))) notrace
 
 #define VGETCPU_RDTSCP 1
 #define VGETCPU_LSL2

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/23 -v6] initialize the clock source to jiffies clock.

2008-01-25 Thread Steven Rostedt
The latency tracer can call clocksource_read very early in bootup and
before the clock source variable has been initialized. This results in a
crash at boot up (even before earlyprintk is initialized). Since the
clock->read variable points to NULL.

This patch simply initializes the clock to use clocksource_jiffies, so
that any early user of clocksource_read will not crash.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
Acked-by: John Stultz <[EMAIL PROTECTED]>
---
 include/linux/clocksource.h |3 +++
 kernel/time/timekeeping.c   |9 +++--
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-mcount.git/include/linux/clocksource.h
===
--- linux-mcount.git.orig/include/linux/clocksource.h   2008-01-25 
21:47:09.0 -0500
+++ linux-mcount.git/include/linux/clocksource.h2008-01-25 
21:47:11.0 -0500
@@ -273,6 +273,9 @@ extern struct clocksource* clocksource_g
 extern void clocksource_change_rating(struct clocksource *cs, int rating);
 extern void clocksource_resume(void);
 
+/* used to initialize clock */
+extern struct clocksource clocksource_jiffies;
+
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
 extern void update_vsyscall(struct timespec *ts, struct clocksource *c);
 extern void update_vsyscall_tz(void);
Index: linux-mcount.git/kernel/time/timekeeping.c
===
--- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 
21:47:09.0 -0500
+++ linux-mcount.git/kernel/time/timekeeping.c  2008-01-25 21:47:11.0 
-0500
@@ -53,8 +53,13 @@ static inline void update_xtime_cache(u6
timespec_add_ns(&xtime_cache, nsec);
 }
 
-static struct clocksource *clock; /* pointer to current clocksource */
-
+/*
+ * pointer to current clocksource
+ *  Just in case we use clocksource_read before we initialize
+ *  the actual clock source. Instead of calling a NULL read pointer
+ *  we return jiffies.
+ */
+static struct clocksource *clock = &clocksource_jiffies;
 
 #ifdef CONFIG_GENERIC_TIME
 /**

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/23 -v6] add notrace annotations to timing events

2008-01-25 Thread Steven Rostedt
This patch adds notrace annotations to timer functions
that will be used by tracing. This helps speed things up and
also keeps the ugliness of printing these functions down.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/apic_32.c |2 +-
 arch/x86/kernel/hpet.c|2 +-
 arch/x86/kernel/time_32.c |2 +-
 arch/x86/kernel/tsc_32.c  |2 +-
 arch/x86/kernel/tsc_64.c  |4 ++--
 arch/x86/lib/delay_32.c   |6 +++---
 drivers/clocksource/acpi_pm.c |8 
 7 files changed, 13 insertions(+), 13 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/apic_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-25 
21:46:49.0 -0500
+++ linux-mcount.git/arch/x86/kernel/apic_32.c  2008-01-25 21:47:15.0 
-0500
@@ -577,7 +577,7 @@ static void local_apic_timer_interrupt(v
  *   interrupt as well. Thus we cannot inline the local irq ... ]
  */
 
-void fastcall smp_apic_timer_interrupt(struct pt_regs *regs)
+notrace fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
 
Index: linux-mcount.git/arch/x86/kernel/hpet.c
===
--- linux-mcount.git.orig/arch/x86/kernel/hpet.c2008-01-25 
21:46:49.0 -0500
+++ linux-mcount.git/arch/x86/kernel/hpet.c 2008-01-25 21:47:15.0 
-0500
@@ -295,7 +295,7 @@ static int hpet_legacy_next_event(unsign
 /*
  * Clock source related code
  */
-static cycle_t read_hpet(void)
+static notrace cycle_t read_hpet(void)
 {
return (cycle_t)hpet_readl(HPET_COUNTER);
 }
Index: linux-mcount.git/arch/x86/kernel/time_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/time_32.c 2008-01-25 
21:46:49.0 -0500
+++ linux-mcount.git/arch/x86/kernel/time_32.c  2008-01-25 21:47:15.0 
-0500
@@ -122,7 +122,7 @@ static int set_rtc_mmss(unsigned long no
 
 int timer_ack;
 
-unsigned long profile_pc(struct pt_regs *regs)
+notrace unsigned long profile_pc(struct pt_regs *regs)
 {
unsigned long pc = instruction_pointer(regs);
 
Index: linux-mcount.git/arch/x86/kernel/tsc_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/tsc_32.c  2008-01-25 
21:46:49.0 -0500
+++ linux-mcount.git/arch/x86/kernel/tsc_32.c   2008-01-25 21:47:15.0 
-0500
@@ -269,7 +269,7 @@ core_initcall(cpufreq_tsc);
 
 static unsigned long current_tsc_khz = 0;
 
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
cycle_t ret;
 
Index: linux-mcount.git/arch/x86/kernel/tsc_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/tsc_64.c  2008-01-25 
21:46:49.0 -0500
+++ linux-mcount.git/arch/x86/kernel/tsc_64.c   2008-01-25 21:47:15.0 
-0500
@@ -248,13 +248,13 @@ __setup("notsc", notsc_setup);
 
 
 /* clock source code: */
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
cycle_t ret = (cycle_t)get_cycles_sync();
return ret;
 }
 
-static cycle_t __vsyscall_fn vread_tsc(void)
+static notrace cycle_t __vsyscall_fn vread_tsc(void)
 {
cycle_t ret = (cycle_t)get_cycles_sync();
return ret;
Index: linux-mcount.git/arch/x86/lib/delay_32.c
===
--- linux-mcount.git.orig/arch/x86/lib/delay_32.c   2008-01-25 
21:46:49.0 -0500
+++ linux-mcount.git/arch/x86/lib/delay_32.c2008-01-25 21:47:15.0 
-0500
@@ -24,7 +24,7 @@
 #endif
 
 /* simple loop based delay: */
-static void delay_loop(unsigned long loops)
+static notrace void delay_loop(unsigned long loops)
 {
int d0;
 
@@ -39,7 +39,7 @@ static void delay_loop(unsigned long loo
 }
 
 /* TSC based delay: */
-static void delay_tsc(unsigned long loops)
+static notrace void delay_tsc(unsigned long loops)
 {
unsigned long bclock, now;
 
@@ -72,7 +72,7 @@ int read_current_timer(unsigned long *ti
return -1;
 }
 
-void __delay(unsigned long loops)
+notrace void __delay(unsigned long loops)
 {
delay_fn(loops);
 }
Index: linux-mcount.git/drivers/clocksource/acpi_pm.c
===
--- linux-mcount.git.orig/drivers/clocksource/acpi_pm.c 2008-01-25 
21:46:49.0 -0500
+++ linux-mcount.git/drivers/clocksource/acpi_pm.c  2008-01-25 
21:47:15.0 -0500
@@ -30,13 +30,13 @@
  */
 u32 pmtmr_ioport __read_mostly;
 
-static inline u32 read_pmtmr(void)
+static inline notrace u32 read_pmtmr(void)
 {
/* mask the output to 24 bits */
return inl(pmtmr_ioport) & ACPI_PM_MASK;
 }
 
-u32 acpi_pm_read_verified(void)
+notrace u32 acpi_pm_read_verified(void)
 {
u32 v1 = 0, v2 = 0, v3 = 0;
 

[PATCH 02/23 -v6] Add basic support for gcc profiler instrumentation

2008-01-25 Thread Steven Rostedt
If CONFIG_MCOUNT is selected and /proc/sys/kernel/mcount_enabled is set to a
non-zero value the mcount routine will be called everytime we enter a kernel
function that is not marked with the "notrace" attribute.

The mcount routine will then call a registered function if a function
happens to be registered.

[This code has been highly hacked by Steven Rostedt, so don't
 blame Arnaldo for all of this ;-) ]

Update:
  It is now possible to register more than one mcount function.
  If only one mcount function is registered, that will be the
  function that mcount calls directly. If more than one function
  is registered, then mcount will call a function that will loop
  through the functions to call.

Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 Makefile   |3 
 arch/x86/Kconfig   |1 
 arch/x86/kernel/entry_32.S |   25 +++
 arch/x86/kernel/entry_64.S |   36 +++
 include/linux/linkage.h|2 
 include/linux/mcount.h |   38 
 kernel/sysctl.c|   11 +++
 lib/Kconfig.debug  |1 
 lib/Makefile   |2 
 lib/tracing/Kconfig|   10 +++
 lib/tracing/Makefile   |3 
 lib/tracing/mcount.c   |  141 +
 12 files changed, 273 insertions(+)

Index: linux-mcount.git/Makefile
===
--- linux-mcount.git.orig/Makefile  2008-01-25 21:46:50.0 -0500
+++ linux-mcount.git/Makefile   2008-01-25 21:47:00.0 -0500
@@ -509,6 +509,9 @@ endif
 
 include $(srctree)/arch/$(SRCARCH)/Makefile
 
+ifdef CONFIG_MCOUNT
+KBUILD_CFLAGS  += -pg
+endif
 ifdef CONFIG_FRAME_POINTER
 KBUILD_CFLAGS  += -fno-omit-frame-pointer -fno-optimize-sibling-calls
 else
Index: linux-mcount.git/arch/x86/Kconfig
===
--- linux-mcount.git.orig/arch/x86/Kconfig  2008-01-25 21:46:50.0 
-0500
+++ linux-mcount.git/arch/x86/Kconfig   2008-01-25 21:47:00.0 -0500
@@ -19,6 +19,7 @@ config X86_64
 config X86
bool
default y
+   select HAVE_MCOUNT
 
 config GENERIC_TIME
bool
Index: linux-mcount.git/arch/x86/kernel/entry_32.S
===
--- linux-mcount.git.orig/arch/x86/kernel/entry_32.S2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/entry_32.S 2008-01-25 21:47:00.0 
-0500
@@ -75,6 +75,31 @@ DF_MASK  = 0x0400 
 NT_MASK= 0x4000
 VM_MASK= 0x0002
 
+#ifdef CONFIG_MCOUNT
+.globl mcount
+mcount:
+   /* unlikely(mcount_enabled) */
+   cmpl $0, mcount_enabled
+   jnz trace
+   ret
+
+trace:
+   /* taken from glibc */
+   pushl %eax
+   pushl %ecx
+   pushl %edx
+   movl 0xc(%esp), %edx
+   movl 0x4(%ebp), %eax
+
+   call   *mcount_trace_function
+
+   popl %edx
+   popl %ecx
+   popl %eax
+
+   ret
+#endif
+
 #ifdef CONFIG_PREEMPT
 #define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
Index: linux-mcount.git/arch/x86/kernel/entry_64.S
===
--- linux-mcount.git.orig/arch/x86/kernel/entry_64.S2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/entry_64.S 2008-01-25 21:47:00.0 
-0500
@@ -53,6 +53,42 @@
 
.code64
 
+#ifdef CONFIG_MCOUNT
+
+ENTRY(mcount)
+   /* unlikely(mcount_enabled) */
+   cmpl $0, mcount_enabled
+   jnz trace
+   retq
+
+trace:
+   /* taken from glibc */
+   subq $0x38, %rsp
+   movq %rax, (%rsp)
+   movq %rcx, 8(%rsp)
+   movq %rdx, 16(%rsp)
+   movq %rsi, 24(%rsp)
+   movq %rdi, 32(%rsp)
+   movq %r8, 40(%rsp)
+   movq %r9, 48(%rsp)
+
+   movq 0x38(%rsp), %rsi
+   movq 8(%rbp), %rdi
+
+   call   *mcount_trace_function
+
+   movq 48(%rsp), %r9
+   movq 40(%rsp), %r8
+   movq 32(%rsp), %rdi
+   movq 24(%rsp), %rsi
+   movq 16(%rsp), %rdx
+   movq 8(%rsp), %rcx
+   movq (%rsp), %rax
+   addq $0x38, %rsp
+
+   retq
+#endif
+
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif 
Index: linux-mcount.git/include/linux/linkage.h
===
--- linux-mcount.git.orig/include/linux/linkage.h   2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/include/linux/linkage.h2008-01-25 21:47:00.0 
-0500
@@ -3,6 +3,8 @@
 
 #include 
 
+#define notrace __attribute__((no_instrument_function))
+
 #ifdef __cplusplus
 #define CPP_ASMLINKAGE extern "C"
 #else
Index: linux-mcount.git/include/linux/mcount.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-mcount.git

[PATCH 14/23 -v6] Add tracing of context switches

2008-01-25 Thread Steven Rostedt
This patch adds context switch tracing, of the format of:

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
 swapper-0 1d..3  137us+:  0:140:R --> 2912:120
sshd-2912  1d..3  216us+:  2912:120:S --> 0:140
 swapper-0 1d..3  261us+:  0:140:R --> 2912:120
bash-2920  0d..3  267us+:  2920:120:S --> 0:140
sshd-2912  1d..3  330us!:  2912:120:S --> 0:140
 swapper-0 1d..3 2389us+:  0:140:R --> 2847:120
yum-upda-2847  1d..3 2411us!:  2847:120:S --> 0:140
 swapper-0 0d..3 11089us+:  0:140:R --> 3139:120
gdm-bina-3139  0d..3 3us!:  3139:120:S --> 0:140
 swapper-0 1d..3 102328us+:  0:140:R --> 2847:120
yum-upda-2847  1d..3 102348us!:  2847:120:S --> 0:140


 "sched_switch" is added to /debugfs/tracing/available_tracers

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
Cc: Mathieu Desnoyers <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig  |9 ++
 lib/tracing/Makefile |1 
 lib/tracing/trace_sched_switch.c |  134 +++
 lib/tracing/tracer.c |   43 
 lib/tracing/tracer.h |   19 +
 5 files changed, 205 insertions(+), 1 deletion(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-25 21:47:17.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:23.0 
-0500
@@ -23,3 +23,12 @@ config FUNCTION_TRACER
  insert a call to an architecture specific __mcount routine,
  that the debugging mechanism using this facility will hook by
  providing a set of inline routines.
+
+config CONTEXT_SWITCH_TRACER
+   bool "Trace process context switches"
+   depends on DEBUG_KERNEL
+   select TRACING
+   help
+ This tracer hooks into the context switch and records
+ all switching of tasks.
+
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-mcount.git.orig/lib/tracing/Makefile  2008-01-25 21:47:17.0 
-0500
+++ linux-mcount.git/lib/tracing/Makefile   2008-01-25 21:47:23.0 
-0500
@@ -1,6 +1,7 @@
 obj-$(CONFIG_MCOUNT) += libmcount.o
 
 obj-$(CONFIG_TRACING) += tracer.o
+obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c   2008-01-25 
21:47:23.0 -0500
@@ -0,0 +1,134 @@
+/*
+ * trace context switch
+ *
+ * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]>
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace;
+static int trace_enabled __read_mostly;
+
+static notrace void sched_switch_callback(const struct marker *mdata,
+ void *private_data,
+ const char *format, ...)
+{
+   struct tracing_trace **p = mdata->private;
+   struct tracing_trace *tr = *p;
+   struct tracing_trace_cpu *data;
+   struct task_struct *prev;
+   struct task_struct *next;
+   unsigned long flags;
+   va_list ap;
+   int cpu;
+
+   if (likely(!trace_enabled))
+   return;
+
+   va_start(ap, format);
+   prev = va_arg(ap, typeof(prev));
+   next = va_arg(ap, typeof(next));
+   va_end(ap);
+
+   raw_local_irq_save(flags);
+   cpu = raw_smp_processor_id();
+   data = tr->data[cpu];
+   atomic_inc(&data->disabled);
+
+   if (likely(atomic_read(&data->disabled) == 1))
+   tracing_sched_switch_trace(tr, data, prev, next, flags);
+
+   atomic_dec(&data->disabled);
+   raw_local_irq_restore(flags);
+}
+
+static notrace void sched_switch_reset(struct tracing_trace *tr)
+{
+   int cpu;
+
+   tr->time_start = now();
+
+   for_each_online_cpu(cpu)
+   tracing_reset(tr->data[cpu]);
+}
+
+static notrace void start_sched_trace(struct tracing_trace *tr)
+{
+   sched_switch_reset(tr);
+   trace_enabled = 1;
+}
+
+static notrace void stop_sched_trace(struct tracing_trace *tr)
+{
+   trace_enabled = 0;
+}
+
+static notrace void sched_switch_trace_init(struct tracing_trace *tr)
+{
+   tracer_trace = tr;
+
+   if (tr->ctrl)
+   start_sched_trace(tr);
+}
+
+static notrac

[PATCH 07/23 -v6] handle accurate time keeping over long delays

2008-01-25 Thread Steven Rostedt
Handle accurate time even if there's a long delay between
accumulated clock cycles.

Signed-off-by: John Stultz <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/powerpc/kernel/time.c|3 +-
 arch/x86/kernel/vsyscall_64.c |5 ++-
 include/asm-x86/vgtod.h   |2 -
 include/linux/clocksource.h   |   58 --
 kernel/time/timekeeping.c |   36 +-
 5 files changed, 82 insertions(+), 22 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 
21:47:06.0 -0500
+++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c  2008-01-25 
21:47:09.0 -0500
@@ -86,6 +86,7 @@ void update_vsyscall(struct timespec *wa
vsyscall_gtod_data.clock.mask = clock->mask;
vsyscall_gtod_data.clock.mult = clock->mult;
vsyscall_gtod_data.clock.shift = clock->shift;
+   vsyscall_gtod_data.clock.cycle_accumulated = clock->cycle_accumulated;
vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec;
vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec;
vsyscall_gtod_data.wall_to_monotonic = wall_to_monotonic;
@@ -121,7 +122,7 @@ static __always_inline long time_syscall
 
 static __always_inline void do_vgettimeofday(struct timeval * tv)
 {
-   cycle_t now, base, mask, cycle_delta;
+   cycle_t now, base, accumulated, mask, cycle_delta;
unsigned seq;
unsigned long mult, shift, nsec;
cycle_t (*vread)(void);
@@ -135,6 +136,7 @@ static __always_inline void do_vgettimeo
}
now = vread();
base = __vsyscall_gtod_data.clock.cycle_last;
+   accumulated  = __vsyscall_gtod_data.clock.cycle_accumulated;
mask = __vsyscall_gtod_data.clock.mask;
mult = __vsyscall_gtod_data.clock.mult;
shift = __vsyscall_gtod_data.clock.shift;
@@ -145,6 +147,7 @@ static __always_inline void do_vgettimeo
 
/* calculate interval: */
cycle_delta = (now - base) & mask;
+   cycle_delta += accumulated;
/* convert to nsecs: */
nsec += (cycle_delta * mult) >> shift;
 
Index: linux-mcount.git/include/asm-x86/vgtod.h
===
--- linux-mcount.git.orig/include/asm-x86/vgtod.h   2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/include/asm-x86/vgtod.h2008-01-25 21:47:09.0 
-0500
@@ -15,7 +15,7 @@ struct vsyscall_gtod_data {
struct timezone sys_tz;
struct { /* extract of a clocksource struct */
cycle_t (*vread)(void);
-   cycle_t cycle_last;
+   cycle_t cycle_last, cycle_accumulated;
cycle_t mask;
u32 mult;
u32 shift;
Index: linux-mcount.git/include/linux/clocksource.h
===
--- linux-mcount.git.orig/include/linux/clocksource.h   2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/include/linux/clocksource.h2008-01-25 
21:47:09.0 -0500
@@ -50,8 +50,12 @@ struct clocksource;
  * @flags: flags describing special properties
  * @vread: vsyscall based read
  * @resume:resume function for the clocksource, if necessary
+ * @cycle_last:Used internally by timekeeping core, please 
ignore.
+ * @cycle_accumulated: Used internally by timekeeping core, please ignore.
  * @cycle_interval:Used internally by timekeeping core, please ignore.
  * @xtime_interval:Used internally by timekeeping core, please ignore.
+ * @xtime_nsec:Used internally by timekeeping core, please 
ignore.
+ * @error: Used internally by timekeeping core, please ignore.
  */
 struct clocksource {
/*
@@ -82,7 +86,10 @@ struct clocksource {
 * Keep it in a different cache line to dirty no
 * more than one cache line.
 */
-   cycle_t cycle_last cacheline_aligned_in_smp;
+   struct {
+   cycle_t cycle_last, cycle_accumulated;
+   } cacheline_aligned_in_smp;
+
u64 xtime_nsec;
s64 error;
 
@@ -168,11 +175,44 @@ static inline cycle_t clocksource_read(s
 }
 
 /**
+ * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * @cs:pointer to clocksource being read
+ * @now:   current cycle value
+ *
+ * Uses the clocksource to return the current cycle_t value.
+ * NOTE!!!: This is different from clocksource_read, because it
+ * returns the accumulated cycle value! Must hold xtime lock!
+ */
+static inline cycle_t
+clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+{
+   cycle_t offset = (now - cs->cycle_last) & cs->mask;
+   offset += cs->cycle_accumulated;
+   

[PATCH 03/23 -v6] Annotate core code that should not be traced

2008-01-25 Thread Steven Rostedt
Mark with "notrace" functions in core code that should not be
traced.  The "notrace" attribute will prevent gcc from adding
a call to mcount on the annotated funtions.

Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

---
 lib/smp_processor_id.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-mcount.git/lib/smp_processor_id.c
===
--- linux-mcount.git.orig/lib/smp_processor_id.c2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/lib/smp_processor_id.c 2008-01-25 21:47:03.0 
-0500
@@ -7,7 +7,7 @@
 #include 
 #include 
 
-unsigned int debug_smp_processor_id(void)
+notrace unsigned int debug_smp_processor_id(void)
 {
unsigned long preempt_count = preempt_count();
int this_cpu = raw_smp_processor_id();

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 17/23 -v6] Add marker in try_to_wake_up

2008-01-25 Thread Steven Rostedt
Add markers into the wakeup code, to allow the tracer to
record wakeup timings.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 kernel/sched.c |8 
 1 file changed, 8 insertions(+)

Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:21.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:30.0 -0500
@@ -1885,6 +1885,10 @@ static int try_to_wake_up(struct task_st
 
 out_activate:
 #endif /* CONFIG_SMP */
+   trace_mark(kernel_sched_wakeup,
+  "p %p rq->curr %p",
+  p, rq->curr);
+
schedstat_inc(p, se.nr_wakeups);
if (sync)
schedstat_inc(p, se.nr_wakeups_sync);
@@ -2026,6 +2030,10 @@ void fastcall wake_up_new_task(struct ta
p->sched_class->task_new(rq, p);
inc_nr_running(p, rq);
}
+   trace_mark(kernel_sched_wakeup_new,
+  "p %p rq->curr %p",
+  p, rq->curr);
+
check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
if (p->sched_class->task_wake_up)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/23 -v6] add get_monotonic_cycles

2008-01-25 Thread Steven Rostedt
The latency tracer needs a way to get an accurate time
without grabbing any locks. Locks themselves might call
the latency tracer and cause at best a slow down.

This patch adds get_monotonic_cycles that returns cycles
from a reliable clock source in a monotonic fashion.

Signed-off-by: John Stultz <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/clocksource.h |   54 +---
 kernel/time/timekeeping.c   |   26 +++--
 2 files changed, 70 insertions(+), 10 deletions(-)

Index: linux-mcount.git/include/linux/clocksource.h
===
--- linux-mcount.git.orig/include/linux/clocksource.h   2008-01-25 
21:47:11.0 -0500
+++ linux-mcount.git/include/linux/clocksource.h2008-01-25 
21:47:13.0 -0500
@@ -88,8 +88,16 @@ struct clocksource {
 */
struct {
cycle_t cycle_last, cycle_accumulated;
-   } cacheline_aligned_in_smp;
 
+   /* base structure provides lock-free read
+* access to a virtualized 64bit counter
+* Uses RCU-like update.
+*/
+   struct {
+   cycle_t cycle_base_last, cycle_base;
+   } base[2];
+   int base_num;
+   } cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
 
@@ -175,19 +183,30 @@ static inline cycle_t clocksource_read(s
 }
 
 /**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
  * @cs:pointer to clocksource being read
  * @now:   current cycle value
  *
  * Uses the clocksource to return the current cycle_t value.
  * NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
  */
 static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
-   offset += cs->cycle_accumulated;
+   int num;
+   cycle_t now, offset;
+
+   preempt_disable();
+   num = cs->base_num;
+   /* base_num is shared, and some archs are wacky */
+   smp_read_barrier_depends();
+   now = clocksource_read(cs);
+   offset = (now - cs->base[num].cycle_base_last);
+   offset &= cs->mask;
+   offset += cs->base[num].cycle_base;
+   preempt_enable();
+
return offset;
 }
 
@@ -197,11 +216,27 @@ clocksource_get_cycles(struct clocksourc
  * @now:   current cycle value
  *
  * Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
  */
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
+   /*
+* First update the monotonic base portion.
+* The dual array update method allows for lock-free reading.
+* 'num' is always 1 or 0.
+*/
+   int num = 1 - cs->base_num;
+   cycle_t offset = (now - cs->base[1-num].cycle_base_last);
+   offset &= cs->mask;
+   cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
+   cs->base[num].cycle_base_last = now;
+   /* make sure this array is visible to the world first */
+   smp_wmb();
+   cs->base_num = num;
+
+   /* Now update the cycle_accumulated portion */
+   offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
 }
@@ -272,6 +307,9 @@ extern int clocksource_register(struct c
 extern struct clocksource* clocksource_get_next(void);
 extern void clocksource_change_rating(struct clocksource *cs, int rating);
 extern void clocksource_resume(void);
+extern cycle_t get_monotonic_cycles(void);
+extern unsigned long cycles_to_usecs(cycle_t cycles);
+extern cycle_t usecs_to_cycles(unsigned long usecs);
 
 /* used to initialize clock */
 extern struct clocksource clocksource_jiffies;
Index: linux-mcount.git/kernel/time/timekeeping.c
===
--- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 
21:47:11.0 -0500
+++ linux-mcount.git/kernel/time/timekeeping.c  2008-01-25 21:47:13.0 
-0500
@@ -71,10 +71,12 @@ static struct clocksource *clock = &cloc
  */
 static inline s64 __get_nsec_offset(void)
 {
-   cycle_t cycle_delta;
+   cycle_t now, cycle_delta;
s64 ns_offset;
 
-   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+   now = clocksource_read(clock);
+   cycle

[PATCH 23/23 -v6] Critical latency timings histogram

2008-01-25 Thread Steven Rostedt
This patch adds hooks into the latency tracer to give
us histograms of interrupts off, preemption off and
wakeup timings.

This code was based off of work done by Yi Yang <[EMAIL PROTECTED]>

But heavily modified to work with the new tracer, and some
clean ups by Steven Rostedt <[EMAIL PROTECTED]>

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig |   20 +
 lib/tracing/Makefile|4 
 lib/tracing/trace_irqsoff.c |   21 +
 lib/tracing/trace_wakeup.c  |   18 +
 lib/tracing/tracer_hist.c   |  513 
 lib/tracing/tracer_hist.h   |   39 +++
 6 files changed, 610 insertions(+), 5 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-25 21:47:40.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-25 21:47:42.0 
-0500
@@ -102,3 +102,23 @@ config CONTEXT_SWITCH_TRACER
  This tracer hooks into the context switch and records
  all switching of tasks.
 
+config INTERRUPT_OFF_HIST
+   bool "Interrupts off critical timings histogram"
+   depends on CRITICAL_IRQSOFF_TIMING
+   help
+ This option uses the infrastructure of the critical
+ irqs off timings to create a histogram of latencies.
+
+config PREEMPT_OFF_HIST
+   bool "Preempt off critical timings histogram"
+   depends on CRITICAL_PREEMPT_TIMING
+   help
+ This option uses the infrastructure of the critical
+ preemption off timings to create a histogram of latencies.
+
+config WAKEUP_LATENCY_HIST
+   bool "Interrupts off critical timings histogram"
+   depends on WAKEUP_TRACER
+   help
+ This option uses the infrastructure of the wakeup tracer
+ to create a histogram of latencies.
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-mcount.git.orig/lib/tracing/Makefile  2008-01-25 21:47:40.0 
-0500
+++ linux-mcount.git/lib/tracing/Makefile   2008-01-25 21:47:42.0 
-0500
@@ -8,4 +8,8 @@ obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) +=
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 obj-$(CONFIG_EVENT_TRACER) += trace_events.o
 
+obj-$(CONFIG_INTERRUPT_OFF_HIST) += tracer_hist.o
+obj-$(CONFIG_PREEMPT_OFF_HIST) += tracer_hist.o
+obj-$(CONFIG_WAKEUP_LATENCY_HIST) += tracer_hist.o
+
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===
--- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c   2008-01-25 
21:47:40.0 -0500
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-25 
21:47:42.0 -0500
@@ -16,6 +16,7 @@
 #include 
 
 #include "tracer.h"
+#include "tracer_hist.h"
 
 static struct tracing_trace *tracer_trace __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex);
@@ -237,7 +238,7 @@ stop_critical_timing(unsigned long ip, u
else
return;
 
-   if (likely(!trace_enabled))
+   if (!trace_enabled)
return;
 
cpu = raw_smp_processor_id();
@@ -261,10 +262,14 @@ void notrace start_critical_timings(void
 {
if (preempt_trace() || irq_trace())
start_critical_timing(CALLER_ADDR0, 0);
+
+   tracing_hist_preempt_start();
 }
 
 void notrace stop_critical_timings(void)
 {
+   tracing_hist_preempt_stop(TRACE_STOP);
+
if (preempt_trace() || irq_trace())
stop_critical_timing(CALLER_ADDR0, 0);
 }
@@ -273,6 +278,8 @@ void notrace stop_critical_timings(void)
 #ifdef CONFIG_LOCKDEP
 void notrace time_hardirqs_on(unsigned long a0, unsigned long a1)
 {
+   tracing_hist_preempt_stop(1);
+
if (!preempt_trace() && irq_trace())
stop_critical_timing(a0, a1);
 }
@@ -281,6 +288,8 @@ void notrace time_hardirqs_off(unsigned 
 {
if (!preempt_trace() && irq_trace())
start_critical_timing(a0, a1);
+
+   tracing_hist_preempt_start();
 }
 
 #else /* !CONFIG_LOCKDEP */
@@ -314,6 +323,8 @@ inline void print_irqtrace_events(struct
  */
 void notrace trace_hardirqs_on(void)
 {
+   tracing_hist_preempt_stop(1);
+
if (!preempt_trace() && irq_trace())
stop_critical_timing(CALLER_ADDR0, 0);
 }
@@ -323,11 +334,15 @@ void notrace trace_hardirqs_off(void)
 {
if (!preempt_trace() && irq_trace())
start_critical_timing(CALLER_ADDR0, 0);
+
+   tracing_hist_preempt_start();
 }
 EXPORT_SYMBOL(trace_hardirqs_off);
 
 void notrace trace_hardirqs_on_caller(unsigned long caller_addr)
 {
+   tracing_hist_preempt_stop(1);
+
if (!preempt_trace() && irq_trace())
stop_critical_timing(CALLER_ADDR0, caller_addr);
 }
@@ -337,6 +352,8 @@ void notrace trace_hardirqs_off_caller(u
 {
if (!preempt_trace() && irq_trace())

[PATCH 04/23 -v6] x86_64: notrace annotations

2008-01-25 Thread Steven Rostedt
Add "notrace" annotation to x86_64 specific files.

Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/head64.c |2 +-
 arch/x86/kernel/setup64.c|4 ++--
 arch/x86/kernel/smpboot_64.c |2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/head64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/head64.c  2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/head64.c   2008-01-25 21:47:05.0 
-0500
@@ -46,7 +46,7 @@ static void __init copy_bootdata(char *r
}
 }
 
-void __init x86_64_start_kernel(char * real_mode_data)
+notrace void __init x86_64_start_kernel(char *real_mode_data)
 {
int i;
 
Index: linux-mcount.git/arch/x86/kernel/setup64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/setup64.c 2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/setup64.c  2008-01-25 21:47:05.0 
-0500
@@ -114,7 +114,7 @@ void __init setup_per_cpu_areas(void)
}
 } 
 
-void pda_init(int cpu)
+notrace void pda_init(int cpu)
 { 
struct x8664_pda *pda = cpu_pda(cpu);
 
@@ -197,7 +197,7 @@ DEFINE_PER_CPU(struct orig_ist, orig_ist
  * 'CPU state barrier', nothing should get across.
  * A lot of state is already set up in PDA init.
  */
-void __cpuinit cpu_init (void)
+notrace void __cpuinit cpu_init(void)
 {
int cpu = stack_smp_processor_id();
struct tss_struct *t = &per_cpu(init_tss, cpu);
Index: linux-mcount.git/arch/x86/kernel/smpboot_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/smpboot_64.c  2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/smpboot_64.c   2008-01-25 
21:47:05.0 -0500
@@ -317,7 +317,7 @@ static inline void set_cpu_sibling_map(i
 /*
  * Setup code on secondary processor (after comming out of the trampoline)
  */
-void __cpuinit start_secondary(void)
+notrace __cpuinit void start_secondary(void)
 {
/*
 * Dont put anything before smp_callin(), SMP

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/23 -v6] Add context switch marker to sched.c

2008-01-25 Thread Steven Rostedt
Add marker into context_switch to record the prev and next tasks.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 kernel/sched.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:46:55.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:19.0 -0500
@@ -2198,6 +2198,8 @@ context_switch(struct rq *rq, struct tas
struct mm_struct *mm, *oldmm;
 
prepare_task_switch(rq, prev, next);
+   trace_mark(kernel_sched_schedule,
+  "prev %p next %p", prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/23 -v6] mcount and latency tracing utility -v6

2008-01-25 Thread Steven Rostedt
[
  version 6 of mcount / trace patches:

  changes include:

   Ported to lastest git 99f1c97dbdb30e958edfd1ced0ae43df62504e07

   Added the runqueue_is_locked schedule api to let others (printk)
   know if the runqueue is locked and if it is safe to call
   wake_up on klogd.

   Added event_trace! This records various events like interrupts,
   timer events, page fault events, some scheduler stuff. This is
   also used to help show a little more information to the
   other latency tracers. It is a lot less overhead than what
   mcount gives us (without as much data).

   Also included in this series the histograms. When configured in
   the interrupts/preemption off and wakeup timings will be recorded
   into histograms found in /debugfs/tracing/latency_hist/
   This code is from MontaVista contributions into the RT kernel,
   with some cleanups and porting effort by myself.
]

All released version of these patches can be found at:

   http://people.redhat.com/srostedt/tracing/


The following patch series brings to vanilla Linux a bit of the RT kernel
trace facility. This incorporates the "-pg" profiling option of gcc
that will call the "mcount" function for all functions called in
the kernel.

Note: I did investigate using -finstrument-functions but that adds a call
to both start and end of a function. Using mcount only does the
beginning of the function. mcount alone adds ~13% overhead. The
-finstrument-functions added ~19%.  Also it caused me to do tricks with
inline, because it adds the function calls to inline functions as well.

This patch series implements the code for x86 (32 and 64 bit), but
other archs can easily be implemented as well (note: ARM and PPC are
already implemented in -rt)

Some Background:


A while back, Ingo Molnar and William Lee Irwin III created a latency tracer
to find problem latency areas in the kernel for the RT patch.  This tracer
became a very integral part of the RT kernel in solving where latency hot
spots were.  One of the features that the latency tracer added was a
function trace.  This function tracer would record all functions that
were called (implemented by the gcc "-pg" option) and would show what was
called when interrupts or preemption was turned off.

This feature is also very helpful in normal debugging. So it's been talked
about taking bits and pieces from the RT latency tracer and bring them
to LKML. But no one had the time to do it.

Arnaldo Carvalho de Melo took a crack at it. He pulled out the mcount
as well as part of the tracing code and made it generic from the point
of the tracing code.  I'm not sure why this stopped. Probably because
Arnaldo is a very busy man, and his efforts had to be utilized elsewhere.

While I still maintain my own Logdev utility:

  http://rostedt.homelinux.com/logdev

I came across a need to do the mcount with logdev too. I was successful
but found that it became very dependent on a lot of code. One thing that
I liked about my logdev utility was that it was very non-intrusive, and has
been easy to port from the Linux 2.0 days. I did not want to burden the
logdev patch with the intrusiveness of mcount (not really that intrusive,
it just needs to add a "notrace" annotation to functions in the kernel
that will cause more conflicts in applying patches for me).

Being close to the holidays, I grabbed Arnaldos old patches and started
massaging them into something that could be useful for logdev, and what
I found out (and talking this over with Arnaldo too) that this can
be much more useful for others as well.

The main thing I changed, was that I made the mcount function itself
generic, and not the dependency on the tracing code.  That is I added

register_mcount_function()
 and
clear_mcount_function()

So when ever mcount is enabled and a function is registered that function
is called for all functions in the kernel that is not labeled with the
"notrace" annotation.


The Simple Tracer:
--

To show the power of this I also massaged the tracer code that Arnaldo pulled
from the RT patch and made it be a nice example of what can be done
with this.

The function that is registered to mcount has the prototype:

 void func(unsigned long ip, unsigned long parent_ip);

The ip is the address of the function and parent_ip is the address of
the parent function that called it.

The x86_64 version has the assembly call the registered function directly
to save having to do a double function call.

To enable mcount, a sysctl is added:

   /proc/sys/kernel/mcount_enabled

Once mcount is enabled, when a function is registed, it will be called by
all functions. The tracer in this patch series shows how this is done.
It adds a directory in the debugfs, called mctracer. With a ctrl file that
will allow the user have the tracer register its function.  Note, the order
of enabling mcount and registering a function is not important, but both
must be done to initiate the tracing. That is, you can disable tracing
by eith

Re: [PATCH] Linux Kernel Markers Support for Proprierary Modules

2008-01-25 Thread Jon Masters
On Sat, 2008-01-26 at 14:27 +1100, Rusty Russell wrote:

> 2) Unconditionally reject modules with a wrong module section size.  
> Currently 
> we have no such check, which means without KALLSYMS, anything goes.

I favor the latter, since it's safest.

Jon.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

2008-01-25 Thread Theodore Tso
On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
> +static int free_ext_idx(handle_t *handle, struct inode *inode,
> + struct ext4_extent_idx *ix)
> +{
> + int i, retval = 0;
> + ext4_fsblk_t block;
> + struct buffer_head *bh;
> + struct ext4_extent_header *eh;
> +
> + block = idx_pblock(ix);
> + bh = sb_bread(inode->i_sb, block);
> + if (!bh)
> + return -EIO;
> +
> + eh = (struct ext4_extent_header *)bh->b_data;
> + if (eh->eh_depth == 0) {
> + brelse(bh);
> + ext4_free_blocks(handle, inode, block, 1);
> + } else {
> + ix = EXT_FIRST_INDEX(eh);
> + for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> + retval = free_ext_idx(handle, inode, ix);
> + if (retval)
> + return retval;
> + }
> + }
> + return retval;
> +}

Aneesh, looks like if eh->eh_depth is != 0, bh gets leaked.  This is
how I plan to fix it up:

+static int free_ext_idx(handle_t *handle, struct inode *inode,
+   struct ext4_extent_idx *ix)
+{
+   int i, retval = 0;
+   ext4_fsblk_t block;
+   struct buffer_head *bh;
+   struct ext4_extent_header *eh;
+
+   block = idx_pblock(ix);
+   bh = sb_bread(inode->i_sb, block);
+   if (!bh)
+   return -EIO;
+
+   eh = (struct ext4_extent_header *)bh->b_data;
+   if (eh->eh_depth == 0)
+   ext4_free_blocks(handle, inode, block, 1);
+   else {
+   ix = EXT_FIRST_INDEX(eh);
+   for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+   retval = free_ext_idx(handle, inode, ix);
+   if (retval)
+   break;
+   }
+   }
+   put_bh(bh);
+   return retval;
+}

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Netatalk-admins] netatalk slow after system upgrade (possibly kernel problem?)

2008-01-25 Thread Didier
Hi,
On Fri, 25 Jan 2008 12:55:42 +0100, Michael Monnerie wrote
> Dear lists,
> 
> I've been spending a LOT of time trying to find out where's the 
> problem, but can't find it and therefore seek urgent help now. We 
> have the following system:
Did you try to force the server MTU to 1500 (it looks like you have jumbo
frame enabled) ?

Some interesting TCP/IP packets though :) 
 
OT as it was working before but if it's a simple LAN you may have a flaky/wire
in the loop too.

Didier

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 063/196] kset: convert /sys/devices to use kset_create

2008-01-25 Thread Olof Johansson
On Thu, Jan 24, 2008 at 11:10:01PM -0800, Greg Kroah-Hartman wrote:
> Dynamically create the kset instead of declaring it statically.  We also
> rename devices_subsys to devices_kset to catch all users of the
> variable.

Guess what, you broke powerpc again!

[EMAIL PROTECTED]:~/work/linux/k.org $ git grep devices_subsys
arch/powerpc/kernel/vio.c:extern struct kset devices_subsys; /* needed for 
vio_find_name() */
arch/powerpc/kernel/vio.c:  found = kset_find_obj(&devices_subsys, 
kobj_name);

Obviously causes build failues, even of ppc64_defconfig.

(I can unfortunately not boot test, since I lack hardware that uses vio)


Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>

diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 19a5656..ee752ab 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -37,7 +37,7 @@
 #include 
 #include 
 
-extern struct kset devices_subsys; /* needed for vio_find_name() */
+extern struct kset *devices_kset; /* needed for vio_find_name() */
 
 static struct bus_type vio_bus_type;
 
@@ -369,7 +369,7 @@ static struct vio_dev *vio_find_name(const char *kobj_name)
 {
struct kobject *found;
 
-   found = kset_find_obj(&devices_subsys, kobj_name);
+   found = kset_find_obj(devices_kset, kobj_name);
if (!found)
return NULL;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Linux Kernel Markers Support for Proprierary Modules

2008-01-25 Thread Rusty Russell
On Saturday 26 January 2008 02:31:30 Jon Masters wrote:
> On Fri, 2008-01-25 at 08:56 +0100, Jan Engelhardt wrote:
> > So what is needed is an Oops with an explaining message
> > if (kernel_tainted) "blame that proprietary module first",
> > and make sure the user sees that oops even if in X.
>
> The former is actually trivially doable with the module->taints bits. We
> could have the equivalent of a neon flashing "blame this module" sign.
>
> I also agree, we should stop force loading. Incompatible struct module,
> etc. are really bad things to have mapped into a running kernel.

I think there are two things here:
1) Currently we allow modules with no kallsyms info to be loaded into a 
KALLSYMS kernel (and taint).  A new option is needed to control this: 
CONFIG_ACCEPT_NO_KALLSYMS under KERNEL_DEBUG which allows loading of 
such "stripped" modules (a-la modprobe --force).

2) Unconditionally reject modules with a wrong module section size.  Currently 
we have no such check, which means without KALLSYMS, anything goes.

Thoughts?
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: using LKML for subsystem development (was Re: Linux 2.6.24)

2008-01-25 Thread Valdis . Kletnieks
On Sat, 26 Jan 2008 01:42:43 +0100, Stefan Richter said:

> Even if you only look at the Subject: and number of postings in a
> thread, how to judge whether there is a stability risk for the next -rc
> in the making, without experience or personal interest in the domain?

My general rule of thumb is "if my laptop has one of those on it, I look
at it more, even if I don't have the foggiest idea how it works".  Of
course, this only works for threads with semi-sane Subject: headers


pgpY55jybX5o4.pgp
Description: PGP signature


Re: [GIT PATCH] driver core patches against 2.6.24

2008-01-25 Thread Valdis . Kletnieks
On Sat, 26 Jan 2008 01:05:26 +0100, Ingo Molnar said:

> all it takes for me on Fedora is to boot a modular distro kernel once, 
> then copy the /dev to the real (persistent) /dev:
> 
>mkdir /tmp2
>mount /dev/sda1 /tmp2
>cp -a /dev/* /tmp2/dev/
> 
> and from that point on a bzImage/vmlinuz can boot up on Fedora without 
> any problems (as long as it has the right drivers built in), and the 
> initrd line can be removed from grub.conf.

Tried something like that once - it didn't play nice with the fact that I
have root-on-LVM, so you need an initrd to do the 'lvm vgchange -a' and get
it online.


pgpJjqf9nRzjE.pgp
Description: PGP signature


Re: Linux 2.6.24

2008-01-25 Thread Valdis . Kletnieks
On Sat, 26 Jan 2008 00:50:44 +0100, Stefan Richter said:
> How often is "bisectability" being broken already before merge in
> subsystem trees, and how often only in the context of the merge result?

I don't bisect git trees often - but I'd say that at least half the time
I have to bisect -mm, I'll hit a busticated bisection point and need to
move several one way or another.  Fortunately, Andrew does a good job
of keeping fixes near their parents, so it's usually not *that* hard to
clean up (at least for me - but I recently realized that I had passed the
3-decade mark of breaking and fixing software).  Newcomer kernel testers
are likely in for a rude awakening if they hit one of those points.



pgpfUQMyT426h.pgp
Description: PGP signature


Re: Hot (un)plugging of a SATA drive with sata_nv (CK8S) ?

2008-01-25 Thread Robert Hancock

Ignacy Gawedzki wrote:

Hi everyone,

I'm having trouble to determine the cause of the following behavior.  I'm not
even sure that I'm supposed to hot plug and unplug a SATA drive from a nForce3
Ultra (apparently CK8S, on a Gigabyte K8NS Ultra 939 mobo) SATA interface, to
begin with.  The information is hard to find given that the sata_nv driver
supports a range of different hardware.

I've recently acquired an external drive with (among others) an eSATA
interface, so I also bought a eSATA->SATA bracket and intend to use that drive
(Lacie d2 quadra 500G) through eSATA.


BTW, eSATA cannot technically be converted properly to SATA with a 
simple connector adapter. eSATA is supposed to use higher signalling 
voltages and so using such an adapter is not guaranteed to work.




The thing is that if I boot the machine with the drive plugged and turned on,
it is properly detected and usable.  If, at some point, I want to remove the
drive, I unmount any partitions on it and issue the proper scsiadd -r command
(usually scsiadd -r 1 0 0 0, since this is the second SATA drive) and
everything is fine (I turn the drive off and unplug it), so far.  Next, when
I want to use the drive again, it's still detected alright (although appears
as sdc and not sdb anymore), but the SCSI layer issues "scsi 1:0:0:0:
rejecting I/O to dead device" from time to time.  Then any scsiadd -r 1 0 0 0
command fails with "No such device or address", although it appears in the
output of scsiadd -p or even scsiadd -s (always as 1 0 0 0).  If I ignore that
detail and switch the drive off, then the kernel eventually notices that the
drive is gone and the SCSI layer attempts to stop the device and fails ([sdc]
START_STOP FAILED).  From that moment on, any attempt to plug the drive again
fails.  The kernel issues "ata2: hard resetting port" and "ata2: port is slow
to respond, please be patient (Status 0x80)" periodically, until I switch the
drive off.

If the drive is not present at boot, then hot plugging it fails.  The kernel
first soft resets the port, then issues the "please be patient (Status 0x80)"
message, complains that SRST failed (errno=-16) and goes on hard resetting the
port, issuing "please be patient (Status 0x80)" and complaining that COMRESET
failed (errno=-16), periodically, until the drive is switched off.


Full dmesg output would be useful..



If somebody could tell me whether hot-plugging is supposed to work with my
SATA interface, it would be nice. =)  The motherboard happens to offer another
SATA interface (Sil3512A) which is well supported and appears to support
hot-plugging as well, but it conflicts nastily with my PCTV Pro (bttv) card
(which are apparently known to conflict with the Sil SATA interfaces).

Thanks for any help.

Ignacy



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Q: x86 suspend/hibernation code consolidation

2008-01-25 Thread Len Brown
On Friday 25 January 2008 19:32, Rafael J. Wysocki wrote:
> Hi,
> 
> I'd like to move the 64-bit suspend/hibernation files from arch/x86/kernel to
> arch/x86/power, modify the names of the 32-bit files already in
> arch/x86/power and update the Makefiles accordingly, but there are some 
> changes
> queued for merging that touch the files in question.
> 
> When is the right time for making changes like that?
> 
> Rafael

In Cambridge, when we discussed cleanups that touch a lot of files
but have no functional change -- somebody suggested that right
after rc1 closes is a good time.  The reasoning was that they
would not conflict with the functional changes in rc1.

However, I recall Linus saying something about "Andrew is special"
WRT permission to push cleanups after the rc1 window; so I don't
know what the final ruling was -- if there was such a ruling.

-Len
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: change aper valid checking sequence

2008-01-25 Thread Yinghai Lu
[PATCH] x86_64: change aper valid checking sequence

old sequence:
size ==> >4G  ==> point to RAM
changed to
>4G ==> point to RAM ==> size

some bios even leave aper to unclear, so check size at last.
to avoid reporting that like

Node 0: Aperture @ 4a4200 size 32 MB
Aperture too small (32 MB)

with patch will get

Node 0: Aperture @ 4a4200 size 32 MB
Aperture beyond 4G. Ignoring.

Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]>

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 0b837bb..608152a 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -85,10 +85,6 @@ static int __init aperture_valid(u64 aper_base, u32 
aper_size)
if (!aper_base)
return 0;
 
-   if (aper_size < 64*1024*1024) {
-   printk(KERN_ERR "Aperture too small (%d MB)\n", aper_size>>20);
-   return 0;
-   }
if (aper_base + aper_size > 0x1UL) {
printk(KERN_ERR "Aperture beyond 4GB. Ignoring.\n");
return 0;
@@ -97,6 +93,10 @@ static int __init aperture_valid(u64 aper_base, u32 
aper_size)
printk(KERN_ERR "Aperture pointing to e820 RAM. Ignoring.\n");
return 0;
}
+   if (aper_size < 64*1024*1024) {
+   printk(KERN_ERR "Aperture too small (%d MB)\n", aper_size>>20);
+   return 0;
+   }
 
return 1;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [kvm-devel] [PATCH][RFC] SVM: Add Support for Nested Paging in AMD Fam16 CPUs

2008-01-25 Thread Nakajima, Jun
Joerg Roedel wrote:
> Hi,
> 
> here is the first release of patches for KVM to support the Nested
Paging
> (NPT) feature of AMD QuadCore CPUs for comments and public testing.
This
> feature improves the guest performance significantly. I measured an
> improvement of around 17% using kernbench in my first tests.
> 
> This patch series is basically tested with Linux guests (32 bit legacy
> paging, 32 bit PAE paging and 64 bit Long Mode). Also tested with
Windows
> Vista 32 bit and 64 bit. All these guests ran successfully with these
> patches. The patch series only enables NPT for 64 bit Linux hosts at
the
> moment. 
> 
> Please give these patches a good and deep testing. I hope we have this
> patchset ready for merging soon.

Good. We also ported the EPT patch for Xen to KVM, which we submitted
last year. We've been cleaning up the patch with Avi. We are working on
live migration support now, and we'll submit the patch once it's done.
So please stay tuned.

> 
> Joerg
> 

Jun
---
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Bryan Henderson
>> Incidentally, some context for the AIX approach to the OOM problem: a 
>> process may exclude itself from OOM vulnerability altogether.  It 
places 
>> itself in "early allocation" mode, which means at the time it creates 
>> virtual memory, it reserves enough backing store for the worst case. 
The 
>> memory manager does not send such a process the SIGDANGER signal or 
>> terminate it when it runs out of paging space.  Before c. 2000, this 
was 
>> the only mode.  Now the default is late allocation mode, which is 
similar 
>> to Linux.
>
>This is an interesting approach. It feels like some programs might be 
>interested in choosing this mode instead of risking OOM. 

It's the way virtual memory always worked when it was first invented.  The 
system not only reserved space to back every page of virtual memory; it 
assigned the particular blocks for it.  Late allocation was a later 
innovation, and I believe its main goal was to make it possible to use the 
cheaper disk drives for paging instead of drums.  Late allocation gives 
you better locality on disk, so the seeking doesn't eat you alive (drums 
don't seek).  Even then, I assume (but am not sure) that the system at 
least reserved the space in an account somewhere so at pageout time there 
was guaranteed to be a place to which to page out.  Overcommitting page 
space to save on disk space was a later idea.

I was surprised to see AIX do late allocation by default, because IBM's 
traditional style is bulletproof systems.  A system where a process can be 
killed at unpredictable times because of resource demands of unrelated 
processes doesn't really fit that style.

It's really a fairly unusual application that benefits from late 
allocation: one that creates a lot more virtual memory than it ever 
touches.  For example, a sparse array.  Or am I missing something?

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable NestedPaging

2008-01-25 Thread Nakajima, Jun
Joerg Roedel wrote:
> To disable the use of the Nested Paging feature even if it is
available in
> hardware this patch adds a module parameter. Nested Paging can be
disabled by
> passing npt=off to the kvm_amd module.

I think it's better to use a (common) parameter to qemu. That way you
can control on/off for each VM.

Jun
---
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CS5536 mfgpt timer setup register hangs board

2008-01-25 Thread Hasan Rashid
Jordan,

Although, I am using TinyBios v.99 with MFGPT workaround disabled, and
upon a subsequent write I still run in to that system hang problem. To try
out that fix u mentioned, I thought I enable the workaround in the BIOS
and then apply the fix, It still hangs.

I am dumping this info from the module at load time, when setting up the
devive.

-
read 0 from: 6206
read 0 from: 620e
read 0 from: 6216
read 0 from: 621e
read 0 from: 6226
read 0 from: 622e
read 0 from: 6236
read 0 from: 623e
geode-mfgpt: MFGPT PCI device enabled
geode-mfgpt:  8 timers available.
geode-mfgpt:  Registered timer # 0
writting 306 to: 6206
[And then it hangs as if CS5536 is now mad]
-
I tried specifying a timer number, but the same behaviour with all.

In the code this is all I am doing
/* Set up the timer */
geode_mfgpt_write(wdt_timer, MFGPT_REG_SETUP,
  GEODEWDT_SCALE | (3 << 8) );
geode_mfgpt_read(wdt_timer, MFGPT_REG_SETUP);

void
geode_mfgpt_write(int i, u16 r, u16 v)
{
printk("writting %x to: %lx \n", v, (unsigned long)(mfgpt_iobase + (r +
(i * 8;
outl(v, (unsigned long)(mfgpt_iobase + (r + (i * 8))) );
}

u16
geode_mfgpt_read(int i, u16 r)
{
u16 val;
val = inl((unsigned long)mfgpt_iobase + (r + (i * 8)));
printk("read %x from: %lx\n", val, (unsigned long)(mfgpt_iobase + (r + 
(i
* 8))) );
return val;
}


Now, while experimenting, I set the Counter enable bit on the first write
and I don't touch the setup register again.

geode_mfgpt_write(wdt_timer, MFGPT_REG_SETUP,
  GEODEWDT_SCALE | (3 << 8) | MFGPT_SETUP_CNTEN);

Before calling the above function, I set the reset event and initialized
CMP2 with 0x7530h. Therefore, on every "geode_ping" to the timer I only
re-write 0x0 in the Up Counter register. This works fine, except  the
reset event seems to get unhooked as the system never reboots as expected.

So, I figured its either that the event is unset or the counter gets
disabled. I tried setting the reset event on every ping but that didn't
solve the problem. Then I tried setting the Counter Enable bit
(MFGPT_SETUP_CNTEN), which as you might've guessed hung the system but,
interestingly though the system rebooted after 60 secs. That got me
thinking that it was the counter enable bit that gets unset.

Anyhow, that's where I am stuck. The Alix2c0 boards use AMD Geode LX700, I
looked in the databook to see if there are any GPIO registers that can be
used as an alternative to program a watchdog timer but I couldn't find
anything usable. And I can't think of anything different to try with the
MFGPTs.

Not sure, but does the kernel version make a difference in any of this? I
am using 2.4 I have yet to try this on 2.6?

> On 25/01/08 15:50 -0800, Hasan Rashid wrote:
>>
>> Hi,
>>
>> I have been working on a watchdog timer using the mfgpt on AMD Geode
>> CS5536. I initialize the setup register MFGPT0_SETUP (0x6206) with hex
>> value 0x306 (110110b). However, after this first initialization if I
>> ever read/write to the register it hangs the system.
>>
>> I have been through all the documentation, tried several different
>> methods
>> but all the efforts, frustratingly, to no avail.
>>
>> Does anyone have any idea as to why would this be? TIA!
>
> It looks like you are using TinyBIOS.  Make sure that if you are using
> v0.99
> that you do *not* enable the MFGPT workaround.  If you are using an older
> version, then you will need this patch:
>
> http://lkml.org/lkml/2008/1/23/372
>
> And enable mfgptfix on the command line.  There seems to be a problem with
> the MFGPT "workaround" that causes hangs exactly like you are seeing.
>
> Jordan
>
> --
> Jordan Crouse
> Systems Software Development Engineer
> Advanced Micro Devices, Inc.
>
>
>


-- 
Regards,
Hasan Rashid

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

2008-01-25 Thread H. Peter Anvin

Jeremy Fitzhardinge wrote:


Now, all of this reminds me of something somewhat messy: if we share 
the kernel page tables for trampoline page tables, as discussed 
elsewhere, we HAVE to do a complete, all-tlb-including-global-pages 
flush after use, since the kernel pages are global and otherwise will 
stick around.  Unlike the permissions pages, there aren't G enable 
bits on the higher levels, but only for the PTEs themselves.


That wouldn't happen to often though, would it.  The identity mapping is 
only interested in a 1:1 view on RAM, and that's not going to change at 
all?  Does the TLB cache PAT attributes?  Do you need to do a global 
flush after changing a PTE's PAT bits to make sure that all that PTE's 
mappings have a consistent view on memory?




You do need to flush *that page* globally, yes.

As far as flushing after using the trampoline pagetables, we're talking 
about rare, expensive events here like suspend to ram.


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

2008-01-25 Thread Jeremy Fitzhardinge

H. Peter Anvin wrote:

Keir Fraser wrote:

On 25/1/08 22:54, "Jeremy Fitzhardinge" <[EMAIL PROTECTED]> wrote:


The only possibly relevant comment I can find in vol3a is:

Older IA-32 processors that implement the PAE mechanism use 
uncached

accesses when loading page-directory-pointer table entries. This
behavior is
model specific and not architectural. More recent IA-32 
processors may

cache page-directory-pointer table entries.


Go read the Intel application note "TLBs, Paging-Structure Caches, 
and Their
Invalidation" at 
http://www.intel.com/design/processor/applnots/317080.pdf


Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can
only be refreshed by appropriate tickling of CR0, CR3 or CR4.

It is also important to note that *any* valid page directory entry at 
*any*
level in the page-table hierarchy can become cached at *any* time. 
Basically
TLB lookup is performed as a longest-prefix match on the linear 
address to

skip as many levels in a page-table walk as possible (where a walk is
needed, because there is no full-length match on the linear address). 
So, if
you modify a directory entry from present to not-present, or change 
the page

directory that a valid pde points to, you probably need to flush the pde
caching structure. One piece of good news is that all pde caches are 
flushed

by any arbitrary INVLPG.



Actually, it's trickier than that.  The PDPTR, just like the segments, 
aren't a real cache, and aren't invalidated by INVLPG.  This means you 
can't go from less permissive to more permissive, which is normally 
permitted in the x86.  The PDPTR should really be thought of as an 
extended cr3 with four entries (this is also how it would be typically 
implemented in hardware) rather than as a part of the paging structure 
per se.


Yeah, that's basically what 8.1 says.  PAE doesn't follow the normal TLB 
rules for the top level, though they reserve the right to make it behave 
properly (as it would if you graft a PAE pagetable into a full 64-bit 
pagetable).



Now, all of this reminds me of something somewhat messy: if we share 
the kernel page tables for trampoline page tables, as discussed 
elsewhere, we HAVE to do a complete, all-tlb-including-global-pages 
flush after use, since the kernel pages are global and otherwise will 
stick around.  Unlike the permissions pages, there aren't G enable 
bits on the higher levels, but only for the PTEs themselves.


That wouldn't happen to often though, would it.  The identity mapping is 
only interested in a 1:1 view on RAM, and that's not going to change at 
all?  Does the TLB cache PAT attributes?  Do you need to do a global 
flush after changing a PTE's PAT bits to make sure that all that PTE's 
mappings have a consistent view on memory?


   J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Zan Lynx

On Fri, 2008-01-25 at 04:09 -0700, Andreas Dilger wrote:
> On Jan 24, 2008  17:25 -0700, Zan Lynx wrote:
> > Have y'all been following the /dev/mem_notify patches?
> > http://article.gmane.org/gmane.linux.kernel/628653
> 
> Having the notification be via poll() is a very restrictive processing
> model.  Having the notification be via a signal means that any kind of
> process (and not just those that are event loop driven) can register
> a callback at some arbitrary point in the code and be notified.  I
> don't object to the poll() interface, but it would be good to have a
> signal mechanism also.

The commentary on the mem_notify threads claimed that the signal is
easily provided by setting up the file handle for SIGIO.

Yeah.  Here it is...copied from email written by KOSAKI Motohiro:

implement FASYNC capability to /dev/mem_notify.


fd = open("/dev/mem_notify", O_RDONLY);

fcntl(fd, F_SETOWN, getpid());

flags = fcntl(fd, F_GETFL);
fcntl(fd, F_SETFL, flags|FASYNC);  /* when low memory, receive SIGIO */

-- 
Zan Lynx <[EMAIL PROTECTED]>


signature.asc
Description: This is a digitally signed message part


[GIT PATCH] SCSI updates for 2.6.24 (part 1)

2008-01-25 Thread James Bottomley
We have a difficult merge this time; the SCSI tree is split between
components that can go now and pieces that are waiting on other trees.
Part 1 is the components that can go now ... you'll be getting part 2
towards the end of the merge window.

There's misc driver updates, the accessor conversions (peparation for
large scatterlists) and tons of other misc updates.

There are also some sysfs changes (with Greg's ack) because of the way
the dependencies thread through SCSI.

The patch is available here:

master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git

The short changelog is:

Adrian Bunk (4):
  qla2xxx: Code cleanups.
  megaraid: add __devexit annotation
  lpfc: minor cleanups
  53c7xx: fix removal fallout

Alan Cox (1):
  aacraid: fix security weakness

Andi Kleen (1):
  sg: Only print SCSI data direction warning once for a command

Andrew Morton (1):
  sgiwd93: export sgiwd93_reset()

Andrew Vasquez (13):
  qla2xxx: Update version number to 8.02.00-k7.
  qla2xxx: Correct late-memset() of EFT buffer.
  qla2xxx: Add Fibre Channel Event (FCE) tracing support.
  qla2xxx: Trace-Control naming cleanups.
  qla2xxx: Don't schedule the DPC routine to perform an issue-lip request.
  qla2xxx: Restrict MSI/MSI-X enablement on select ISP2432-type HBAs.
  qla2xxx: Wait for FLASH write-protection to complete after a write.
  qla2xxx: Fix for 32-bit platforms with 64-bit resources.
  qla2xxx: Retrieve additional HBA port statistics from recent ISPs.
  qla2xxx: Consolidate duplicate sense-data handling codes.
  qla2xxx: Update version number to 8.02.00-k6.
  qla2xxx: Correct NPIV support for recent ISPs.
  qla2xxx: Don't explicitly read mbx registers while processing a system-err

Boaz Harrosh (26):
  libiscsi,iser: patch for AHS support
  iscsi_tcp, libiscsi: initial AHS Support
  iscsi: Prettify resid handling and some extra checks
  imm: convert to accessors and !use_sg cleanup
  ppa: convert to accessors and !use_sg cleanup
  NCR5380 family: convert to accessors & !use_sg cleanup
  wd7000: proper fix for boards without sg support
  atp870u: convert to accessors and !use_sg cleanup
  scsi_debug: convert to use the data buffer accessors
  isd200: use one-element sg list in issuing commands
  usb: transport - convert to accessors and !use_sg code path removal
  usb: shuttle_usbat - convert to accessors and !use_sg code path removal
  usb: freecom & sddr09 - convert to accessors and !use_sg cleanup
  usb: protocol - convert to accessors and !use_sg code path removal
  seagate: Remove driver
  psi240i: remove driver
  in2000: convert to accessors and !use_sg cleanup
  qlogicpti: convert to accessors and !use_sg cleanup
  wd33c93: convert to accessors and !use_sg cleanup
  fd_mcs: convert to accessors and !use_sg cleanup
  aha1542: convert to accessors and !use_sg cleanup
  a3000: convert to accessors and !use_sg cleanup
  a2091: convert to accessors and !use_sg cleanup
  eata_pio: convert to accessors and !use_sg cleanup
  nsp_cs: convert to data accessors and !use_sg cleanup
  aha152x: Use scsi_eh API for REQUEST_SENSE invocation

Brian King (1):
  ibmvscsi: Set default command timeout

Christof Schmitt (11):
  zfcp: Hold queue lock when checking port/unit handle for task management c
  zfcp: Hold queue lock when checking port/unit handle for FCP command
  zfcp: Hold queue lock when checking port handle for ELS command
  zfcp: Hold queue lock when checking port/unit handle for abort command
  zfcp: Fix evaluation of port handles in abort handler
  zfcp: Reduce flood on hba trace
  zfcp: Fix deadlock when adding invalid LUN
  zfcp: Remove SCSI devices when removing complete adapter
  zfcp: Specify waiting times in ERP in seconds
  zfcp: Use also port and adapter to identify unit in messages.
  zfcp: Remove unnecessary eh_bus_reset_handler callback

Christoph Hellwig (1):
  aacraid: don't assign cpu_to_le32(int) to u8

Darrick J. Wong (2):
  libsas: Fix various sparse complaints
  libsas: Convert sas_proto users to sas_protocol

Denis Cheng (1):
  ipr: use LIST_HEAD instead of LIST_HEAD_INIT

Erez Zilber (1):
  IB/iSER: add logical unit reset support

FUJITA Tomonori (13):
  ch: remove forward declarations
  ch: fix device minor number management bug
  ch: handle class_device_create failure properly
  use dynamically allocated sense buffer
  sg: handle class_device_create failure properly
  sg: set class_data after success
  replace sizeof sense_buffer with SCSI_SENSE_BUFFERSIZE
  aic7xxx_old, eata_pio, ips, libsas: don't zero out sense_buffer in queueco
  libsas: fix sense_buffer overrun
  fix scsi_setup_command_freelist failure path race
  mpt fusion: make mptsas_smp_handler update resid
  iscsi_tcp: update

[PATCH] 2.4: Back-port of pl2303.c from 2.6.23.14

2008-01-25 Thread David Newall
I experienced major major data loss on a PL-2303 USB-serial converter
under 2.4.36, which I remedied by back-porting the pl2303.c from the
latest 2.6 kernel tree.

---

diff -u linux-2.4.36/drivers/usb/serial/pl2303.c.orig 
linux-2.4.36/drivers/usb/serial/pl2303.c
--- pl2303.c.orig   2008-01-01 22:36:40.0 +1030
+++ pl2303.c2008-01-26 05:32:00.0 +1030
@@ -1,17 +1,20 @@
 /*
  * Prolific PL2303 USB to serial adaptor driver
  *
- * Copyright (C) 2001-2003 Greg Kroah-Hartman ([EMAIL PROTECTED])
+ * Copyright (C) 2001-2007 Greg Kroah-Hartman ([EMAIL PROTECTED])
  * Copyright (C) 2003 IBM Corp.
  *
  * Original driver for 2.2.x by anonymous
  *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
  *
  * See Documentation/usb/usb-serial.txt for more information on using this 
driver
+ * 2007_Jan_25 dn
+ * Back-port pl2303.c from linux-2.6.23.14 - corrects major loss of
+ * transmitted data, plus minor loss during close. [EMAIL PROTECTED]
+ *
  * 2003_Apr_24 gkh
  * Added line error reporting support.  Hopefully it is correct...
  *
@@ -33,6 +36,9 @@
  * 
  */
 
+/* TODO first char sent is lost on second open of device.  anecdotal evidence
+ * TODO suggests this might be on all even opens of device. dn. */
+
 #include 
 #include 
 #include 
@@ -59,31 +65,60 @@
 /*
  * Version Information
  */
-#define DRIVER_VERSION "v0.10.1"   /* Takes from 2.6's */
 #define DRIVER_DESC "Prolific PL2303 USB to serial adaptor driver"
 
+#define PL2303_CLOSING_WAIT(30*HZ)
 
+#define PL2303_BUF_SIZE1024
+#define PL2303_TMP_BUF_SIZE1024
+
+struct pl2303_buf {
+   unsigned intbuf_size;
+   char*buf_buf;
+   char*buf_get;
+   char*buf_put;
+};
 
 static struct usb_device_id id_table [] = {
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID) },
{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_RSAQ2) },
+   { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_DCU11) },
+   { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_RSAQ3) },
+   { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_PHAROS) },
{ USB_DEVICE(IODATA_VENDOR_ID, IODATA_PRODUCT_ID) },
{ USB_DEVICE(ATEN_VENDOR_ID, ATEN_PRODUCT_ID) },
{ USB_DEVICE(ATEN_VENDOR_ID2, ATEN_PRODUCT_ID) },
{ USB_DEVICE(ELCOM_VENDOR_ID, ELCOM_PRODUCT_ID) },
+   { USB_DEVICE(ELCOM_VENDOR_ID, ELCOM_PRODUCT_ID_UCSGT) },
{ USB_DEVICE(ITEGNO_VENDOR_ID, ITEGNO_PRODUCT_ID) },
+   { USB_DEVICE(ITEGNO_VENDOR_ID, ITEGNO_PRODUCT_ID_2080) },
{ USB_DEVICE(MA620_VENDOR_ID, MA620_PRODUCT_ID) },
{ USB_DEVICE(RATOC_VENDOR_ID, RATOC_PRODUCT_ID) },
{ USB_DEVICE(TRIPP_VENDOR_ID, TRIPP_PRODUCT_ID) },
{ USB_DEVICE(RADIOSHACK_VENDOR_ID, RADIOSHACK_PRODUCT_ID) },
{ USB_DEVICE(DCU10_VENDOR_ID, DCU10_PRODUCT_ID) },
{ USB_DEVICE(SITECOM_VENDOR_ID, SITECOM_PRODUCT_ID) },
+   { USB_DEVICE(ALCATEL_VENDOR_ID, ALCATEL_PRODUCT_ID) },
+   { USB_DEVICE(SAMSUNG_VENDOR_ID, SAMSUNG_PRODUCT_ID) },
+   { USB_DEVICE(SIEMENS_VENDOR_ID, SIEMENS_PRODUCT_ID_SX1) },
+   { USB_DEVICE(SIEMENS_VENDOR_ID, SIEMENS_PRODUCT_ID_X65) },
+   { USB_DEVICE(SIEMENS_VENDOR_ID, SIEMENS_PRODUCT_ID_X75) },
+   { USB_DEVICE(SYNTECH_VENDOR_ID, SYNTECH_PRODUCT_ID) },
+   { USB_DEVICE(NOKIA_CA42_VENDOR_ID, NOKIA_CA42_PRODUCT_ID) },
+   { USB_DEVICE(CA_42_CA42_VENDOR_ID, CA_42_CA42_PRODUCT_ID) },
+   { USB_DEVICE(SAGEM_VENDOR_ID, SAGEM_PRODUCT_ID) },
+   { USB_DEVICE(LEADTEK_VENDOR_ID, LEADTEK_9531_PRODUCT_ID) },
+   { USB_DEVICE(SPEEDDRAGON_VENDOR_ID, SPEEDDRAGON_PRODUCT_ID) },
+   { USB_DEVICE(DATAPILOT_U2_VENDOR_ID, DATAPILOT_U2_PRODUCT_ID) },
+   { USB_DEVICE(BELKIN_VENDOR_ID, BELKIN_PRODUCT_ID) },
+   { USB_DEVICE(ALCOR_VENDOR_ID, ALCOR_PRODUCT_ID) },
+   { USB_DEVICE(HUAWEI_VENDOR_ID, HUAWEI_PRODUCT_ID) },
+   { USB_DEVICE(WS002IN_VENDOR_ID, WS002IN_PRODUCT_ID) },
{ } /* Terminating entry */
 };
 
 MODULE_DEVICE_TABLE (usb, id_table);
 
-
 #define SET_LINE_REQUEST_TYPE  0x21
 #define SET_LINE_REQUEST   0x20
 
@@ -164,6 +199,8 @@
 
 struct pl2303_private {
spinlock_t lock;
+   struct pl2303_buf *buf;
+   int write_urb_in_use;
wait_queue_head_t delta_msr_wait;
u8 line_control;
u8 line_status;
@@ -171,6 +208,175 @@
enum pl2303_type type;
 };
 
+/*
+ * pl2303_buf_alloc
+ *
+ * Allocate a circular buffer and all associated memory.
+ */
+static struc

using LKML for subsystem development (was Re: Linux 2.6.24)

2008-01-25 Thread Stefan Richter
(I already deleted the posting I'm going to reply to, therefore
References and In-Reply-To are wrong.  Sorry.)

On 2008-01-25, Ingo Molnar wrote in http://lkml.org/lkml/2008/1/25/320:
> * Giacomo A. Catenazzi <[EMAIL PROTECTED]> wrote:
>> As a tester, I'm not so happy.
>> The last few merge windows were a nightmare for us (the tester).
>> It remember me the 2.1.x times, but with few differences:
>> - more changes, so bugs are unnoticed/ignored in the first weeks or
>> - or people are pushing more patches possible, so they delay
>>   bug corrections to later times (after merge windows).
> 
> i think this heavily varies per subsystem.
> 
> v2.6.24-rc was pretty bad due to the sglist design bug that crept in and 
> that kept most of the IO hackers busy for a few weeks, while testsystems 
> kept crashing and no progress was made on _other_ bugs. v2.6.24 early 
> rc's were also marred by half-cooked networking patches messing up 
> bisectability. I've seen a number of testers give up on that alone. 
> There was an unusually high flux of networking fixes throughout v2.6.24, 
> up to the very last day before the release.
> 
> Since it's Friday already, i put the blame for that on all the 
> subsystems that do not develop on lkml! :-)
> 
> It is _very_ hard for us to judge the stability and sanity of a 
> subsystem (and the risk factor of upcoming features!) if it's not 
> developed on lkml. Observing the bugs alone helps in getting a picture, 
> but it does not help the testers of early -rc's:
[...]
> there's way too much 'surprise factor' 
> in the git merges and all the hidden development that is not directly 
> visible on lkml. The 'surprise factor' is not even come mainly from 
> combining all the trees together (that is relatively easy), it is in the 
> cumulative risk factor that is hard to get right due to development not 
> always being done on lkml.
> 
> Case point from arch/x86: everyone who follows lkml could have predicted 
> it from the PAT development discussions that PAT is simply not ready 
> yet. We deferred it to v2.6.26,

The remedy can't just be to Cc: LKML all the time.  This would shift the
burden of directing the "general public's" attention from the domain
experts to the general public.  How will subscribers of LKML decide
which discussion threads in the huge amount of traffic are worth to
glance at?  Each of us has only a limited amount of time for LKML
consumption.

Even if you only look at the Subject: and number of postings in a
thread, how to judge whether there is a stability risk for the next -rc
in the making, without experience or personal interest in the domain?

> but had we tried to cram it into v2.6.25 
> and had it broken boxes left and right, we'd rightfully be confronted 
> with all the existing lkml track record that suggested bad PAT related 
> problems and predicted the outcome. For subsystems that do not develop 
> on lkml, no such lkml track record exists and the danger of introducing 
> bad patches and ruining early -rc's increases.

Having a track record in list archives doesn't prevent bugs from happen,
at least not directly.  It might help to clarify who's responsible, if
the changelog doesn't tell us already, and thus might have a positive
long term effect on quality.  (I work in an industry where it is often
hard to identify responsibilities which IMO contributes to chronic
quality issues in that industry.)

Anyhow, I will try to remember to add a list archive pointer into my
future "what's in abc123-2.6.git" messages, so that those who care can
browse over the topics and threads to get at least a superficial
impression of what went on on the development list behind LKML's back.

(I usually also add Cc: LKML to discussions when I get the feeling that
the expertise and judgment on the development list might not be
sufficient during a respective stage of development --- but of course my
judgment of when to involve LKML isn't objective and perfect.  That is,
I /don't/ claim this to be the best way to handle subsystem development
discussions.)
-- 
Stefan Richter
-=-==--- ---= ==-=-
http://arcgraph.de/sr/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PATCH] driver core patches against 2.6.24

2008-01-25 Thread Jon Masters

On Sat, 2008-01-26 at 01:27 +0100, Peter Zijlstra wrote:
> On Sat, 2008-01-26 at 01:05 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > My wish is that distros would just boot without requiring an initrd. I 
> > > know how to make them for redhat and debian based distros, but the 
> > > fact that you can't (easily) cross-build them makes it a very tedious 
> > > construct.
> > 
> > all it takes for me on Fedora is to boot a modular distro kernel once, 
> > then copy the /dev to the real (persistent) /dev:
> > 
> >mkdir /tmp2
> >mount /dev/sda1 /tmp2
> >cp -a /dev/* /tmp2/dev/
> > 
> > and from that point on a bzImage/vmlinuz can boot up on Fedora without 
> > any problems (as long as it has the right drivers built in), and the 
> > initrd line can be removed from grub.conf.
> 
> Yeah, I usually do the same but with a bind mount, still it would be
> grand if such things would not be needed.

Agreed. But it's not likely to be a priority - all the vendors want
completely modular kernels. But now we see what Linus wants to do,
perhaps we can try to be a bit more friendly toward that. It's not
actually rocket science, after all. I was concerned that he wanted to
use the modules in the initrd, but now I see Linus, and everyone else,
just want to do what I also secretly do, and just not use an initrd.

Isn't it funny. We all secretly hate using initrds ourselves :)

Jon.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Unpredictable performance

2008-01-25 Thread Nick Piggin
On Saturday 26 January 2008 02:03, Asbjørn Sannes wrote:
> Asbjørn Sannes wrote:
> > Nick Piggin wrote:
> >> On Friday 25 January 2008 22:32, Asbjorn Sannes wrote:
> >>> Hi,
> >>>
> >>> I am experiencing unpredictable results with the following test
> >>> without other processes running (exception is udev, I believe):
> >>> cd /usr/src/test
> >>> tar -jxf ../linux-2.6.22.12
> >>> cp ../working-config linux-2.6.22.12/.config
> >>> cd linux-2.6.22.12
> >>> make oldconfig
> >>> time make -j3 > /dev/null # This is what I note down as a "test" result
> >>> cd /usr/src ; umount /usr/src/test ; mkfs.ext3 /dev/cc/test
> >>> and then reboot
> >>>
> >>> The kernel is booted with the parameter mem=8192
> >>>
> >>> For 2.6.23.14 the results vary from (real time) 33m30.551s to
> >>> 45m32.703s (30 runs)
> >>> For 2.6.23.14 with nop i/o scheduler from 29m8.827s to 55m36.744s (24
> >>> runs) For 2.6.22.14 also varied a lot.. but, lost results :(
> >>> For 2.6.20.21 only vary from 34m32.054s to 38m1.928s (10 runs)
> >>>
> >>> Any idea of what can cause this? I have tried to make the runs as equal
> >>> as possible, rebooting between each run.. i/o scheduler is cfq as
> >>> default.
> >>>
> >>> sys and user time only varies a couple of seconds.. and the order of
> >>> when it is "fast" and when it is "slow" is completly random, but it
> >>> seems that the results are mostly concentrated around the mean.
> >>
> >> Hmm, lots of things could cause it. With such big variations in
> >> elapsed time, and small variations on CPU time, I guess the fs/IO
> >> layers are the prime suspects, although it could also involve the
> >> VM if you are doing a fair amount of page reclaim.
> >>
> >> Can you boot with enough memory such that it never enters page
> >> reclaim? `grep scan /proc/vmstat` to check.
> >>
> >> Otherwise you could mount the working directory as tmpfs to
> >> eliminate IO.
> >>
> >> bisecting it down to a single patch would be really helpful if you
> >> can spare the time.
> >
> > I'm going to run some tests without limiting the memory to 80 megabytes
> > (so that it is 2 gigabyte) and see how much it varies then, but iff I
> > recall correctly it did not vary much. I'll reply to this e-mail with
> > the results.
>
> 5 runs gives me:
> real5m58.626s
> real5m57.280s
> real5m56.584s
> real5m57.565s
> real5m56.613s
>
> Should I test with tmpfs aswell?

I wouldn't worry about it. It seems like it might be due to page reclaim
(fs / IO can't be ruled out completely though). Hmm, I haven't been following
reclaim so closely lately; you say it started going bad around 2.6.22? It
may be lumpy reclaim patches?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-25 Thread Justin Piszcz



On Fri, 25 Jan 2008, Yinghai Lu wrote:


On Jan 25, 2008 4:01 PM, Justin Piszcz <[EMAIL PROTECTED]> wrote:




...

Tried it, it worked successfully!

With stock kernel, previous way I had to use it was mem=8832M and top
showed this:

top - 18:53:52 up 1 min,  2 users,  load average: 1.03, 0.30, 0.10
Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.1%us,  2.6%sy,  4.5%ni, 81.3%id,  5.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8039464k total,  1288948k used,  6750516k free, 3640k buffers
Swap: 16787768k total,0k used, 16787768k free,   178528k cached

With kernel you mentioned and use e820 v3:

top - 18:48:13 up 3 min,  6 users,  load average: 1.67, 0.68, 0.25
Tasks: 195 total,   2 running, 193 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.5%us,  1.2%sy,  1.6%ni, 74.8%id,  3.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8037668k total,  1438732k used,  6598936k free, 6844k buffers
Swap: 16787768k total,0k used, 16787768k free,   273928k cached

No append mem= required.




thanks

any chance to try 32 bit with higemem64 option?

YH



My distribution is setup for 64-bit (64bit-clean) only, I do not have a 
32-bit userland, so cannot help here, sorry.


Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Q: x86 suspend/hibernation code consolidation

2008-01-25 Thread Rafael J. Wysocki
Hi,

I'd like to move the 64-bit suspend/hibernation files from arch/x86/kernel to
arch/x86/power, modify the names of the 32-bit files already in
arch/x86/power and update the Makefiles accordingly, but there are some changes
queued for merging that touch the files in question.

When is the right time for making changes like that?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PATCH] driver core patches against 2.6.24

2008-01-25 Thread Peter Zijlstra

On Sat, 2008-01-26 at 01:05 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > My wish is that distros would just boot without requiring an initrd. I 
> > know how to make them for redhat and debian based distros, but the 
> > fact that you can't (easily) cross-build them makes it a very tedious 
> > construct.
> 
> all it takes for me on Fedora is to boot a modular distro kernel once, 
> then copy the /dev to the real (persistent) /dev:
> 
>mkdir /tmp2
>mount /dev/sda1 /tmp2
>cp -a /dev/* /tmp2/dev/
> 
> and from that point on a bzImage/vmlinuz can boot up on Fedora without 
> any problems (as long as it has the right drivers built in), and the 
> initrd line can be removed from grub.conf.

Yeah, I usually do the same but with a bind mount, still it would be
grand if such things would not be needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

2008-01-25 Thread H. Peter Anvin

Ingo Molnar wrote:

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

Is there any guide about the tradeoff of when to use invlpg vs 
flushing the whole tlb?  1 page?  10?  90% of the tlb?


i made measurements some time ago and INVLPG was quite uniformly slow on 
all important CPU types - on the order of 100+ cycles. It's probably 
microcode. With a cr3 flush being on the order of 200-300 cycles (plus 
any add-on TLB miss costs - but those are amortized quite well as long 
as the pagetables are well cached - which they usually are on today's 
2MB-ish L2 caches), the high cost of INVLPG rarely makes it worthwile 
for anything more than a few pages.


so INVLPG makes sense for pagetable fault realated single-address 
flushes, but they rarely make sense for range flushes. (and that's how 
Linux uses it)




Incidentally, as far as I can tell, the main INVLPG is so slow is 
because of its painful behaviour with regards to large pages which may 
have been split by hardware.


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CS5536 mfgpt timer setup register hangs board

2008-01-25 Thread Hasan Rashid
This is what TinyBios posts
---
PC Engines ALIX.2 v0.99
640 KB Base Memory
130048 KB Extended Memory

01F0 Master 848A CF 128MB
Phys C/H/S 1002/8/32 Log C/H/S 1002/8/32

BIOS setup:

(9) 9600 baud (2) 19200 baud *3* 38400 baud (5) 57600 baud (1) 115200 baud
*C* CHS mode (L) LBA mode (W) HDD wait (V) HDD slave (U) UDMA enable
(M) MFGPT workaround
(P) late PCI init
*R* Serial console enable
(E) PXE boot enable
(X) Xmodem upload
(Q) Quit

The MFGPT workaround is disabled, and I deleted the workaround code that
came with voyage linux's distribution. Anyhow I will try the fix and see
if that solves my problem.

> On 25/01/08 15:50 -0800, Hasan Rashid wrote:
>>
>> Hi,
>>
>> I have been working on a watchdog timer using the mfgpt on AMD Geode
>> CS5536. I initialize the setup register MFGPT0_SETUP (0x6206) with hex
>> value 0x306 (110110b). However, after this first initialization if I
>> ever read/write to the register it hangs the system.
>>
>> I have been through all the documentation, tried several different
>> methods
>> but all the efforts, frustratingly, to no avail.
>>
>> Does anyone have any idea as to why would this be? TIA!
>
> It looks like you are using TinyBIOS.  Make sure that if you are using
> v0.99
> that you do *not* enable the MFGPT workaround.  If you are using an older
> version, then you will need this patch:
>
> http://lkml.org/lkml/2008/1/23/372
>
> And enable mfgptfix on the command line.  There seems to be a problem with
> the MFGPT "workaround" that causes hangs exactly like you are seeing.
>
> Jordan
>
> --
> Jordan Crouse
> Systems Software Development Engineer
> Advanced Micro Devices, Inc.
>
>
>


-- 
Regards,
Hasan Rashid


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-25 Thread Yinghai Lu
On Jan 25, 2008 4:01 PM, Justin Piszcz <[EMAIL PROTECTED]> wrote:
>
>
...
> Tried it, it worked successfully!
>
> With stock kernel, previous way I had to use it was mem=8832M and top
> showed this:
>
> top - 18:53:52 up 1 min,  2 users,  load average: 1.03, 0.30, 0.10
> Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
> Cpu(s):  6.1%us,  2.6%sy,  4.5%ni, 81.3%id,  5.5%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:   8039464k total,  1288948k used,  6750516k free, 3640k buffers
> Swap: 16787768k total,0k used, 16787768k free,   178528k cached
>
> With kernel you mentioned and use e820 v3:
>
> top - 18:48:13 up 3 min,  6 users,  load average: 1.67, 0.68, 0.25
> Tasks: 195 total,   2 running, 193 sleeping,   0 stopped,   0 zombie
> Cpu(s): 18.5%us,  1.2%sy,  1.6%ni, 74.8%id,  3.9%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:   8037668k total,  1438732k used,  6598936k free, 6844k buffers
> Swap: 16787768k total,0k used, 16787768k free,   273928k cached
>
> No append mem= required.
>


thanks

any chance to try 32 bit with higemem64 option?

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 158/196] Driver core: convert block from raw kobjects

2008-01-25 Thread Alexander van Heukelum
Fix build with CONFIG_BLOCK off.

Building git-2d94dfc with CONFIG_BLOCK turned off gives me:

drivers/base/core.c: In function 'device_add_class_symlinks':
drivers/base/core.c:704: error: 'part_type' undeclared (first use in this 
function)
drivers/base/core.c:704: error: (Each undeclared identifier is reported only 
once
drivers/base/core.c:704: error: for each function it appears in.)
drivers/base/core.c: In function 'device_remove_class_symlinks':
drivers/base/core.c:743: error: 'part_type' undeclared (first use in this 
function)

git-blame points to Kay Sievers.

The problem is obvious. I think te solution is too ;).

Tested with a silly configuration that contains just enough wits to boot
and get to the prompt of klibc-dash on the built-in initramfs using:
   qemu -m 8 -cpu pentium -serial stdio -cdrom arch/x86/boot/image.iso

Compile-tested i386-defconfig.

Signed-off-by: Alexander van Heukelum <[EMAIL PROTECTED]>

Oh, and the compile-problem still exists in git-99f1c97. The git-tree is
changing faster than I can test the patch and write an e-mail :-/.

diff --git a/drivers/base/core.c b/drivers/base/core.c
index edf3bbe..3751843 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -75,6 +75,15 @@ static ssize_t dev_attr_store(struct kobject *kobj, struct 
attribute *attr,
return ret;
 }
 
+static int dev_needs_link(struct device *dev)
+{
+#ifdef CONFIG_BLOCK
+   return dev->type != &part_type;
+#else
+   return 1;
+#endif
+}
+
 static struct sysfs_ops dev_sysfs_ops = {
.show   = dev_attr_show,
.store  = dev_attr_store,
@@ -652,14 +661,14 @@ static int device_add_class_symlinks(struct device *dev)
 #ifdef CONFIG_SYSFS_DEPRECATED
/* stacked class devices need a symlink in the class directory */
if (dev->kobj.parent != &dev->class->subsys.kobj &&
-   dev->type != &part_type) {
+   dev_needs_link(dev)) {
error = sysfs_create_link(&dev->class->subsys.kobj, &dev->kobj,
  dev->bus_id);
if (error)
goto out_subsys;
}
 
-   if (dev->parent && dev->type != &part_type) {
+   if (dev->parent && dev_needs_link(dev)) {
struct device *parent = dev->parent;
char *class_name;
 
@@ -688,11 +697,11 @@ static int device_add_class_symlinks(struct device *dev)
return 0;
 
 out_device:
-   if (dev->parent && dev->type != &part_type)
+   if (dev->parent && dev_needs_link(dev))
sysfs_remove_link(&dev->kobj, "device");
 out_busid:
if (dev->kobj.parent != &dev->class->subsys.kobj &&
-   dev->type != &part_type)
+   dev_needs_link(dev))
sysfs_remove_link(&dev->class->subsys.kobj, dev->bus_id);
 #else
/* link in the class directory pointing to the device */
@@ -701,7 +710,7 @@ out_busid:
if (error)
goto out_subsys;
 
-   if (dev->parent && dev->type != &part_type) {
+   if (dev->parent && dev_needs_link(dev)) {
error = sysfs_create_link(&dev->kobj, &dev->parent->kobj,
  "device");
if (error)
@@ -725,7 +734,7 @@ static void device_remove_class_symlinks(struct device *dev)
return;
 
 #ifdef CONFIG_SYSFS_DEPRECATED
-   if (dev->parent && dev->type != &part_type) {
+   if (dev->parent && dev_needs_link(dev)) {
char *class_name;
 
class_name = make_class_name(dev->class->name, &dev->kobj);
@@ -737,10 +746,10 @@ static void device_remove_class_symlinks(struct device 
*dev)
}
 
if (dev->kobj.parent != &dev->class->subsys.kobj &&
-   dev->type != &part_type)
+   dev_needs_link(dev))
sysfs_remove_link(&dev->class->subsys.kobj, dev->bus_id);
 #else
-   if (dev->parent && dev->type != &part_type)
+   if (dev->parent && dev_needs_link(dev))
sysfs_remove_link(&dev->kobj, "device");
 
sysfs_remove_link(&dev->class->subsys.kobj, dev->bus_id);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


nfs server patches not in 2.6.25

2008-01-25 Thread J. Bruce Fields
Just some idea what we might be working on for 2.6.26, besides continued
bug-fixing and cleanup:

Work that we already have patches for and that I expect to be included
in whole or in 2.6.26:

- ipv6: Aurélien Charbon's patch to add ipv6 support to the
  server's export interface is ready.  I'm not clear what else
  remains for full ipv6 support.
- failover and migration: Wendy Cheng's patches appear to be in
  good shape, so I expect them or something with equivalent
  functionality to be in 2.6.26.
- gss callbacks: We have patches to add support for rpcsec_gss
  on NFSv4's callback channel (allowing us to support
  delegations on kerberos mounts), but they've been put on hold
  pending improvements to the client's gssd upcall.  I hope to
  get back to that work in the next few weeks.

Also in progress:

- spkm3 and future gss mechanisms may generate context
  initiation rpc's that are very large.  Olga Kornievskaia and I
  have been working on fixing the server gssd interfaces to
  permit this.

- There are some mismatches between the semantics required for
  nfsv4 delegations and what Linux's lease subsystem provides
  us.  David Richter and I have done a little work on this.  We
  need to start submitting it.

Three items I identified previously as issues I'd like fixed before we
removed the dependency of CONFIG_NFSD_V4 on CONFIG_EXPERIMENTAL:

http://linux-nfs.org/pipermail/nfsv4/2006-December/005497.html

- export paths consistent between v2/v3/v4:  We have some code
  that fixes this entirely in userspace.  That approach doesn't
  provide stable filehandles in the NFSv4 pseudofilesystem, and
  there seems to be a general sentiment that it's overly
  complicated.  It has the one advantage that we don't have to
  commit to it, since it uses only existing kernel interfaces.
  So I think we're probably going to apply that to nfs-utils as
  a stopgap measure and start work on fixing this in the kernel
  at the same time

- reboot recovery: there have been complaints about the
  server-side nfsv4 reboot recovery code for a while, we've had
  code that tries to fix it for a while, and it just hasn't
  happened.  I'm hoping we can finally get this ready for
  2.6.26.

- export security: this was finished in 2.6.23; we now support
  export options like sec=krb5:krb5i:krb5p, which have a few
  advantages over the special gss/krb5 client names.  This could
  be better documented, though.

I've probably left a lot out.  Let me know of ongoing projects and
todo's that I've forgotten

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

2008-01-25 Thread H. Peter Anvin

Keir Fraser wrote:

On 25/1/08 22:54, "Jeremy Fitzhardinge" <[EMAIL PROTECTED]> wrote:


The only possibly relevant comment I can find in vol3a is:

Older IA-32 processors that implement the PAE mechanism use uncached
accesses when loading page-directory-pointer table entries. This
behavior is
model specific and not architectural. More recent IA-32 processors may
cache page-directory-pointer table entries.


Go read the Intel application note "TLBs, Paging-Structure Caches, and Their
Invalidation" at http://www.intel.com/design/processor/applnots/317080.pdf

Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can
only be refreshed by appropriate tickling of CR0, CR3 or CR4.

It is also important to note that *any* valid page directory entry at *any*
level in the page-table hierarchy can become cached at *any* time. Basically
TLB lookup is performed as a longest-prefix match on the linear address to
skip as many levels in a page-table walk as possible (where a walk is
needed, because there is no full-length match on the linear address). So, if
you modify a directory entry from present to not-present, or change the page
directory that a valid pde points to, you probably need to flush the pde
caching structure. One piece of good news is that all pde caches are flushed
by any arbitrary INVLPG.



Actually, it's trickier than that.  The PDPTR, just like the segments, 
aren't a real cache, and aren't invalidated by INVLPG.  This means you 
can't go from less permissive to more permissive, which is normally 
permitted in the x86.  The PDPTR should really be thought of as an 
extended cr3 with four entries (this is also how it would be typically 
implemented in hardware) rather than as a part of the paging structure 
per se.


We do NOT want to frob %cr4 unless we actually need to clear all the 
global pages.


The stuff in chapter 10 sounds like they're flagging for a revised 
INVLPG instruction or mode which would fit some of the extremely serious 
defects in INVLPG that was introduced by haphazard semantics from the P5 
and early P6 days.


In general, we should assume that INVLPG only flushes the hierarchy 
above it, and not rely on side effects.  In particular, we should only 
assume INVLPG invalidates the hierarchy immediately above it, not on any 
side effects.  That's basically sane design anyway.


Now, all of this reminds me of something somewhat messy: if we share the 
kernel page tables for trampoline page tables, as discussed elsewhere, 
we HAVE to do a complete, all-tlb-including-global-pages flush after 
use, since the kernel pages are global and otherwise will stick around. 
 Unlike the permissions pages, there aren't G enable bits on the higher 
levels, but only for the PTEs themselves.


-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

2008-01-25 Thread Ingo Molnar

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> Is there any guide about the tradeoff of when to use invlpg vs 
> flushing the whole tlb?  1 page?  10?  90% of the tlb?

i made measurements some time ago and INVLPG was quite uniformly slow on 
all important CPU types - on the order of 100+ cycles. It's probably 
microcode. With a cr3 flush being on the order of 200-300 cycles (plus 
any add-on TLB miss costs - but those are amortized quite well as long 
as the pagetables are well cached - which they usually are on today's 
2MB-ish L2 caches), the high cost of INVLPG rarely makes it worthwile 
for anything more than a few pages.

so INVLPG makes sense for pagetable fault realated single-address 
flushes, but they rarely make sense for range flushes. (and that's how 
Linux uses it)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PATCH] driver core patches against 2.6.24

2008-01-25 Thread Ingo Molnar

* Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> My wish is that distros would just boot without requiring an initrd. I 
> know how to make them for redhat and debian based distros, but the 
> fact that you can't (easily) cross-build them makes it a very tedious 
> construct.

all it takes for me on Fedora is to boot a modular distro kernel once, 
then copy the /dev to the real (persistent) /dev:

   mkdir /tmp2
   mount /dev/sda1 /tmp2
   cp -a /dev/* /tmp2/dev/

and from that point on a bzImage/vmlinuz can boot up on Fedora without 
any problems (as long as it has the right drivers built in), and the 
initrd line can be removed from grub.conf.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CS5536 mfgpt timer setup register hangs board

2008-01-25 Thread Jordan Crouse
On 25/01/08 15:50 -0800, Hasan Rashid wrote:
> 
> Hi,
> 
> I have been working on a watchdog timer using the mfgpt on AMD Geode
> CS5536. I initialize the setup register MFGPT0_SETUP (0x6206) with hex
> value 0x306 (110110b). However, after this first initialization if I
> ever read/write to the register it hangs the system.
> 
> I have been through all the documentation, tried several different methods
> but all the efforts, frustratingly, to no avail.
> 
> Does anyone have any idea as to why would this be? TIA!

It looks like you are using TinyBIOS.  Make sure that if you are using v0.99
that you do *not* enable the MFGPT workaround.  If you are using an older 
version, then you will need this patch:

http://lkml.org/lkml/2008/1/23/372

And enable mfgptfix on the command line.  There seems to be a problem with
the MFGPT "workaround" that causes hangs exactly like you are seeing.

Jordan

-- 
Jordan Crouse
Systems Software Development Engineer 
Advanced Micro Devices, Inc.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_32: trim memory by updating e820 v2

2008-01-25 Thread Justin Piszcz



On Tue, 22 Jan 2008, Yinghai Lu wrote:


On Monday 21 January 2008 01:37:09 pm Justin Piszcz wrote:


On Mon, 21 Jan 2008, Yinghai Lu wrote:


On Monday 21 January 2008 11:14:02 am Justin Piszcz wrote:
please get x86.git

 git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
 cd linux-2.6
 #--{ x86.git instructions }-->
 # Add Linus's tree as a remote
 git remote add linus 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

 # Add Ingo's tree as a remote
 git remote add x86 
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

 # With that setup, just run the following to get any changes you
 # don't have.  It will also notice any new branches Ingo/Linus
 # add to their repo.  Look in .git/config afterwards, the format
 # to add new remotes is easy to figure out.
 git remote update
 #-
 git merge x86/master
 git merge x86/mm

and apply

[PATCH] x86_64: check if Tom2 is enabled
http://lkml.org/lkml/2008/1/21/20
[PATCH] x86_64: update e820 instead of updating end_pfn v3
http://lkml.org/lkml/2008/1/21/19
[PATCH] x86_32: trim memory by updating e820 v2
http://lkml.org/lkml/2008/1/21/18

YH



Thanks, I am all patched up and ready to test, unfortunately one of my disks
in my RAID 1 just died, I already filled out the advanced replacement form,
I will test when I receive the replacement disk.


please get x86.git and apply
[PATCH] x86_32: trim memory by updating e820 v3
http://lkml.org/lkml/2008/1/22/394

Ingo already put other two into the tree.

Thanks

YH



Tried it, it worked successfully!

With stock kernel, previous way I had to use it was mem=8832M and top 
showed this:


top - 18:53:52 up 1 min,  2 users,  load average: 1.03, 0.30, 0.10
Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.1%us,  2.6%sy,  4.5%ni, 81.3%id,  5.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8039464k total,  1288948k used,  6750516k free, 3640k buffers
Swap: 16787768k total,0k used, 16787768k free,   178528k cached

With kernel you mentioned and use e820 v3:

top - 18:48:13 up 3 min,  6 users,  load average: 1.67, 0.68, 0.25
Tasks: 195 total,   2 running, 193 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.5%us,  1.2%sy,  1.6%ni, 74.8%id,  3.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8037668k total,  1438732k used,  6598936k free, 6844k buffers
Swap: 16787768k total,0k used, 16787768k free,   273928k cached

No append mem= required.

A full dmesg is attached so you can analyze the e820/MTRR mapping.

File: dmesg-e820v3patch.txt.bz2

Justin.


dmesg-e820v3patch.txt.bz2
Description: Binary data


Re: [build bug] ./drivers/crypto/hifn_795x.c

2008-01-25 Thread Ingo Molnar

* Herbert Xu <[EMAIL PROTECTED]> wrote:

> On Sat, Jan 26, 2008 at 12:51:31AM +0100, Ingo Molnar wrote:
> >
> > find a workaround below - but i'm not sure it's the right one.
> 
> Thanks, but I've already checked in a fix :)

hey, that's my punishment for not reading my email promptly :) Could 
have saved me some time in the Kconfig web of dependencies :-/

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [build bug] ./drivers/crypto/hifn_795x.c

2008-01-25 Thread Herbert Xu
On Sat, Jan 26, 2008 at 12:51:31AM +0100, Ingo Molnar wrote:
>
> find a workaround below - but i'm not sure it's the right one.

Thanks, but I've already checked in a fix :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [build bug] ./drivers/crypto/hifn_795x.c

2008-01-25 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> randconfig testing found this (post-v2.6.24) build bug:
> 
>  drivers/built-in.o: In function `hifn_unregister_rng':
>  hifn_795x.c:(.text+0x17bbd9): undefined reference to `hwrng_unregister'
>  drivers/built-in.o: In function `hifn_probe':
>  hifn_795x.c:(.text+0x17df70): undefined reference to `hwrng_register'
> 
> config attached.

find a workaround below - but i'm not sure it's the right one.

Ingo

Index: linux/drivers/crypto/Kconfig
===
--- linux.orig/drivers/crypto/Kconfig
+++ linux/drivers/crypto/Kconfig
@@ -89,6 +89,7 @@ config CRYPTO_DEV_HIFN_795X
select CRYPTO_ALGAPI
select CRYPTO_BLKCIPHER
depends on PCI
+   depends on DEV_HIFN_795X = HW_RANDOM
help
  This option allows you to have support for HIFN 795x crypto adapters.
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.24

2008-01-25 Thread Stefan Richter
Giacomo A. Catenazzi wrote:
>> On Friday, 25 of January 2008, [EMAIL PROTECTED] wrote:
[-mm]
>>> should flush out most of the truly stupid mistakes, but those are
>>> usually found and fixed literally within hours.  Anyhow, the proper
>>> time for test compiles is *before* it goes into the git trees at
>>> all - it should have been tested before it gets sent to a
>>> maintainer for inclusion.
> 
> few hours, but a lot of changeset will broke bisect (few doc tell
> us how to continue bisecting on compile errors).
[...]
> I only want to raise the problem, to see if it is possible to improve
> testing environment without affecting the development of Linux.

How often is "bisectability" being broken already before merge in
subsystem trees, and how often only in the context of the merge result?
(Probably impossible to answer because nobody has the data.)

Much of the former type of breakage (if we really have such breakage)
could probably be found in mostly automated ways and by volunteer
testers, by regularly testing the subsystem trees.
-- 
Stefan Richter
-=-==--- ---= ==-=-
http://arcgraph.de/sr/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   >