date:20080129

Re: About closed-source module use GPL module function

2008-01-29 Thread Pekka Enberg

Hi,

On Jan 30, 2008 8:41 AM, CooperYuan Cooper <[EMAIL PROTECTED]> wrote:
> Now I am porting a device driver to Linux, its source code is not opened.
>
> In this module, I use some interface functions exported from GPL
> module through EXPORT_SYMBOL macros. (not EXPORT_SYMBOL_GPL), For
> example, register_sound_dsp() and so on.
>
> Do I violate GPL? How to solve it?

This list is probably a not good source for (free) legal advice but
the simplest way to be sure is to release the source code under GPLv2.
HTH.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread Sam Ravnborg

On Wed, Jan 30, 2008 at 06:22:43AM +0800, Yi Yang wrote:
> On Tue, 2008-01-29 at 09:44 +0100, Sam Ravnborg wrote:
> > > +
> > > +static struct notifier_block __cpuinitdata cpuid_sysfs_cpu_notifier = {
> > > + .notifier_call = cpuid_sysfs_cpu_callback,
> > > +};
> > Data is annotated _cpuintidata
> > 
> > but
> > 
> > > +
> > Data is annotated _cpuintidata
> > 
> > > @@ -217,11 +445,14 @@ static void __exit cpuid_exit(void)
> > >  {
> > >   int cpu = 0;
> > >  
> > > - for_each_online_cpu(cpu)
> > > + for_each_online_cpu(cpu) {
> > >   cpuid_device_destroy(cpu);
> > > + remove_cpuid_sysfs(cpu);
> > > + }
> > >   class_destroy(cpuid_class);
> > >   unregister_chrdev(CPUID_MAJOR, "cpu/cpuid");
> > >   unregister_hotcpu_notifier(&cpuid_class_cpu_notifier);
> > > + unregister_hotcpu_notifier(&cpuid_sysfs_cpu_notifier);
> > 
> > used in an __exit function.
> > 
> > You should have seen a Section mismatch warning for this.
> > The right fix is to annotate the cpuid_sysfs_cpu_notifier
> > with __initdata_refok (soon to be named __refdata)
> > Or even better to declare it const and use _refconst.
> I think __cpuinitdata is different from __initdata, i have tested it
> by insmod, rmmod, echo 0/1 > /sys/devices/system/cpu/cpu1/online
> repeatly, it hasn't any issue.

__cpuinit & _cpuinitdata have over time been used for 
different purposes:
a) To annotate code/data used in the init path and that in the
non HOTPLUG_CPU case can be discarded after init.
b) To annotate code/data used in the 'core' HOTPLUG_CPU
functionality that isonly in use if HOTPLUG_CPU='y'

The b) usage is questionable and the annotation
of cpuid_sysfs_cpu_notifier beongs in the b) category.

The correct solution would be to factor out the 'core'
HOTPLUG_CPU=y code to a set of separate files and used to
usual mechanishm in the Makefile to select when to include
this code in the kernel.

The improved section mismatch checks by modpost has just
brought this issue to the attention and now you add code
that does the wrong thing it is being discussed.

Sam
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] ipwireless: driver for 3G PC Card

2008-01-29 Thread David Sterba

Hi,

On Tue, Jan 29, 2008 at 02:49:24PM +0100, David Sterba wrote:
> ---
> From: David Sterba <[EMAIL PROTECTED]>
> 
> ipwireless_cs: driver for PC Card, 3G internet connection
> 
> The driver is manufactured by IPWireless.

Sorry this ^^^ is not correct, should be

"The device is manufactured by IPWireless."

Thanks,
Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Geert Uytterhoeven

On Tue, 29 Jan 2008, Zan Lynx wrote:
> Jon Masters wrote:
> > I wouldn't quite say that. I wasn't going to comment, but...personally,
> > I actually disagree with the assertions that ndiswrapper isn't causing
> > proprietary code to link against GPL functions in the kernel (how is
> > an NDIS implementation any different than a shim layer provided to
> > load a graphics driver?), but I wasn't trying to make that point.
> 
> Well, as long as *any* part of the kernel ever links to proprietary code, then
> GPL functions link to it in exactly the same way ndiswrapper enables.  It's
> only a matter of how many steps of separation.
> 
> A perfectly GPL USB network driver linked to GPL-only functions feeds data
> into the kernel where it swirls about and emerges from a proprietary network
> filesystem driver, for example.

A proprietary network filesystem driver _on a different system_, you
mean? In this case the proprietary code has no direct access to your
kernel data, except through the communication protocol. No tainting is
involved, as all corruption in your kernel is caused by kernel bugs in
visible code that can be debugged.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Geert Uytterhoeven

On Wed, 30 Jan 2008, M�ns Rullg�rd wrote:
> Adrian Bunk <[EMAIL PROTECTED]> writes:
> > On Tue, Jan 29, 2008 at 11:25:22PM +, M�ns Rullg�rd wrote:
> >> As long as you don't distribute /proc/kcore, I can't see how the GPL
> >> would have any say in the matter.  The Windows drivers are (unrelated
> >> violations aside) clearly not derived from GPL code.
> >
> > Someone might sell a laptop with Linux installed?
> 
> Not a problem, unless it is booted when sold.  Even that might not be
> a problem, since it would be a matter of transferring ownership of a
> single copy, not creating and distributing new copies, and the GPL
> does is only concerned with the latter.

Interesting... I never heard about this `transferring ownership of a
single copy not involving GPL'.

Note that some lawyers claim that at trade shows, you should not hand over
a demo device running GPLed code to any interested party, as it would be
distribution...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread Yi Yang

On Tue, 2008-01-29 at 09:44 +0100, Sam Ravnborg wrote:
> > +
> > +static struct notifier_block __cpuinitdata cpuid_sysfs_cpu_notifier = {
> > +   .notifier_call = cpuid_sysfs_cpu_callback,
> > +};
> Data is annotated _cpuintidata
> 
> but
> 
> > +
> Data is annotated _cpuintidata
> 
> > @@ -217,11 +445,14 @@ static void __exit cpuid_exit(void)
> >  {
> > int cpu = 0;
> >  
> > -   for_each_online_cpu(cpu)
> > +   for_each_online_cpu(cpu) {
> > cpuid_device_destroy(cpu);
> > +   remove_cpuid_sysfs(cpu);
> > +   }
> > class_destroy(cpuid_class);
> > unregister_chrdev(CPUID_MAJOR, "cpu/cpuid");
> > unregister_hotcpu_notifier(&cpuid_class_cpu_notifier);
> > +   unregister_hotcpu_notifier(&cpuid_sysfs_cpu_notifier);
> 
> used in an __exit function.
> 
> You should have seen a Section mismatch warning for this.
> The right fix is to annotate the cpuid_sysfs_cpu_notifier
> with __initdata_refok (soon to be named __refdata)
> Or even better to declare it const and use _refconst.
I think __cpuinitdata is different from __initdata, i have tested it
by insmod, rmmod, echo 0/1 > /sys/devices/system/cpu/cpu1/online
repeatly, it hasn't any issue.

> 
>   Sam

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread H. Peter Anvin


Yi Yang wrote:

Function cpuid has reset ecx to 0 immediate before calling to __cpuid,
so this shouldn't be a problem now.


Unless, of course, you want to get to the information for the higher 
CPUID levels.


The easiest way to fix that would be to use cpuid_count() and let 
/dev/cpu/*/cpuid take the %ecx value in the high half of the offset. 
That would have minimal impact on the interface.


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 2/5] dmaengine: Add slave DMA interface

2008-01-29 Thread David Brownell

On Tuesday 29 January 2008, Haavard Skinnemoen wrote:
> @@ -297,6 +356,13 @@ struct dma_device {
> struct dma_async_tx_descriptor *(*device_prep_dma_interrupt)(
> struct dma_chan *chan);
>  
> +   struct dma_slave_descriptor *(*device_prep_slave)(
> +   struct dma_chan *chan, dma_addr_t mem_addr,
> +   enum dma_slave_direction direction,
> +   enum dma_slave_width reg_width,
> +   size_t len, unsigned long flags);

That isn't enough options!  Check out arch/arm/plat-omap/dma.c (and
maybe OMAP5912 DMA docs [1] for not-very-recent specs) as one example.
You'll see more options that drivers need to use, including:

 - DMA priority and arbitration
 - Burst size, packing/unpacking support (for optimized memory access)
 - Multiple DMA quanta (not just reg_width, but also frames and blocks)
 - Multiple synch modes (per element/"width", frame, or block)
 - Multiple addressing modes:   pre-index, post-index, double-index, ...
 - Both descriptor-based and register based transfers
 - ... lots more ...

Example:  USB tends to use one packet per "frame" and have the DMA
request signal mean "give me the next frame".  It's sometimes been
very important to use use the tuning options to avoid some on-chip
race conditions for transfers that cross lots of internal busses and
clock domains, and to have special handling for aborting transfers
and handling "short RX" packets.

I wonder whether a unified programming interface is the right way
to approach peripheral DMA support, given such variability.  The DMAC
from Synopsys that you're working with has some of those options, but
not all of them... and other DMA controllers have their own oddities.

For memcpy() acceleration, sure -- there shouldn't be much scope for
differences.  Source, destination, bytecount ... go!  (Not that it's
anywhere *near* that quick in the current interface.)

For peripheral DMA, maybe it should be a "core plus subclasses"
approach so that platform drivers can make use hardware-specific
knowledge (SOC-specific peripheral drivers using SOC-specific DMA),
sharing core code for dma-memcpy() and DMA channel housekeeping.

- Dave

[1] http://focus.ti.com/docs/prod/folders/print/omap5912.html
lists spru755c near the bottom; the "System DMA" section.

> +   void (*device_terminate_all)(struct dma_chan *chan);
> +
> void (*device_dependency_added)(struct dma_chan *chan);
> enum dma_status (*device_is_tx_complete)(struct dma_chan *chan,
> dma_cookie_t cookie, dma_cookie_t *last,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2 0/5] dmaengine: Slave DMA interface and example users

2008-01-29 Thread David Brownell

On Tuesday 29 January 2008, Haavard Skinnemoen wrote:
>
> Btw, there's one issue I forgot to mention: I believe the DMA Engine
> framework is currently misusing the DMA mapping API, and this patchset
> makes things worse.
> 
> Currently, the async_tx bits of the API do the required calls to
> dma_map_single() and/or dma_map_page(), but they rely on the driver to
> do the unmapping. This is problematic ...
> 
> How do we solve this?

How about:  for peripheral DMA, don't let the engine see anything
except dma_addr_t values.

The engine needs to be able to dma_alloc_coherent() memory too,
which is pre-mapped.

- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ptrace API extensions for BTS

2008-01-29 Thread Roland McGrath

Sorry I did not get more into this discussion earlier.  I still have not
read through all of the email threads.  But I have looked over the current
version of your code now in -mm.

I think this work has a great deal of overlap with the perfmon2 project.
There are two facets that overlap, which together are the whole BTS feature.

The same x86 "debug store" hardware is programmed for both the BTS and the
performance monitoring features.  The implementations clearly have to
cooperate on managing that hardware.  Your ds.c is a start in the right
direction, to abstract the hardware configuration from the BTS feature and
its interface.  I'm not familiar with the perfmon2 code, but it may have
something similar already.

The rest of the BTS feature is the buffer management and the interface.
It has to deal with the hardware buffer setup, context switching, and
overflow interrupts, and delivering data from the hardware buffers to the
interface in appropriate formats.  We'd also like it to be able to trace
kernel-mode as well as user-mode, and either deliver combined data or
segregate the data between the two for user-space and kernel-space users
who need not know about each other's tracing.  (On some of the hardware
you can program it to record only one or the other (X86_FEATURE_DSCPL).
On older hardware, or when separately tracing both, you can trace both
and then distinguish each sample by its from_ip.)  perfmon2 also wants to
address all of that.

I don't much like the way you've shoe-horned the context-switch timestamp
logging into the BTS feature.  It's a nice feature to have in some form,
and I sympathize with your seeing it as easy pickings once you had the
BTS buffer machinery handy.  But really it is not part of the BTS feature
and there is nothing arch-dependent about it.  Given some other general
thing providing the buffer management et al, that could just be done in
schedule(), i.e.:

departs(prev);
context_switch(rq, prev, next); /* unlocks the rq */
arrives(prev);

If there is a general thing for event-reporting from perfmon2 or
whatever, then it might be natural to have the context-switch event
reports configurable to different record formats you might be using 
for other things.  For a BTS-style record, I would use:

 departs: { .from = task_pt_regs(prev)->ip, .to = jiffies, .flag = MAGIC1 }
 arrives: { .from = jiffies, .to = task_pt_regs(next)->ip, .flag = MAGIC2 }
 MAGIC1 = 0x0001
 MAGIC2 = 0x0002

or something like that, i.e. as if it were a "branch to block-time" and a
"branch from wake-time".  (Actually you might want MAGIC3 and MAGIC4 too,
for whether it was a voluntary or involuntary context switch.)  For
different use that is doing mostly other event sampling rather than BTS,
it might use a different format that gives more register into a la PEBS.

I'm no expert on perfmon2 and I understand there are many issues to be
resolved to get it into the kernel.  But if you are not desperate to have
the BTS feature in the kernel ASAP, it would ideal IMHO if you can work
with Stephane et al on putting this work together.  I'd like to see the
work go into the kernel in much smaller pieces even than your BTS patch
set that's in -mm.  The first thing is just the DS hardware management,
context switch and hardware-facing parts of the buffer management (one or
three or fourth small bisect-friendly patches just for that much).  If
you and Stephane can hash out a fresh patch that provides what you both
need for that, that would be a great start.  Personally, I'd prefer to
abandon the ptrace extensions altogether in favor of some generalized
event buffer interface that comes from merging perfmon2.  But if you
still want to do the ptrace interface, it can be built on the shared
DS-management code.  What do you think?


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread Yi Yang

On Tue, 2008-01-29 at 23:17 -0800, H. Peter Anvin wrote:
> Yi Yang wrote:
> >>
> >> It's broken, because it doesn't take into account the fact that Intel 
> >> broke CPUID level 4 and made it "repeating" (neither did the cpuid char 
> >> device, because it predated the Intel braindamage; I've had a patch for 
> >> it privately for a while, but didn't push it upstream because paravirt 
> >> broke it royally and I wanted the situation to settle down.)
> 
> > level 4 doesn't result in repeating on Intel CPU, cpuid module sets
> > file offset to level, so cat /dev/cpu/*/cpuid will run cpuid instruction
> > continuously.
> 
> The issue is that Intel suddenly made CPUID ECX-sensitive, which there 
> is no way to represent.
Function cpuid has reset ecx to 0 immediate before calling to __cpuid,
so this shouldn't be a problem now.

in include/asm-x86/processor_32.h
/*
 * Generic CPUID function
 * clear %ecx since some cpus (Cyrix MII) do not set or clear %ecx
 * resulting in stale register contents being returned.
 */
static inline void cpuid(unsigned int op,
 unsigned int *eax, unsigned int *ebx,
 unsigned int *ecx, unsigned int *edx)
{
*eax = op;
*ecx = 0;
__cpuid(eax, ebx, ecx, edx);
}
> 
> As far as cat /dev/cpu/*/cpuid, that's a user error.
> 
>   -hpa
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread H. Peter Anvin


Yi Yang wrote:


It's broken, because it doesn't take into account the fact that Intel 
broke CPUID level 4 and made it "repeating" (neither did the cpuid char 
device, because it predated the Intel braindamage; I've had a patch for 
it privately for a while, but didn't push it upstream because paravirt 
broke it royally and I wanted the situation to settle down.)



level 4 doesn't result in repeating on Intel CPU, cpuid module sets
file offset to level, so cat /dev/cpu/*/cpuid will run cpuid instruction
continuously.


The issue is that Intel suddenly made CPUID ECX-sensitive, which there 
is no way to represent.


As far as cat /dev/cpu/*/cpuid, that's a user error.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] ext4 update

2008-01-29 Thread Theodore Tso

On Tue, Jan 29, 2008 at 10:54:03PM +0100, Jan Engelhardt wrote:
> 
> On Jan 29 2008 07:53, Theodore Tso wrote:
> >
> >>fwiw, diffstat is confused by git's diff output; you need to use
> >>'diffstat -p1'
> 
> I am seeing normal behavior:
>
> 22:52 sovereign:~/linux > git diff HEAD | diffstat

That's because you are doing a diff stat of changes that haven't been
checked in yet.  I was doing a "git log -p origin.. | diffstat -p1",
and in that incantation you definitely do need the -p1 to diffstat.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Debugfs compile fix.

2008-01-29 Thread Denis V. Lunev

Debugfs is not compiled without CONFIG_SYSFS in net-2.6 tree. Move
kobject_create_and_add under appropriate ifdef. The fix looks correct
from a first glance, but may be the dependency should be added into
the Kconfig.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 fs/debugfs/inode.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index d26e282..61cc937 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -432,9 +432,11 @@ static int __init debugfs_init(void)
 {
int retval;
 
+#ifdef CONFIG_SYSFS
debug_kobj = kobject_create_and_add("debug", kernel_kobj);
if (!debug_kobj)
return -EINVAL;
+#endif
 
retval = register_filesystem(&debug_fs_type);
if (retval)
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread Yi Yang

On Tue, 2008-01-29 at 07:51 -0800, H. Peter Anvin wrote:
> Yi Yang wrote:
> > Current cpuid module will create a char device for every logical cpu,
> > when a user cats /dev/cpu/*/cpuid, he/she will enter a limitless loop,
> > the root cause is that cpuid module doesn't decide wether a cpuid level
> > is valid, it just uses an offset to denote cpuid level and take it to
> > cpuid instruction, cpuid instruction will ignore it and return some data
> > 
> > This patch uses sysfs to avoid limitless loop and provide more flexible
> > interface for cpuid, please consider to merge to -mm tree in order to test.
> 
> This is broken.
> 
> Triple broken.
> 
> It's broken, because it doesn't take into account the fact that Intel 
> broke CPUID level 4 and made it "repeating" (neither did the cpuid char 
> device, because it predated the Intel braindamage; I've had a patch for 
> it privately for a while, but didn't push it upstream because paravirt 
> broke it royally and I wanted the situation to settle down.)
level 4 doesn't result in repeating on Intel CPU, cpuid module sets
file offset to level, so cat /dev/cpu/*/cpuid will run cpuid instruction
continuously.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.24-git6-ext4-1 patchset released

2008-01-29 Thread Theodore Ts'o


I've just released 2.6.24-git6-ext4-1.  It removes the patches that have
been pulled into mainline by Linus, and adds the unlocked ioctl patches
from Andi Kleen, and Eric's patch to allow the root inode to use
in-inode EA's.

As a git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 2.6.24-git6-ext4-1
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=shortlog;h=2.6.24-git6-ext4-1

As a patchset:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.24-git6-ext4-1

- Ted

What's now in the ext4 tree:

Akira Fujita (4):
  ext4: online defrad header file changes
  ext4: online defrag-- Allocate new contiguous blocks with mballoc
  ext4: online defrag -- Move the file data to the new blocks
  Free space fragmentation functions

Alex Tomas (2):
  vfs: add basic delayed allocation support
  ext4: Add basic delayed allocation support

Andi Kleen (7):
  Convert ext2 over to use unlocked_ioctl
  Remove incorrect BKL comment in ext2
  Convert ext3 to use unlocked_ioctl v2
  ext3: Remove incorrect BKL comment
  Remove incorrect comment refering to lock_kernel() from jbd/jbd2
  Convert ext4 to use unlocked_ioctl v2
  Remove incorrect BKL comments in ext4

Aneesh Kumar K.V (2):
  ext4: Enable delalloc and mballoc by default.
  ext4: Show delalloc options

Eric Sandeen (1):
  allow in-inode EAs on ext4 root inode

Mingming Cao (2):
  jbd: blocks reservation fix for large block support
  jbd2: blocks reservation fix for large block support

Theodore Ts'o (2):
  patch test-filesys-flag.patch
  ext4: New inode allocation for FLEX_BG meta-data groups.

 fs/buffer.c |3 
 fs/ext2/dir.c   |2 
 fs/ext2/ext2.h  |3 
 fs/ext2/file.c  |4 
 fs/ext2/inode.c |1 
 fs/ext2/ioctl.c |   12 
 fs/ext3/dir.c   |4 
 fs/ext3/file.c  |2 
 fs/ext3/ioctl.c |   12 
 fs/ext4/Makefile|2 
 fs/ext4/balloc.c|   28 
 fs/ext4/defrag.c| 2206 
 fs/ext4/dir.c   |4 
 fs/ext4/extents.c   |   67 -
 fs/ext4/file.c  |2 
 fs/ext4/ialloc.c|   96 +
 fs/ext4/inode.c |  174 ++-
 fs/ext4/ioctl.c |   25 
 fs/ext4/mballoc.c   |7 
 fs/ext4/super.c |   91 +
 fs/jbd/journal.c|7 
 fs/jbd/recovery.c   |2 
 fs/jbd2/journal.c   |7 
 fs/jbd2/recovery.c  |2 
 fs/mpage.c  |  406 +++
 include/linux/ext3_fs.h |3 
 include/linux/ext4_fs.h |  107 +
 include/linux/ext4_fs_extents.h |   22 
 include/linux/ext4_fs_sb.h  |3 
 include/linux/mpage.h   |2 
 30 files changed, 3220 insertions(+), 86 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] lib: Add support for DIF CRC

2008-01-29 Thread Martin K. Petersen


Add support for the T10 Data Integrity Field CRC.

Signed-off-by: Martin K. Petersen <[EMAIL PROTECTED]>

diff -r f5ec697e8b10 include/linux/crc-t10dif.h
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/include/linux/crc-t10dif.hTue Jan 29 13:26:19 2008 -0500
@@ -0,0 +1,9 @@
+#ifndef _LINUX_CRC_T10DIF_H
+#define _LINUX_CRC_T10DIF_H
+
+#include 
+
+const __u16 t10_dif_crc_table[256];
+__u16 crc_t10dif(unsigned char const *, size_t);
+
+#endif
diff -r f5ec697e8b10 lib/Kconfig
--- a/lib/Kconfig   Tue Jan 29 12:11:57 2008 -0500
+++ b/lib/Kconfig   Tue Jan 29 13:26:19 2008 -0500
@@ -22,6 +22,13 @@ config CRC16
  modules require CRC16 functions, but a module built outside
  the kernel tree does. Such modules that use library CRC16
  functions require M here.
+
+config CRC_T10DIF
+   tristate "CRC calculation for the T10 Data Integrity Field"
+   help
+ This option is only needed if a module that's not in the
+ kernel tree needs to calculate CRC checks for use with the
+ SCSI data integrity subsystem.
 
 config CRC_ITU_T
tristate "CRC ITU-T V.41 functions"
diff -r f5ec697e8b10 lib/Makefile
--- a/lib/Makefile  Tue Jan 29 12:11:57 2008 -0500
+++ b/lib/Makefile  Tue Jan 29 13:26:19 2008 -0500
@@ -44,6 +44,7 @@ obj-$(CONFIG_BITREVERSE) += bitrev.o
 obj-$(CONFIG_BITREVERSE) += bitrev.o
 obj-$(CONFIG_CRC_CCITT)+= crc-ccitt.o
 obj-$(CONFIG_CRC16)+= crc16.o
+obj-$(CONFIG_CRC_T10DIF)+= crc-t10dif.o
 obj-$(CONFIG_CRC_ITU_T)+= crc-itu-t.o
 obj-$(CONFIG_CRC32)+= crc32.o
 obj-$(CONFIG_CRC7) += crc7.o
diff -r f5ec697e8b10 lib/crc-t10dif.c
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/lib/crc-t10dif.c  Tue Jan 29 13:26:19 2008 -0500
@@ -0,0 +1,68 @@
+/*
+ * T10 Data Integrity Field CRC16 calculation
+ *
+ * Copyright (c) 2007 Oracle Corporation.  All rights reserved.
+ * Written by Martin K. Petersen <[EMAIL PROTECTED]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include 
+#include 
+#include 
+
+/* Table generated using the following polynomium:
+ * x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1
+ * gt: 0x8bb7
+ */
+const __u16 t10_dif_crc_table[256] = {
+   0x, 0x8BB7, 0x9CD9, 0x176E, 0xB205, 0x39B2, 0x2EDC, 0xA56B,
+   0xEFBD, 0x640A, 0x7364, 0xF8D3, 0x5DB8, 0xD60F, 0xC161, 0x4AD6,
+   0x54CD, 0xDF7A, 0xC814, 0x43A3, 0xE6C8, 0x6D7F, 0x7A11, 0xF1A6,
+   0xBB70, 0x30C7, 0x27A9, 0xAC1E, 0x0975, 0x82C2, 0x95AC, 0x1E1B,
+   0xA99A, 0x222D, 0x3543, 0xBEF4, 0x1B9F, 0x9028, 0x8746, 0x0CF1,
+   0x4627, 0xCD90, 0xDAFE, 0x5149, 0xF422, 0x7F95, 0x68FB, 0xE34C,
+   0xFD57, 0x76E0, 0x618E, 0xEA39, 0x4F52, 0xC4E5, 0xD38B, 0x583C,
+   0x12EA, 0x995D, 0x8E33, 0x0584, 0xA0EF, 0x2B58, 0x3C36, 0xB781,
+   0xD883, 0x5334, 0x445A, 0xCFED, 0x6A86, 0xE131, 0xF65F, 0x7DE8,
+   0x373E, 0xBC89, 0xABE7, 0x2050, 0x853B, 0x0E8C, 0x19E2, 0x9255,
+   0x8C4E, 0x07F9, 0x1097, 0x9B20, 0x3E4B, 0xB5FC, 0xA292, 0x2925,
+   0x63F3, 0xE844, 0xFF2A, 0x749D, 0xD1F6, 0x5A41, 0x4D2F, 0xC698,
+   0x7119, 0xFAAE, 0xEDC0, 0x6677, 0xC31C, 0x48AB, 0x5FC5, 0xD472,
+   0x9EA4, 0x1513, 0x027D, 0x89CA, 0x2CA1, 0xA716, 0xB078, 0x3BCF,
+   0x25D4, 0xAE63, 0xB90D, 0x32BA, 0x97D1, 0x1C66, 0x0B08, 0x80BF,
+   0xCA69, 0x41DE, 0x56B0, 0xDD07, 0x786C, 0xF3DB, 0xE4B5, 0x6F02,
+   0x3AB1, 0xB106, 0xA668, 0x2DDF, 0x88B4, 0x0303, 0x146D, 0x9FDA,
+   0xD50C, 0x5EBB, 0x49D5, 0xC262, 0x6709, 0xECBE, 0xFBD0, 0x7067,
+   0x6E7C, 0xE5CB, 0xF2A5, 0x7912, 0xDC79, 0x57CE, 0x40A0, 0xCB17,
+   0x81C1, 0x0A76, 0x1D18, 0x96AF, 0x33C4, 0xB873, 0xAF1D, 0x24AA,
+   0x932B, 0x189C, 0x0FF2, 0x8445, 0x212E, 0xAA99, 0xBDF7, 0x3640,
+   0x7C96, 0xF721, 0xE04F, 0x6BF8, 0xCE93, 0x4524, 0x524A, 0xD9FD,
+   0xC7E6, 0x4C51, 0x5B3F, 0xD088, 0x75E3, 0xFE54, 0xE93A, 0x628D,
+   0x285B, 0xA3EC, 0xB482, 0x3F35, 0x9A5E, 0x11E9, 0x0687, 0x8D30,
+   0xE232, 0x6985, 0x7EEB, 0xF55C, 0x5037, 0xDB80, 0xCCEE, 0x4759,
+   0x0D8F, 0x8638, 0x9156, 0x1AE1, 0xBF8A, 0x343D, 0x2353, 0xA8E4,
+   0xB6FF, 0x3D48, 0x2A26, 0xA191, 0x04FA, 0x8F4D, 0x9823, 0x1394,
+   0x5942, 0xD2F5, 0xC59B, 0x4E2C, 0xEB47, 0x60F0, 0x779E, 0xFC29,
+   0x4BA8, 0xC01F, 0xD771, 0x5CC6, 0xF9AD, 0x721A, 0x6574, 0xEEC3,
+   0xA415, 0x2FA2, 0x38CC, 0xB37B, 0x1610, 0x9DA7, 0x8AC9, 0x017E,
+   0x1F65, 0x94D2, 0x83BC, 0x080B, 0xAD60, 0x26D7, 0x31B9, 0xBA0E,
+   0xF0D8, 0x7B6F, 0x6C01, 0xE7B6, 0x42DD, 0xC96A, 0xDE04, 0x55B3
+};
+
+__u16 crc_t10dif(const unsigned char *buffer, size_t len)
+{
+   __u16 crc = 0;
+   unsigned int i;
+
+   for (i=0 ; i < len ; i++)
+   crc = (crc << 8) ^ t10_dif_crc_table[((crc >> 8) ^ buffer[i]) & 
0xff];
+
+   return crc;
+}
+
+EXPORT_SYMBOL(crc_t10dif);
+
+MODULE_DESCRIPTION("T10 DIF CRC calculation");
+MODULE_LICENSE("GPL");
--
To unsubscribe fr

Re: [PATCH] lib: Add support for DIF CRC

2008-01-29 Thread Martin K. Petersen

> "Jan" == Jan Engelhardt <[EMAIL PROTECTED]> writes:

Jan> 'const unsigned char *', like the rest of all code does.

Updated patch follows.


Jan> Do we already have some users for the T10DIF CRC? 

This is a runt patch given that it doesn't fit in block and SCSI.  The
remaining bits will come in through those trees.

-- 
Martin K. Petersen  Oracle Linux Engineering

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH] e100 driver didn't support any MII-less PHYs...

2008-01-29 Thread Andreas Mohr

Hi,

On Tue, Jan 29, 2008 at 03:09:25PM -0800, Kok, Auke wrote:
> Andreas Mohr wrote:
> > Perhaps it's useful to file a bug/patch
> > on http://sourceforge.net/projects/e1000/ ? Perhaps -mm testing?
> 
> I wanted to push this though our testing labs first which has not happened 
> due to
> time constraints - that should quickly at least confirm that the most common 
> nics
> work OK after the change with your patch. I'll try and see if we can get this
> testing done soon.

Oh, full-scale regression testing even? Nice idea...
Would optionally be even better if during hardware tests one could also
dig out some i82503-based card (or additional MII-less cards?)
since I didn't really make any effort yet to try to make them all
recognized/supported by my patch already (would have been out of scope anyway
since I have this single card only).

Thanks a lot,

Andreas Mohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUG] 2.6.24-git6 soft lockup detected while running libhugetlbfs

2008-01-29 Thread Kamalesh Babulal

Hi,

Softlockup is detected while running libhugetlbfs on the 2.6.24-git6 kernel.
The machine is a Pentium III (Cascades) 16 cpu machine.

BUG: soft lockup - CPU#13 stuck for 61s! [swapper:0]

Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1)
EIP: 0060:[] EFLAGS: 0246 CPU: 13
EIP is at default_idle+0x30/0x44
EAX:  EBX: c10002f8 ECX: 0010 EDX: 8fcf
ESI: 000d EDI: 00128868 EBP: e744bf9c ESP: e744bf9c
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
CR0: 8005003b CR2: b7eadcc0 CR3: 01386000 CR4: 06f0
DR0:  DR1:  DR2:  DR3: 
DR6: 0ff0 DR7: 0400
 [] show_trace_log_lvl+0x19/0x2e
 [] show_trace+0x12/0x14
 [] show_regs+0x1c/0x1f
 [] softlockup_tick+0xe0/0xf6
 [] run_local_timers+0x17/0x19
 [] update_process_times+0x24/0x49
 [] tick_periodic+0x63/0x6f
 [] tick_handle_periodic+0x19/0x6a
 [] local_apic_timer_interrupt+0x4e/0x53
 [] smp_apic_timer_interrupt+0x2a/0x39
 [] apic_timer_interrupt+0x28/0x30
 [] cpu_idle+0x76/0x8b
 [] start_secondary+0xb1/0xb3
 [<>] _stext+0x3e40/0x19
 ===
BUG: soft lockup - CPU#12 stuck for 61s! [swapper:0]

Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1)
EIP: 0060:[] EFLAGS: 0246 CPU: 12
EIP is at default_idle+0x30/0x44
EAX:  EBX: c10002f8 ECX: 000f7000 EDX: 8fcf
ESI: 000c EDI: 00128868 EBP: e7447f9c ESP: e7447f9c
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
CR0: 8005003b CR2: b7f67f1c CR3: 01386000 CR4: 06f0
DR0:  DR1:  DR2:  DR3: 
DR6: 0ff0 DR7: 0400
 [] show_trace_log_lvl+0x19/0x2e
 [] show_trace+0x12/0x14
 [] show_regs+0x1c/0x1f
 [] softlockup_tick+0xe0/0xf6
 [] run_local_timers+0x17/0x19
 [] update_process_times+0x24/0x49
 [] tick_periodic+0x63/0x6f
 [] tick_handle_periodic+0x19/0x6a
 [] local_apic_timer_interrupt+0x4e/0x53
 [] smp_apic_timer_interrupt+0x2a/0x39
 [] apic_timer_interrupt+0x28/0x30
 [] cpu_idle+0x76/0x8b
 [] start_secondary+0xb1/0xb3
 [<>] _stext+0x3e40/0x19
 ===
BUG: soft lockup - CPU#14 stuck for 61s! [swapper:0]

Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1)
EIP: 0060:[] EFLAGS: 0246 CPU: 14
EIP is at default_idle+0x30/0x44
EAX:  EBX: c10002f8 ECX: 00109000 EDX: 8fcf
ESI: 000e EDI: 00128868 EBP: e744ff9c ESP: e744ff9c
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
CR0: 8005003b CR2: b7e12494 CR3: 01386000 CR4: 06f0
DR0:  DR1:  DR2:  DR3: 
DR6: 0ff0 DR7: 0400
 [] show_trace_log_lvl+0x19/0x2e
 [] show_trace+0x12/0x14
 [] show_regs+0x1c/0x1f
 [] softlockup_tick+0xe0/0xf6
 [] run_local_timers+0x17/0x19
 [] update_process_times+0x24/0x49
 [] tick_periodic+0x63/0x6f
 [] tick_handle_periodic+0x19/0x6a
 [] local_apic_timer_interrupt+0x4e/0x53
 [] smp_apic_timer_interrupt+0x2a/0x39
 [] apic_timer_interrupt+0x28/0x30
 [] cpu_idle+0x76/0x8b
 [] start_secondary+0xb1/0xb3
 [<>] _stext+0x3e40/0x19
 ===
BUG: soft lockup - CPU#15 stuck for 61s! [swapper:0]

Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1)
EIP: 0060:[] EFLAGS: 0246 CPU: 15
EIP is at default_idle+0x30/0x44
EAX:  EBX: c10002f8 ECX: 00112000 EDX: 8fcf
ESI: 000f EDI: 00128868 EBP: e7451f9c ESP: e7451f9c
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
CR0: 8005003b CR2: b7f2ecc0 CR3: 01386000 CR4: 06f0
DR0:  DR1:  DR2:  DR3: 
DR6: 0ff0 DR7: 0400
 [] show_trace_log_lvl+0x19/0x2e
 [] show_trace+0x12/0x14
 [] show_regs+0x1c/0x1f
 [] softlockup_tick+0xe0/0xf6
 [] run_local_timers+0x17/0x19
 [] update_process_times+0x24/0x49
 [] tick_periodic+0x63/0x6f
 [] tick_handle_periodic+0x19/0x6a
 [] local_apic_timer_interrupt+0x4e/0x53
 [] smp_apic_timer_interrupt+0x2a/0x39
 [] apic_timer_interrupt+0x28/0x30
 [] cpu_idle+0x76/0x8b
 [] start_secondary+0xb1/0xb3
 [<>] _stext+0x3e40/0x19
 ===
BUG: soft lockup - CPU#10 stuck for 61s! [swapper:0]

Pid: 0, comm: swapper Not tainted (2.6.24-git6-autokern1 #1)
EIP: 0060:[] EFLAGS: 0246 CPU: 10
EIP is at default_idle+0x30/0x44
EAX:  EBX: c10002f8 ECX: 000e5000 EDX: 8fcf
ESI: 000a EDI: 00128868 EBP: e7443f9c ESP: e7443f9c
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
CR0: 8005003b CR2: b7ed5cc0 CR3: 01386000 CR4: 06f0
DR0:  DR1:  DR2:  DR3: 
DR6: 0ff0 DR7: 0400
 [] show_trace_log_lvl+0x19/0x2e
 [] show_trace+0x12/0x14
 [] show_regs+0x1c/0x1f
 [] softlockup_tick+0xe0/0xf6
 [] run_local_timers+0x17/0x19
 [] update_process_times+0x24/0x49
 [] tick_periodic+0x63/0x6f
 [] tick_handle_periodic+0x19/0x6a
 [] local_apic_timer_interrupt+0x4e/0x53
 [] smp_apic_timer_interrupt+0x2a/0x39
 [] apic_timer_interrupt+0x28/0x30
 [] cpu_idle+0x76/0x8b
 [] start_secondary+0xb1/0xb3
 [<>] _stext+0x3e40/0x19
 ===
BUG: soft lockup - CPU#8 stuck for 61s! [swapper:0]

Pid: 0, comm: swapper Not tainte

Re: [PATCH 24/27] NFS: Use local caching [try #2]

2008-01-29 Thread Trond Myklebust


On Wed, 2008-01-30 at 03:25 +, David Howells wrote:
> Chuck Lever <[EMAIL PROTECTED]> wrote:
> 
> > This patch really ought to be broken into more manageable atomic
> > changes to make it easier to review, and to provide more fine-grained
> > explanation and rationalization for each specific change via
> > individual patch descriptions.
> 
> Hmmm  I broke the patch up as Trond stipulated - at least, I thought I
> had.
> 
> In many ways this request doesn't make sense.  You can't do NFS caching
> without all the appropriate bits, so logically they should be one patch.
> Breaking it up won't help git-bisect since the option to enable all this is
> the last (or nearly last) patch.

That depends entirely on what you are tracking. At this point in time,
I'm completely uninterested in debugging cachefs, but _very_ interested
in tracking and debugging changes to core NFS code.

Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

About closed-source module use GPL module function

2008-01-29 Thread CooperYuan Cooper

Hi all,

Now I am porting a device driver to Linux, its source code is not opened.

In this module, I use some interface functions exported from GPL
module through EXPORT_SYMBOL macros. (not EXPORT_SYMBOL_GPL), For
example, register_sound_dsp() and so on.

Do I violate GPL? How to solve it?

Thanks a lot, any suggestion is appreciated.

Cooper
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Change In sk_buff structure in 2.6.22 kernel

2008-01-29 Thread Stephen Hemminger

On Wed, 30 Jan 2008 10:49:49 +0530
"PV Juliet" <[EMAIL PROTECTED]> wrote:

> Hi All,
> 
> 
> The header fields in the sk_buff structure have been renamed and are
> no longer unions.
> 
> Networking code and drivers are supposed to use skb->transport_header,
> skb->network_header, and skb->skb_mac_header.
> But when I am trying to access fields of TCP using the code
> struct tcphdr *tcp = skb->transport_header;
> tcp->   //accessing proper field
> It is not accessing the value properly ...
> Can anyone please help me ???
> 
> 
> Thanks in advance
> Regards
> Juliet

Read the source (include/linux/skbuff.h)

Use the new accessor functions skb_transport_header(skb), 
skb_network_header(skb),


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Pavel Roskin


Quoting Andi Kleen <[EMAIL PROTECTED]>:


Pavel Roskin <[EMAIL PROTECTED]> writes:

  */
@@ -162,6 +163,7 @@ const char *print_tainted(void)
if (tainted) {
snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c%c%c%c%c",
tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G',
+   tainted & TAINT_BLOB_WRAPPER ? 'W' : ' ',



Are you sure you don't need to add a new '%c' to the format string too?
I think gcc should have warned.


You are right!  Thanks.

--
Regards,
Pavel Roskin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread H. Peter Anvin


Yi Yang wrote:


It's broken, because it doesn't take into account the fact that Intel 
broke CPUID level 4 and made it "repeating" (neither did the cpuid char 
device, because it predated the Intel braindamage; I've had a patch for 
it privately for a while, but didn't push it upstream because paravirt 
broke it royally and I wanted the situation to settle down.)


It's broken, because the algorithm used to determine valid CPUID levels 
is incorrect; it fails to recognize any CPUID levels other than the main 
Intel and AMD ones, e.g. the Transmeta 0x8086 (and sometimes more) 
and VIA 0xc000 levels.

Thank you for pointing out these issues, i think we can let users input
any cpuid level and output the corresponding cpuid, in this way we can
avoid to consider cpu differences and left this to userspace. We can
also consider all the x86 platforms to do cpuid for every one.

It's broken, because it is better for the userspace extractor to have 
this logic than to stuff it into the kernel, where it sits hogging 
unswappable memory at all times.

It seems not to be very appropriate to let user space consider hardware
details. /proc/cpuinfo should be an example to justify this.


/proc/cpuinfo represents what the kernel needs to know, so it reflects 
the kernel's interpretation of CPUID.  There is no reason to interpret 
things in the kernel that the kernel doesn't need.



Is there any user application using /dev/cpu/*/cpuid? if no, i think it
is feasible to provide an interface in the kernel.


Yes.  It's called x86info, I believe.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-29 Thread Al Boldi

Jan Kara wrote:
> > Chris Snook wrote:
> > > Al Boldi wrote:
> > > > This RFC proposes to introduce a tunable which allows to disable
> > > > fsync and changes ordered into writeback writeout on a per-process
> > > > basis like this:
> > > >
> > > >   echo 1 > /proc/`pidof process`/softsync
> > >
> > > This is basically a kernel workaround for stupid app behavior.
> >
> > Exactly right to some extent, but don't forget the underlying
> > data=ordered starvation problem, which looks like a genuinely deep
> > problem maybe related to blockIO.
>
>   It is a problem with the way how ext3 does fsync (at least that's what
> we ended up with in that konqueror problem)... It has to flush the
> current transaction which means that app doing fsync() has to wait till
> all dirty data of all files on the filesystem are written (if we are in
> ordered mode). And that takes quite some time... There are possibilities
> how to avoid that but especially with freshly created files, it's tough
> and I don't see a way how to do it without some fundamental changes to
> JBD.

Ok, but keep in mind that this starvation occurs even in the absence of 
fsync, as the benchmarks show.

And, a quick test of successive 1sec delayed syncs shows no hangs until about 
1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for 
minutes on end, and io-wait shows almost 100%.

Now it turns out that 'echo 3 > /proc/.../drop_caches' has no effect, but 
doing it a few more times makes the hangs go away for while, only to come 
back again and again.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Andi Kleen

Pavel Roskin <[EMAIL PROTECTED]> writes:
>   */
> @@ -162,6 +163,7 @@ const char *print_tainted(void)
>   if (tainted) {
>   snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c%c%c%c%c",
>   tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G',
> + tainted & TAINT_BLOB_WRAPPER ? 'W' : ' ',


Are you sure you don't need to add a new '%c' to the format string too?
I think gcc should have warned.

-Andi

>   tainted & TAINT_FORCED_MODULE ? 'F' : ' ',
>   tainted & TAINT_UNSAFE_SMP ? 'S' : ' ',
>   tainted & TAINT_FORCED_RMMOD ? 'R' : ' ',
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] IB: introducing MTHCA_MR_DMABARRIER

2008-01-29 Thread akepner


Add MTHCA_MR_DMABARRIER to mthca's API, increment ABI version, 
and make use of MTHCA_MR_DMABARRIER when mapping a user-allocated 
memory region with ib_umem_get().


Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

--

 drivers/infiniband/core/umem.c   |   15 +---
 drivers/infiniband/hw/mthca/mthca_provider.c |7 -
 drivers/infiniband/hw/mthca/mthca_user.h |   10 +++-
 include/rdma/ib_verbs.h  |   33 +++
 4 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 5b00408..57b5ce9 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "uverbs.h"
 
@@ -72,6 +73,8 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
  * @addr: userspace virtual address to start at
  * @size: length of region to pin
  * @access: IB_ACCESS_xxx flags for memory being pinned
+ * @dmabarrier: set a "dma barrier" so that in-flight DMA is 
+ *  flushed when the memory region is written to
  */
 struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
size_t size, int access, int dmabarrier)
@@ -87,6 +90,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
int ret;
int off;
int i;
+   struct dma_attrs attrs;
+
+   dma_set_attr(&attrs, dmabarrier ? DMA_ATTR_BARRIER : 0);
 
if (!can_do_mlock())
return ERR_PTR(-EPERM);
@@ -174,10 +180,11 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
sg_set_page(&chunk->page_list[i], page_list[i + 
off], PAGE_SIZE, 0);
}
 
-   chunk->nmap = ib_dma_map_sg(context->device,
-   &chunk->page_list[0],
-   chunk->nents,
-   DMA_BIDIRECTIONAL);
+   chunk->nmap = ib_dma_map_sg_attrs(context->device,
+ &chunk->page_list[0],
+ chunk->nents,
+ DMA_BIDIRECTIONAL, 
+ &attrs);
if (chunk->nmap <= 0) {
for (i = 0; i < chunk->nents; ++i)
put_page(sg_page(&chunk->page_list[i]));
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c 
b/drivers/infiniband/hw/mthca/mthca_provider.c
index 704d8ef..e837cc9 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1017,17 +1017,22 @@ static struct ib_mr *mthca_reg_user_mr(struct ib_pd 
*pd, u64 start, u64 length,
struct mthca_dev *dev = to_mdev(pd->device);
struct ib_umem_chunk *chunk;
struct mthca_mr *mr;
+   struct mthca_reg_mr ucmd;
u64 *pages;
int shift, n, len;
int i, j, k;
int err = 0;
int write_mtt_size;
 
+   if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
+   return ERR_PTR(-EFAULT);
+
mr = kmalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
 
-   mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
+   mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 
+  ucmd.mr_attrs & MTHCA_MR_DMABARRIER);
 
if (IS_ERR(mr->umem)) {
err = PTR_ERR(mr->umem);
diff --git a/drivers/infiniband/hw/mthca/mthca_user.h 
b/drivers/infiniband/hw/mthca/mthca_user.h
index 02cc0a7..701a430 100644
--- a/drivers/infiniband/hw/mthca/mthca_user.h
+++ b/drivers/infiniband/hw/mthca/mthca_user.h
@@ -41,7 +41,7 @@
  * Increment this value if any changes that break userspace ABI
  * compatibility are made.
  */
-#define MTHCA_UVERBS_ABI_VERSION   1
+#define MTHCA_UVERBS_ABI_VERSION   2
 
 /*
  * Make sure that all structs defined in this file remain laid out so
@@ -61,6 +61,14 @@ struct mthca_alloc_pd_resp {
__u32 reserved;
 };
 
+struct mthca_reg_mr {
+   __u32 mr_attrs;
+#define MTHCA_MR_DMABARRIER 0x1  /* set a dma barrier in order to flush 
+ * in-flight DMA on a write to memory 
+ * region */
+   __u32 reserved;
+};
+
 struct mthca_create_cq {
__u32 lkey;
__u32 pdn;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 11f3960..ac869e2 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1507,6 +1507,24 @@ static inline void ib_dma_unmap_single(struct ib_device 
*dev,
dma_u

[PATCH 3/4] IB: add dmabarrier to ib_umem_get() prototype

2008-01-29 Thread akepner


Add a new parameter, dmabarrier, to the ib_umem_get() 
prototype.

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

-- 

 drivers/infiniband/core/umem.c   |2 +-
 drivers/infiniband/hw/amso1100/c2_provider.c |2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c   |2 +-
 drivers/infiniband/hw/ipath/ipath_mr.c   |3 ++-
 drivers/infiniband/hw/mlx4/cq.c  |2 +-
 drivers/infiniband/hw/mlx4/doorbell.c|2 +-
 drivers/infiniband/hw/mlx4/mr.c  |3 ++-
 drivers/infiniband/hw/mlx4/qp.c  |2 +-
 drivers/infiniband/hw/mlx4/srq.c |2 +-
 drivers/infiniband/hw/mthca/mthca_provider.c |3 ++-
 include/rdma/ib_umem.h   |4 ++--
 12 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 4e3128f..5b00408 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -74,7 +74,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
  * @access: IB_ACCESS_xxx flags for memory being pinned
  */
 struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-   size_t size, int access)
+   size_t size, int access, int dmabarrier)
 {
struct ib_umem *umem;
struct page **page_list;
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
b/drivers/infiniband/hw/amso1100/c2_provider.c
index 7a6cece..f571dff 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -449,7 +449,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
return ERR_PTR(-ENOMEM);
c2mr->pd = c2pd;
 
-   c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc);
+   c2mr->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
if (IS_ERR(c2mr->umem)) {
err = PTR_ERR(c2mr->umem);
kfree(c2mr);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index b5436ca..66d9d65 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -601,7 +601,7 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
if (!mhp)
return ERR_PTR(-ENOMEM);
 
-   mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc);
+   mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0);
if (IS_ERR(mhp->umem)) {
err = PTR_ERR(mhp->umem);
kfree(mhp);
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c 
b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index e239bbf..62a382c 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -325,7 +325,7 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, 
u64 length,
}
 
e_mr->umem = ib_umem_get(pd->uobject->context, start, length,
-mr_access_flags);
+mr_access_flags, 0);
if (IS_ERR(e_mr->umem)) {
ib_mr = (void *)e_mr->umem;
goto reg_user_mr_exit1;
diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c 
b/drivers/infiniband/hw/ipath/ipath_mr.c
index db4ba92..7ffb392 100644
--- a/drivers/infiniband/hw/ipath/ipath_mr.c
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c
@@ -195,7 +195,8 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
goto bail;
}
 
-   umem = ib_umem_get(pd->uobject->context, start, length, 
mr_access_flags);
+   umem = ib_umem_get(pd->uobject->context, start, length, 
+  mr_access_flags, 0);
if (IS_ERR(umem))
return (void *) umem;
 
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 9d32c49..3adad6f 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -122,7 +122,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, 
int entries, int vector
}
 
cq->umem = ib_umem_get(context, ucmd.buf_addr, buf_size,
-  IB_ACCESS_LOCAL_WRITE);
+  IB_ACCESS_LOCAL_WRITE, 0);
if (IS_ERR(cq->umem)) {
err = PTR_ERR(cq->umem);
goto err_cq;
diff --git a/drivers/infiniband/hw/mlx4/doorbell.c 
b/drivers/infiniband/hw/mlx4/doorbell.c
index 1c36087..0afde2d 100644
--- a/drivers/infiniband/hw/mlx4/doorbell.c
+++ b/drivers/infiniband/hw/mlx4/doorbell.c
@@ -181,7 +181,7 @@ int mlx4_ib_db_map_user(struct mlx4_ib_ucontext *context, 
unsigned long virt,
page->user_virt = (virt & PAGE_MASK);
p

Re: Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e"

2008-01-29 Thread Linus Torvalds



On Wed, 30 Jan 2008, Linus Torvalds wrote:
> 
> Untested, but as mentioned, this is more of a "this looks maintainable and 
> like it should solve the issues" rather than anything I was planning on 
> committing now.

Side note: I "verified" this patch by also diffing it against the HEAD^ 
state (before adding the PCIE ID's back in), to check that I marked 
exactly the right entries as PCIE() entries.

So while it's not tested, at least it looks right from two different 
angles ;)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] dma/ia64: update ia64 machvecs

2008-01-29 Thread akepner


Change all ia64 machvecs to use the new dma_{un}map_*_attrs()
interfaces. Implement the old dma_{un}map_*() interfaces in 
terms of the corresponding new interfaces. For ia64/sn, make 
use of one dma attribute, DMA_ATTR_BARRIER.


Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

--

 arch/ia64/hp/common/hwsw_iommu.c |   60 
 arch/ia64/hp/common/sba_iommu.c  |   62 ++--
 arch/ia64/sn/pci/pci_dma.c   |   77 ---
 include/asm-ia64/dma-mapping.h   |   28 +--
 include/asm-ia64/machvec.h   |   52 
 include/asm-ia64/machvec_hpzx1.h |   16 +++---
 include/asm-ia64/machvec_hpzx1_swiotlb.h |   16 +++---
 include/asm-ia64/machvec_sn2.h   |   16 +++---
 include/linux/dma-attrs.h|   48 +++
 lib/swiotlb.c|   50 
 10 files changed, 289 insertions(+), 136 deletions(-)

diff --git a/arch/ia64/hp/common/hwsw_iommu.c b/arch/ia64/hp/common/hwsw_iommu.c
index 94e5710..8cedd6c 100644
--- a/arch/ia64/hp/common/hwsw_iommu.c
+++ b/arch/ia64/hp/common/hwsw_iommu.c
@@ -20,10 +20,10 @@
 extern int swiotlb_late_init_with_default_size (size_t size);
 extern ia64_mv_dma_alloc_coherent  swiotlb_alloc_coherent;
 extern ia64_mv_dma_free_coherent   swiotlb_free_coherent;
-extern ia64_mv_dma_map_single  swiotlb_map_single;
-extern ia64_mv_dma_unmap_singleswiotlb_unmap_single;
-extern ia64_mv_dma_map_sg  swiotlb_map_sg;
-extern ia64_mv_dma_unmap_sgswiotlb_unmap_sg;
+extern ia64_mv_dma_map_single_attrsswiotlb_map_single_attrs;
+extern ia64_mv_dma_unmap_single_attrs  swiotlb_unmap_single_attrs;
+extern ia64_mv_dma_map_sg_attrsswiotlb_map_sg_attrs;
+extern ia64_mv_dma_unmap_sg_attrs  swiotlb_unmap_sg_attrs;
 extern ia64_mv_dma_supported   swiotlb_dma_supported;
 extern ia64_mv_dma_mapping_error   swiotlb_dma_mapping_error;
 
@@ -31,19 +31,19 @@ extern ia64_mv_dma_mapping_error
swiotlb_dma_mapping_error;
 
 extern ia64_mv_dma_alloc_coherent  sba_alloc_coherent;
 extern ia64_mv_dma_free_coherent   sba_free_coherent;
-extern ia64_mv_dma_map_single  sba_map_single;
-extern ia64_mv_dma_unmap_singlesba_unmap_single;
-extern ia64_mv_dma_map_sg  sba_map_sg;
-extern ia64_mv_dma_unmap_sgsba_unmap_sg;
+extern ia64_mv_dma_map_single_attrssba_map_single_attrs;
+extern ia64_mv_dma_unmap_single_attrs  sba_unmap_single_attrs;
+extern ia64_mv_dma_map_sg_attrssba_map_sg_attrs;
+extern ia64_mv_dma_unmap_sg_attrs  sba_unmap_sg_attrs;
 extern ia64_mv_dma_supported   sba_dma_supported;
 extern ia64_mv_dma_mapping_error   sba_dma_mapping_error;
 
 #define hwiommu_alloc_coherent sba_alloc_coherent
 #define hwiommu_free_coherent  sba_free_coherent
-#define hwiommu_map_single sba_map_single
-#define hwiommu_unmap_single   sba_unmap_single
-#define hwiommu_map_sg sba_map_sg
-#define hwiommu_unmap_sg   sba_unmap_sg
+#define hwiommu_map_single_attrs   sba_map_single_attrs
+#define hwiommu_unmap_single_attrs sba_unmap_single_attrs
+#define hwiommu_map_sg_attrs   sba_map_sg_attrs
+#define hwiommu_unmap_sg_attrs sba_unmap_sg_attrs
 #define hwiommu_dma_supported  sba_dma_supported
 #define hwiommu_dma_mapping_error  sba_dma_mapping_error
 #define hwiommu_sync_single_for_cpumachvec_dma_sync_single
@@ -98,40 +98,44 @@ hwsw_free_coherent (struct device *dev, size_t size, void 
*vaddr, dma_addr_t dma
 }
 
 dma_addr_t
-hwsw_map_single (struct device *dev, void *addr, size_t size, int dir)
+hwsw_map_single_attrs (struct device *dev, void *addr, size_t size, int dir, 
+  struct dma_attrs *attrs)
 {
if (use_swiotlb(dev))
-   return swiotlb_map_single(dev, addr, size, dir);
+   return swiotlb_map_single_attrs(dev, addr, size, dir, attrs);
else
-   return hwiommu_map_single(dev, addr, size, dir);
+   return hwiommu_map_single_attrs(dev, addr, size, dir, attrs);
 }
 
 void
-hwsw_unmap_single (struct device *dev, dma_addr_t iova, size_t size, int dir)
+hwsw_unmap_single_attrs (struct device *dev, dma_addr_t iova, size_t size, 
+int dir, struct dma_attrs *attrs)
 {
if (use_swiotlb(dev))
-   return swiotlb_unmap_single(dev, iova, size, dir);
+   return swiotlb_unmap_single_attrs(dev, iova, size, dir, attrs);
else
-   return hwiommu_unmap_single(dev, iova, size, dir);
+   return hwiommu_unmap_single_attrs(dev, iova, size, dir, attrs);
 }
 
 
 int
-hwsw_map_sg (struct device *dev, struct scatterlist *sglist, int nents, int 
dir)
+hwsw_map_sg_attrs (struct device *dev, struct scatterlist *sglist, in

Re: Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e"

2008-01-29 Thread Linus Torvalds



On Tue, 29 Jan 2008, Randy Dunlap wrote:
> 
> Andrew was concerned about this when the driver was in -mm.
> He asked for a patch that would set E1000E to same value as E1000
> and I supplied that.  Auke acked it IIRC.  Other people vetoed it.  :(

Yeah, I've been discussing with Jeff and the gang.

I think we have agreed on a solution where the ID's show up in the old 
driver if the new driver is not enabled at all.

(And as a side note: it turns out that the problem I experienced didn't 
come from the new e1000e driver after all, so I'll be removing the 
EXPERIMENTAL flag again).

So I'd suggest the final patch be something like this, but I'm sendign it 
out just as an example of how we could solve this, not necessarily as a 
final patch.

Jeff, Auke, would something like this be acceptable? It makes it very 
obvious in the driver table which entries are for the PCIE versions that 
would be handled by the E1000E driver if it is enabled..

Untested, but as mentioned, this is more of a "this looks maintainable and 
like it should solve the issues" rather than anything I was planning on 
committing now.

Linus
---
 drivers/net/Kconfig|5 ++-
 drivers/net/e1000/e1000_main.c |   60 ++--
 2 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5a2d1dd..6c57540 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1992,7 +1992,7 @@ config E1000_DISABLE_PACKET_SPLIT
 
 config E1000E
tristate "Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support"
-   depends on PCI && EXPERIMENTAL
+   depends on PCI
---help---
  This driver supports the PCI-Express Intel(R) PRO/1000 gigabit
  ethernet family of adapters. For PCI or PCI-X e1000 adapters,
@@ -2009,6 +2009,9 @@ config E1000E
  To compile this driver as a module, choose M here. The module
  will be called e1000e.
 
+config E1000E_ENABLED
+   def_bool E1000E != n
+
 config IP1000
tristate "IP1000 Gigabit Ethernet support"
depends on PCI && EXPERIMENTAL
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 3111af6..8c87940 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -47,6 +47,12 @@ static const char e1000_copyright[] = "Copyright (c) 
1999-2006 Intel Corporation
  * Macro expands to...
  *   {PCI_DEVICE(PCI_VENDOR_ID_INTEL, device_id)}
  */
+#ifdef CONFIG_E1000E_ENABLED
+  #define PCIE(x) 
+#else
+  #define PCIE(x) x,
+#endif
+
 static struct pci_device_id e1000_pci_tbl[] = {
INTEL_E1000_ETHERNET_DEVICE(0x1000),
INTEL_E1000_ETHERNET_DEVICE(0x1001),
@@ -73,14 +79,14 @@ static struct pci_device_id e1000_pci_tbl[] = {
INTEL_E1000_ETHERNET_DEVICE(0x1026),
INTEL_E1000_ETHERNET_DEVICE(0x1027),
INTEL_E1000_ETHERNET_DEVICE(0x1028),
-   INTEL_E1000_ETHERNET_DEVICE(0x1049),
-   INTEL_E1000_ETHERNET_DEVICE(0x104A),
-   INTEL_E1000_ETHERNET_DEVICE(0x104B),
-   INTEL_E1000_ETHERNET_DEVICE(0x104C),
-   INTEL_E1000_ETHERNET_DEVICE(0x104D),
-   INTEL_E1000_ETHERNET_DEVICE(0x105E),
-   INTEL_E1000_ETHERNET_DEVICE(0x105F),
-   INTEL_E1000_ETHERNET_DEVICE(0x1060),
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x1049))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x104A))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x104B))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x104C))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x104D))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x105E))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x105F))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x1060))
INTEL_E1000_ETHERNET_DEVICE(0x1075),
INTEL_E1000_ETHERNET_DEVICE(0x1076),
INTEL_E1000_ETHERNET_DEVICE(0x1077),
@@ -89,28 +95,28 @@ static struct pci_device_id e1000_pci_tbl[] = {
INTEL_E1000_ETHERNET_DEVICE(0x107A),
INTEL_E1000_ETHERNET_DEVICE(0x107B),
INTEL_E1000_ETHERNET_DEVICE(0x107C),
-   INTEL_E1000_ETHERNET_DEVICE(0x107D),
-   INTEL_E1000_ETHERNET_DEVICE(0x107E),
-   INTEL_E1000_ETHERNET_DEVICE(0x107F),
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x107D))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x107E))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x107F))
INTEL_E1000_ETHERNET_DEVICE(0x108A),
-   INTEL_E1000_ETHERNET_DEVICE(0x108B),
-   INTEL_E1000_ETHERNET_DEVICE(0x108C),
-   INTEL_E1000_ETHERNET_DEVICE(0x1096),
-   INTEL_E1000_ETHERNET_DEVICE(0x1098),
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x108B))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x108C))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x1096))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x1098))
INTEL_E1000_ETHERNET_DEVICE(0x1099),
-   INTEL_E1000_ETHERNET_DEVICE(0x109A),
-   INTEL_E1000_ETHERNET_DEVICE(0x10A4),
-   INTEL_E1000_ETHERNET_DEVICE(0x10A5),
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x109A))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x10A4))
+PCIE(  INTEL_E1000_ETHERNET_DEVICE(0x10A5))
INTEL_E1000_ET

[PATCH 0/4] dma: dma_{un}map_{single|sg}_attrs() interface

2008-01-29 Thread akepner


Introduce a new interface for passing architecture-specific 
attributes when memory is mapped and unmapped for DMA. Give 
the interface a default implementation which ignores 
attributes.

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

--

 dma-mapping.h |   33 +
 1 files changed, 33 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 101a2d4..bc313e3 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -116,4 +116,37 @@ static inline void dmam_release_declared_memory(struct 
device *dev)
 }
 #endif /* ARCH_HAS_DMA_DECLARE_COHERENT_MEMORY */
 
+#ifndef ARCH_USES_DMA_ATTRS
+struct dma_attrs;
+
+static inline dma_addr_t dma_map_single_attrs(struct device *dev, 
+ void *cpu_addr, size_t size, 
+ int dir, struct dma_attrs* attrs)
+{
+   return dma_map_single(dev, cpu_addr, size, dir);
+}
+
+static inline void dma_unmap_single_attrs(struct device *dev, 
+ dma_addr_t dma_addr,
+ size_t size, int dir, 
+ struct dma_attrs* attrs)
+{
+   return dma_unmap_single(dev, dma_addr, size, dir);
+}
+
+static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
+  int nents, int dir, struct dma_attrs *attrs)
+{
+   return dma_map_sg(dev, sgl, nents, dir);
+}
+
+static inline void dma_unmap_sg_attrs(struct device *dev, 
+ struct scatterlist *sgl,
+ int nents, int dir, 
+ struct dma_attrs *attrs)
+{
+   return dma_unmap_sg(dev, sgl, nents, dir);
+}
+#endif /* ARCH_USES_DMA_ATTRS */
+
 #endif


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] dma: document dma_{un}map_{single|sg}_attrs() interface

2008-01-29 Thread akepner


Document the new dma_{un}map_{single|sg}_attrs() functions. 

Signed-off-by: Arthur Kepner <[EMAIL PROTECTED]>

--

 DMA-API.txt |   63 
 1 files changed, 63 insertions(+)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index b939ebb..fad05e0 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -395,6 +395,69 @@ Notes:  You must do this:
 
 See also dma_map_single().
 
+dma_addr_t 
+dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size, 
+enum dma_data_direction dir, 
+struct dma_attrs* attrs)
+
+void 
+dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr,
+  size_t size, enum dma_data_direction dir,
+  struct dma_attrs* attrs)
+
+int 
+dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
+int nents, enum dma_data_direction dir, 
+struct dma_attrs *attrs)
+
+void 
+dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, 
+  int nents, enum dma_data_direction dir,
+  struct dma_attrs *attrs)
+
+The four functions above are just like the counterpart functions 
+without the _attrs suffixes, except that they pass an optional 
+struct dma_attrs*. 
+
+struct dma_attrs encapsulates a set of "dma attributes". For the 
+definition of struct dma_attrs see linux/dma-attrs.h. 
+
+The interpretation of dma attributes is architecture-specific. 
+
+If struct dma_attrs* is NULL, the semantics of each of these 
+functions is identical to those of the corresponding function 
+without the _attrs suffix. As a result dma_map_single_attrs() 
+can generally replace dma_map_single(), etc.
+
+As an example of the use of the *_attrs functions, here's how 
+you could pass an attribute DMA_ATTR_FOO when mapping memory 
+for DMA:
+
+#include 
+/* DMA_ATTR_FOO should be defined in linux/dma-attrs.h */
+...
+
+   DECLARE_DMA_ATTRS(attrs);
+   dma_set_attr(&attrs, DMA_ATTR_FOO);
+   
+   n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, &attr);
+   
+
+Architectures that care about DMA_ATTR_FOO would check for its 
+presence in their implementations of the mapping and unmapping 
+routines, e.g.:
+
+void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, 
+size_t size, enum dma_data_direction dir, 
+struct dma_attrs* attrs)
+{
+   
+   int foo =  dma_get_attr(attrs, DMA_ATTR_FOO);
+   
+   if (foo) 
+   /* twizzle the frobnozzle */
+   
+
 
 Part II - Advanced dma_ usage
 -
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [WARNING -rc8] at fs/sysfs/dir.c:424 sysfs_add_one(), related with processor (ACPI)

2008-01-29 Thread Dave Young

On Jan 25, 2008 9:27 AM, Dave Young <[EMAIL PROTECTED]> wrote:
>
> On Jan 25, 2008 12:32 AM, Miguel Ojeda <[EMAIL PROTECTED]> wrote:
> >
> > On Jan 24, 2008 2:44 AM, Dave Young <[EMAIL PROTECTED]> wrote:
> > >
> > > On Wed, Jan 23, 2008 at 02:06:43PM -0800, Andrew Morton wrote:
> > > > > On Mon, 21 Jan 2008 18:53:18 +0100 "Miguel Ojeda" <[EMAIL PROTECTED]> 
> > > > > wrote:
> > > > > Booting 2.6.24-rc8 I get this:
> > > > >
> > > > >
> > > > > sysfs: duplicate filename 'fan' can not be created
> > > > > WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
> > > > > Pid: 819, comm: modprobe Not tainted 2.6.24-rc8 #2
> > > > >  [] sysfs_add_one+0x9f/0xe0
> > > > >  [] create_dir+0x48/0x90
> > > > >  [] sysfs_create_dir+0x29/0x50
> > > > >  [] kobject_get+0xf/0x20
> > > > >  [] kobject_add+0x8f/0x1b0
> > > > >  [] kobject_register+0x21/0x50
> > > > >  [] bus_add_driver+0x71/0x1e0
> > > > >  [] acpi_fan_init+0x2f/0x4d [fan]
> > > > >  [] sys_init_module+0x126/0x19b0
> > > > >  [] rb_insert_color+0xb7/0xe0
> > > > >  [] acpi_bus_register_driver+0x0/0x38
> > > > >  [] syscall_call+0x7/0xb
> > > > >  ===
> > > > > kobject_add failed for fan with -EEXIST, don't try to register things
> > > > > with the same name in the same directory.
> > > > > Pid: 819, comm: modprobe Not tainted 2.6.24-rc8 #2
> > > > >  [] kobject_add+0x111/0x1b0
> > > > >  [] kobject_register+0x21/0x50
> > > > >  [] bus_add_driver+0x71/0x1e0
> > > > >  [] acpi_fan_init+0x2f/0x4d [fan]
> > > > >  [] sys_init_module+0x126/0x19b0
> > > > >  [] rb_insert_color+0xb7/0xe0
> > > > >  [] acpi_bus_register_driver+0x0/0x38
> > > > >  [] syscall_call+0x7/0xb
> > > > >  ===
> > > > > processor: exports duplicate symbol acpi_processor_set_thermal_limit
> > > > > (owned by kernel)
> > > > >
> > >
> > > Could apply following debug patch and see the result?
> > >
> > >
> > > diff -upr linux/fs/sysfs/dir.c linux.new/fs/sysfs/dir.c
> > > --- linux/fs/sysfs/dir.c2008-01-23 09:56:24.0 +0800
> > > +++ linux.new/fs/sysfs/dir.c2008-01-23 09:59:12.0 +0800
> > > @@ -418,6 +418,8 @@ void sysfs_addrm_start(struct sysfs_addr
> > >   */
> > >  int sysfs_add_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
> > >  {
> > > +   if (!strcmp(sd->s_name, "fan"))
> > > +   dump_stack();
> > > if (sysfs_find_dirent(acxt->parent_sd, sd->s_name)) {
> > > printk(KERN_WARNING "sysfs: duplicate filename '%s' "
> > >"can not be created\n", sd->s_name);
> > >
> > >
> >
> > Done. The following appears in the new dmesg output.
> >
> >
> > ACPI: Power Button (CM) [PBTN]
> > input: Sleep Button (CM) as /class/input/input2
> > ACPI: Sleep Button (CM) [SBTN]
> > Pid: 1, comm: swapper Not tainted 2.6.24-rc8 #3
> >  [] sysfs_add_one+0x75/0x100
> >  [] sysfs_addrm_start+0x3f/0xb0
> >  [] create_dir+0x48/0x90
> >  [] sysfs_create_dir+0x29/0x50
> >  [] kobject_get+0xf/0x20
> >  [] kobject_add+0x8f/0x1b0
> >  [] kobject_register+0x21/0x50
> >  [] bus_add_driver+0x71/0x1e0
> >  [] acpi_fan_init+0x2f/0x4d
> >  [] kernel_init+0x121/0x300
> >  [] ret_from_fork+0x6/0x1c
> >  [] kernel_init+0x0/0x300
> >  [] kernel_init+0x0/0x300
> >  [] kernel_thread_helper+0x7/0x18
> >  ===
>
> I'm curious, the "fan" is configured as built-in, why the modprobe be called?

I guess initrd or your lib/modules need update.

>
>
> > ACPI: SSDT 3FE93134, 0244 (r1  PmRef  Cpu0Ist 3000 INTL 20050624)
> > ACPI: SSDT 3FE92ACA, 05E5 (r1  PmRef  Cpu0Cst 3001 INTL 20050624)
> > Monitor-Mwait will be used to enter C-1 state
> > Monitor-Mwait will be used to enter C-2 state
> >
> > Attached dmesg.txt
> >
> >
> > --
> > Miguel Ojeda
> > http://maxextreme.googlepages.com/index.htm
> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Zan Lynx


Jon Masters wrote:


I wouldn't quite say that. I wasn't going to comment, but...personally,
I actually disagree with the assertions that ndiswrapper isn't causing
proprietary code to link against GPL functions in the kernel (how is
an NDIS implementation any different than a shim layer provided to
load a graphics driver?), but I wasn't trying to make that point.


Well, as long as *any* part of the kernel ever links to proprietary 
code, then GPL functions link to it in exactly the same way ndiswrapper 
enables.  It's only a matter of how many steps of separation.


A perfectly GPL USB network driver linked to GPL-only functions feeds 
data into the kernel where it swirls about and emerges from a 
proprietary network filesystem driver, for example.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUILD FAILURE]2.6.24-git6 build failure on sis190 ethernet driver

2008-01-29 Thread Sam Ravnborg

On Wed, Jan 30, 2008 at 09:11:36AM +0530, Kamalesh Babulal wrote:
> Hi,
> 
> The 2.6.24-git6 kernel build fails on various x86_64 machines with the build 
> failure
> 
> drivers/net/sis190.c:329: error: sis190_pci_tbl causes a section type conflict
> make[2]: *** [drivers/net/sis190.o] Error 1
> 
> # gcc --version (machine1)
> gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52)
> 
> # gcc --version (machine2)
> gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1)

Hi Kamalesh

I know another patch is circulating, but please try the following.
diff --git a/drivers/net/sis190.c b/drivers/net/sis190.c
index b570402..0a5e024 100644
--- a/drivers/net/sis190.c
+++ b/drivers/net/sis190.c
@@ -1556,7 +1556,7 @@ static int __devinit 
sis190_get_mac_addr_from_eeprom(struct pci_dev *pdev,
 static int __devinit sis190_get_mac_addr_from_apc(struct pci_dev *pdev,
   struct net_device *dev)
 {
-   static const u16 __devinitdata ids[] = { 0x0965, 0x0966, 0x0968 };
+   static const u16 __devinitconst ids[] = { 0x0965, 0x0966, 0x0968 };
struct sis190_private *tp = netdev_priv(dev);
struct pci_dev *isa_bridge;
u8 reg, tmp8;

It is the better fix if you can confirm it working.
The section conflict issued by gcc happens because we try to
mix const and non-const data in the same section.

Sam
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Pavel Roskin

On Wed, 2008-01-30 at 05:07 +, Jon Masters wrote:
> *). Add a new taint?
> *). Move it later?
> 
> It's all trivial, but a policy should be established for the future.

I'd prefer a new taint.  It's less likely to break.  It provides more
information in the stack dumps.  It makes it clear the difference
ndiswrapper and driverloader.

Here's the patch:
---

Introduce a new taint flag for ndiswrapper

Although ndiswrapper loads proprietary code, it's under GPL itself.
Introduce a different taint flag for this case, so that ndiswrapper
retains access to GPL-only symbols.

Add comments to show the difference between driverloader and
ndiswrapper.

Signed-off-by: Pavel Roskin <[EMAIL PROTECTED]>
---

 include/linux/kernel.h |1 +
 kernel/module.c|5 -
 kernel/panic.c |2 ++
 3 files changed, 7 insertions(+), 1 deletions(-)


diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index a7283c9..861a6ae 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -240,6 +240,7 @@ extern enum system_states {
 #define TAINT_BAD_PAGE (1<<5)
 #define TAINT_USER (1<<6)
 #define TAINT_DIE  (1<<7)
+#define TAINT_BLOB_WRAPPER (1<<8)
 
 extern void dump_stack(void) __cold;
 
diff --git a/kernel/module.c b/kernel/module.c
index f6a4e72..a64380c 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1925,8 +1925,11 @@ static struct module *load_module(void __user *umod,
/* Set up license info based on the info section */
set_license(mod, get_modinfo(sechdrs, infoindex, "license"));
 
+   /* GPL, but may load proprietary code */
if (strcmp(mod->name, "ndiswrapper") == 0)
-   add_taint_module(mod, TAINT_PROPRIETARY_MODULE);
+   add_taint_module(mod, TAINT_BLOB_WRAPPER);
+
+   /* Wrongly claims to be under GPL */
if (strcmp(mod->name, "driverloader") == 0)
add_taint_module(mod, TAINT_PROPRIETARY_MODULE);
 
diff --git a/kernel/panic.c b/kernel/panic.c
index da4d6ba..b040812 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -152,6 +152,7 @@ EXPORT_SYMBOL(panic);
  *  'M' - System experienced a machine check exception.
  *  'B' - System has hit bad_page.
  *  'U' - Userspace-defined naughtiness.
+ *  'W' - Wrapper for untrusted binary blobs has been loaded.
  *
  * The string is overwritten by the next call to print_taint().
  */
@@ -162,6 +163,7 @@ const char *print_tainted(void)
if (tainted) {
snprintf(buf, sizeof(buf), "Tainted: %c%c%c%c%c%c%c%c",
tainted & TAINT_PROPRIETARY_MODULE ? 'P' : 'G',
+   tainted & TAINT_BLOB_WRAPPER ? 'W' : ' ',
tainted & TAINT_FORCED_MODULE ? 'F' : ' ',
tainted & TAINT_UNSAFE_SMP ? 'S' : ' ',
tainted & TAINT_FORCED_RMMOD ? 'R' : ' ',

-- 
Regards,
Pavel Roskin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e"

2008-01-29 Thread Randy Dunlap

On Tue, 29 Jan 2008 23:59:37 GMT Linux Kernel Mailing List wrote:

> Gitweb: 
> http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b10ca19ea4859d3884d10a3eb8495de92089792
> Commit: 5b10ca19ea4859d3884d10a3eb8495de92089792
> Parent: 9e97198dbf318be7958b57900d05b37c7e09ad7c
> Author: Linus Torvalds <[EMAIL PROTECTED]>
> AuthorDate: Wed Jan 30 09:54:54 2008 +1100
> Committer:  Linus Torvalds <[EMAIL PROTECTED]>
> CommitDate: Wed Jan 30 09:54:54 2008 +1100
> 
> Mostly revert "e1000/e1000e: Move PCI-Express device IDs over to e1000e"
> 
> The new e1000e driver is apparently not yet suitable for general use, so
> mark it experimental, and re-instate all the PCI-Express device IDs in
> the old and stable e1000 driver so that people (namely me) can continue
> to use a driver that actually works.
> 
> Auke & co have been appraised of the situation.
> 
> Cc: Auke Kok <[EMAIL PROTECTED]>
> Cc: Jeff Garzik <[EMAIL PROTECTED]>
> Cc: David Miller <[EMAIL PROTECTED]>
> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

Andrew was concerned about this when the driver was in -mm.
He asked for a patch that would set E1000E to same value as E1000
and I supplied that.  Auke acked it IIRC.  Other people vetoed it.  :(


> ---
>  drivers/net/Kconfig|2 +-
>  drivers/net/e1000/e1000_main.c |   27 +++
>  2 files changed, 28 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index af40ff4..5a2d1dd 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -1992,7 +1992,7 @@ config E1000_DISABLE_PACKET_SPLIT
>  
>  config E1000E
>   tristate "Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support"
> - depends on PCI
> + depends on PCI && EXPERIMENTAL
>   ---help---
> This driver supports the PCI-Express Intel(R) PRO/1000 gigabit
> ethernet family of adapters. For PCI or PCI-X e1000 adapters,
> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> index 7f5b2ae..3111af6 100644
> --- a/drivers/net/e1000/e1000_main.c
> +++ b/drivers/net/e1000/e1000_main.c
> @@ -73,6 +73,14 @@ static struct pci_device_id e1000_pci_tbl[] = {
>   INTEL_E1000_ETHERNET_DEVICE(0x1026),
>   INTEL_E1000_ETHERNET_DEVICE(0x1027),
>   INTEL_E1000_ETHERNET_DEVICE(0x1028),
> + INTEL_E1000_ETHERNET_DEVICE(0x1049),
> + INTEL_E1000_ETHERNET_DEVICE(0x104A),
> + INTEL_E1000_ETHERNET_DEVICE(0x104B),
> + INTEL_E1000_ETHERNET_DEVICE(0x104C),
> + INTEL_E1000_ETHERNET_DEVICE(0x104D),
> + INTEL_E1000_ETHERNET_DEVICE(0x105E),
> + INTEL_E1000_ETHERNET_DEVICE(0x105F),
> + INTEL_E1000_ETHERNET_DEVICE(0x1060),
>   INTEL_E1000_ETHERNET_DEVICE(0x1075),
>   INTEL_E1000_ETHERNET_DEVICE(0x1076),
>   INTEL_E1000_ETHERNET_DEVICE(0x1077),
> @@ -81,9 +89,28 @@ static struct pci_device_id e1000_pci_tbl[] = {
>   INTEL_E1000_ETHERNET_DEVICE(0x107A),
>   INTEL_E1000_ETHERNET_DEVICE(0x107B),
>   INTEL_E1000_ETHERNET_DEVICE(0x107C),
> + INTEL_E1000_ETHERNET_DEVICE(0x107D),
> + INTEL_E1000_ETHERNET_DEVICE(0x107E),
> + INTEL_E1000_ETHERNET_DEVICE(0x107F),
>   INTEL_E1000_ETHERNET_DEVICE(0x108A),
> + INTEL_E1000_ETHERNET_DEVICE(0x108B),
> + INTEL_E1000_ETHERNET_DEVICE(0x108C),
> + INTEL_E1000_ETHERNET_DEVICE(0x1096),
> + INTEL_E1000_ETHERNET_DEVICE(0x1098),
>   INTEL_E1000_ETHERNET_DEVICE(0x1099),
> + INTEL_E1000_ETHERNET_DEVICE(0x109A),
> + INTEL_E1000_ETHERNET_DEVICE(0x10A4),
> + INTEL_E1000_ETHERNET_DEVICE(0x10A5),
>   INTEL_E1000_ETHERNET_DEVICE(0x10B5),
> + INTEL_E1000_ETHERNET_DEVICE(0x10B9),
> + INTEL_E1000_ETHERNET_DEVICE(0x10BA),
> + INTEL_E1000_ETHERNET_DEVICE(0x10BB),
> + INTEL_E1000_ETHERNET_DEVICE(0x10BC),
> + INTEL_E1000_ETHERNET_DEVICE(0x10C4),
> + INTEL_E1000_ETHERNET_DEVICE(0x10C5),
> + INTEL_E1000_ETHERNET_DEVICE(0x10D5),
> + INTEL_E1000_ETHERNET_DEVICE(0x10D9),
> + INTEL_E1000_ETHERNET_DEVICE(0x10DA),
>   /* required last entry */
>   {0,}
>  };

---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] Proportional bandwidth scheduling using anticipatory I/O scheduler on 2.6.24

2008-01-29 Thread Naveen Gupta

This patch creates channels in anticipatory I/O scheduler for sharing
bandwidth in specified proportions. It uses the ioprio_(get/set) interface
to create various channels and as of now it is using the best effort levels.

It is an initial attempt to get proportional b/w working in I/O
schedulers. One of the applications can be to assign a portion of
b/w on a device to a specified container. The advantages of this
approach over putting absolute restricting on b/w of a container is
that by restricting the b/w, we may end up not utilizing additional
b/w in absence of load from other containers. Not to say that we cannot
do I/O limiting in schedulers.

The current patch works for read requests and we may need more work for
congested queues and writes limiting. The idea is to allow processes to submit
requests in a round-robin manner and if it exceeds it's limit wait till the
other channel is done submitting it's share. In order to prevent a very
active channel from submitting any request (even though it may have exceeded
it's limit) in absence of i/o from other channel, counters are reset after
a period of inactivity on idle channels. Also there is a loss of overall
average bandwidth when using multiple classes. Some of which can be expected
due to different behavior of applications in multiple containers sharing a
single device, but apart from that a major portion of that loss is due to
the fact that we are still not using the scheduler optimally. Here is a
simple fio script to test this patch.


<-- snip fio.script -->
[global]
ioengine=sync
rw=read
direct=0
exitall

[file]
name=buffered1
directory=/tmp
bs=256k
size=1g
prio=0
prioclass=2

[file]
name=buffered3
directory=/tmp
bs=1m
size=1g
prio=3
prioclass=2
<-- end snip -->

Other interfaces are in /sys/block/[device]/queue/iosched/
1. priority_weights - Assign proportions to various channels.
   Note these poportions are now expressed in multiples of 1024*1024. I will
   work on getting these into exact proportions.
2. bandwidth_scheduling - writing 0 into this stops proportional scheduling.
3. bw_timer_expire - time period after which counters are reset.
   Writing large value to it will give you more exact proportions and
   small values increase overall average bandwidth. This is the time
   after which b/w on a different channels is reset due to inactivity. Some
   tuning of this variable may be needed to get required results. I will work
   on making this transparent.
This patch has default four channels.

I would like to know initial feedback regarding what do we expect especially
when it comes to container groups. Is this something which is useful or we
need hard limits for various channels? What other things are expected? Would
assigning priorities be of any use, either absolute priorities or
soft priorities along with b/w limitations. I can add cgroup interfaces in
another patch.

Signed-off-by: Naveen Gupta <[EMAIL PROTECTED]>

Index: linux-2.6.24/block/Kconfig.iosched
===
--- linux-2.6.24.orig/block/Kconfig.iosched 2008-01-24
14:58:37.0 -0800
+++ linux-2.6.24/block/Kconfig.iosched  2008-01-27 11:24:50.0 -0800
@@ -21,6 +21,13 @@ config IOSCHED_AS
  deadline I/O scheduler, it can also be slower in some cases
  especially some database loads.

+config IOPRIO_AS_MAX
+   int "Bandwidth channels in anticipatory I/O scheduler"
+   depends on IOSCHED_AS
+   default "4"
+   help
+ Number of valid b/w channels in anticipatory scheduler.
+
 config IOSCHED_DEADLINE
tristate "Deadline I/O scheduler"
default y
Index: linux-2.6.24/block/as-iosched.c
===
--- linux-2.6.24.orig/block/as-iosched.c2008-01-24
14:58:37.0 -0800
+++ linux-2.6.24/block/as-iosched.c 2008-01-29 12:05:52.0 -0800
@@ -16,6 +16,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #define REQ_SYNC   1
 #define REQ_ASYNC  0
@@ -63,6 +65,9 @@
  */
 #define MAX_THINKTIME (HZ/50UL)

+#define default_bandwidth_scheduling  (0)
+#define default_bw_timer_expire (16)   /* msecs */
+
 /* Bits in as_io_context.state */
 enum as_io_states {
AS_TASK_RUNNING=0,  /* Process has not exited */
@@ -89,10 +94,14 @@ struct as_data {
/*
 * requests (as_rq s) are present on both sort_list and fifo_list
 */
-   struct rb_root sort_list[2];
-   struct list_head fifo_list[2];
+   struct {
+   struct rb_root sort_list[2];
+   struct list_head fifo_list[2];
+   struct request *next_rq[2];
+   unsigned long ioprio_wt;
+   unsigned long serviced;
+   } prio_q[IOPRIO_AS_MAX];

-   struct request *next_rq[2]; /* next in sort order */
sector_t last_sector[2];/* last REQ_SYNC & REQ_ASYNC sectors */

unsigned long exit_prob;/

Re: [PATCH] add support for dynamic ticks and preempt rcu

2008-01-29 Thread Paul E. McKenney

On Tue, Jan 29, 2008 at 11:18:12AM -0500, Steven Rostedt wrote:
> 
> [
>  Paul, you had your Signed-off-by in the RT patch, so I attached it here
>   too
> ]

Works for me!!!

> The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The
> idle CPU will not progress the RCU through its grace period and a
> synchronize_rcu my get stuck. Without this patch I have a box that will
> not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine with
> this patch.
> 
> Note: This patch came directly from the -rt patch where it has been tested
> for several months.

For those who attended my lightening talk yesterday on changing RCU to
"let sleeping CPUs lie", this is the patch.

If your architecture calls rcu_irq_enter() or irq_enter() upon
NMI/SMI/MC/whatever handler entry and also calls rcu_irq_exit() or
irq_exit() upon NMI/SMI/MC/whatever handler exit, you are covered.

Alternatively, if none of your architecture's NMI/SMI/MC/whatever
handlers never invoke rcu_read_lock()/rcu_read_unlock() and friends,
you are also covered.

I believe that we are covered, but I cannot claim to fully understand
all 20+ architectures.  ;-)

Thanx, Paul

> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
> Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]>
> ---
>  include/linux/hardirq.h|   10 ++
>  include/linux/rcuclassic.h |3
>  include/linux/rcupreempt.h |   22 
>  kernel/rcupreempt.c|  224 
> -
>  kernel/softirq.c   |1
>  kernel/time/tick-sched.c   |3
>  6 files changed, 259 insertions(+), 4 deletions(-)
> 
> Index: linux-compile.git/kernel/rcupreempt.c
> ===
> --- linux-compile.git.orig/kernel/rcupreempt.c2008-01-29 
> 11:03:21.0 -0500
> +++ linux-compile.git/kernel/rcupreempt.c 2008-01-29 11:10:08.0 
> -0500
> @@ -23,6 +23,10 @@
>   *   to Suparna Bhattacharya for pushing me completely away
>   *   from atomic instructions on the read side.
>   *
> + *  - Added handling of Dynamic Ticks
> + *  Copyright 2007 - Paul E. Mckenney <[EMAIL PROTECTED]>
> + * - Steven Rostedt <[EMAIL PROTECTED]>
> + *
>   * Papers:  http://www.rdrop.com/users/paulmck/RCU
>   *
>   * Design Document: http://lwn.net/Articles/253651/
> @@ -409,6 +413,212 @@ static void __rcu_advance_callbacks(stru
>   }
>  }
> 
> +#ifdef CONFIG_NO_HZ
> +
> +DEFINE_PER_CPU(long, dynticks_progress_counter) = 1;
> +static DEFINE_PER_CPU(long, rcu_dyntick_snapshot);
> +static DEFINE_PER_CPU(int, rcu_update_flag);
> +
> +/**
> + * rcu_irq_enter - Called from Hard irq handlers and NMI/SMI.
> + *
> + * If the CPU was idle with dynamic ticks active, this updates the
> + * dynticks_progress_counter to let the RCU handling know that the
> + * CPU is active.
> + */
> +void rcu_irq_enter(void)
> +{
> + int cpu = smp_processor_id();
> +
> + if (per_cpu(rcu_update_flag, cpu))
> + per_cpu(rcu_update_flag, cpu)++;
> +
> + /*
> +  * Only update if we are coming from a stopped ticks mode
> +  * (dynticks_progress_counter is even).
> +  */
> + if (!in_interrupt() &&
> + (per_cpu(dynticks_progress_counter, cpu) & 0x1) == 0) {
> + /*
> +  * The following might seem like we could have a race
> +  * with NMI/SMIs. But this really isn't a problem.
> +  * Here we do a read/modify/write, and the race happens
> +  * when an NMI/SMI comes in after the read and before
> +  * the write. But NMI/SMIs will increment this counter
> +  * twice before returning, so the zero bit will not
> +  * be corrupted by the NMI/SMI which is the most important
> +  * part.
> +  *
> +  * The only thing is that we would bring back the counter
> +  * to a postion that it was in during the NMI/SMI.
> +  * But the zero bit would be set, so the rest of the
> +  * counter would again be ignored.
> +  *
> +  * On return from the IRQ, the counter may have the zero
> +  * bit be 0 and the counter the same as the return from
> +  * the NMI/SMI. If the state machine was so unlucky to
> +  * see that, it still doesn't matter, since all
> +  * RCU read-side critical sections on this CPU would
> +  * have already completed.
> +  */
> + per_cpu(dynticks_progress_counter, cpu)++;
> + /*
> +  * The following memory barrier ensures that any
> +  * rcu_read_lock() primitives in the irq handler
> +  * are seen by other CPUs to follow the above
> +  * increment to dynticks_progress_counter. This is
> +  * required in order for other CPUs to correctly
> +

Re: [PATCH powerpc] Fake NUMA emulation for PowerPC (Take 3)

2008-01-29 Thread Balbir Singh

* Michael Ellerman <[EMAIL PROTECTED]> [2008-01-30 00:04:58]:

> Why do you check !p after assigning to nid? I assume it's because we
> might have reached the end of the command line, ie. p == NULL, but we're
> still adding memory to the last node? If so it's a it's a little subtle
> and deserves a comment I think.
>

The reason that we check for !p after assigning node id is that, in
case we create fake NUMA nodes, we want nid to be the fake numa node
id and not the real node id or in the non NUMA case, node id 0.

The if (!p) checks to see if we do have more arguments to parse.
 
> Otherwise this looks pretty good.
> 

Thanks!

> cheers
> 
> -- 
> Michael Ellerman
> OzLabs, IBM Australia Development Lab
> 
> wwweb: http://michael.ellerman.id.au
> phone: +61 2 6212 1183 (tie line 70 21183)
> 
> We do not inherit the earth from our ancestors,
> we borrow it from our children. - S.M.A.R.T Person



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Change In sk_buff structure in 2.6.22 kernel

2008-01-29 Thread PV Juliet

Hi All,


The header fields in the sk_buff structure have been renamed and are
no longer unions.

Networking code and drivers are supposed to use skb->transport_header,
skb->network_header, and skb->skb_mac_header.
But when I am trying to access fields of TCP using the code
struct tcphdr *tcp = skb->transport_header;
tcp->   //accessing proper field
It is not accessing the value properly ...
Can anyone please help me ???


Thanks in advance
Regards
Juliet
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 && -g8561b089

2008-01-29 Thread Borislav Petkov

On Wed, Jan 30, 2008 at 06:03:47AM +0100, Borislav Petkov wrote:
> On Wed, Jan 30, 2008 at 12:58:33AM +0100, Bartlomiej Zolnierkiewicz wrote:
> > 
> > Hi,
> > 
> > On Wednesday 30 January 2008, Kiyoshi Ueda wrote:
> > > Hi Bart, 
> > > 
> > > On Tue, 29 Jan 2008 14:22:53 -0800, Roland Dreier wrote:
> > > > Hi, I saw the same BUG from ide-cd on one of my systems.  I applied
> > > > the debugging patch to replace the BUG with blk_dump_rq_flags(), and I
> > > > got the output below (full boot log and .config attached to this
> > > > email).
> > > > 
> > > > Please let me know if there's anything else that would help debug the
> > > > problem.
> > > 
> > > Thank you for the information, Roland.
> > > 
> > >  
> > > > [4.072271] Uniform CD-ROM driver Revision: 3.20
> > > > [4.098236] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > > [4.100269]
> > > > [4.100269] sector 1949759, nr/cnr 0/0
> > > > [4.100269] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > > , d8
> > > > [4.100269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > > [4.101005] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > > [4.104269]
> > > > [4.104269] sector 1949759, nr/cnr 0/0
> > > > [4.104269] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > > , d2
> > > > [4.104269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > > [4.109203] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > > [4.112270]
> > > > [4.112270] sector 1949759, nr/cnr 0/0
> > > > [4.112270] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > > , d8
> > > > [4.112270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > > [4.112945] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > > [4.116270]
> > > > [4.116270] sector 1949759, nr/cnr 0/0
> > > > [4.116270] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > > , d2
> > > > [4.116270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > 
> > > Bart,
> > > This means that the rq still has a bio even after DRQ_STAT is cleared.
> > > The original ide-cd code was calling only end_that_request_last() there.
> > > So I thought that the rq should have no bio when DRQ_STAT is cleared,
> > > otherwise the bio leaks.
> > > 
> > > Was my understanding wrong and is that correct behavior in ide-cd?
> > 
> > Added Borislav to cc:.
> > 
> > PS I'm extremely busy with "real-life" (unfortunately IDE hacking is not
> > my paid job) and the friday is the earliest date on which I would be able
> > to look in detail into this problem and other outstanding IDE stuff, sorry.
> 
> Same here, will be able to look into it tomorrow. In the meantime, can someone
> direct me the full BUG() output?

Nevermind. Got it, thanks.

-- 
Regards/Gruß,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH retry] bluetooth : add conn add/del workqueues to avoid connection fail

2008-01-29 Thread David Miller

From: Dave Young <[EMAIL PROTECTED]>
Date: Wed, 30 Jan 2008 10:23:54 +0800

> 
> The bluetooth hci_conn sysfs add/del executed in the default workqueue.
> If the del_conn is executed after the new add_conn with same target,
> add_conn will failed with warning of "same kobject name".
> 
> Here add btaddconn & btdelconn workqueues,
> flush the btdelconn workqueue in the add_conn function to avoid the issue.
> 
> Signed-off-by: Dave Young <[EMAIL PROTECTED]> 

This looks good, applied, thanks Dave.

I've queued this up for 2.6.25 merging, if you want me to
schedule it for -stable, just let me know.

Thanks again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Jon Masters

On Wed, Jan 30, 2008 at 04:24:50AM +0100, Andi Kleen wrote:
> Pavel Roskin <[EMAIL PROTECTED]> writes:
> >
> > static inline void add_taint_module(struct module *mod, unsigned flag)
> > {
> > add_taint(flag);
> > mod->taints |= flag;
> > }
> >
> > The module taint is set before the symbols are resolved.  Therefore, the
> > GPL-only symbols won't be resolved.
> 
> I think using a separate taint flag that does not disable GPL symbols
> for the ndiswrapper case would be a fair solution. After all the main
> motivation for tainting ndiswrapper is to make it visible in oopses, but not 
> prevent it from loading in the first place.
> 
> How about you submit an incremental patch to do that?

I'll happily submit a patch to do whateve is wanted, and add comments
(I'm also debugging seveal module issues right now, so I have a
good opportunity to look over some of the code).

But do we want to:

*). Add a new taint?
*). Move it later?

It's all trivial, but a policy should be established for the future.

Jon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 && -g8561b089

2008-01-29 Thread Borislav Petkov

On Wed, Jan 30, 2008 at 12:58:33AM +0100, Bartlomiej Zolnierkiewicz wrote:
> 
> Hi,
> 
> On Wednesday 30 January 2008, Kiyoshi Ueda wrote:
> > Hi Bart, 
> > 
> > On Tue, 29 Jan 2008 14:22:53 -0800, Roland Dreier wrote:
> > > Hi, I saw the same BUG from ide-cd on one of my systems.  I applied
> > > the debugging patch to replace the BUG with blk_dump_rq_flags(), and I
> > > got the output below (full boot log and .config attached to this
> > > email).
> > > 
> > > Please let me know if there's anything else that would help debug the
> > > problem.
> > 
> > Thank you for the information, Roland.
> > 
> >  
> > > [4.072271] Uniform CD-ROM driver Revision: 3.20
> > > [4.098236] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > [4.100269]
> > > [4.100269] sector 1949759, nr/cnr 0/0
> > > [4.100269] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > , d8
> > > [4.100269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > [4.101005] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > [4.104269]
> > > [4.104269] sector 1949759, nr/cnr 0/0
> > > [4.104269] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > , d2
> > > [4.104269] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > [4.109203] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > [4.112270]
> > > [4.112270] sector 1949759, nr/cnr 0/0
> > > [4.112270] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > , d8
> > > [4.112270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > [4.112945] ide-cd: rq still having bio: dev hda: type=2, flags=114c8
> > > [4.116270]
> > > [4.116270] sector 1949759, nr/cnr 0/0
> > > [4.116270] bio 8102418cc600, biotail 8102418cc600, buffer 
> > > , d2
> > > [4.116270] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > 
> > Bart,
> > This means that the rq still has a bio even after DRQ_STAT is cleared.
> > The original ide-cd code was calling only end_that_request_last() there.
> > So I thought that the rq should have no bio when DRQ_STAT is cleared,
> > otherwise the bio leaks.
> > 
> > Was my understanding wrong and is that correct behavior in ide-cd?
> 
> Added Borislav to cc:.
> 
> PS I'm extremely busy with "real-life" (unfortunately IDE hacking is not
> my paid job) and the friday is the earliest date on which I would be able
> to look in detail into this problem and other outstanding IDE stuff, sorry.

Same here, will be able to look into it tomorrow. In the meantime, can someone
direct me the full BUG() output?
-- 
Regards/Gruß,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Jon Masters

On Tue, Jan 29, 2008 at 08:48:21PM -0500, Pavel Roskin wrote:
> On Tue, 2008-01-29 at 19:20 -0500, Jon Masters wrote:
> 
> > Yes it is. But I thought the existing code was intending to taint the
> > kernel (that's what it does), so it would really help to identify why it
> > tainted the kernel, by calling add_taint_module instead of add_taint. I
> > didn't put the existing match in there...don't shoot the messenger :)
> 
> So, it's the same thing as in year 2006.  Good intentions, unexpected
> side effects, and a long discussion.

I wouldn't quite say that. I wasn't going to comment, but...personally,
I actually disagree with the assertions that ndiswrapper isn't causing
proprietary code to link against GPL functions in the kernel (how is
an NDIS implementation any different than a shim layer provided to
load a graphics driver?), but I wasn't trying to make that point.

Rusty - shall we just move the taint to post symbol resolution?

Jon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU

2008-01-29 Thread Paul E. McKenney

On Tue, Jan 29, 2008 at 08:24:17PM -0700, Eric W. Biederman wrote:
> Oleg Nesterov <[EMAIL PROTECTED]> writes:
> 
> > With CONFIG_PREEMPT_RCU read_lock(tasklist_lock) doesn't imply 
> > rcu_read_lock(),
> > but find_pid_ns()->hlist_for_each_entry_rcu() should be safe under tasklist.
> >
> > Usually it is, detach_pid() is always called under 
> > write_lock(tasklist_lock),
> > but copy_process() calls free_pid() lockless.
> >
> > "#ifdef CONFIG_PREEMPT_RCU" is added mostly as documentation, perhaps it is
> > too ugly and should be removed.
> >
> > Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>
> >
> > --- MM/kernel/fork.c~PR_RCU 2008-01-27 17:09:47.0 +0300
> > +++ MM/kernel/fork.c2008-01-29 19:23:44.0 +0300
> > @@ -1335,8 +1335,19 @@ static struct task_struct *copy_process(
> > return p;
> >  
> >  bad_fork_free_pid:
> > -   if (pid != &init_struct_pid)
> > +   if (pid != &init_struct_pid) {
> > +#ifdef CONFIG_PREEMPT_RCU
> > +   /*
> > +* read_lock(tasklist_lock) doesn't imply rcu_read_lock(),
> > +* make sure find_pid() is safe under read_lock(tasklist).
> > +*/
> > +   write_lock_irq(&tasklist_lock);
> > +#endif
> > free_pid(pid);
> > +#ifdef CONFIG_PREEMPT_RCU
> > +   write_unlock_irq(&tasklist_lock);
> > +#endif
> > +   }
> >  bad_fork_cleanup_namespaces:
> > exit_task_namespaces(p);
> >  bad_fork_cleanup_keys:
> 
> Ok. I believe I see what problem you are trying to fix.  That
> a pid returned from find_pid might disappear if we are not rcu
> protected.
> 
> This patch in the simplest form is wrong because it is confusing.
> 
> We currently appear to have two options.
> 1) Force all pid hash table access and pid accesses that
>do not get a count to be covered under rcu_read_lock.
> 2) To modify the locking requirements for free_pid to require
>the tasklist_lock.
> 
>However this second approach is horribly brittle, as it
>will break if we ever have intermediate entries in the
>hash table protected by pidmap_lock.
> 
> Using the tasklist_lock to still guarantee we see the list, the entire
> list, and exactly the list for proper implementation of kill to
> process groups and sessions still seems sane.
> 
> So let's just remove the guarantee of find_pid being usable with
> just the tasklist_lock held.

Makes sense to me -- it is totally permissible to hold rcu_read_lock()
across update code.  ;-)

Thanx, Paul

> Eric
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index e29a900..0ffb8cc 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -100,8 +100,7 @@ struct pid_namespace;
>  extern struct pid_namespace init_pid_ns;
> 
>  /*
> - * look up a PID in the hash table. Must be called with the tasklist_lock
> - * or rcu_read_lock() held.
> + * look up a PID in the hash table. Must be called with the rcu_read_lock() 
> held.
>   *
>   * find_pid_ns() finds the pid in the namespace specified
>   * find_pid() find the pid by its global id, i.e. in the init namespace
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU

2008-01-29 Thread Paul E. McKenney

On Tue, Jan 29, 2008 at 07:16:50PM -0700, Eric W. Biederman wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > On Tue, 29 Jan 2008 19:40:19 +0300
> > Oleg Nesterov <[EMAIL PROTECTED]> wrote:
> >
> >> With CONFIG_PREEMPT_RCU read_lock(tasklist_lock) doesn't imply
> > rcu_read_lock(),
> >> but find_pid_ns()->hlist_for_each_entry_rcu() should be safe under 
> >> tasklist.
> >> 
> >> Usually it is, detach_pid() is always called under 
> >> write_lock(tasklist_lock),
> >> but copy_process() calls free_pid() lockless.
> >> 
> >> "#ifdef CONFIG_PREEMPT_RCU" is added mostly as documentation, perhaps it is
> >> too ugly and should be removed.
> >> 
> >> Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>
> >> 
> >> --- MM/kernel/fork.c~PR_RCU2008-01-27 17:09:47.0 +0300
> >> +++ MM/kernel/fork.c   2008-01-29 19:23:44.0 +0300
> >> @@ -1335,8 +1335,19 @@ static struct task_struct *copy_process(
> >>return p;
> >>  
> >>  bad_fork_free_pid:
> >> -  if (pid != &init_struct_pid)
> >> +  if (pid != &init_struct_pid) {
> >> +#ifdef CONFIG_PREEMPT_RCU
> >> +  /*
> >> +   * read_lock(tasklist_lock) doesn't imply rcu_read_lock(),
> >> +   * make sure find_pid() is safe under read_lock(tasklist).
> >> +   */
> >> +  write_lock_irq(&tasklist_lock);
> >> +#endif
> >>free_pid(pid);
> >> +#ifdef CONFIG_PREEMPT_RCU
> >> +  write_unlock_irq(&tasklist_lock);
> >> +#endif
> >> +  }
> >>  bad_fork_cleanup_namespaces:
> >>exit_task_namespaces(p);
> >>  bad_fork_cleanup_keys:
> >
> > My attempt to understand this change timed out.
> >
> > kernel/pid.c is full of global but undocumented functions.  What are the
> > locking requirements for free_pid()?  free_pid_ns()?  If it's just
> > caller-must-hold-rcu_read_lock() then why not use rcu_read_lock() here?
> >
> > If the locking is "caller must hold write_lock_irq(tasklist_lock) then the
> > sole relevant comment in there (in free_pid()) is wrong.
> >
> > Guys, more maintainable code please?
> 
> Well I took a quick look.
> 
> Yeah this looks complex.
> Mutation of the hash table is protected by pidmap_lock.
> But attachments of tasks to hash entries is protected task_lock.
> 
> And it looks like it has been that way since commit 
> 92476d7fc0326a409ab1d3864a04093a6be9aca7
> 
> I thought free_pid did not have any requirements that a lock be held when
> it was called, taking all of the needed locks.
> 
> Now how read_lock doesn't imply rcu_read_lock is another question.

Although read_lock() does accidentally imply rcu_read_lock() for
Classic RCU, it no longer does so for preemptible RCU.

But I thought that we had found these -- must have missed some...

Thanx, Paul

> Anyway I have to run.  I will see about looking at this in a bit.
> 
> Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUILD FAILURE]2.6.24-git6 build failure on sis190 ethernet driver

2008-01-29 Thread Gabriel C

Kamalesh Babulal wrote:
> Hi,
> 
> The 2.6.24-git6 kernel build fails on various x86_64 machines with the build 
> failure
> 
> drivers/net/sis190.c:329: error: sis190_pci_tbl causes a section type conflict
> make[2]: *** [drivers/net/sis190.o] Error 1
> 
> # gcc --version (machine1)
> gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52)
> 
> # gcc --version (machine2)
> gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1)
> 

Heh :) vger.kernel.org does not like emails directly from gmail , it seems =)

( sorry for sending this 3 time now )

The following patch should fix the build failure.

diff --git a/drivers/net/sis190.c b/drivers/net/sis190.c
index b570402..e48e4ad 100644
--- a/drivers/net/sis190.c
+++ b/drivers/net/sis190.c
@@ -326,7 +326,7 @@ static const struct {
{ "SiS 191 PCI Gigabit Ethernet adapter" },
 };
 
-static struct pci_device_id sis190_pci_tbl[] __devinitdata = {
+static const struct pci_device_id sis190_pci_tbl[] __devinitdata = {
{ PCI_DEVICE(PCI_VENDOR_ID_SI, 0x0190), 0, 0, 0 },
{ PCI_DEVICE(PCI_VENDOR_ID_SI, 0x0191), 0, 0, 1 },
{ 0, },


Gabriel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3][RFC] x86: Catch stray non-kprobe breakpoints

2008-01-29 Thread Ananth N Mavinakayanahalli

On Tue, Jan 29, 2008 at 02:29:41PM -0500, Masami Hiramatsu wrote:
> Abhishek Sagar wrote:
> > On 1/29/08, Masami Hiramatsu <[EMAIL PROTECTED]> wrote:
> >> In that case, why don't you just reduce the priority of 
> >> kprobe_exceptions_nb?
> >> Then, the execution path becomes very simple.
> > 
> > Ananth mentioned that the kprobe notifier has to be the first to run.
> 
> (Hmm.. I think he has just explained current implementation:))
> IMHO, since kprobes itself can not know what the external debugger
> wants to do, the highest priority should be reserved for those external tools.

The reason why kprobes needs to be the first to run is simple: it
doesn't need user intervention and if it isn't the intended recepient of
the breakpoint, it just lets the kernel take over (unlike a debugger,
which would potentially need user attention). Also, if the underlying
instruction itself is a breakpoint, we have the facility in kprobes to
single-step inline so the kernel can take control and notify any other
intended recepient of the underlying breakpoint.

As such, I believe the current situation is fine, has worked fine for
close to 4 years now and doesn't warrant any change.

Ananth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-29 Thread Matthew Wilcox

On Tue, Jan 29, 2008 at 05:19:55AM -0800, Greg KH wrote:
> On Mon, Jan 28, 2008 at 08:18:04PM -0700, Matthew Wilcox wrote:
> > I'm more optimistic because we've so severely restricted the use of
> > mmconf after these patches that it's unlikely to cause problems.  I also
> > hear Vista is now using mmconf, so fewer implementations are going to
> > be buggy at this point.
> 
> Hahahaha, oh, that's a good one...

Thanks Greg.  What happened to "Can't we all try to get along"?

> But what about the thousands of implementations out there that are
> buggy?
> 
> I'm with Arjan here, I'm very skeptical.

Maybe I'm insufficiently imaginative.  Can you come up with a plausible
way in which the two patches I posted will succumb to bugs?  After those
patches we only use mmconf if:

 1. conf1 has failed to work
OR
 2. user has compiled their own kernel without support for conf1
OR
 3. kernel probes config space 0x100 to see if it can access extended
config space (requires the device to be PCIe or PCI-X2)
OR
 4. root attempts to lspci - or lspci -v
OR
 5. device driver tries to access extended config space

With Arjan's patch, I believe only case 3 changes.  In cases 4 and 5,
either lspci or the device driver will jump through the hoop to enable
access to extended config space.

> Matthew, with Arjan's patch, is anything that currently works now
> broken?  Why do you feel it is somehow "wrong"?

lspci is broken.  It used to be able to access extended config space, and
now can't unless it is patched to know about the sysfs flag to enable it.

If you're determined to implement something to disable extended config
space by default, it can be done in a much better way than Arjan's patch
-- less code (both source and object).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: at91sam9260 wakeup on serial port

2008-01-29 Thread David Brownell

On Monday 28 January 2008, Haavard Skinnemoen wrote:
> 
> > What will AVR32 (AP7) need to do, when it supports system sleep states?
> 
> Not sure. The PIOs seem to require a clock in order to detect a pin
> change, so I don't think we can enter very deep sleep states if we want
> to be woken up by the USART.

Right; if no DMA is pending, then the HSB matrix clock can be idled, DRAM put
into self-refresh, and most peripherals can issue wakeups ... AP7 "Frozen"
state, very analagous to AT91 "standby" on Linux.  UARTs and GPIOs can wake.

Deeper sleep states -- "standby" with clocks running, "stop" with all
except 32K (and RTC) off, "static" with no clocks at all -- can only
wake from WAKE_N and external interrupts; and RTC except in "static".
I suspect "stop" and "static" might want to use the on-chip SRAMs so
they don't need to change DRAM timings while they fiddle with clocks.

The closest analogue to the AT91 support would map /sys/power/state:

standby --> to AP7 "Frozen"
mem --> to AP7 "Stop"

Except that there could be no GPIO wakeups from "mem" ... so the $SUBJECT
patch wouldn't be useful on AVR32 (just AT91), unless USARTn.RXD is wired
up to one of those special wake-capable pins (extremely board-specific).

> There's a separate WAKE_N pin that is completely asynchronous, so with
> some external logic, we can probably wake up the CPU all the way from
> Static mode if a given input state is present. But that's definitely
> "board specific" territory, and starting the oscillators take a _long_
> time on the AP7000 (especially the 32 kHz, but then again, it barely
> consumes any power, so we might as well keep it running and keep the
> RTC going as well.)

I'd think the support of any "deeper" state for "mem" sleep would not
be entirely board specific ... when the RTC alarm is set, any board
should be able to use states other than "static".  But otherwise, no
board could enter those states unless WAKE_N or an external IRQ are
doing something useful (like being hooked up to a button).

Matching those few "deep wake" events to a given device would imply
board-specific glue code.

> So on AP7000, I think we'll just need to keep clocking the USART and
> let it generate the interrupt that wakes up the rest of the system.

For "standby" sleep state, yes -- map to at most AVR32 "Frozen" state.
That'd be a good first step for PM support.

- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUILD FAILURE]2.6.24-git6 build failure on sis190 ethernet driver

2008-01-29 Thread Kamalesh Babulal

Hi,

The 2.6.24-git6 kernel build fails on various x86_64 machines with the build 
failure

drivers/net/sis190.c:329: error: sis190_pci_tbl causes a section type conflict
make[2]: *** [drivers/net/sis190.o] Error 1

# gcc --version (machine1)
gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52)

# gcc --version (machine2)
gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1)

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] dm-band: The I/O bandwidth controller: Performance Report

2008-01-29 Thread Ryo Tsuruta

Hi,

> you mean that you run 128 processes on each user-device pairs?  Namely,
> I guess that
> 
>   user1: 128 processes on sdb5,
>   user2: 128 processes on sdb5,
>   another: 128 processes on sdb5,
>   user2: 128 processes on sdb6.

"User-device pairs" means "band groups", right?
What I actually did is the followings:

  user1: 128 processes on sdb5,
  user2: 128 processes on sdb5,
  user3: 128 processes on sdb5,
  user4: 128 processes on sdb6.

> The second preliminary studies might be:
> - What if you use a different I/O size on each device (or device-user pair)?
> - What if you use a different number of processes on each device (or
> device-user pair)?

There are other ideas of controlling bandwidth, limiting bytes-per-sec,
latency time or something. I think it is possible to implement it if 
a lot of people really require it. I feel there wouldn't be a single
correct answer for this issue. Posting good ideas how it should work
and submitting patches for it are also welcome.

> And my impression is that it's natural dm-band is in device-mapper,
> separated from I/O scheduler.  Because bandwidth control and I/O
> scheduling are two different things, it may be simpler that they are
> implemented in different layers.

I would like to know how dm-band works on various configurations on
various type of hardware. I'll try running dm-band on with other
configurations. Any reports or impressions of dm-band on your machines
are also welcome.

Thanks,
Ryo Tsuruta
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 24/27] NFS: Use local caching [try #2]

2008-01-29 Thread David Howells

Chuck Lever <[EMAIL PROTECTED]> wrote:

> This patch really ought to be broken into more manageable atomic
> changes to make it easier to review, and to provide more fine-grained
> explanation and rationalization for each specific change via
> individual patch descriptions.

Hmmm  I broke the patch up as Trond stipulated - at least, I thought I
had.

In many ways this request doesn't make sense.  You can't do NFS caching
without all the appropriate bits, so logically they should be one patch.
Breaking it up won't help git-bisect since the option to enable all this is
the last (or nearly last) patch.

However, I can do it (when I get back from LCA next week).

> This should no longer be necessary.  The latest mount.nfs subcommand
> from nfs-utils supports text-based mounts when running on kernels
> 2.6.23 and later.

Okay.  I'll update my patches to reflect this.  Note, however, I've got
someone reporting a bug that seems to show otherwise.  I'll have to
investigate this more next week.

> I hope you intend to provide updates to nfs(5) that describe the new
> mount options you introduce in this and later patches.  You don't
> mention it, but I assume that "nofsc" is the default behavior.

I should make SteveD do that, the options was his idea:-)  But I'll deal with
it.

> Add comments like this in a separate clean up patch.

> A suggestion: fs/nfs/fsc-index.c might be a better name.

If you wish, though I'd prefer to use a name that isn't like to clash with a
name that's going to appear in fs/fscache/ (or include/linux/ - I'd really
like to rename fs/nfs/fscache.h as dealing with two fscache.h's is annoying.

> > +struct nfs_fh_auxdata {
> > +   struct timespec i_mtime;
> > +   struct timespec i_ctime;
> > +   loff_t  i_size;
> > +};
> 
> It might be useful to explain here why you need to supplement the
> mtime, ctime, and size fields that already exist in an NFS inode.

Supplement?  I don't understand.

> > +   key->port = clp->cl_addr.sin_port;
> 
> Not sure why you are using the server's port here.  In almost every
> case the server side port number will be 2049, so it really doesn't
> add any uniquification.

The reason lies is "in almost every case".  It's possible to configure it
such that a server is running two separate NFS servers on different ports.

> If you're going for the client side port number, that changes after
> every connection, so it would be useless to identify a cache after a
> reboot (or even after the connection idles out!).

I'm going for the server side port number.  Using the client side port number
would be silly.

> I strongly recommend you use the existing IPv6 address conversion
> macros for this instead of open-coding yet another way of mapping an
> IPv4 address to an IPv6 address.
> 
> However, since AF_INET6 support is being introduced in the NFS client
> in 2.6.24, I recommend you take a look at these source files after
> Trond has pushed his NFS_ALL for 2.6.24.

I'll look at them.

> See below: the NFS cache-related stats should be added to nfs_iostats.

I believe I asked Trond, but I'll check.

I've got to move, so I'll deal with the rest of your email later.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU

2008-01-29 Thread Eric W. Biederman

Oleg Nesterov <[EMAIL PROTECTED]> writes:

> With CONFIG_PREEMPT_RCU read_lock(tasklist_lock) doesn't imply 
> rcu_read_lock(),
> but find_pid_ns()->hlist_for_each_entry_rcu() should be safe under tasklist.
>
> Usually it is, detach_pid() is always called under write_lock(tasklist_lock),
> but copy_process() calls free_pid() lockless.
>
> "#ifdef CONFIG_PREEMPT_RCU" is added mostly as documentation, perhaps it is
> too ugly and should be removed.
>
> Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>
>
> --- MM/kernel/fork.c~PR_RCU   2008-01-27 17:09:47.0 +0300
> +++ MM/kernel/fork.c  2008-01-29 19:23:44.0 +0300
> @@ -1335,8 +1335,19 @@ static struct task_struct *copy_process(
>   return p;
>  
>  bad_fork_free_pid:
> - if (pid != &init_struct_pid)
> + if (pid != &init_struct_pid) {
> +#ifdef CONFIG_PREEMPT_RCU
> + /*
> +  * read_lock(tasklist_lock) doesn't imply rcu_read_lock(),
> +  * make sure find_pid() is safe under read_lock(tasklist).
> +  */
> + write_lock_irq(&tasklist_lock);
> +#endif
>   free_pid(pid);
> +#ifdef CONFIG_PREEMPT_RCU
> + write_unlock_irq(&tasklist_lock);
> +#endif
> + }
>  bad_fork_cleanup_namespaces:
>   exit_task_namespaces(p);
>  bad_fork_cleanup_keys:

Ok. I believe I see what problem you are trying to fix.  That
a pid returned from find_pid might disappear if we are not rcu
protected.

This patch in the simplest form is wrong because it is confusing.

We currently appear to have two options.
1) Force all pid hash table access and pid accesses that
   do not get a count to be covered under rcu_read_lock.
2) To modify the locking requirements for free_pid to require
   the tasklist_lock.

   However this second approach is horribly brittle, as it
   will break if we ever have intermediate entries in the
   hash table protected by pidmap_lock.

Using the tasklist_lock to still guarantee we see the list, the entire
list, and exactly the list for proper implementation of kill to
process groups and sessions still seems sane.

So let's just remove the guarantee of find_pid being usable with
just the tasklist_lock held.

Eric

diff --git a/include/linux/pid.h b/include/linux/pid.h
index e29a900..0ffb8cc 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -100,8 +100,7 @@ struct pid_namespace;
 extern struct pid_namespace init_pid_ns;

 /*
- * look up a PID in the hash table. Must be called with the tasklist_lock
- * or rcu_read_lock() held.
+ * look up a PID in the hash table. Must be called with the rcu_read_lock() 
held.
  *
  * find_pid_ns() finds the pid in the namespace specified
  * find_pid() find the pid by its global id, i.e. in the init namespace
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/19] split LRU lists into anon & file sets

2008-01-29 Thread KOSAKI Motohiro

Hi Rik, Lee

I tested new hackbench on rvr split LRU patch.
   http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

method of test

(1) $ ./hackbench 150 process 1000
(2) # sync; echo 3 > /proc/sys/vm/drop_caches
$ dd if=tmp10G of=/dev/null
$ ./hackbench 150 process 1000

test machine
CPU: Itanium2 x4 (logical 8cpu)
MEM: 8GB

A. vanilla 2.6.24-rc8-mm1
(1) 127.540
(2) 727.548

B. 2.6.24-rc8-mm1 + split-lru-patch-series
(1) 92.730
(2) 758.369

   comment:
(1) active/inactive anon ratio improve performance significant.
(2) incorrect page activation reduce performance.


I investigate reason and found reason is [05/19] change.
I tested a bit porton reverted split-lru-patch-series again.

C. 2.6.24-rc8-mm1 + split-lru-patch-series + my-revert-patch
(1) 83.014
(2) 717.009


Of course, We need reintroduce this portion after new page LRU
(aka LRU for used only page).
but now is too early.

I hope this patch series merge to -mm ASAP.
therefore, I hope remove any corner case regression.

Thanks!


- kosaki



Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 mm/vmscan.c |   26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

Index: b/mm/vmscan.c
===
--- a/mm/vmscan.c   2008-01-29 15:59:17.0 +0900
+++ b/mm/vmscan.c   2008-01-30 11:53:42.0 +0900
@@ -247,6 +247,27 @@
return ret;
 }

+/* Called without lock on whether page is mapped, so answer is unstable */
+static inline int page_mapping_inuse(struct page *page)
+{
+   struct address_space *mapping;
+
+   /* Page is in somebody's page tables. */
+   if (page_mapped(page))
+   return 1;
+
+   /* Be more reluctant to reclaim swapcache than pagecache */
+   if (PageSwapCache(page))
+   return 1;
+
+   mapping = page_mapping(page);
+   if (!mapping)
+   return 0;
+
+   /* File is mmap'd by somebody? */
+   return mapping_mapped(mapping);
+}
+
 static inline int is_page_cache_freeable(struct page *page)
 {
return page_count(page) - !!PagePrivate(page) == 2;
@@ -515,7 +536,8 @@

referenced = page_referenced(page, 1, sc->mem_cgroup);
/* In active use or really unfreeable?  Activate it. */
-   if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
+   if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
+   referenced && page_mapping_inuse(page))
goto activate_locked;

 #ifdef CONFIG_SWAP
@@ -550,6 +572,8 @@
}

if (PageDirty(page)) {
+   if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
+   goto keep_locked;
if (!may_enter_fs) {
sc->nr_io_pages++;
goto keep_locked;





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] x86_64: make bootmap_start page align v3

2008-01-29 Thread Yinghai Lu

[PATCH 2/2] x86_64: make bootmap_start page align v3

boot oops when system get 64g or 128 installed

Calling initcall 0x80bc33b6: sctp_init+0x0/0x711()
BUG: unable to handle kernel NULL pointer dereference at 005f
IP: [] proc_register+0xe7/0x10f
PGD 0
Oops:  [1] SMP
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.24-smp-g5a514e21-dirty #6
RIP: 0010:[]  [] proc_register+0xe7/0x10f
RSP: :810824c57e60  EFLAGS: 00010246
RAX: d7d7 RBX: 811024c5fa80 RCX: 810824c57e08
RDX:  RSI: 0195 RDI: 80cc2460
RBP:  R08:  R09: 811024c5fa80
R10:  R11: 0002 R12: 810824c57e6c
R13:  R14: 810824c57ee0 R15: 0006abd25bee
FS:  () GS:80b4d000() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 005f CR3: 00201000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper (pid: 1, threadinfo 810824c56000, task 812024c52000)
Stack:  80a57348 0195 811024c5fa80 
 ff97 802bfef0  
  80bc3b4b 810824c57ee0 80bc34a5
Call Trace:
 [] ? create_proc_entry+0x73/0x8a
 [] ? sctp_snmp_proc_init+0x1c/0x34
 [] ? sctp_init+0xef/0x711
 [] ? kernel_init+0x175/0x2e1
 [] ? child_rip+0xa/0x12
 [] ? kernel_init+0x0/0x2e1
 [] ? child_rip+0x0/0x12


Code: 1e 48 83 7b 38 00 75 08 48 c7 43 38 f0 e8 82 80 48 83 7b 30 00 75 08 48 
c7 43 30 d0 e9 82 80 48 c7 c7 60 24 cc 80 e8 bd 5a 54 00 <48> 8b 45 60 48 89 6b 
58 48 89 5d 60 48 89 43 50 fe 05 f5 25 a0
RIP  [] proc_register+0xe7/0x10f
 RSP 
CR2: 005f
---[ end trace 02c2d78def82877a ]---
Kernel panic - not syncing: Attempted to kill init!

it turns out some variables near end of bss is corrupted already.

in System.map we have
80d40420 b rsi_table
80d40620 B krb5_seq_lock
80d40628 b i.20437
80d40630 b xprt_rdma_inline_write_padding
80d40638 b sunrpc_table_header
80d40640 b zero
80d40644 b min_memreg
80d40648 b rpcrdma_tk_lock_g
80d40650 B sctp_assocs_id_lock
80d40658 B proc_net_sctp
80d40660 B sctp_assocs_id
80d40680 B sysctl_sctp_mem
80d40690 B sysctl_sctp_rmem
80d406a0 B sysctl_sctp_wmem
80d406b0 b sctp_ctl_socket
80d406b8 b sctp_pf_inet6_specific
80d406c0 b sctp_pf_inet_specific
80d406c8 b sctp_af_v4_specific
80d406d0 b sctp_af_v6_specific
80d406d8 b sctp_rand.33270
80d406dc b sctp_memory_pressure
80d406e0 b sctp_sockets_allocated
80d406e4 b sctp_memory_allocated
80d406e8 b sctp_sysctl_header
80d406f0 b zero
80d406f4 A __bss_stop
80d406f4 A _end

and setup_node_bootmem() will use that page 0xd4 for bootmap
Bootmem setup node 0 -00082800
  NODE_DATA [0008a485 - 00091484]
  bootmap [00d406f4 -  00e456f3] pages 105
Bootmem setup node 1 00082800-00102800
  NODE_DATA [00082800 - 000828006fff]
  bootmap [000828007000 -  000828106fff] pages 100
Bootmem setup node 2 00102800-00182800
  NODE_DATA [00102800 - 001028006fff]
  bootmap [001028007000 -  001028106fff] pages 100
Bootmem setup node 3 00182800-00202800
  NODE_DATA [00182800 - 001828006fff]
  bootmap [001828007000 -  001828106fff] pages 100

the patch update bootmap_start to page_align to make sure we can extra range
for alignment.

Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]>

Index: linux-2.6/arch/x86/mm/numa_64.c
===
--- linux-2.6.orig/arch/x86/mm/numa_64.c
+++ linux-2.6/arch/x86/mm/numa_64.c
@@ -224,6 +224,9 @@ void __init setup_node_bootmem(int nodei
}
bootmap_start = __pa(bootmap);
 
+   /* make sure bootmap is not overlapped with bss section */
+   bootmap_start = round_up(bootmap_start, PAGE_SIZE);
+
bootmap_size = init_bootmem_node(NODE_DATA(nodeid),
 bootmap_start >> PAGE_SHIFT,
 start_pfn, end_pfn);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ndiswrapper and GPL-only symbols redux

2008-01-29 Thread Andi Kleen

Pavel Roskin <[EMAIL PROTECTED]> writes:
>
> static inline void add_taint_module(struct module *mod, unsigned flag)
> {
> add_taint(flag);
> mod->taints |= flag;
> }
>
> The module taint is set before the symbols are resolved.  Therefore, the
> GPL-only symbols won't be resolved.

I think using a separate taint flag that does not disable GPL symbols
for the ndiswrapper case would be a fair solution. After all the main
motivation for tainting ndiswrapper is to make it visible in oopses, but not 
prevent it from loading in the first place.

How about you submit an incremental patch to do that?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 18/22 -v7] Trace irq disabled critical timings

2008-01-29 Thread Steven Rostedt

This patch adds latency tracing for critical timings
(how long interrupts are disabled for).

 "irqsoff" is added to /debugfs/tracing/available_tracers

Note:
  tracing_max_latency
also holds the max latency for irqsoff (in usecs).
   (default to large number so one must start latency tracing)

  tracing_thresh
threshold (in usecs) to always print out if irqs off
is detected to be longer than stated here.
If irq_thresh is non-zero, then max_irq_latency
is ignored.

Here's an example of a trace with mcount_enabled = 0

===
preemption latency trace v1.1.5 on 2.6.24-rc7

 latency: 100 us, #3/3, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
-
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
 swapper-0 1d.s30us+: _spin_lock_irqsave+0x2a/0xb7 
(e1000_update_stats+0x47/0x64c [e1000])
 swapper-0 1d.s3  100us : _spin_unlock_irqrestore+0x32/0x5f 
(e1000_update_stats+0x641/0x64c [e1000])
 swapper-0 1d.s3  100us : trace_hardirqs_on_caller+0x75/0x89 
(_spin_unlock_irqrestore+0x32/0x5f)


vim:ft=help
===


And this is a trace with mcount_enabled == 1


===
preemption latency trace v1.1.5 on 2.6.24-rc7

 latency: 102 us, #12/12, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
-
 => started at: _spin_lock_irqsave+0x2a/0xb7
 => ended at:   _spin_unlock_irqrestore+0x32/0x5f

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
 swapper-0 1dNs30us+: _spin_lock_irqsave+0x2a/0xb7 
(e1000_update_stats+0x47/0x64c [e1000])
 swapper-0 1dNs3   46us : e1000_read_phy_reg+0x16/0x225 [e1000] 
(e1000_update_stats+0x5e2/0x64c [e1000])
 swapper-0 1dNs3   46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] 
(e1000_read_phy_reg+0x49/0x225 [e1000])
 swapper-0 1dNs3   46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] 
(e1000_swfw_sync_acquire+0x36/0x99 [e1000])
 swapper-0 1dNs3   47us : __const_udelay+0x9/0x47 
(e1000_read_phy_reg+0x116/0x225 [e1000])
 swapper-0 1dNs3   47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47)
 swapper-0 1dNs3   97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50)
 swapper-0 1dNs3   98us : e1000_swfw_sync_release+0xc/0x55 [e1000] 
(e1000_read_phy_reg+0x211/0x225 [e1000])
 swapper-0 1dNs3   99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] 
(e1000_swfw_sync_release+0x50/0x55 [e1000])
 swapper-0 1dNs3  101us : _spin_unlock_irqrestore+0xe/0x5f 
(e1000_update_stats+0x641/0x64c [e1000])
 swapper-0 1dNs3  102us : _spin_unlock_irqrestore+0x32/0x5f 
(e1000_update_stats+0x641/0x64c [e1000])
 swapper-0 1dNs3  102us : trace_hardirqs_on_caller+0x75/0x89 
(_spin_unlock_irqrestore+0x32/0x5f)


vim:ft=help
===


Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/process_64.c  |3 
 arch/x86/lib/thunk_64.S   |   18 +
 include/asm-x86/irqflags_32.h |4 
 include/asm-x86/irqflags_64.h |4 
 include/linux/irqflags.h  |   37 ++-
 include/linux/mcount.h|   31 ++-
 kernel/fork.c |2 
 kernel/lockdep.c  |   16 +
 lib/tracing/Kconfig   |   18 +
 lib/tracing/Makefile  |1 
 lib/tracing/trace_irqsoff.c   |  415 ++
 lib/tracing/tracer.c  |   59 -
 lib/tracing/tracer.h  |2 
 13 files changed, 575 insertions(+), 35 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/process_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/process_64.c  2008-01-29 
18:06:20.0 -0500
+++ linux-mcount.git/arch/x86/kernel/process_64.c   2008-01-29 
18:10:56.0 -0500
@@ -233,7 +233,10 @@ void cpu_idle (void)
 */
local_irq_disable();
enter_idle();
+   /* Don't trace irqs off for idle */
+   stop_critical_timings();

[PATCH 06/22 -v7] handle accurate time keeping over long delays

2008-01-29 Thread Steven Rostedt

Handle accurate time even if there's a long delay between
accumulated clock cycles.

Signed-off-by: John Stultz <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/powerpc/kernel/time.c|3 +-
 arch/x86/kernel/vsyscall_64.c |5 ++-
 include/asm-x86/vgtod.h   |2 -
 include/linux/clocksource.h   |   58 --
 kernel/time/timekeeping.c |   36 +-
 5 files changed, 82 insertions(+), 22 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 
21:47:06.0 -0500
+++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c  2008-01-25 
21:47:09.0 -0500
@@ -86,6 +86,7 @@ void update_vsyscall(struct timespec *wa
vsyscall_gtod_data.clock.mask = clock->mask;
vsyscall_gtod_data.clock.mult = clock->mult;
vsyscall_gtod_data.clock.shift = clock->shift;
+   vsyscall_gtod_data.clock.cycle_accumulated = clock->cycle_accumulated;
vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec;
vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec;
vsyscall_gtod_data.wall_to_monotonic = wall_to_monotonic;
@@ -121,7 +122,7 @@ static __always_inline long time_syscall
 
 static __always_inline void do_vgettimeofday(struct timeval * tv)
 {
-   cycle_t now, base, mask, cycle_delta;
+   cycle_t now, base, accumulated, mask, cycle_delta;
unsigned seq;
unsigned long mult, shift, nsec;
cycle_t (*vread)(void);
@@ -135,6 +136,7 @@ static __always_inline void do_vgettimeo
}
now = vread();
base = __vsyscall_gtod_data.clock.cycle_last;
+   accumulated  = __vsyscall_gtod_data.clock.cycle_accumulated;
mask = __vsyscall_gtod_data.clock.mask;
mult = __vsyscall_gtod_data.clock.mult;
shift = __vsyscall_gtod_data.clock.shift;
@@ -145,6 +147,7 @@ static __always_inline void do_vgettimeo
 
/* calculate interval: */
cycle_delta = (now - base) & mask;
+   cycle_delta += accumulated;
/* convert to nsecs: */
nsec += (cycle_delta * mult) >> shift;
 
Index: linux-mcount.git/include/asm-x86/vgtod.h
===
--- linux-mcount.git.orig/include/asm-x86/vgtod.h   2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/include/asm-x86/vgtod.h2008-01-25 21:47:09.0 
-0500
@@ -15,7 +15,7 @@ struct vsyscall_gtod_data {
struct timezone sys_tz;
struct { /* extract of a clocksource struct */
cycle_t (*vread)(void);
-   cycle_t cycle_last;
+   cycle_t cycle_last, cycle_accumulated;
cycle_t mask;
u32 mult;
u32 shift;
Index: linux-mcount.git/include/linux/clocksource.h
===
--- linux-mcount.git.orig/include/linux/clocksource.h   2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/include/linux/clocksource.h2008-01-25 
21:47:09.0 -0500
@@ -50,8 +50,12 @@ struct clocksource;
  * @flags: flags describing special properties
  * @vread: vsyscall based read
  * @resume:resume function for the clocksource, if necessary
+ * @cycle_last:Used internally by timekeeping core, please 
ignore.
+ * @cycle_accumulated: Used internally by timekeeping core, please ignore.
  * @cycle_interval:Used internally by timekeeping core, please ignore.
  * @xtime_interval:Used internally by timekeeping core, please ignore.
+ * @xtime_nsec:Used internally by timekeeping core, please 
ignore.
+ * @error: Used internally by timekeeping core, please ignore.
  */
 struct clocksource {
/*
@@ -82,7 +86,10 @@ struct clocksource {
 * Keep it in a different cache line to dirty no
 * more than one cache line.
 */
-   cycle_t cycle_last cacheline_aligned_in_smp;
+   struct {
+   cycle_t cycle_last, cycle_accumulated;
+   } cacheline_aligned_in_smp;
+
u64 xtime_nsec;
s64 error;
 
@@ -168,11 +175,44 @@ static inline cycle_t clocksource_read(s
 }
 
 /**
+ * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * @cs:pointer to clocksource being read
+ * @now:   current cycle value
+ *
+ * Uses the clocksource to return the current cycle_t value.
+ * NOTE!!!: This is different from clocksource_read, because it
+ * returns the accumulated cycle value! Must hold xtime lock!
+ */
+static inline cycle_t
+clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+{
+   cycle_t offset = (now - cs->cycle_last) & cs->mask;
+   offset += cs->cycle_accumulated;
+

[PATCH 13/22 -v7] Add tracing of context switches

2008-01-29 Thread Steven Rostedt

This patch adds context switch tracing, of the format of:

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
 swapper-0 1d..3  137us+:  0:140:R --> 2912:120
sshd-2912  1d..3  216us+:  2912:120:S --> 0:140
 swapper-0 1d..3  261us+:  0:140:R --> 2912:120
bash-2920  0d..3  267us+:  2920:120:S --> 0:140
sshd-2912  1d..3  330us!:  2912:120:S --> 0:140
 swapper-0 1d..3 2389us+:  0:140:R --> 2847:120
yum-upda-2847  1d..3 2411us!:  2847:120:S --> 0:140
 swapper-0 0d..3 11089us+:  0:140:R --> 3139:120
gdm-bina-3139  0d..3 3us!:  3139:120:S --> 0:140
 swapper-0 1d..3 102328us+:  0:140:R --> 2847:120
yum-upda-2847  1d..3 102348us!:  2847:120:S --> 0:140


 "sched_switch" is added to /debugfs/tracing/available_tracers

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
Cc: Mathieu Desnoyers <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig  |9 ++
 lib/tracing/Makefile |1 
 lib/tracing/trace_sched_switch.c |  165 +++
 lib/tracing/tracer.c |   43 ++
 lib/tracing/tracer.h |   23 +
 5 files changed, 240 insertions(+), 1 deletion(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-29 18:06:25.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 18:08:06.0 
-0500
@@ -23,3 +23,12 @@ config FUNCTION_TRACER
  insert a call to an architecture specific __mcount routine,
  that the debugging mechanism using this facility will hook by
  providing a set of inline routines.
+
+config CONTEXT_SWITCH_TRACER
+   bool "Trace process context switches"
+   depends on DEBUG_KERNEL
+   select TRACING
+   help
+ This tracer hooks into the context switch and records
+ all switching of tasks.
+
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-mcount.git.orig/lib/tracing/Makefile  2008-01-29 18:06:25.0 
-0500
+++ linux-mcount.git/lib/tracing/Makefile   2008-01-29 18:08:06.0 
-0500
@@ -1,6 +1,7 @@
 obj-$(CONFIG_MCOUNT) += libmcount.o
 
 obj-$(CONFIG_TRACING) += tracer.o
+obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c   2008-01-29 
18:08:06.0 -0500
@@ -0,0 +1,165 @@
+/*
+ * trace context switch
+ *
+ * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]>
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace;
+static int trace_enabled __read_mostly;
+static atomic_t sched_ref;
+int tracing_sched_switch_enabled __read_mostly;
+
+static notrace void sched_switch_callback(const struct marker *mdata,
+ void *private_data,
+ const char *format, ...)
+{
+   struct tracing_trace **p = mdata->private;
+   struct tracing_trace *tr = *p;
+   struct tracing_trace_cpu *data;
+   struct task_struct *prev;
+   struct task_struct *next;
+   unsigned long flags;
+   va_list ap;
+   int cpu;
+
+   if (!trace_enabled)
+   return;
+
+   va_start(ap, format);
+   prev = va_arg(ap, typeof(prev));
+   next = va_arg(ap, typeof(next));
+   va_end(ap);
+
+   raw_local_irq_save(flags);
+   cpu = raw_smp_processor_id();
+   data = tr->data[cpu];
+   atomic_inc(&data->disabled);
+
+   if (likely(atomic_read(&data->disabled) == 1))
+   tracing_sched_switch_trace(tr, data, prev, next, flags);
+
+   atomic_dec(&data->disabled);
+   raw_local_irq_restore(flags);
+}
+
+static notrace void sched_switch_reset(struct tracing_trace *tr)
+{
+   int cpu;
+
+   tr->time_start = now();
+
+   for_each_online_cpu(cpu)
+   tracing_reset(tr->data[cpu]);
+}
+
+static notrace void start_sched_trace(struct tracing_trace *tr)
+{
+   sched_switch_reset(tr);
+   trace_enabled = 1;
+   tracing_start_sched_switch();
+}
+
+static notrace void stop_sched_trace(struct tracing_trace *tr)
+{
+   tracing_stop_sched_switch();
+   trace_enabled = 0;
+}
+
+static notrace void sched_switch_trace_i

[PATCH 09/22 -v7] add notrace annotations to timing events

2008-01-29 Thread Steven Rostedt

This patch adds notrace annotations to timer functions
that will be used by tracing. This helps speed things up and
also keeps the ugliness of printing these functions down.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/apic_32.c |2 +-
 arch/x86/kernel/hpet.c|2 +-
 arch/x86/kernel/time_32.c |2 +-
 arch/x86/kernel/tsc_32.c  |2 +-
 arch/x86/kernel/tsc_64.c  |4 ++--
 arch/x86/lib/delay_32.c   |6 +++---
 drivers/clocksource/acpi_pm.c |8 
 7 files changed, 13 insertions(+), 13 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/apic_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-29 
11:35:35.0 -0500
+++ linux-mcount.git/arch/x86/kernel/apic_32.c  2008-01-29 11:49:47.0 
-0500
@@ -577,7 +577,7 @@ static void local_apic_timer_interrupt(v
  *   interrupt as well. Thus we cannot inline the local irq ... ]
  */
 
-void fastcall smp_apic_timer_interrupt(struct pt_regs *regs)
+notrace fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
 
Index: linux-mcount.git/arch/x86/kernel/hpet.c
===
--- linux-mcount.git.orig/arch/x86/kernel/hpet.c2008-01-29 
11:35:35.0 -0500
+++ linux-mcount.git/arch/x86/kernel/hpet.c 2008-01-29 11:49:47.0 
-0500
@@ -295,7 +295,7 @@ static int hpet_legacy_next_event(unsign
 /*
  * Clock source related code
  */
-static cycle_t read_hpet(void)
+static notrace cycle_t read_hpet(void)
 {
return (cycle_t)hpet_readl(HPET_COUNTER);
 }
Index: linux-mcount.git/arch/x86/kernel/time_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/time_32.c 2008-01-29 
11:35:35.0 -0500
+++ linux-mcount.git/arch/x86/kernel/time_32.c  2008-01-29 11:49:47.0 
-0500
@@ -122,7 +122,7 @@ static int set_rtc_mmss(unsigned long no
 
 int timer_ack;
 
-unsigned long profile_pc(struct pt_regs *regs)
+notrace unsigned long profile_pc(struct pt_regs *regs)
 {
unsigned long pc = instruction_pointer(regs);
 
Index: linux-mcount.git/arch/x86/kernel/tsc_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/tsc_32.c  2008-01-29 
11:35:35.0 -0500
+++ linux-mcount.git/arch/x86/kernel/tsc_32.c   2008-01-29 11:49:47.0 
-0500
@@ -269,7 +269,7 @@ core_initcall(cpufreq_tsc);
 
 static unsigned long current_tsc_khz = 0;
 
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
cycle_t ret;
 
Index: linux-mcount.git/arch/x86/kernel/tsc_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/tsc_64.c  2008-01-29 
11:35:35.0 -0500
+++ linux-mcount.git/arch/x86/kernel/tsc_64.c   2008-01-29 11:49:47.0 
-0500
@@ -248,13 +248,13 @@ __setup("notsc", notsc_setup);
 
 
 /* clock source code: */
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
cycle_t ret = (cycle_t)get_cycles_sync();
return ret;
 }
 
-static cycle_t __vsyscall_fn vread_tsc(void)
+static notrace cycle_t __vsyscall_fn vread_tsc(void)
 {
cycle_t ret = (cycle_t)get_cycles_sync();
return ret;
Index: linux-mcount.git/arch/x86/lib/delay_32.c
===
--- linux-mcount.git.orig/arch/x86/lib/delay_32.c   2008-01-29 
11:35:35.0 -0500
+++ linux-mcount.git/arch/x86/lib/delay_32.c2008-01-29 11:49:47.0 
-0500
@@ -24,7 +24,7 @@
 #endif
 
 /* simple loop based delay: */
-static void delay_loop(unsigned long loops)
+static notrace void delay_loop(unsigned long loops)
 {
int d0;
 
@@ -39,7 +39,7 @@ static void delay_loop(unsigned long loo
 }
 
 /* TSC based delay: */
-static void delay_tsc(unsigned long loops)
+static notrace void delay_tsc(unsigned long loops)
 {
unsigned long bclock, now;
 
@@ -72,7 +72,7 @@ int read_current_timer(unsigned long *ti
return -1;
 }
 
-void __delay(unsigned long loops)
+notrace void __delay(unsigned long loops)
 {
delay_fn(loops);
 }
Index: linux-mcount.git/drivers/clocksource/acpi_pm.c
===
--- linux-mcount.git.orig/drivers/clocksource/acpi_pm.c 2008-01-29 
11:35:35.0 -0500
+++ linux-mcount.git/drivers/clocksource/acpi_pm.c  2008-01-29 
11:49:47.0 -0500
@@ -30,13 +30,13 @@
  */
 u32 pmtmr_ioport __read_mostly;
 
-static inline u32 read_pmtmr(void)
+static inline notrace u32 read_pmtmr(void)
 {
/* mask the output to 24 bits */
return inl(pmtmr_ioport) & ACPI_PM_MASK;
 }
 
-u32 acpi_pm_read_verified(void)
+notrace u32 acpi_pm_read_verified(void)
 {
u32 v1 = 0, v2 = 0, v3 = 0;

[PATCH 16/22 -v7] Add marker in try_to_wake_up

2008-01-29 Thread Steven Rostedt

Add markers into the wakeup code, to allow the tracer to
record wakeup timings.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 kernel/sched.c |8 
 1 file changed, 8 insertions(+)

Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:21.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:30.0 -0500
@@ -1885,6 +1885,10 @@ static int try_to_wake_up(struct task_st
 
 out_activate:
 #endif /* CONFIG_SMP */
+   trace_mark(kernel_sched_wakeup,
+  "p %p rq->curr %p",
+  p, rq->curr);
+
schedstat_inc(p, se.nr_wakeups);
if (sync)
schedstat_inc(p, se.nr_wakeups_sync);
@@ -2026,6 +2030,10 @@ void fastcall wake_up_new_task(struct ta
p->sched_class->task_new(rq, p);
inc_nr_running(p, rq);
}
+   trace_mark(kernel_sched_wakeup_new,
+  "p %p rq->curr %p",
+  p, rq->curr);
+
check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
if (p->sched_class->task_wake_up)

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/22 -v7] Annotate core code that should not be traced

2008-01-29 Thread Steven Rostedt

Mark with "notrace" functions in core code that should not be
traced.  The "notrace" attribute will prevent gcc from adding
a call to mcount on the annotated funtions.

Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

---
 lib/smp_processor_id.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-mcount.git/lib/smp_processor_id.c
===
--- linux-mcount.git.orig/lib/smp_processor_id.c2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/lib/smp_processor_id.c 2008-01-25 21:47:03.0 
-0500
@@ -7,7 +7,7 @@
 #include 
 #include 
 
-unsigned int debug_smp_processor_id(void)
+notrace unsigned int debug_smp_processor_id(void)
 {
unsigned long preempt_count = preempt_count();
int this_cpu = raw_smp_processor_id();

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 17/22 -v7] mcount tracer for wakeup latency timings.

2008-01-29 Thread Steven Rostedt

This patch adds hooks to trace the wake up latency of the highest
priority waking task.

  "wakeup" is added to /debugfs/tracing/available_tracers

Also added to /debugfs/tracing

  tracing_max_latency
 holds the current max latency for the wakeup

  wakeup_thresh
 if set to other than zero, a log will be recorded
 for every wakeup that takes longer than the number
 entered in here (usecs for all counters)
 (deletes previous trace)

Examples:

  (with mcount_enabled = 0)


preemption latency trace v1.1.5 on 2.6.24-rc8

 latency: 26 us, #2/2, CPU#1 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: migration/0-3 (uid:0 nice:-5 policy:1 rt_prio:99)
-

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
   quilt-8551  0d..30us+: wake_up_process+0x15/0x17  
(sched_exec+0xc9/0x100 )
   quilt-8551  0d..4   26us : sched_switch_callback+0x73/0x81 
 (schedule+0x483/0x6d5 )


vim:ft=help



  (with mcount_enabled = 1)


preemption latency trace v1.1.5 on 2.6.24-rc8

 latency: 36 us, #45/45, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0 #P:2)
-
| task: migration/1-5 (uid:0 nice:-5 policy:1 rt_prio:99)
-

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
bash-10653 1d..30us : wake_up_process+0x15/0x17  
(sched_exec+0xc9/0x100 )
bash-10653 1d..31us : try_to_wake_up+0x271/0x2e7  
(sub_preempt_count+0xc/0x7a )
bash-10653 1d..22us : try_to_wake_up+0x296/0x2e7  
(update_rq_clock+0x9/0x20 )
bash-10653 1d..22us : update_rq_clock+0x1e/0x20  
(__update_rq_clock+0xc/0x90 )
bash-10653 1d..23us : __update_rq_clock+0x1b/0x90  
(sched_clock+0x9/0x29 )
bash-10653 1d..24us : try_to_wake_up+0x2a6/0x2e7  
(activate_task+0xc/0x3f )
bash-10653 1d..24us : activate_task+0x2d/0x3f  
(enqueue_task+0xe/0x66 )
bash-10653 1d..25us : enqueue_task+0x5b/0x66  
(enqueue_task_rt+0x9/0x3c )
bash-10653 1d..26us : try_to_wake_up+0x2ba/0x2e7  
(check_preempt_wakeup+0x12/0x99 )
[...]
bash-10653 1d..5   33us : tracing_record_cmdline+0xcf/0xd4 
 (_spin_unlock+0x9/0x33 )
bash-10653 1d..5   34us : _spin_unlock+0x19/0x33  
(sub_preempt_count+0xc/0x7a )
bash-10653 1d..4   35us : wakeup_sched_switch+0x65/0x2ff  
(_spin_lock_irqsave+0xc/0xa9 )
bash-10653 1d..4   35us : _spin_lock_irqsave+0x19/0xa9  
(add_preempt_count+0xe/0x77 )
bash-10653 1d..4   36us : sched_switch_callback+0x73/0x81 
 (schedule+0x483/0x6d5 )


vim:ft=help


The [...] was added here to not waste your email box space.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig|   14 +
 lib/tracing/Makefile   |1 
 lib/tracing/trace_wakeup.c |  359 +
 lib/tracing/tracer.c   |  131 
 lib/tracing/tracer.h   |5 
 5 files changed, 508 insertions(+), 2 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-29 18:09:01.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 18:10:17.0 
-0500
@@ -9,6 +9,9 @@ config MCOUNT
bool
select FRAME_POINTER
 
+config TRACER_MAX_TRACE
+   bool
+
 config TRACING
 bool
select DEBUG_FS
@@ -25,6 +28,17 @@ config FUNCTION_TRACER
  that the debugging mechanism using this facility will hook by
  providing a set of inline routines.
 
+config WAKEUP_TRACER
+   bool "Trace wakeup latencies"
+   depends on DEBUG_KERNEL
+   select TRACING
+   select CONTEXT_SWITCH_TRACER
+   select TRACER_MAX_TRACE
+   help
+ This tracer adds hooks into scheduling to time the latency
+ of the highest priority task tasks to be scheduled in
+ after it has worken up.
+
 config CONTEXT_SWITCH_TRACER
bool "Trace process context switches"
depends on DEBUG_KERNEL
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-m

[PATCH 14/22 -v7] Generic command line storage

2008-01-29 Thread Steven Rostedt

Saving the comm of tasks for each trace is very expensive.
This patch includes in the context switch hook, a way to
store the last 100 command lines of tasks. This table is
examined when a trace is to be printed.

Note: The comm may be destroyed if other traces are performed.
Later (TBD) patches may simply store this information in the trace
itself.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig  |1 
 lib/tracing/trace_function.c |2 
 lib/tracing/trace_sched_switch.c |5 +
 lib/tracing/tracer.c |  108 ---
 lib/tracing/tracer.h |3 -
 5 files changed, 111 insertions(+), 8 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-29 18:08:06.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 18:09:01.0 
-0500
@@ -18,6 +18,7 @@ config FUNCTION_TRACER
depends on DEBUG_KERNEL && HAVE_MCOUNT
select MCOUNT
select TRACING
+   select CONTEXT_SWITCH_TRACER
help
  Use profiler instrumentation, adding -pg to CFLAGS. This will
  insert a call to an architecture specific __mcount routine,
Index: linux-mcount.git/lib/tracing/trace_function.c
===
--- linux-mcount.git.orig/lib/tracing/trace_function.c  2008-01-29 
18:06:24.0 -0500
+++ linux-mcount.git/lib/tracing/trace_function.c   2008-01-29 
18:08:10.0 -0500
@@ -29,10 +29,12 @@ static notrace void start_function_trace
 {
function_reset(tr);
tracing_start_function_trace();
+   tracing_start_sched_switch();
 }
 
 static notrace void stop_function_trace(struct tracing_trace *tr)
 {
+   tracing_stop_sched_switch();
tracing_stop_function_trace();
 }
 
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===
--- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c  2008-01-29 
18:08:06.0 -0500
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c   2008-01-29 
18:09:03.0 -0500
@@ -32,6 +32,11 @@ static notrace void sched_switch_callbac
va_list ap;
int cpu;
 
+   if (!atomic_read(&sched_ref))
+   return;
+
+   tracing_record_cmdline(current);
+
if (!trace_enabled)
return;
 
Index: linux-mcount.git/lib/tracing/tracer.c
===
--- linux-mcount.git.orig/lib/tracing/tracer.c  2008-01-29 18:08:06.0 
-0500
+++ linux-mcount.git/lib/tracing/tracer.c   2008-01-29 18:10:04.0 
-0500
@@ -171,6 +171,87 @@ void tracing_stop_function_trace(void)
unregister_mcount_function(&trace_ops);
 }
 
+#define SAVED_CMDLINES 128
+static unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1];
+static unsigned map_cmdline_to_pid[SAVED_CMDLINES];
+static char saved_cmdlines[SAVED_CMDLINES][TASK_COMM_LEN];
+static int cmdline_idx;
+static DEFINE_SPINLOCK(trace_cmdline_lock);
+atomic_t trace_record_cmdline_disabled;
+
+static void trace_init_cmdlines(void)
+{
+   memset(&map_pid_to_cmdline, -1, sizeof(map_pid_to_cmdline));
+   memset(&map_cmdline_to_pid, -1, sizeof(map_cmdline_to_pid));
+   cmdline_idx = 0;
+}
+
+notrace void trace_stop_cmdline_recording(void);
+
+static void notrace trace_save_cmdline(struct task_struct *tsk)
+{
+   unsigned map;
+   unsigned idx;
+
+   if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT))
+   return;
+
+   /*
+* It's not the end of the world if we don't get
+* the lock, but we also don't want to spin
+* nor do we want to disable interrupts,
+* so if we miss here, then better luck next time.
+*/
+   if (!spin_trylock(&trace_cmdline_lock))
+   return;
+
+   idx = map_pid_to_cmdline[tsk->pid];
+   if (idx >= SAVED_CMDLINES) {
+   idx = (cmdline_idx + 1) % SAVED_CMDLINES;
+
+   map = map_cmdline_to_pid[idx];
+   if (map <= PID_MAX_DEFAULT)
+   map_pid_to_cmdline[map] = (unsigned)-1;
+
+   map_pid_to_cmdline[tsk->pid] = idx;
+
+   cmdline_idx = idx;
+   }
+
+   memcpy(&saved_cmdlines[idx], tsk->comm, TASK_COMM_LEN);
+
+   spin_unlock(&trace_cmdline_lock);
+}
+
+static notrace char *trace_find_cmdline(int pid)
+{
+   char *cmdline = "<...>";
+   unsigned map;
+
+   if (!pid)
+   return "";
+
+   if (pid > PID_MAX_DEFAULT)
+   goto out;
+
+   map = map_pid_to_cmdline[pid];
+   if (map >= SAVED_CMDLINES)
+   goto out;
+
+   cmdline = saved_cmdlines[map];
+
+ out:
+   return cmdline;
+}
+
+void tracing_record_cmdline(struct task_struct *tsk)
+{
+   if (atom

[PATCH 05/22 -v7] add notrace annotations to vsyscall.

2008-01-29 Thread Steven Rostedt

Add the notrace annotations to some of the vsyscall functions.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/vsyscall_64.c  |3 ++-
 arch/x86/vdso/vclock_gettime.c |   15 ---
 arch/x86/vdso/vgetcpu.c|3 ++-
 include/asm-x86/vsyscall.h |3 ++-
 4 files changed, 14 insertions(+), 10 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/vsyscall_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/vsyscall_64.c 2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/vsyscall_64.c  2008-01-25 
21:47:06.0 -0500
@@ -42,7 +42,8 @@
 #include 
 #include 
 
-#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
+#define __vsyscall(nr) \
+   __attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
 #define __syscall_clobber "r11","rcx","memory"
 #define __pa_vsymbol(x)\
({unsigned long v;  \
Index: linux-mcount.git/arch/x86/vdso/vclock_gettime.c
===
--- linux-mcount.git.orig/arch/x86/vdso/vclock_gettime.c2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/vdso/vclock_gettime.c 2008-01-25 
21:47:06.0 -0500
@@ -24,7 +24,7 @@
 
 #define gtod vdso_vsyscall_gtod_data
 
-static long vdso_fallback_gettime(long clock, struct timespec *ts)
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
asm("syscall" : "=a" (ret) :
@@ -32,7 +32,7 @@ static long vdso_fallback_gettime(long c
return ret;
 }
 
-static inline long vgetns(void)
+notrace static inline long vgetns(void)
 {
long v;
cycles_t (*vread)(void);
@@ -41,7 +41,7 @@ static inline long vgetns(void)
return (v * gtod->clock.mult) >> gtod->clock.shift;
 }
 
-static noinline int do_realtime(struct timespec *ts)
+notrace static noinline int do_realtime(struct timespec *ts)
 {
unsigned long seq, ns;
do {
@@ -55,7 +55,8 @@ static noinline int do_realtime(struct t
 }
 
 /* Copy of the version in kernel/time.c which we cannot directly access */
-static void vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
+notrace static void
+vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
 {
while (nsec >= NSEC_PER_SEC) {
nsec -= NSEC_PER_SEC;
@@ -69,7 +70,7 @@ static void vset_normalized_timespec(str
ts->tv_nsec = nsec;
 }
 
-static noinline int do_monotonic(struct timespec *ts)
+notrace static noinline int do_monotonic(struct timespec *ts)
 {
unsigned long seq, ns, secs;
do {
@@ -83,7 +84,7 @@ static noinline int do_monotonic(struct 
return 0;
 }
 
-int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
+notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
if (likely(gtod->sysctl_enabled && gtod->clock.vread))
switch (clock) {
@@ -97,7 +98,7 @@ int __vdso_clock_gettime(clockid_t clock
 int clock_gettime(clockid_t, struct timespec *)
__attribute__((weak, alias("__vdso_clock_gettime")));
 
-int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
+notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
long ret;
if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
Index: linux-mcount.git/arch/x86/vdso/vgetcpu.c
===
--- linux-mcount.git.orig/arch/x86/vdso/vgetcpu.c   2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/vdso/vgetcpu.c2008-01-25 21:47:06.0 
-0500
@@ -13,7 +13,8 @@
 #include 
 #include "vextern.h"
 
-long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+notrace long
+__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
 {
unsigned int dummy, p;
 
Index: linux-mcount.git/include/asm-x86/vsyscall.h
===
--- linux-mcount.git.orig/include/asm-x86/vsyscall.h2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/include/asm-x86/vsyscall.h 2008-01-25 21:47:06.0 
-0500
@@ -24,7 +24,8 @@ enum vsyscall_num {
((unused, __section__ (".vsyscall_gtod_data"),aligned(16)))
 #define __section_vsyscall_clock __attribute__ \
((unused, __section__ (".vsyscall_clock"),aligned(16)))
-#define __vsyscall_fn __attribute__ ((unused,__section__(".vsyscall_fn")))
+#define __vsyscall_fn \
+   __attribute__ ((unused, __section__(".vsyscall_fn"))) notrace
 
 #define VGETCPU_RDTSCP 1
 #define VGETCPU_LSL2

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 21/22 -v7] Add event tracer.

2008-01-29 Thread Steven Rostedt

This patch adds a event trace that hooks into various events
in the kernel. Although it can be used separately, it is mainly
to help other traces (wakeup and preempt off) with seeing various
events in the traces without having to enable the heavy mcount
hooks.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig |   12 +
 lib/tracing/Makefile|1 
 lib/tracing/trace_events.c  |  475 
 lib/tracing/trace_irqsoff.c |6 
 lib/tracing/trace_wakeup.c  |   55 -
 lib/tracing/tracer.c|  159 ++
 lib/tracing/tracer.h|   64 +
 7 files changed, 721 insertions(+), 51 deletions(-)

Index: linux-mcount.git/lib/tracing/trace_events.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-mcount.git/lib/tracing/trace_events.c 2008-01-29 18:11:37.0 
-0500
@@ -0,0 +1,475 @@
+/*
+ * trace task events
+ *
+ * Copyright (C) 2007 Steven Rostedt <[EMAIL PROTECTED]>
+ *
+ * Based on code from the latency_tracer, that is:
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tracer.h"
+
+static struct tracing_trace *tracer_trace __read_mostly;
+static int trace_enabled __read_mostly;
+
+static void notrace event_reset(struct tracing_trace *tr)
+{
+   struct tracing_trace_cpu *data;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   data = tr->data[cpu];
+   tracing_reset(data);
+   }
+
+   tr->time_start = now();
+}
+
+static void notrace event_trace_sched_switch(void *private,
+struct task_struct *prev,
+struct task_struct *next)
+{
+   struct tracing_trace **ptr = private;
+   struct tracing_trace *tr = *ptr;
+   struct tracing_trace_cpu *data;
+   unsigned long flags;
+   int cpu;
+
+   if (!trace_enabled || !tr)
+   return;
+
+   local_irq_save(flags);
+   cpu = raw_smp_processor_id();
+   data = tr->data[cpu];
+
+   atomic_inc(&data->disabled);
+   if (atomic_read(&data->disabled) != 1)
+   goto out;
+
+   tracing_sched_switch_trace(tr, data, prev, next, flags);
+
+ out:
+   atomic_dec(&data->disabled);
+   local_irq_restore(flags);
+}
+
+static struct tracer_switch_ops switch_ops __read_mostly = {
+   .func = event_trace_sched_switch,
+   .private = &tracer_trace,
+};
+
+notrace int trace_event_enabled(void)
+{
+   return trace_enabled && tracer_trace;
+}
+
+/* Taken from sched.c */
+#define __PRIO(prio) \
+   ((prio) <= 99 ? 199 - (prio) : (prio) - 120)
+
+#define PRIO(p) __PRIO((p)->prio)
+
+notrace void trace_event_wakeup(unsigned long ip,
+   struct task_struct *p,
+   struct task_struct *curr)
+{
+   struct tracing_trace *tr = tracer_trace;
+   struct tracing_trace_cpu *data;
+   unsigned long flags;
+   int cpu;
+
+   if (!trace_enabled || !tr)
+   return;
+
+   local_irq_save(flags);
+   cpu = raw_smp_processor_id();
+   data = tr->data[cpu];
+
+   atomic_inc(&data->disabled);
+   if (atomic_read(&data->disabled) != 1)
+   goto out;
+
+   /* record process's command line */
+   tracing_record_cmdline(p);
+   tracing_record_cmdline(curr);
+   tracing_trace_pid(tr, data, flags, ip, p->pid, PRIO(p), PRIO(curr));
+
+ out:
+   atomic_dec(&data->disabled);
+   local_irq_restore(flags);
+}
+
+struct event_probes {
+   const char *name;
+   const char *fmt;
+   void (*func)(const struct event_probes *probe,
+struct tracing_trace *tr,
+struct tracing_trace_cpu *data,
+unsigned long flags,
+unsigned long ip,
+va_list ap);
+   int active;
+   int armed;
+};
+
+#define getarg(arg, ap) arg = va_arg(ap, typeof(arg))
+
+static notrace void event_trace_apic_timer(const struct event_probes *probe,
+  struct tracing_trace *tr,
+  struct tracing_trace_cpu *data,
+  unsigned long flags,
+  unsigned long ip,
+  va_list ap)
+{
+   unsigned long parent_ip;
+
+   getarg(parent_ip, ap);
+
+   tracing_trace_special(tr, data, flags, ip, parent_ip, 0, 0);
+}
+
+static notrace void event_trace_do_irq(const struct event_probes *probe,
+  struct tracing_trace *tr,
+  struct tracing_trace_cpu *data,
+  unsigned long flags,
+

[PATCH 04/22 -v7] x86_64: notrace annotations

2008-01-29 Thread Steven Rostedt

Add "notrace" annotation to x86_64 specific files.

Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/head64.c |2 +-
 arch/x86/kernel/setup64.c|4 ++--
 arch/x86/kernel/smpboot_64.c |2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-mcount.git/arch/x86/kernel/head64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/head64.c  2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/head64.c   2008-01-25 21:47:05.0 
-0500
@@ -46,7 +46,7 @@ static void __init copy_bootdata(char *r
}
 }
 
-void __init x86_64_start_kernel(char * real_mode_data)
+notrace void __init x86_64_start_kernel(char *real_mode_data)
 {
int i;
 
Index: linux-mcount.git/arch/x86/kernel/setup64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/setup64.c 2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/setup64.c  2008-01-25 21:47:05.0 
-0500
@@ -114,7 +114,7 @@ void __init setup_per_cpu_areas(void)
}
 } 
 
-void pda_init(int cpu)
+notrace void pda_init(int cpu)
 { 
struct x8664_pda *pda = cpu_pda(cpu);
 
@@ -197,7 +197,7 @@ DEFINE_PER_CPU(struct orig_ist, orig_ist
  * 'CPU state barrier', nothing should get across.
  * A lot of state is already set up in PDA init.
  */
-void __cpuinit cpu_init (void)
+notrace void __cpuinit cpu_init(void)
 {
int cpu = stack_smp_processor_id();
struct tss_struct *t = &per_cpu(init_tss, cpu);
Index: linux-mcount.git/arch/x86/kernel/smpboot_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/smpboot_64.c  2008-01-25 
21:46:50.0 -0500
+++ linux-mcount.git/arch/x86/kernel/smpboot_64.c   2008-01-25 
21:47:05.0 -0500
@@ -317,7 +317,7 @@ static inline void set_cpu_sibling_map(i
 /*
  * Setup code on secondary processor (after comming out of the trampoline)
  */
-void __cpuinit start_secondary(void)
+notrace __cpuinit void start_secondary(void)
 {
/*
 * Dont put anything before smp_callin(), SMP

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/22 -v7] mcount and latency tracing utility -v7

2008-01-29 Thread Steven Rostedt


[
  version 7  (and hopefully last) of mcount / trace patches:

  changes include:

   Ported to lastest git 0ba6c33bcddc64a54b5f1c25a696c4767dc76292

   Moved the markers around so they would only be armed when used,
   this brings down the overhead dramatically.

   Added printing of the "to" process name in the sched switch output:
ksoftirq-8 2d..3 120829us+:  8:49:S --> 11:115 group_balance
group_ba-112d..3 120836us!:  11:115:S --> 0:140 

   Removed notrace to nmi handlers. I've tested it a little with
   NMIs and function trace, and it seems to work fine.

   added "disable" to available_tracers that will unregister all
   tracers when written into current_tracer.

   Ran new benchmarks and got better results! See below.
]

All released version of these patches can be found at:

   http://people.redhat.com/srostedt/tracing/


The following patch series brings to vanilla Linux a bit of the RT kernel
trace facility. This incorporates the "-pg" profiling option of gcc
that will call the "mcount" function for all functions called in
the kernel.

Note: I did investigate using -finstrument-functions but that adds a call
to both start and end of a function. Using mcount only does the
beginning of the function. mcount alone adds ~13% overhead. The
-finstrument-functions added ~19%.  Also it caused me to do tricks with
inline, because it adds the function calls to inline functions as well.

This patch series implements the code for x86 (32 and 64 bit), but
other archs can easily be implemented as well (note: ARM and PPC are
already implemented in -rt)

Some Background:


A while back, Ingo Molnar and William Lee Irwin III created a latency tracer
to find problem latency areas in the kernel for the RT patch.  This tracer
became a very integral part of the RT kernel in solving where latency hot
spots were.  One of the features that the latency tracer added was a
function trace.  This function tracer would record all functions that
were called (implemented by the gcc "-pg" option) and would show what was
called when interrupts or preemption was turned off.

This feature is also very helpful in normal debugging. So it's been talked
about taking bits and pieces from the RT latency tracer and bring them
to LKML. But no one had the time to do it.

Arnaldo Carvalho de Melo took a crack at it. He pulled out the mcount
as well as part of the tracing code and made it generic from the point
of the tracing code.  I'm not sure why this stopped. Probably because
Arnaldo is a very busy man, and his efforts had to be utilized elsewhere.

While I still maintain my own Logdev utility:

  http://rostedt.homelinux.com/logdev

I came across a need to do the mcount with logdev too. I was successful
but found that it became very dependent on a lot of code. One thing that
I liked about my logdev utility was that it was very non-intrusive, and has
been easy to port from the Linux 2.0 days. I did not want to burden the
logdev patch with the intrusiveness of mcount (not really that intrusive,
it just needs to add a "notrace" annotation to functions in the kernel
that will cause more conflicts in applying patches for me).

Being close to the holidays, I grabbed Arnaldos old patches and started
massaging them into something that could be useful for logdev, and what
I found out (and talking this over with Arnaldo too) that this can
be much more useful for others as well.

The main thing I changed, was that I made the mcount function itself
generic, and not the dependency on the tracing code.  That is I added

register_mcount_function()
 and
clear_mcount_function()

So when ever mcount is enabled and a function is registered that function
is called for all functions in the kernel that is not labeled with the
"notrace" annotation.


The Simple Tracer:
--

To show the power of this I also massaged the tracer code that Arnaldo pulled
from the RT patch and made it be a nice example of what can be done
with this.

The function that is registered to mcount has the prototype:

 void func(unsigned long ip, unsigned long parent_ip);

The ip is the address of the function and parent_ip is the address of
the parent function that called it.

The x86_64 version has the assembly call the registered function directly
to save having to do a double function call.

To enable mcount, a sysctl is added:

   /proc/sys/kernel/mcount_enabled

Once mcount is enabled, when a function is registed, it will be called by
all functions. The tracer in this patch series shows how this is done.
It adds a directory in the debugfs, called mctracer. With a ctrl file that
will allow the user have the tracer register its function.  Note, the order
of enabling mcount and registering a function is not important, but both
must be done to initiate the tracing. That is, you can disable tracing
by either disabling mcount or by clearing the registered function.

When one function is registered, it is called directly from the mcount
asse

[PATCH 20/22 -v7] Add markers to various events

2008-01-29 Thread Steven Rostedt

This patch adds markers to various events in the kernel.
(interrupts, task activation and hrtimers)

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/apic_32.c  |2 ++
 arch/x86/kernel/irq_32.c   |1 +
 arch/x86/kernel/irq_64.c   |2 ++
 arch/x86/kernel/traps_32.c |2 ++
 arch/x86/kernel/traps_64.c |2 ++
 arch/x86/mm/fault_32.c |3 +++
 arch/x86/mm/fault_64.c |3 +++
 kernel/hrtimer.c   |7 +++
 kernel/sched.c |   11 +++
 9 files changed, 33 insertions(+)

Index: linux-mcount.git/arch/x86/kernel/apic_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/apic_32.c 2008-01-28 
08:37:49.0 -0500
+++ linux-mcount.git/arch/x86/kernel/apic_32.c  2008-01-28 09:54:49.0 
-0500
@@ -581,6 +581,8 @@ notrace fastcall void smp_apic_timer_int
 {
struct pt_regs *old_regs = set_irq_regs(regs);
 
+   trace_mark(arch_apic_timer, "ip %lx", regs->eip);
+
/*
 * NOTE! We'd better ACK the irq immediately,
 * because timer handling can be slow.
Index: linux-mcount.git/arch/x86/kernel/irq_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/irq_32.c  2008-01-28 
08:37:14.0 -0500
+++ linux-mcount.git/arch/x86/kernel/irq_32.c   2008-01-28 09:54:49.0 
-0500
@@ -85,6 +85,7 @@ fastcall unsigned int do_IRQ(struct pt_r
 
old_regs = set_irq_regs(regs);
irq_enter();
+   trace_mark(arch_do_irq, "ip %lx irq %d", regs->eip, irq);
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 1KB free? */
{
Index: linux-mcount.git/arch/x86/kernel/irq_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/irq_64.c  2008-01-28 
08:37:14.0 -0500
+++ linux-mcount.git/arch/x86/kernel/irq_64.c   2008-01-28 09:54:49.0 
-0500
@@ -149,6 +149,8 @@ asmlinkage unsigned int do_IRQ(struct pt
irq_enter();
irq = __get_cpu_var(vector_irq)[vector];
 
+   trace_mark(arch_do_irq, "ip %lx irq %d", regs->rip, irq);
+
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
stack_overflow_check(regs);
 #endif
Index: linux-mcount.git/arch/x86/kernel/traps_32.c
===
--- linux-mcount.git.orig/arch/x86/kernel/traps_32.c2008-01-28 
08:37:17.0 -0500
+++ linux-mcount.git/arch/x86/kernel/traps_32.c 2008-01-28 09:54:49.0 
-0500
@@ -769,6 +769,8 @@ fastcall __kprobes void do_nmi(struct pt
 
nmi_enter();
 
+   trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->eip, regs->eflags);
+
cpu = smp_processor_id();
 
++nmi_count(cpu);
Index: linux-mcount.git/arch/x86/kernel/traps_64.c
===
--- linux-mcount.git.orig/arch/x86/kernel/traps_64.c2008-01-28 
08:37:14.0 -0500
+++ linux-mcount.git/arch/x86/kernel/traps_64.c 2008-01-28 09:54:49.0 
-0500
@@ -782,6 +782,8 @@ asmlinkage __kprobes void default_do_nmi
 
cpu = smp_processor_id();
 
+   trace_mark(arch_do_nmi, "ip %lx flags %lx", regs->rip, regs->eflags);
+
/* Only the BSP gets external NMIs from the system.  */
if (!cpu)
reason = get_nmi_reason();
Index: linux-mcount.git/arch/x86/mm/fault_32.c
===
--- linux-mcount.git.orig/arch/x86/mm/fault_32.c2008-01-28 
08:37:14.0 -0500
+++ linux-mcount.git/arch/x86/mm/fault_32.c 2008-01-28 09:54:49.0 
-0500
@@ -311,6 +311,9 @@ fastcall void __kprobes do_page_fault(st
/* get the address */
 address = read_cr2();
 
+   trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
+  regs->eip, error_code, address);
+
tsk = current;
 
si_code = SEGV_MAPERR;
Index: linux-mcount.git/arch/x86/mm/fault_64.c
===
--- linux-mcount.git.orig/arch/x86/mm/fault_64.c2008-01-28 
08:37:14.0 -0500
+++ linux-mcount.git/arch/x86/mm/fault_64.c 2008-01-28 09:54:49.0 
-0500
@@ -316,6 +316,9 @@ asmlinkage void __kprobes do_page_fault(
/* get the address */
address = read_cr2();
 
+   trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
+  regs->rip, error_code, address);
+
info.si_code = SEGV_MAPERR;
 
 
Index: linux-mcount.git/kernel/hrtimer.c
===
--- linux-mcount.git.orig/kernel/hrtimer.c  2008-01-28 08:37:14.0 
-0500
+++ linux-mcount.git/kernel/hrtimer.c   2008-01-28 09:54:49.0 -0500
@@ -709,6 +709,8 @@ static void enqueue_hrtimer(struct hrtim
struct hrtimer *entry;
int leftmost = 1

[PATCH 12/22 -v7] Make the task State char-string visible to all

2008-01-29 Thread Steven Rostedt

The tracer wants to be able to convert the state number
into a user visible character. This patch pulls that conversion
string out the scheduler into the header. This way if it were to
ever change, other parts of the kernel will know.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/sched.h |2 ++
 kernel/sched.c|2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-mcount.git/include/linux/sched.h
===
--- linux-mcount.git.orig/include/linux/sched.h 2008-01-25 21:46:55.0 
-0500
+++ linux-mcount.git/include/linux/sched.h  2008-01-25 21:47:21.0 
-0500
@@ -2055,6 +2055,8 @@ static inline void migration_init(void)
 }
 #endif
 
+#define TASK_STATE_TO_CHAR_STR "RSDTtZX"
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:47:19.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:21.0 -0500
@@ -5149,7 +5149,7 @@ out_unlock:
return retval;
 }
 
-static const char stat_nam[] = "RSDTtZX";
+static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
 {

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/22 -v7] add get_monotonic_cycles

2008-01-29 Thread Steven Rostedt

The latency tracer needs a way to get an accurate time
without grabbing any locks. Locks themselves might call
the latency tracer and cause at best a slow down.

This patch adds get_monotonic_cycles that returns cycles
from a reliable clock source in a monotonic fashion.

Signed-off-by: John Stultz <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/clocksource.h |   54 +---
 kernel/time/timekeeping.c   |   26 +++--
 2 files changed, 70 insertions(+), 10 deletions(-)

Index: linux-mcount.git/include/linux/clocksource.h
===
--- linux-mcount.git.orig/include/linux/clocksource.h   2008-01-25 
21:47:11.0 -0500
+++ linux-mcount.git/include/linux/clocksource.h2008-01-25 
21:47:13.0 -0500
@@ -88,8 +88,16 @@ struct clocksource {
 */
struct {
cycle_t cycle_last, cycle_accumulated;
-   } cacheline_aligned_in_smp;
 
+   /* base structure provides lock-free read
+* access to a virtualized 64bit counter
+* Uses RCU-like update.
+*/
+   struct {
+   cycle_t cycle_base_last, cycle_base;
+   } base[2];
+   int base_num;
+   } cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
 
@@ -175,19 +183,30 @@ static inline cycle_t clocksource_read(s
 }
 
 /**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
  * @cs:pointer to clocksource being read
  * @now:   current cycle value
  *
  * Uses the clocksource to return the current cycle_t value.
  * NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
  */
 static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
-   offset += cs->cycle_accumulated;
+   int num;
+   cycle_t now, offset;
+
+   preempt_disable();
+   num = cs->base_num;
+   /* base_num is shared, and some archs are wacky */
+   smp_read_barrier_depends();
+   now = clocksource_read(cs);
+   offset = (now - cs->base[num].cycle_base_last);
+   offset &= cs->mask;
+   offset += cs->base[num].cycle_base;
+   preempt_enable();
+
return offset;
 }
 
@@ -197,11 +216,27 @@ clocksource_get_cycles(struct clocksourc
  * @now:   current cycle value
  *
  * Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
  */
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
+   /*
+* First update the monotonic base portion.
+* The dual array update method allows for lock-free reading.
+* 'num' is always 1 or 0.
+*/
+   int num = 1 - cs->base_num;
+   cycle_t offset = (now - cs->base[1-num].cycle_base_last);
+   offset &= cs->mask;
+   cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
+   cs->base[num].cycle_base_last = now;
+   /* make sure this array is visible to the world first */
+   smp_wmb();
+   cs->base_num = num;
+
+   /* Now update the cycle_accumulated portion */
+   offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
 }
@@ -272,6 +307,9 @@ extern int clocksource_register(struct c
 extern struct clocksource* clocksource_get_next(void);
 extern void clocksource_change_rating(struct clocksource *cs, int rating);
 extern void clocksource_resume(void);
+extern cycle_t get_monotonic_cycles(void);
+extern unsigned long cycles_to_usecs(cycle_t cycles);
+extern cycle_t usecs_to_cycles(unsigned long usecs);
 
 /* used to initialize clock */
 extern struct clocksource clocksource_jiffies;
Index: linux-mcount.git/kernel/time/timekeeping.c
===
--- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 
21:47:11.0 -0500
+++ linux-mcount.git/kernel/time/timekeeping.c  2008-01-25 21:47:13.0 
-0500
@@ -71,10 +71,12 @@ static struct clocksource *clock = &cloc
  */
 static inline s64 __get_nsec_offset(void)
 {
-   cycle_t cycle_delta;
+   cycle_t now, cycle_delta;
s64 ns_offset;
 
-   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+   now = clocksource_read(clock);
+   cycle

[PATCH 10/22 -v7] mcount based trace in the form of a header file library

2008-01-29 Thread Steven Rostedt

This is a simple trace that uses the mcount infrastructure. It is
designed to be fast and small, and easy to use. It is useful to
record things that happen over a very short period of time, and
not to analyze the system in general.

An interface is added to the debugfs

  /debugfs/tracing/

This patch adds the following files:

  available_tracers
 list of available tracers. Currently only "function" is
 available.

  current_tracer
 The trace that is currently active. Empty on start up.
 To switch to a tracer simply echo one of the tracers that
 are listed in available_tracers:

  echo function > /debugfs/tracing/current_tracer

 To disable the tracer:

   echo disable > /debugfs/tracing/current_tracer


  trace_ctrl
 echoing "1" into this file starts the mcount function tracing
  (if sysctl kernel.mcount_enabled=1)
 echoing "0" turns it off.

  latency_trace
  This file is readonly and holds the result of the trace.

  trace
  This file outputs a easier to read version of the trace.

  iter_ctrl
  Controls the way the output of traces look.
  So far there's two controls:
echoing in "symonly" will only show the kallsyms variables
without the addresses (if kallsyms was configured)
echoing in "verbose" will change the output to show
a lot more data, but not very easy to understand by
humans.
echoing in "nosymonly" turns off symonly.
echoing in "noverbose" turns off verbose.

The output of the function_trace file is as follows

  "echo noverbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst

 latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
-
| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
-

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq
   ||| / _--=> preempt-depth
    /
   | delay
   cmd pid | time  |   caller
  \   /|   \   |   /
 swapper-0 0d.h. 1595128us+: set_normalized_timespec+0x8/0x2d  
(ktime_get_ts+0x4a/0x4e )
 swapper-0 0d.h. 1595131us+: _spin_lock+0x8/0x18  
(hrtimer_interrupt+0x6e/0x1b0 )

Or with verbose turned on:

  "echo verbose > /debugfs/tracing/iter_ctrl"

preemption latency trace v1.1.5 on 2.6.24-rc7-tst

 latency: 0 us, #419428/4361791, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
-
| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
-

 swapper 0 0 9   [f3675f41] 1595.128ms (+0.003ms): 
set_normalized_timespec+0x8/0x2d  (ktime_get_ts+0x4a/0x4e )
 swapper 0 0 9  0001 [f3675f45] 1595.131ms (+0.003ms): 
_spin_lock+0x8/0x18  (hrtimer_interrupt+0x6e/0x1b0 )
 swapper 0 0 9  0002 [f3675f48] 1595.135ms (+0.003ms): 
_spin_lock+0x8/0x18  (hrtimer_interrupt+0x6e/0x1b0 )


The "trace" file is not affected by the verbose mode, but is by the symonly.

 echo "nosymonly" > /debugfs/tracing/iter_ctrl

tracer:
[   81.479967] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 
 <-- _spin_unlock_irqrestore+0xe/0x5a 
[   81.479967] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a 
 <-- sub_preempt_count+0xc/0x7a 
[   81.479968] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a  
<-- in_lock_functions+0x9/0x24 
[   81.479968] CPU 0: bash:3154 vfs_write+0x11d/0x155  <-- 
dnotify_parent+0x12/0x78 
[   81.479968] CPU 0: bash:3154 dnotify_parent+0x2d/0x78  <-- 
_spin_lock+0xe/0x70 
[   81.479969] CPU 0: bash:3154 _spin_lock+0x1b/0x70  <-- 
add_preempt_count+0xe/0x77 
[   81.479969] CPU 0: bash:3154 add_preempt_count+0x3e/0x77  
<-- in_lock_functions+0x9/0x24 


 echo "symonly" > /debugfs/tracing/iter_ctrl

tracer:
[   81.479913] CPU 0: bash:3154 register_mcount_function+0x5f/0x66 <-- 
_spin_unlock_irqrestore+0xe/0x5a
[   81.479913] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- 
sub_preempt_count+0xc/0x7a
[   81.479913] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- 
in_lock_functions+0x9/0x24
[   81.479914] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- 
dnotify_parent+0x12/0x78
[   81.479914] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70
[   81.479914] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- 
add_preempt_count+0xe/0x77
[   81.479914] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- 
in_lock_functions+0x9/0x24


Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
---
 lib/Makefile |1 
 lib/tracing/Kconfig  |   15 
 lib/tracing/Makefile |3 
 lib/tracing/trace_function.c |   72 ++
 lib/tracing/tracer.c | 1160 +++

[PATCH 07/22 -v7] initialize the clock source to jiffies clock.

2008-01-29 Thread Steven Rostedt

The latency tracer can call clocksource_read very early in bootup and
before the clock source variable has been initialized. This results in a
crash at boot up (even before earlyprintk is initialized). Since the
clock->read variable points to NULL.

This patch simply initializes the clock to use clocksource_jiffies, so
that any early user of clocksource_read will not crash.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
Acked-by: John Stultz <[EMAIL PROTECTED]>
---
 include/linux/clocksource.h |3 +++
 kernel/time/timekeeping.c   |9 +++--
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-mcount.git/include/linux/clocksource.h
===
--- linux-mcount.git.orig/include/linux/clocksource.h   2008-01-25 
21:47:09.0 -0500
+++ linux-mcount.git/include/linux/clocksource.h2008-01-25 
21:47:11.0 -0500
@@ -273,6 +273,9 @@ extern struct clocksource* clocksource_g
 extern void clocksource_change_rating(struct clocksource *cs, int rating);
 extern void clocksource_resume(void);
 
+/* used to initialize clock */
+extern struct clocksource clocksource_jiffies;
+
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
 extern void update_vsyscall(struct timespec *ts, struct clocksource *c);
 extern void update_vsyscall_tz(void);
Index: linux-mcount.git/kernel/time/timekeeping.c
===
--- linux-mcount.git.orig/kernel/time/timekeeping.c 2008-01-25 
21:47:09.0 -0500
+++ linux-mcount.git/kernel/time/timekeeping.c  2008-01-25 21:47:11.0 
-0500
@@ -53,8 +53,13 @@ static inline void update_xtime_cache(u6
timespec_add_ns(&xtime_cache, nsec);
 }
 
-static struct clocksource *clock; /* pointer to current clocksource */
-
+/*
+ * pointer to current clocksource
+ *  Just in case we use clocksource_read before we initialize
+ *  the actual clock source. Instead of calling a NULL read pointer
+ *  we return jiffies.
+ */
+static struct clocksource *clock = &clocksource_jiffies;
 
 #ifdef CONFIG_GENERIC_TIME
 /**

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/22 -v7] Add context switch marker to sched.c

2008-01-29 Thread Steven Rostedt

Add marker into context_switch to record the prev and next tasks.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 kernel/sched.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-25 21:46:55.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-25 21:47:19.0 -0500
@@ -2198,6 +2198,8 @@ context_switch(struct rq *rq, struct tas
struct mm_struct *mm, *oldmm;
 
prepare_task_switch(rq, prev, next);
+   trace_mark(kernel_sched_schedule,
+  "prev %p next %p", prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 19/22 -v7] trace preempt off critical timings

2008-01-29 Thread Steven Rostedt

Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.

This adds "preemptoff" and "preemptirqsoff" to 
/debugfs/tracing/available_tracers

Now instead of just tracing irqs off, preemption off can be selected
to be recorded.

When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.

By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/x86/kernel/process_32.c |3 
 include/linux/irqflags.h |3 
 include/linux/mcount.h   |8 +
 include/linux/preempt.h  |2 
 kernel/sched.c   |   24 +
 lib/tracing/Kconfig  |   25 +
 lib/tracing/Makefile |1 
 lib/tracing/trace_irqsoff.c  |  183 +++
 8 files changed, 196 insertions(+), 53 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-29 15:05:34.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 15:22:31.0 
-0500
@@ -46,6 +46,31 @@ config CRITICAL_IRQSOFF_TIMING
 
  echo 0 > /debugfs/tracing/tracing_max_latency
 
+ (Note that kernel size and overhead increases with this option
+ enabled. This option and the preempt-off timing option can be
+ used together or separately.)
+
+config CRITICAL_PREEMPT_TIMING
+   bool "Preemption-off critical section latency timing"
+   default n
+   depends on GENERIC_TIME
+   depends on PREEMPT
+   select TRACING
+   select TRACER_MAX_TRACE
+   help
+ This option measures the time spent in preemption off critical
+ sections, with microsecond accuracy.
+
+ The default measurement method is a maximum search, which is
+ disabled by default and can be runtime (re-)started
+ via:
+
+ echo 0 > /debugfs/tracing/tracing_max_latency
+
+ (Note that kernel size and overhead increases with this option
+ enabled. This option and the irqs-off timing option can be
+ used together or separately.)
+
 config WAKEUP_TRACER
bool "Trace wakeup latencies"
depends on DEBUG_KERNEL
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-mcount.git.orig/lib/tracing/Makefile  2008-01-29 15:05:34.0 
-0500
+++ linux-mcount.git/lib/tracing/Makefile   2008-01-29 15:22:31.0 
-0500
@@ -4,6 +4,7 @@ obj-$(CONFIG_TRACING) += tracer.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace_function.o
 obj-$(CONFIG_CRITICAL_IRQSOFF_TIMING) += trace_irqsoff.o
+obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) += trace_irqsoff.o
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===
--- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c   2008-01-29 
15:05:34.0 -0500
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-29 
15:25:28.0 -0500
@@ -21,6 +21,34 @@ static struct tracing_trace *tracer_trac
 static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex);
 static int trace_enabled __read_mostly;
 
+static DEFINE_PER_CPU(int, tracing_cpu);
+
+enum {
+   TRACER_IRQS_OFF = (1 << 1),
+   TRACER_PREEMPT_OFF  = (1 << 2),
+};
+
+static int trace_type __read_mostly;
+
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+# define preempt_trace() \
+   ((trace_type & TRACER_PREEMPT_OFF) && preempt_count())
+#else
+# define preempt_trace() (0)
+#endif
+
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+# define irq_trace()   \
+   ((trace_type & TRACER_IRQS_OFF) &&  \
+({ \
+unsigned long __flags; \
+local_save_flags(__flags); \
+irqs_disabled_flags(__flags);  \
+}))
+#else
+# define irq_trace() (0)
+#endif
+
 /*
  * Sequence count - we record it when starting a measurement and
  * skip the latency if the sequence has changed - some other section
@@ -41,14 +69,11 @@ static void notrace irqsoff_trace_call(u
unsigned long flags;
int cpu;
 
-   if (likely(!trace_enabled))
+   if (likely(!__get_cpu_var(tracing_cpu)))
return;
 
local_save_flags(flags);
 
-   if (!irqs_disable

[PATCH 22/22 -v7] Critical latency timings histogram

2008-01-29 Thread Steven Rostedt

This patch adds hooks into the latency tracer to give
us histograms of interrupts off, preemption off and
wakeup timings.

This code was based off of work done by Yi Yang <[EMAIL PROTECTED]>

But heavily modified to work with the new tracer, and some
clean ups by Steven Rostedt <[EMAIL PROTECTED]>

This adds the following to /debugfs/tracing

  latency_hist/ - root dir for historgrams.

  Under latency_hist there is (depending on what's configured):

interrupt_off_latency/ - latency histograms of interrupts off.

preempt_interrupts_off_latency/ - latency histograms of
  preemption and/or interrupts off.

preempt_off_latency/ - latency histograms of preemption off.

wakeup_latency/ - latency histograms of wakeup timings.

  Under each of the above is a file labeled:

CPU# for each possible CPU were # is the CPU number.

reset - writing into this file will reset the histogram
back to zeros and start again.


Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/Kconfig |   20 +
 lib/tracing/Makefile|4 
 lib/tracing/trace_irqsoff.c |   19 +
 lib/tracing/trace_wakeup.c  |   21 +
 lib/tracing/tracer_hist.c   |  514 
 lib/tracing/tracer_hist.h   |   39 +++
 6 files changed, 613 insertions(+), 4 deletions(-)

Index: linux-mcount.git/lib/tracing/Kconfig
===
--- linux-mcount.git.orig/lib/tracing/Kconfig   2008-01-29 21:34:14.0 
-0500
+++ linux-mcount.git/lib/tracing/Kconfig2008-01-29 21:34:30.0 
-0500
@@ -102,3 +102,23 @@ config CONTEXT_SWITCH_TRACER
  This tracer hooks into the context switch and records
  all switching of tasks.
 
+config INTERRUPT_OFF_HIST
+   bool "Interrupts off critical timings histogram"
+   depends on CRITICAL_IRQSOFF_TIMING
+   help
+ This option uses the infrastructure of the critical
+ irqs off timings to create a histogram of latencies.
+
+config PREEMPT_OFF_HIST
+   bool "Preempt off critical timings histogram"
+   depends on CRITICAL_PREEMPT_TIMING
+   help
+ This option uses the infrastructure of the critical
+ preemption off timings to create a histogram of latencies.
+
+config WAKEUP_LATENCY_HIST
+   bool "Interrupts off critical timings histogram"
+   depends on WAKEUP_TRACER
+   help
+ This option uses the infrastructure of the wakeup tracer
+ to create a histogram of latencies.
Index: linux-mcount.git/lib/tracing/Makefile
===
--- linux-mcount.git.orig/lib/tracing/Makefile  2008-01-29 21:34:14.0 
-0500
+++ linux-mcount.git/lib/tracing/Makefile   2008-01-29 21:34:30.0 
-0500
@@ -8,4 +8,8 @@ obj-$(CONFIG_CRITICAL_PREEMPT_TIMING) +=
 obj-$(CONFIG_WAKEUP_TRACER) += trace_wakeup.o
 obj-$(CONFIG_EVENT_TRACER) += trace_events.o
 
+obj-$(CONFIG_INTERRUPT_OFF_HIST) += tracer_hist.o
+obj-$(CONFIG_PREEMPT_OFF_HIST) += tracer_hist.o
+obj-$(CONFIG_WAKEUP_LATENCY_HIST) += tracer_hist.o
+
 libmcount-y := mcount.o
Index: linux-mcount.git/lib/tracing/trace_irqsoff.c
===
--- linux-mcount.git.orig/lib/tracing/trace_irqsoff.c   2008-01-29 
21:34:14.0 -0500
+++ linux-mcount.git/lib/tracing/trace_irqsoff.c2008-01-29 
21:34:30.0 -0500
@@ -16,6 +16,7 @@
 #include 
 
 #include "tracer.h"
+#include "tracer_hist.h"
 
 static struct tracing_trace *tracer_trace __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_MUTEX(max_mutex);
@@ -261,10 +262,14 @@ void notrace start_critical_timings(void
 {
if (preempt_trace() || irq_trace())
start_critical_timing(CALLER_ADDR0, 0);
+
+   tracing_hist_preempt_start();
 }
 
 void notrace stop_critical_timings(void)
 {
+   tracing_hist_preempt_stop(TRACE_STOP);
+
if (preempt_trace() || irq_trace())
stop_critical_timing(CALLER_ADDR0, 0);
 }
@@ -273,6 +278,8 @@ void notrace stop_critical_timings(void)
 #ifdef CONFIG_LOCKDEP
 void notrace time_hardirqs_on(unsigned long a0, unsigned long a1)
 {
+   tracing_hist_preempt_stop(1);
+
if (!preempt_trace() && irq_trace())
stop_critical_timing(a0, a1);
 }
@@ -281,6 +288,8 @@ void notrace time_hardirqs_off(unsigned 
 {
if (!preempt_trace() && irq_trace())
start_critical_timing(a0, a1);
+
+   tracing_hist_preempt_start();
 }
 
 #else /* !CONFIG_LOCKDEP */
@@ -314,6 +323,8 @@ inline void print_irqtrace_events(struct
  */
 void notrace trace_hardirqs_on(void)
 {
+   tracing_hist_preempt_stop(1);
+
if (!preempt_trace() && irq_trace())
stop_critical_timing(CALLER_ADDR0, 0);
 }
@@ -323,11 +334,15 @@ void notrace trace_hardirqs_off(void)
 {
if (!preempt_trace() && irq_trace())
start_cr

[PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation

2008-01-29 Thread Steven Rostedt

If CONFIG_MCOUNT is selected and /proc/sys/kernel/mcount_enabled is set to a
non-zero value the mcount routine will be called everytime we enter a kernel
function that is not marked with the "notrace" attribute.

The mcount routine will then call a registered function if a function
happens to be registered.

[This code has been highly hacked by Steven Rostedt, so don't
 blame Arnaldo for all of this ;-) ]

Update:
  It is now possible to register more than one mcount function.
  If only one mcount function is registered, that will be the
  function that mcount calls directly. If more than one function
  is registered, then mcount will call a function that will loop
  through the functions to call.

Signed-off-by: Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 Makefile   |3 
 arch/x86/Kconfig   |1 
 arch/x86/kernel/entry_32.S |   25 +++
 arch/x86/kernel/entry_64.S |   36 +++
 include/linux/linkage.h|2 
 include/linux/mcount.h |   38 
 kernel/sysctl.c|   11 +++
 lib/Kconfig.debug  |1 
 lib/Makefile   |2 
 lib/tracing/Kconfig|   10 +++
 lib/tracing/Makefile   |3 
 lib/tracing/mcount.c   |  141 +
 12 files changed, 273 insertions(+)

Index: linux-mcount.git/Makefile
===
--- linux-mcount.git.orig/Makefile  2008-01-29 17:01:56.0 -0500
+++ linux-mcount.git/Makefile   2008-01-29 17:26:17.0 -0500
@@ -509,6 +509,9 @@ endif
 
 include $(srctree)/arch/$(SRCARCH)/Makefile
 
+ifdef CONFIG_MCOUNT
+KBUILD_CFLAGS  += -pg
+endif
 ifdef CONFIG_FRAME_POINTER
 KBUILD_CFLAGS  += -fno-omit-frame-pointer -fno-optimize-sibling-calls
 else
Index: linux-mcount.git/arch/x86/Kconfig
===
--- linux-mcount.git.orig/arch/x86/Kconfig  2008-01-29 16:59:15.0 
-0500
+++ linux-mcount.git/arch/x86/Kconfig   2008-01-29 17:26:18.0 -0500
@@ -19,6 +19,7 @@ config X86_64
 config X86
bool
default y
+   select HAVE_MCOUNT
 
 config GENERIC_TIME
bool
Index: linux-mcount.git/arch/x86/kernel/entry_32.S
===
--- linux-mcount.git.orig/arch/x86/kernel/entry_32.S2008-01-29 
16:59:15.0 -0500
+++ linux-mcount.git/arch/x86/kernel/entry_32.S 2008-01-29 17:26:18.0 
-0500
@@ -75,6 +75,31 @@ DF_MASK  = 0x0400 
 NT_MASK= 0x4000
 VM_MASK= 0x0002
 
+#ifdef CONFIG_MCOUNT
+.globl mcount
+mcount:
+   /* unlikely(mcount_enabled) */
+   cmpl $0, mcount_enabled
+   jnz trace
+   ret
+
+trace:
+   /* taken from glibc */
+   pushl %eax
+   pushl %ecx
+   pushl %edx
+   movl 0xc(%esp), %edx
+   movl 0x4(%ebp), %eax
+
+   call   *mcount_trace_function
+
+   popl %edx
+   popl %ecx
+   popl %eax
+
+   ret
+#endif
+
 #ifdef CONFIG_PREEMPT
 #define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
Index: linux-mcount.git/arch/x86/kernel/entry_64.S
===
--- linux-mcount.git.orig/arch/x86/kernel/entry_64.S2008-01-29 
16:59:15.0 -0500
+++ linux-mcount.git/arch/x86/kernel/entry_64.S 2008-01-29 17:26:18.0 
-0500
@@ -53,6 +53,42 @@
 
.code64
 
+#ifdef CONFIG_MCOUNT
+
+ENTRY(mcount)
+   /* unlikely(mcount_enabled) */
+   cmpl $0, mcount_enabled
+   jnz trace
+   retq
+
+trace:
+   /* taken from glibc */
+   subq $0x38, %rsp
+   movq %rax, (%rsp)
+   movq %rcx, 8(%rsp)
+   movq %rdx, 16(%rsp)
+   movq %rsi, 24(%rsp)
+   movq %rdi, 32(%rsp)
+   movq %r8, 40(%rsp)
+   movq %r9, 48(%rsp)
+
+   movq 0x38(%rsp), %rsi
+   movq 8(%rbp), %rdi
+
+   call   *mcount_trace_function
+
+   movq 48(%rsp), %r9
+   movq 40(%rsp), %r8
+   movq 32(%rsp), %rdi
+   movq 24(%rsp), %rsi
+   movq 16(%rsp), %rdx
+   movq 8(%rsp), %rcx
+   movq (%rsp), %rax
+   addq $0x38, %rsp
+
+   retq
+#endif
+
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif 
Index: linux-mcount.git/include/linux/linkage.h
===
--- linux-mcount.git.orig/include/linux/linkage.h   2008-01-29 
16:59:15.0 -0500
+++ linux-mcount.git/include/linux/linkage.h2008-01-29 17:26:18.0 
-0500
@@ -3,6 +3,8 @@
 
 #include 
 
+#define notrace __attribute__((no_instrument_function))
+
 #ifdef __cplusplus
 #define CPP_ASMLINKAGE extern "C"
 #else
Index: linux-mcount.git/include/linux/mcount.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-mcount.git

[PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled

2008-01-29 Thread Steven Rostedt

[ This patch is added to the series since the wakeup timings trace
  may lockup without it. ]

I thought that one could place a printk anywhere without worrying.
But it seems that it is not wise to place a printk where the runqueue
lock is held.

I just spent two hours debugging why some of my code was locking up,
to find that the lockup was caused by some debugging printk's that
I had in the scheduler.  The printk's were only in rare paths so
they shouldn't be too much of a problem, but after I hit the printk
the system locked up.

Thinking that it was locking up on my code I went looking down the
wrong path. I finally found (after examining an NMI dump) that
the lockup happened because printk was trying to wakeup the klogd
daemon, which caused a deadlock when the try_to_wakeup code tries
to grab the runqueue lock.

This patch adds a runqueue_is_locked interface in sched.c for other
files to see if the current runqueue lock is held. This is used
in printk to determine whether it is safe or not to wakeup the klogd.

And with this patch, my code ran fine ;-)

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 include/linux/sched.h |2 ++
 kernel/printk.c   |   14 ++
 kernel/sched.c|   18 ++
 3 files changed, 30 insertions(+), 4 deletions(-)

Index: linux-mcount.git/kernel/printk.c
===
--- linux-mcount.git.orig/kernel/printk.c   2008-01-29 17:02:10.0 
-0500
+++ linux-mcount.git/kernel/printk.c2008-01-29 17:25:40.0 -0500
@@ -590,9 +590,11 @@ static int have_callable_console(void)
  * @fmt: format string
  *
  * This is printk().  It can be called from any context.  We want it to work.
- * Be aware of the fact that if oops_in_progress is not set, we might try to
- * wake klogd up which could deadlock on runqueue lock if printk() is called
- * from scheduler code.
+ *
+ * Note: if printk() is called with the runqueue lock held, it will not wake
+ * up the klogd. This is to avoid a deadlock from calling printk() in schedule
+ * with the runqueue lock held and having the wake_up grab the runqueue lock
+ * as well.
  *
  * We try to grab the console_sem.  If we succeed, it's easy - we log the 
output and
  * call the console drivers.  If we fail to get the semaphore we place the 
output
@@ -1001,7 +1003,11 @@ void release_console_sem(void)
console_locked = 0;
up(&console_sem);
spin_unlock_irqrestore(&logbuf_lock, flags);
-   if (wake_klogd)
+   /*
+* If we try to wake up klogd while printing with the runqueue lock
+* held, this will deadlock.
+*/
+   if (wake_klogd && !runqueue_is_locked())
wake_up_klogd();
 }
 EXPORT_SYMBOL(release_console_sem);
Index: linux-mcount.git/include/linux/sched.h
===
--- linux-mcount.git.orig/include/linux/sched.h 2008-01-29 17:02:10.0 
-0500
+++ linux-mcount.git/include/linux/sched.h  2008-01-29 17:25:40.0 
-0500
@@ -222,6 +222,8 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 extern void init_idle_bootup_task(struct task_struct *idle);
 
+extern int runqueue_is_locked(void);
+
 extern cpumask_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
 extern int select_nohz_load_balancer(int cpu);
Index: linux-mcount.git/kernel/sched.c
===
--- linux-mcount.git.orig/kernel/sched.c2008-01-29 16:59:15.0 
-0500
+++ linux-mcount.git/kernel/sched.c 2008-01-29 17:25:40.0 -0500
@@ -621,6 +621,24 @@ unsigned long rt_needs_cpu(int cpu)
 # define const_debug static const
 #endif
 
+/**
+ * runqueue_is_locked
+ *
+ * Returns true if the current cpu runqueue is locked.
+ * This interface allows printk to be called with the runqueue lock
+ * held and know whether or not it is OK to wake up the klogd.
+ */
+int runqueue_is_locked(void)
+{
+   int cpu = get_cpu();
+   struct rq *rq = cpu_rq(cpu);
+   int ret;
+
+   ret = spin_is_locked(&rq->lock);
+   put_cpu();
+   return ret;
+}
+
 /*
  * Debugging: various feature bits
  */

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 15/22 -v7] trace generic call to schedule switch

2008-01-29 Thread Steven Rostedt

This patch adds hooks into the schedule switch tracing to
allow other latency traces to hook into the schedule switches.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 lib/tracing/trace_sched_switch.c |  123 +--
 lib/tracing/tracer.h |   14 
 2 files changed, 119 insertions(+), 18 deletions(-)

Index: linux-mcount.git/lib/tracing/tracer.h
===
--- linux-mcount.git.orig/lib/tracing/tracer.h  2008-01-29 12:35:27.0 
-0500
+++ linux-mcount.git/lib/tracing/tracer.h   2008-01-29 14:22:15.0 
-0500
@@ -113,4 +113,18 @@ static inline notrace cycle_t now(void)
return get_monotonic_cycles();
 }
 
+#ifdef CONFIG_CONTEXT_SWITCH_TRACER
+typedef void (*tracer_switch_func_t)(void *private,
+struct task_struct *prev,
+struct task_struct *next);
+struct tracer_switch_ops {
+   tracer_switch_func_t func;
+   void *private;
+   struct tracer_switch_ops *next;
+};
+
+extern int register_tracer_switch(struct tracer_switch_ops *ops);
+extern int unregister_tracer_switch(struct tracer_switch_ops *ops);
+#endif /* CONFIG_CONTEXT_SWITCH_TRACER */
+
 #endif /* _LINUX_MCOUNT_TRACER_H */
Index: linux-mcount.git/lib/tracing/trace_sched_switch.c
===
--- linux-mcount.git.orig/lib/tracing/trace_sched_switch.c  2008-01-29 
12:35:40.0 -0500
+++ linux-mcount.git/lib/tracing/trace_sched_switch.c   2008-01-29 
14:24:35.0 -0500
@@ -18,33 +18,21 @@ static struct tracing_trace *tracer_trac
 static int trace_enabled __read_mostly;
 static atomic_t sched_ref;
 int tracing_sched_switch_enabled __read_mostly;
+static DEFINE_SPINLOCK(sched_switch_func_lock);
 
-static notrace void sched_switch_callback(const struct marker *mdata,
- void *private_data,
- const char *format, ...)
+static void notrace sched_switch_func(void *private,
+ struct task_struct *prev,
+ struct task_struct *next)
 {
-   struct tracing_trace **p = mdata->private;
-   struct tracing_trace *tr = *p;
+   struct tracing_trace **ptr = private;
+   struct tracing_trace *tr = *ptr;
struct tracing_trace_cpu *data;
-   struct task_struct *prev;
-   struct task_struct *next;
unsigned long flags;
-   va_list ap;
int cpu;
 
-   if (!atomic_read(&sched_ref))
-   return;
-
-   tracing_record_cmdline(current);
-
if (!trace_enabled)
return;
 
-   va_start(ap, format);
-   prev = va_arg(ap, typeof(prev));
-   next = va_arg(ap, typeof(next));
-   va_end(ap);
-
raw_local_irq_save(flags);
cpu = raw_smp_processor_id();
data = tr->data[cpu];
@@ -57,6 +45,105 @@ static notrace void sched_switch_callbac
raw_local_irq_restore(flags);
 }
 
+static struct tracer_switch_ops sched_switch_ops __read_mostly =
+{
+   .func = sched_switch_func,
+   .private = &tracer_trace,
+};
+
+static tracer_switch_func_t tracer_switch_func __read_mostly =
+   sched_switch_func;
+
+static struct tracer_switch_ops *tracer_switch_func_ops __read_mostly =
+   &sched_switch_ops;
+
+static void notrace sched_switch_func_loop(void *private,
+  struct task_struct *prev,
+  struct task_struct *next)
+{
+   struct tracer_switch_ops *ops = tracer_switch_func_ops;
+
+   for (; ops != NULL; ops = ops->next)
+   ops->func(ops->private, prev, next);
+}
+
+notrace int register_tracer_switch(struct tracer_switch_ops *ops)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(&sched_switch_func_lock, flags);
+   ops->next = tracer_switch_func_ops;
+   smp_wmb();
+   tracer_switch_func_ops = ops;
+
+   if (ops->next == &sched_switch_ops)
+   tracer_switch_func = sched_switch_func_loop;
+
+   spin_unlock_irqrestore(&sched_switch_func_lock, flags);
+
+   return 0;
+}
+
+notrace int unregister_tracer_switch(struct tracer_switch_ops *ops)
+{
+   unsigned long flags;
+   struct tracer_switch_ops **p = &tracer_switch_func_ops;
+   int ret;
+
+   spin_lock_irqsave(&sched_switch_func_lock, flags);
+
+   /*
+* If the sched_switch is the only one left, then
+*  only call that function.
+*/
+   if (*p == ops && ops->next == &sched_switch_ops) {
+   tracer_switch_func = sched_switch_func;
+   tracer_switch_func_ops = &sched_switch_ops;
+   goto out;
+   }
+
+   for (; *p != &sched_switch_ops; p = &(*p)->next)
+   if (*p == ops)
+   break;
+
+   if (*p != ops) {
+

Re: [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap

2008-01-29 Thread Yinghai Lu

On Tuesday 29 January 2008 06:57:54 pm Andi Kleen wrote:
> On Tuesday 29 January 2008 20:16, Yinghai Lu wrote:
> > [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap
> >
> > otherise early_node_mem will use up these for 8 nodes system
> 
> Yes this was the problem with my early_reserve node bootmem patch.
> It adds a node limit.
> 
> But even with increasing the limit is far too small. Probably best to not 
> use the patch. In theory it should not have been needed anyways because
> there is no need to reserve here because there are no interfering users.
> 
> Whatever your problem is it needs to be solved differently.

ok, discard 3, and 4.

how about 2 v2?

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] x86_64: make early_node_mem return align address

2008-01-29 Thread Yinghai Lu

On Tuesday 29 January 2008 06:55:45 pm Andi Kleen wrote:
> On Tuesday 29 January 2008 18:41, Yinghai Lu wrote:
> > On Tuesday 29 January 2008 01:33:29 am Andi Kleen wrote:
> > > On Tuesday 29 January 2008 10:05, Yinghai Lu wrote:
> > > > [PATCH 2/2] x86_64: make early_node_mem return align address
> > > >
> > > > boot oops when system get 64g or 128g installed
> > >
> > > Probably it should just use reserve_early(). Does this patch work?
> > >
> > > The alignment change is needed at some point too, but only to
> > > relax the alignment to not force all early allocations to be page
> > > padded.
> >
> > No, my patch doesn't force all early allocations to be page padded.
> > for find_e820_mem, i just change PAGE_ALIGN to be aligned align
> > parameter
> 
> They are already all PAGE_ALIGN()ed (which is too strict, but needs
> some care to fix properly), but your patch uses it the wrong way.
> The PAGE_ALIGNment was added some time ago to avoid such over
> lapping, but it should not actually be needed for that anymore.
> 
> >
> > only make early_node_mem have aligned data. because it seems it like
> > to...and assume that.
> 
> Using alignment doesn't seem the correct way to avoid overlapping.
> 
> If there is still overlap then some reservation needs to be extended.
> 
> > I think your patch will get early panic about overlap between bss and
> > bootmem... like the 256g machine, bss is overlapped with early page
> > table...
> 
> Well did you test it? 
> 
> bss should have been reserved by this line in head64.c 
> 
> reserve_early(__pa_symbol(&_text), __pa_symbol(&_end));
> 
> (in git-x86). In earlier kernels it was checked for explicitely by the e820 
> allocator.

no early panic. but the bss end still get corrupted.

because bootmap_start is used as

Re: [PATCH 2.6.24] x86: add sysfs interface for cpuid module

2008-01-29 Thread Yi Yang

On Tue, 2008-01-29 at 07:51 -0800, H. Peter Anvin wrote:
> Yi Yang wrote:
> > Current cpuid module will create a char device for every logical cpu,
> > when a user cats /dev/cpu/*/cpuid, he/she will enter a limitless loop,
> > the root cause is that cpuid module doesn't decide wether a cpuid level
> > is valid, it just uses an offset to denote cpuid level and take it to
> > cpuid instruction, cpuid instruction will ignore it and return some data
> > specific to cpu model, cpuid doesn't an error return value because it is
> > void type. So cpuid module will execute cpuid continuously and return
> > data although most of data make no sense.
> > 
> > This patch tries to add a sysfs interface for cpuid, users can see all the
> > available cpuid levels, specify a specific level and get cpuid corresponding
> > to this cpuid level.
> > 
> > For every logical cpu, this patch will create a cpuid directory under
> > /sys/devices/system/cpu/cpu*/, there are three entries under cpuid:
> > 
> > avail_levelscur_level   cur_cpuid
> > 
> > A user can get all the available cpuid levels from avail_levels, he/she can
> > set one available cpuid level to cur_level, then he/she can get cpuid from
> > cur_cpuid, cur_cpuid corresponds to cur_level.
> > 
> > This patch uses sysfs to avoid limitless loop and provide more flexible
> > interface for cpuid, please consider to merge to -mm tree in order to test.
> 
> This is broken.
> 
> Triple broken.
> 
> It's broken, because it doesn't take into account the fact that Intel 
> broke CPUID level 4 and made it "repeating" (neither did the cpuid char 
> device, because it predated the Intel braindamage; I've had a patch for 
> it privately for a while, but didn't push it upstream because paravirt 
> broke it royally and I wanted the situation to settle down.)
> 
> It's broken, because the algorithm used to determine valid CPUID levels 
> is incorrect; it fails to recognize any CPUID levels other than the main 
> Intel and AMD ones, e.g. the Transmeta 0x8086 (and sometimes more) 
> and VIA 0xc000 levels.
Thank you for pointing out these issues, i think we can let users input
any cpuid level and output the corresponding cpuid, in this way we can
avoid to consider cpu differences and left this to userspace. We can
also consider all the x86 platforms to do cpuid for every one.

> 
> It's broken, because it is better for the userspace extractor to have 
> this logic than to stuff it into the kernel, where it sits hogging 
> unswappable memory at all times.
It seems not to be very appropriate to let user space consider hardware
details. /proc/cpuinfo should be an example to justify this.

Is there any user application using /dev/cpu/*/cpuid? if no, i think it
is feasible to provide an interface in the kernel.

I noticed an application cpu-z on Windows, maybe we can clone it on
Linux.
> 
>   -hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix NUMA emulation for x86_64

2008-01-29 Thread Minoru Usui

> ">" == Ingo Molnar <[EMAIL PROTECTED]> writes:

>> * Minoru Usui <[EMAIL PROTECTED]> wrote:

>> I found a small bug of NUMA emulation code for x86_64. 
>> (CONFIG_NUMA_EMU) If machine is non-NUMA, find_node_by_addr() should 
>> return NUMA_NO_NODE, but current implementation code returns existent 
>> maximum NUMA node number + 1. This is not existent NUMA node number.
>> 
>> However, this behaviour does not affect NUMA emulation fortunately, 
>> because acpi_fake_nodes() that is caller of find_node_by_addr() gets 
>> pxm (proximity domain) by node_to_pxm() from non-existent NUMA node 
>> number that was returned by find_node_by_addr(). node_to_pxm() returns 
>> PXM_INVAL that means illegal or non-existent NUMA node number.

>> thanks, i have applied your fix to x86.git.

>> It seems this does not need to be backported to v2.6.24.1 because 
>> node_to_pxm() masked the bad effects of this bug, right?

>>  Ingo

I think this bug is not urgency.
If you mean that it's not necessary to release 2.6.24.1 only for this
bug, I think so.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] x86_64: make early_node_mem return align address

2008-01-29 Thread Andi Kleen

On Tuesday 29 January 2008 18:41, Yinghai Lu wrote:
> On Tuesday 29 January 2008 01:33:29 am Andi Kleen wrote:
> > On Tuesday 29 January 2008 10:05, Yinghai Lu wrote:
> > > [PATCH 2/2] x86_64: make early_node_mem return align address
> > >
> > > boot oops when system get 64g or 128g installed
> >
> > Probably it should just use reserve_early(). Does this patch work?
> >
> > The alignment change is needed at some point too, but only to
> > relax the alignment to not force all early allocations to be page
> > padded.
>
> No, my patch doesn't force all early allocations to be page padded.
> for find_e820_mem, i just change PAGE_ALIGN to be aligned align
> parameter

They are already all PAGE_ALIGN()ed (which is too strict, but needs
some care to fix properly), but your patch uses it the wrong way.
The PAGE_ALIGNment was added some time ago to avoid such over
lapping, but it should not actually be needed for that anymore.

>
> only make early_node_mem have aligned data. because it seems it like
> to...and assume that.

Using alignment doesn't seem the correct way to avoid overlapping.

If there is still overlap then some reservation needs to be extended.

> I think your patch will get early panic about overlap between bss and
> bootmem... like the 256g machine, bss is overlapped with early page
> table...

Well did you test it? 

bss should have been reserved by this line in head64.c 

reserve_early(__pa_symbol(&_text), __pa_symbol(&_end));

(in git-x86). In earlier kernels it was checked for explicitely by the e820 
allocator.

-Andi

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap

2008-01-29 Thread Andi Kleen

On Tuesday 29 January 2008 20:16, Yinghai Lu wrote:
> [PATCH 4/4] x86_64: increse MAX_EARLY_RES for NODE_DATA and bootmap
>
> otherise early_node_mem will use up these for 8 nodes system

Yes this was the problem with my early_reserve node bootmem patch.
It adds a node limit.

But even with increasing the limit is far too small. Probably best to not 
use the patch. In theory it should not have been needed anyways because
there is no need to reserve here because there are no interfering users.

Whatever your problem is it needs to be solved differently.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sata_nv: fix for completion handling

2008-01-29 Thread Robert Hancock


Tejun Heo wrote:

Robert Hancock wrote:

This patch is based on an original patch from Kuan Luo of NVIDIA,
posted under subject "fixed a bug of adma in rhel4u5 with HDS7250SASUN500G".
His description follows. I've reworked it a bit to avoid some unnecessary
repeated checks but it should be functionally identical.

"The patch is to solve the error message "ata1: CPB flags CMD err,
flags=0x11" when testing HDS7250SASUN500G in rhel4u5.
I tested this hd in 2.6.24-rc7 which needed to remove the mask in
blacklist to run the ncq and the same error also showed up. 


I traced the  bug and found that the interrupt finished a command (for
example, tag=0) when the driver got that adma status is
NV_ADMA_STAT_DONE  and  cpb->resp_flags is NV_CPB_RESP_DONE.
However, For this hd, the drive maybe didn't clear bit 0 at this moment.
It meaned the hardware  had not completely finished the command.
If at the same time  the driver freed the command(tag 0) and sended
another command (tag 0), the error happened.

The notifier register is 32-bit register containing notifier value.
Value is bit vector containing one bit per tag number (0-31) in
corresponding bit positions (bit 0 is for tag 0, etc). When bit is set
then ADMA indicates that command with corresponding tag number completed
execution.

So i added the check notifier code. Sometimes i saw that the notifier
reg set some bits  , but the adma status set NV_ADMA_STAT_CMD_COMPLETE
,not NV_ADMA_STAT_DONE. So i added the NV_ADMA_STAT_CMD_COMPLETE check
code."

Signed-off-by: Robert Hancock <[EMAIL PROTECTED]>


Any chance this fixes the FLUSH problem?


I could still reproduce that issue when I took the udelay(20) out of the 
driver. Others have seen that without taking it out, so I suspect some 
systems/drives are more sensitive to that for some reason. However, who 
knows, it may help some people with that problem.


The symptoms of the problem dealt with here are different, not a command 
timeout it appears, but the controller reporting an error.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: SATA DOM is not identified by ata_piix module

2008-01-29 Thread Mao Rui

Hmm... Does anybody own this bug?

Best regards,

Mao Rui


> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Mao Rui
> Sent: Friday, January 18, 2008 2:12 PM
> To: 'Alan Cox'
> Cc: linux-kernel@vger.kernel.org
> Subject: RE: SATA DOM is not identified by ata_piix module
> 
> I tried to nail down when the problem was introduced. I compiled some
> official kernel release. Here is the result.
> 2.6.17.14   IDE -- Failed   SATA -- passed
> 2.6.18   IDE -- Failed   SATA -- failed
> 2.6.18.8   IDE -- Failed   SATA -- failed
> 2.6.24-rc7-git6  IDE -- passed   SATA DOM -- failed
> linux-2.6.24-rc8-git1   IDE -- passedSATA -- failed
> All IDE failed reason is xfermode error, and all SATA failure is IDENTIFY
> error.
> 
> As you can find out, the failure of SATA DOM was introduced from kernel
> 2.6.18.
> 
> I'm not good at low level driver programming, so I cannot find out the
root
> cause by myself. But if Alan or someone else needs more info or wants to
> test the patch, I'm glad to do it in my platform.
> 
> Best regards,
> Mao Rui
> 
> > -Original Message-
> > From: Alan Cox [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, January 15, 2008 7:02 PM
> > To: Mao Rui
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: SATA DOM is not identified by ata_piix module
> >
> > On Tue, 15 Jan 2008 18:11:25 +0800
> > "Mao Rui" <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > I have a PQI Turbo SATA DOM. It works well under Windows. I installed
it
> in
> > > a SuperMicro motherboard, Intel 5000P chipset. The OS is Ubuntu 7.04,
> kernel
> > > 2.6.20-15. But the DOM is not appeared as a device node, and I found
> several
> > > error messages in kernel log.
> >
> > Generally it is a good idea to report problems with vendor built kernels
> > to the vendor and their support, especially one that is 3 releases
behind.
> > They have a much better idea what is in that kernel and who else has
seen
> > problems.
> >
> > > [   67.124299] ata2.00: qc timeout (cmd 0x91)
> > > [   67.124306] ata2.00: failed to IDENTIFY (INIT_DEV_PARAMS failed,
> > > err_mask=0x4)
> >
> > We issued commands, it didn't respond. That could be a libata problem
but
> > actually looks more like an IRQ routing problem.
> >
> > > Actually I also have a PQI IDE DOM, it have same error with Ubuntu
7.04
> /
> >
> > The PQI DOM is a bit odd, it is however known to work with libata at
> > least for the PATA one, and the versions which don't understand
> > SET_XFER_MODE to work with current kernels. (Your failure mode isn't the
> > SET_XFER_MODE one - it hasn't got that far).
> >
> > Alan
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-29 Thread Tony Camuso


Matthew Wilcox wrote:

On Tue, Jan 29, 2008 at 07:29:51AM -0800, Arjan van de Ven wrote:

Right now, that isn't a lot of people in x86 land, but your patch
encumbers drivers for non-x86 archs with an additional call to access
space that they've never had a problem with.

lets say s/x86/x86, IA64 and architectures that use intel, amd or via chipsets/


Umm .. ia64 already does exactly what I'm proposing for x86.  It uses
one SAL interface for bytes below 256 and a different SAL interface for
bytes 256-4095.



Not exactly.
:)

The interface is the same, ia64_sal_pci_config_write() and 
ia64_sal_pci_config_read(),
but a flag bit in the mode argument is used to tell the SAL interface whether to
translate the offset component of the config address as having 8 or 12 bits of
of displacement.

In my estimation, Ivan's patch, in his implementation of Loic's suggestion, is 
even
more elegant, since there is no need to flag whether the access is for offsets 
below
256. Ivan's code automatically uses Port IO (or equivalent with Matthew's 
patch) for
offsets below 256 and MMCONFIG for offsets from 256 to 4096.

And even better, it removes the bitmap that tracks MMCONFIG-unfriendly devices 
for
the first 16 buses, a solution that assumes systems with bus numbers higher 
than 16
will get MMCONFIG right, which turned out to be a very wrong assumption. 
Furthermore,
the config address is translated by the Northbridge. The delivery mechanism to
the Northbridge, whether Port IO or MMCONFIG, is utterly opaque to the devices 
on the
bus, since all they see is PCI config cycles, not Port IO or MMCONFIG cycles. 
The test
only needed to be made at the Northbridge level, not at the device level. 
Ivan's patch
removes all this cruft.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Add a list for custom page fault handlers.

2008-01-29 Thread Harvey Harrison

From: Pekka Paalanen <[EMAIL PROTECTED]>

Provides kernel modules a way to register custom page fault handlers.
On every page fault, except those handled in vmalloc_fault(), this will
call a list of registered functions. The functions may handle the fault
and force do_page_fault() to return immediately.

This functionality is similar to the now removed page fault notifiers.
Custom page fault handlers are used by debugging and reverse engineering
tools. Mmio-trace is one such tool and a patch to add it into the tree
will follow.

The custom page fault handlers are called from the exact same points in
do_page_fault() as the page fault notifiers were.

Signed-off-by: Pekka Paalanen <[EMAIL PROTECTED]>
Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]>
---
Sorry, attached the wrong version to my last message missing the
kdebug.h hunk.  This is still just a straight port to current x86.git.

 arch/x86/Kconfig.debug   |9 
 arch/x86/mm/fault.c  |   51 ++
 include/asm-x86/kdebug.h |8 +++
 3 files changed, 68 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 2e1e3af..9b44bc5 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -225,4 +225,13 @@ config CPA_DEBUG
help
  Do change_page_attr self tests at boot.
 
+config PAGE_FAULT_HANDLERS
+   bool "Custom page fault handlers"
+   depends on DEBUG_KERNEL
+   help
+ Allow the use of custom page fault handlers. A kernel module may
+ register a function that is called on every page fault not handled
+ for vmalloc. Custom handlers are used by some debugging and reverse
+ engineering tools.
+
 endmenu
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e28cc52..c6c8164 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -49,6 +49,54 @@
 #define PF_RSVD(1<<3)
 #define PF_INSTR   (1<<4)
 
+#ifdef CONFIG_PAGE_FAULT_HANDLERS
+static HLIST_HEAD(pf_handlers); /* protected by RCU */
+static DEFINE_SPINLOCK(pf_handlers_writer);
+
+void register_page_fault_handler(struct pf_handler *new_pfh)
+{
+   spin_lock(&pf_handlers_writer);
+   hlist_add_head_rcu(&new_pfh->hlist, &pf_handlers);
+   spin_unlock(&pf_handlers_writer);
+}
+EXPORT_SYMBOL_GPL(register_page_fault_handler);
+
+void unregister_page_fault_handler(struct pf_handler *old_pfh)
+{
+   might_sleep();
+   spin_lock(&pf_handlers_writer);
+   hlist_del_rcu(&old_pfh->hlist);
+   spin_unlock(&pf_handlers_writer);
+   synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(unregister_page_fault_handler);
+#endif
+
+/* returns non-zero if do_page_fault() should return */
+static int handle_custom_pf(struct pt_regs *regs, unsigned long error_code,
+   unsigned long address)
+{
+#ifdef CONFIG_PAGE_FAULT_HANDLERS
+   int ret = 0;
+   struct pf_handler *cur;
+   struct hlist_node *ncur;
+
+   if (hlist_empty(&pf_handlers))
+   return 0;
+
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(cur, ncur, &pf_handlers, hlist) {
+   ret = cur->handler(regs, error_code, address);
+   if (ret)
+   break;
+   }
+   rcu_read_unlock();
+   return ret;
+#else
+   return 0;
+#endif
+}
+
 static inline int notify_page_fault(struct pt_regs *regs)
 {
 #ifdef CONFIG_KPROBES
@@ -588,6 +636,9 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned 
long error_code)
if (notify_page_fault(regs))
return;
 
+   if (handle_custom_pf(regs, error_code, address))
+   return;
+
/*
 * We fault-in kernel-space virtual memory on-demand. The
 * 'reference' page table is init_mm.pgd.
diff --git a/include/asm-x86/kdebug.h b/include/asm-x86/kdebug.h
index dd442a1..ba03368 100644
--- a/include/asm-x86/kdebug.h
+++ b/include/asm-x86/kdebug.h
@@ -35,4 +35,12 @@ extern void dump_pagetable(unsigned long);
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
 
+struct pf_handler {
+   struct hlist_node hlist;
+   int (*handler)(struct pt_regs *regs, unsigned long error_code,
+  unsigned long address);
+};
+
+extern void register_page_fault_handler(struct pf_handler *new_pfh);
+extern void unregister_page_fault_handler(struct pf_handler *old_pfh);
 #endif
-- 
1.5.4.rc4.1142.gf5a97



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Christoph Lameter

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.
Most of the VM address space changes can use the range invalidate
callback.

invalidate_range() is generally called with mmap_sem held but
no spinlocks are active. If invalidate_range() is called with
locks held then we pass a flag into invalidate_range()

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
Signed-off-by: Robin Holt <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/fremap.c  |2 ++
 mm/hugetlb.c |2 ++
 mm/memory.c  |   11 +--
 mm/mmap.c|1 +
 4 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===
--- linux-2.6.orig/mm/fremap.c  2008-01-29 16:56:33.0 -0800
+++ linux-2.6/mm/fremap.c   2008-01-29 16:59:24.0 -0800
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -212,6 +213,7 @@ asmlinkage long sys_remap_file_pages(uns
}
 
err = populate_range(mm, vma, start, size, pgoff);
+   mmu_notifier(invalidate_range, mm, start, start + size, 0);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2008-01-29 16:56:33.0 -0800
+++ linux-2.6/mm/memory.c   2008-01-29 16:59:24.0 -0800
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+   mmu_notifier(invalidate_range, mm, address, end,
+   (details ? (details->i_mmap_lock != NULL)  : 0));
return end;
 }
 
@@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + PAGE_ALIGN(size);
+   unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;
 
@@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct
 {
pgd_t *pgd;
unsigned long next;
-   unsigned long end = addr + size;
+   unsigned long start = addr, end = addr + size;
int err;
 
BUG_ON(addr >= end);
@@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1669,6 +1674,8 @@ gotten:
page_cache_release(old_page);
 unlock:
pte_unmap_unlock(page_table, ptl);
+   mmu_notifier(invalidate_range, mm, address,
+   address + PAGE_SIZE - 1, 0);
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-01-29 16:56:36.0 -0800
+++ linux-2.6/mm/mmap.c 2008-01-29 16:58:15.0 -0800
@@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+   mmu_notifier(invalidate_range, mm, start, end, 0);
 }
 
 /*
Index: linux-2.6/mm/hugetlb.c
===
--- linux-2.6.orig/mm/hugetlb.c 2008-01-29 16:56:33.0 -0800
+++ linux-2.6/mm/hugetlb.c  2008-01-29 16:58:15.0 -0800
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+   mmu_notifier(invalidate_range, mm, start, end, 1);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/m

Re: [PATCH 2/4] x86_64: make early_node_mem return align address v2

2008-01-29 Thread Yinghai Lu

On Tuesday 29 January 2008 11:14:48 am Yinghai Lu wrote:
> [PATCH 2/4] x86_64: make early_node_mem return align address v2
> 
> boot oops when system get 64g or 128 installed
> 

can you apply this updated version with others?

setup_node_mem should return with PAGE_ALIGN.

in setup_node_bootmem, it need bootmap_start to be PAGE_ALIGN, without this 
patch it will overlap with bss.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 6/6] mmu_notifier: Add invalidate_all()

2008-01-29 Thread Christoph Lameter

when a task exits we can remove all external pts at once. At that point the
extern mmu may also unregister itself from the mmu notifier chain to avoid
future calls.

Note the complications because of RCU. Other processors may not see that the
notifier was unlinked until a quiescent period has passed!

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/mmu_notifier.h |4 
 mm/mmap.c|1 +
 2 files changed, 5 insertions(+)

Index: linux-2.6/include/linux/mmu_notifier.h
===
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-28 14:02:18.0 
-0800
+++ linux-2.6/include/linux/mmu_notifier.h  2008-01-28 14:15:49.0 
-0800
@@ -62,6 +62,10 @@ struct mmu_notifier_ops {
struct mm_struct *mm,
unsigned long address);
 
+   /* Dummy needed because the mmu_notifier() macro requires it */
+   void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm,
+   int dummy);
+
/*
 * lock indicates that the function is called under spinlock.
 */
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c2008-01-28 14:15:49.0 -0800
+++ linux-2.6/mm/mmap.c 2008-01-28 14:15:49.0 -0800
@@ -2034,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
unsigned long end;
 
/* mm's last user has gone, and its about to be pulled down */
+   mmu_notifier(invalidate_all, mm, 0);
arch_exit_mmap(mm);
 
lru_add_drain();

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps

2008-01-29 Thread Christoph Lameter

These notifiers here use the Linux rmaps to perform the callbacks.
In order to walk the rmaps locks must be held. Callbacks can therefore
only operate in an atomic context.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/rmap.c |   12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2008-01-29 16:58:25.0 -0800
+++ linux-2.6/mm/rmap.c 2008-01-29 16:58:39.0 -0800
@@ -285,7 +285,8 @@ static int page_referenced_one(struct pa
if (!pte)
goto out;
 
-   if (ptep_clear_flush_young(vma, address, pte))
+   if (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address))
referenced++;
 
/* Pretend the page is referenced if the task has the
@@ -435,6 +436,7 @@ static int page_mkclean_one(struct page 
 
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -680,7 +682,8 @@ static int try_to_unmap_one(struct page 
 * skipped over this mm) then we should reactivate it.
 */
if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-   (ptep_clear_flush_young(vma, address, pte {
+   (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address {
ret = SWAP_FAIL;
goto out_unmap;
}
@@ -688,6 +691,7 @@ static int try_to_unmap_one(struct page 
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
 
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -812,12 +816,14 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));
 
-   if (ptep_clear_flush_young(vma, address, pte))
+   if (ptep_clear_flush_young(vma, address, pte) |
+   mmu_notifier_age_page(mm, address))
continue;
 
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+   mmu_notifier(invalidate_page, mm, address);
 
/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/6] [RFC] MMU Notifiers V3

2008-01-29 Thread Christoph Lameter

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel. The known immediate users are

KVM (establishes a refcount to the page. External references called spte)

GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb)

XPmem (uses its own reverse mappings and refcount. Remote ptes, Needs
to sleep when sending messages)

Issues:

- Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU
  Early tests with the GRU were successful.

- Pages may be freed before the external mapping are torn down
  through invalidate_range() if no refcount on the page is taken.
  There is the chance that page content may be visible after
  they have been reallocated (mainly an issue for the GRU that
  takes no refcount).

- invalidate_range() callbacks are sometimes called under i_mmap_lock.
  These need to be dealt with or XPmem needs to be able to work around
  these.

- filemap_xip.c does not follow conventions for Rmap callbacks.
  We could depends on XIP support not being active to avoid the issue.

Things that we leave as is:

- RCU quiescent periods are required on registering and unregistering
  notifiers to guarantee visibility to other processors.
  Currently only mmu_notifier_release() does the correct thing.
  It is up to the user to provide RCU quiescent periods for
  register/unregister functions if they are called outside of the
  ->release method.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
  and sys_remap_file_pages() after the pte clearing.

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap

2008-01-29 Thread Christoph Lameter

Callbacks to remove individual pages if the subsystem has an
rmap capability. The pagelock is held but no spinlocks are held.
The refcount of the page is elevated so that dropping the refcount
in the subsystem will not directly free the page.

The callbacks occur after the Linux rmaps have been walked.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/rmap.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2008-01-25 14:24:19.0 -0800
+++ linux-2.6/mm/rmap.c 2008-01-25 14:24:38.0 -0800
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -473,6 +474,8 @@ int page_mkclean(struct page *page)
struct address_space *mapping = page_mapping(page);
if (mapping) {
ret = page_mkclean_file(mapping, page);
+   if (unlikely(PageExternalRmap(page)))
+   mmu_rmap_notifier(invalidate_page, page);
if (page_test_dirty(page)) {
page_clear_dirty(page);
ret = 1;
@@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int 
else
ret = try_to_unmap_file(page, migration);
 
+   if (unlikely(PageExternalRmap(page)))
+   mmu_rmap_notifier(invalidate_page, page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 578 matches

Mail list logo