date:20071129


Paul Rolland (ポール・ロラン) wrote:



Note that once TSC is disabled (it's using "jiffies" as far
as I can see), ntpd constantly speeds up and slows down the
clock, it jumps +/- 0.5sec every several minutes or hours -
I guess that's when ntpd process gets moved from one core
to another for whatever reason.  And an interesting thing
is that with 64bits kernel this TSC problem does not occur
on this very machine.

H That could make it a problem related to kernel rather than CPU.
 

Something similar is reported on AMD X2 64 machines as well --
can't check right now.

If I recall correctly, issues with AMD X2 where related to TSC being
independant for each core and not constant (speed depending of C state).
But the reason I raise the issue is that the Core2 reports constant TSC,
so there is (IMHO) no reason for that.



Well, "constant" doesn't mean "synchronized", but it might very well be 
that the Core2 could really benefit from synchronizing the TSCs manually 
like we used to.


On the other hand, I notice that most of the TSC warp values are 
relatively close to 2^32, so this could be a specific bug.


-hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] 2.6.24-rc3-git2 softlockup detected

On Thu, 29 Nov 2007 23:00:47 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Fri, 30 Nov 2007 01:39:29 -0500 Kyle McMartin <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Nov 29, 2007 at 12:35:33AM -0800, Andrew Morton wrote:
> > > ten million is close enough to infinity for me to assume that we broke the
> > > driver and that's never going to terminate.
> > > 
> > 
> > how about this? doesn't break things on my pa8800:
> > 
> > diff --git a/drivers/scsi/sym53c8xx_2/sym_hipd.c 
> > b/drivers/scsi/sym53c8xx_2/sym_hipd.c
> > index 463f119..ef01cb1 100644
> > --- a/drivers/scsi/sym53c8xx_2/sym_hipd.c
> > +++ b/drivers/scsi/sym53c8xx_2/sym_hipd.c
> > @@ -1037,10 +1037,13 @@ restart_test:
> > /*
> >  *  Wait 'til done (with timeout)
> >  */
> > -   for (i=0; i > +   do {
> > if (INB(np, nc_istat) & (INTF|SIP|DIP))
> > break;
> > -   if (i>=SYM_SNOOP_TIMEOUT) {
> > +   msleep(10);
> > +   } while (i++ < SYM_SNOOP_TIMEOUT);
> > +
> > +   if (i >= SYM_SNOOP_TIMEOUT) {
> > printf ("CACHE TEST FAILED: timeout.\n");
> > return (0x20);
> > }
> > diff --git a/drivers/scsi/sym53c8xx_2/sym_hipd.h 
> > b/drivers/scsi/sym53c8xx_2/sym_hipd.h
> > index ad07880..85c483b 100644
> > --- a/drivers/scsi/sym53c8xx_2/sym_hipd.h
> > +++ b/drivers/scsi/sym53c8xx_2/sym_hipd.h
> > @@ -339,7 +339,7 @@
> >  /*
> >   *  Misc.
> >   */
> > -#define SYM_SNOOP_TIMEOUT (1000)
> > +#define SYM_SNOOP_TIMEOUT (1000)
> >  #define BUS_8_BIT  0
> >  #define BUS_16_BIT 1
> >  
> 
> That might be the fix, but do we know what we're actually fixing?  afaik
> 2.6.24-rc3 doesn't get this timeout, 2.6.24-rc3-mm2 does get it and we
> don't know why?
> 





So 2.6.24-rc3 was OK and 2.6.24-rc3-git2 is not?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] 2.6.24-rc3-git2 softlockup detected

On Fri, 30 Nov 2007 01:39:29 -0500 Kyle McMartin <[EMAIL PROTECTED]> wrote:

> On Thu, Nov 29, 2007 at 12:35:33AM -0800, Andrew Morton wrote:
> > ten million is close enough to infinity for me to assume that we broke the
> > driver and that's never going to terminate.
> > 
> 
> how about this? doesn't break things on my pa8800:
> 
> diff --git a/drivers/scsi/sym53c8xx_2/sym_hipd.c 
> b/drivers/scsi/sym53c8xx_2/sym_hipd.c
> index 463f119..ef01cb1 100644
> --- a/drivers/scsi/sym53c8xx_2/sym_hipd.c
> +++ b/drivers/scsi/sym53c8xx_2/sym_hipd.c
> @@ -1037,10 +1037,13 @@ restart_test:
>   /*
>*  Wait 'til done (with timeout)
>*/
> - for (i=0; i + do {
>   if (INB(np, nc_istat) & (INTF|SIP|DIP))
>   break;
> - if (i>=SYM_SNOOP_TIMEOUT) {
> + msleep(10);
> + } while (i++ < SYM_SNOOP_TIMEOUT);
> +
> + if (i >= SYM_SNOOP_TIMEOUT) {
>   printf ("CACHE TEST FAILED: timeout.\n");
>   return (0x20);
>   }
> diff --git a/drivers/scsi/sym53c8xx_2/sym_hipd.h 
> b/drivers/scsi/sym53c8xx_2/sym_hipd.h
> index ad07880..85c483b 100644
> --- a/drivers/scsi/sym53c8xx_2/sym_hipd.h
> +++ b/drivers/scsi/sym53c8xx_2/sym_hipd.h
> @@ -339,7 +339,7 @@
>  /*
>   *  Misc.
>   */
> -#define SYM_SNOOP_TIMEOUT (1000)
> +#define SYM_SNOOP_TIMEOUT (1000)
>  #define BUS_8_BIT0
>  #define BUS_16_BIT   1
>  

That might be the fix, but do we know what we're actually fixing?  afaik
2.6.24-rc3 doesn't get this timeout, 2.6.24-rc3-mm2 does get it and we
don't know why?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: constant_tsc and TSC unstable

2007-11-29 Thread ポール・ロラン

Hello,

On Fri, 30 Nov 2007 00:26:47 +0300
Michael Tokarev <[EMAIL PROTECTED]> wrote:

> H. Peter Anvin wrote:
> > Paul Rolland (ポール・ロラン) wrote:
> []
> >> Measured 3978592228 cycles TSC warp between CPUs, turning off TSC clock.
> >> Marking TSC unstable due to: check_tsc_sync_source failed.
> []
> >> but I was wondering if this is a bug or a feature ;)
> 
> > The problem you're having is that the TSCs of your two cores are
> > completely different, over a second apart.  This is a bug, unrelated to
> > constant_tsc.
> 
> A bug in where - in the CPU or in kernel?
Good question !
 
> The thing is that all our dual-core machines shows something like
> that.
> 
> (not that huge difference as Paul reported, but still "unstable".
> The same happens with 2.6.23)
I've been checking my logs, and the difference is quite constant and
huge :
[EMAIL PROTECTED] log]# grep 'cycles TSC warp' messages*
messages:Nov 26 08:27:56 tux kernel: Measured 4078687691 cycles TSC warp 
between C
PUs, turning off TSC clock.
messages:Nov 26 17:21:21 tux kernel: Measured 3978592228 cycles TSC warp 
between C
PUs, turning off TSC clock.
messages.1:Nov 18 22:52:23 tux kernel: Measured 4063102940 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.1:Nov 19 07:19:02 tux kernel: Measured 4057192061 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.1:Nov 23 20:50:12 tux kernel: Measured 4064589321 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.2:Nov 12 08:06:44 tux kernel: Measured 4072130361 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.2:Nov 13 19:42:47 tux kernel: Measured 4049899451 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.2:Nov 17 09:27:22 tux kernel: Measured 4066629060 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.3:Nov  5 08:25:08 tux kernel: Measured 4086386109 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.3:Nov  8 13:07:08 tux kernel: Measured 4041945934 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.3:Nov  9 23:31:24 tux kernel: Measured 4092303059 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Oct 29 07:28:23 tux kernel: Measured 4096946373 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Oct 31 17:07:21 tux kernel: Measured 4046765372 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Oct 31 17:15:09 tux kernel: Measured 4039328228 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Oct 31 23:19:00 tux kernel: Measured 4069714246 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Nov  1 20:33:02 tux kernel: Measured 4088199726 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Nov  2 11:53:17 tux kernel: Measured 4079927527 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Nov  3 09:37:16 tux kernel: Measured 4071112656 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Nov  3 10:51:29 tux kernel: Measured 3986266219 cycles TSC warp 
between
 CPUs, turning off TSC clock.
messages.4:Nov  4 18:14:56 tux kernel: Measured 4074214144 cycles TSC warp 
between
 CPUs, turning off TSC clock.

> Note that once TSC is disabled (it's using "jiffies" as far
> as I can see), ntpd constantly speeds up and slows down the
> clock, it jumps +/- 0.5sec every several minutes or hours -
> I guess that's when ntpd process gets moved from one core
> to another for whatever reason.  And an interesting thing
> is that with 64bits kernel this TSC problem does not occur
> on this very machine.
H That could make it a problem related to kernel rather than CPU.
 
> Something similar is reported on AMD X2 64 machines as well --
> can't check right now.
If I recall correctly, issues with AMD X2 where related to TSC being
independant for each core and not constant (speed depending of C state).
But the reason I raise the issue is that the Core2 reports constant TSC,
so there is (IMHO) no reason for that.

Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: constant_tsc and TSC unstable

2007-11-29 Thread Paul Rolland (ポール・ロラン)

Hello,

On Thu, 29 Nov 2007 15:29:49 -0800
"Pallipadi, Venkatesh" <[EMAIL PROTECTED]> wrote:



> TSCs on Core 2 Duo are supposed to be in sync unless CPU supports deep idle
> states like C2, C3. Can you send the full /proc/cpuinfo and full dmesg.
> 
Sure I can...
[EMAIL PROTECTED] log]# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 CPU T5300  @ 1.73GHz
stepping: 2
cpu MHz : 800.000
cache size  : 2048 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat 
ps
e36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc 
arch_perfmo
n pebs bts pni monitor ds_cpl est tm2 ssse3 cx16 xtpr lahf_lm
bogomips: 3461.13
clflush size: 64

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 CPU T5300  @ 1.73GHz
stepping: 2
cpu MHz : 800.000
cache size  : 2048 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat 
ps
e36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc 
arch_perfmo
n pebs bts pni monitor ds_cpl est tm2 ssse3 cx16 xtpr lahf_lm
bogomips: 3458.02
clflush size: 64

Regards,
Paul

dmesg
Description: Binary data

[patch 3/3] x86_64: Make the x86_32 percpu operations usable on x86_64

Relocate the x86_64 percpu variables to begin at zero. Then
we can directly use the x86_32 percpu operations. x86_32
offsets %fs by __per_cpu_start. x86_64 has %gs pointing
directly to the pda and the per cpu area if they start at zero.

Access to the pda with the x86_64 pda operations is still
possible in addition to access to the per cpu variables
using x86_32 percpu operations.

Hopefully this is helpful for arch integration.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 arch/x86/Kconfig |5 +
 arch/x86/kernel/setup64.c|4 ++--
 arch/x86/kernel/vmlinux_64.lds.S |1 +
 include/asm-x86/percpu.h |   12 +++-
 4 files changed, 19 insertions(+), 3 deletions(-)

Index: linux-2.6.24-rc3-mm2/include/asm-x86/percpu.h
===
--- linux-2.6.24-rc3-mm2.orig/include/asm-x86/percpu.h  2007-11-29 
22:13:54.806575787 -0800
+++ linux-2.6.24-rc3-mm2/include/asm-x86/percpu.h   2007-11-29 
22:21:42.383571603 -0800
@@ -17,6 +17,12 @@
 
 #define per_cpu_offset(x) (__per_cpu_offset(x))
 
+#define __percpu_seg "%%gs:"
+
+#else
+
+#define __percpu_seg ""
+
 #endif
 #include 
 
@@ -81,6 +87,11 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
 /* We can use this directly for local CPU (faster). */
 DECLARE_PER_CPU(unsigned long, this_cpu_off);
 
+#endif /* __ASSEMBLY__ */
+#endif /* !CONFIG_X86_64 */
+
+#ifndef __ASSEMBLY__
+
 /* For arch-specific code, we can use direct single-insn ops (they
  * don't give an lvalue though). */
 extern void __bad_percpu_size(void);
@@ -138,5 +149,4 @@ extern void __bad_percpu_size(void);
 #define x86_sub_percpu(var,val) percpu_to_op("sub", per_cpu__##var, val)
 #define x86_or_percpu(var,val) percpu_to_op("or", per_cpu__##var, val)
 #endif /* !__ASSEMBLY__ */
-#endif /* !CONFIG_X86_64 */
 #endif /* _ASM_X86_PERCPU_H_ */
Index: linux-2.6.24-rc3-mm2/arch/x86/Kconfig
===
--- linux-2.6.24-rc3-mm2.orig/arch/x86/Kconfig  2007-11-29 22:05:39.003576212 
-0800
+++ linux-2.6.24-rc3-mm2/arch/x86/Kconfig   2007-11-29 22:12:53.942575452 
-0800
@@ -123,6 +123,11 @@ config GENERIC_TIME_VSYSCALL
 config ARCH_SETS_UP_PER_CPU_AREA
def_bool X86_64
 
+config PERCPU_ZERO_BASED
+   bool
+   depends on X86_64 && SMP
+   default y
+
 config ZONE_DMA32
bool
default X86_64
Index: linux-2.6.24-rc3-mm2/arch/x86/kernel/setup64.c
===
--- linux-2.6.24-rc3-mm2.orig/arch/x86/kernel/setup64.c 2007-11-29 
22:12:08.962826086 -0800
+++ linux-2.6.24-rc3-mm2/arch/x86/kernel/setup64.c  2007-11-29 
22:12:53.942575452 -0800
@@ -111,11 +111,11 @@ void __init setup_per_cpu_areas(void)
}
if (!ptr)
panic("Cannot allocate cpu data for CPU %d\n", i);
-   memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+   memcpy(ptr, __per_cpu_load, __per_cpu_size);
/* Relocate the pda */
memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda));
cpu_pda(i) = (struct x8664_pda *)ptr;
-   cpu_pda(i)->data_offset = ptr - __per_cpu_start;
+   cpu_pda(i)->data_offset = (unsigned long)ptr;
}
/* Fix up pda for this processor  */
pda_init(0);
Index: linux-2.6.24-rc3-mm2/arch/x86/kernel/vmlinux_64.lds.S
===
--- linux-2.6.24-rc3-mm2.orig/arch/x86/kernel/vmlinux_64.lds.S  2007-11-29 
22:05:38.987576338 -0800
+++ linux-2.6.24-rc3-mm2/arch/x86/kernel/vmlinux_64.lds.S   2007-11-29 
22:12:53.930825752 -0800
@@ -16,6 +16,7 @@ jiffies_64 = jiffies;
 _proxy_pda = 1;
 PHDRS {
text PT_LOAD FLAGS(5);  /* R_E */
+   percpu PT_LOAD FLAGS(4);/* R__ */
data PT_LOAD FLAGS(7);  /* RWE */
user PT_LOAD FLAGS(7);  /* RWE */
data.init PT_LOAD FLAGS(7); /* RWE */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/3] X86_64: Declare pda as per cpu data thereby moving it into the cpu area

Declare the pda as a per cpu variable. This will have the effect of moving
the pda data into the cpu area managed by cpu alloc.

The boot_pdas are only needed in head64.c so move the declaration
over there and make it static.

Remove the code that allocates special pda data structures.

The pda is moved to the beginning of the per cpu area. gs is pointing to the
pda. And therefore gs: is now pointing to the per cpu area of the current
processor. A per cpu variable can then be reached at

%gs:[_cpu_ - __per_cpu_start]

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 arch/x86/kernel/head64.c  |6 ++
 arch/x86/kernel/setup64.c |   13 ++---
 arch/x86/kernel/smpboot_64.c  |   16 
 include/asm-generic/vmlinux.lds.h |1 +
 include/asm-x86/pda.h |1 -
 include/linux/percpu.h|4 
 6 files changed, 21 insertions(+), 20 deletions(-)

Index: linux-2.6.24-rc3-mm2/arch/x86/kernel/setup64.c
===
--- linux-2.6.24-rc3-mm2.orig/arch/x86/kernel/setup64.c 2007-11-28 
20:59:13.124188194 -0800
+++ linux-2.6.24-rc3-mm2/arch/x86/kernel/setup64.c  2007-11-28 
21:08:50.473347382 -0800
@@ -30,7 +30,9 @@ cpumask_t cpu_initialized __cpuinitdata 
 
 struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(_cpu_pda);
-struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
+
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);
 
 struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
 
@@ -109,10 +111,15 @@ void __init setup_per_cpu_areas(void)
}
if (!ptr)
panic("Cannot allocate cpu data for CPU %d\n", i);
-   cpu_pda(i)->data_offset = ptr - __per_cpu_start;
memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+   /* Relocate the pda */
+   memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda));
+   cpu_pda(i) = (struct x8664_pda *)ptr;
+   cpu_pda(i)->data_offset = ptr - __per_cpu_start;
}
-} 
+   /* Fix up pda for this processor  */
+   pda_init(0);
+}
 
 void pda_init(int cpu)
 { 
Index: linux-2.6.24-rc3-mm2/arch/x86/kernel/smpboot_64.c
===
--- linux-2.6.24-rc3-mm2.orig/arch/x86/kernel/smpboot_64.c  2007-11-28 
20:59:13.136188167 -0800
+++ linux-2.6.24-rc3-mm2/arch/x86/kernel/smpboot_64.c   2007-11-28 
20:59:35.399937395 -0800
@@ -556,22 +556,6 @@ static int __cpuinit do_boot_cpu(int cpu
return -1;
}
 
-   /* Allocate node local memory for AP pdas */
-   if (cpu_pda(cpu) == _cpu_pda[cpu]) {
-   struct x8664_pda *newpda, *pda;
-   int node = cpu_to_node(cpu);
-   pda = cpu_pda(cpu);
-   newpda = kmalloc_node(sizeof (struct x8664_pda), GFP_ATOMIC,
- node);
-   if (newpda) {
-   memcpy(newpda, pda, sizeof (struct x8664_pda));
-   cpu_pda(cpu) = newpda;
-   } else
-   printk(KERN_ERR
-   "Could not allocate node local PDA for CPU %d on node %d\n",
-   cpu, node);
-   }
-
alternatives_smp_switch(1);
 
c_idle.idle = get_idle_for_cpu(cpu);
Index: linux-2.6.24-rc3-mm2/arch/x86/kernel/head64.c
===
--- linux-2.6.24-rc3-mm2.orig/arch/x86/kernel/head64.c  2007-11-28 
20:59:13.152187359 -0800
+++ linux-2.6.24-rc3-mm2/arch/x86/kernel/head64.c   2007-11-28 
20:59:35.403937534 -0800
@@ -22,6 +22,12 @@
 #include 
 #include 
 
+/*
+ * Only used before the per cpu areas are setup. The use for the non possible
+ * cpus continues after boot
+ */
+static struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
+
 static void __init zap_identity_mappings(void)
 {
pgd_t *pgd = pgd_offset_k(0UL);
Index: linux-2.6.24-rc3-mm2/include/asm-x86/pda.h
===
--- linux-2.6.24-rc3-mm2.orig/include/asm-x86/pda.h 2007-11-28 
20:59:13.164187921 -0800
+++ linux-2.6.24-rc3-mm2/include/asm-x86/pda.h  2007-11-28 20:59:35.403937534 
-0800
@@ -39,7 +39,6 @@ struct x8664_pda {
 } cacheline_aligned_in_smp;
 
 extern struct x8664_pda *_cpu_pda[];
-extern struct x8664_pda boot_cpu_pda[];
 extern void pda_init(int);
 
 #define cpu_pda(i) (_cpu_pda[i])
Index: linux-2.6.24-rc3-mm2/include/asm-generic/vmlinux.lds.h
===
--- linux-2.6.24-rc3-mm2.orig/include/asm-generic/vmlinux.lds.h 2007-11-28 
20:59:13.176187886 -0800
+++ linux-2.6.24-rc3-mm2/include/asm-generic/vmlinux.lds.h  2007-11-28 
20:59:35.403937534 -0800
@@ -259,6 +259,7 @@
. = ALIGN(align);

[patch 1/3] Percpu infrastructure to rebase the per cpu area to 0UL

Support an option

CONFIG_PERCPU_ZERO_BASED

that makes offsets for per cpu variables start at zero.

If a percpu area starts at zero then

1. We do not need RELOC_HIDE anymore

2. Indexes off the per cpu area for each processor are small

3. The percpu area "addresses" are offsets and we can then
   have allocpercpu/cpu_alloc in the future also use these
   offsets so that percpu functions can take any type of
   percpu address if it is provided by a percpu variable
   or a pointer obtained via allocpercpu/cpu_alloc.

The linker area boundaries variables are different for zero based
percpu segments:

__per_cpu_load  -> The address at which the percpu area was loaded
__per_cpu_size  -> The length of the per cpu area


Removes the &__per_cpu_x in lockdep. AFAICT The __per_cpu_x are already
pointers.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/asm-generic/percpu.h  |7 ++-
 include/asm-generic/sections.h|   10 ++
 include/asm-generic/vmlinux.lds.h |   15 +++
 init/main.c   |   17 +
 kernel/lockdep.c  |4 ++--
 5 files changed, 42 insertions(+), 11 deletions(-)

Index: linux-2.6.24-rc3-mm2/include/asm-generic/percpu.h
===
--- linux-2.6.24-rc3-mm2.orig/include/asm-generic/percpu.h  2007-11-29 
22:05:58.359576450 -0800
+++ linux-2.6.24-rc3-mm2/include/asm-generic/percpu.h   2007-11-29 
22:06:22.750825804 -0800
@@ -42,8 +42,13 @@ extern unsigned long __per_cpu_offset[NR
  * Only S390 provides its own means of moving the pointer.
  */
 #ifndef SHIFT_PTR
+#ifdef CONFIG_PERCPU_ZERO_BASED
+#define SHIFT_PTR(__p, __offset) \
+   ((__typeof(__p))(((void *)(__p)) + (__offset)))
+#else
 #define SHIFT_PTR(__p, __offset)   RELOC_HIDE((__p), (__offset))
-#endif
+#endif /* CONFIG_PER_CPU_ZERO_BASED */
+#endif /* SHIFT_PTR */
 
 /*
  * A percpu variable may point to a discarded reghions. The following are
Index: linux-2.6.24-rc3-mm2/include/asm-generic/sections.h
===
--- linux-2.6.24-rc3-mm2.orig/include/asm-generic/sections.h2007-11-29 
22:05:58.367576240 -0800
+++ linux-2.6.24-rc3-mm2/include/asm-generic/sections.h 2007-11-29 
22:06:22.754826440 -0800
@@ -9,7 +9,17 @@ extern char __bss_start[], __bss_stop[];
 extern char __init_begin[], __init_end[];
 extern char _sinittext[], _einittext[];
 extern char _end[];
+#ifdef CONFIG_PERCPU_ZERO_BASED
+extern char __per_cpu_load[];
+extern char per_cpu_size[];
+#define __per_cpu_size ((unsigned long)&per_cpu_size)
+#define __per_cpu_start ((char *)0)
+#define __per_cpu_end ((char *)__per_cpu_size)
+#else
 extern char __per_cpu_start[], __per_cpu_end[];
+#define __per_cpu_load __per_cpu_start
+#define __per_cpu_size (__per_cpu_end - __per_cpu_start)
+#endif
 extern char __kprobes_text_start[], __kprobes_text_end[];
 extern char __initdata_begin[], __initdata_end[];
 extern char __start_rodata[], __end_rodata[];
Index: linux-2.6.24-rc3-mm2/include/asm-generic/vmlinux.lds.h
===
--- linux-2.6.24-rc3-mm2.orig/include/asm-generic/vmlinux.lds.h 2007-11-29 
22:06:03.486826118 -0800
+++ linux-2.6.24-rc3-mm2/include/asm-generic/vmlinux.lds.h  2007-11-29 
22:06:22.754826440 -0800
@@ -255,6 +255,20 @@
*(.initcall7.init)  \
*(.initcall7s.init)
 
+#ifdef CONFIG_PERCPU_ZERO_BASED
+#define PERCPU(align)  \
+   . = ALIGN(align);   \
+   percpu : { } :percpu\
+   __per_cpu_load = .; \
+   .data.percpu 0 : AT(__per_cpu_load - LOAD_OFFSET) { \
+   *(.data.percpu.first)   \
+   *(.data.percpu) \
+   *(.data.percpu.shared_aligned)  \
+   per_cpu_size = .;   \
+   }   \
+   . = __per_cpu_load + per_cpu_size;  \
+   data : { } :data
+#else
 #define PERCPU(align)  \
. = ALIGN(align);   \
__per_cpu_start = .;\
@@ -263,3 +277,4 @@
*(.data.percpu.shared_aligned)  \
}   \
__per_cpu_end = .;
+#endif
Index: linux-2.6.24-rc3-mm2/init/main.c
===
--- linux-2.6.24-rc3-mm2.orig/init/main.c   2007-11-29

[patch 0/3] Per cpu relocation to ZERO and x86_32 percpu ops on x86_64

This patchset allows the use of x86_32 percpu ops on x86_64 while maintaining
%gs pointing to the pda. It does that by moving the x86_64 pda into
the percpu area (thereby pointing %gs at the per cpu area) and then
relocating the x86_64 per cpu variables to start at 0.

Patch applies on top of the per cpu cleanup patches V2.
See http://marc.info/?l=linux-kernel=119628478316525=2

Ultimately I think we can make the per cpu accessors arch independent
(see the RFC at http://marc.info/?l=linux-kernel=119552126330405=2).
There is a performance benefit from using these in core code.

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Sample kset/ktype/kobject implementation

2007-11-29 Thread Greg KH

On Thu, Nov 29, 2007 at 05:11:35PM -0500, Alan Stern wrote:
> On Thu, 29 Nov 2007, Greg KH wrote:
> 
> > > > > kobject_put(foo) is needed since it gets you through kobject_cleanup()
> > > > > where the name can be freed.
> > > > 
> > > > No, kobject_register() should have handled that for us, right?
> > > 
> > > kobject_register() doesn't do a kobject_put() if kobject_add() failed.
> > 
> > Crap.  If I can't get this code right in an example, the API is messed
> > up.  Time to take Kay seriously and start to revamp the basic kobject
> > api :)
> 
> The rule is simple enough.  After calling kobject_register() you should 
> always use kobject_put() -- even if kobject_register() failed.

Yes.

> In fact, after calling kobject_init() you should use kobject_put().  
> The first rule follows from this one, since kobject_register() calls 
> kobject_init() internally.

Yes, that makes sense, time to write it all down :)

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: pnpacpi : exceeded the max number of IO resources

2007-11-29 Thread Valdis . Kletnieks

On Fri, 30 Nov 2007 10:21:28 +0800, Zhao Yakui said:
> Thanks for the acpidump & dmesg.
>   In the acpidump there are so many IO resource definitions in the device
> of mem2 and the number exceeds the predefined number(24).

On a semi-related note, I'm seeing 7 of these at each boot on a Dell Latitude 
D820:

pnpacpi: exceeded the max number of mem resources: 12

2.6.24-rc3-mm2 does it, it didn't do it for 2.6.23-mm1.

pnp-increase-the-maximum-number-of-resources.patch raised it from 4 to 12, but
I don't understand why it didn't complain at 4 in 23-mm1, but it does at 12 now.




pgpH0YcKmbnsZ.pgp
Description: PGP signature

Re: [BUG] 2.6.24-rc3-git2 softlockup detected

2007-11-29 Thread Kyle McMartin

On Thu, Nov 29, 2007 at 12:35:33AM -0800, Andrew Morton wrote:
> ten million is close enough to infinity for me to assume that we broke the
> driver and that's never going to terminate.
> 

how about this? doesn't break things on my pa8800:

diff --git a/drivers/scsi/sym53c8xx_2/sym_hipd.c 
b/drivers/scsi/sym53c8xx_2/sym_hipd.c
index 463f119..ef01cb1 100644
--- a/drivers/scsi/sym53c8xx_2/sym_hipd.c
+++ b/drivers/scsi/sym53c8xx_2/sym_hipd.c
@@ -1037,10 +1037,13 @@ restart_test:
/*
 *  Wait 'til done (with timeout)
 */
-   for (i=0; i=SYM_SNOOP_TIMEOUT) {
+   msleep(10);
+   } while (i++ < SYM_SNOOP_TIMEOUT);
+
+   if (i >= SYM_SNOOP_TIMEOUT) {
printf ("CACHE TEST FAILED: timeout.\n");
return (0x20);
}
diff --git a/drivers/scsi/sym53c8xx_2/sym_hipd.h 
b/drivers/scsi/sym53c8xx_2/sym_hipd.h
index ad07880..85c483b 100644
--- a/drivers/scsi/sym53c8xx_2/sym_hipd.h
+++ b/drivers/scsi/sym53c8xx_2/sym_hipd.h
@@ -339,7 +339,7 @@
 /*
  *  Misc.
  */
-#define SYM_SNOOP_TIMEOUT (1000)
+#define SYM_SNOOP_TIMEOUT (1000)
 #define BUS_8_BIT  0
 #define BUS_16_BIT 1
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Out of tree module using LSM

2007-11-29 Thread Valdis . Kletnieks

On Thu, 29 Nov 2007 18:34:33 EST, Jon Masters said:
> 
> On Thu, 2007-11-29 at 21:45 +, Alan Cox wrote:
> > > Jargon File in all its glory. And if you still think you could look for
> > > patterns, how about executable code that self-modifies in random ways
> > > but when executed as a whole actually has the functionality of fetchmail
> > > embedded within it? How would you guard against that?
> > 
> > Thats a problem for whoever writes the ESR detection tool and to what
> > level it works. The question for the kernel is how do we provide a
> > mechanism to allow (to some extent at least) this kind of tool to run.
> 
> Right. I'm just saying reading a single page out of context (no pun
> intended) is not going to be very useful. 

Fortunately for all concerned, although Alan's self-modifying code is indeed a
possibility, it's much less of an issue than the sort of malware that can be
found with a simple "find this 27-byte sequence, which will be found in either
block 36 or 37 of the file".

And I'll make the prediction that we won't see anything doing the sorts of
things that Alan's program does, until that's the *easiest* way to get into
a system.  Until that time, they're either going to be sending simpler stuff
that a scanner can easily template and find, or using other means of attacks
that are outside the scope of a scanner.

Remember guys - we want to think about *realistic* threat models.  The e-mail
virus scanners we use catch hundreds to thousands of known viruses *every day*.
But I can count on the fingers of both hands the number of times I've had to
deal with a *real* "0-day" in a quarter century.  The scanner doesn't have to
be perfect - it just has to make it hard enough to bypass to render it
economically infeasible.  If you're targeted by a military/govt/political/
religious group that doesn't *care* if it's economically viable, you have
other, bigger problems to deal with...

pgpaezS6lQXPW.pgp
Description: PGP signature

Re: [PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead

2007-11-29 Thread Herbert Xu

On Fri, Nov 30, 2007 at 09:20:34AM +0800, rae l wrote:
> 
> Cc: Randy Dunlap <[EMAIL PROTECTED]>
> Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>

Patch applied.  Thanks a lot Denis!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Sample kset/ktype/kobject implementation

2007-11-29 Thread Dave Young

On Fri, Nov 30, 2007 at 01:07:37PM +0800, Dave Young wrote:
> On Nov 30, 2007 6:11 AM, Alan Stern <[EMAIL PROTECTED]> wrote:
> > On Thu, 29 Nov 2007, Greg KH wrote:
> >
> > > > > > kobject_put(foo) is needed since it gets you through 
> > > > > > kobject_cleanup()
> > > > > > where the name can be freed.
> > > > >
> > > > > No, kobject_register() should have handled that for us, right?
> > > >
> > > > kobject_register() doesn't do a kobject_put() if kobject_add() failed.
> > >
> > > Crap.  If I can't get this code right in an example, the API is messed
> > > up.  Time to take Kay seriously and start to revamp the basic kobject
> > > api :)
> >
> > The rule is simple enough.  After calling kobject_register() you should
> > always use kobject_put() -- even if kobject_register() failed.
> >
> > In fact, after calling kobject_init() you should use kobject_put().
> > The first rule follows from this one, since kobject_register() calls
> > kobject_init() internally.
> >
> Hi,
> The behavior is not very clear here, the root problem is that :
> 
> 1. Should we call kobject_put so cleanup work can be done by refcount
> touch zero or call kfree every time after kobject_register failed?
> 
> 2. If kobject_put calling is true, should this be done in
> kobject_register error handling codes or by hand after
> kobject_register failed?
> 
IMO,I'd rather select kobject_put due to the kobj name should also be released.
After searching for kobject_register, I found one leaks as this issue in 
pktcdvd.

Signed-off-by: Dave Young <[EMAIL PROTECTED]> 

---
drivers/block/pktcdvd.c |4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff -upr linux/drivers/block/pktcdvd.c linux.new/drivers/block/pktcdvd.c
--- linux/drivers/block/pktcdvd.c   2007-11-30 13:13:44.0 +0800
+++ linux.new/drivers/block/pktcdvd.c   2007-11-30 13:24:08.0 +0800
@@ -117,8 +117,10 @@ static struct pktcdvd_kobj* pkt_kobj_cre
p->kobj.parent = parent;
p->kobj.ktype = ktype;
p->pd = pd;
-   if (kobject_register(>kobj) != 0)
+   if (kobject_register(>kobj) != 0) {
+   kobject_put(>kobj);
return NULL;
+   }
return p;
 }
 /*
> Regards
> dave
> > Alan Stern
> >
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] keyspan: init termios properly

2007-11-29 Thread Borislav Petkov

On Mon, Nov 26, 2007 at 02:18:52PM -0800, Andrew Morton wrote:
> On Sun, 18 Nov 2007 14:11:30 +0100
> Borislav Petkov <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Nov 15, 2007 at 01:10:16PM -0800, Lucy McCoy wrote:

...

> > yes, after testing this i can confirm that this one fixes the NULL ptr
> > problem here so you might want to submit a proper patch to Greg.
> 
> I'll merge revert-keyspan-init-termios-properly.patch soon, but afaik we
> are still awaiting the real fix for this problem?

Hi Andrew,
sorry for the late reply - i was away from the country and couldn't read mail.
Yes, we are still awaiting the real fix afaik but the code fragment above
removes the NULL ptr deref so we should at least merge that. Will prepare a
patch for this later today...

-- 
Regards/Gruß,
Boris.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] 2.6.24-rc3-mm2 soft lockup while running tbench

2007-11-29 Thread Kamalesh Babulal

Andrew Morton wrote:
> On Wed, 28 Nov 2007 20:03:22 +0530
> Kamalesh Babulal <[EMAIL PROTECTED]> wrote:
> 
>> Hi Andrew,
>>
>> while running tbench on the powerpc with 2.6.24-rc3-mm2 softlock up occurs
>>
>> BUG: soft lockup - CPU#0 stuck for 11s! [tbench:12183]
>> NIP: c00ac978 LR: c00acff0 CTR: c005c648
>> REGS: C0076F0F3200 TRAP: 0901   Not tainted  (2.6.24-rc3-mm2-autotest)
>> MSR: 80009032   CR: 44000482  XER: 
>> TASK = C0076F4BC000[12183] 'tbench' THREAD: C0076F0F CPU: 0
>> NIP [c00ac978] .get_page_from_freelist+0x1cc/0x754
>> LR [c00acff0] .__alloc_pages+0xb0/0x3a8
>> Call Trace:
>> [c0076f0f3480] [c0076f0f3560] 0xc0076f0f3560 (unreliable)
>> [c0076f0f3590] [c00acff0] .__alloc_pages+0xb0/0x3a8
>> [c0076f0f3680] [c00ce2e4] .alloc_pages_current+0xa8/0xc8
>> [c0076f0f3710] [c00ac6ec] .__get_free_pages+0x20/0x70
>> [c0076f0f3790] [c00d75c8] .__kmalloc_node_track_caller+0x60/0x148
>> [c0076f0f3840] [c02c22b0] .__alloc_skb+0x98/0x184
>> [c0076f0f38f0] [c0306cd8] .tcp_sendmsg+0x1fc/0xe24
>> [c0076f0f3a10] [c02b963c] .sock_sendmsg+0xe4/0x128
>> [c0076f0f3c10] [c02ba4ec] .sys_sendto+0xd4/0x120
>> [c0076f0f3d90] [c02df2f8] .compat_sys_socketcall+0x148/0x214
>> [c0076f0f3e30] [c000872c] syscall_exit+0x0/0x40
>> Instruction dump:
>> 720b0001 eb97 40820070 7202 4182000c e8bc 4818 72080004 
>> 4182000c e8bc0008 4808 e8bc0010  7f83e378 7de407b4 7e078378 
>>
> 
> hm.  Beats me.  Does the machine recover OK?
> -
Hi Andrew,

In the set of test cases ran serially, the softlockup in seen in tbench,
then the remaining test cases get to run successfully after the softlockup.

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Sample kset/ktype/kobject implementation

2007-11-29 Thread Dave Young

On Nov 30, 2007 6:11 AM, Alan Stern <[EMAIL PROTECTED]> wrote:
> On Thu, 29 Nov 2007, Greg KH wrote:
>
> > > > > kobject_put(foo) is needed since it gets you through kobject_cleanup()
> > > > > where the name can be freed.
> > > >
> > > > No, kobject_register() should have handled that for us, right?
> > >
> > > kobject_register() doesn't do a kobject_put() if kobject_add() failed.
> >
> > Crap.  If I can't get this code right in an example, the API is messed
> > up.  Time to take Kay seriously and start to revamp the basic kobject
> > api :)
>
> The rule is simple enough.  After calling kobject_register() you should
> always use kobject_put() -- even if kobject_register() failed.
>
> In fact, after calling kobject_init() you should use kobject_put().
> The first rule follows from this one, since kobject_register() calls
> kobject_init() internally.
>
Hi,
The behavior is not very clear here, the root problem is that :

1. Should we call kobject_put so cleanup work can be done by refcount
touch zero or call kfree every time after kobject_register failed?

2. If kobject_put calling is true, should this be done in
kobject_register error handling codes or by hand after
kobject_register failed?

Regards
dave
> Alan Stern
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-usb-devel] [BUG] USB_PERSIST

2007-11-29 Thread Raymano Garibaldi

On 11/29/07, Alan Stern <[EMAIL PROTECTED]> wrote:
> On Thu, 29 Nov 2007, Raymano Garibaldi wrote:
>
> > The feature does work as long as the device remains plugged in and
> > that is what I have said in my previous postings too. What I'm saying
> > that should work and worked under 2.6.21 and is not working currently
> > is the ability to unplug and plug back in the device while the
> > computer is suspended before resuming without losing the mount.
>
> Okay, guess I misunderstood what you wrote before.
>
> The patch below for 2.6.23 should do what you want (and more besides).
> It forces the USB Persist feature to apply to all persist-enabled
> devices, whether they were unplugged or not.
>
> There's no chance of this getting accepted into the official kernel in
> such a simple form, but at least it will allow you to do what you want.
>
> Alan Stern
>
>
> --- 2.6.23/drivers/usb/core/driver.c1   2007-11-29 10:57:36.0 -0500
> +++ 2.6.23/drivers/usb/core/driver.c2007-11-29 11:01:44.0 -0500
> @@ -1550,6 +1550,9 @@
> if (!(udev->reset_resume && udev->do_remote_wakeup))
> return -EPERM;
> }
> +
> +   /* Force all system resumes to be reset-resumes */
> +   udev->reset_resume = 1;
> return usb_external_resume_device(udev);
>  }
>
>
>

Alan,

Thank you! Thank you! Thank you!

Who'd have thought such a simple patch could make someone so happy?

That did the trick. I just tried it and it works beautifully whether
the device remains plugged in during suspend or if it's unplugged and
plugged back in during suspend and before resume.

Now if this could only become the default behavior ;-)

Thanks again,
Raymano G.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sched_yield: delete sysctl_sched_compat_yield

2007-11-29 Thread Zhang, Yanmin

On Fri, 2007-11-30 at 14:29 +1100, Nick Piggin wrote:
> On Friday 30 November 2007 14:15, Zhang, Yanmin wrote:
> > On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> > > On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> 
> > > > sounds like a bad idea; volanomark (well, technically the jvm behind
> > > > it) is abusing sched_yield() by assuming it does something it really
> > > > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > > > accidentally happened to behave in a way that was nice for this
> > > > benchmark.
> > >
> > > OK, why is this still happening? Haven't we been asking JVMs to use
> > > futexes or posix locking for years and years now? Are there any sane
> > > jvms that _don't_ use yield?
> >
> > I think it's an issue of volanomark (a kind of java application) instead of
> > JVM.
> 
> volanomark itself and not the jvm is calling sched_yield()? Do we have
> any non-toy threaded java apps? (what's JAVA in the kernel-perf tests?)
I run lots of well-known benchmarks and volanoMark is the one who gets the 
largest
impact from sched_yield.

As for real-applications which use sched_yield, mostly, they are not open 
sources.
Yesterday, I got to know someone was using sched_yield in his network C 
programs,
but he didn't want to share the sources with me.

> 
> 
> > > > Todays kernel has a different behavior somewhat (and before people
> > > > scream "regression"; sched_yield() behavior isn't really specified and
> > > > doesn't make any sense at all, whatever you get is what you get
> > > > it's pretty much an insane defacto behavior that is incredibly tied to
> > > > which decisions the scheduler makes how, and no app can depend on that
> > >
> > > It is a performance regression. Is there any reason *not* to use the
> > > "compat" yield by default?
> >
> > There is no, so I suggest to set sched_compat_yield=1 by default.
> > If sched_compat_yield=0, kernel almost does nothing but returns. When
> > sched_compat_yield=1, it is closer to the meaning of sched_yield man page.
> 
> sched_yield() is really only defined for posix realtime scheduling
> AFAIK, which talks about priority lists. 
> 
> SCHED_OTHER is defined to be a single priority, below the rest of the
> realtime priorities. So at first you *might* say that the process
> should then be made to run only after all other SCHED_OTHER processes,
> however there is no such ordering requirement for SCHED_OTHER
> scheduling. The SCHED_OTHER scheduler can run any task at any time.
> 
> That said, I think people would *expect* that call be much closer to
> the compat behaviour than the current default. And that's definitely
> what Linux has done in the past. So there really does need to be a
> good reason to change it like this IMO.
That's indeed what I am thinking.

I am running many testing(SPECjbb/SPECjbb2005/cpu2000/iozone/dbench/tbench...) 
to 
see if there is any regression if sched_compat_yield=1. I think there is no
regression and the testing is just to double-check.

> 
> 
> > > As you say, for SCHED_OTHER tasks, yield
> > > can do almost anything. We may as well do something that isn't a
> > > regression...
> >
> > I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
> > the latest kernel?
> 
> Yes, SCHED_NORMAL is SCHED_OTHER. Don't know why it got renamed...
Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid overflows in kernel/time.c


Arjan van de Ven wrote:


Anyway, I don't think compiling bc is hard on anything which has a C 
compiler.


alternative is to just also ship the precomputed values ;-)



Oh, come on... it's not like bc is some obscure thing.  It's a POSIX 
utility.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid overflows in kernel/time.c

2007-11-29 Thread Arjan van de Ven

On Thu, 29 Nov 2007 19:04:36 -0800
"H. Peter Anvin" <[EMAIL PROTECTED]> wrote:

> Chris Snook wrote:
> > H. Peter Anvin wrote:
> >> NOTE: This patch uses a bc(1) script to compute the appropriate
> >> constants.
> > 
> > Perhaps dc would be more appropriate?  That's included in busybox.
> > 
> 
> Perhaps it would, but I think there is more variability between dc 
> implementations -- consider if the busybox version is broken, for
> eample.
> 
> Either way, how many people compile their kernels in a busybox
> environment?
> 
> Anyway, I don't think compiling bc is hard on anything which has a C 
> compiler.

alternative is to just also ship the precomputed values ;-)

> 
>   -hpa
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RFC - organize include/linux/kernel.h, add include/linux/logging.h

2.6.25 material.

kernel.h has become a bit disorganized over a long time.
Here's an attempt to clean it up a bit.

Something for everyone to like or dislike...

Groups externs and functions by module/function
Creates a "logging.h" for printk, KERN_
Changes some macros to statement expressions
DIV_ROUND_UP, roundup and __ALIGN_MASK
Removes the unused PTR_ALIGN
Conforms to coding style and 80 columns
Passes checkpatch but for coding style defects in checkpatch
statement expressions don't need a space between "; and })"
"do {} whiles" between "; and }"

 include/linux/kernel.h  |  458 +--
 include/linux/logging.h |  154 

These files used macros to declare array elements.
Statement expressions can't be used for that,
so these now use direct calculations instead.

 include/linux/bitops.h  |2 +-
 lib/radix-tree.c|5 +-

This one used the ALIGN macro, but I'm not inclined to
figure out what it actually does right now, so copy
the old macro to this file and renames it.

 include/net/neighbour.h |5 +-

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 94bc996..2783ed9 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -1,403 +1,273 @@
 #ifndef _LINUX_KERNEL_H
 #define _LINUX_KERNEL_H
 
 /*
  * 'kernel.h' contains some often-used function prototypes etc
  */
 
 #ifdef __KERNEL__
 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
-extern const char linux_banner[];
-extern const char linux_proc_banner[];
-
+/* could be in an include linux/limits.h */
 #define INT_MAX((int)(~0U>>1))
 #define INT_MIN(-INT_MAX - 1)
 #define UINT_MAX   (~0U)
 #define LONG_MAX   ((long)(~0UL>>1))
 #define LONG_MIN   (-LONG_MAX - 1)
 #define ULONG_MAX  (~0UL)
 #define LLONG_MAX  ((long long)(~0ULL>>1))
 #define LLONG_MIN  (-LLONG_MAX - 1)
 #define ULLONG_MAX (~0ULL)
 
-#define STACK_MAGIC0xdeadbeef
+/* useful macros */
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))
+#define FIELD_SIZEOF(t, f) (sizeof(((t*)0)->f))
 
-#define ALIGN(x,a) __ALIGN_MASK(x,(typeof(x))(a)-1)
-#define __ALIGN_MASK(x,mask)   (((x)+(mask))&~(mask))
-#define PTR_ALIGN(p, a)((typeof(p))ALIGN((unsigned long)(p), 
(a)))
-#define IS_ALIGNED(x,a)(((x) % ((typeof(x))(a))) == 0)
+/*
+ * Check at compile time that something is of a particular type.
+ * Always evaluates to 1 so you may use it easily in comparisons.
+ */
+#define typecheck(type, x) \
+   ({type _dummy; typeof(x) _dummy2; (void)(&_dummy == &_dummy2); 1;})
 
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))
+/*
+ * Check at compile time that 'function' is a certain type, or is a pointer
+ * to that type (needs to use typedef for the function type.)
+ */
+#define typecheck_fn(type, function)   \
+   ({typeof(type) _x = function; (void)_x;})
 
-#define FIELD_SIZEOF(t, f) (sizeof(((t*)0)->f))
-#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
-#define roundup(x, y) x) + ((y) - 1)) / (y)) * (y))
+/**
+ * container_of - cast a member of a structure out to the containing structure
+ * @ptr:   the pointer to the member.
+ * @type:  the type of the container struct this is embedded in.
+ * @member:the name of the member within the struct.
+ *
+ */
+#define container_of(ptr, type, member) ({ \
+   const typeof(((type *)0)->member) *__mptr = (ptr);  \
+   (type *)((char *)__mptr - offsetof(type, member));})
 
-#ifdef CONFIG_LBD
-# include 
-# define sector_div(a, b) do_div(a, b)
-#else
-# define sector_div(n, b)( \
-{ \
-   int _res; \
-   _res = (n) % (b); \
-   (n) /= (b); \
-   _res; \
-} \
-)
-#endif
+/*
+ * min()/max() macros that also do strict type-checking..
+ * See the "unnecessary" pointer comparison.
+ */
+#define min(x, y) ({   \
+   typeof(x) _x = (x); \
+   typeof(y) _y = (y); \
+   (void)(&_x == &_y); \
+   _x < _y ? _x : _y;})
+
+#define max(x, y) ({   \
+   typeof(x) _x = (x); \
+   typeof(y) _y = (y); \
+   (void)(&_x == &_y); \
+   _x > _y ? _x : _y;})
+
+/*
+ * ..and if you can't take the strict
+ * types, you can specify one yourself.
+ *
+ * Or not use min/max at all, of course.
+ */
+#define min_t(type, x, y) \
+   ({type _x = (x); type _y = (y); _x < _y ? _x: _y;})
+
+#define max_t(type, x, y) \
+   ({type _x = (x); type _y = (y); _x > _y ? _x: _y;})
+
+#define abs(x) ({int _x = (x); (_x < 0) ? -_x : _x;})
 
 /**
  * upper_32_bits - return bits 32-63 of a number
  * @n: the number we're accessing
  *
  * A basic shift-right of a 64- or 32-bit quantity.  Use this to suppress
  * the "right shift count >= width of type" warning when that

[PATCH] Documentation/Changes -> Documentation/Requirements (resend without truncated comment text)

Change Documentation/Changes to Documentation/Requirements, and at
least begin to separate the runtime requirements from the kernel
compilation requirements.

There are definitely kernel compilation requirements that are not
listed in this file.  It would be good to get them uncovered.

This document is obviously woefully incomplete, for one thing it has
absolutely no per-architecture information, except "may depend on the
CPU in your system."  Hopefully this will encourage people to document
those per-architecture requirements.

Signed-off-by: H. Peter Anvin <[EMAIL PROTECTED]>
---

As far as I can tell, Documentation/Changes is the only thing we have
that even attempts to document the basic requirements.  This attempts
to formalize that fact.

 Documentation/Changes  |  396 
 Documentation/Requirements |  394 +++
 2 files changed, 394 insertions(+), 396 deletions(-)
 delete mode 100644 Documentation/Changes
 create mode 100644 Documentation/Requirements

diff --git a/Documentation/Changes b/Documentation/Changes
deleted file mode 100644
index cb2b141..000
--- a/Documentation/Changes
+++ /dev/null
@@ -1,396 +0,0 @@
-Intro
-=
-
-This document is designed to provide a list of the minimum levels of
-software necessary to run the 2.6 kernels, as well as provide brief
-instructions regarding any other "Gotchas" users may encounter when
-trying life on the Bleeding Edge.  If upgrading from a pre-2.4.x
-kernel, please consult the Changes file included with 2.4.x kernels for
-additional information; most of that information will not be repeated
-here.  Basically, this document assumes that your system is already
-functional and running at least 2.4.x kernels.
-
-This document is originally based on my "Changes" file for 2.0.x kernels
-and therefore owes credit to the same people as that file (Jared Mauch,
-Axel Boldt, Alessandro Sigala, and countless other users all over the
-'net).
-
-Current Minimal Requirements
-
-
-Upgrade to at *least* these software revisions before thinking you've
-encountered a bug!  If you're unsure what version you're currently
-running, the suggested command should tell you.
-
-Again, keep in mind that this list assumes you are already
-functionally running a Linux 2.4 kernel.  Also, not all tools are
-necessary on all systems; obviously, if you don't have any ISDN
-hardware, for example, you probably needn't concern yourself with
-isdn4k-utils.
-
-o  Gnu C  3.2 # gcc --version
-o  Gnu make   3.79.1  # make --version
-o  binutils   2.12# ld -v
-o  util-linux 2.10o   # fdformat --version
-o  module-init-tools  0.9.10  # depmod -V
-o  e2fsprogs  1.29# tune2fs
-o  jfsutils   1.1.3   # fsck.jfs -V
-o  reiserfsprogs  3.6.3   # reiserfsck -V 2>&1|grep 
reiserfsprogs
-o  xfsprogs   2.6.0   # xfs_db -V
-o  pcmciautils004 # pccardctl -V
-o  quota-tools3.09# quota -V
-o  PPP2.4.0   # pppd --version
-o  isdn4k-utils   3.1pre1 # isdnctrl 2>&1|grep version
-o  nfs-utils  1.0.5   # showmount --version
-o  procps 3.2.0   # ps --version
-o  oprofile   0.9 # oprofiled --version
-o  udev   081 # udevinfo -V
-o  grub   0.93# grub --version
-
-Kernel compilation
-==
-
-GCC

-
-The gcc version requirements may vary depending on the type of CPU in your
-computer.
-
-Make
-
-
-You will need Gnu make 3.79.1 or later to build the kernel.
-
-Binutils
-
-
-Linux on IA-32 has recently switched from using as86 to using gas for
-assembling the 16-bit boot code, removing the need for as86 to compile
-your kernel.  This change does, however, mean that you need a recent
-release of binutils.
-
-System utilities
-
-
-Architectural changes
--
-
-DevFS has been obsoleted in favour of udev
-(http://www.kernel.org/pub/linux/utils/kernel/hotplug/)
-
-32-bit UID support is now in place.  Have fun!
-
-Linux documentation for functions is transitioning to inline
-documentation via specially-formatted comments near their
-definitions in the source.  These comments can be combined with the
-SGML templates in the Documentation/DocBook directory to make DocBook
-files, which can then be converted by DocBook stylesheets to PostScript,
-HTML, PDF files, and several other formats.  In order to convert from
-DocBook format to a format of your choice, you'll need to install Jade as
-well as the desired DocBook

Re: sched_yield: delete sysctl_sched_compat_yield

On Friday 30 November 2007 14:15, Zhang, Yanmin wrote:
> On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> > On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:

> > > sounds like a bad idea; volanomark (well, technically the jvm behind
> > > it) is abusing sched_yield() by assuming it does something it really
> > > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > > accidentally happened to behave in a way that was nice for this
> > > benchmark.
> >
> > OK, why is this still happening? Haven't we been asking JVMs to use
> > futexes or posix locking for years and years now? Are there any sane
> > jvms that _don't_ use yield?
>
> I think it's an issue of volanomark (a kind of java application) instead of
> JVM.

volanomark itself and not the jvm is calling sched_yield()? Do we have
any non-toy threaded java apps? (what's JAVA in the kernel-perf tests?)

> > > Todays kernel has a different behavior somewhat (and before people
> > > scream "regression"; sched_yield() behavior isn't really specified and
> > > doesn't make any sense at all, whatever you get is what you get
> > > it's pretty much an insane defacto behavior that is incredibly tied to
> > > which decisions the scheduler makes how, and no app can depend on that
> >
> > It is a performance regression. Is there any reason *not* to use the
> > "compat" yield by default?
>
> There is no, so I suggest to set sched_compat_yield=1 by default.
> If sched_compat_yield=0, kernel almost does nothing but returns. When
> sched_compat_yield=1, it is closer to the meaning of sched_yield man page.

sched_yield() is really only defined for posix realtime scheduling
AFAIK, which talks about priority lists. 

SCHED_OTHER is defined to be a single priority, below the rest of the
realtime priorities. So at first you *might* say that the process
should then be made to run only after all other SCHED_OTHER processes,
however there is no such ordering requirement for SCHED_OTHER
scheduling. The SCHED_OTHER scheduler can run any task at any time.

That said, I think people would *expect* that call be much closer to
the compat behaviour than the current default. And that's definitely
what Linux has done in the past. So there really does need to be a
good reason to change it like this IMO.

> > As you say, for SCHED_OTHER tasks, yield
> > can do almost anything. We may as well do something that isn't a
> > regression...
>
> I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
> the latest kernel?

Yes, SCHED_NORMAL is SCHED_OTHER. Don't know why it got renamed...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Documentation/Changes -> Documentation/Requirements

Change Documentation/Changes to Documentation/Requirements, and at
least begin to separate the runtime requirements from the kernel
compilation requirements.

There are definitely kernel compilation requirements that are not
listed in this file.  It would be good to get them uncovered.

This document is obviously woefully incomplete, for one thing it has
absolutely no per-architecture information, except "may depend on the
CPU in your system."  Hopefully this will encourage people to
---

As far as I can tell, Documentation/Changes is the only thing we have
that even attempts to document the basic requirements.  This attempts
to formalize that fact.

 Documentation/Changes  |  396 
 Documentation/Requirements |  394 +++
 2 files changed, 394 insertions(+), 396 deletions(-)
 delete mode 100644 Documentation/Changes
 create mode 100644 Documentation/Requirements

diff --git a/Documentation/Changes b/Documentation/Changes
deleted file mode 100644
index cb2b141..000
--- a/Documentation/Changes
+++ /dev/null
@@ -1,396 +0,0 @@
-Intro
-=
-
-This document is designed to provide a list of the minimum levels of
-software necessary to run the 2.6 kernels, as well as provide brief
-instructions regarding any other "Gotchas" users may encounter when
-trying life on the Bleeding Edge.  If upgrading from a pre-2.4.x
-kernel, please consult the Changes file included with 2.4.x kernels for
-additional information; most of that information will not be repeated
-here.  Basically, this document assumes that your system is already
-functional and running at least 2.4.x kernels.
-
-This document is originally based on my "Changes" file for 2.0.x kernels
-and therefore owes credit to the same people as that file (Jared Mauch,
-Axel Boldt, Alessandro Sigala, and countless other users all over the
-'net).
-
-Current Minimal Requirements
-
-
-Upgrade to at *least* these software revisions before thinking you've
-encountered a bug!  If you're unsure what version you're currently
-running, the suggested command should tell you.
-
-Again, keep in mind that this list assumes you are already
-functionally running a Linux 2.4 kernel.  Also, not all tools are
-necessary on all systems; obviously, if you don't have any ISDN
-hardware, for example, you probably needn't concern yourself with
-isdn4k-utils.
-
-o  Gnu C  3.2 # gcc --version
-o  Gnu make   3.79.1  # make --version
-o  binutils   2.12# ld -v
-o  util-linux 2.10o   # fdformat --version
-o  module-init-tools  0.9.10  # depmod -V
-o  e2fsprogs  1.29# tune2fs
-o  jfsutils   1.1.3   # fsck.jfs -V
-o  reiserfsprogs  3.6.3   # reiserfsck -V 2>&1|grep 
reiserfsprogs
-o  xfsprogs   2.6.0   # xfs_db -V
-o  pcmciautils004 # pccardctl -V
-o  quota-tools3.09# quota -V
-o  PPP2.4.0   # pppd --version
-o  isdn4k-utils   3.1pre1 # isdnctrl 2>&1|grep version
-o  nfs-utils  1.0.5   # showmount --version
-o  procps 3.2.0   # ps --version
-o  oprofile   0.9 # oprofiled --version
-o  udev   081 # udevinfo -V
-o  grub   0.93# grub --version
-
-Kernel compilation
-==
-
-GCC

-
-The gcc version requirements may vary depending on the type of CPU in your
-computer.
-
-Make
-
-
-You will need Gnu make 3.79.1 or later to build the kernel.
-
-Binutils
-
-
-Linux on IA-32 has recently switched from using as86 to using gas for
-assembling the 16-bit boot code, removing the need for as86 to compile
-your kernel.  This change does, however, mean that you need a recent
-release of binutils.
-
-System utilities
-
-
-Architectural changes
--
-
-DevFS has been obsoleted in favour of udev
-(http://www.kernel.org/pub/linux/utils/kernel/hotplug/)
-
-32-bit UID support is now in place.  Have fun!
-
-Linux documentation for functions is transitioning to inline
-documentation via specially-formatted comments near their
-definitions in the source.  These comments can be combined with the
-SGML templates in the Documentation/DocBook directory to make DocBook
-files, which can then be converted by DocBook stylesheets to PostScript,
-HTML, PDF files, and several other formats.  In order to convert from
-DocBook format to a format of your choice, you'll need to install Jade as
-well as the desired DocBook stylesheets.
-
-Util-linux
---
-
-New versions of util-linux provide *fdisk support for

Re: sched_yield: delete sysctl_sched_compat_yield

2007-11-29 Thread Zhang, Yanmin

On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> > On Tue, 27 Nov 2007 17:33:05 +0800
> >
> > "Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:
> > > If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> > > volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> > > my 16-core tigerton.
> > >
> > > 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> > > 2.6.24-rc3 has more than 70% improvement;
> > > 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> > > 2.6.24-rc3 has more than 80% regression;
> > >
> > > On other machines, the volanoMark result also has much improvement if
> > > /proc/sys/kernel/sched_compat_yield=1.
> > >
> > > Would you like to change function yield_task_fair to delete codes
> > > around sysctl_sched_compat_yield, or just initiate it to 1?
> >
> > sounds like a bad idea; volanomark (well, technically the jvm behind
> > it) is abusing sched_yield() by assuming it does something it really
> > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > accidentally happened to behave in a way that was nice for this
> > benchmark.
> 
> OK, why is this still happening? Haven't we been asking JVMs to use
> futexes or posix locking for years and years now? Are there any sane
> jvms that _don't_ use yield?
I think it's an issue of volanomark (a kind of java application) instead of JVM.

> 
> 
> > Todays kernel has a different behavior somewhat (and before people
> > scream "regression"; sched_yield() behavior isn't really specified and
> > doesn't make any sense at all, whatever you get is what you get
> > it's pretty much an insane defacto behavior that is incredibly tied to
> > which decisions the scheduler makes how, and no app can depend on that
> 
> It is a performance regression. Is there any reason *not* to use the
> "compat" yield by default?
There is no, so I suggest to set sched_compat_yield=1 by default.
If sched_compat_yield=0, kernel almost does nothing but returns. When
sched_compat_yield=1, it is closer to the meaning of sched_yield man page.

> As you say, for SCHED_OTHER tasks, yield
> can do almost anything. We may as well do something that isn't a
> regression...
I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
the latest kernel?

> 
> 
> > in any way. In fact, I've proposed to make sched_yield() just do an
> > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > standard wise than any of the current behaviors  ;_
> 
> What makes you say that? IIRC of all the things that sched_yeild can
> do, it is not allowed to block. So this is about the only thing that
> will break the standard...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What can we do to get ready for memory controller merge in 2.6.25

2007-11-29 Thread Balbir Singh

Nick Piggin wrote:
> On Friday 30 November 2007 01:43, Balbir Singh wrote:
>> They say better strike when the iron is hot.
>>
>> Since we have so many people discussing the memory controller, I would
>> like to access the readiness of the memory controller for mainline
>> merge. Given that we have some time until the merge window, I'd like to
>> set aside some time (from my other work items) to work on the memory
>> controller, fix review comments and defects.
>>
>> In the past, we've received several useful comments from Rik Van Riel,
>> Lee Schermerhorn, Peter Zijlstra, Hugh Dickins, Nick Piggin, Paul Menage
>> and code contributions and bug fixes from Hugh Dickins, Pavel Emelianov,
>> Lee Schermerhorn, YAMAMOTO-San, Andrew Morton and KAMEZAWA-San. I
>> apologize if I missed out any other names or contributions
>>
>> At the VM-Summit we decided to try the current double LRU approach for
>> memory control. At this juncture in the space-time continuum, I seek
>> your support, feedback, comments and help to move the memory controller
> 
> Do you have any test cases, performance numbers, etc.? And also some
> results or even anecdotes of where this is going to be used would be
> interesting...
> 

Some test results were posted at

http://lkml.org/lkml/2007/8/17/69
http://lkml.org/lkml/2007/8/19/36
http://lwn.net/Articles/242554/

Some results for the RSS controller can be found in the OLS paper

https://ols2006.108.redhat.com/2007/Reprints/singh-Reprint.pdf

and at

http://lkml.org/lkml/2007/5/18/1

As far as test cases are concerned, I have a simple test case that I use
that allocates memory and touches all the allocated memory in a loop. I
can post that out if required. It uses various types of allocation

1. mmaped memory
2. anonymous memory
3. shared memory

I also run various benchmarks inside a control group, limited to 400 MB
of RAM.

One interesting that I noticed was that when I booted with mem= and created a container with the same . The swapout
test case ran much faster in the container (NOTE: This was prior to the
swap cache changes).

KAMEZAWA-San posted some test results on background reclaim and per zone
reclaim

http://forum.openvz.org/index.php?t=tree=4696=23964&==

The simplest use cases that come to mind are

1. Memory control for containers/virtualization
2. Job Isolation


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid overflows in kernel/time.c


Chris Snook wrote:

H. Peter Anvin wrote:

NOTE: This patch uses a bc(1) script to compute the appropriate
constants.


Perhaps dc would be more appropriate?  That's included in busybox.



Perhaps it would, but I think there is more variability between dc 
implementations -- consider if the busybox version is broken, for eample.


Either way, how many people compile their kernels in a busybox environment?

Anyway, I don't think compiling bc is hard on anything which has a C 
compiler.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sched_yield: delete sysctl_sched_compat_yield

On Friday 30 November 2007 13:51, Arjan van de Ven wrote:
> On Fri, 30 Nov 2007 13:46:22 +1100
>
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > Todays kernel has a different behavior somewhat (and before people
> > > scream "regression"; sched_yield() behavior isn't really specified
> > > and doesn't make any sense at all, whatever you get is what you
> > > get it's pretty much an insane defacto behavior that is
> > > incredibly tied to which decisions the scheduler makes how, and no
> > > app can depend on that
> >
> > It is a performance regression. Is there any reason *not* to use the
> > "compat" yield by default? As you say, for SCHED_OTHER tasks, yield
> > can do almost anything. We may as well do something that isn't a
> > regression..
>
> it just makes OTHER tests/benchmarks regress this is one of those
> things where you just can't win.

OK, which ones? Because java is slightly important...

> > > in any way. In fact, I've proposed to make sched_yield() just do an
> > > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > > standard wise than any of the current behaviors  ;_
> >
> > What makes you say that? IIRC of all the things that sched_yeild can
> > do, it is not allowed to block. So this is about the only thing that
> > will break the standard...
>
> sched_yield OF COURSE can block.. it's a schedule call after all!

In unix, blocking ~= removed from runqueue, no?

OF COURSE it is allowed to cooperatively schedule another task, but
I don't see why you think it should so obviously be allowed to block
/ sleep.

It breaks the basically only invariant of sched_yeild in that the
task will no longer run when there is nothing else running.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid overflows in kernel/time.c


Andrew Morton wrote:


NOTE: This patch uses a bc(1) script to compute the appropriate
constants.


Does this add the first dependency upon the availability of bc?


I believe it does.  I used bc because doing it C would have required 
arbitrary-precision code or have added a dependency on libgmp.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-usb-devel] [PATCH] base/class.c: prevent ooops due to insert/remove race (v3)

2007-11-29 Thread Alan Stern

On Thu, 29 Nov 2007, Linus Torvalds wrote:

> Heh. It definitely hasn't gotten lost by "the git software".

No, it sure hasn't.  In fact it was staring me right in the face and I 
didn't realize it.

> In fact, with 
> the kinds of hints you already gave, git makes it really _trivial_ to find 
> it.
> 
> Here's what you do:
> 
>   git log v2.6.23.. --author=Wilcox
> 
> and then just search for "scan_mutex", in the hope that Matthew wrote a 
> nice commit message. And yes, he did, so in less than a blink you get:
> 
>   commit 6b7f123f378743d739377871c0cbfbaf28c7d25a
>   Author: Matthew Wilcox <[EMAIL PROTECTED]>
>   Date:   Tue Jun 26 15:18:51 2007 -0600
>   
>   [SCSI] Fix async scanning double-add problems
> 
>   Stress-testing and some thought has revealed some places where
>   asynchronous scanning needs some more attention to locking.
>   
>- Since async_scan is a bit, we need to hold the host_lock while
>  modifying it to prevent races against other CPUs modifying the 
> word
>  that bit is in.  This is probably a theoretical race for the 
> moment,
>  but other patches may change that.
>- The async_scan bit means not only that this host is being scanned
>  asynchronously, but that all the devices attached to this host 
> are not
>  yet added to sysfs.  So we must ensure that this bit is always 
> in sync.
>  I've chosen to do this with the scan_mutex since it's already 
> acquired
>  in most of the right places.
>   ...
> 
> which I assume is the commit you're talking about.

Yep, that's the one.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-11-29 Thread Eric W. Biederman

Ben Woodard <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> Vivek Goyal <[EMAIL PROTECTED]> writes:
>>
>>> Ok. Got it. So in this case we route the interrupts directly through LAPIC
>>> and put LVT0 in ExtInt mode and IOAPIC is bypassed.
>>>
>>> I am looking at Intel Multiprocessor specification v1.4 and as per figure
>>> 3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
>>> connected to LINTIN0 pin on all processors. If that is the case, even in
>>> this mode, all the CPU should see the timer interrupts (which is coming
>>> from 8259)?
>>
>> However things are implemented completely differently now.  I don't think
>> the coherent hypertransport domain of AMD processors actually routes
>> ExtINT interrupts to all cpus but instead one (the default route?) is
>> picked.
>>
>> So I think for the kdump case we pretty much need to use an IOAPIC
>> in virtual wire mode for recent AMD systems.
>>
>> For current Intel systems I believe either scenario still works.
>>
>>> Can you print the LAPIC registers (print_local_APIC) during normal boot
>>> and during kdump boot and paste here?
>>
>> It's worth a look.
>>
>> I still think we need to just use apic mode at kernel startup, and
>> be done with it.
>>
>
> Neil whipped up a patch to try this and evidently it worked on his test boxes
> but it didn't work very well on our problem tests box. It hung after the 
> kernel
> printed "Ready". i.e. on a normal boot I get:

Interesting can you please try an early_printk console.


I expect you made it a fair ways and it just didn't show up because you didn't
get as far as the normal serial port setup.

You don't have any output from your linux kernel.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sched_yield: delete sysctl_sched_compat_yield

2007-11-29 Thread Arjan van de Ven

On Fri, 30 Nov 2007 13:46:22 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> > Todays kernel has a different behavior somewhat (and before people
> > scream "regression"; sched_yield() behavior isn't really specified
> > and doesn't make any sense at all, whatever you get is what you
> > get it's pretty much an insane defacto behavior that is
> > incredibly tied to which decisions the scheduler makes how, and no
> > app can depend on that
> 
> It is a performance regression. Is there any reason *not* to use the
> "compat" yield by default? As you say, for SCHED_OTHER tasks, yield
> can do almost anything. We may as well do something that isn't a
> regression..

it just makes OTHER tests/benchmarks regress this is one of those
things where you just can't win.

> 
> 
> > in any way. In fact, I've proposed to make sched_yield() just do an
> > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > standard wise than any of the current behaviors  ;_
> 
> What makes you say that? IIRC of all the things that sched_yeild can
> do, it is not allowed to block. So this is about the only thing that
> will break the standard...

sched_yield OF COURSE can block.. it's a schedule call after all!



-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sched_yield: delete sysctl_sched_compat_yield

On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> On Tue, 27 Nov 2007 17:33:05 +0800
>
> "Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:
> > If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> > volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> > my 16-core tigerton.
> >
> > 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> > 2.6.24-rc3 has more than 70% improvement;
> > 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> > 2.6.24-rc3 has more than 80% regression;
> >
> > On other machines, the volanoMark result also has much improvement if
> > /proc/sys/kernel/sched_compat_yield=1.
> >
> > Would you like to change function yield_task_fair to delete codes
> > around sysctl_sched_compat_yield, or just initiate it to 1?
>
> sounds like a bad idea; volanomark (well, technically the jvm behind
> it) is abusing sched_yield() by assuming it does something it really
> doesn't do, and as it happens some of the earlier 2.6 schedulers
> accidentally happened to behave in a way that was nice for this
> benchmark.

OK, why is this still happening? Haven't we been asking JVMs to use
futexes or posix locking for years and years now? Are there any sane
jvms that _don't_ use yield?

> Todays kernel has a different behavior somewhat (and before people
> scream "regression"; sched_yield() behavior isn't really specified and
> doesn't make any sense at all, whatever you get is what you get
> it's pretty much an insane defacto behavior that is incredibly tied to
> which decisions the scheduler makes how, and no app can depend on that

It is a performance regression. Is there any reason *not* to use the
"compat" yield by default? As you say, for SCHED_OTHER tasks, yield
can do almost anything. We may as well do something that isn't a
regression...

> in any way. In fact, I've proposed to make sched_yield() just do an
> msleep(1)... that'd be closer to what sched_yield is supposed to do
> standard wise than any of the current behaviors  ;_

What makes you say that? IIRC of all the things that sched_yeild can
do, it is not allowed to block. So this is about the only thing that
will break the standard...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix kmem_cache_free performance regression in slab

On Thu, 29 Nov 2007 12:05:13 -0700 Matthew Wilcox <[EMAIL PROTECTED]> wrote:

> The database performance group have found that half the cycles spent
> in kmem_cache_free are spent in this one call to BUG_ON.  Moving it
> into the CONFIG_SLAB_DEBUG-only function cache_free_debugcheck() is a
> performance win of almost 0.5% on their particular benchmark.
> 
> The call was added as part of commit ddc2e812d592457747c4367fb73edcaa8e1e49ff
> with the comment that "overhead should be minimal".  It may have been
> minimal at the time, but it isn't now.
> 

It is worth noting that the offending commit hit mainline in June 2006.

It takes a very long time for some performance regressions to be
discovered.  By which time it is effectively too late to fix it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Trailing periods in kernel messages

On Fri, 2007-11-30 at 09:54 +0800, Li Zefan wrote:
> So it doesn't deserve the effort to eliminate these periods, isn't it?

I hope these will eventually disappear.

> Or we can add a check to checkpatch.pl to prevent new ones.

Perhaps that's a good idea.

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index cbb4258..707f84c 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -1390,6 +1390,10 @@ sub process {
if ($line =~ /\*\s*\)\s*k[czm]alloc\b/) {
WARN("unnecessary cast may hide bugs, see 
http://c-faq.com/malloc/mallocnocast.html\n; . $herecurr);
}
+
+   if ($rawline =~ 
/(print|pr_(emerg|alert|crit|err|warning|notice|info|debug)).*\.\\n\"/) {
+   WARN("unnecessary period before newline\n" . $herecurr);
+   }
}
 
if ($chk_patch && !$is_patch) {


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: pnpacpi : exceeded the max number of IO resources

2007-11-29 Thread Shaohua Li


On Fri, 2007-11-30 at 03:18 +0100, Rene Herman wrote:
> On 29-11-07 10:11, Dave Young wrote:
> 
> > The pnpacpi rsparser.c report warnings of:
> > exceeded the max number of IO resources: 24
> > 
> > dmesg|grep exceeded|wc
> > 66 5943564
> 
> Heavens... (added CCs of people who just upped it from 8 -- I suppose the 
> problem is not new then?)
Properly we should make a bit bigger till Thomas's patch is ready.
Thomas, your patch isn't 2.6.24 staff, right?

Thanks,
Shaohua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Something similar to inotify in 2.4.

2007-11-29 Thread Rene Herman


On 29-11-07 18:09, Vitaliy Ivanov wrote:


Can anyone advice whether there is something similar to inotify in 2.4
kernel?


inotify is 2.6 (dnotify 2.4).

Rene
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/14] percpu: Use a Kconfig variable to configure arch specific percpu setup

2007-11-29 Thread Rusty Russell

On Thursday 29 November 2007 10:36:06 Christoph Lameter wrote:
> The code becomes much simpler if gs would point to the beginning of the
> per cpu area and if the __per_cpu_offset[i] would do the same. No weird
> __per_cpu_start offsetting anymore.

It is a little weird, but it gave flexibility for most archs.

ISTR I had issues relocating the percpu area to 0, but I look forward to your 
code!

> The generic write/readpercpu functionality introduced by the cpu_alloc
> patchset works best with offsets relative to an arch dependent
> register. All per cpu data (pda, percpu and allocpercpu) is handles as an
> offset relative to the start of the per cpu data.

Hmm, did someone cc me on the patchset and I missed it?

> If the current offset by __per_cpu_start is kept then a per cpu allocator
> may have to dish out addresses that go beyond __per_cpu_end.

Of course; you just need congruence in your allocation across CPUs.  It's 
possible, but no worse than the requirements on other schemes where you can 
reach a variable with a single addition for the CPU.

> I think dealing with a per cpu variable as if it would be an offset
> relative to a base is natural for the typical addressing of cpus based on
> an offset relative to some register.

We've had practical problems getting the compiler to eke out the potential 
benefit.  That's why we settled for an offset between where the compiler 
expected and where the variable actually was.

Cheers,
Rusty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: pnpacpi : exceeded the max number of IO resources

2007-11-29 Thread Rene Herman


On 29-11-07 10:11, Dave Young wrote:


The pnpacpi rsparser.c report warnings of:
exceeded the max number of IO resources: 24

dmesg|grep exceeded|wc
66 5943564


Heavens... (added CCs of people who just upped it from 8 -- I suppose the 
problem is not new then?)


Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-29 Thread Rusty Russell

On Friday 30 November 2007 03:53:34 Arjan van de Ven wrote:
> On Mon, 26 Nov 2007 10:25:33 -0800
>
> > Agreed. On first glance, I was intrigued but:
> >
> > 1) Why is everyone so concerned that export symbol space is large?
> > - does it cost cpu or running memory?
>
> yes. about 120 bytes per symbol

But this patch makes that worse, not better.

> > - does it cause bugs?
>
> yes, bad apis are causing bugs... sys_open is just the starter of that.

Sure, but this doesn't change the APIs, either.  We seem to have fixed 
sys_open the right way, and since we're not supposed to care about 
out-of-tree modules...

Rusty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-11-29 Thread Ben Woodard


Eric W. Biederman wrote:

Vivek Goyal <[EMAIL PROTECTED]> writes:


Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.

I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is 
connected to LINTIN0 pin on all processors. If that is the case, even in

this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?


However things are implemented completely differently now.  I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.

So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.

For current Intel systems I believe either scenario still works.


Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?


It's worth a look.

I still think we need to just use apic mode at kernel startup, and
be done with it.



Neil whipped up a patch to try this and evidently it worked on his test 
boxes but it didn't work very well on our problem tests box. It hung 
after the kernel printed "Ready". i.e. on a normal boot I get:



2007-11-29 13:48:29 Loading
vmlinuz-2.6.18-13chaos.ben.test
2007-11-29 13:48:29 Loading
initrd-2.6.18-13chaos.ben.test.
..
2007-11-29 13:48:29 Ready.
2007-11-29 13:48:30 Linux version 2.6.18-13chaos.ben.test ([EMAIL PROTECTED]) 
(gcc
version 4.1.2 20070626 (Red Hat 4.1.2-14
)) #10 SMP Thu Nov 29 13:11:49 PST 2007
2007-11-29 13:48:30 Command line: initrd=initrd-2.6.18-13chaos.ben.test
loglevel=8 console=ttyS0,115200n8 [EMAIL PROTECTED] elevator=deadline 
swiotlb=65536 selinux=0 apic=debug 
BOOT_IMAGE=vmlinuz-2.6.18-13chaos.ben.test BOOTIF=

01-00-30-48-57-91-56

With Neil's patch:
2007-11-29 17:12:55 PXELINUX 2.11 2004-08-16  Copyright (C) 1994-2004 H. 
Peter Anvin

2007-11-29 17:12:55 Boot options [default: 2.6.18-54.el5.bz336371]:
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux
2007-11-29 17:12:55 linux-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux-2.6.18-52.el5
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-13chaos.ben.test
2007-11-29 17:12:55 linux-2.6.23-0.214.rc8.git2.fc8
2007-11-29 17:12:55 linux-2.6.18-8.1.14.el5
2007-11-29 17:12:55 linux-2.6.18-7chaos
2007-11-29 17:12:55 boot:
2007-11-29 17:13:02 Loading
vmlinuz-2.6.18-13chaos.ben.test
2007-11-29 17:13:02 Loading
initrd-2.6.18-13chaos.ben.test.
..
2007-11-29 17:13:02 Ready.
(END)
That's all she wrote. End of story. Had to reboot to another kernel to 
make get it back.


Neil's patch:

--- linux-2.6.18.noarch/arch/x86_64/kernel/i8259.c.orig 2007-11-28 
18:00:31.0 -0500
+++ linux-2.6.18.noarch/arch/x86_64/kernel/i8259.c  2007-11-29 
10:37:14.0 -0500

@@ -599,4 +599,30 @@

if (!acpi_ioapic)
setup_irq(2, );
+
+   /*
+ * Switch from PIC to APIC mode.
+ */
+connect_bsp_APIC();
+setup_local_APIC();
+
+if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_id) {
+panic("Boot APIC ID in local APIC unexpected (%d vs %d)",
+  GET_APIC_ID(apic_read(APIC_ID)), boot_cpu_id);
+/* Or can we switch back to PIC here? */
+}
+
+/*
+ * Now start the IO-APICs
+ */
+if (!skip_ioapic_setup && nr_ioapics)
+setup_IO_APIC();
+else
+nr_ioapics = 0;
+
+   /*
+* Disable local irqs here so start_kernel doesn't complain
+*/
+   local_irq_disable();
+
 }
--- linux-2.6.18.noarch/arch/x86_64/kernel/smpboot.c.orig 
2007-11-28 18:07:33.0 -0500
+++ linux-2.6.18.noarch/arch/x86_64/kernel/smpboot.c2007-11-29 
10:37:59.0 -0500

@@ -1088,26 +1088,6 @@


/*
-* Switch from PIC to APIC mode.
-*/
-   connect_bsp_APIC();
-   setup_local_APIC();
-
-   if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_id) {
-   panic("Boot APIC ID in local APIC unexpected (%d vs %d)",
- GET_APIC_ID(apic_read(APIC_ID)), boot_cpu_id);
-   /* Or can we switch back to PIC here? */
-   }
-
-   /*
-* Now start the IO-APICs
-*/
-   if (!skip_ioapic_setup && nr_ioapics)
-   setup_IO_APIC();
-   else
-   nr_ioapics = 0;
-
-   /*
 * Set up local APIC timer on boot CPU.
 */



Eric

___

Re: What can we do to get ready for memory controller merge in 2.6.25

On Friday 30 November 2007 01:43, Balbir Singh wrote:
> They say better strike when the iron is hot.
>
> Since we have so many people discussing the memory controller, I would
> like to access the readiness of the memory controller for mainline
> merge. Given that we have some time until the merge window, I'd like to
> set aside some time (from my other work items) to work on the memory
> controller, fix review comments and defects.
>
> In the past, we've received several useful comments from Rik Van Riel,
> Lee Schermerhorn, Peter Zijlstra, Hugh Dickins, Nick Piggin, Paul Menage
> and code contributions and bug fixes from Hugh Dickins, Pavel Emelianov,
> Lee Schermerhorn, YAMAMOTO-San, Andrew Morton and KAMEZAWA-San. I
> apologize if I missed out any other names or contributions
>
> At the VM-Summit we decided to try the current double LRU approach for
> memory control. At this juncture in the space-time continuum, I seek
> your support, feedback, comments and help to move the memory controller

Do you have any test cases, performance numbers, etc.? And also some
results or even anecdotes of where this is going to be used would be
interesting...

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu

2007-11-29 Thread Ben Woodard


Vivek Goyal wrote:

On Wed, Nov 28, 2007 at 11:02:06AM -0500, Neil Horman wrote:

On Wed, Nov 28, 2007 at 10:36:49AM -0500, Vivek Goyal wrote:

On Tue, Nov 27, 2007 at 03:24:35PM -0800, Ben Woodard wrote:

Andi Kleen wrote:

Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts

Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.


can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been

The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.

So if the BIOS originally left the IOAPIC in a state where the timer 
interrupts were only going to CPU0 then by restoring that state we could be 
bringing this problem upon ourselves when we restore that state.



Hi Ben,

Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.

entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));

Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.


Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic.  In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be.  If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?



Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.

I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is 
connected to LINTIN0 pin on all processors. If that is the case, even in

this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?

Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?


Here are the ones from a normal bootup.

I was unable to get info from a kdump boot. I haven't figured out why 
yet. With the same patch that I used to capture this, when I tried to 
kdump the kernel, it paused a second or two after the backtrace and then 
dropped to BIOS and came up normally.


Here is a little trick, at the point where we are trying to get the info 
to print out, the kernel command line hasn't been completely parsed yet. 
That tricked me for part of the day. I had apic=debug on the command 
line but the logic in print_local_APIC saw the default value because the 
kernel command line had yet to be parsed.


2007-11-29 17:58:07 ***Here is the info you requested
2007-11-29 17:58:07
2007-11-29 17:58:07 printing local APIC contents on CPU#0/0:
2007-11-29 17:58:07 ... APIC ID:   (0)
2007-11-29 17:58:07 ... APIC VERSION: 80050010
2007-11-29 17:58:07 ... APIC TASKPRI:  (00)
2007-11-29 17:58:07 ... APIC ARBPRI:  (00)
2007-11-29 17:58:07 ... APIC PROCPRI: 
2007-11-29 17:58:07 ... APIC EOI: 
2007-11-29 17:58:07 ... APIC RRR: 0002
2007-11-29 17:58:07 ... APIC LDR: 
2007-11-29 17:58:07 ... APIC DFR: 
2007-11-29 17:58:07 ... APIC SPIV: 010f
2007-11-29 17:58:07 ... APIC ISR field:
2007-11-29 17:58:07 ... APIC TMR field:
2007-11-29 17:58:07 ... APIC IRR field:
2007-11-29 17:58:07 ... APIC ESR: 
2007-11-29 17:58:07 ... APIC ICR: 4630
2007-11-29 17:58:07 ... APIC ICR2: 0700
2007-11-29 17:58:07 ... APIC LVTT: 0001
2007-11-29 17:58:07 ... APIC LVTPC: 0001
2007-11-29 17:58:07 ... APIC LVT0: 0700
2007-11-29 17:58:07 ... APIC LVT1: 0400
2007-11-29 17:58:07 ... APIC LVTERR: 0001000f
2007-11-29 17:58:07 ... APIC TMICT: 8000
2007-11-29 17:58:07 ... APIC TMCCT: 
2007-11-29 17:58:07 ... APIC TDCR: 
2007-11-29 17:58:07
2007-11-29 17:58:07 number of MP IRQ sources: 15.
2007-11-29 17:58:07 number of IO-APIC #8 registers: 0.
2007-11-29 17:58:07 number of IO-APIC #9 registers: 0.
2007-11-29 17:58:07 number of IO-APIC #10 registers: 0.
2007-11-29 17:58:07 testing the IO APIC...
2007-11-29 17:58:07
2007-11-29 17:58:07 IO APIC #8..
2007-11-29 17:58:07  register #00:

Re: kondemand: kernel BUG at kernel/workqueue.c:258!

2007-11-29 Thread Arjan van de Ven

On Thu, 29 Nov 2007 13:47:34 -0800
"Pallipadi, Venkatesh" <[EMAIL PROTECTED]> wrote:

>  
> 
> >-Original Message-
> >From: Jiri Slaby [mailto:[EMAIL PROTECTED] 
> >Sent: Thursday, November 29, 2007 1:43 PM
> >To: Pallipadi, Venkatesh; Nakajima, Jun
> >Cc: Linux kernel mailing list
> >Subject: kondemand: kernel BUG at kernel/workqueue.c:258!
> >
> >Hi,
> >
> >while trying to evoke another bug by endlessly change 
> >governors, this appeared:
> >kernel BUG at .../kernel/workqueue.c:258!
> >invalid opcode:  [1] PREEMPT SMP
> >CPU 0
> >Modules linked in: iwl3945 mac80211 cfg80211 tun 
> >cpufreq_userspace rfcomm
> >l2cap hci_usb bluetooth kvm_intel arc4 ecb blkcipher kvm cryptomgr
> >crypto_algapi acpi_cpufreq fglrx(P) asus_laptop sr_mod cdrom ehci_hcd
> >uhci_hcd battery
> >Pid: 443, comm: kondemand/0 Tainted: P2.6.23 #38
> 
> Kernel version?

on the same line as the tainted flag and 2 below the binary module that
is in use I assume Jiri is now working on reproducing this
untainted ... ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid overflows in kernel/time.c

2007-11-29 Thread Chris Snook


H. Peter Anvin wrote:

NOTE: This patch uses a bc(1) script to compute the appropriate
constants.


Perhaps dc would be more appropriate?  That's included in busybox.

-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Trailing periods in kernel messages

2007-11-29 Thread Li Zefan

Joe Perches wrote:
> On Fri, 2007-11-30 at 09:12 +0800, Li Zefan wrote:
>> Just a roughly grep:
>> # grep -r -P --include=*.[ch] 'printk.*\.\\n' * | wc -l
>> 6025
>> # grep -r -P --include=*.[ch] '\.\\n' * | wc -l
>> 12723
> 
> Inequivalent.
> 
> Try:
>   grep -rP --include=*.[ch] 'printk.*\.\\n' * | wc -l
> and
>   grep -rp --include=*.[ch] 'printk.*[^\.]\\n' * | wc -l
> 
> 6k/38k
> 

My 2nd grep finds out how many strings are terminated with '.'.
Those strings may finally pass to prink().

So it doesn't deserve the effort to eliminate these periods, isn't it?
Or we can add a check to checkpatch.pl to prevent new ones.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Avoid overflows in kernel/time.c

On Thu, 29 Nov 2007 16:19:51 -0800 "H. Peter Anvin" <[EMAIL PROTECTED]> wrote:

> When the conversion factor between jiffies and milli- or microseconds
> is not a single multiply or divide, as for the case of HZ == 300, we
> currently do a multiply followed by a divide.  The intervening
> result, however, is subject to overflows, especially since the
> fraction is not simplified (for HZ == 300, we multiply by 300 and
> divide by 1000).
> 
> This is exposed to the user when passing a large timeout to poll(),
> for example.
> 
> This patch replaces the multiply-divide with a reciprocal
> multiplication on 32-bit platforms.  When the input is an unsigned
> long, there is no portable way to do this on 64-bit platforms there is
> no portable way to do this since it requires a 128-bit intermediate
> result (which gcc does support on 64-bit platforms but may generate
> libgcc calls, e.g. on 64-bit s390), but since the output is a 32-bit
> integer in the cases affected, just simplify the multiply-divide
> (*3/10 instead of *300/1000).
> 
> The reciprocal multiply used can have off-by-one errors in the upper
> half of the valid output range.  This could be avoided at the expense
> of having to deal with a potential 65-bit intermediate result.  Since
> the intent is to avoid overflow problems and most of the other time
> conversions are only semiexact, the off-by-one errors were considered
> an acceptable tradeoff.
> 
> NOTE: This patch uses a bc(1) script to compute the appropriate
> constants.

Does this add the first dependency upon the availability of bc?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4, v3] Physical PCI slot objects

2007-11-29 Thread Alex Chiang

Hi Kenji-san,

* Kenji Kaneshige <[EMAIL PROTECTED]>:
> > Hi Gary, Kenji-san, et. al,
> > 
> > * Gary Hade <[EMAIL PROTECTED]>:
> >> Alex, What I was trying to suggest is a boot-time kernel
> >> option, not a kernel configuration option.  The basic idea is
> >> to give the user (with a single binary kernel) the ability to
> >> include your ACPI-PCI slot driver feature changes only when
> >> they are really needed.  In addition to reducing the number of
> >> system/PCI hotplug driver combinations where your changes would
> >> need to be validated, I believe would also help alleviate other
> >> worries (e.g. Andi Kleen's memory consumption concern).  I
> >> believe this goal could also be achieved with the kernel config
> >> option by making the pci_slot module runtime loadable with the
> >> PCI hotplug drivers only visiting your new code when the
> >> pci_slot driver is loaded, although I think this would be more
> >> difficult to implement.
> > 
> > I have modified my patch series so that the final patch that
> > introduces my ACPI-PCI slot driver is a full-fledged module, that
> > has a tristate Kconfig option.
> > 
> 
> Thank you for your good job.

Thanks for testing. :)

> I tested shpchp and pciehp both with and without pci_slot
> module. There seems no regression from shpchp and pciehp's
> point of view.  (I had a little concern about the hotplug
> slots' name that vary depending on whether pci_slot
> functionality is enabled or disabled. But, now that we can
> build pci_slot driver as a kernel module, I don't think it is a
> big problem).

Hm, you are right. On my machine, if I load pciehp first and
acpiphp second (even without loading pci_slot), I will see the
following:

[EMAIL PROTECTED] slots]# ls
0016_0006  0197_0005  10  3  4  7  8  9

[EMAIL PROTECTED] slots]# lsmod | grep pci_slot
[EMAIL PROTECTED] slots]# lsmod | grep hp
acpiphp   115984  0 
pciehp140616  0 
pci_hotplug   123972  2 acpiphp,pciehp

On the other hand, if I do load pci_slot first, and then pciehp,
you are right, I will see something like this:

[EMAIL PROTECTED] slots]# ls
1  10  2  3  4  5  6  7  8  9

[EMAIL PROTECTED] slots]# lsmod | grep pci_slot
pci_slot   74436  0 
[EMAIL PROTECTED] slots]# lsmod | grep hp
pciehp140616  0 
pci_hotplug   123972  1 pciehp

But I do agree, people don't need to load pci_slot at all if they
don't want it, and they won't be bothered.

> Only the problems is that I got Call Traces with the following
> error messages when pci_slot driver was loaded, and one strange
> slot named '1023' was registered (other slots are fine). This
> is the same problem I reported before.
> 
> sysfs: duplicate filename '1023' can not be created
> WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
> 
> kobject_add failed for 1023 with -EEXIST, don't try to
> register things with the same name in the same directory.
> 
> On my system, hotplug slots themselves can be added, removed
> and replaced with the ohter type of I/O box. The ACPI firmware
> tells OS the presence of those slots using _STA method (That
> is, it doesn't use 'LoadTable()' AML operator). On the other
> hand, current pci_slot driver doesn't check _STA.  As a result,
> pci_slot driver tryied to register the invalid (non-existing)
> slots. The ACPI firmware of my system returns '1023' if the
> invalid slot's _SUN is evaluated. This is the cause of Call
> Traces mentioned above. To fix this problem, pci_slot driver
> need to check _STA when scanning ACPI Namespace.

Now this is very curious. The relevant line in pci_slot is:

check_slot()
status = acpi_evaluate_integer(handle, "_SUN", NULL, sun);
if (ACPI_FAILURE(status))
return -1;

Why does your firmware return the error information inside sun,
instead of returning an error in status? That doesn't seem right
to me...

> I'm sorry for reporting this so late. I'm attaching the patch
> to fix the problem. This is against 2.6.24-rc3 with your
> patches applied. Could you try it?

Applying this patch causes me to only detect populated slots in
my system, which isn't what I want -- otherwise, I could have
just enumerated the PCI bus and found the devices that way. :)

Maybe on your machine, checking existence of _STA might do the
right thing, but I don't think we should actually be looking at
any of the actual bits returned. 

If we check ACPI_STA_DEVICE_PRESENT, then we will not detect
empty slots on my system. Can you try this patch to see if at
least the first call to acpi_evaluate_integer helps? If that
doesn't help, maybe the second block will help you, but it breaks
my machine...

Thanks.

/ac


diff --git a/drivers/acpi/pci_slot.c b/drivers/acpi/pci_slot.c
index 724f4f0..63a4dc8 100644
--- a/drivers/acpi/pci_slot.c
+++ b/drivers/acpi/pci_slot.c
@@ -55,9 +65,21 @@ static struct acpi_pci_driver acpi_pci_slot_driver = {
 static int
 check_slot(acpi_handle handle, int *device, unsigned long

Re: Out of tree module using LSM

2007-11-29 Thread Al Viro

On Thu, Nov 29, 2007 at 03:12:38PM -0700, Justin Banks wrote:

> It's not perfect, but as was recently pointed out, if you can only get
> 98% of the way there rather than 100% is that a reason for not trying to
> make it possible?

BTW, that's a fine example of a common fallacy: "$FOO is 98% of the way to
$TARGET" does not allow to interpolate the properties of $TARGET to those
of $FOO.

Telling that a condom is a 98% approximation to platonic ideal of such is
not particulary useful, especially if it turns out that what this number 
really means is that there's a hole on its tip covering 2% of surface...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Markers Implementation for RCU Tracing

2007-11-29 Thread Paul E. McKenney

On Fri, Nov 30, 2007 at 12:11:28AM +0530, K. Prasad wrote:
> Hi,
>   Please review the ensuing set of patches which convert the
> existing RCU tracing mechanism for Preempt RCU and RCU Boost into
> markers.
> 
> These patches are based upon the 2.6.24-rc2-rt1 kernel tree.
> 
> Along with marker transition, the RCU Tracing infrastructure has also
> been modularised to be built as a kernel module, thereby enabling
> runtime changes to the RCU Tracing infrastructure.
> 
> Patch [1/2] - Patch that converts the Preempt RCU tracing in
> rcupreempt.c into markers.
> 
> Patch [1/2] - Patch that converts the Preempt RCU Boost tracing in
> rcupreempt-boost.c into markers.

Looks good to me, though I do not pretend to understand the markers
implementation.  I presume that the markers implementation forces the
varargs usage -- though the markers do seem quite a bit nicer in allowing
the formatting to be specified more naturally.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/1] Writeback fix for concurrent large and small file writes

2007-11-29 Thread Fengguang Wu

On Thu, Nov 29, 2007 at 12:16:36PM -0800, Michael Rubin wrote:
> Due to my faux pas of top posting (see
> http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt) I am
> resending this email.
> 
> On Nov 28, 2007 4:34 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > Could you demonstrate the situation? Or if I guess it right, could it
> > be fixed by the following patch? (not a nack: If so, your patch could
> > also be considered as a general purpose improvement, instead of a bug
> > fix.)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 0fca820..62e62e2 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -301,7 +301,7 @@ __sync_single_inode(struct inode *inode, struct 
> > writeback_control *wbc)
> >  * Someone redirtied the inode while were writing 
> > back
> >  * the pages.
> >  */
> > -   redirty_tail(inode);
> > +   requeue_io(inode);
> > } else if (atomic_read(>i_count)) {
> > /*
> >  * The inode is clean, inuse
> >
> 
> By testing the situation I can confirm that the one line patch above
> fixes the problem.
> 
> I will continue testing some other cases to see if it cause any other
> issues but I don't expect it to.

One major concern could be whether a continuous writer dirting pages
at the 'right' pace will generate a steady flow of write I/Os which are
_tiny_hence_inefficient_.

I have gathered some timing info about writeback speed in
http://lkml.org/lkml/2007/10/4/468. For ext3, it takes wb_kupdate()
~15ms to submit 4MB. Whereas one disk I/O typically takes ~5ms. So if
there are too many tiny write I/Os, they will simply get delayed and
merged into bigger ones.

So it's not a problem in *theory* :-)

> I will post this change for 2.6.24 and list Feng as author. If that's
> ok with Feng.

Thank you.

> As for the original patch I will resubmit it for 2.6.25 as a general
> purpose improvement.

There are some discussions and patches on inode number based writeback
clustering which you may want to reference/compare with:
http://lkml.org/lkml/2007/8/21/396
http://lkml.org/lkml/2007/8/27/45

Cheers,
Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Trailing periods in kernel messages

On Fri, 2007-11-30 at 09:12 +0800, Li Zefan wrote:
> Just a roughly grep:
> # grep -r -P --include=*.[ch] 'printk.*\.\\n' * | wc -l
> 6025
> # grep -r -P --include=*.[ch] '\.\\n' * | wc -l
> 12723

Inequivalent.

Try:
grep -rP --include=*.[ch] 'printk.*\.\\n' * | wc -l
and
grep -rp --include=*.[ch] 'printk.*[^\.]\\n' * | wc -l

6k/38k

cheers, Joe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [RESEND] crypto test: use print_hex_dump from kernel.h instead

2007-11-29 Thread rae l

On Nov 29, 2007 7:13 PM, Herbert Xu <[EMAIL PROTECTED]> wrote:
...
> > uninlining this function shrinks crypto/tcrypt.o's .text from 20,009 bytes
> > down to 19,701.
> >
> > inlining is almost always wrong.
>
> I agree.  Please do as Andrew suggests and resubmit.
inline disabled.

Cc: Randy Dunlap <[EMAIL PROTECTED]>
Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>
---

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 24141fb..13efc72 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -83,10 +83,9 @@ static char *check[] = {

 static void hexdump(unsigned char *buf, unsigned int len)
 {
-   while (len--)
-   printk("%02x", *buf++);
-
-   printk("\n");
+   print_hex_dump(KERN_CONT, "", DUMP_PREFIX_OFFSET,
+   16, 1,
+   buf, len, false);
 }

 static void tcrypt_complete(struct crypto_async_request *req, int err)

-- 
Denis Cheng
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Question regarding mutex locking

2007-11-29 Thread Bryan O'Sullivan


Larry Finger wrote:

If a particular routine needs to lock a mutex, but it may be entered with that 
mutex already locked,
would the following code be SMP safe?

hold_lock = mutex_trylock()


The common way to deal with this is first to restructure your function 
into two.  One always acquires the lock, and the other (often written 
with a "__" prefix) never acquires it.  The never-acquire code does the 
actual work, and the always-acquire function calls it.


You then refactor the callers so that you don't have any code paths on 
which you can't predict whether or not the lock will be held.


http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4, v3] Physical PCI slot objects

2007-11-29 Thread Alex Chiang

Hi Gary,

First, thanks for all the help and testing -- I really appreciate
it.

* Gary Hade <[EMAIL PROTECTED]>:
> 
> I'm getting back to you but unfortunately with not so good
> news.  Sorry Alex.

:-/

> On the x3950 (configured single node) I encountered the below
> problem when attempting to hotplug a PCIe adapter when 'pci_slot'
> was loaded prior to 'acpiphp'.  I did not see the problem when
> the drivers were loaded in the opposite order.

Very bizarre, especially given the stack trace below, which
doesn't really make any sense to me at all.

> FYI, the node contains 2 hotpluggable PCIe slots and 5
> non-hotpluggable PCIe slots but 'pci_slot' only exposed
> the 2 hotpluggable slots.  This does not appear to be due
> to a 'pci_slot' driver problem since I looked at the DSDT
> and SSDT and found that there are currently no _SUN methods
> for the non-hotpluggable slots.

Ok, this is not too surprising, but it's a different can o'
worms. ;) Let's save this for another day...

> invalid opcode:  [1] SMP 
> CPU 1 
> Modules linked in: acpiphp pci_slot e1000 aic79xx scsi_transport_spi shpchp 
> dock pci_hotplug ipt_LOG xt_limit xt_pkttype button battery ac power_supply 
> ip6t_REJECT xt_tcpudp ipt_REJECT iptable_mangle iptable_filter 
> ip6table_mangle ip_tables ip6table_filter ip6_tables x_tables ipv6 usbhid 
> ff_memless ext3 jbd loop dm_mod ehci_hcd uhci_hcd usbcore ide_cd bnx2 cdrom 
> rng_core reiserfs ata_piix ahci libata thermal processor piix sg megaraid_sas 
> fan edd sd_mod scsi_mod ide_disk ide_core
> Pid: 121, comm: kacpi_notify Not tainted 2.6.24-rc3-gh-smp #1
> RIP: 0010:[]  [] 
> :pci_slot:__this_module+0x21c4/0xf204
> RSP: 0018:81103fa43ea8  EFLAGS: 00010216
> RAX: 81103f944a18 RBX: 81103d4fe910 RCX: 000f
> RDX:  RSI:  RDI: 8110400d13d0
> RBP: 8032d97b R08: 8110400fc7e0 R09: 0002
> R10:  R11: 8021d193 R12: 811040105cf0
> R13:  R14: 80635820 R15: 
> FS:  () GS:8110400ed8c0() knlGS:
> CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> CR2: 2b266d876471 CR3: 00103c825000 CR4: 06e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process kacpi_notify (pid: 121, threadinfo 81103fa42000, task 
> 81103f9f8040)
> Stack:  809c 81103d119a00 8032d99e 81103f9fc540
>  8024618d 81103f9fc540 81103f9fc540 8024696c
>  80246a46  81103f9f8040 80249ada
> Call Trace:
>  [] acpi_ev_notify_dispatch+0x57/0x60
>  [] acpi_os_execute_notify+0x23/0x2c
>  [] run_workqueue+0x7f/0x10b
>  [] worker_thread+0x0/0xe4
>  [] worker_thread+0xda/0xe4
>  [] autoremove_wake_function+0x0/0x2e
>  [] kthread+0x47/0x73
>  [] child_rip+0xa/0x12
>  [] kthread+0x0/0x73
>  [] child_rip+0x0/0x12

Maybe we're trying to kick off a hotplug event on the wrong slot?
I really have no idea...

> Code: ff ff ff ff 40 23 2c 88 ff ff ff ff 00 c8 c6 3b 10 81 ff ff 
> RIP  [] :pci_slot:__this_module+0x21c4/0xf204
>  RSP 

Can you apply this debug patch on top of your tree, and send me
the output?

I'd be curious to see the output for your failure case:

  # modprobe pci_slot debug=1
  # modprobe acpiphp debug=1

Thanks.

/ac

diff --git a/drivers/acpi/pci_slot.c b/drivers/acpi/pci_slot.c
index 724f4f0..5a62def 100644
--- a/drivers/acpi/pci_slot.c
+++ b/drivers/acpi/pci_slot.c
@@ -30,12 +30,16 @@
 #include 
 #include 
 
+static int debug;
+
 #define DRIVER_VERSION "0.1"
 #define DRIVER_AUTHOR  "Alex Chiang <[EMAIL PROTECTED]>"
 #define DRIVER_DESC"ACPI PCI Slot Detection Driver"
 MODULE_AUTHOR(DRIVER_AUTHOR);
 MODULE_DESCRIPTION(DRIVER_DESC);
 MODULE_LICENSE("GPL");
+MODULE_PARM_DESC(debug, "Debugging mode enabled or not");
+module_param(debug, bool, 0644);
 
 #define _COMPONENT ACPI_PCI_COMPONENT
 ACPI_MODULE_NAME("pci_slot");
@@ -43,6 +47,12 @@ ACPI_MODULE_NAME("pci_slot");
 #define MY_NAME "pci_slot"
 #define err(format, arg...) printk(KERN_ERR "%s: " format , MY_NAME , ## arg)
 #define info(format, arg...) printk(KERN_INFO "%s: " format , MY_NAME , ## arg)
+#define dbg(format, arg...)\
+   do {\
+   if (debug)  \
+   printk(KERN_DEBUG "%s: " format,\
+   MY_NAME , ## arg);  \
+   } while (0)
 
 static int acpi_pci_slot_add(acpi_handle handle);
 static void acpi_pci_slot_remove(acpi_handle handle);
@@ -125,6 +135,9 @@ register_slot(acpi_handle handle, u32 lvl, void *context, 
void **rv)
if (IS_ERR(pci_slot))
err("pci_create_slot returned %ld\n", PTR_ERR(pci_slot));
 
+

Re: Trailing periods in kernel messages

2007-11-29 Thread Li Zefan

Andrew Morton wrote:
> On Thu, 29 Nov 2007 11:20:18 +0100 Frans Pop <[EMAIL PROTECTED]> wrote:
> 
>> Well, for one it needlessly increases the size of log files.
>> It also IMO just looks weird to have a trailing period only for some 
>> messages and it certainly is completely inappropriate for messages like:
> 
> I'll confess to stealthily deleting some of those periods when nobody is 
> looking.
> I don't find them to have any value and they do have some cost, including 
> screen
> real estate at the source-code level.
> 
> 

Just a roughly grep:

# grep -r -P --include=*.[ch] 'printk.*\.\\n' * | wc -l
6025
# grep -r -P --include=*.[ch] '\.\\n' * | wc -l
12723

:)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bluez-users] Lost connections - mouse and keyboard

2007-11-29 Thread Dave Young

On Nov 30, 2007 4:43 AM, Jiri Kosina <[EMAIL PROTECTED]> wrote:
> On Thu, 29 Nov 2007, Marcel Holtmann wrote:
>
> > > >Nov 28 18:53:39 pico kernel: WARNING: at drivers/hid/hid-core.c:784
> [ ... ]
>
> > > Does bluetooth input devices have something to do with usbhid? I don't
> > > know, perhaps this is another problem in kernel.
> > in case you have a HID proxy dongle the usbhid driver can be involved. And
> > since this is hiddev, then it will be caused by the hid2hci program.
>
> Absolutely.
>
> This particular warning means, that someone (usually indeed hid2hci)
> passed usage through hiddev that was out of bounds, with respect to the
> device's report descriptor.

Is this behaviour the normal one? IMHO, userspace program should not
cause kernel warnings like this no mater what input from users.

>
> This usually means that hid2hci has chosen the wrong method to switch the
> modes. Unfortunately, it's not easy to implement always the switching
> properly, if we don't know the vendor-specific packet that has to be sent.
>
> --
> Jiri Kosina
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] bdi patches

2007-11-29 Thread Neil Brown

On Thursday November 29, [EMAIL PROTECTED] wrote:
> > http://programming.kicks-ass.net/kernel-patches/foo/
> > 
> > bdi-task-dirty.patch
> > bdi-sysfs.patch
> > bdi-min.patch
> > bdi-max.patch
> > 
> > 
> > Is my current rather experimental stack, I just wrote the max part after
> > having slept on it. I'm not fond of the multiplication there, but I
> > dno't see a way around it.
> > 
> > Compile tested only.
> 
> I've done some testing on these patches and did some changes. So here
> they go.
> 
> Thanks,
> Miklos
> 
> -
> Subject: mm: sysfs: expose the BDI object in sysfs
> 
> Provide a place in sysfs for the backing_dev_info object.
> This allows us to see and set the various BDI specific variables.

You don't say what the place is, and I'm not quite familiar enough
with sysfs internals to figure it out my self.  Help?

And while I was looking I noticed that bdi_register (and bdi_init_fmt)
takes a second argument 'parent', which is always NULL, and which is
undocumented as to purpose.
If no-one would ever add another call to bdi_register, why have the
second arg, and if they might, how would they know what to put there?

Finally, the omission of NFS bothers me - and makes me wonder if the
choice of name in sysfs is appropriate.

Would a program ever want to generate the name (in sysfs) for a
particular bdi?  If so, how would it do it.

It seems to me after a fairly quick look that a bdi is always
associated with a device number.  For block devices the device number
is obvious.  For NFS and FUSE, the device number is an anon device
number allocated at mount time.
Maybe the name of the bdi should be based on that number.  Then it
would be possible to map directly from e.g. a file to the bdi that the
file would be written to. 

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Out of tree module using LSM

2007-11-29 Thread James Morris

On Thu, 29 Nov 2007, Al Viro wrote:

> Incidentally, I would really love to see the threat profile we are talking
> about.  

Exactly.

Please come up with a set of requirements that can be reviewed by the core 
kernel folk, and perhaps then focus on how to meet those requirements once 
they have been accepted.

- James
-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/6] time: fix typo in comments

2007-11-29 Thread Li Zefan

>>  
>> -/* Suppose we want to devide two numbers NOM and DEN: NOM/DEN, the we can
>> +/* Suppose we want to devide two numbers NOM and DEN: NOM/DEN, then we can
> 
> divide
> 

Yes, I missed it.

>> - * which, buy the way, it can do, but it take more code and at least 2
>> + * which, buy the way, it can do, but it takes more code and at least 2
> 
> by the way 
> (and does this really add anything to the sentence?)
> 

Thanks for pointing it out :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: WARNING: at kernel/resource.c:189 __release_resource

On Thu, 29 Nov 2007 16:40:37 -0700
Bjorn Helgaas <[EMAIL PROTECTED]> wrote:

> On Monday 26 November 2007 11:05:38 pm Andrew Morton wrote:
> > On Thu, 22 Nov 2007 22:41:16 +0100 Jiri Slaby <[EMAIL PROTECTED]> wrote:
> > > Ok, I hit the bug, suspend of 00:06 device complains about it:
> > > WARNING: at .../kernel/resource.c:185 __release_resource()
> > > 
> > > Call Trace:
> > >  [] release_resource+0xb5/0xf0
> > >  [] pnp_release_resources+0x70/0x130
> > >  [] pnp_stop_dev+0x45/0x90
> > >  [] pnp_bus_suspend+0x92/0xb0
> > >  [] suspend_device+0x113/0x180
> > >  [] device_suspend+0x200/0x320
> > >  [] suspend_devices_and_enter+0xa5/0x170
> > >  [] enter_state+0x209/0x270
> > >  [] state_store+0xaf/0xf0
> > >  [] kobj_attr_store+0x17/0x20
> > >  [] sysfs_write_file+0xce/0x140
> > >  [] vfs_write+0xc7/0x170
> > >  [] sys_write+0x50/0x90
> > >  [] system_call+0x7e/0x83
> > > 
> > > # LANG=en ll /sys/devices/pnp0/00:06/
> > > total 0
> > > lrwxrwxrwx 1 root root0 Nov 22 22:35 driver -> 
> > > ../../../bus/pnp/drivers/serial
> > > -r--r--r-- 1 root root 4096 Nov 22 22:35 id
> > > -r--r--r-- 1 root root 4096 Nov 22 22:35 options
> > > drwxr-xr-x 2 root root0 Nov 22 22:35 power
> > > -rw-r--r-- 1 root root 4096 Nov 22 22:35 resources
> > > lrwxrwxrwx 1 root root0 Nov 22 22:35 subsystem -> ../../../bus/pnp
> > > drwxr-xr-x 3 root root0 Nov 22 22:35 tty
> > > -rw-r--r-- 1 root root 4096 Nov 22 22:35 uevent
> > 
> > I suppose that's a genuine leak, presumably in 8250_pnp.
> 
> We used to have only the serial driver resource reservation.  We now
> have an additional 00:06 resource that is the parent of the serial
> resource, e.g.,
> 
>   03f8-03ff : 00:06
> 03f8-03ff : serial
> 
> I think this problem happens because pnp_bus_suspend() calls
> serial_pnp_suspend(), which suspends the driver but does nothing
> with the resources.  Then it calls pnp_stop_dev(), which releases
> the 00:06 resource, which still has a serial child resource.
> 
> The corresponding PCI code in pci_device_suspend() does not do
> any generic device disable or resource release.  I don't know
> why PNP disables the device on suspend.  I glanced through the
> ACPI spec but didn't see a requirement for it.  Maybe Pierre [1]
> remembers.
> 
> Maybe we could either remove the pnp_{stop,start}_dev() calls
> from the suspend/resume path, or move the PNP resource management
> out of pnp_{start,stop}_dev().
> 
> Bjorn
> 
> [1] http://lkml.org/lkml/2005/11/30/39

So was this particular problem caused/exposed by
pnp-request-ioport-and-iomem-resources-used-by-active-devices.patch, or is
it in mainline?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: constant_tsc and TSC unstable

2007-11-29 Thread Frans Pop

Paul Rolland wrote:
> Total of 2 processors activated (6919.15 BogoMIPS).
> ENABLING IO-APIC IRQs
> ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> checking TSC synchronization [CPU#0 -> CPU#1]:
> Measured 3978592228 cycles TSC warp between CPUs, turning off TSC clock.
> Marking TSC unstable due to: check_tsc_sync_source failed.
> Brought up 2 CPUs
> ...

Not sure if this is related, but thought I'd contribute it anyway...

I've got a Pentium D system (dual core, single processor) and I on some
boots I get "Marking TSC unstable due to check_tsc_sync_source failed" with
some cycles warp between CPUs, while most boots are OK. This kind of
inconsistency seems more due to a failure in the kernel to deal with
differences between boots than with something inherent to the hardware.

I conclude that because basically I never have any problems with the system
once it has booted and the TSC has passed.

>From my kern.logs since Okt 26, I get the following data:
2.6.23+cfs:  2 passes
2.6.23.1:1 pass;   1 failure  (48 cycles warp)
2.6.24-rc1: 15 passes
2.6.24-rc2: 13 passes; 1 failure  (8 cycles warp)
2.6.24-rc3:  5 passes; 3 failures (8, 8 and 16 cycles warp)

Note that this is not a new issue. For 2.6.21/2.6.23-RCx kernels I reported
similar data in http://lkml.org/lkml/2007/9/16/45.

Cheers,
FJP
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Avoid overflows in kernel/time.c

When the conversion factor between jiffies and milli- or microseconds
is not a single multiply or divide, as for the case of HZ == 300, we
currently do a multiply followed by a divide.  The intervening
result, however, is subject to overflows, especially since the
fraction is not simplified (for HZ == 300, we multiply by 300 and
divide by 1000).

This is exposed to the user when passing a large timeout to poll(),
for example.

This patch replaces the multiply-divide with a reciprocal
multiplication on 32-bit platforms.  When the input is an unsigned
long, there is no portable way to do this on 64-bit platforms there is
no portable way to do this since it requires a 128-bit intermediate
result (which gcc does support on 64-bit platforms but may generate
libgcc calls, e.g. on 64-bit s390), but since the output is a 32-bit
integer in the cases affected, just simplify the multiply-divide
(*3/10 instead of *300/1000).

The reciprocal multiply used can have off-by-one errors in the upper
half of the valid output range.  This could be avoided at the expense
of having to deal with a potential 65-bit intermediate result.  Since
the intent is to avoid overflow problems and most of the other time
conversions are only semiexact, the off-by-one errors were considered
an acceptable tradeoff.

NOTE: This patch uses a bc(1) script to compute the appropriate
constants.

Signed-off-by: H. Peter Anvin <[EMAIL PROTECTED]>
---
 kernel/Makefile |8 +++
 kernel/time.c   |   29 +---
 kernel/timeconst.bc |  123 +++
 3 files changed, 152 insertions(+), 8 deletions(-)
 create mode 100644 kernel/timeconst.bc

diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..f136d18 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -80,3 +80,11 @@ quiet_cmd_ikconfiggz = IKCFG   $@
 targets += config_data.h
 $(obj)/config_data.h: $(obj)/config_data.gz FORCE
$(call if_changed,ikconfiggz)
+
+$(obj)/time.o: $(obj)/timeconst.h
+
+quiet_cmd_timeconst  = BC  $@
+  cmd_timeconst = (echo $(CONFIG_HZ) | bc -q $<) > $@
+targets += timeconst.h
+$(obj)/timeconst.h: $(src)/timeconst.bc $(wildcard include/config/hz.h) FORCE
+   $(call if_changed,timeconst)
diff --git a/kernel/time.c b/kernel/time.c
index 09d3c45..8e790b5 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -39,6 +39,8 @@
 #include 
 #include 
 
+#include "timeconst.h"
+
 /*
  * The timezone where the local system is located.  Used as a default by some
  * programs who obtain this value by using gettimeofday.
@@ -93,7 +95,8 @@ asmlinkage long sys_stime(time_t __user *tptr)
 
 #endif /* __ARCH_WANT_SYS_TIME */
 
-asmlinkage long sys_gettimeofday(struct timeval __user *tv, struct timezone 
__user *tz)
+asmlinkage long sys_gettimeofday(struct timeval __user *tv,
+struct timezone __user *tz)
 {
if (likely(tv != NULL)) {
struct timeval ktv;
@@ -118,7 +121,7 @@ asmlinkage long sys_gettimeofday(struct timeval __user *tv, 
struct timezone __us
  * hard to make the program warp the clock precisely n hours)  or
  * compile in the timezone information into the kernel.  Bad, bad
  *
- * - TYT, 1992-01-01
+ * - TYT, 1992-01-01
  *
  * The best thing to do is to keep the CMOS clock in universal time (UTC)
  * as real UNIX machines always do it. This avoids all headaches about
@@ -239,7 +242,11 @@ unsigned int inline jiffies_to_msecs(const unsigned long j)
 #elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
 #else
-   return (j * MSEC_PER_SEC) / HZ;
+# if BITS_PER_LONG == 32
+   return ((u64)HZ_TO_MSEC_MUL32 * j) >> HZ_TO_MSEC_SHR32;
+# else
+   return (j * HZ_TO_MSEC_NUM) / HZ_TO_MSEC_DEN;
+# endif
 #endif
 }
 EXPORT_SYMBOL(jiffies_to_msecs);
@@ -251,7 +258,11 @@ unsigned int inline jiffies_to_usecs(const unsigned long j)
 #elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
return (j + (HZ / USEC_PER_SEC) - 1)/(HZ / USEC_PER_SEC);
 #else
-   return (j * USEC_PER_SEC) / HZ;
+# if BITS_PER_LONG == 32
+   return ((u64)HZ_TO_USEC_MUL32 * j) >> HZ_TO_USEC_SHR32;
+# else
+   return (j * HZ_TO_USEC_NUM) / HZ_TO_USEC_DEN;
+# endif
 #endif
 }
 EXPORT_SYMBOL(jiffies_to_usecs);
@@ -351,7 +362,7 @@ EXPORT_SYMBOL(mktime);
  * normalize to the timespec storage format
  *
  * Note: The tv_nsec part is always in the range of
- * 0 <= tv_nsec < NSEC_PER_SEC
+ * 0 <= tv_nsec < NSEC_PER_SEC
  * For negative values only the tv_sec field is negative !
  */
 void set_normalized_timespec(struct timespec *ts, time_t sec, long nsec)
@@ -452,12 +463,13 @@ unsigned long msecs_to_jiffies(const unsigned int m)
/*
 * Generic case - multiply, round and divide. But first
 * check that if we are doing a net multiplication, that
-* we wouldnt overflow:
+* we

Relation between nr_dirty and nr_inactive

2007-11-29 Thread Kunal Trivedi

Hi,
I am running older kernel (CentOS 2.6.9-34 SMP) on 32 bit arch. Some
of my systems got hung, while trying to write some data to disk. All
those systems exhibit similar pattern where during this time,
/proc/meminfo suggesting 'Inactive' < 'Dirty'. All of machines have 2G
of physical memory and ~1.5G memory is locked (via mlock).

I tried reading code but could not establish any direct relationship
between Zone->in_active pages vs. per-cpu_page_state->nr_dirty.

Has anybody seen system in this kind of state before ? And are these 2
parameters affect each-other ?

Thanks
-Kunal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm2 (bugfix for memory cgroup per-zone-struct allocation.)

2007-11-29 Thread KAMEZAWA Hiroyuki

On Thu, 29 Nov 2007 16:25:33 -0500
Lee Schermerhorn <[EMAIL PROTECTED]> wrote:
> > -   pn = kmalloc_node(sizeof(*pn), GFP_KERNEL, node);
> > +   /*
> > +* This routine is called against possible nodes.
> > +* But it's BUG to call kmalloc() against offline node.
> > +*
> > +* TODO: this routine can waste much memory for nodes which will
> > +*   never be onlined. It's better to use memory hotplug callback
> > +*   function.
> > +*/
> > +   if (node_state(node, N_HIGH_MEMORY))
> > +   pn = kmalloc_node(sizeof(*pn), GFP_KERNEL, node);
> > +   else
> > +   pn = kmalloc(sizeof(*pn), GFP_KERNEL);
> > if (!pn)
> > return 1;
> >  
> > 
> 
> This worked for me.  Can boot 24-rc3-mm2 [if I turn off async scsi scan,
> that is--not related to mem controller].  
> 
Thank you !

> Just FYI, on my ia64 platform, with NODES_SHIFT == 8 [RHEL & SLES ship
> with 10, I believe], the size of the mem_cgroup structure is ~10KB.
> 
Yes. But...
I'll ask Goto-san how memory hotplug callback works and try it.

Thanks,
-Kame


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] x86 setup: don't recalculate ss:esp unless really necessary

Hi Linus,

It appears that unconditionally resetting the stack, which fixes old
LILO, breaks LOADLIN after all.  This patch should work with either,
as well as work around the command-line truncation bug in old versions
of SYSLINUX.

Please pull:

  git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup.git 
for-linus

Jens Rottmann (1):
  x86 setup: don't recalculate ss:esp unless really necessary

 arch/x86/boot/header.S |   41 -
 1 files changed, 16 insertions(+), 25 deletions(-)

commit 16252da654800461e0e1c32697cb59f4cda15aa9
Author: Jens Rottmann <[EMAIL PROTECTED]>
Date:   Tue Nov 27 12:35:13 2007 +0100

x86 setup: don't recalculate ss:esp unless really necessary

In order to work around old LILO versions providing an invalid ss
register, the current setup code always sets up a new stack,
immediately following .bss and the heap. But this breaks LOADLIN.

This rewrite of the workaround checks for an invalid stack (ss!=ds)
first, and leaves ss:sp alone otherwise (apart from aligning esp).

[hpa note: LOADLIN has a number of arbitrary hard-coded limits that
are being pushed up against.  Without some major revision of LOADLIN
itself it will not be sustainable keeping it alive.  This gives it
another brief lease on life, however.  This patch also helps the
cmdline truncation problem with old versions of SYSLINUX.]

Signed-off-by: Jens Rottmann 
Signed-off-by: H. Peter Anvin <[EMAIL PROTECTED]>

diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 6ef5a06..4cc5b04 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -236,39 +236,30 @@ start_of_setup:
movw%ax, %es
cld
 
-# Apparently some ancient versions of LILO invoked the kernel
-# with %ss != %ds, which happened to work by accident for the
-# old code.  If the CAN_USE_HEAP flag is set in loadflags, or
-# %ss != %ds, then adjust the stack pointer.
+# Apparently some ancient versions of LILO invoked the kernel with %ss != %ds,
+# which happened to work by accident for the old code.  Recalculate the stack
+# pointer if %ss is invalid.  Otherwise leave it alone, LOADLIN sets up the
+# stack behind its own code, so we can't blindly put it directly past the heap.
 
-   # Smallest possible stack we can tolerate
-   movw$(_end+STACK_SIZE), %cx
-
-   movwheap_end_ptr, %dx
-   addw$512, %dx
-   jnc 1f
-   xorw%dx, %dx# Wraparound - whole segment available
-1: testb   $CAN_USE_HEAP, loadflags
-   jnz 2f
-
-   # No CAN_USE_HEAP
movw%ss, %dx
cmpw%ax, %dx# %ds == %ss?
movw%sp, %dx
-   # If so, assume %sp is reasonably set, otherwise use
-   # the smallest possible stack.
-   jne 4f  # -> Smallest possible stack...
+   je  2f  # -> assume %sp is reasonably set
+
+   # Invalid %ss, make up a new stack
+   movw$_end, %dx
+   testb   $CAN_USE_HEAP, loadflags
+   jz  1f
+   movwheap_end_ptr, %dx
+1: addw$STACK_SIZE, %dx
+   jnc 2f
+   xorw%dx, %dx# Prevent wraparound
 
-   # Make sure the stack is at least minimum size.  Take a value
-   # of zero to mean "full segment."
-2:
+2: # Now %dx should point to the end of our stack space
andw$~3, %dx# dword align (might as well...)
jnz 3f
movw$0xfffc, %dx# Make sure we're not zero
-3: cmpw%cx, %dx
-   jnb 5f
-4: movw%cx, %dx# Minimum value we can possibly use
-5: movw%ax, %ss
+3: movw%ax, %ss
movzwl  %dx, %esp   # Clear upper half of %esp
sti # Now we should have a working stack
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-29 Thread Ingo Molnar

* Eric W. Biederman <[EMAIL PROTECTED]> wrote:

> > You'll never run out of this sort of problem. Keeping Linux lean and 
> > simple would be far better.
> 
> Nah.  The control group stuff has all kinds of corner cases because it 
> is a new and untested API.  The namespace work after we get the code 
> cleanup up so it is maintainable and we can work with it is usually 
> just finding our globals through a pointer instead of from a static 
> variable.  Hardly a measurable cost on the best day.

yeah - anyone who claims that containers are 'fat' has likely not even 
looked at the code. Even maintainance-wise there's very visible positive 
effects: we do discover and properly map our "global resource" 
dependencies and abstract them. That increases cleanliness of our code 
and APIs all around.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH x86/mm 01/11] x86-32 thread_struct.debugreg

2007-11-29 Thread Jeff Dike

On Thu, Nov 29, 2007 at 01:50:55PM -0800, Roland McGrath wrote:
> UML is also a good test, though I have never been set up to verify
> anything beyond "UML seems to boot far enough to complain I don't
> have a userland filesystem for it".  

BTW, this doesn't exercise ptrace at all.  Interesting ptrace things
only start happening when userspace runs.

Grab an interesting-looking image from http://uml.nagafix.co.uk,
uncompress it, and run
./linux ubda=the-filesystem-image

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH x86/mm 11/11] x86 ptrace merge removals

2007-11-29 Thread Jeff Dike

On Thu, Nov 29, 2007 at 02:38:03PM -0800, Roland McGrath wrote:
> > Can you make sure that UML still runs when you're done with ptrace?
> 
> I'd be glad to, especially if you give me some advice on testing (.config
> for um-i386 and um-x86_64, what do try that constitutes "UML still runs").

Use defconfig and boot it.  If you break ptrace, I think it's
overwhelmingly likely that UML will stop booting.  So if UML boots,
I'd say you're good to go, with one caveat.  That is, UML should
report at boot that PTRACE_SYSEMU works.  I put in a fallback from
PTRACE_SYSEMU to PTRACE_SYSCALL when Fedora broke PTRACE_SYSEMU.

> Right now (before these), UML
> doesn't build for x86_64 or i386 from this tree to begin with.

For current -mm, you'll need
http://marc.info/?l=linux-kernel=119635496908681=raw to build.

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Robert Hancock


Phillip Susi wrote:

Tejun Heo wrote:

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?


Why does a single spurious interrupt cause it to be shut down?  I can 
see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?


I'm not certain offhand, but I think there may be such a threshold. 
However, an occasional spurious interrupt isn't likely. For a 
level-triggered interrupt, an unhandled interrupt will keep interrupting 
forever since nobody knows how to clear it (until we decide to disable 
the IRQ entirely).


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Tejun Heo

Phillip Susi wrote:
> Tejun Heo wrote:
>> Agreed.  Nobody cared on ATA controllers is usually very effective at
>> taking the whole machine down.  Is there any reason why we don't turn on
>> irqpoll on turned off IRQs automatically?
> 
> Why does a single spurious interrupt cause it to be shut down?  I can
> see if the interrupt is stuck on and keeps interrupting constantly, but
> if it's just the occasional spurious interrupt, why not just ignore it
> and move on?

Because SFF ATA controller don't have IRQ pending bit.  You don't know
whether IRQ is raised or not.  Plus, accessing the status register which
clears pending IRQ can be very slow on PATA machines.  It has to go
through the PCI and ATA bus and come back.  So, unconditionally trying
to clear IRQ by accessing Status can incur noticeable overhead if the
IRQ is shared with devices which raise a lot of IRQs.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/2] x86, ptrace: support for branch trace store(BTS)

On Thu, 29 Nov 2007 08:14:10 -
"Metzger, Markus T" <[EMAIL PROTECTED]> wrote:

> Support for Intel's last branch recording to ptrace. This gives
> debuggers
> access to this hardware feature and allows them to show an execution
> trace
> of the debugged application.
> 
> Last branch recording (see section 18.5 in the Intel 64 and IA-32
> Architectures Software Developer's Manual) allows taking an execution
> trace of the running application without instrumentation. When a branch
> is executed, the hardware logs the source and destination address in a
> cyclic buffer given to it by the OS.
> 
> This can be a great debugging aid. It shows you how exactly you got
> where you currently are without requiring you to do lots of single
> stepping and rerunning.
> 
> This patch manages the various buffers, configures the trace
> hardware, disentangles the trace, and provides a user interface via
> ptrace. On the high-level design:
> - there is one optional trace buffer per thread_struct
> - upon a context switch, the trace hardware is reconfigured to either
>   disable tracing or to use the appropriate buffer for the new task.
>   - tracing induces ~20% overhead as branch records are sent out on
> the bus. 
>   - the hardware collects trace per processor. To disentangle the
> traces for different tasks, we use separate buffers and reconfigure
> the trace hardware.
> - the low-level data layout is configured at cpu initialization time
>   - different processors use different branch record formats
> 
> 
> patch 1/2 contains the kernel changes
> patch 2/2 contains changes to the ptrace man pages
> 
> 

Is there any userspace code avaialble which people can use to play with
this?

How do you envisage it being used in the long term?  Do you expect any of
the standard performance tuning tools will be tweaked to understand this
feature and if so which ones?

I'm generally wondering "how will developers be using this in a year or
two's time?"

Please cc Michael Kerrisk <[EMAIL PROTECTED]> on future versions of
these patches.

The patches were horridly wordwrapped.

Is there any likelihood that any other CPUs do now or will in the future
support any similar feature to this?  If so, is an implementation which is
100% contained to arch/x86 appropriate?  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reproducible data corruption with sendfile+vsftp - splice regression?

2007-11-29 Thread Holger Hoffstaette

Hi -

This regular Linux user and lkml lurker just noticed data corruption in
ftp'ed files and narrowed it down to vsftpd using sendfile(). So far this
has never caused problems in the past; I have not noticed this with
2.6.22.x but may have missed it. I do remember reading about some changes
to the underlying splice stuff since .23 so that may have something to do
with it.

The scenario:

- created a file with known bit pattern on Linux server
- ftp-got this file to Windows client: file has bad crc (yes, binary)
- verified with another client: same result

I have thus far eliminated (to the best of my knowledge) NICs, switches,
cables, the Windows FTP clients, the hard disk in the server (SATA, ext3):
nothing suspicious in any logs. Box is an AMD Sempron 2600+ with 1.5 GB
RAM, added rt8169 card, Gentoo, vsftpd stable 2.0.5 - nothing fancy.
Transferring the file with samba (interestingly with sendfile enabled) and
via ftp but from /dev/shm repeatably works fine; pulling from disk creates
bad crc, every time. The file is readable and can be copied, verified etc.
over and over so I'm sure that I'm not falling prey to a false positive.
ifconfig indicates no dropped or otherwise corrupted packets.
I noticed this first with 2.6.4-rc3, but also just tried the latest stable
2.6.23.9 with the same config, with no change in behaviour. After setting
vsftpd to use_sendfile=NO, gigs can be transferred without corruption.

The data corruption is sporadic, but absolutely repeatable. The file with
the known good pattern just contains multiple lines of:

012345678901234567890123456789012345678901234567890
012345678901234567890123456789012345678901234567890
012345678901234567890123456789012345678901234567890
..etc..

A corrupted file is missing random characters, so that the corrupted lines
looks like this (line numbers added by me):

19785: 012345678901234567890123456789012345678901234567890
19786: 01234567890123456789012345678901234567890123678901234567890
19787: 012345678901234567890123456789012345678901234567890

or:

20074: 012345678901234567890123456789012345678901234567890
20075:
01234567890123456789012345678901234567890123012345678901234567890123456789012345678901234567890
20076: 012345678901234567890123456789012345678901234567890

Again, other network or hd traffic shows no signs of gremlins; the box is
perfectly stable, and turning sendfile on or off triggers/untriggers the
corruption reliably. I will try 2.6.22.x over the weekend, and before I
bother lkml with dmesg/.config etc. I wanted to fish for initial thoughts.

thanks
Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] xfs: revert to double-buffering readdir

2007-11-29 Thread Christian Kujau


On Sun, 25 Nov 2007, Christoph Hellwig wrote:

This patch does exactly that and reverts xfs_file_readdir to what's
basically the 2.6.23 version minus the uio and vnops junk.


Thanks, works here too (without nordirplus as a mountoption).
Am I supposed to close the bug[0] or do you guys want to leave this
open to track the Real Fix (TM) for 2.6.25?

Again, thank you for the fix!
Christian.

[0] http://bugzilla.kernel.org/show_bug.cgi?id=9400
--
BOFH excuse #112:

The monitor is plugged into the serial port
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: WARNING: at kernel/resource.c:189 __release_resource

2007-11-29 Thread Bjorn Helgaas

On Monday 26 November 2007 11:05:38 pm Andrew Morton wrote:
> On Thu, 22 Nov 2007 22:41:16 +0100 Jiri Slaby <[EMAIL PROTECTED]> wrote:
> > Ok, I hit the bug, suspend of 00:06 device complains about it:
> > WARNING: at .../kernel/resource.c:185 __release_resource()
> > 
> > Call Trace:
> >  [] release_resource+0xb5/0xf0
> >  [] pnp_release_resources+0x70/0x130
> >  [] pnp_stop_dev+0x45/0x90
> >  [] pnp_bus_suspend+0x92/0xb0
> >  [] suspend_device+0x113/0x180
> >  [] device_suspend+0x200/0x320
> >  [] suspend_devices_and_enter+0xa5/0x170
> >  [] enter_state+0x209/0x270
> >  [] state_store+0xaf/0xf0
> >  [] kobj_attr_store+0x17/0x20
> >  [] sysfs_write_file+0xce/0x140
> >  [] vfs_write+0xc7/0x170
> >  [] sys_write+0x50/0x90
> >  [] system_call+0x7e/0x83
> > 
> > # LANG=en ll /sys/devices/pnp0/00:06/
> > total 0
> > lrwxrwxrwx 1 root root0 Nov 22 22:35 driver -> 
> > ../../../bus/pnp/drivers/serial
> > -r--r--r-- 1 root root 4096 Nov 22 22:35 id
> > -r--r--r-- 1 root root 4096 Nov 22 22:35 options
> > drwxr-xr-x 2 root root0 Nov 22 22:35 power
> > -rw-r--r-- 1 root root 4096 Nov 22 22:35 resources
> > lrwxrwxrwx 1 root root0 Nov 22 22:35 subsystem -> ../../../bus/pnp
> > drwxr-xr-x 3 root root0 Nov 22 22:35 tty
> > -rw-r--r-- 1 root root 4096 Nov 22 22:35 uevent
> 
> I suppose that's a genuine leak, presumably in 8250_pnp.

We used to have only the serial driver resource reservation.  We now
have an additional 00:06 resource that is the parent of the serial
resource, e.g.,

  03f8-03ff : 00:06
03f8-03ff : serial

I think this problem happens because pnp_bus_suspend() calls
serial_pnp_suspend(), which suspends the driver but does nothing
with the resources.  Then it calls pnp_stop_dev(), which releases
the 00:06 resource, which still has a serial child resource.

The corresponding PCI code in pci_device_suspend() does not do
any generic device disable or resource release.  I don't know
why PNP disables the device on suspend.  I glanced through the
ACPI spec but didn't see a requirement for it.  Maybe Pierre [1]
remembers.

Maybe we could either remove the pnp_{stop,start}_dev() calls
from the suspend/resume path, or move the PNP resource management
out of pnp_{start,stop}_dev().

Bjorn

[1] http://lkml.org/lkml/2005/11/30/39
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Out of tree module using LSM

2007-11-29 Thread Jon Masters


On Thu, 2007-11-29 at 21:45 +, Alan Cox wrote:
> > Jargon File in all its glory. And if you still think you could look for
> > patterns, how about executable code that self-modifies in random ways
> > but when executed as a whole actually has the functionality of fetchmail
> > embedded within it? How would you guard against that?
> 
> Thats a problem for whoever writes the ESR detection tool and to what
> level it works. The question for the kernel is how do we provide a
> mechanism to allow (to some extent at least) this kind of tool to run.

Right. I'm just saying reading a single page out of context (no pun
intended) is not going to be very useful. They need to scan the entire
file, which means that there are limited ways this is practical - it's
not practical to do that on every write into a shared mapping, hence a
solution that scans on open, etc. is probably the best there is.

(I know you know this)

Jon.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata NCQ blacklist entry

2007-11-29 Thread Bjoern Olausson

On 11/29/07, Tejun Heo <[EMAIL PROTECTED]> wrote:
>
> I now have affected drives on my desk and am gonna try reproduce it.  My
> gut feeling says it's timing related problem on controller / driver
> side.  Please wait a bit.
>

> > by the way, and OT, did the Plextor DVD-RW drive reach you, Tejun?
>
> No, not yet.  Do you have a tracking number or something?
>
> Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Out of tree module using LSM

2007-11-29 Thread Jon Masters

On Thu, 2007-11-29 at 15:56 -0500, [EMAIL PROTECTED] wrote:
> On Thu, 29 Nov 2007 14:45:51 EST, Jon Masters said:
> > Ah, but I could write a sequence of pages that on their own looked
> > garbage, but in reality, when executed would print out a copy of the
> > Jargon File in all its glory. And if you still think you could look for
> > patterns, how about executable code that self-modifies in random ways
> > but when executed as a whole actually has the functionality of fetchmail
> > embedded within it? How would you guard against that?
> 
> So, just because Fred Cohen showed in his PhD thesis that *perfect* 
> virus/malware
> scanning is equivalent to the Turing Halting Problem, we should abandon
> efforts to make a 99.9998% workable system?

I think you misread what I said. I implied the exact opposite :-)

I'm trying to show that I understand the problem by saying the above,
that doing this perfectly is impossible, but I also happen to believe
that there are those who have solutions that provide a level of
protection to their users, who ask for such things. Hence my point is
that it's not really our place to debate whether virus scanning is
good/bad but more how to provide a sane API. I'll get a spec.

Jon.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remap_file_pages() broken in 2.6.23?

On Thu, Nov 29, 2007 at 02:45:23PM -0500, Chuck Ebbert wrote:
> Original report: https://bugzilla.redhat.com/show_bug.cgi?id=404201
> 
> The test case below, taken from the LTP test code, prints -1 (as
> expected) on 2.6.22 and 0 on 2.6.23. It tries to remap an out-of-range
> page. Proposed patch follows the program. Bug was apparently caused by
> commit 54cb8821de07f2ffcd28c380ce9b93d5784b40d7.

Ah, that's not such good behaviour anyway. mmap is allowed to map
outside the file offset, so you're telling me that remap_file_pages
just magically should not be allowed to remap these...?

> Patch:
> 
> Signed-off-by: Supriya Kannery <[EMAIL PROTECTED]>
> 
> --- linux-2.6.23/mm/fremap.c.orig 2007-11-22 00:56:09.0 -0600
> +++ linux-2.6.23/mm/fremap.c  2007-11-26 03:08:55.0 -0600
> @@ -124,6 +124,7 @@ asmlinkage long sys_remap_file_pages(uns
>   struct vm_area_struct *vma;
>   int err = -EINVAL;
>   int has_write_lock = 0;
> + unsigned long f_size = 0;
>  
>   if (__prot)
>   return err;
> @@ -181,6 +182,14 @@ asmlinkage long sys_remap_file_pages(uns
>   goto retry;
>   }
>   mapping = vma->vm_file->f_mapping;
> +
> + f_size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
> + f_size = f_size >> PAGE_CACHE_SHIFT;
> + if ((pgoff + size >> PAGE_CACHE_SHIFT) > f_size) {
> + err = -EINVAL;
> + goto out;
> + }
> +
>   /*
>* page_mkclean doesn't work on nonlinear vmas, so if
>* dirty pages need to be accounted, emulate with linear

I don't think there is anything preventing truncate races here. Theoretically
we could do it by taking i_mutex around here, but anyway then a subsequent
truncate is just going to be able to cause the mapping to be out of bounds
anyway.

If it were any other syscall than remap_file_pages, I'd be much more
hesitant to say this: I propose we change the test case instead. I
also changed other elements of the API, and we had the result tested
and verified by Oracle...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Reduce stack used by lib/hexdump.c