date:20071106

Re: [BUG] Linux 2.6.24-rc2 - oom-killer gets invoked

2007-11-06 Thread Balbir Singh

Kamalesh Babulal wrote:
> Hi,
> 
> oom-killer got invoked while running ltp-runall on the 2.6.24-rc2 kernel.
> 
> python invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
> 
> Call Trace:
>  [] oom_kill_process+0x4f/0xf5
>  [] out_of_memory+0x1bc/0x22d
>  [] __alloc_pages+0x282/0x313
>  [] __wait_on_bit_lock+0x5b/0x66
>  [] __do_page_cache_readahead+0x7c/0x18f
>  [] filemap_fault+0x15d/0x317
>  [] __do_fault+0x68/0x3bb
>  [] handle_mm_fault+0x325/0x694
>  [] do_page_fault+0x3c5/0x764
>  [] arch_get_unmapped_area+0x184/0x1f9
>  [] error_exit+0x0/0x51
> 
> Mem-info:
> Node 0 DMA per-cpu:
> CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> CPU2: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> CPU3: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> Node 0 DMA32 per-cpu:
> CPU0: Hot: hi:  186, btch:  31 usd:  29   Cold: hi:   62, btch:  15 usd:  
> 14
> CPU1: Hot: hi:  186, btch:  31 usd:  95   Cold: hi:   62, btch:  15 usd:  
> 52
> CPU2: Hot: hi:  186, btch:  31 usd:  33   Cold: hi:   62, btch:  15 usd:  
> 51
> CPU3: Hot: hi:  186, btch:  31 usd: 101   Cold: hi:   62, btch:  15 usd:  
> 50
> Active:118809 inactive:124570 dirty:0 writeback:1882 unstable:0

The active/inactive page count looks good.

>  free:2358 slab:2831 mapped:41 pagetables:2058 bounce:0
> Node 0 DMA free:3972kB min:28kB low:32kB high:40kB active:996kB 
> inactive:3036kB present:7552kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 992 992 992
> Node 0 DMA32 free:177860kB min:4012kB low:5012kB high:6016kB active:408456kB 
> inactive:388372kB present:1015864kB pages_scanned:640 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0

Free memory also looks good, specially Node 0 DMA32 and the order is 0.

> Node 0 DMA: 57*4kB 39*8kB 25*16kB 10*32kB 2*64kB 3*128kB 1*256kB 0*512kB 
> 1*1024kB 1*2048kB 0*4096kB = 5100kB
> Node 0 DMA32: 1597*4kB 1029*8kB 544*16kB 348*32kB 296*64kB 363*128kB 
> 305*256kB 253*512kB 134*1024kB 73*2048kB 0*4096kB = 594204kB
> Swap cache: add 841053, delete 835135, find 529/803, race 0+0
> Free swap  = 1986640kB
> Total swap = 2031640kB
> Free swap:   1986640kB
> 262093 pages of RAM
> 6981 reserved pages
> 190 pages shared
> 5918 pages swap cached
> Out of memory: kill process 23256 (mem01) score 46673 or a child
> Killed process 23256 (mem01)
> 

You don't see the problem with 2.6.24-rc1 right? Any chance of you
being able to do a git-bisect?

> 
> and during the bootup, following call trace was seen 
> 
> sysctl table check failed: /net/token-ring .3.14 procname does not match 
> binary path procname
> 
> Call Trace:
>  [] set_fail+0x3f/0x47
>  [] sysctl_check_table+0x4cb/0x51e
>  [] sysctl_check_lookup+0xc9/0xd8
>  [] sysctl_check_table+0x4d9/0x51e
>  [] sysctl_set_parent+0x1f/0x32
>  [] sysctl_init+0x1e/0x22
>  [] kernel_init+0x195/0x307
>  [] child_rip+0xa/0x12
>  [] kernel_init+0x0/0x307
>  [] child_rip+0x0/0x12
> 



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] always export sysctl_{r,w}mem_max

2007-11-06 Thread David Miller

From: [EMAIL PROTECTED] (Eric W. Biederman)
Date: Fri, 26 Oct 2007 18:04:22 -0600

> So if this is really something we want to stop doing we should
> be able to take a few extra moments remove the code from the
> two problem drivers, and remove the exports.

I've killed the references in dlm and rrunner in the net-2.6
tree.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH] Optimize zone allocator synchronization

2007-11-06 Thread Nick Piggin

On Wednesday 07 November 2007 17:19, Andrew Morton wrote:
> > On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <[EMAIL PROTECTED]> wrote:
> >
> > Don Porter wrote:
> > > From: Donald E. Porter <[EMAIL PROTECTED]>
> > >
> > > In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> > > lock is held across all iterations.  For certain parallel workloads, I
> > > have found that releasing and reacquiring the lock for each iteration
> > > yields better performance, especially at higher CPU counts.  For
> > > instance, kernel compilation is sped up by 5% on an 8 CPU test
> > > machine.  In most cases, there is no significant effect on performance
> > > (although the effect tends to be slightly positive).  This seems quite
> > > reasonable for the very small scope of the change.
> > >
> > > My intuition is that this patch prevents smaller requests from waiting
> > > on larger ones.  While grabbing and releasing the lock within the loop
> > > adds a few instructions, it can lower the latency for a particular
> > > thread's allocation which is often on the thread's critical path.
> > > Lowering the average latency for allocation can increase system
> > > throughput.
> > >
> > > More detailed information, including data from the tests I ran to
> > > validate this change are available at
> > > http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> > >
> > > Thanks in advance for your consideration and feedback.

I did see this initial post, and didn't quite know what to make of it.
I'll admit it is slightly unexpected :) Always good to research ideas
against common convention, though.

I don't know whether your reasoning is correct though: unless there is
a significant number of higher order allocations (which there should
not be, AFAIKS), all allocators will go through the per CPU lists which
batch the same number of objects on and off, so there is no such thing
as smaller or larger requests.

And there are a number of regressions as well in your tests. It would be
nice to get some more detailed profile numbers (preferably with an
upstream kernel) to try to work out what is going faster.

It's funny, Dave Miller and I were just talking about the possible
reappearance of zone->lock contention with massively multi core and
multi threaded CPUs. I think the right way to fix this in the long run
if it turns into a real problem, is something like having a lock per
MAX_ORDER block, and having CPUs prefer to allocate from different
blocks. Anti-frag makes this pretty interesting to implement, but it
will be possible.

> > That's an interesting insight.  My intuition is that Nick Piggin's
> > recently-posted ticket spinlocks patches[1] will reduce the need for this
> > patch, though it may be useful to have both.  Can you benchmark again
> > with only ticket spinlocks, and with ticket spinlocks + this patch? 
> > You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86
> > architecture merge.
>
> The patch as-is would hurt low cpu-count workloads, and single-threaded
> workloads: it is simply taking that lock a lot more times.  This will be
> particuarly noticable on things like older P4 machines which have
> peculiarly expensive locked operations.

It's not even restricted to P4s -- another big cost is going to be the
cacheline pingpong. Actually it might be worth trying another test run
with zone->lock put into its own cacheline (as it stands, when the lock
gets contended, spinners will just sit there pushing useful fields out
of the holder's memory -- ticket locks will do better here, but they
still write to the lock once, then sit there loading it).

> A test to run would be, on ext2:
>
>   time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)
>
> (might need to increase /proc/sys/vm/dirty* to avoid any writeback)
>
>
> I wonder if we can do something like:
>
>   if (lock_is_contended(lock)) {
>   spin_unlock(lock);
>   spin_lock(lock);/* To the back of the queue */
>   }
>
> (in conjunction with the ticket locks) so that we only do the expensive
> buslocked operation when we actually have a need to do so.
>
> (The above should be wrapped in some new spinlock interface function which
> is probably a no-op on architectures which cannot implement it usefully)

We have the need_lockbreak stuff. Of course, that's often pretty useless
with regular spinlocks (when you consider that my tests show that a single
CPU can be allowed to retake the same lock several million times in a row
despite contention)...

Anyway, yeah we could do that. But I think we do actually want to batch
up allocations on a given CPU in the multithreaded case as well, rather
than interleave them. There are some benefits avoiding cacheline bouncing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.t

[PATCH -libata] nv_hardreset: update dangling reference to bugzilla entry

2007-11-06 Thread Fernando Luis Vázquez Cao

Signed-off-by: Fernando Luis Vazquez Cao <[EMAIL PROTECTED]>
---

diff -urNp linux-2.6.24-rc2-orig/drivers/ata/sata_nv.c 
linux-2.6.24-rc2/drivers/ata/sata_nv.c
--- linux-2.6.24-rc2-orig/drivers/ata/sata_nv.c 2007-11-07 10:28:41.0 
+0900
+++ linux-2.6.24-rc2/drivers/ata/sata_nv.c  2007-11-07 16:29:21.0 
+0900
@@ -1629,7 +1629,7 @@ static int nv_hardreset(struct ata_link 
 
/* SATA hardreset fails to retrieve proper device signature on
 * some controllers.  Don't classify on hardreset.  For more
-* info, see http://bugme.osdl.org/show_bug.cgi?id=3352
+* info, see http://bugzilla.kernel.org/show_bug.cgi?id=3352
 */
return sata_std_hardreset(link, &dummy, deadline);
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Massive slowdown when re-querying large nfs dir

2007-11-06 Thread Neil Brown

On Tuesday November 6, [EMAIL PROTECTED] wrote:
> > On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi <[EMAIL PROTECTED]> wrote:
> > Al Boldi wrote:
> > > There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+
> > > entries) using a simple ls -l.
> > >
> > > On 2.6.23 client and server running userland rpc.nfs.V2:
> > > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > > more tries: time -p ls -l <2k+ entry dir>  in ~8sec
> > >
> > > first  try: time -p ls -l <5k+ entry dir>  in ~9sec
> > > more tries: time -p ls -l <5k+ entry dir>  in ~180sec
> > >
> > > On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
> > > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > > more tries: time -p ls -l <2k+ entry dir>  in ~7sec
> > >
> > > first  try: time -p ls -l <5k+ entry dir>  in ~8sec
> > > more tries: time -p ls -l <5k+ entry dir>  in ~43sec
> > >
> > > Remounting the nfs-dir on the client resets the problem.
> > >
> > > Any ideas?
> > 
> > Ok, I played some more with this, and it turns out that nfsV3 is a lot 
> > faster.  But, this does not explain why the 2.4.31 kernel is still over 
> > 4-times faster than 2.6.23.
> > 
> > Can anybody explain what's going on?
> > 
> 
> Sure, Neil can! ;)

Nuh.
He said "userland rpc.nfs.Vx".  I only do "kernel-land NFS".  In these
days of high specialisation, each line of code is owned by a different
person, and finding the right person is hard

I would suggest getting a 'tcpdump -s0' trace and seeing (with
wireshark) what is different between the various cases.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: unification of cfufreq/Kconfig

2007-11-06 Thread Adrian Bunk

On Wed, Nov 07, 2007 at 08:06:44AM +0100, Sam Ravnborg wrote:
> On Wed, Nov 07, 2007 at 07:02:20AM +0100, Adrian Bunk wrote:
> > On Wed, Nov 07, 2007 at 12:01:12AM +0100, Sam Ravnborg wrote:
> > >...
> > >  config X86_SPEEDSTEP_CENTRINO
> > > - tristate "Intel Enhanced SpeedStep"
> > > + tristate "Intel Enhanced SpeedStep (deprecated)"
> > >   select CPU_FREQ_TABLE
> > > - select X86_SPEEDSTEP_CENTRINO_TABLE
> > > + select X86_SPEEDSTEP_CENTRINO_TABLE if X86_32
> > > + depends on X86_64 && ACPI_PROCESSOR
> > >...
> > 
> > No.
> > 
> > depends on ACPI_PROCESSOR if X86_64
> 
> Gives syntax error.

That happens when you review something without trying it...

depends on (ACPI_PROCESSOR || !X86_64)

>   Sam

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUG] Linux 2.6.24-rc2 - oom-killer gets invoked

2007-11-06 Thread Kamalesh Babulal

Hi,

oom-killer got invoked while running ltp-runall on the 2.6.24-rc2 kernel.

python invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0

Call Trace:
 [] oom_kill_process+0x4f/0xf5
 [] out_of_memory+0x1bc/0x22d
 [] __alloc_pages+0x282/0x313
 [] __wait_on_bit_lock+0x5b/0x66
 [] __do_page_cache_readahead+0x7c/0x18f
 [] filemap_fault+0x15d/0x317
 [] __do_fault+0x68/0x3bb
 [] handle_mm_fault+0x325/0x694
 [] do_page_fault+0x3c5/0x764
 [] arch_get_unmapped_area+0x184/0x1f9
 [] error_exit+0x0/0x51

Mem-info:
Node 0 DMA per-cpu:
CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU2: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU3: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU0: Hot: hi:  186, btch:  31 usd:  29   Cold: hi:   62, btch:  15 usd:  14
CPU1: Hot: hi:  186, btch:  31 usd:  95   Cold: hi:   62, btch:  15 usd:  52
CPU2: Hot: hi:  186, btch:  31 usd:  33   Cold: hi:   62, btch:  15 usd:  51
CPU3: Hot: hi:  186, btch:  31 usd: 101   Cold: hi:   62, btch:  15 usd:  50
Active:118809 inactive:124570 dirty:0 writeback:1882 unstable:0
 free:2358 slab:2831 mapped:41 pagetables:2058 bounce:0
Node 0 DMA free:3972kB min:28kB low:32kB high:40kB active:996kB inactive:3036kB 
present:7552kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 992 992 992
Node 0 DMA32 free:177860kB min:4012kB low:5012kB high:6016kB active:408456kB 
inactive:388372kB present:1015864kB pages_scanned:640 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 57*4kB 39*8kB 25*16kB 10*32kB 2*64kB 3*128kB 1*256kB 0*512kB 
1*1024kB 1*2048kB 0*4096kB = 5100kB
Node 0 DMA32: 1597*4kB 1029*8kB 544*16kB 348*32kB 296*64kB 363*128kB 305*256kB 
253*512kB 134*1024kB 73*2048kB 0*4096kB = 594204kB
Swap cache: add 841053, delete 835135, find 529/803, race 0+0
Free swap  = 1986640kB
Total swap = 2031640kB
Free swap:   1986640kB
262093 pages of RAM
6981 reserved pages
190 pages shared
5918 pages swap cached
Out of memory: kill process 23256 (mem01) score 46673 or a child
Killed process 23256 (mem01)


and during the bootup, following call trace was seen 

sysctl table check failed: /net/token-ring .3.14 procname does not match binary 
path procname

Call Trace:
 [] set_fail+0x3f/0x47
 [] sysctl_check_table+0x4cb/0x51e
 [] sysctl_check_lookup+0xc9/0xd8
 [] sysctl_check_table+0x4d9/0x51e
 [] sysctl_set_parent+0x1f/0x32
 [] sysctl_init+0x1e/0x22
 [] kernel_init+0x195/0x307
 [] child_rip+0xa/0x12
 [] kernel_init+0x0/0x307
 [] child_rip+0x0/0x12

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Patch]Add strict_goal parameter to __alloc_bootmem_core

2007-11-06 Thread Zou Nan hai

If __alloc_bootmem_core was given a goal, it will first try to allocate
memory above that goal. If failed, it will try from the low pages.

Sometimes we don't want this behavior, we want the goal to be strict.

This patch introduce a strict_goal parameter to __alloc_bootmem_core, 

If strict_goal is set, __alloc_bootmem_core will return NULL to indicate
it can't allocate memory above that goal.

Note we do not scan from last_success if strict_goal is set, it will
scan from the beginning of the goal instead
We skip this optimization to keep the code simple because strict_goal is
not supposed to be used in hot path.

Signed-off-by: Zou Nan hai <[EMAIL PROTECTED]>
Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>

diff -Nraup a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c 2007-10-24 11:50:57.0 +0800
+++ b/arch/x86/mm/numa_64.c 2007-11-07 13:06:50.0 +0800
@@ -247,7 +247,7 @@ void __init setup_node_zones(int nodeid)
__alloc_bootmem_core(NODE_DATA(nodeid)->bdata, 
memmapsize, SMP_CACHE_BYTES, 
round_down(limit - memmapsize, PAGE_SIZE), 
-   limit);
+   limit, 1);
 #endif
 } 
 
diff -Nraup a/include/linux/bootmem.h b/include/linux/bootmem.h
--- a/include/linux/bootmem.h   2007-11-07 13:06:35.0 +0800
+++ b/include/linux/bootmem.h   2007-11-07 13:06:04.0 +0800
@@ -58,7 +58,8 @@ extern void *__alloc_bootmem_core(struct
  unsigned long size,
  unsigned long align,
  unsigned long goal,
- unsigned long limit);
+ unsigned long limit,
+ int strict_goal);
 
 #ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
 extern void reserve_bootmem(unsigned long addr, unsigned long size);
diff -Nraup a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c  2007-11-07 13:06:35.0 +0800
+++ b/mm/bootmem.c  2007-11-07 13:06:18.0 +0800
@@ -179,7 +179,7 @@ static void __init free_bootmem_core(boo
  */
 void * __init
 __alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
- unsigned long align, unsigned long goal, unsigned long limit)
+ unsigned long align, unsigned long goal, unsigned long limit, int 
strict_goal)
 {
unsigned long offset, remaining_size, areasize, preferred;
unsigned long i, start = 0, incr, eidx, end_pfn;
@@ -212,15 +212,20 @@ __alloc_bootmem_core(struct bootmem_data
/*
 * We try to allocate bootmem pages above 'goal'
 * first, then we try to allocate lower pages.
-*/
-   if (goal && goal >= bdata->node_boot_start && PFN_DOWN(goal) < end_pfn) 
{
-   preferred = goal - bdata->node_boot_start;
+* if the goal is not strict.
+ */
+
+   preferred = 0;
+   if (goal) {
+   if (goal >= bdata->node_boot_start && PFN_DOWN(goal) < end_pfn) 
{
+   preferred = goal - bdata->node_boot_start;
 
if (bdata->last_success >= preferred)
-   if (!limit || (limit && limit > bdata->last_success))
+   if (!strict_goal && (!limit || (limit && limit > 
bdata->last_success)))
preferred = bdata->last_success;
-   } else
-   preferred = 0;
+   } else if (strict_goal)
+return NULL;
+   }
 
preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
@@ -247,7 +252,7 @@ restart_scan:
i = ALIGN(j, incr);
}
 
-   if (preferred > offset) {
+   if (preferred > offset && !strict_goal) {
preferred = offset;
goto restart_scan;
}
@@ -421,7 +426,7 @@ void * __init __alloc_bootmem_nopanic(un
void *ptr;
 
list_for_each_entry(bdata, &bdata_list, list) {
-   ptr = __alloc_bootmem_core(bdata, size, align, goal, 0);
+   ptr = __alloc_bootmem_core(bdata, size, align, goal, 0, 0);
if (ptr)
return ptr;
}
@@ -449,7 +454,7 @@ void * __init __alloc_bootmem_node(pg_da
 {
void *ptr;
 
-   ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
+   ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0, 0);
if (ptr)
return ptr;
 
@@ -468,7 +473,7 @@ void * __init __alloc_bootmem_low(unsign
 
list_for_each_entry(bdata, &bdata_list, list) {
ptr = __alloc_bootmem_core(bdata, size, align, goal,
-   ARCH_LOW_ADDRESS_LIMIT);
+   ARCH_LOW_ADDRESS_LIMIT, 0);
if (ptr)
return ptr;

[Patch] Allocate sparse vmemmap block above 4G

2007-11-06 Thread Zou Nan hai

Try to allocate sparse vmemmap block above 4G on x64 system.

On some single node x64 system with huge amount of physical memory e.g >
64G. the memmap size maybe very big. 

If the memmap is allocated from low pages, it may occupies too much
memory below 4G. 
then swiotlb could fail to reserve bounce buffer under 4G which will
lead to boot failure.

This patch will first try to allocate memmap memory above 4G in sparse
vmemmap code. 
If it failed, it will allocate memmap above MAX_DMA_ADDRESS. 
This patch is against 2.6.24-rc1-git14

Signed-off-by: Zou Nan hai <[EMAIL PROTECTED]>
Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>


diff -Nraup a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c 2007-11-06 15:16:12.0 +0800
+++ b/arch/x86/mm/init_64.c 2007-11-06 15:55:50.0 +0800
@@ -448,6 +448,13 @@ void online_page(struct page *page)
num_physpages++;
 }
 
+void * __meminit alloc_bootmem_high_node(pg_data_t *pgdat, unsigned long size,
+unsigned long align)
+{
+return __alloc_bootmem_core(pgdat->bdata, size,
+align, (4UL*1024*1024*1024), 0, 1);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * Memory is added always to NORMAL zone. This means you will never get
diff -Nraup a/include/linux/bootmem.h b/include/linux/bootmem.h
--- a/include/linux/bootmem.h   2007-11-06 16:06:31.0 +0800
+++ b/include/linux/bootmem.h   2007-11-06 15:50:36.0 +0800
@@ -61,6 +61,10 @@ extern void *__alloc_bootmem_core(struct
  unsigned long limit,
  int strict_goal);
 
+extern void *alloc_bootmem_high_node(pg_data_t *pgdat,
+unsigned long size,
+unsigned long align);
+
 #ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
 extern void reserve_bootmem(unsigned long addr, unsigned long size);
 #define alloc_bootmem(x) \
diff -Nraup a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c  2007-11-06 16:06:31.0 +0800
+++ b/mm/bootmem.c  2007-11-06 15:49:20.0 +0800
@@ -492,3 +492,11 @@ void * __init __alloc_bootmem_low_node(p
return __alloc_bootmem_core(pgdat->bdata, size, align, goal,
ARCH_LOW_ADDRESS_LIMIT, 0);
 }
+
+__attribute__((weak)) __meminit
+void *alloc_bootmem_high_node(pg_data_t *pgdat, unsigned long size,
+unsigned long align)
+{
+return NULL;
+}
+
diff -Nraup a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
--- a/mm/sparse-vmemmap.c   2007-11-06 15:16:12.0 +0800
+++ b/mm/sparse-vmemmap.c   2007-11-06 16:08:52.0 +0800
@@ -43,9 +43,13 @@ void * __meminit vmemmap_alloc_block(uns
if (page)
return page_address(page);
return NULL;
-   } else
+   } else {
+   void *p = alloc_bootmem_high_node(NODE_DATA(node), size, size);
+   if (p)
+   return p;
return __alloc_bootmem_node(NODE_DATA(node), size, size,
__pa(MAX_DMA_ADDRESS));
+   }
 }
 
 void __meminit vmemmap_verify(pte_t *pte, int node,


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread Torsten Kaiser

On 11/7/07, David Chinner <[EMAIL PROTECTED]> wrote:
> Ok, so it's not synchronous writes that we are doing - we're just
> submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
> The "synchronous" nature appears to be coming from higher level
> locking when reclaiming inodes (on the flush lock). It appears that
> inode write clustering is failing completely so we are writing the
> same block multiple times i.e. once for each inode in the cluster we
> have to write.

Works for me. The only remaining stalls are sub second and look
completely valid, considering the amount of files being removed.

iostat 10 from this test:
 3  0  0 3500192332 20495600   105  8512 1809 6473  6 10 83  1
 0  0  0 3500200332 20457600 0  4367 1355 3712  2  6 92  0
 2  0  0 3504264332 20352800 0  6805 1912 4967  4  8 88  0
 0  0  0 3511632332 20352800 0  2843  805 1791  2  4 94  0
 0  0  0 3516852332 20351600 0  3375  879 2712  3  5 93  0
 0  0  0 3530544332 20266800   186   776  488 1152  4  2 89  4
 0  0  0 3574788332 20496000   226   326  358  787  0  1 98  0
 0  0  0 3576820332 20496000 0   376  332  737  0  0 99  0
 0  0  0 3578432332 20496000 0   356  293  606  1  1 99  0
 0  0  0 3580192332 20496000 0   101  104  384  0  0 99  0

I'm pleased to note that this is now much faster again.
Thanks!

Tested-by: Torsten Kaiser <[EMAIL PROTECTED]>

CC's please note: It looks like this was really a different problem
then the 100% iowait that was seen with reiserfs.
Also the one complete stall I have seen is probably something else.
But I have not been able to reproduce this again with -mm and have
never seen this on mainline, so I will just ignore that single event
until I see it again.

Torsten

> ---
>  fs/xfs/xfs_iget.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
> ===
> --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
> +1100
> +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
> @@ -248,7 +248,7 @@ finish_inode:
> icl = NULL;
> if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq,
> first_index, 1)) {
> -   if ((iq->i_ino & mask) == first_index)
> +   if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
> icl = iq->i_cluster;
> }
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: unification of cfufreq/Kconfig

2007-11-06 Thread Sam Ravnborg

On Wed, Nov 07, 2007 at 07:02:20AM +0100, Adrian Bunk wrote:
> On Wed, Nov 07, 2007 at 12:01:12AM +0100, Sam Ravnborg wrote:
> >...
> >  config X86_SPEEDSTEP_CENTRINO
> > -   tristate "Intel Enhanced SpeedStep"
> > +   tristate "Intel Enhanced SpeedStep (deprecated)"
> > select CPU_FREQ_TABLE
> > -   select X86_SPEEDSTEP_CENTRINO_TABLE
> > +   select X86_SPEEDSTEP_CENTRINO_TABLE if X86_32
> > +   depends on X86_64 && ACPI_PROCESSOR
> >...
> 
> No.
> 
>   depends on ACPI_PROCESSOR if X86_64

Gives syntax error.

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] ia64: remove per_cpu_offset()

2007-11-06 Thread Luming Yu

NAK for now.

I'm trying to add lockdep , so please don't delete it until it could
be proved really useless...
Please don't hurry...

On 11/7/07, Simon Horman <[EMAIL PROTECTED]> wrote:
> per_cpu_offset() was added as part of a lockdep patch,
> "[PATCH] lockdep: add per_cpu_offset()"
> (a875a69f8b00a38b4f40d9632a4fc71a159f0e0d),
> but ia64 doesn't have lockdep, nor does it use per_cpu_offset()
> anywhere else.
>
> This came up because Yu Lumming noticed that the ia64 version
> of per_cpu_offset() actually has a syntax error. Amusing as it seems
> to have been in the tree for months.
>
> > -#define per_cpu_offset(x) (__per_cpu_offset(x))
> > +#define per_cpu_offset(x) (__per_cpu_offset[x])
>
> Dave Miller suggested that rather than fixing the unused code it would be
> better to just remove it all together.
>
> Signed-off-by: Simon Horman <[EMAIL PROTECTED]>
>
> diff --git a/include/asm-ia64/percpu.h b/include/asm-ia64/percpu.h
> index c4f1e32..2870f8d 100644
> --- a/include/asm-ia64/percpu.h
> +++ b/include/asm-ia64/percpu.h
> @@ -46,7 +46,6 @@
>  #ifdef CONFIG_SMP
>
>  extern unsigned long __per_cpu_offset[NR_CPUS];
> -#define per_cpu_offset(x) (__per_cpu_offset[x])
>
>  /* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
>  DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix typo in per_cpu_offset

2007-11-06 Thread Luming Yu

NAK for now.

I'm trying to add lockdep , so please don't delete it until it could
be proved really useless...
Please don't hurry...

On 11/7/07, Simon Horman <[EMAIL PROTECTED]> wrote:
> On Tue, Oct 30, 2007 at 05:50:56PM +0900, Simon Horman wrote:
> > On Tue, Oct 30, 2007 at 12:36:22AM -0700, David Miller wrote:
> > > From: Simon Horman <[EMAIL PROTECTED]>
> > > Date: Tue, 30 Oct 2007 16:15:13 +0900
> > >
> > > > Though curiuously with my config nothing uses per_cpu_offset()
> > > > (I added a bogus call to produce an error.) Is it actually
> > > > used on ia64?
> > >
> > > It is unused, and in that regard should probably be deleted.
> > >
> > > include/asm-generic/percpu.h defines a seemingly similarly
> > > unused per_cpu_offset() macro define as well
> >
> > It looks like they were both added by "[PATCH] lockdep: add 
> > per_cpu_offset()"
> > (a875a69f8b00a38b4f40d9632a4fc71a159f0e0d)
> >
> > Perhaps they were used at that time?
>
> I looked into this a little further:
>
> I'm pretty much convinced that the asm-ia64 version of per_cpu_offset()
> is unused as ia64 doesn't have lockdep. I will send a patch to get rid
> of it. The generic version might be used on mips, sh or arm with
> CONFIG_SMP, as these architectures have lockdep. I did managed to
> produce a compiler error on mips by removing the asm-generic version of
> per_cpu_offset().
>
> --
> Horms
>   H: http://www.vergenet.net/~horms/
>   W: http://www.valinux.co.jp/en/
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Unionfs: stop using iget() and read_inode()

2007-11-06 Thread Erez Zadok


From: David Howells <[EMAIL PROTECTED]>

Stop UnionFS from using iget() and read_inode().  Replace
unionfs_read_inode() with unionfs_iget(), and call that instead of iget().
unionfs_iget() then uses iget_locked() directly and returns a proper error
code instead of an inode in the event of an error.

unionfs_fill_super() returns any error incurred when getting the root inode
instead of EINVAL.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
Signed-off-by: Erez Zadok <[EMAIL PROTECTED]>

diff --git a/fs/unionfs/main.c b/fs/unionfs/main.c
index ffb0da1..89e3b31 100644
--- a/fs/unionfs/main.c
+++ b/fs/unionfs/main.c
@@ -104,9 +104,8 @@ struct dentry *unionfs_interpose(struct dentry *dentry, 
struct super_block *sb,
BUG_ON(is_negative_dentry);
 
/*
-* We allocate our new inode below, by calling iget.
-* iget will call our read_inode which will initialize some
-* of the new inode's fields
+* We allocate our new inode below by calling unionfs_iget,
+* which will initialize some of the new inode's fields
 */
 
/*
@@ -128,9 +127,9 @@ struct dentry *unionfs_interpose(struct dentry *dentry, 
struct super_block *sb,
}
} else {
/* get unique inode number for unionfs */
-   inode = iget(sb, iunique(sb, UNIONFS_ROOT_INO));
-   if (!inode) {
-   err = -EACCES;
+   inode = unionfs_iget(sb, iunique(sb, UNIONFS_ROOT_INO));
+   if (IS_ERR(inode)) {
+   err = PTR_ERR(inode);
goto out;
}
if (atomic_read(&inode->i_count) > 1)
diff --git a/fs/unionfs/super.c b/fs/unionfs/super.c
index 7d28045..18e506b 100644
--- a/fs/unionfs/super.c
+++ b/fs/unionfs/super.c
@@ -24,13 +24,21 @@
  */
 static struct kmem_cache *unionfs_inode_cachep;
 
-static void unionfs_read_inode(struct inode *inode)
+struct inode *unionfs_iget(struct super_block *sb, unsigned long ino)
 {
int size;
-   struct unionfs_inode_info *info = UNIONFS_I(inode);
+   struct unionfs_inode_info *info;
+   struct inode *inode;
 
-   unionfs_read_lock(inode->i_sb);
+   inode = iget_locked(sb, ino);
+   if (!inode)
+   return ERR_PTR(-ENOMEM);
+   if (!(inode->i_state & I_NEW))
+   return inode;
 
+   unionfs_read_lock(sb);
+
+   info = UNIONFS_I(inode);
memset(info, 0, offsetof(struct unionfs_inode_info, vfs_inode));
info->bstart = -1;
info->bend = -1;
@@ -46,7 +54,9 @@ static void unionfs_read_inode(struct inode *inode)
if (unlikely(!info->lower_inodes)) {
printk(KERN_CRIT "unionfs: no kernel memory when allocating "
   "lower-pointer array!\n");
-   BUG();
+   iget_failed(inode);
+   unionfs_read_unlock(sb);
+   return ERR_PTR(-ENOMEM);
}
 
inode->i_version++;
@@ -55,7 +65,9 @@ static void unionfs_read_inode(struct inode *inode)
 
inode->i_mapping->a_ops = &unionfs_aops;
 
-   unionfs_read_unlock(inode->i_sb);
+   unlock_new_inode(inode);
+   unionfs_read_unlock(sb);
+   return inode;
 }
 
 /*
@@ -1002,7 +1014,6 @@ out:
 }
 
 struct super_operations unionfs_sops = {
-   .read_inode = unionfs_read_inode,
.delete_inode   = unionfs_delete_inode,
.put_super  = unionfs_put_super,
.statfs = unionfs_statfs,
diff --git a/fs/unionfs/union.h b/fs/unionfs/union.h
index 9b530ec..f5afae0 100644
--- a/fs/unionfs/union.h
+++ b/fs/unionfs/union.h
@@ -356,6 +356,7 @@ extern int unionfs_fsync(struct file *file, struct dentry 
*dentry,
 extern int unionfs_fasync(int fd, struct file *file, int flag);
 
 /* Inode operations */
+extern struct inode *unionfs_iget(struct super_block *sb, unsigned long ino);
 extern int unionfs_rename(struct inode *old_dir, struct dentry *old_dentry,
  struct inode *new_dir, struct dentry *new_dentry);
 extern int unionfs_unlink(struct inode *dir, struct dentry *dentry);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] ia64: remove per_cpu_offset()

2007-11-06 Thread Simon Horman

per_cpu_offset() was added as part of a lockdep patch,
"[PATCH] lockdep: add per_cpu_offset()"
(a875a69f8b00a38b4f40d9632a4fc71a159f0e0d),
but ia64 doesn't have lockdep, nor does it use per_cpu_offset()
anywhere else.

This came up because Yu Lumming noticed that the ia64 version
of per_cpu_offset() actually has a syntax error. Amusing as it seems
to have been in the tree for months.

> -#define per_cpu_offset(x) (__per_cpu_offset(x))
> +#define per_cpu_offset(x) (__per_cpu_offset[x])

Dave Miller suggested that rather than fixing the unused code it would be
better to just remove it all together.

Signed-off-by: Simon Horman <[EMAIL PROTECTED]>

diff --git a/include/asm-ia64/percpu.h b/include/asm-ia64/percpu.h
index c4f1e32..2870f8d 100644
--- a/include/asm-ia64/percpu.h
+++ b/include/asm-ia64/percpu.h
@@ -46,7 +46,6 @@
 #ifdef CONFIG_SMP
 
 extern unsigned long __per_cpu_offset[NR_CPUS];
-#define per_cpu_offset(x) (__per_cpu_offset[x])
 
 /* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
 DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc1 - Regularly getting processes stuck in D state on startup

2007-11-06 Thread Fengguang Wu

On Wed, Nov 07, 2007 at 02:26:09PM +1100, Stephen Rothwell wrote:
> On Wed, 7 Nov 2007 14:17:17 +1100 Stephen Rothwell <[EMAIL PROTECTED]> wrote:
> >
> > On Tue, Nov 06, 2007 at 04:00:06PM +0800, Fengguang Wu wrote:
> > > 
> > > Could you try with the attached 4 patches? Two of them are expected to
> > > fix your problem, another two are debugging ones(in case the problem
> > > persists).
> > 
> > Applying these four patches fixes it for me.  Obviously the reiserfs patch
> > was not relevant in  my case (only using ext3).
> 
> I am now running on a kernel with just the
> mm-speed-up-writeback-ramp-up-on-clean-systems.patch applied and I am
> seeing no hangs.

Thank you(including David:-)) for the confirmation.

Andrew: so mm-speed-up-writeback-ramp-up-on-clean-systems.patch is a
safe and working patch ;-)

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix typo in per_cpu_offset

2007-11-06 Thread Simon Horman

On Tue, Oct 30, 2007 at 05:50:56PM +0900, Simon Horman wrote:
> On Tue, Oct 30, 2007 at 12:36:22AM -0700, David Miller wrote:
> > From: Simon Horman <[EMAIL PROTECTED]>
> > Date: Tue, 30 Oct 2007 16:15:13 +0900
> > 
> > > Though curiuously with my config nothing uses per_cpu_offset()
> > > (I added a bogus call to produce an error.) Is it actually
> > > used on ia64?
> > 
> > It is unused, and in that regard should probably be deleted.
> > 
> > include/asm-generic/percpu.h defines a seemingly similarly
> > unused per_cpu_offset() macro define as well
> 
> It looks like they were both added by "[PATCH] lockdep: add per_cpu_offset()"
> (a875a69f8b00a38b4f40d9632a4fc71a159f0e0d)
> 
> Perhaps they were used at that time?

I looked into this a little further:

I'm pretty much convinced that the asm-ia64 version of per_cpu_offset()
is unused as ia64 doesn't have lockdep. I will send a patch to get rid
of it. The generic version might be used on mips, sh or arm with
CONFIG_SMP, as these architectures have lockdep. I did managed to
produce a compiler error on mips by removing the asm-generic version of
per_cpu_offset().

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IGET: Stop UnionFS from using iget() and read_inode()

2007-11-06 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, David Howells writes:
> From: David Howells <[EMAIL PROTECTED]>
> 
> Stop the UnionFS filesystem from using iget() and read_inode().  Replace
> unionfs_read_inode() with unionfs_iget(), and call that instead of iget().
> unionfs_iget() then uses iget_locked() directly and returns a proper error 
> code
> instead of an inode in the event of an error.
> 
> unionfs_fill_super() returns any error incurred when getting the root inode
> instead of EINVAL.
> 
> Signed-off-by: David Howells <[EMAIL PROTECTED]>
[...]

Thanks.  I tested this code and it passed all my tests.  I'll shortly submit
a slightly revised patch which applies cleanly against the unionfs code in
-mm.

Cheers,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MemoryStick / Pro support

2007-11-06 Thread Pierre Ossman

On Tue, 6 Nov 2007 22:15:37 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> > On Fri, 2 Nov 2007 06:23:48 -0700 (PDT) Alex Dubov <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > I also wonder, where do I send the patches if nobody currently maintains 
> > this thing?
> > 
> 
> Me, Pierre, lkml?

I'm not sure sending it through me is a good idea. I'm having trouble finding 
time for my current projects, so I will be a bottle neck.

That's not a "no", but you have been warned. ;)

Rgds
Pierre

signature.asc
Description: PGP signature

Re: Use of virtio device IDs

2007-11-06 Thread Anthony Liguori


Rusty Russell wrote:

On Wednesday 07 November 2007 16:40:13 Avi Kivity wrote:
  

Gregory Haskins wrote:


 but FWIW: This is a major motivation for the reason that the
IOQ stuff I posted a while back used strings for device identification
instead of a fixed length, centrally managed namespace like PCI
vendor/dev-id.  Then you can just name your device something reasonably
unique (e.g. "qumranet::veth", or "ibm-pvirt-clock").
  

I dislike strings.  They make it look as if you have a nice extensible
interface, where in reality you have a poorly documented interface which
leads to poor interoperability.



Yes, you end up with exactly names like "qumranet::veth" 
and "ibm-pvirt-clock".  I would recommend looking very hard at /proc, Open 
Firmware on a modern system, or the Xen store, to see what a lack of 
limitation can do to you :)


  

We will support non-pci for s390, but in order to support Windows and
older Linux PCI is necessary.



The aim is that PCI support is clean, but that we're not really tied to PCI.  
I think we're getting closer with the recent config changes.
  


Yes, my main desire was to ensure that we had a clean PCI ABI that would 
be natural to implement on a platform like Windows.  I think with the 
recent config_ops refactoring, we can now do that.


Regards,

Anthony Liguori


Cheers,
Rusty.
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Massive slowdown when re-querying large nfs dir

2007-11-06 Thread Andrew Morton

> On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi <[EMAIL PROTECTED]> wrote:
> Al Boldi wrote:
> > There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+
> > entries) using a simple ls -l.
> >
> > On 2.6.23 client and server running userland rpc.nfs.V2:
> > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > more tries: time -p ls -l <2k+ entry dir>  in ~8sec
> >
> > first  try: time -p ls -l <5k+ entry dir>  in ~9sec
> > more tries: time -p ls -l <5k+ entry dir>  in ~180sec
> >
> > On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
> > first  try: time -p ls -l <2k+ entry dir>  in ~2.5sec
> > more tries: time -p ls -l <2k+ entry dir>  in ~7sec
> >
> > first  try: time -p ls -l <5k+ entry dir>  in ~8sec
> > more tries: time -p ls -l <5k+ entry dir>  in ~43sec
> >
> > Remounting the nfs-dir on the client resets the problem.
> >
> > Any ideas?
> 
> Ok, I played some more with this, and it turns out that nfsV3 is a lot 
> faster.  But, this does not explain why the 2.4.31 kernel is still over 
> 4-times faster than 2.6.23.
> 
> Can anybody explain what's going on?
> 

Sure, Neil can! ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH] Optimize zone allocator synchronization

2007-11-06 Thread Andrew Morton

> On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <[EMAIL PROTECTED]> wrote:
> Don Porter wrote:
> > From: Donald E. Porter <[EMAIL PROTECTED]>
> > 
> > In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> > lock is held across all iterations.  For certain parallel workloads, I
> > have found that releasing and reacquiring the lock for each iteration
> > yields better performance, especially at higher CPU counts.  For
> > instance, kernel compilation is sped up by 5% on an 8 CPU test
> > machine.  In most cases, there is no significant effect on performance
> > (although the effect tends to be slightly positive).  This seems quite
> > reasonable for the very small scope of the change.
> > 
> > My intuition is that this patch prevents smaller requests from waiting
> > on larger ones.  While grabbing and releasing the lock within the loop
> > adds a few instructions, it can lower the latency for a particular
> > thread's allocation which is often on the thread's critical path.
> > Lowering the average latency for allocation can increase system throughput.
> > 
> > More detailed information, including data from the tests I ran to
> > validate this change are available at
> > http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> > 
> > Thanks in advance for your consideration and feedback.
> 
> That's an interesting insight.  My intuition is that Nick Piggin's 
> recently-posted ticket spinlocks patches[1] will reduce the need for this 
> patch, 
> though it may be useful to have both.  Can you benchmark again with only 
> ticket 
> spinlocks, and with ticket spinlocks + this patch?  You'll probably want to 
> use 
> 2.6.24-rc1 as your baseline, due to the x86 architecture merge.

The patch as-is would hurt low cpu-count workloads, and single-threaded
workloads: it is simply taking that lock a lot more times.  This will be
particuarly noticable on things like older P4 machines which have peculiarly
expensive locked operations.

A test to run would be, on ext2:

time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)

(might need to increase /proc/sys/vm/dirty* to avoid any writeback)


I wonder if we can do something like:

if (lock_is_contended(lock)) {
spin_unlock(lock);
spin_lock(lock);/* To the back of the queue */
}

(in conjunction with the ticket locks) so that we only do the expensive
buslocked operation when we actually have a need to do so.

(The above should be wrapped in some new spinlock interface function which
is probably a no-op on architectures which cannot implement it usefully)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23.1: Random hangs during boot with "tsc" clocksource

2007-11-06 Thread Andrew Morton

> On Fri, 02 Nov 2007 12:10:00 -0500 Jordan Russell <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> With 2.6.23.1 (stock and Fedora), roughly 50% of the time my system
> hangs indefinitely during the kernel boot process. The hangs occur in
> places where normally a brief delay is seen, such as when detecting
> serial ports, ATA devices, and USB hubs. SysRq+W, when it works, shows
> tasks stuck inside schedule_timeout and lock_timer_base.
> 
> I've found, however, that if I override the default "tsc" clocksource
> with "acpi_pm" by adding "clocksource=acpi_pm" to the kernel command
> line, the system comes up fine every time. (Zero hangs in ~50 tries.)
> 
> I never had to do this with stock kernels based on 2.6.22.10 and
> earlier, which also defaulted to "tsc" on my system.
> 
> I came across this thread:
>   http://lkml.org/lkml/2007/10/16/339
> but I'm not sure if it's the same issue. I've never seen a "Clocksource
> tsc unstable" message, and the kernel doesn't appear to unwedge itself
> after 5 minutes (I waited 20 minutes once).
> 
> Any ideas?
> 
> Thanks,
> Jordan Russell
> 
> 
> 
> Motherboard: Supermicro X6DLP-EG2
> CPU: Intel Xeon LV (essentially a rebranded mobile Core Duo)
> 
> # cat /proc/cpuinfo
> processor   : 0
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 14
> model name  : Intel(R) Xeon(TM) CPU @ 1.66GHz
> stepping: 8
> cpu MHz : 1666.846
> cache size  : 2048 KB
> physical id : 0
> siblings: 2
> core id : 0
> cpu cores   : 2
> fdiv_bug: no
> hlt_bug : no
> f00f_bug: no
> coma_bug: no
> fpu : yes
> fpu_exception   : yes
> cpuid level : 10
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
> constant_tsc arch_perfmon bts pni monitor vmx est tm2 xtpr
> bogomips: 3335.39
> clflush size: 64
> 
> processor   : 1
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 14
> model name  : Intel(R) Xeon(TM) CPU @ 1.66GHz
> stepping: 8
> cpu MHz : 1666.846
> cache size  : 2048 KB
> physical id : 0
> siblings: 2
> core id : 1
> cpu cores   : 2
> fdiv_bug: no
> hlt_bug : no
> f00f_bug: no
> coma_bug: no
> fpu : yes
> fpu_exception   : yes
> cpuid level : 10
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx
> constant_tsc arch_perfmon bts pni monitor vmx est tm2 xtpr
> bogomips: .68
> clflush size: 64
> 
> 
> 
> dmesg (from a non-hanging boot without "clocksource=acpi_pm"):
> 
> Linux version 2.6.23.1x ([EMAIL PROTECTED]) (gcc version 4.1.2 20070925 (Red
> Hat 4.1.2-27)) #3 SMP Tue Oct 30 18:21:58 CDT 2007
> BIOS-provided physical RAM map:
>  BIOS-e820:  - 0009fc00 (usable)
>  BIOS-e820: 0009fc00 - 000a (reserved)
>  BIOS-e820: 000e8000 - 0010 (reserved)
>  BIOS-e820: 0010 - 3ffe (usable)
>  BIOS-e820: 3ffe - 3ffef000 (ACPI data)
>  BIOS-e820: 3ffef000 - 36d0 (ACPI NVS)
>  BIOS-e820: 36d0 - 4000 (reserved)
>  BIOS-e820: fec0 - fec86000 (reserved)
>  BIOS-e820: fee0 - fee01000 (reserved)
>  BIOS-e820: ffb0 - 0001 (reserved)
> 127MB HIGHMEM available.
> 896MB LOWMEM available.
> found SMP MP-table at 000ff780
> NX (Execute Disable) protection: active
> Entering add_active_range(0, 0, 262112) 0 entries of 256 used
> Zone PFN ranges:
>   DMA 0 -> 4096
>   Normal   4096 ->   229376
>   HighMem229376 ->   262112
> Movable zone start PFN for each node
> early_node_map[1] active PFN ranges
> 0:0 ->   262112
> On node 0 totalpages: 262112
>   DMA zone: 32 pages used for memmap
>   DMA zone: 0 pages reserved
>   DMA zone: 4064 pages, LIFO batch:0
>   Normal zone: 1760 pages used for memmap
>   Normal zone: 223520 pages, LIFO batch:31
>   HighMem zone: 255 pages used for memmap
>   HighMem zone: 32481 pages, LIFO batch:7
>   Movable zone: 0 pages used for memmap
> DMI 2.3 present.
> Using APIC driver default
> ACPI: RSDP 000FA930, 0014 (r0 ACPIAM)
> ACPI: RSDT 3FFE, 0034 (r1 A M I  OEMRSDT   4000605 MSFT   97)
> ACPI: FACP 3FFE0200, 0084 (r2 A M I  OEMFACP   4000605 MSFT   97)
> ACPI: DSDT 3FFE0460, 39E0 (r1  DLP4G DLP4G0077 INTL  2002026)
> ACPI: FACS 3FFEF000, 0040
> ACPI: APIC 3FFE0390, 0078 (r1 A M I  OEMAPIC   4000605 MSFT   97)
> ACPI: OEMB 3FFEF040, 0040 (r1 A M I  AMI_OEM   40

Re: question about sata-error on boot.

2007-11-06 Thread Andrew Morton

> On Fri, 2 Nov 2007 19:34:20 +0100 "Hemmann, Volker Armin" <[EMAIL PROTECTED]> 
> wrote:
> Hi,

(cc linux-ide)

> for some time (and I can't say for how long, but the board is less than a
> month old) I get this error on boot:
> 
> [   42.116273] ahci :00:0a.0: version 2.2
> [   42.116482] ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23
> [   42.116653] ACPI: PCI Interrupt :00:0a.0[A] -> Link [LSA0] -> GSI 23 
> (level, low) -> IRQ 23
> [   43.119478] ahci :00:0a.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf 
> impl IDE mode
> [   43.119778] ahci :00:0a.0: flags: 64bit led clo pmp pio
> [   43.119943] PCI: Setting latency timer of device :00:0a.0 to 64
> [   43.120149] scsi0 : ahci
> [   43.120365] scsi1 : ahci
> [   43.120556] scsi2 : ahci
> [   43.120741] scsi3 : ahci
> [   43.120927] ata1: SATA max UDMA/133 cmd 0xc2014100 ctl 
> 0x bmdma 0x irq 315
> [   43.121227] ata2: SATA max UDMA/133 cmd 0xc2014180 ctl 
> 0x bmdma 0x irq 315
> [   43.121526] ata3: SATA max UDMA/133 cmd 0xc2014200 ctl 
> 0x bmdma 0x irq 315
> [   43.121826] ata4: SATA max UDMA/133 cmd 0xc2014280 ctl 
> 0x bmdma 0x irq 315
> [   43.934296] ata1: softreset failed (1st FIS failed)
> [   43.934461] ata1: reset failed (errno=-5), retrying in 10 secs
> [   53.885194] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   53.885890] ata1.00: ATA-7: WDC WD1600JS-00MHB1, 10.02E01, max UDMA/133
> [   53.886056] ata1.00: 312581808 sectors, multi 16: LBA48
> [   53.886804] ata1.00: configured for UDMA/133
> [   54.201147] ata2: SATA link down (SStatus 0 SControl 300)
> [   54.517101] ata3: SATA link down (SStatus 0 SControl 300)
> [   54.833055] ata4: SATA link down (SStatus 0 SControl 300)
> 
> The board has four ports and I use the first one. After that, the computer 
> boots and works fine. Harddisk-speed is normal. Kernel is 2.6.22.9 with 
> cfs&reiser4 patches.
> 
> Is this something to worry about?
> 
> Following is lspci -v and dmesg, config is attached. 
> lspci -v
> 00:00.0 RAM memory: nVidia Corporation MCP65 Memory Controller (rev a1)
> Subsystem: ASRock Incorporation Unknown device 0444
> Flags: bus master, 66MHz, fast devsel, latency 0
> Capabilities: [44] HyperTransport: Slave or Primary Interface
> Capabilities: [dc] HyperTransport: MSI Mapping Enable+ Fixed-
> 
> 00:01.0 ISA bridge: nVidia Corporation MCP65 LPC Bridge (rev a2)
> Subsystem: ASRock Incorporation Unknown device 0441
> Flags: bus master, 66MHz, fast devsel, latency 0
> I/O ports at 2f00 [size=256]
> 
> 00:01.1 SMBus: nVidia Corporation MCP65 SMBus (rev a1)
> Subsystem: ASRock Incorporation Unknown device 0446
> Flags: 66MHz, fast devsel, IRQ 11
> I/O ports at ac00 [size=64]
> I/O ports at 2d00 [size=64]
> I/O ports at 2e00 [size=64]
> Capabilities: [44] Power Management version 2
> 
> 00:01.2 RAM memory: nVidia Corporation MCP65 Memory Controller (rev a1)
> Subsystem: ASRock Incorporation Unknown device 0446
> Flags: 66MHz, fast devsel
> 
> 00:02.0 USB Controller: nVidia Corporation MCP65 USB Controller (rev a1) 
> (prog-if 10 [OHCI])
> Subsystem: ASRock Incorporation Unknown device 0454
> Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 21
> Memory at f9dff000 (32-bit, non-prefetchable) [size=4K]
> Capabilities: [44] Power Management version 2
> 
> 00:02.1 USB Controller: nVidia Corporation MCP65 USB Controller (rev a1) 
> (prog-if 20 [EHCI])
> Subsystem: ASRock Incorporation Unknown device 0455
> Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 22
> Memory at f9dfec00 (32-bit, non-prefetchable) [size=256]
> Capabilities: [44] Debug port
> Capabilities: [80] Power Management version 2
> 
> 00:08.0 PCI bridge: nVidia Corporation MCP65 PCI bridge (rev a1) (prog-if 01 
> [Subtractive decode])
> Flags: bus master, 66MHz, fast devsel, latency 0
> Bus: primary=00, secondary=02, subordinate=02, sec-latency=128
> I/O behind bridge: c000-dfff
> Memory behind bridge: f9f0-f9ff
> Prefetchable memory behind bridge: 8800-880f
> Capabilities: [b8] Subsystem: ASRock Incorporation Unknown device 0449
> Capabilities: [8c] HyperTransport: MSI Mapping Enable- Fixed-
> 
> 00:09.0 IDE interface: nVidia Corporation MCP65 IDE (rev a1) (prog-if 8a 
> [Master SecP PriP])
> Subsystem: ASRock Incorporation Unknown device 0448
> Flags: bus master, 66MHz, fast devsel, latency 0
> [virtual] Memory at 01f0 (32-bit, non-prefetchable) [disabled] 
> [size=8]
> [virtual] Memory at 03f0 (type 3, non-prefetchable) [disabled] 
> [size=1]
> [virtual] Memory at 0170 (32-bit, non-prefetchable

Re: SC1200 failure in 2.6.23 and 2.6.24-rc1-git10

2007-11-06 Thread Andrew Morton

> On Thu, 1 Nov 2007 23:30:13 +0200 "Denys" <[EMAIL PROTECTED]> wrote:
> Finally i got full DMESG with 1GB card till end. Seems not readable too.
> 
> Linux version 2.6.24-rc1-git10-embedded ([EMAIL PROTECTED]) (gcc 
> version 4.1.2 (Gentoo 4.1.2 p1.0.1)) #1 Thu Nov 1 23:12:53 EET 2007
> BIOS-provided physical RAM map:
>  BIOS-e801:  - 0009f000 (usable)
>  BIOS-e801: 0010 - 0400 (usable)
> 64MB LOWMEM available.
> Zone PFN ranges:
>   DMA 0 -> 4096
>   Normal   4096 ->16384
> Movable zone start PFN for each node
> early_node_map[1] active PFN ranges
> 0:0 ->16384
> DMI not present or invalid.
> Allocating PCI resources starting at 1000 (gap: 0400:fc00)
> Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16256
> Kernel command line: console=ttyS0,38400n8
> Initializing CPU#0
> PID hash table entries: 256 (order: 8, 1024 bytes)
> Detected 266.627 MHz processor.
> Console: colour dummy device 80x25
> console [ttyS0] enabled
> Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
> Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
> Memory: 62836k/65536k available (1020k kernel code, 2292k reserved, 317k 
> data, 112k init, 0k highmem)
> virtual kernel memory layout:
> fixmap  : 0xb000 - 0xf000   (  16 kB)
> vmalloc : 0xc480 - 0x9000   ( 951 MB)
> lowmem  : 0xc000 - 0xc400   (  64 MB)
>   .init : 0xc0252000 - 0xc026e000   ( 112 kB)
>   .data : 0xc01ff111 - 0xc024e6f4   ( 317 kB)
>   .text : 0xc010 - 0xc01ff111   (1020 kB)
> Checking if this processor honours the WP bit even in supervisor mode... Ok.
> SLUB: Genslabs=11, HWalign=32, Order=0-1, MinObjects=4, CPUs=1, Nodes=1
> Calibrating delay using timer specific routine.. 534.41 BogoMIPS (lpj=1068836)
> Mount-cache hash table entries: 512
> Compat vDSO mapped to e000.
> CPU: NSC Unknown stepping 01
> Checking 'hlt' instruction... OK.
> Freeing SMP alternatives: 0k freed
> net_namespace: 64 bytes
> NET: Registered protocol family 16
> PCI: PCI BIOS revision 2.10 entry at 0xfc3ad, last bus=0
> PCI: Using configuration type 1
> Setting up standard PCI resources
> SCSI subsystem initialized
> PCI: Probing PCI hardware
> PCI: Device :00:12.5 not found by BIOS
> Time: tsc clocksource has been installed.
> NET: Registered protocol family 2
> IP route cache hash table entries: 1024 (order: 0, 4096 bytes)
> TCP established hash table entries: 2048 (order: 2, 16384 bytes)
> TCP bind hash table entries: 2048 (order: 1, 8192 bytes)
> TCP: Hash tables configured (established 2048 bind 2048)
> TCP reno registered
> scx200: NatSemi SCx200 Driver
> scx200: GPIO base 0xf400
> scx200: Configuration Block base 0x9000
> io scheduler noop registered (default)
> Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
> serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a NS16550A
> serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a NS16550A
> natsemi dp8381x driver, version 2.1, Sept 11, 2006
>   originally by Donald Becker <[EMAIL PROTECTED]>
>   2.4.x kernel port by Jeff Garzik, Tjeerd Mulder
> natsemi eth0: NatSemi DP8381[56] at 0x8000 (:00:0e.0), 
> 00:0d:b9:00:8a:30, IRQ 10, port TP.
> scsi0 : sc1200
> scsi1 : sc1200
> ata1: PATA max UDMA/33 cmd 0x1f0 ctl 0x3f6 bmdma 0xfc00 irq 14
> ata2: DUMMY
> ata1.00: CFA: SanDisk SDCFH-1024, HDX 3.07, max MWDMA2
> ata1.00: 2001888 sectors, multi 0: LBA
> ata1.00: configured for MWDMA2
> scsi 0:0:0:0: Direct-Access ATA  SanDisk SDCFH-10 HDX  PQ: 0 ANSI: 5
> sd 0:0:0:0: [sda] 2001888 512-byte hardware sectors (1025 MB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
> DPO or FUA
> sd 0:0:0:0: [sda] 2001888 512-byte hardware sectors (1025 MB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
> DPO or FUA
>  sda:<4>Clocksource tsc unstable (delta = -334501841 ns)
> Time: pit clocksource has been installed.
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1: soft resetting link
> ata1.00: configured for MWDMA2
> ata1: EH complete
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1: soft resetting link
> ata1.00: configured for MWDMA2
> ata1: EH complete
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
>

Re: MemoryStick / Pro support

2007-11-06 Thread Andrew Morton

> On Fri, 2 Nov 2007 06:23:48 -0700 (PDT) Alex Dubov <[EMAIL PROTECTED]> wrote:
> After a much longer, than expected, time I managed to implement a support for 
> MemoryStick (read-only currently, as there's still a subtle data corruption 
> bug with writes) and MemoryStick Pro cards. The implementation follows the 
> MMC driver model (there exist MSIO cards, but none are supported at the 
> moment). The MS Pro support appears stable from what I can learn from user 
> reports. Nevertheless, I've implemented a couple of diagnostics files in the 
> "sys" filesystem, as well as low level format facility for legacy MS cards.
> 
> Currently only TI Flashmedia adapters are supported, but I'm working on a 
> JMicron JMB38x adapter support and I know for sure that it'll be easy to 
> support a Winbond 528 adapter, as I used its GPLed driver as a reference for  
> a more generic implementation.
> 
> I would like to get an advice on the way to arrange the files in the kernel 
> tree. My current idea is:
> 
> memstick.h-> include/linux

Or drivers/memstick/.  Will anything else need this header?

> memstick.c-> drivers/memstick  ("bus" support)
> ms_block.c-> drivers/memstick  (legacy MS storage support)
> mspro_block.c   -> drivers/memstick  (MS Pro storage support)
> tifm_ms.c  -> drivers/memstick  (TI Flashmedia low level driver)
> 
> 
> 
> I also wonder, where do I send the patches if nobody currently maintains this 
> thing?
> 

Me, Pierre, lkml?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: crash 2.6.24-rc1-git10

2007-11-06 Thread Andrew Morton

> On Thu, 01 Nov 2007 15:57:27 -0300 (GFT) "werner" <[EMAIL PROTECTED]> wrote:
> Kernel Crash -- Details see below
> globc 2.7  glib2 2.14.2
> W.Landgraf
> www.copaya.yi.org
> =
> 2.6.24-rc1-git10
> EIP 0600:  EFLAGS 00010212 CPU 0
> EIUP is at xor_sse_2+0x34/0x200
> EAX: 10 EBX fffedb22 ECX c183f000 EDX c183c000 ESS 8005003b EDI c0929614 EBP 
> c183f000 ESP c1823ef0   
> DS 7b ES 7b FS d8 GS 0 SS 68
> Process swapper  pid 1  ti: c182200  task c182 task.ti c=1822000
> Stack:  8x 08x 0   fffedb22  0  c04067b3  10  c0849b62  c1030780  
> c183f000  c183c000
> call trace
> c0 4067b3 do_xor_speed+0x53/0xd0
>9a9582 calibrate_xor_blocks 0xe2/0x100 (or 1a0 ?)
>   191594  register_filesystem =0X44/0X70
>   991565 kernel_init+0x125/0x2f0
>10420a  ret_from_fork +0x6/0x1c  (or 0xb ...)
>   991440 kernel_init+0x0/0x2f0
>" again
>c0104edf  kernel_thread_helper+0x7/0x18
> code  08 89 74 24 44 0f 20 cf 0f 06 (or 0b) 0f 11 04 24 0f 11 4c 34 10 0f 11 
> 54 24 20 0f 11 5c 24 30 0f 18 82 00
> 01 00 00 0f 18 82 20 01 00 00 <00> 20x 0
> EIP c0407284 xor_sse_2+0x34/0x200 SS ESP 068: c1823ef0
> kernel panic
> System.map==
> 
> 

Five days, no reply?

Is this still reproducible in latest mainline?  If so, please make a fuss.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Use of virtio device IDs

2007-11-06 Thread Rusty Russell

On Wednesday 07 November 2007 16:40:13 Avi Kivity wrote:
> Gregory Haskins wrote:
> >  but FWIW: This is a major motivation for the reason that the
> > IOQ stuff I posted a while back used strings for device identification
> > instead of a fixed length, centrally managed namespace like PCI
> > vendor/dev-id.  Then you can just name your device something reasonably
> > unique (e.g. "qumranet::veth", or "ibm-pvirt-clock").
>
> I dislike strings.  They make it look as if you have a nice extensible
> interface, where in reality you have a poorly documented interface which
> leads to poor interoperability.

Yes, you end up with exactly names like "qumranet::veth" 
and "ibm-pvirt-clock".  I would recommend looking very hard at /proc, Open 
Firmware on a modern system, or the Xen store, to see what a lack of 
limitation can do to you :)

> We will support non-pci for s390, but in order to support Windows and
> older Linux PCI is necessary.

The aim is that PCI support is clean, but that we're not really tied to PCI.  
I think we're getting closer with the recent config changes.

Cheers,
Rusty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.34-rc1 eat my photo SD card :-(

2007-11-06 Thread Willy Tarreau

On Tue, Nov 06, 2007 at 11:17:39PM +0100, Romano Giannetti wrote:
> On Tue, 2007-11-06 at 22:48 +0100, Romano Giannetti wrote:
> 
> > I do really suspect a software bug.
> >
> 
> Well, I started bisecting it. It will be a long shot, I suspect...
> 
>   Romano
> 
> BTW: I noticed that if I change EXTRAVERSION, doing a make rebuild
> almost all the kernel. Is it normal?

yes I think, because it changes version.h which is included directly or
indirectly by every file.

> And it seems to me that the same
> thing happens if a make oldconfig results in a changed .config...

this should not happen IMHO. If you post a simple reproducible case, maybe
some people can investigate it.

Regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] virtio config_ops refactoring

2007-11-06 Thread Rusty Russell

After discussion with Anthony, the virtio config has been simplified.  We
lose some minor features (the virtio_net address must now be 6 bytes) but
it turns out to be a wash in terms of complexity, while simplifying PCI.

This can be found in the new virtio git tree, in the "patches/1" branch
(new branches will be created as I rebase to keep up with Linus).

git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-virtio.git

I've also posted below, for easy commentry.
Cheers,
Rusty.

===
Previously we used a type/len pair within the config space, but this
seems overkill.  We now simply use an agreed offset within the config
space and assume everyone knows the size of any entry it is interested
in (the config space can now only be extended at the end).

The main driver-visible change is that we indicate what fields are
present with an explicit feature bit.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index f266839..e5e5890 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "linux/lguest_launcher.h"
 #include "linux/virtio_config.h"
 #include "linux/virtio_net.h"
@@ -96,13 +97,11 @@ struct device_list
/* The descriptor page for the devices. */
u8 *descpage;
 
-   /* The tail of the last descriptor. */
-   unsigned int desc_used;
-
/* A single linked list of devices. */
struct device *dev;
-   /* ... And an end pointer so we can easily append new devices */
-   struct device **lastdev;
+   /* And a pointer to the last device for easy append and also for
+* configuration appending. */
+   struct device *lastdev;
 };
 
 /* The list of Guest devices, based on command line arguments. */
@@ -980,54 +979,44 @@ static void handle_input(int fd)
  *
  * All devices need a descriptor so the Guest knows it exists, and a "struct
  * device" so the Launcher can keep track of it.  We have common helper
- * routines to allocate them.
- *
- * This routine allocates a new "struct lguest_device_desc" from descriptor
- * table just above the Guest's normal memory.  It returns a pointer to that
- * descriptor. */
-static struct lguest_device_desc *new_dev_desc(u16 type)
-{
-   struct lguest_device_desc *d;
-
-   /* We only have one page for all the descriptors. */
-   if (devices.desc_used + sizeof(*d) > getpagesize())
-   errx(1, "Too many devices");
-
-   /* We don't need to set config_len or status: page is 0 already. */
-   d = (void *)devices.descpage + devices.desc_used;
-   d->type = type;
-   devices.desc_used += sizeof(*d);
+ * routines to allocate and manage them. */
 
-   return d;
+/* The layout of the device page is a "struct lguest_device_desc" followed by a
+ * number of virtqueue descriptors, then two sets of feature bits, then an
+ * array of configuration bytes.  This routine returns the configuration
+ * pointer. */
+static void *device_config(const struct device *dev)
+{
+   return (void *)(dev->desc + 1)
+   + dev->desc->num_vq * sizeof(struct lguest_vqconfig)
+   + dev->desc->feature_len * 2;
 }
 
-/* Each device descriptor is followed by some configuration information.
- * Each configuration field looks like: u8 type, u8 len, [... len bytes...].
- *
- * This routine adds a new field to an existing device's descriptor.  It only
- * works for the last device, but that's OK because that's how we use it. */
-static void add_desc_field(struct device *dev, u8 type, u8 len, const void *c)
+/* This routine allocates a new "struct lguest_device_desc" from descriptor
+ * table page just above the Guest's normal memory.  It returns a pointer to
+ * that descriptor. */
+static struct lguest_device_desc *new_dev_desc(u16 type)
 {
-   /* This is the last descriptor, right? */
-   assert(devices.descpage + devices.desc_used
-  == (u8 *)(dev->desc + 1) + dev->desc->config_len);
+   struct lguest_device_desc d = { .type = type }; 
+   void *p;
 
-   /* We only have one page of device descriptions. */
-   if (devices.desc_used + 2 + len > getpagesize())
-   errx(1, "Too many devices");
+   /* Figure out where the next device config is, based on the last one. */
+   if (devices.lastdev)
+   p = device_config(devices.lastdev)
+   + devices.lastdev->desc->config_len;
+   else
+   p = devices.descpage;
 
-   /* Copy in the new config header: type then length. */
-   devices.descpage[devices.desc_used++] = type;
-   devices.descpage[devices.desc_used++] = len;
-   memcpy(devices.descpage + devices.desc_used, c, len);
-   devices.desc_used += len;
+   /* We only have one page for all the descriptors. */
+   if (p + sizeof(d) > (void *)devices.descpage + getpagesize())
+

Re: [PATCH] x86: unification of cfufreq/Kconfig

2007-11-06 Thread Adrian Bunk

On Wed, Nov 07, 2007 at 12:01:12AM +0100, Sam Ravnborg wrote:
>...
>  config X86_SPEEDSTEP_CENTRINO
> - tristate "Intel Enhanced SpeedStep"
> + tristate "Intel Enhanced SpeedStep (deprecated)"
>   select CPU_FREQ_TABLE
> - select X86_SPEEDSTEP_CENTRINO_TABLE
> + select X86_SPEEDSTEP_CENTRINO_TABLE if X86_32
> + depends on X86_64 && ACPI_PROCESSOR
>...

No.

depends on ACPI_PROCESSOR if X86_64



cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] include files for kvmclock

2007-11-06 Thread Jeremy Fitzhardinge

Avi Kivity wrote:
> Glauber de Oliveira Costa wrote:
>   
>>> +union kvm_hv_clock {
>>> +   struct {
>>> +   u64 tsc_mult;
>>> +   u64 now_ns;
>>> +   /* That's the wall clock, not the water closet */
>>> +   u64 wc_sec;
>>> +   u64 wc_nsec;
>>> 
>>>   
>
> Do we really need 128-bit time?  you must be planning to live forever.
>   

Well, he's planning on having lots of very small nanoseconds.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][VIRTIO] Fix vring_init() ring computations

2007-11-06 Thread Rusty Russell

On Wednesday 07 November 2007 13:52:29 Anthony Liguori wrote:
> This patch fixes a typo in vring_init(). 

Thanks, applied.

I've put it in the new, experimental virtio git tree on git.kernel.org.

Cheers,
Rusty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] include files for kvmclock

2007-11-06 Thread Avi Kivity

Glauber de Oliveira Costa wrote:
>> +union kvm_hv_clock {
>> +   struct {
>> +   u64 tsc_mult;
>> +   u64 now_ns;
>> +   /* That's the wall clock, not the water closet */
>> +   u64 wc_sec;
>> +   u64 wc_nsec;
>> 

Do we really need 128-bit time?  you must be planning to live forever.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kvmclock - the host part.

2007-11-06 Thread Avi Kivity

Glauber de Oliveira Costa wrote:
> This is the host part of kvm clocksource implementation. As it does
> not include clockevents, it is a fairly simple implementation. We
> only have to register a per-vcpu area, and start writting to it periodically.
>
> Signed-off-by: Glauber de Oliveira Costa <[EMAIL PROTECTED]>
> ---
>  drivers/kvm/irq.c  |1 +
>  drivers/kvm/kvm_main.c |2 +
>  drivers/kvm/svm.c  |1 +
>  drivers/kvm/vmx.c  |1 +
>  drivers/kvm/x86.c  |   59 
> 
>  drivers/kvm/x86.h  |   13 ++
>  6 files changed, 77 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/kvm/irq.c b/drivers/kvm/irq.c
> index 22bfeee..0344879 100644
> --- a/drivers/kvm/irq.c
> +++ b/drivers/kvm/irq.c
> @@ -92,6 +92,7 @@ void kvm_vcpu_kick_request(struct kvm_vcpu *vcpu, int 
> request)
>  
>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
>  {
> + vcpu->time_needs_update = 1;
>   

Why here and not in __vcpu_run()?  It isn't timer irq related.

> @@ -1242,6 +1243,7 @@ static long kvm_dev_ioctl(struct file *filp,
>   case KVM_CAP_MMU_SHADOW_CACHE_CONTROL:
>   case KVM_CAP_USER_MEMORY:
>   case KVM_CAP_SET_TSS_ADDR:
> + case KVM_CAP_CLK:
>   

It's just a clock source now, right?  so _CLOCK_SOURCE.

>  
> +static void kvm_write_guest_time(struct kvm_vcpu *vcpu)
> +{
> + struct timespec ts;
> + void *clock_addr;
> +
> +
> + if (!vcpu->clock_page)
> + return;
> +
> + /* Updates version to the next odd number, indicating we're writing */
> + vcpu->hv_clock.version++;
>   

No one can actually see this as you're updating a private structure. 
You need to copy it to guestspace.

> + /* Updating the tsc count is the first thing we do */
> + kvm_get_msr(vcpu, MSR_IA32_TIME_STAMP_COUNTER, 
> &vcpu->hv_clock.last_tsc);
> + ktime_get_ts(&ts);
> + vcpu->hv_clock.now_ns = ts.tv_nsec + (NSEC_PER_SEC * (u64)ts.tv_sec);
> + vcpu->hv_clock.wc_sec = get_seconds();
> + vcpu->hv_clock.version++;
> +
> + clock_addr = vcpu->clock_addr;
> + memcpy(clock_addr, &vcpu->hv_clock, sizeof(vcpu->hv_clock));
> + mark_page_dirty(vcpu->kvm, vcpu->clock_gfn);
>   

Just use kvm_write_guest().

> +
> + vcpu->time_needs_update = 0;
> +}
> +
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  {
>   unsigned long nr, a0, a1, a2, a3, ret;
> @@ -1648,7 +1674,33 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   a3 &= 0x;
>   }
>  
> + ret = 0;
>   switch (nr) {
> + case  KVM_HCALL_REGISTER_CLOCK: {
> + struct kvm_vcpu *dst_vcpu;
> +
> + if (!((a1 < KVM_MAX_VCPUS) && (vcpu->kvm->vcpus[a1]))) {
> + ret = -KVM_EINVAL;
> + break;
> + }
> +
> + dst_vcpu = vcpu->kvm->vcpus[a1];
>   

What if !dst_vcpu?  What about locking?

Suggest simply using vcpu.  Every guest cpu can register its own
clocksource.

> + dst_vcpu->clock_page = gfn_to_page(vcpu->kvm, a0 >> PAGE_SHIFT);
>   

Shift right?  Why?

> +
> + if (!dst_vcpu->clock_page) {
>   

IIRC gfn_to_page() never returns NULL, need a different check.

> + ret = -KVM_EINVAL;
> + break;
> + }
> + dst_vcpu->clock_gfn = a0 >> PAGE_SHIFT;
> +
> + dst_vcpu->hv_clock.tsc_mult = clocksource_khz2mult(tsc_khz, 22);
> + dst_vcpu->clock_addr = kmap(dst_vcpu->clock_page);
>   

kmap() is bad since the page can move due to swapping. 
kvm_write_guest() is your friend.

> +static inline void release_clock(struct kvm_vcpu *vcpu)
> +{
> + if (vcpu->clock_page)
> + kunmap(vcpu->clock_page);
> +}
>   


While it's a static inline, please prefix with kvm_ in case one day it
isn't.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: unification of cfufreq/Kconfig

2007-11-06 Thread Sam Ravnborg

On Wed, Nov 07, 2007 at 12:04:51AM +0100, Sam Ravnborg wrote:
> This serie of patches unify the X86 Kconfig files.
> Next step is to enable use of "make ARCH=x86" and kill
> "make ARCH=i386/x86_64".
> But that will wait till tomorrow.
> 
> The allmodconfig kernel is still building here.
> Testing / reviews appreciated.

Small fix to let allmodconfig build on x86_64.
I need to doublecheck why this bug sneaked in but that will be tonight.

Sam

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0acb92f..dab26a0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -79,7 +79,7 @@ config ARCH_MAY_HAVE_PC_FDC
 
 config DMI
bool
-   default X86_32
+   default y
 
 config RWSEM_GENERIC_SPINLOCK
def_bool !X86_XADD
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.23.1-rt9 (and others)

2007-11-06 Thread Steven Rostedt

This is a special announcement for the latest -rt patches. This is
actually announcing more than one tree (pay close attention to the
differences between -rt7, -rt8 and -rt9).

  2.6.23.1-rt6

   - Removed BUG_ON in exit (Steven Rostedt and Daniel Walker)

   -  Turn RCU preempt boost on by default (Steven Rostedt)
  (for when RCU PREEMPT is enabled)

   - Fixes for PowerPC (Paul McKenney)


  2.6.23.1-rt7

   - Found that there's a flaw in the PowerPC patch so
 it was pulled from the tree.

  2.6.23.1-rt8

   - More aggressive RT Balancing (Gregory Haskins)

  2.6.23.1-rt9

   - RT balancing by CPU priorities (Gregory Haskins)


Now benchmarks against 2.6.23.1-rt7 -rt8 and -rt9 would be greatly
appreciated.  These three are all present in

  http://www.kernel.org/pub/linux/kernel/projects/rt/

Gregory and I have been having disagreements on how to solve RT task
balancing among CPUS. Although we shared ideas back and forth, and both
our methods have been greatly influenced by each other, the real answer
comes from actual numbers. So these three versions are posted for your
convenience to see which actually do the best. I would be happy to tell
Gregory he's right, if the numbers prove it.

Currently, what we do to test RT latencies is to run Thomas Gleixner's
cyclictest
(http://git.kernel.org/?p=linux/kernel/git/tglx/rt-tests.git;a=summary)
as well as hackbench, to see what the maximum latencies we get are.

Other tests are welcomed too.


to build a 2.6.23.1-rt7 tree, the following patches should be applied:

  http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.23.1.tar.bz2 
  http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.23.1-rt7.bz2

to build a 2.6.23.1-rt8 tree, the following patches should be applied:

  http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.23.1.tar.bz2 
  http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.23.1-rt8.bz2

to build a 2.6.23.1-rt9 tree, the following patches should be applied:

  http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.23.1.tar.bz2 
  http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.23.1-rt9.bz2


And like always, my RT version of Matt Mackall's ketchup will get this
for you nicely:

  http://people.redhat.com/srostedt/rt/tools/ketchup-0.9.8-rt1

The broken out patches are also available.

Thanks!

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Use of virtio device IDs

2007-11-06 Thread Avi Kivity

Gregory Haskins wrote:
> Anthony Liguori wrote:
>
>   
>> Right now, we would have to have every PCI vendor/device ID pair in the
>> virtio PCI driver ID table for every virtio device.
>> 
>
> I realize you guys are probably far down this road in the design
> process,

That doesn't mean we can't change it if it's wrong.

>  but FWIW: This is a major motivation for the reason that the
> IOQ stuff I posted a while back used strings for device identification
> instead of a fixed length, centrally managed namespace like PCI
> vendor/dev-id.  Then you can just name your device something reasonably
> unique (e.g. "qumranet::veth", or "ibm-pvirt-clock").
>   

I dislike strings.  They make it look as if you have a nice extensible
interface, where in reality you have a poorly documented interface which
leads to poor interoperability.

I prefer nice structure where you can see all the limitations immediately.

> (I realize that if you are going to do PCI, you need to make it
> PCI-like.  But I think using PCI in the first place is probably the
> wrong direction.  IMHO, there's really not a lot of reason to be
> constrained by a hardware specification once you decide to go PV.  This
> is even more true if you want to support as many platforms as possible
> (i.e. platforms that don't have PCI natively).
>
>   

PCI means that you can reuse all of the platform's infrastructure for
irq allocation, discovery, device hotplug, and management.  You can
write it for new guests but backporting it to older guests will be a
huge task.

We will support non-pci for s390, but in order to support Windows and
older Linux PCI is necessary.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: aim7 -30% regression in 2.6.24-rc1

2007-11-06 Thread Zhang, Yanmin

On Mon, 2007-11-05 at 10:37 +0100, Cyrus Massoumi wrote:
> Zhang, Yanmin wrote:
> > On Thu, 2007-11-01 at 11:02 +0100, Cyrus Massoumi wrote:
> >> Zhang, Yanmin wrote:
> >>> On Wed, 2007-10-31 at 17:57 +0800, Zhang, Yanmin wrote:
>  On Tue, 2007-10-30 at 16:36 +0800, Zhang, Yanmin wrote:
> > On Tue, 2007-10-30 at 08:26 +0100, Ingo Molnar wrote:
> >> * Zhang, Yanmin <[EMAIL PROTECTED]> wrote:
> >>
> >>> sub-bisecting captured patch 
> >>> 38ad464d410dadceda1563f36bdb0be7fe4c8938(sched: uniform tunings) 
> >>> caused 20% regression of aim7.
> >>>
> >>> The last 10% should be also related to sched parameters, such like 
> >>> sysctl_sched_min_granularity.
> >> ah, interesting. Since you have CONFIG_SCHED_DEBUG enabled, could you 
> >> please try to figure out what the best value for 
> >> /proc/sys/kernel_sched_latency, /proc/sys/kernel_sched_nr_latency and 
> >> /proc/sys/kernel_sched_min_granularity is?
> >>
> >> there's a tuning constraint for kernel_sched_nr_latency: 
> >>
> >> - kernel_sched_nr_latency should always be set to 
> >>   kernel_sched_latency/kernel_sched_min_granularity. (it's not a free 
> >>   tunable)
> >>
> >> i suspect a good approach would be to double the value of 
> >> kernel_sched_latency and kernel_sched_nr_latency in each tuning 
> >> iteration, while keeping kernel_sched_min_granularity unchanged. That 
> >> will excercise the tuning values of the 2.6.23 kernel as well.
> > I followed your idea to test 2.6.24-rc1. The improvement is slow.
> > When sched_nr_latency=2560 and sched_latency_ns=64000, the 
> > performance
> > is still about 15% less than 2.6.23.
>  I got the aim7 30% regression on my new upgraded stoakley machine. I 
>  found
>  this mahcine is slower than the old one. Maybe BIOS has issues, or 
>  memeory(Might not
>  be dual-channel?) is slow. So I retested it on the old machine and found 
>  on the old
>  stoakley machine, the regression is about 6%, quite similiar to the 
>  regression on tigerton
>  machine.
> 
>  By sched_nr_latency=640 and sched_latency_ns=64000 on the old 
>  stoakley machine,
>  the regression becomes about 2%. Other latency has more regression.
> 
>  On my tulsa machine, by sched_nr_latency=640 and 
>  sched_latency_ns=64000,
>  the regression becomes less than 1% (The original regression is about 
>  20%).
> >>> I rerun SPECjbb by ched_nr_latency=640 and sched_latency_ns=64000. On 
> >>> tigerton,
> >>> the regression is still more than 40%. On stoakley machine, it becomes 
> >>> worse (26%,
> >>> original is 9%). I will do more investigation to make sure SPECjbb 
> >>> regression is
> >>> also casued by the bad default values.
> >>>
> >>> We need a smarter method to calculate the best default values for the key 
> >>> tuning
> >>> parameters.
> >>>
> >>> One interesting is sysbench+mysql(readonly) got the same result like 
> >>> 2.6.22 (no
> >>> regression). Good job!
> >> Do you mean you couldn't reproduce the regression which was reported 
> >> with 2.6.23 (http://lkml.org/lkml/2007/10/30/53) with 2.6.24-rc1?
> > It looks like you missed my emails.
> 
> Yeah :(
> 
> > Firstly, I reproduced (or just find the same myself :) ) the issue with 
> > kernel 2.6.22,
> > 2.6.23-rc and 2.6.23.
> > 
> > Ingo wrote a big patch to fix it and the new patch is in 2.6.24-rc1 now.
> 
> That's nice, could you please point me to the commit?
The patch is very big. 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b5869ce7f68b233ceb81465a7644be0d9a5f3dbb
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] audit support for SH

2007-11-06 Thread Paul Mundt

On Wed, Nov 07, 2007 at 02:04:46PM +0900, Yuichi Nakamura wrote:
> I found syscall audit does not work on SH(SuperH).
> I made patch to support syscall audit for SH.
> 
> Signed-off-by: Yuichi Nakamura<[EMAIL PROTECTED]>

Looks fine, but it's too late for 2.6.24. So this will go in to the 2.6.25
queue when I open up the 2.6.25 development tree. Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abit F-190HD Onboard rlt8169 Ethernet Controller

2007-11-06 Thread Josh Logan

I tried this out and it works great.

Thanks!

Later, JOSH


On 11/1/07, Francois Romieu <[EMAIL PROTECTED]> wrote:
> Josh Logan <[EMAIL PROTECTED]> :
> [...]
> > I have had this board for a few months as well and I have needed to
> > patch the driver to use 0x0001 as well.
>
> Ok.
>
> Can people give the patch below a try before I submit it to Jeff ?
>
> From: Ciaran McCreesh <[EMAIL PROTECTED]>
> Date: Thu, 1 Nov 2007 22:48:15 +0100
> Subject: [PATCH] r8169: add PCI ID for the 8168 in the Abit Fatal1ty F-190HD 
> motherboard
>
> Signed-off-by: Ciaran McCreesh <[EMAIL PROTECTED]>
> Signed-off-by: Francois Romieu <[EMAIL PROTECTED]>
> Cc: Edward Hsu <[EMAIL PROTECTED]>
> ---
>  drivers/net/r8169.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
> index b94fa7e..702334e 100644
> --- a/drivers/net/r8169.c
> +++ b/drivers/net/r8169.c
> @@ -171,6 +171,8 @@ static struct pci_device_id rtl8169_pci_tbl[] = {
> { PCI_DEVICE(0x16ec,0x0116), 0, 0, RTL_CFG_0 },
> { PCI_VENDOR_ID_LINKSYS,0x1032,
> PCI_ANY_ID, 0x0024, 0, 0, RTL_CFG_0 },
> +   { 0x0001,   0x8168,
> +   PCI_ANY_ID, 0x2410, 0, 0, RTL_CFG_2 },
> {0,},
>  };
>
> --
> 1.5.3.3
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fix for sparc64 cpu hangs.

2007-11-06 Thread David Miller

From: David Miller <[EMAIL PROTECTED]>
Date: Tue, 06 Nov 2007 20:34:33 -0800 (PST)

> [FUTEX]: Fix address computation in compat code.

Sorry, I just noticed there is a second handle_futex_death()
call in compat_exit_robust_list() which has the same
address computation bug.

Here is an updated patch:

[FUTEX]: Fix address computation in compat code.

compat_exit_robust_list() computes a pointer to the
futex entry in userspace as follows:

(void __user *)entry + futex_offset

'entry' is a 'struct robust_list __user *', and
'futex_offset' is a 'compat_long_t' (typically a 's32').

Things explode if the 32-bit sign bit is set in futex_offset.

Type promotion sign extends futex_offset to a 64-bit value before
adding it to 'entry'.

This triggered a problem on sparc64 running 32-bit applications which
would lock up a cpu looping forever in the fault handling for the
userspace load in handle_futex_death().

Compat userspace runs with address masking (wherein the cpu zeros out
the top 32-bits of every effective address given to a memory operation
instruction) so the sparc64 fault handler accounts for this by
zero'ing out the top 32-bits of the fault address too.

Since the kernel properly uses the compat_uptr interfaces, kernel side
accesses to compat userspace work too since they will only use
addresses with the top 32-bit clear.

Because of this compat futex layer bug we get into the following loop
when executing the get_user() load near the top of handle_futex_death():

1) load from address '0xf7f16bd8', FAULT
2) fault handler clears upper 32-bits, processes fault
   for address '0xf7f16bd8' which succeeds
3) goto #1

I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
for their tireless efforts helping me track down this bug.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 00b5726..1931457 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -30,6 +30,15 @@ fetch_robust_entry(compat_uptr_t *uentry, struct robust_list 
__user **entry,
return 0;
 }

+static void __user *futex_uaddr(struct robust_list *entry,
+   compat_long_t futex_offset)
+{
+   compat_uptr_t base = ptr_to_compat(entry);
+   void __user *uaddr = compat_ptr(base + futex_offset);
+
+   return uaddr;
+}
+
 /*
  * Walk curr->robust_list (very carefully, it's a userspace list!)
  * and mark any locks found there dead, and notify any waiters.
@@ -76,11 +85,13 @@ void compat_exit_robust_list(struct task_struct *curr)
 * A pending lock might already be on the list, so
 * dont process it twice:
 */
-   if (entry != pending)
-   if (handle_futex_death((void __user *)entry + 
futex_offset,
-   curr, pi))
-   return;
+   if (entry != pending) {
+   void __user *uaddr = futex_uaddr(entry,
+futex_offset);

+   if (handle_futex_death(uaddr, curr, pi))
+   return;
+   }
if (rc)
return;
uentry = next_uentry;
@@ -94,9 +105,11 @@ void compat_exit_robust_list(struct task_struct *curr)

cond_resched();
}
-   if (pending)
-   handle_futex_death((void __user *)pending + futex_offset,
-  curr, pip);
+   if (pending) {
+   void __user *uaddr = futex_uaddr(pending, futex_offset);
+
+   handle_futex_death(uaddr, curr, pip);
+   }
 }

 asmlinkage long
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] audit support for SH

2007-11-06 Thread Yuichi Nakamura

I found syscall audit does not work on SH(SuperH).
I made patch to support syscall audit for SH.

Signed-off-by: Yuichi Nakamura<[EMAIL PROTECTED]>
---
 arch/sh/kernel/entry-common.S |8 ++--
 arch/sh/kernel/ptrace.c   |   19 +++
 include/asm-sh/thread_info.h  |2 ++
 init/Kconfig  |2 +-
 4 files changed, 24 insertions(+), 7 deletions(-)
diff -purN -X linux-2.6.24.rc1/Documentation/dontdiff 
linux-2.6.24.rc1.orig/arch/sh/kernel/entry-common.S 
linux-2.6.24.rc1/arch/sh/kernel/entry-common.S
--- linux-2.6.24.rc1.orig/arch/sh/kernel/entry-common.S 2007-11-06 
16:03:17.0 +0900
+++ linux-2.6.24.rc1/arch/sh/kernel/entry-common.S  2007-11-06 
18:16:11.0 +0900
@@ -224,7 +224,7 @@ work_resched:
 syscall_exit_work:
! r0: current_thread_info->flags
! r8: current_thread_info
-   tst #_TIF_SYSCALL_TRACE | _TIF_SINGLESTEP, r0
+   tst #_TIF_SYSCALL_TRACE | _TIF_SINGLESTEP |_TIF_SYSCALL_AUDIT, r0
bt/swork_pending
 tst#_TIF_NEED_RESCHED, r0
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -234,6 +234,8 @@ syscall_exit_work:
 #endif
sti
! XXX setup arguments...
+   mov r15, r4
+   mov #1, r5
mov.l   4f, r0  ! do_syscall_trace
jsr @r0
 nop
@@ -244,6 +246,8 @@ syscall_exit_work:
 syscall_trace_entry:
!   Yes it is traced.
! XXX setup arguments...
+   mov r15, r4
+   mov #0, r5
mov.l   4f, r11 ! Call do_syscall_trace which notifies
jsr @r11! superior (will chomp R[0-7])
 nop
@@ -366,7 +370,7 @@ ENTRY(system_call)
!
get_current_thread_info r8, r10
mov.l   @(TI_FLAGS,r8), r8
-   mov #_TIF_SYSCALL_TRACE, r10
+   mov #(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT), r10
tst r10, r8
bf  syscall_trace_entry
!
diff -purN -X linux-2.6.24.rc1/Documentation/dontdiff 
linux-2.6.24.rc1.orig/arch/sh/kernel/ptrace.c 
linux-2.6.24.rc1/arch/sh/kernel/ptrace.c
--- linux-2.6.24.rc1.orig/arch/sh/kernel/ptrace.c   2007-11-06 
16:03:17.0 +0900
+++ linux-2.6.24.rc1/arch/sh/kernel/ptrace.c2007-11-07 08:46:14.0 
+0900
@@ -6,7 +6,7 @@
  * edited by Linus Torvalds
  *
  * SuperH version:   Copyright (C) 1999, 2000  Kaz Kojima & Niibe Yutaka
- *
+ * Audit support: Yuichi Nakamura <[EMAIL PROTECTED]>
  */
 #include 
 #include 
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * does not yet catch signals sent when the child dies.
@@ -248,15 +249,18 @@ long arch_ptrace(struct task_struct *chi
return ret;
 }
 
-asmlinkage void do_syscall_trace(void)
+asmlinkage void do_syscall_trace(struct pt_regs *regs, int entryexit)
 {
struct task_struct *tsk = current;
+   if (unlikely(current->audit_context) && entryexit)
+   audit_syscall_exit(AUDITSC_RESULT(regs->regs[0]),
+  regs->regs[0]);
 
if (!test_thread_flag(TIF_SYSCALL_TRACE) &&
!test_thread_flag(TIF_SINGLESTEP))
-   return;
+   goto out;
if (!(tsk->ptrace & PT_PTRACED))
-   return;
+   goto out;
/* the 0x80 provides a way for the tracing parent to distinguish
   between a syscall stop and SIGTRAP delivery */
ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) &&
@@ -271,4 +275,11 @@ asmlinkage void do_syscall_trace(void)
send_sig(tsk->exit_code, tsk, 1);
tsk->exit_code = 0;
}
+
+out:
+   if (unlikely(current->audit_context) && !entryexit)
+   audit_syscall_entry(AUDIT_ARCH_SH, regs->regs[3],
+   regs->regs[4], regs->regs[5],
+   regs->regs[6], regs->regs[7]);
+
 }
--- linux-2.6.24.rc1.orig/include/asm-sh/thread_info.h  2007-10-10 
05:31:38.0 +0900
+++ linux-2.6.24.rc1/include/asm-sh/thread_info.h   2007-11-07 
08:46:37.0 +0900
@@ -111,6 +111,7 @@ static inline struct thread_info *curren
 #define TIF_NEED_RESCHED   2   /* rescheduling necessary */
 #define TIF_RESTORE_SIGMASK3   /* restore signal mask in do_signal() */
 #define TIF_SINGLESTEP 4   /* singlestepping active */
+#define TIF_SYSCALL_AUDIT  5
 #define TIF_USEDFPU16  /* FPU was used by this task this 
quantum (SMP) */
 #define TIF_POLLING_NRFLAG 17  /* true if poll_idle() is polling 
TIF_NEED_RESCHED */
 #define TIF_MEMDIE 18
@@ -121,6 +122,7 @@ static inline struct thread_info *curren
 #define _TIF_NEED_RESCHED  (1

[PATCH] fat: silence warning for 64KB PAGE_SIZE builds

2007-11-06 Thread Olof Johansson

Annoying gcc warning:

fs/fat/inode.c: In function 'fat_fill_super':
fs/fat/inode.c:1222: warning: comparison is always false due to limited range 
of data type

logical_sector_size can never be more than 16 bits worth, but switching
it to an int silences gcc. It's a sanity check that can never fail with
64KB PAGE_SIZE but it seems like it'd still be useful for other page
sizes, so it's worth keeping:

if (!is_power_of_2(logical_sector_size)
|| (logical_sector_size < 512)
|| (PAGE_CACHE_SIZE < logical_sector_size)) {
if (!silent)
printk(KERN_ERR "FAT: bogus logical sector size %u\n",
   logical_sector_size);
brelse(bh);
goto out_invalid;
}


Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>

diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 920a576..6aae680 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -1158,7 +1158,7 @@ int fat_fill_super(struct super_block *sb, void *data, 
int silent,
struct buffer_head *bh;
struct fat_boot_sector *b;
struct msdos_sb_info *sbi;
-   u16 logical_sector_size;
+   int logical_sector_size;
u32 total_sectors, total_clusters, fat_clusters, rootdir_sectors;
int debug;
unsigned int media;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Jeff Lessem


Dan Williams wrote:
> The following patch, also attached, cleans up cases where the code looks
> at sh->ops.pending when it should be looking at the consistent
> stack-based snapshot of the operations flags.

I tried this patch (against a stock 2.6.23), and it did not work for
me.  Not only did I/O to the effected RAID5 & XFS partition stop, but
also I/O to all other disks.  I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.

I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.

>
>
> ---
>
>  drivers/md/raid5.c |   16 +---
>  1 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 496b9a3..e1a3942 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)

>  }
>
>  static struct dma_async_tx_descriptor *
> -ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
> +ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
> +   unsigned long pending)
>  {
>int disks = sh->disks;
>int pd_idx = sh->pd_idx, i;
> @@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)

>/* check if prexor is active which means only process blocks
> * that are part of a read-modify-write (Wantprexor)
> */
> -  int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
> +  int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
>
>pr_debug("%s: stripe %llu\n", __FUNCTION__,
>(unsigned long long)sh->sector);
> @@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
>  }
>
>  static void
> -ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
> +ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
> +  unsigned long pending)
>  {
>/* kernel stack size limits the total number of disks */
>int disks = sh->disks;
> @@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)

>
>int count = 0, pd_idx = sh->pd_idx, i;
>struct page *xor_dest;
> -  int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
> +  int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
>unsigned long flags;
>dma_async_tx_callback callback;
>
> @@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)

>}
>
>/* check whether this postxor is part of a write */
> -  callback = test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) ?
> +  callback = test_bit(STRIPE_OP_BIODRAIN, &pending) ?
>ops_complete_write : ops_complete_postxor;
>
>/* 1/ if we prexor'd then the dest is reused as a source
> @@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)

>tx = ops_run_prexor(sh, tx);
>
>if (test_bit(STRIPE_OP_BIODRAIN, &pending)) {
> -  tx = ops_run_biodrain(sh, tx);
> +  tx = ops_run_biodrain(sh, tx, pending);
>overlap_clear++;
>}
>
>if (test_bit(STRIPE_OP_POSTXOR, &pending))
> -  ops_run_postxor(sh, tx);
> +  ops_run_postxor(sh, tx, pending);
>
>if (test_bit(STRIPE_OP_CHECK, &pending))
>ops_run_check(sh);
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] [POWERPC] Remove sysctl warning about writable directory

2007-11-06 Thread Olof Johansson

Getting this when booting 2.6.24-rc2:

sysctl table check failed: /kernel .1 Writable sysctl directory
Call Trace:
[c2047b60] [c000e204] .show_stack+0x54/0x1f0 (unreliable)
[c2047c10] [c006ea50] .set_fail+0x60/0x90
[c2047ca0] [c006ef64] .sysctl_check_table+0x4e4/0x730
[c2047d80] [c0055860] .register_sysctl_table+0x80/0x120
[c2047e20] [c05de924] .register_powersave_nap_sysctl+0x14/0x30
[c2047e90] [c05d58c8] .kernel_init+0x1f8/0x450
[c2047f90] [c0023bac] .kernel_thread+0x4c/0x68


Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>

diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index abd2957..c3cf0e8 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -122,7 +122,7 @@ static ctl_table powersave_nap_sysctl_root[] = {
{
.ctl_name   = CTL_KERN,
.procname   = "kernel",
-   .mode   = 0755,
+   .mode   = 0555,
.child  = powersave_nap_ctl_table,
},
{}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Defense in depth: LSM modules, not a static interface

2007-11-06 Thread Casey Schaufler


--- Tetsuo Handa <[EMAIL PROTECTED]> wrote:

> Hello.
> 
> Casey Schaufler wrote:
> > Fine grained capabilities are a bonus, and there are lots of
> > people who think that it would be really nifty if there were a
> > separate capability for each "if" in the kernel. I personally
> > don't see need for more than about 20. That is a matter of taste.
> > DG/UX ended up with 330 and I say that's too many.
> 
> TOMOYO Linux has own (non-POSIX) capability that can support 65536
> capabilities
> if there *were* a separate capability for each "if" in the kernel.
>
http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/trunk/2.1.x/tomoyo-lsm/patches/tomoyo-capability.diff?root=tomoyo&view=markup
> 
> The reason I don't use POSIX capability is that the maximum types are limited
> to
> bitwidth of a variable (i.e. currently 32, or are we going to extend it to
> 64).
> This leads to abuse of CAP_SYS_ADMIN capability.

That is a matter of taste. 

> In other words, it makes fine-grained privilege division impossible.

I personally believe that a finer granularity than about 20
is too fine. I understand that this is a minority opinion.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Defense in depth: LSM modules, not a static interface

2007-11-06 Thread Peter Dolding

On Nov 7, 2007 2:11 PM, Tetsuo Handa <[EMAIL PROTECTED]> wrote:
> Hello.
>
> Casey Schaufler wrote:
> > Fine grained capabilities are a bonus, and there are lots of
> > people who think that it would be really nifty if there were a
> > separate capability for each "if" in the kernel. I personally
> > don't see need for more than about 20. That is a matter of taste.
> > DG/UX ended up with 330 and I say that's too many.
>
> TOMOYO Linux has own (non-POSIX) capability that can support 65536 
> capabilities
> if there *were* a separate capability for each "if" in the kernel.
> http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/trunk/2.1.x/tomoyo-lsm/patches/tomoyo-capability.diff?root=tomoyo&view=markup
>
> The reason I don't use POSIX capability is that the maximum types are limited 
> to
> bitwidth of a variable (i.e. currently 32, or are we going to extend it to 
> 64).
> This leads to abuse of CAP_SYS_ADMIN capability.
> In other words, it makes fine-grained privilege division impossible.
>
> Since security_capable() cannot receive fine-grained values,
> TOMOYO can't do fine-grained privilege division.
>
Seen same problem.  Tetsuo Handa.

Capabilities alone does not.   Capabilities make up part of the engine.

As you can see currently it allows controls by block.  Now if
something has no network access at all does it have filtering rules no
it does not.  Same with file access.  There are some applications that
never need write or read from file systems.  So why are they granted
that.

These broad area covering controls can be provided to applications
without very much complexity.  Applications can use these features
internally to harden their security.  Make sections of program only
have read only file access other sections having read write other
sections have no file access.  Same with network access.  This is a
layer that is over looked and lacking power.

Capabilities do big blocks of security.  Bottom point of capabilities
should be a static application that loads into ram runs but cannot
report or allocate any memory.  Ie basically contained harmless and
useless.

The LSM takes control of permission allocation not enforcement in my
model.  The enforcement are done by sections like Capabilities and
Netlabels and some filesystem part that is missing.  Other parts might
be missing too.  Really need to be bashed out.   The Capabilities
could even tell you if those features are applied to your application.
 Now application can respond more correctly to user cannot access
directory because blocked by LSM/Application security settings not
just failed access.

Note Capabilities can provide a nice central point to give a basic
quick overview of what a application can and cannot do.  This
application does not have network access and is locked that way no
need to process Netlabels.  Same with filesystem.

330 is not too many if they exist for valid reasons.  20 appears to be
too few.  Most of the capabilities have be designed with the idea of
breaking up root powers.   This does not provide enough for
applications own internal security.

Its like currently you have a under 1024 port access switch and a Raw
network access switch.  Now there is no mirror switch for over 1024 so
all networking to application could be turned off.  Also applications
under 1024 then many not have the right to magically open up a back
door on higher user like ports.

On filesystem Read Write Execute and Change stat.   Memory allowed to
Allocate memory,  Memory map.Device access limitations flags.
This is quickly list getting to 10 more at least needed.

Basically there are quite a few still missing in Capabilities that are
needed for application own security.  No permissions issued threw
Capabilities should equal application paper weight.   There are also
missing engine parts.   Netlabels is only one part.

Basically Capabilities flags as the hub.  With sections like Netlabels
and other security processing engines forking off it.  Sections like
Netlabels only need settings if Capabilities allows anything in the
first place.  This allows special engines for sections.  Yet not
having to allocate the memory when you don't need it.

Peter Dolding
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Fix for sparc64 cpu hangs.

2007-11-06 Thread David Miller


[ Bernd, Josip, and Fabio, I think I finally nailed this
  cpu hang bug we were all seeing on sparc64.  ]

[FUTEX]: Fix address computation in compat code.

compat_exit_robust_list() computes a pointer to the
futex entry in userspace as follows:

(void __user *)entry + futex_offset

'entry' is a 'struct robust_list __user *', and
'futex_offset' is a 'compat_long_t' (typically a 's32').

Things explode if the 32-bit sign bit is set in futex_offset.

Type promotion sign extends futex_offset to a 64-bit value before
adding it to 'entry'.

This triggered a problem on sparc64 running 32-bit applications which
would lock up a cpu looping forever in the fault handling for the
userspace load in handle_futex_death().

Compat userspace runs with address masking (wherein the cpu zeros out
the top 32-bits of every effective address given to a memory operation
instruction) so the sparc64 fault handler accounts for this by
zero'ing out the top 32-bits of the fault address too.

Since the kernel properly uses the compat_uptr interfaces, kernel side
accesses to compat userspace work too since they will only use
addresses with the top 32-bits clear.

Because of this compat futex layer bug we get into the following loop
when executing the get_user() load near the top of handle_futex_death():

1) load from address '0xf7f16bd8', FAULT
2) fault handler clears upper 32-bits, processes fault
   for address '0xf7f16bd8' which succeeds
3) goto #1

I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
for their tireless efforts helping me track down this bug.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 00b5726..8089e7e 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -76,11 +76,16 @@ void compat_exit_robust_list(struct task_struct *curr)
 * A pending lock might already be on the list, so
 * dont process it twice:
 */
-   if (entry != pending)
-   if (handle_futex_death((void __user *)entry + 
futex_offset,
-   curr, pi))
-   return;
+   if (entry != pending) {
+   void __user *uaddr;
+   compat_uptr_t base;
+
+   base = ptr_to_compat(entry);
+   uaddr = compat_ptr(base + futex_offset);
 
+   if (handle_futex_death(uaddr, curr, pi))
+   return;
+   }
if (rc)
return;
uentry = next_uentry;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: problem in follow_hugetlb_page on ppc64 architecture with get_user_pages

2007-11-06 Thread David Gibson

On Tue, Nov 06, 2007 at 04:06:04PM +0100, Hoang-Nam Nguyen wrote:
> Hello Roland!
> > We currently see this when testing Infiniband on ppc64 with ehca +
> > hugetlbfs.
> > From reading the code this should also be an issue on other architectures.
> > Roland, Adam, are you aware of anything in this area with mellanox
> > Infiniband cards or other usages with I/O adapters?
> Below is a testcase demonstrating this problem. You need to install
> libhugetlbfs.so and run it as below:
> HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so ./hugetlb_ibtest 100
> 
> This testcase does the following steps (high level desc):
> 1. malloc two buffers each of 100MB for send and recv
> 2. register them as memory regions
> 3. create queue pair QP
> 4. send data in send buffer using QP to itself (target is then recv buffer)
> 5. compare those buffers content
> 
> It runs fine without libhugetlbsf. If you call it with libhugetlbfs as
> above, step 5 will fail. If you do memset() of the buffers before step 2
> (register mr), then it runs without errors.
> It appears that hugetlb_cow() is called when first write access is performed
> after mrs have been registered. That means the testcase is seeing other pages
> than the ones registered to the adapter...
> 
> I was able reproduce this with mthca on 2.6.23/ppc64 and fc6/intel.

We should cut this down to the bare necessary and fold it into the
libhugetlbfs testsuite.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Defense in depth: LSM modules, not a static interface

2007-11-06 Thread Tetsuo Handa

Hello.

Casey Schaufler wrote:
> Fine grained capabilities are a bonus, and there are lots of
> people who think that it would be really nifty if there were a
> separate capability for each "if" in the kernel. I personally
> don't see need for more than about 20. That is a matter of taste.
> DG/UX ended up with 330 and I say that's too many.

TOMOYO Linux has own (non-POSIX) capability that can support 65536 capabilities
if there *were* a separate capability for each "if" in the kernel.
http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/trunk/2.1.x/tomoyo-lsm/patches/tomoyo-capability.diff?root=tomoyo&view=markup

The reason I don't use POSIX capability is that the maximum types are limited to
bitwidth of a variable (i.e. currently 32, or are we going to extend it to 64).
This leads to abuse of CAP_SYS_ADMIN capability.
In other words, it makes fine-grained privilege division impossible.

Since security_capable() cannot receive fine-grained values,
TOMOYO can't do fine-grained privilege division.

I wish if capability machanism has mapping layer like:

#define CAP_DIVIDED_FOO1 0
#define CAP_DIVIDED_FOO2 1
#define CAP_DIVIDED_FOO3 2
  ...
#define CAP_DIVIDED_BAR1 100
#define CAP_DIVIDED_BAR2 101
#define CAP_DIVIDED_BAR3 102

const int cap_divided_to_grouped(int cap_divided)
{
static const int cap_mapping_array[] = {
/* [divided index value] = POSIX compatible index value (i.e. 
0-31) */
[CAP_DIVIDED_FOO1] = 0,
[CAP_DIVIDED_FOO2] = 0,
[CAP_DIVIDED_FOO3] = 0,
[CAP_DIVIDED_BAR1] = 1,
[CAP_DIVIDED_BAR2] = 1,
[CAP_DIVIDED_BAR3] = 1,
};
return cap_mapping_array[cap_divided];
}

int capable(int cap_divided)
{
return security_capable(cap_divided);
}

int security_capable(int cap_divided)
{
/* Allow LSM to decide based on fine-grained capability index. */
return 
LSM_implementation_specific_capability_check(cap_divided_to_grouped(cap_divided));
}

int function_foo(void)
{
if (!capable(CAP_DIVIDED_FOO1))
return -EPERM;
return 0;
}

int function_bar(void)
{
if (!capable(CAP_DIVIDED_BAR2))
return -EPERM;
return 0;
}

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-06 Thread Linus Torvalds

On Wed, 7 Nov 2007, Adrian Bunk wrote:
> 
> Users are used to work on characters, not on bytes.

Adrian, stop this idiocy. I'm not interested in listening to your 
soliloqui about irrelevant stuff.

The kernel works on bytes. Deal with it. Stop whining. 

You've been told several times that all the examples you showed were 
irrelevant, and tomoyo worked on bytes too. 

You have been told several times that the VFS layer works on bytes, and 
has done so since day 1.

You have *also* been told that there is no real other option ("you can 
work with bytes, or you can go mad"). The normal kernel interfaces have to 
be locale-independent (parly because it doesn't even KNOW the locale, 
partly because locale is just totally irrelevant).

And your statement above is a TOTAL AND UTTER LIE.

More people are used to work with bytes (the C language calls them "char") 
than with what _you_ call "characters". The fact is, people are very very 
very used to working with 8-bit bytes, and there are a lot more people who 
understand them than people who understand UTF-8 (never mind any of the 
other million possible stupid and insane locales).

So can you stop your inane whining now? You can either:

 - accept that the kernel works on bytes (*) and that when we talk about 
   parsing strings, we're talking the very _traditional_ C meaning, which 
   is locale-independent, because locales DO NOT WORK in the kernel!

 - or you can continue your irrelevant ranting that has nothing to do with 
   anything, but please don't cc me any more. People already pointed out 
   to you that your assumption that "character" means something else than 
   "byte" was wrong.

Please stop this. The absolute *last* thing you want is a kernel that 
cares about locales. You *also* don't want a kernel that enforces some 
idiotic UTF-8 rules, since not everybody is using UTF-8. That way lies 
madness, not to mention totally unnecessary complexity.

Linus

(*) With some *very* rare special cases, notably in the console driver, 
and for filesystems that are forced by idiot designers to be compatible 
with crap like OS X and Windows that think that filesystems should be 
case-insensitive, which is a fundamental problem exactly because of its 
dependence on locales)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SCSI: Add the SGPIO support for sata_nv.c

2007-11-06 Thread Yinghai Lu

On Nov 7, 2006 1:55 AM, Peer Chen <[EMAIL PROTECTED]> wrote:
> Modified and resent out the patch as attachment.
> Description about the patch:
> Add SGPIO support in sata_nv.c.
> SGPIO (Serial General Purpose Input Output) is a sideband serial 4-wire
> interface that a storage controller uses to communicate with a storage
> enclosure management controller, primarily to control activity and
> status LEDs that are located within drive bays or on a storage
> backplane. SGPIO is defined by [SFF8485].
> In this patch,we drive the LEDs to blink when read/write operation
> happen on SATA drives connect the corresponding ports on MCP55 board.
> ==
> The patch will be applied to kernel 2.6.19-rc4-git9.

do you have one that can apply to 2.6.24-rc2 or current linus git tree.

YH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] blackfin: remove dump_thread()

2007-11-06 Thread Mike Frysinger

On Nov 6, 2007 10:39 PM, Bryan Wu <[EMAIL PROTECTED]> wrote:
> On Nov 7, 2007 11:28 AM, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> > On Wed, Nov 07, 2007 at 11:07:06AM +0800, Bryan Wu wrote:
> > > On 11/6/07, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> > > > This patch removes the unused dump_thread().
> > >
> > > Why only remove it from Blackfin? any more reason?
> >
> > The only user is the a.out support.
> >
> > It was therefore removed prior to the blackfin merge from all
> > architectures not supporting a.out.
>
> OK, make sense. Currently, Blackfin doesn't suppport a.out.

and hopefully never will
-mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Use of virtio device IDs

2007-11-06 Thread Gregory Haskins

Anthony Liguori wrote:

> 
> Right now, we would have to have every PCI vendor/device ID pair in the
> virtio PCI driver ID table for every virtio device.

I realize you guys are probably far down this road in the design
process, but FWIW: This is a major motivation for the reason that the
IOQ stuff I posted a while back used strings for device identification
instead of a fixed length, centrally managed namespace like PCI
vendor/dev-id.  Then you can just name your device something reasonably
unique (e.g. "qumranet::veth", or "ibm-pvirt-clock").

(I realize that if you are going to do PCI, you need to make it
PCI-like.  But I think using PCI in the first place is probably the
wrong direction.  IMHO, there's really not a lot of reason to be
constrained by a hardware specification once you decide to go PV.  This
is even more true if you want to support as many platforms as possible
(i.e. platforms that don't have PCI natively).

Regards,
-Greg

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] blackfin: remove dump_thread()

2007-11-06 Thread Bryan Wu

On Nov 7, 2007 11:28 AM, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> On Wed, Nov 07, 2007 at 11:07:06AM +0800, Bryan Wu wrote:
> > On 11/6/07, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> > > This patch removes the unused dump_thread().
> > >
> >
> > Why only remove it from Blackfin? any more reason?
>
> The only user is the a.out support.
>
> It was therefore removed prior to the blackfin merge from all
> architectures not supporting a.out.
>

OK, make sense. Currently, Blackfin doesn't suppport a.out.

Acked-by: Bryan Wu <[EMAIL PROTECTED]>



> > I found this in latest 2.6.24-rc2:
> >
> > --
> > Cscope tag: dump_thread
> >#   line  filename / context / line
> >1324  arch/alpha/kernel/process.c <>
> >  dump_thread(struct pt_regs * pt, struct user * dump)
> >2373  arch/arm/kernel/process.c <>
> >  void dump_thread(struct pt_regs * regs, struct user * dump)
> >3244  arch/blackfin/kernel/process.c <>
> >  void dump_thread(struct pt_regs *regs, struct user *dump)
> >4321  arch/m68k/kernel/process.c <>
> >  void dump_thread(struct pt_regs * regs, struct user * dump)
> >5572  arch/sparc/kernel/process.c <>
> >  void dump_thread(struct pt_regs * regs, struct user * dump)
> >6731  arch/sparc64/kernel/process.c <>
> >  void dump_thread(struct pt_regs * regs, struct user * dump)
> >7306  arch/um/kernel/process.c <>
> >  void dump_thread(struct pt_regs *regs, struct user *u)
> >8519  arch/x86/kernel/process_32.c <>
> >  void dump_thread(struct pt_regs * regs, struct user * dump)
> > --
> >
> > Regards,
> > -Bryan
>
> cu
> Adrian
>
> --
>
>"Is there not promise of rain?" Ling Tan asked suddenly out
> of the darkness. There had been need of rain for many days.
>"Only a promise," Lao Er said.
>Pearl S. Buck - Dragon Seed
>
>

Thanks a lot.
-Bryan Wu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Defense in depth: LSM modules, not a static interface

2007-11-06 Thread Casey Schaufler

--- Cliffe <[EMAIL PROTECTED]> wrote:

> As good an idea POSIX capabilities might be,

Now that's a refreshing comment. Thank you.

> not all security problems 
> can be solved with a bitmap of on/off permissions.

There are people (I'm not one of them) who figure that you
can solve all the security problems by applying sufficiently
fine granularity of on/off permissions.

> Peter Dolding wrote:
> 
> 
> Ok but what happens to the principle of least privilege?
> 
> What if we want AppArmor to confine that application to use a particular 
> set of ports?
> 
> Do you propose having a capability for each port? how about protocols?

While you're at it, how about a capability for each possible
directory entry name?

> So unless my understanding of capabilities is fundamentally flawed 
> (which it may be - I have not spent time reviewing recent changes) 
> obviously Linux capabilities does not provide a solution to every problem.

Of course they don't. The only problem they are intended
to solve, and I really mean this, is the association of uid 0
with privilege. That's it. You would be better off with a single
CAP_GODLIKE than with uid 0 having all privilege all the time.
Fine grained capabilities are a bonus, and there are lots of
people who think that it would be really nifty if there were a
separate capability for each "if" in the kernel. I personally
don't see need for more than about 20. That is a matter of taste.
DG/UX ended up with 330 and I say that's too many.

Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] blackfin: remove dump_thread()

2007-11-06 Thread Adrian Bunk

On Wed, Nov 07, 2007 at 11:07:06AM +0800, Bryan Wu wrote:
> On 11/6/07, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> > This patch removes the unused dump_thread().
> >
> 
> Why only remove it from Blackfin? any more reason?

The only user is the a.out support.

It was therefore removed prior to the blackfin merge from all 
architectures not supporting a.out.

> I found this in latest 2.6.24-rc2:
> 
> --
> Cscope tag: dump_thread
>#   line  filename / context / line
>1324  arch/alpha/kernel/process.c <>
>  dump_thread(struct pt_regs * pt, struct user * dump)
>2373  arch/arm/kernel/process.c <>
>  void dump_thread(struct pt_regs * regs, struct user * dump)
>3244  arch/blackfin/kernel/process.c <>
>  void dump_thread(struct pt_regs *regs, struct user *dump)
>4321  arch/m68k/kernel/process.c <>
>  void dump_thread(struct pt_regs * regs, struct user * dump)
>5572  arch/sparc/kernel/process.c <>
>  void dump_thread(struct pt_regs * regs, struct user * dump)
>6731  arch/sparc64/kernel/process.c <>
>  void dump_thread(struct pt_regs * regs, struct user * dump)
>7306  arch/um/kernel/process.c <>
>  void dump_thread(struct pt_regs *regs, struct user *u)
>8519  arch/x86/kernel/process_32.c <>
>  void dump_thread(struct pt_regs * regs, struct user * dump)
> --
> 
> Regards,
> -Bryan

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 3/10] define page_file_cache

2007-11-06 Thread Christoph Lameter

n Tue, 6 Nov 2007, Rik van Riel wrote:

> Every anonymous, tmpfs or shared memory segment page is potentially
> swap backed. That is the whole point of the PG_swapbacked flag.

One of the current issues with anonymous pages is the accounting when 
they become file backed and get dirty. There are performance issue with 
swap writeout because we are not doing it in file order and on a page by 
page basis.

Do ramfs pages count as memory backed?

> A page from a filesystem like ext3 or NFS cannot suddenly turn into
> a swap backed page.  This page "nature" is not changed during the
> lifetime of a page.

Well COW sortof does that but then its a new page.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 09/23] SLUB: Add get() and kick() methods

2007-11-06 Thread Adrian Bunk

On Tue, Nov 06, 2007 at 07:07:15PM -0800, Christoph Lameter wrote:
> On Wed, 7 Nov 2007, Adrian Bunk wrote:
> 
> > A static inline dummy function for CONFIG_SLUB=n seems to be missing?
> 
> Correct. This patch is needed so that building with SLAB will work.
> 
> Slab defrag: Provide empty kmem_cache_setup_defrag function for SLAB.
> 
> Provide an empty function to satisfy dependencies for Slab defrag.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>?
> 
> ---
>  mm/slab.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> Index: linux-2.6/mm/slab.c
> ===
> --- linux-2.6.orig/mm/slab.c  2007-11-06 18:57:22.0 -0800
> +++ linux-2.6/mm/slab.c   2007-11-06 18:58:40.0 -0800
> @@ -2535,6 +2535,13 @@ static int __cache_shrink(struct kmem_ca
>   return (ret ? 1 : 0);
>  }
>  
> +void kmem_cache_setup_defrag(struct kmem_cache *s,
> + void *(*get)(struct kmem_cache *, int nr, void **),
> + void (*kick)(struct kmem_cache *, int nr, void **, void *private))
> +{
> +}
> +EXPORT_SYMBOL(kmem_cache_setup_defrag);
> +

- this misses slob
- this wastes memory

An empty static inline function in slab.h would be better.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc1 - Regularly getting processes stuck in D state on startup

2007-11-06 Thread Stephen Rothwell

On Wed, 7 Nov 2007 14:17:17 +1100 Stephen Rothwell <[EMAIL PROTECTED]> wrote:
>
> On Tue, Nov 06, 2007 at 04:00:06PM +0800, Fengguang Wu wrote:
> > 
> > Could you try with the attached 4 patches? Two of them are expected to
> > fix your problem, another two are debugging ones(in case the problem
> > persists).
> 
> Applying these four patches fixes it for me.  Obviously the reiserfs patch
> was not relevant in  my case (only using ext3).

I am now running on a kernel with just the
mm-speed-up-writeback-ramp-up-on-clean-systems.patch applied and I am
seeing no hangs.

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgp7WnMzzHitQ.pgp
Description: PGP signature

Re: 2.6.24-rc1 - Regularly getting processes stuck in D state on startup

2007-11-06 Thread Stephen Rothwell

On Tue, 6 Nov 2007 17:46:26 +1100 Stephen Rothwell <[EMAIL PROTECTED]> wrote:
>
> I am seeing something very similar on a PowerPC machine where copying a
> file from an LVM volume with ext3 on it to a simple scsi partition (again
> ext3) on the same disk will hang in congestion_wait.  If I am patient
> enough, the copy makes very slow progress.  A kill -9 will kill it
> eventually, but a simple control-C will not.

Turns out a simple control-C would kill the copy, I was just not patient
enough :-)

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpyL4k9RctEp.pgp
Description: PGP signature

Re: [PATCH] NetLabel: Introduce a new kernel configuration API for NetLabel - For 2.6.24-rc-git11 - Smack Version 10

2007-11-06 Thread Casey Schaufler


--- Casey Schaufler <[EMAIL PROTECTED]> wrote:

> 
> --- Joshua Brindle <[EMAIL PROTECTED]> wrote:
> 
> > Joshua Brindle wrote:
> > > Casey Schaufler wrote:
> > >> From: Paul Moore <[EMAIL PROTECTED]>
> > >>
> > >> Add a new set of configuration functions to the NetLabel/LSM API so that
> > >> LSMs can perform their own configuration of the NetLabel subsystem 
> > >> without
> > >> relying on assistance from userspace.
> > >>   
> > > I'm still not receiving the actual patch email on lsm (perhaps its too 
> > > long and should be split up..) so I'll just respond on this email. 
> > > Using the v10 patches on your website I'm still seeing strange 
> > > behavior where echo foo > /proc/self/attr/current changes the label of 
> > > every process on the system to foo (verified with both ps -AZ and cat 
> > > /proc/1/attr/current).
> > >
> > Actually I'm getting more strange behavior:
> > 
> > On terminal 1 I do:
> > echo foo > /proc/self/attr/current
> > then ps -AZ shows foo for every process
> > touch somefile; attr -S -g SMACK64 somefile says foo
> > 
> > On terminal 2 I do:
> > ps -AZ and everything shows up as _
> > cat /proc/$pid of bash on term 1/attr/current is _
> 
> Now this I can explain. Every task has it's own correct
> label. The problem is a missing smack_getprocattr() hook.
> ps is getting the value for "current" on the current process,
> not that of the named process. Interestingly, the Smack label
> of /proc//attr/current is correct.

That's what I get for responding too quickly. There is a
smack_getprocattr() hook, it's just wrong. Fixed for the
next version. 

Thank you again. 


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] blackfin: unexport get_wchan

2007-11-06 Thread Bryan Wu

On Nov 6, 2007 1:07 AM, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> This patch removes the unused EXPORT_SYMBOL(get_wchan).
>
It should be. "The only user of get_wchan I was able to find is the
proc fs - and proc
can't be built modular." You said before, right?

> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
>
Acked-by: Bryan Wu <[EMAIL PROTECTED]>

> ---
> 6acc2faa1f25d2d5fbf6c5e435c222a79f753afa
> diff --git a/arch/blackfin/kernel/bfin_ksyms.c 
> b/arch/blackfin/kernel/bfin_ksyms.c
> index 99ea57c..5dad9d3 100644
> --- a/arch/blackfin/kernel/bfin_ksyms.c
> +++ b/arch/blackfin/kernel/bfin_ksyms.c
> @@ -65,7 +65,6 @@ EXPORT_SYMBOL(memset);
>  EXPORT_SYMBOL(memcmp);
>  EXPORT_SYMBOL(memmove);
>  EXPORT_SYMBOL(memchr);
> -EXPORT_SYMBOL(get_wchan);
>
>  /*
>   * libgcc functions - functions that are used internally by the
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc1 - Regularly getting processes stuck in D state on startup

2007-11-06 Thread Stephen Rothwell

On Tue, Nov 06, 2007 at 04:00:06PM +0800, Fengguang Wu wrote:
> 
> Could you try with the attached 4 patches? Two of them are expected to
> fix your problem, another two are debugging ones(in case the problem
> persists).

Applying these four patches fixes it for me.  Obviously the reiserfs patch was 
not relevant in  my case (only using ext3).

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpcPHOTcbygH.pgp
Description: PGP signature

Re: [RFC PATCH 3/10] define page_file_cache

2007-11-06 Thread Rik van Riel

On Tue, 6 Nov 2007 19:02:47 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:
> On Tue, 6 Nov 2007, Rik van Riel wrote:
> 
> > > I think we could add a flag to the bdi to indicate wheter the backing 
> > > store is a disk file. In fact you can also deduce if if a device has
> > > no writeback capability set in the BDI.
> > > 
> > > > Unfortunately this needs to use a page flag, since the
> > > > PG_swapbacked state needs to be preserved all the way
> > > > to the point where the page is last removed from the
> > > > LRU.  Trying to derive the status from other info in
> > > > the page resulted in wrong VM statistics in earlier
> > > > split VM patchsets.
> > > 
> > > The bdi may avoid that extra flag.
> > 
> > The bdi will no longer be accessible by the time a page
> > makes it to free_hot_cold_page, which is one place in the
> > kernel where this information is needed.
> 
> At that point you need only information about which list the page
> was put on. Dont we need something like PageLRU -> PageFileLRU
> and PageMemLRU?

That is exactly why we need a page flag.  If you have a better
name for the page flag, please let me know.

Note that the kind of page needs to be separate from PageLRU,
since pages are taken off of and put back onto LRUs all the
time.
 
> The page may change its nature I think? What if a page becomes
> swap backed?

Every anonymous, tmpfs or shared memory segment page is potentially
swap backed. That is the whole point of the PG_swapbacked flag.

A page from a filesystem like ext3 or NFS cannot suddenly turn into
a swap backed page.  This page "nature" is not changed during the
lifetime of a page.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] NetLabel: Introduce a new kernel configuration API for NetLabel - For 2.6.24-rc-git11 - Smack Version 10

2007-11-06 Thread Casey Schaufler


--- Joshua Brindle <[EMAIL PROTECTED]> wrote:

> Joshua Brindle wrote:
> > Casey Schaufler wrote:
> >> From: Paul Moore <[EMAIL PROTECTED]>
> >>
> >> Add a new set of configuration functions to the NetLabel/LSM API so that
> >> LSMs can perform their own configuration of the NetLabel subsystem 
> >> without
> >> relying on assistance from userspace.
> >>   
> > I'm still not receiving the actual patch email on lsm (perhaps its too 
> > long and should be split up..) so I'll just respond on this email. 
> > Using the v10 patches on your website I'm still seeing strange 
> > behavior where echo foo > /proc/self/attr/current changes the label of 
> > every process on the system to foo (verified with both ps -AZ and cat 
> > /proc/1/attr/current).
> >
> Actually I'm getting more strange behavior:
> 
> On terminal 1 I do:
> echo foo > /proc/self/attr/current
> then ps -AZ shows foo for every process
> touch somefile; attr -S -g SMACK64 somefile says foo
> 
> On terminal 2 I do:
> ps -AZ and everything shows up as _
> cat /proc/$pid of bash on term 1/attr/current is _

Now this I can explain. Every task has it's own correct
label. The problem is a missing smack_getprocattr() hook.
ps is getting the value for "current" on the current process,
not that of the named process. Interestingly, the Smack label
of /proc//attr/current is correct.

So the fix is to put in the smack_getprocattr() hook.
Easily accomplished. Thank you for the informative and
helpful report.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 09/23] SLUB: Add get() and kick() methods

2007-11-06 Thread Christoph Lameter

On Wed, 7 Nov 2007, Adrian Bunk wrote:

> A static inline dummy function for CONFIG_SLUB=n seems to be missing?

Correct. This patch is needed so that building with SLAB will work.


Slab defrag: Provide empty kmem_cache_setup_defrag function for SLAB.

Provide an empty function to satisfy dependencies for Slab defrag.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>?

---
 mm/slab.c |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c2007-11-06 18:57:22.0 -0800
+++ linux-2.6/mm/slab.c 2007-11-06 18:58:40.0 -0800
@@ -2535,6 +2535,13 @@ static int __cache_shrink(struct kmem_ca
return (ret ? 1 : 0);
 }
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+   void *(*get)(struct kmem_cache *, int nr, void **),
+   void (*kick)(struct kmem_cache *, int nr, void **, void *private))
+{
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 /**
  * kmem_cache_shrink - Shrink a cache.
  * @cachep: The cache to shrink.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] blackfin: remove dump_thread()

2007-11-06 Thread Bryan Wu

On 11/6/07, Adrian Bunk <[EMAIL PROTECTED]> wrote:
> This patch removes the unused dump_thread().
>

Why only remove it from Blackfin? any more reason?
I found this in latest 2.6.24-rc2:

--
Cscope tag: dump_thread
   #   line  filename / context / line
   1324  arch/alpha/kernel/process.c <>
 dump_thread(struct pt_regs * pt, struct user * dump)
   2373  arch/arm/kernel/process.c <>
 void dump_thread(struct pt_regs * regs, struct user * dump)
   3244  arch/blackfin/kernel/process.c <>
 void dump_thread(struct pt_regs *regs, struct user *dump)
   4321  arch/m68k/kernel/process.c <>
 void dump_thread(struct pt_regs * regs, struct user * dump)
   5572  arch/sparc/kernel/process.c <>
 void dump_thread(struct pt_regs * regs, struct user * dump)
   6731  arch/sparc64/kernel/process.c <>
 void dump_thread(struct pt_regs * regs, struct user * dump)
   7306  arch/um/kernel/process.c <>
 void dump_thread(struct pt_regs *regs, struct user *u)
   8519  arch/x86/kernel/process_32.c <>
 void dump_thread(struct pt_regs * regs, struct user * dump)
--

Regards,
-Bryan

> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
>
> ---
>
>  arch/blackfin/kernel/bfin_ksyms.c  |1
>  arch/blackfin/kernel/process.c |   45 -
>  include/asm-blackfin/bfin-global.h |1
>  3 files changed, 47 deletions(-)
>
> c7746b10ac98a3166a7257f7117dd125e5ba4c7b
> diff --git a/arch/blackfin/kernel/bfin_ksyms.c 
> b/arch/blackfin/kernel/bfin_ksyms.c
> index 2198afe..99ea57c 100644
> --- a/arch/blackfin/kernel/bfin_ksyms.c
> +++ b/arch/blackfin/kernel/bfin_ksyms.c
> @@ -39,7 +39,6 @@
>  EXPORT_SYMBOL(__ioremap);
>  EXPORT_SYMBOL(strcmp);
>  EXPORT_SYMBOL(strncmp);
> -EXPORT_SYMBOL(dump_thread);
>
>  EXPORT_SYMBOL(ip_fast_csum);
>
> diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
> index 9124467..1a8cf33 100644
> --- a/arch/blackfin/kernel/process.c
> +++ b/arch/blackfin/kernel/process.c
> @@ -239,51 +239,6 @@ copy_thread(int nr, unsigned long clone_flags,
>  }
>
>  /*
> - * fill in the user structure for a core dump..
> - */
> -void dump_thread(struct pt_regs *regs, struct user *dump)
> -{
> -   dump->magic = CMAGIC;
> -   dump->start_code = 0;
> -   dump->start_stack = rdusp() & ~(PAGE_SIZE - 1);
> -   dump->u_tsize = ((unsigned long)current->mm->end_code) >> PAGE_SHIFT;
> -   dump->u_dsize = ((unsigned long)(current->mm->brk +
> -(PAGE_SIZE - 1))) >> PAGE_SHIFT;
> -   dump->u_dsize -= dump->u_tsize;
> -   dump->u_ssize = 0;
> -
> -   if (dump->start_stack < TASK_SIZE)
> -   dump->u_ssize =
> -   ((unsigned long)(TASK_SIZE -
> -dump->start_stack)) >> PAGE_SHIFT;
> -
> -   dump->u_ar0 = (struct user_regs_struct *)((int)&dump->regs - 
> (int)dump);
> -
> -   dump->regs.r0 = regs->r0;
> -   dump->regs.r1 = regs->r1;
> -   dump->regs.r2 = regs->r2;
> -   dump->regs.r3 = regs->r3;
> -   dump->regs.r4 = regs->r4;
> -   dump->regs.r5 = regs->r5;
> -   dump->regs.r6 = regs->r6;
> -   dump->regs.r7 = regs->r7;
> -   dump->regs.p0 = regs->p0;
> -   dump->regs.p1 = regs->p1;
> -   dump->regs.p2 = regs->p2;
> -   dump->regs.p3 = regs->p3;
> -   dump->regs.p4 = regs->p4;
> -   dump->regs.p5 = regs->p5;
> -   dump->regs.orig_p0 = regs->orig_p0;
> -   dump->regs.a0w = regs->a0w;
> -   dump->regs.a1w = regs->a1w;
> -   dump->regs.a0x = regs->a0x;
> -   dump->regs.a1x = regs->a1x;
> -   dump->regs.rets = regs->rets;
> -   dump->regs.astat = regs->astat;
> -   dump->regs.pc = regs->pc;
> -}
> -
> -/*
>   * sys_execve() executes a new program.
>   */
>
> diff --git a/include/asm-blackfin/bfin-global.h 
> b/include/asm-blackfin/bfin-global.h
> index 0212e18..cd924bb 100644
> --- a/include/asm-blackfin/bfin-global.h
> +++ b/include/asm-blackfin/bfin-global.h
> @@ -50,7 +50,6 @@ extern unsigned long get_sclk(void);
>  extern unsigned long sclk_to_usecs(unsigned long sclk);
>  extern unsigned long usecs_to_sclk(unsigned long usecs);
>
> -extern void dump_thread(struct pt_regs *regs, struct user *dump);
>  extern void dump_bfin_regs(struct pt_regs *fp, void *retaddr);
>  extern void dump_bfin_trace_buffer(void);
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 3/10] define page_file_cache

2007-11-06 Thread Christoph Lameter

On Tue, 6 Nov 2007, Rik van Riel wrote:

> > I think we could add a flag to the bdi to indicate wheter the backing 
> > store is a disk file. In fact you can also deduce if if a device has
> > no writeback capability set in the BDI.
> > 
> > > Unfortunately this needs to use a page flag, since the
> > > PG_swapbacked state needs to be preserved all the way
> > > to the point where the page is last removed from the
> > > LRU.  Trying to derive the status from other info in
> > > the page resulted in wrong VM statistics in earlier
> > > split VM patchsets.
> > 
> > The bdi may avoid that extra flag.
> 
> The bdi will no longer be accessible by the time a page
> makes it to free_hot_cold_page, which is one place in the
> kernel where this information is needed.

At that point you need only information about which list the page
was put on. Dont we need something like PageLRU -> PageFileLRU
and PageMemLRU?

The page may change its nature I think? What if a page becomes
swap backed?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 6/10] split anon and file LRUs

2007-11-06 Thread Rik van Riel

On Tue, 6 Nov 2007 18:28:19 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:
> On Sat, 3 Nov 2007, Rik van Riel wrote:
> 
> > Split the LRU lists in two, one set for pages that are backed by
> > real file systems ("file") and one for pages that are backed by
> > memory and swap ("anon").  The latter includes tmpfs.
> 
> If we split the memory backed from the disk backed pages then
> they are no longer competing with one another on equal terms? So the file LRU 
> may run faster than the memory LRU?

The file LRU probably *should* run faster than the memory LRU most
of the time, since we stream the readahead data for many sequentially
accessed files through the file LRU.

We adjust the rates at which the two LRUs are scanned depending on
the fraction of referenced pages found when scanning each list.
Look at vmscan.c:get_scan_ratio() for the magic.

> The patch looks awfully large.

Making it smaller would probably result in something that does
not work right.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 3/10] define page_file_cache

2007-11-06 Thread Rik van Riel

On Tue, 6 Nov 2007 18:23:44 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Sat, 3 Nov 2007, Rik van Riel wrote:
> 
> > Define page_file_cache() function to answer the question:
> > is page backed by a file?
> 
> Well its not clear what is meant by a file in the first place.
> By file you mean disk space in contrast to ram based filesystems?

Yes.  I have improved the comment over page_file_cache() a bit:

/**
 * page_file_cache(@page)
 * Returns !0 if @page is page cache page backed by a regular filesystem,
 * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
 *
 * We would like to get this info without a page flag, but the state
 * needs to survive until the page is last deleted from the LRU, which
 * could be as far down as __page_cache_release.
 */

> I think we could add a flag to the bdi to indicate wheter the backing 
> store is a disk file. In fact you can also deduce if if a device has
> no writeback capability set in the BDI.
> 
> > Unfortunately this needs to use a page flag, since the
> > PG_swapbacked state needs to be preserved all the way
> > to the point where the page is last removed from the
> > LRU.  Trying to derive the status from other info in
> > the page resulted in wrong VM statistics in earlier
> > split VM patchsets.
> 
> The bdi may avoid that extra flag.

The bdi will no longer be accessible by the time a page
makes it to free_hot_cold_page, which is one place in the
kernel where this information is needed.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/10] split anon and file LRUs

2007-11-06 Thread Rik van Riel

On Tue, 6 Nov 2007 18:40:46 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Tue, 6 Nov 2007, Rik van Riel wrote:
> 
> > Also, a factor 16 increase in page size is not going to help
> > if memory sizes also increase by a factor 16, since we already 
> > have trouble with today's memory sizes.
> 
> Note that a factor 16 increase usually goes hand in hand with
> more processors. The synchronization of multiple processors becomes a 
> concern. If you have an 8p and each of them tries to get the zone locks 
> for reclaim then we are already in trouble. And given the immaturity
> of the handling of cacheline contention in current commodity hardware this 
> is likely to result in livelocks and/or starvation on some level.

Which is why we need to greatly reduce the number of pages
scanned to free a page.  In all workloads.

> > > We do not have an accepted standard load. So how would we figure that one 
> > > out?
> > 
> > The current worst case is where we need to scan all of memory, 
> > just to find a few pages we can swap out.  With the effects of
> > lock contention figured in, this can take hours on huge systems.
> 
> Right but I think this looks like a hopeless situation regardless of the 
> algorithm if you have a couple of million pages and are trying to free 
> one. Now image a series of processors going on the hunt for the few pages 
> that can be reclaimed.

An algorithm that only clears the referenced bit and then
moves the anonymous page from the active to the inactive
list will do a lot less work than an algorithm that needs
to scan the *whole* active list because all of the pages
on it are referenced.

This is not a theoretical situation: every anonymous page
starts out referenced!

Add in a relatively small inactive list on huge memory
systems, and we could have something of an acceptable
algorithmic complexity.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][VIRTIO] Fix vring_init() ring computations

2007-11-06 Thread Anthony Liguori

This patch fixes a typo in vring_init().  This happens to work today in lguest
because the sizeof(struct vring_desc) is 16 and struct vring contains 3
pointers and an unsigned int so on 32-bit
sizeof(struct vring_desc) == sizeof(struct vring).  However, this is no longer
true on 64-bit where the bug is exposed.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index ac69e7b..5b88d21 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -92,8 +92,8 @@ static inline void vring_init(struct vring *vr, unsigned int 
num, void *p)
 {
vr->num = num;
vr->desc = p;
-   vr->avail = p + num*sizeof(struct vring);
-   vr->used = p + (num+1)*(sizeof(struct vring) + sizeof(__u16));
+   vr->avail = p + num*sizeof(struct vring_desc);
+   vr->used = p + (num+1)*(sizeof(struct vring_desc) + sizeof(__u16));
 }
 
 static inline unsigned vring_size(unsigned int num)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Defense in depth: LSM modules, not a static interface

2007-11-06 Thread Cliffe

As good an idea POSIX capabilities might be, not all security problems 
can be solved with a bitmap of on/off permissions.


Peter Dolding wrote:

"AppArmor profile denies all network traffic to a specific
application"  Ok why should AppArmor be required to do this.  Would it
not be better as as part of Capabilities that is always there and is
application controllable.  It would be a security advantage if data
processing threads that don't do network access inside a application
don't have it.  Basically this feature could be done in mirror.  Allow
Network access Capabilities flag.  Not set application cannot access
network at all.  All LSM's would be able to use that to cut of network
access to applications.  As a standard feature of kernel if a new
network stack or some other alteration is done LSM hooks would not
need altering.  Lot of LSM hooks would disappear.  Need for LSM to
monitor and run different code to kernel in a lot of places would also
disappear.

With Capabilities expand it to point that applications cannot do
anything without permissions.  Both models are do able.  Restrictive
can be done in a Permissive model effectively if the starting point of
the Permissive is that you cannot do anything without permissions
being granted.  Big different is that the Permissive Model is the
kernel default.  Some LSM are design in conflict with the main model
of the OS.  You really only want one model from speed point of view


Ok but what happens to the principle of least privilege?

What if we want AppArmor to confine that application to use a particular 
set of ports?


Do you propose having a capability for each port? how about protocols?

So unless my understanding of capabilities is fundamentally flawed 
(which it may be - I have not spent time reviewing recent changes) 
obviously Linux capabilities does not provide a solution to every problem.


Regards,

Cliffe.

--

Z. Cliffe Schreuders
BSc Comp Sci (Hons) & Int Comp
PhD Candidate, Casual Tutor
School of IT
Murdoch University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/10] free swap space entries if vm_swap_full()

2007-11-06 Thread Rik van Riel

On Tue, 6 Nov 2007 18:20:44 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Sat, 3 Nov 2007, Rik van Riel wrote:
> 
> > @@ -1142,14 +1145,13 @@ force_reclaim_mapped:
> > }
> > }
> > __mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
> > +   spin_unlock_irq(&zone->lru_lock);
> > pgdeactivate += pgmoved;
> > -   if (buffer_heads_over_limit) {
> > -   spin_unlock_irq(&zone->lru_lock);
> > -   pagevec_strip(&pvec);
> > -   spin_lock_irq(&zone->lru_lock);
> > -   }
> >  
> > +   if (buffer_heads_over_limit)
> > +   pagevec_strip(&pvec);
> > pgmoved = 0;
> > +   spin_lock_irq(&zone->lru_lock);
> > while (!list_empty(&l_active)) {
> > page = lru_to_page(&l_active);
> > prefetchw_prev_lru_page(page, &l_active, flags);
> 
> Why are we dropping the lock here now? There would be less activity
> on the lru_lock if we would only drop it if necessary.

Fixed, thank you.

This will be in the next split VM series, later this week.

> > @@ -1163,6 +1165,8 @@ force_reclaim_mapped:
> > __mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
> > pgmoved = 0;
> > spin_unlock_irq(&zone->lru_lock);
> > +   if (vm_swap_full())
> > +   pagevec_swap_free(&pvec);
> > __pagevec_release(&pvec);
> > spin_lock_irq(&zone->lru_lock);
> > }
> 
> Same here. Maybe the spin_unlock and the spin_lock can go into
> pagevec_swap_free?

We need to unlock the zone->lru_lock across the
__pagevec_release(), which is why the unlock/lock
sequence was already there in the original code.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: pktcdvd oops

2007-11-06 Thread Tejun Heo

Peter Osterlund wrote:
> On Tue, 6 Nov 2007, Thomas Maier wrote:
> 
>> Hello,
>>
>> have not tested it yet, but i quess, the code mentioned by Peter
>> is in pkt_new_dev() that is called by pkt_setup_dev():
>>
>> /* This is safe, since we have a reference from open(). */
>> __module_get(THIS_MODULE);
>>
>>
>> So, now, there must be checks in every sysfs operation in the module
>> code,
>> to ensure that the module is still loaded?
>
> I haven't tested it either yet. What I don't understand is this: If the
> __module_get() is not safe because the module code could have already
> been unloaded, how can it possibly be made safe by adding more code to
> the pktcdvd module? If the module is unloaded, trying to execute its
> code can't be a good thing no matter what the code does.
>

sysfs itself is now out of module lifespan rules.  sysfs callbacks are
guaranteed to stay in memory while running by sysfs node removal waiting
for completion of in-flight operations before returning.  In pktcdvd's
case, class_destroy() call in pkt_sysfs_cleanup() will wait for all
in-flight sysfs r/w ops to complete.

So, even while sysfs callbacks are executing, the module beneath can die
but it will stay in memory till all the callbacks return.  You need to
test module liveness using try_module_get() (and it can fail) if you
want to grab module reference from sysfs callbacks.

>> BTW: the bug report says:
>>
>>  Steps to reproduce:
>>
>>   modprobe pktcdvd
>>   echo 22:0 >/sys/class/pktcdvd/add
>>
>> Is there any module unload??? Why is the module not available after
>> the modprobe, but the sysfs entries, generated by the module? Confused ;)
> 
> I think the purpose of the BUG_ON in __module_get() is to catch cases
> that are unsafe, even if the call would have happened to work in this
> particular case.

The BUG_ON is detecting valid condition here.  If you rmmod pktcdvd
after sysfs write has begun but before __module_get() ran, device node
will be created after the module is killed and scheduled to be unloaded.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/10] split anon and file LRUs

2007-11-06 Thread Christoph Lameter

On Tue, 6 Nov 2007, Rik van Riel wrote:

> Also, a factor 16 increase in page size is not going to help
> if memory sizes also increase by a factor 16, since we already 
> have trouble with today's memory sizes.

Note that a factor 16 increase usually goes hand in hand with
more processors. The synchronization of multiple processors becomes a 
concern. If you have an 8p and each of them tries to get the zone locks 
for reclaim then we are already in trouble. And given the immaturity
of the handling of cacheline contention in current commodity hardware this 
is likely to result in livelocks and/or starvation on some level.

> > I think that is the most urgent issue at hand. At least for us.
> 
> For some workloads this is the most urgent change, indeed.
> Since the patches for this already exist, integrating them
> is at the top of my list.  Expect this to be integrated into
> the split VM patch series by the end of this week.

Good to hear.

> > > - switch to SEQ replacement for the anon LRU lists, so the
> > >   worst case number of pages to scan is reduced greatly.
> > 
> > No idea what that is?
> 
> See http://linux-mm.org/PageReplacementDesign

A bit sparse but limiting the scanning if we cannot do much is certainly 
the right thing to do. The percentage of memory taken up by anonymous 
pages varies depending on the load. HPC applications may consume all of 
memory with anonymous pages. But there the pain is already so bad that 
many users go to huge pages already which bypasses the VM.

> > We do not have an accepted standard load. So how would we figure that one 
> > out?
> 
> The current worst case is where we need to scan all of memory, 
> just to find a few pages we can swap out.  With the effects of
> lock contention figured in, this can take hours on huge systems.

Right but I think this looks like a hopeless situation regardless of the 
algorithm if you have a couple of million pages and are trying to free 
one. Now image a series of processors going on the hunt for the few pages 
that can be reclaimed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: fix cpu-hotplug regression

2007-11-06 Thread Andi Kleen

On Wednesday 07 November 2007 02:12, Andreas Herrmann wrote:

> In cases where not all CPUs are brought up during
> boot (e.g. using maxcpus and additional_cpus parameters)
> mce_cpu_callback now returns NOTFIY_BAD because
> for such CPUs cpu_data is not completely filled when
> the notifier is called. Thus mce_create_device fails right
> at its beginning:
>
> if (!mce_available(&cpu_data[cpu]))
> return -EIO;
>
> As a quick fix I suggest to check boot_cpu_data for MCE.

I guess it would be better to just move the device creation
to after the CPU has booted. AKA call mce_create_dev() on CPU_ONLINE
instead.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 09/23] SLUB: Add get() and kick() methods

2007-11-06 Thread Adrian Bunk

On Tue, Nov 06, 2007 at 05:11:39PM -0800, Christoph Lameter wrote:
> Add the two methods needed for defragmentation and add the display of the
> methods via the proc interface.
> 
> Add documentation explaining the use of these methods.
> 
> Reviewed-by: Rik van Riel <[EMAIL PROTECTED]>
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> ---
>  include/linux/slab.h |3 +++
>  include/linux/slub_def.h |   31 +++
>  mm/slub.c|   32 ++--
>  3 files changed, 64 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/slab.h
> ===
> --- linux-2.6.orig/include/linux/slab.h   2007-10-17 13:35:53.0 
> -0700
> +++ linux-2.6/include/linux/slab.h2007-11-06 12:37:51.0 -0800
> @@ -56,6 +56,9 @@ struct kmem_cache *kmem_cache_create(con
>   void (*)(struct kmem_cache *, void *));
>  void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
> +void kmem_cache_setup_defrag(struct kmem_cache *s,
> + void *(*get)(struct kmem_cache *, int nr, void **),
> + void (*kick)(struct kmem_cache *, int nr, void **, void *private));
>  void kmem_cache_free(struct kmem_cache *, void *);
>  unsigned int kmem_cache_size(struct kmem_cache *);
>  const char *kmem_cache_name(struct kmem_cache *);
>...

A static inline dummy function for CONFIG_SLUB=n seems to be missing?

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] ssb: Add "ssb_pci_set_power_state" function

2007-11-06 Thread John W. Linville

Miguel,

Along with the style point Michael suggested, I'll need you to repost
both this one and the b44 patch with at least a Signed-off-by line.

http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

Also, please include the patches as plain text using a mailer that
does not damage whitespace.

Thanks,

John

On Wed, Oct 24, 2007 at 09:31:21PM +0200, Miguel Botón wrote:
> Add "ssb_pci_set_power_state" function. This allows set the power state of a 
> PCI device (for example b44 ethernet device).
> 
> diff -ruN linux-2.6.23/include/linux/ssb/ssb.h 
> linux-2.6.23.orig/include/linux/ssb/ssb.h
> --- linux-2.6.23.orig/include/linux/ssb/ssb.h 2007-10-24 19:02:33.0 
> +0200
> +++ linux-2.6.23/include/linux/ssb/ssb.h  2007-10-24 19:49:37.0 
> +0200
> @@ -402,6 +402,14 @@
>  {
>   pci_unregister_driver(driver);
>  }
> +
> +/* Set PCI device power state */
> +static inline
> +void ssb_pci_set_power_state(struct ssb_device *dev, pci_power_t state)
> +{
> + if(dev->bus->bustype == SSB_BUSTYPE_PCI)
> + pci_set_power_state(dev->bus->host_pci, state);
> +}
>  #endif /* CONFIG_SSB_PCIHOST */
>  
>  
> 
> -- 
>   Miguel Botón

-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 6/10] split anon and file LRUs

2007-11-06 Thread Christoph Lameter

On Sat, 3 Nov 2007, Rik van Riel wrote:

> Split the LRU lists in two, one set for pages that are backed by
> real file systems ("file") and one for pages that are backed by
> memory and swap ("anon").  The latter includes tmpfs.

If we split the memory backed from the disk backed pages then
they are no longer competing with one another on equal terms? So the file LRU 
may run faster than the memory LRU?

The patch looks awfully large.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/10] split anon and file LRUs

2007-11-06 Thread Rik van Riel

On Tue, 6 Nov 2007 18:11:39 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Sat, 3 Nov 2007, Rik van Riel wrote:
> 
> > The current version only has the infrastructure.  Large changes to
> > the page replacement policy will follow later.
> 
> H.. I'd rather see where we are going.

http://linux-mm.org/PageReplacementDesign

> One other way of addressing many of these issues is to allow large page sizes
> on the LRU which will reduce the number of entities that have to be managed.

Linus seems to have vetoed that (unless I am mistaken), so the
chances of that happening soon are probably not very large.

Also, a factor 16 increase in page size is not going to help
if memory sizes also increase by a factor 16, since we already 
have trouble with today's memory sizes.

> Both approaches actually would work in tandem.

Hence, this patch series.

> > TODO:
> > - have any mlocked and ramfs pages live off of the LRU list,
> >   so we do not need to scan these pages
> 
> I think that is the most urgent issue at hand. At least for us.

For some workloads this is the most urgent change, indeed.
Since the patches for this already exist, integrating them
is at the top of my list.  Expect this to be integrated into
the split VM patch series by the end of this week.

> > - switch to SEQ replacement for the anon LRU lists, so the
> >   worst case number of pages to scan is reduced greatly.
> 
> No idea what that is?

See http://linux-mm.org/PageReplacementDesign

> > - figure out if the file LRU lists need page replacement
> >   changes to help with worst case scenarios
> 
> We do not have an accepted standard load. So how would we figure that one 
> out?

The current worst case is where we need to scan all of memory, 
just to find a few pages we can swap out.  With the effects of
lock contention figured in, this can take hours on huge systems.

In order to make the VM more scalable, we need to find acceptable
pages to swap out with low complexity in the VM.  The "worst case"
above refers to the upper bound on how much work the VM needs to
do in order to get something evicted from the page cache or swapped
out.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 3/10] define page_file_cache

2007-11-06 Thread Christoph Lameter

On Sat, 3 Nov 2007, Rik van Riel wrote:

> Define page_file_cache() function to answer the question:
>   is page backed by a file?

Well its not clear what is meant by a file in the first place.
By file you mean disk space in contrast to ram based filesystems?

I think we could add a flag to the bdi to indicate wheter the backing 
store is a disk file. In fact you can also deduce if if a device has
no writeback capability set in the BDI.

> Unfortunately this needs to use a page flag, since the
> PG_swapbacked state needs to be preserved all the way
> to the point where the page is last removed from the
> LRU.  Trying to derive the status from other info in
> the page resulted in wrong VM statistics in earlier
> split VM patchsets.

The bdi may avoid that extra flag.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/10] free swap space entries if vm_swap_full()

2007-11-06 Thread Christoph Lameter

On Sat, 3 Nov 2007, Rik van Riel wrote:

> @@ -1142,14 +1145,13 @@ force_reclaim_mapped:
>   }
>   }
>   __mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
> + spin_unlock_irq(&zone->lru_lock);
>   pgdeactivate += pgmoved;
> - if (buffer_heads_over_limit) {
> - spin_unlock_irq(&zone->lru_lock);
> - pagevec_strip(&pvec);
> - spin_lock_irq(&zone->lru_lock);
> - }
>  
> + if (buffer_heads_over_limit)
> + pagevec_strip(&pvec);
>   pgmoved = 0;
> + spin_lock_irq(&zone->lru_lock);
>   while (!list_empty(&l_active)) {
>   page = lru_to_page(&l_active);
>   prefetchw_prev_lru_page(page, &l_active, flags);

Why are we dropping the lock here now? There would be less activity
on the lru_lock if we would only drop it if necessary.

> @@ -1163,6 +1165,8 @@ force_reclaim_mapped:
>   __mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
>   pgmoved = 0;
>   spin_unlock_irq(&zone->lru_lock);
> + if (vm_swap_full())
> + pagevec_swap_free(&pvec);
>   __pagevec_release(&pvec);
>   spin_lock_irq(&zone->lru_lock);
>   }

Same here. Maybe the spin_unlock and the spin_lock can go into
pagevec_swap_free?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: writeout stalls in current -git

2007-11-06 Thread David Chinner

On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote:
> On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote:
> > On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > > Rather than vmstat, can you use something like iostat to show how busy 
> > > your
> > > disks are?  i.e. are we seeing RMW cycles in the raid5 or some such issue.
> > 
> > Both "vmstat 10" and "iostat -x 10" output from this test:
> > procs ---memory-- ---swap-- -io -system-- 
> > cpu
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa
> >  2  0  0 3700592  0  85424003183  108  244  2  1 95 
> >  1
> > -> emerge reads something, don't knwo for sure what...
> >  1  0  0 3665352  0  8794000   239 2  343  585  2  1 97 
> >  0
> 
> > 
> > The last 20% of the btrace look more or less completely like this, no
> > other programs do any IO...
> > 
> > 253,03   104626   526.293450729   974  C  WS 79344288 + 8 [0]
> > 253,03   104627   526.293455078   974  C  WS 79344296 + 8 [0]
> > 253,0136469   444.513863133  1068  Q  WS 154998480 + 8 [xfssyncd]
> > 253,0136470   444.513863135  1068  Q  WS 154998488 + 8 [xfssyncd]
> ^^
> Apparently we are doing synchronous writes. That would explain why
> it is slow. We shouldn't be doing synchronous writes here. I'll see if
> I can reproduce this.
> 
> 
> 
> Yes, I can reproduce the sync writes coming out of xfssyncd. I'll
> look into this further and send a patch when I have something concrete.

Ok, so it's not synchronous writes that we are doing - we're just
submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly.
The "synchronous" nature appears to be coming from higher level
locking when reclaiming inodes (on the flush lock). It appears that
inode write clustering is failing completely so we are writing the
same block multiple times i.e. once for each inode in the cluster we
have to write.

This must be a side effect of some other change as we haven't
changed anything in the reclaim code recently.

/me scurries off to run some tests 

Indeed it is. The patch below should fix the problem - the inode
clusters weren't getting set up properly when inodes were being
read in or allocated. This is a regression, introduced by this
mod:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/xfs_iget.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100
@@ -248,7 +248,7 @@ finish_inode:
icl = NULL;
if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq,
first_index, 1)) {
-   if ((iq->i_ino & mask) == first_index)
+   if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index)
icl = iq->i_cluster;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 1/10] move isolate_lru_page to vmscan.c

2007-11-06 Thread Christoph Lameter

Reviewed-by: Christoph Lameter <[EMAIL PROTECTED]>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/10] split anon and file LRUs

2007-11-06 Thread Christoph Lameter

On Sat, 3 Nov 2007, Rik van Riel wrote:

> The current version only has the infrastructure.  Large changes to
> the page replacement policy will follow later.

H.. I'd rather see where we are going. One other way of addressing 
many of these issues is to allow large page sizes on the LRU which will
reduce the number of entities that have to be managed. Both approaches 
actually would work in tandem.

> TODO:
> - have any mlocked and ramfs pages live off of the LRU list,
>   so we do not need to scan these pages

I think that is the most urgent issue at hand. At least for us.

> - switch to SEQ replacement for the anon LRU lists, so the
>   worst case number of pages to scan is reduced greatly.

No idea what that is?

> - figure out if the file LRU lists need page replacement
>   changes to help with worst case scenarios

We do not have an accepted standard load. So how would we figure that one 
out?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sk98lin for 2.6.23-rc1

2007-11-06 Thread Chris Stromsoe


On Tue, 6 Nov 2007, Stephen Hemminger wrote:

On Sun, 9 Sep 2007 05:54:45 -0700 (PDT)
Chris Stromsoe <[EMAIL PROTECTED]> wrote:


On Sat, 8 Sep 2007, Adrian Bunk wrote:

On Sat, Sep 08, 2007 at 01:44:20PM -0400, Bill Davidsen wrote:


Haven't tried later kernels, don't intend to, while no network is
really secure, it not really useful.


You are a regular reader of linux-kernel, and therefore the sk98lin
removal can hardly be a surprise for you. If you prefer whining over
helping to improve the kernel that's your choice...


I've been trying to migrate off sk98lin to skge since earlier this year,
without success, starting with 2.6.18 or .19.

I have several of these cards in production using the sk98lin driver:

fresno:~# lspci -vv -s 02:01
02:01.0 Ethernet controller: SysKonnect SK-9872 Gigabit Ethernet Server Adapter 
(SK-NET GE-ZX dual link) (rev 11)
 Subsystem: SysKonnect SK-9844 Gigabit Ethernet Server Adapter (SK-NET 
GE-SX dual link)
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
 Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- 

Please test 2.6.24-rc1 (or -rc2) because there were several fixes for skge
that made it work correctly for dual port fiber board. The worst bug in skge
was that it configured the ram buffer incorrectly.

I just submitted these for next 2.6.23.X stable release as well



I tested 2.6.24-rc1.  This series of commands

  fresno:~# modprobe skge
  fresno:~# ip li set eth2 up
  fresno:~# ip li set eth2 down
  fresno:~# ip li set eth3 up

still hard-locks the box in the same place.  Was there anything in the 
-rc2 patch for skge?




-Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-06 Thread Adrian Bunk

On Tue, Nov 06, 2007 at 05:06:23PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Nov 2007, Adrian Bunk wrote:
> >
> > How should TOMOYO implement it's "match one character" in a pattern
> > (used to allow or deny access in a name-based MAC)?
> 
> .. I think such a design is fundamentally bogus. You don't have 
> "characters". You have "bytes".

Users are used to work on characters, not on bytes.

> So you either implement "match one byte", or you go crazy. It's that 
> simple.

Sure, you can limit what is possible and what not.

But there are still many pitfalls, e.g. if someone would allow the 
construct "[abc]" in patterns for matching one of these characters you'd 
have to ensure that your syntax contains explicit character delimiters 
or a pattern might match something completely different from what was 
intended.

My opinion is that extended parsing of non-ASCII strings will cause too 
many problems, but it seems we can only agree to disagree on this.

>   Linus

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] NetLabel: Introduce a new kernel configuration API for NetLabel - For 2.6.24-rc-git11 - Smack Version 10

2007-11-06 Thread Joshua Brindle


Joshua Brindle wrote:

Casey Schaufler wrote:

From: Paul Moore <[EMAIL PROTECTED]>

Add a new set of configuration functions to the NetLabel/LSM API so that
LSMs can perform their own configuration of the NetLabel subsystem 
without

relying on assistance from userspace.
  
I'm still not receiving the actual patch email on lsm (perhaps its too 
long and should be split up..) so I'll just respond on this email. 
Using the v10 patches on your website I'm still seeing strange 
behavior where echo foo > /proc/self/attr/current changes the label of 
every process on the system to foo (verified with both ps -AZ and cat 
/proc/1/attr/current).



Actually I'm getting more strange behavior:

On terminal 1 I do:
echo foo > /proc/self/attr/current
then ps -AZ shows foo for every process
touch somefile; attr -S -g SMACK64 somefile says foo

On terminal 2 I do:
ps -AZ and everything shows up as _
cat /proc/$pid of bash on term 1/attr/current is _



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Dan Williams

On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Joël wrote:
> Done. Here is obtained ouput :

Much appreciated.
> 
> [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
> [ 1260.980606] check 5: state 0x6 toread  read 
>  write f800ffcffcc0 written 
> [ 1260.994808] check 4: state 0x6 toread  read 
>  write f800fdd4e360 written 
> [ 1261.009325] check 3: state 0x1 toread  read 
>  write  written 
> [ 1261.244478] check 2: state 0x1 toread  read 
>  write  written 
> [ 1261.270821] check 1: state 0x6 toread  read 
>  write f800ff517e40 written 
> [ 1261.312320] check 0: state 0x6 toread  read 
>  write f800fd4cae60 written 
> [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
> [ 1261.443120] for sector 7629696, rmw=0 rcw=0
[..]

This looks as if the blocks were prepared to be written out, but were
never handled in ops_run_biodrain(), so they remain locked forever.  The
operations flags are all clear which means handle_stripe thinks nothing
else needs to be done.

The following patch, also attached, cleans up cases where the code looks
at sh->ops.pending when it should be looking at the consistent
stack-based snapshot of the operations flags.


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+unsigned long pending)
 {
int disks = sh->disks;
int pd_idx = sh->pd_idx, i;
@@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
 */
-   int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
 
pr_debug("%s: stripe %llu\n", __FUNCTION__,
(unsigned long long)sh->sector);
@@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
 }
 
 static void
-ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+   unsigned long pending)
 {
/* kernel stack size limits the total number of disks */
int disks = sh->disks;
@@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 
int count = 0, pd_idx = sh->pd_idx, i;
struct page *xor_dest;
-   int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
unsigned long flags;
dma_async_tx_callback callback;
 
@@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
}
 
/* check whether this postxor is part of a write */
-   callback = test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) ?
+   callback = test_bit(STRIPE_OP_BIODRAIN, &pending) ?
ops_complete_write : ops_complete_postxor;
 
/* 1/ if we prexor'd then the dest is reused as a source
@@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)
tx = ops_run_prexor(sh, tx);
 
if (test_bit(STRIPE_OP_BIODRAIN, &pending)) {
-   tx = ops_run_biodrain(sh, tx);
+   tx = ops_run_biodrain(sh, tx, pending);
overlap_clear++;
}
 
if (test_bit(STRIPE_OP_POSTXOR, &pending))
-   ops_run_postxor(sh, tx);
+   ops_run_postxor(sh, tx, pending);
 
if (test_bit(STRIPE_OP_CHECK, &pending))
ops_run_check(sh);

raid5: fix unending write sequence

From: Dan Williams <[EMAIL PROTECTED]>


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dm

[GIT PULL] FireWire update

2007-11-06 Thread Stefan Richter

Linus, please pull from the for-linus branch at

git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6.git 
for-linus

to receive the following fix for a regression since 2.6.24-rc1.
(Or apply from this e-mail.)

 drivers/firewire/fw-sbp2.c |   11 +++
 1 files changed, 7 insertions(+), 4 deletions(-)

Stefan Richter (1):
  firewire: fw-sbp2: fix refcounting


Full log and diff:

commit 7c45d1913f0a1d597eb4bc3b2c962bc2967da9ea
Author: Stefan Richter <[EMAIL PROTECTED]>
Date:   Wed Nov 7 01:11:56 2007 +0100

firewire: fw-sbp2: fix refcounting

Since patch "fw-sbp2: use an own workqueue (fix system responsiveness)"
increased parallelism between fw-sbp2 and fw-core, it was possible that
fw-sbp2 didn't release the SCSI device when the FireWire device was
disconnected.

This happened if sbp2_update() ran during sbp2_login(), because a bus
reset occurred during sbp2_login().  The sbp2_login() work would [try
to] reschedule itself because it failed due to the bus reset, and it
would _not_ drop its reference on the target.  However, sbp2_update()
would schedule sbp2_login() too before sbp2_login() rescheduled itself
and hence sbp2_update() would take an additional reference.  And then
we would have one reference too many.

The fix is to _always_ drop the reference when leaving the sbp2_login()
work.  If the sbp2_login() work reschedules itself, it takes a
reference, but only if it wasn't already rescheduled by sbp2_update().

Ditto in the sbp2_reconnect() work.

The resulting code is actually simpler than before:  We _always_ take
a reference when successfully scheduling work.  And we _always_ drop
a reference when leaving a workqueue job.  No exceptions.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

diff --git a/drivers/firewire/fw-sbp2.c b/drivers/firewire/fw-sbp2.c
index 5596df6..624ff3e 100644
--- a/drivers/firewire/fw-sbp2.c
+++ b/drivers/firewire/fw-sbp2.c
@@ -650,13 +650,14 @@ static void sbp2_login(struct work_struct *work)
if (sbp2_send_management_orb(lu, node_id, generation,
SBP2_LOGIN_REQUEST, lu->lun, &response) < 0) {
if (lu->retries++ < 5) {
-   queue_delayed_work(sbp2_wq, &lu->work,
-  DIV_ROUND_UP(HZ, 5));
+   if (queue_delayed_work(sbp2_wq, &lu->work,
+  DIV_ROUND_UP(HZ, 5)))
+   kref_get(&lu->tgt->kref);
} else {
fw_error("failed to login to %s LUN %04x\n",
 unit->device.bus_id, lu->lun);
-   kref_put(&lu->tgt->kref, sbp2_release_target);
}
+   kref_put(&lu->tgt->kref, sbp2_release_target);
return;
}
 
@@ -914,7 +915,9 @@ static void sbp2_reconnect(struct work_struct *work)
lu->retries = 0;
PREPARE_DELAYED_WORK(&lu->work, sbp2_login);
}
-   queue_delayed_work(sbp2_wq, &lu->work, DIV_ROUND_UP(HZ, 5));
+   if (queue_delayed_work(sbp2_wq, &lu->work, DIV_ROUND_UP(HZ, 5)))
+   kref_get(&lu->tgt->kref);
+   kref_put(&lu->tgt->kref, sbp2_release_target);
return;
}
 

-- 
Stefan Richter
-=-=-=== =-== --===
http://arcgraph.de/sr/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86: fix cpu-hotplug regression

2007-11-06 Thread Andreas Herrmann

[PATCH] x86: fix cpu-hotplug regression

Commit d435d862baca3e25e5eec236762a43251b1e7ffc
("cpu hotplug: mce: fix cpu hotplug error handling")
changed the error handling in mce_cpu_callback.

In cases where not all CPUs are brought up during
boot (e.g. using maxcpus and additional_cpus parameters)
mce_cpu_callback now returns NOTFIY_BAD because
for such CPUs cpu_data is not completely filled when
the notifier is called. Thus mce_create_device fails right
at its beginning:

if (!mce_available(&cpu_data[cpu]))
return -EIO;

As a quick fix I suggest to check boot_cpu_data for MCE.

To reproduce this regression:

(1) boot with maxcpus=2 addtional_cpus=2 on a 4 CPU x86-64 system
(2) # echo 1 >/sys/devices/system/cpu/cpu2/online
  -bash: echo: write error: Invalid argument

dmesg shows:

_cpu_up: attempt to bring up CPU 2 failed

Signed-off-by: Andreas Herrmann <[EMAIL PROTECTED]>
---
 arch/x86/kernel/cpu/mcheck/mce_64.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_64.c 
b/arch/x86/kernel/cpu/mcheck/mce_64.c
index b9f802e..5112a70 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_64.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_64.c
@@ -808,7 +808,7 @@ static __cpuinit int mce_create_device(unsigned int cpu)
int err;
int i;
 
-   if (!mce_available(&cpu_data(cpu)))
+   if (!mce_available(&boot_cpu_data))
return -EIO;
 
memset(&per_cpu(device_mce, cpu).kobj, 0, sizeof(struct kobject));
-- 
1.5.3.4




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]bluetooth rfcomm_dev refcount bug fix

2007-11-06 Thread Dave Young

On Nov 6, 2007 9:33 PM, Marcel Holtmann <[EMAIL PROTECTED]> wrote:
> Hi Dave,
>
> > I'm afraid to be considered as spam ;)
> >
> > (Due to timezone offset, I have to mail again and cann't wait for your
> > reply, sorry for the annoying)
>
> I am in a different timezone every other week. So nevermind ;)
>
> > I think the rfcomm_dev_put could be seperated from the rfcomm_dev_put,
> > it will be more straitforward then.
> >
> > please consider below patch, tested on my side. thanks.
>
> That one looks totally wrong to me. Without even testing it, it will
> have side effects that you haven't run into yet. Unless the TTY core
> changed so much, this comments are there for a really good reason and
> the code is tested a lot.
What side effects?

Anyway, the refcnt is wrong in rfcomm_release_dev. We could either
remove the rfcomm_dev_del in rfcomm_tty_hangup or remove the
rfcomm_dev_put in the end of rfcomm_release_dev, or the rfcomm_dev
will be destructed, and  the later callback of rfcomm_tty_close could
cause oops.

>
> Also if you have to do two rfcomm_dev_put() in a row, then we are doing
> something really wrong and this tries to hide a real bug somewhere.
One is for device deletion (1->0), one is for the get/put pairs,
actually same as before.

Main reason of doing so in this patch is that if I remove the last
rfcomm_dev_put in rfcomm_release_dev, then it looks like get device
<-->  del device, so relace it with set deletion flag and then put
the device like "get device <--> put device" which is straitforward.

>
> Regards
>
> Marcel
>
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 20/23] dentries: Add constructor

2007-11-06 Thread Christoph Lameter

In order to support defragmentation on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Reviewed-by: Rik van Riel <[EMAIL PROTECTED]>
So provide a constructor.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/dcache.c |   26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/dcache.c
===
--- linux-2.6.orig/fs/dcache.c  2007-11-06 12:56:56.0 -0800
+++ linux-2.6/fs/dcache.c   2007-11-06 12:57:01.0 -0800
@@ -870,6 +870,16 @@ static struct shrinker dcache_shrinker =
.seeks = DEFAULT_SEEKS,
 };
 
+void dcache_ctor(struct kmem_cache *s, void *p)
+{
+   struct dentry *dentry = p;
+
+   spin_lock_init(&dentry->d_lock);
+   dentry->d_inode = NULL;
+   INIT_LIST_HEAD(&dentry->d_lru);
+   INIT_LIST_HEAD(&dentry->d_alias);
+}
+
 /**
  * d_alloc -   allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -907,8 +917,6 @@ struct dentry *d_alloc(struct dentry * p
 
atomic_set(&dentry->d_count, 1);
dentry->d_flags = DCACHE_UNHASHED;
-   spin_lock_init(&dentry->d_lock);
-   dentry->d_inode = NULL;
dentry->d_parent = NULL;
dentry->d_sb = NULL;
dentry->d_op = NULL;
@@ -918,9 +926,7 @@ struct dentry *d_alloc(struct dentry * p
dentry->d_cookie = NULL;
 #endif
INIT_HLIST_NODE(&dentry->d_hash);
-   INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
-   INIT_LIST_HEAD(&dentry->d_alias);
 
if (parent) {
dentry->d_parent = dget(parent);
@@ -2096,14 +2102,10 @@ static void __init dcache_init(void)
 {
int loop;
 
-   /* 
-* A constructor could be added for stable state like the lists,
-* but it is probably not worth it because of the cache nature
-* of the dcache. 
-*/
-   dentry_cache = KMEM_CACHE(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-   
+   dentry_cache = kmem_cache_create("dentry_cache", sizeof(struct dentry),
+   0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+   dcache_ctor);
+
register_shrinker(&dcache_shrinker);
 
/* Hash may have been set up in dcache_init_early */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 22/23] SLUB: Slab reclaim through Lumpy reclaim

2007-11-06 Thread Christoph Lameter

Creates a special function kmem_cache_isolate_slab() and kmem_cache_reclaim()
to support lumpy reclaim.

In order to isolate pages we will have to handle slab page allocations in
such a way that we can determine if a slab is valid whenever we access it
regardless of its time in life.

A valid slab that can be freed has PageSlab(page) and page->inuse > 0 set.
So we need to make sure in allocate_slab() that page->inuse is zero before
PageSlab is set.

kmem_cache_isolate_page() is called from lumpy reclaim to isolate pages
neighboring a page cache page that is being reclaimed. Lumpy reclaim will
gather the slabs and call kmem_cache_reclaim() on the list.

This means that we can remove a slab in order to be able to coalesce
a higher order page.

Reviewed-by: Rik van Riel <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/slab.h |2 +
 mm/slab.c|   13 ++
 mm/slub.c|  102 ---
 mm/vmscan.c  |   13 +-
 4 files changed, 123 insertions(+), 7 deletions(-)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h 2007-11-06 13:50:47.0 -0800
+++ linux-2.6/include/linux/slab.h  2007-11-06 13:50:54.0 -0800
@@ -64,6 +64,8 @@ unsigned int kmem_cache_size(struct kmem
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 int kmem_cache_defrag(int node);
+int kmem_cache_isolate_slab(struct page *);
+int kmem_cache_reclaim(struct list_head *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c2007-11-06 13:50:33.0 -0800
+++ linux-2.6/mm/slab.c 2007-11-06 13:50:54.0 -0800
@@ -2559,6 +2559,19 @@ int kmem_cache_defrag(int node)
return 0;
 }
 
+/*
+ * SLAB does not support slab defragmentation
+ */
+int kmem_cache_isolate_slab(struct page *page)
+{
+   return -ENOSYS;
+}
+
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+   return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-11-06 13:50:40.0 -0800
+++ linux-2.6/mm/slub.c 2007-11-06 13:50:54.0 -0800
@@ -1088,18 +1088,19 @@ static noinline struct page *new_slab(st
page = allocate_slab(s,
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
if (!page)
-   goto out;
+   return NULL;
 
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
+
+   page->inuse = 0;
page->slab = s;
-   state = 1 << PG_slab;
+   state = page->flags | (1 << PG_slab);
if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
SLAB_STORE_USER | SLAB_TRACE))
state |= SLABDEBUG;
 
-   page->flags |= state;
start = page_address(page);
page->end = start + 1;
 
@@ -1116,8 +1117,13 @@ static noinline struct page *new_slab(st
set_freepointer(s, last, page->end);
 
page->freelist = start;
-   page->inuse = 0;
-out:
+
+   /*
+* page->inuse must be 0 when PageSlab(page) becomes
+* true so that defrag knows that this slab is not in use.
+*/
+   smp_wmb();
+   page->flags = state;
return page;
 }
 
@@ -2622,6 +2628,92 @@ out:
 }
 #endif
 
+
+/*
+ * Check if the given state is that of a reclaimable slab page.
+ *
+ * This is only true if this is indeed a slab page and if
+ * the page has not been frozen.
+ */
+static inline int reclaimable_slab(unsigned long state)
+{
+   if (!(state & (1 << PG_slab)))
+   return 0;
+
+   if (state & FROZEN)
+   return 0;
+
+   return 1;
+}
+
+ /*
+ * Isolate page from the slab partial lists. Return 0 if succesful.
+ *
+ * After isolation the LRU field can be used to put the page onto
+ * a reclaim list.
+ */
+int kmem_cache_isolate_slab(struct page *page)
+{
+   unsigned long flags;
+   struct kmem_cache *s;
+   int rc = -ENOENT;
+   unsigned long state;
+
+   /*
+* Avoid attempting to isolate the slab pages if there are
+* indications that this will not be successful.
+*/
+   if (!reclaimable_slab(page->flags) || page_count(page) == 1)
+   return rc;
+
+   /*
+* Get a reference to the page. Return if its freed or being freed.
+* This is necessary to make sure that the page does not vanish
+* from under us before we are able to check the result.
+*/
+   if (!get_page_unless_zero(page))
+   return rc;
+
+

[patch 23/23] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts

2007-11-06 Thread Christoph Lameter

Add a flag RECLAIMABLE to be set on slabs with a defragmentation method

Clear the flag if a reclaim action is not successful in reducing the
number of objects in a slab.

The reclaim flag is set again when all objeccts of the slab have been
allocated and it is removed from the partial lists.

Reviewed-by: Rik van Riel <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-11-06 17:06:46.0 -0800
+++ linux-2.6/mm/slub.c 2007-11-06 17:07:54.0 -0800
@@ -102,6 +102,7 @@
 
 #define FROZEN (1 << PG_active)
 #define LOCKED (1 << PG_locked)
+#define RECLAIMABLE (1 << PG_dirty)
 
 #ifdef CONFIG_SLUB_DEBUG
 #define SLABDEBUG (1 << PG_error)
@@ -1100,6 +1101,8 @@ static noinline struct page *new_slab(st
if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
SLAB_STORE_USER | SLAB_TRACE))
state |= SLABDEBUG;
+   if (s->kick)
+   state |= RECLAIMABLE;
 
start = page_address(page);
page->end = start + 1;
@@ -1176,6 +1179,7 @@ static void discard_slab(struct kmem_cac
 
atomic_long_dec(&n->nr_slabs);
reset_page_mapcount(page);
+   page->flags &= ~RECLAIMABLE;
__ClearPageSlab(page);
free_slab(s, page);
 }
@@ -1408,8 +1412,11 @@ static void unfreeze_slab(struct kmem_ca
 
if (page->freelist != page->end)
add_partial(s, page, tail);
-   else
+   else {
add_full(s, page, state);
+   if (s->kick && !(state & RECLAIMABLE))
+   state |= RECLAIMABLE;
+   }
slab_unlock(page, state);
 
} else {
@@ -2633,7 +2640,7 @@ out:
  * Check if the given state is that of a reclaimable slab page.
  *
  * This is only true if this is indeed a slab page and if
- * the page has not been frozen.
+ * the page has not been frozen or marked as unreclaimable.
  */
 static inline int reclaimable_slab(unsigned long state)
 {
@@ -2643,7 +2650,7 @@ static inline int reclaimable_slab(unsig
if (state & FROZEN)
return 0;
 
-   return 1;
+   return state & RECLAIMABLE;
 }
 
  /*
@@ -2958,6 +2965,8 @@ out:
 * Check the result and unfreeze the slab
 */
leftover = page->inuse;
+   if (leftover)
+   state &= ~RECLAIMABLE;
unfreeze_slab(s, page, leftover > 0, state);
local_irq_restore(flags);
return leftover;
@@ -3012,6 +3021,11 @@ static unsigned long __kmem_cache_shrink
if (!state)
continue;
 
+   if (!(state & RECLAIMABLE)) {
+   slab_unlock(page, state);
+   continue;
+   }
+
if (page->inuse) {
 
list_move(&page->lru, &zaplist);

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 17/23] FS: Proc filesystem support for slab defrag

2007-11-06 Thread Christoph Lameter

Support procfs inode defragmentation

Reviewed-by: Rik van Riel <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/proc/inode.c |8 
 1 file changed, 8 insertions(+)

Index: linux-2.6.23-mm1/fs/proc/inode.c
===
--- linux-2.6.23-mm1.orig/fs/proc/inode.c   2007-10-12 16:26:08.0 
-0700
+++ linux-2.6.23-mm1/fs/proc/inode.c2007-10-12 18:48:32.0 -0700
@@ -114,6 +114,12 @@ static void init_once(struct kmem_cache 
inode_init_once(&ei->vfs_inode);
 }
 
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v,
+   offsetof(struct proc_inode, vfs_inode));
+};
+
 int __init proc_init_inodecache(void)
 {
proc_inode_cachep = kmem_cache_create("proc_inode_cache",
@@ -121,6 +127,8 @@ int __init proc_init_inodecache(void)
 0, (SLAB_RECLAIM_ACCOUNT|
SLAB_MEM_SPREAD|SLAB_PANIC),
 init_once);
+   kmem_cache_setup_defrag(proc_inode_cachep,
+   proc_get_inodes, kick_inodes);
return 0;
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 21/23] dentries: dentry defragmentation

2007-11-06 Thread Christoph Lameter

The dentry pruning for unused entries works in a straightforward way. It
could be made more aggressive if one would actually move dentries instead
of just reclaiming them.

Reviewed-by: Rik van Riel <[EMAIL PROTECTED]>
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/dcache.c |  101 +++-
 1 file changed, 100 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===
--- linux-2.6.orig/fs/dcache.c  2007-11-06 12:57:01.0 -0800
+++ linux-2.6/fs/dcache.c   2007-11-06 12:57:06.0 -0800
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 
@@ -143,7 +144,10 @@ static struct dentry *d_kill(struct dent
 
list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--;/* For d_free, below */
-   /*drops the locks, at that point nobody can reach this dentry */
+   /*
+* drops the locks, at that point nobody (aside from defrag)
+* can reach this dentry
+*/
dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
@@ -2098,6 +2102,100 @@ static void __init dcache_init_early(voi
INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+   struct dentry *dentry;
+   int i;
+
+   spin_lock(&dcache_lock);
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+
+   /*
+* Three sorts of dentries cannot be reclaimed:
+*
+* 1. dentries that are in the process of being allocated
+*or being freed. In that case the dentry is neither
+*on the LRU nor hashed.
+*
+* 2. Fake hashed entries as used for anonymous dentries
+*and pipe I/O. The fake hashed entries have d_flags
+*set to indicate a hashed entry. However, the
+*d_hash field indicates that the entry is not hashed.
+*
+* 3. dentries that have a backing store that is not
+*writable. This is true for tmpsfs and other in
+*memory filesystems. Removing dentries from them
+*would loose dentries for good.
+*/
+   if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
+  (!d_unhashed(dentry) && hlist_unhashed(&dentry->d_hash)) ||
+  (dentry->d_inode &&
+  !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
+   /* Ignore this dentry */
+   v[i] = NULL;
+   else
+   /* dget_locked will remove the dentry from the LRU */
+   dget_locked(dentry);
+   }
+   spin_unlock(&dcache_lock);
+   return NULL;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+   int nr, void **v, void *private)
+{
+   struct dentry *dentry;
+   int i;
+
+   /*
+* First invalidate the dentries without holding the dcache lock
+*/
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+
+   if (dentry)
+   d_invalidate(dentry);
+   }
+
+   /*
+* If we are the last one holding a reference then the dentries can
+* be freed. We need the dcache_lock.
+*/
+   spin_lock(&dcache_lock);
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+   if (!dentry)
+   continue;
+
+   spin_lock(&dentry->d_lock);
+   if (atomic_read(&dentry->d_count) > 1) {
+   spin_unlock(&dentry->d_lock);
+   spin_unlock(&dcache_lock);
+   dput(dentry);
+   spin_lock(&dcache_lock);
+   continue;
+   }
+
+   prune_one_dentry(dentry);
+   }
+   spin_unlock(&dcache_lock);
+
+   /*
+* dentries are freed using RCU so we need to wait until RCU
+* operations are complete
+*/
+   synchronize_rcu();
+}
+
 static void __init dcache_init(void)
 {
int loop;
@@ -2107,6 +2205,7 @@ static void __init dcache_init(void)
dcache_ctor);
 
register_shrinker(&dcache_shrinker);
+   kmem_cache_setup_defrag(dentry_cache, get_dentries, kick_dentries);
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)

-- 
-
To unsubscribe from this list: send the line "unsubscrib

1 2 3 4 5 >

1 - 100 of 450 matches

Mail list logo