Re: [PATCH 42/59] sysctl: Remove sys_sysctl support from the hpet timer driver.

2007-01-16 Thread Clemens Ladisch
Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> In the binary sysctl interface the hpet driver was claiming to
> be the cdrom driver.  This is a no-no so remove support for the
> binary interface.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Acked-by: Clemens Ladisch <[EMAIL PROTECTED]>

> ---
>  drivers/char/hpet.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
> index 20dc3be..81be1db 100644
> --- a/drivers/char/hpet.c
> +++ b/drivers/char/hpet.c
> @@ -703,7 +703,7 @@ int hpet_control(struct hpet_task *tp, unsigned int
> cmd, unsigned long arg)
>  
>  static ctl_table hpet_table[] = {
>   {
> -.ctl_name = 1,
> +.ctl_name = CTL_UNNUMBERED,
>.procname = "max-user-freq",
>.data = &hpet_max_freq,
>.maxlen = sizeof(int),
> @@ -715,7 +715,7 @@ static ctl_table hpet_table[] = {
>  
>  static ctl_table hpet_root[] = {
>   {
> -.ctl_name = 1,
> +.ctl_name = CTL_UNNUMBERED,
>.procname = "hpet",
>.maxlen = 0,
>.mode = 0555,
> -- 
> 1.4.4.1.g278f
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc4 0/4] futexes functionalities and improvements

2007-01-16 Thread Pierre Peiffer

Ingo Molnar a écrit :

* Ulrich Drepper <[EMAIL PROTECTED]> wrote:


what do you mean by that - which is this same resource?
From what has been said here before, all futexes are stored in the 
same list or hash table or whatever it was.  I want to see how that 
code behaves if many separate processes concurrently use futexes.


futexes are stored in the bucket hash, and these patches do not change 
that. The pi-list that was talked about is per-futex. So there's no 
change to the way futexes are hashed nor should there be any scalability 
impact - besides the micro-impact that was measured in a number of ways 
- AFAICS.


Yes, that's completely right !

--
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i386: entry.S END/ENDPROC annotations

2007-01-16 Thread Jan Beulich
Annotate i386/kernel/entry.S with END/ENDPROC to assist disassemblers and
other analysis tools.

Signed-off-by: Jan Beulich <[EMAIL PROTECTED]>

--- linux-2.6.20-rc5/arch/i386/kernel/entry.S   2007-01-15 14:09:19.0 
+0100
+++ 2.6.20-rc5-i386-entry-end/arch/i386/kernel/entry.S  2007-01-04 
14:46:47.0 +0100
@@ -227,6 +227,7 @@ ENTRY(ret_from_fork)
CFI_ADJUST_CFA_OFFSET -4
jmp syscall_exit
CFI_ENDPROC
+END(ret_from_fork)
 
 /*
  * Return to user mode is not as complex as all this looks,
@@ -258,6 +259,7 @@ ENTRY(resume_userspace)
# int/exception return?
jne work_pending
jmp restore_all
+END(ret_from_exception)
 
 #ifdef CONFIG_PREEMPT
 ENTRY(resume_kernel)
@@ -272,6 +274,7 @@ need_resched:
jz restore_all
call preempt_schedule_irq
jmp need_resched
+END(resume_kernel)
 #endif
CFI_ENDPROC
 
@@ -355,6 +358,7 @@ sysenter_past_esp:
.align 4
.long 1b,2b
 .popsection
+ENDPROC(sysenter_entry)
 
# system call handler stub
 ENTRY(system_call)
@@ -455,6 +459,7 @@ ldt_ss:
CFI_ADJUST_CFA_OFFSET -8
jmp restore_nocheck
CFI_ENDPROC
+ENDPROC(system_call)
 
# perform work that needs to be done immediately before resumption
ALIGN
@@ -500,6 +505,7 @@ work_notifysig_v86:
xorl %edx, %edx
call do_notify_resume
jmp resume_userspace_sig
+END(work_pending)
 
# perform syscall exit tracing
ALIGN
@@ -515,6 +521,7 @@ syscall_trace_entry:
cmpl $(nr_syscalls), %eax
jnae syscall_call
jmp syscall_exit
+END(syscall_trace_entry)
 
# perform syscall exit tracing
ALIGN
@@ -528,6 +535,7 @@ syscall_exit_work:
movl $1, %edx
call do_syscall_trace
jmp resume_userspace
+END(syscall_exit_work)
CFI_ENDPROC
 
RING0_INT_FRAME # can't unwind into user space anyway
@@ -538,10 +546,12 @@ syscall_fault:
GET_THREAD_INFO(%ebp)
movl $-EFAULT,PT_EAX(%esp)
jmp resume_userspace
+END(syscall_fault)
 
 syscall_badsys:
movl $-ENOSYS,PT_EAX(%esp)
jmp resume_userspace
+END(syscall_badsys)
CFI_ENDPROC
 
 #define FIXUP_ESPFIX_STACK \
@@ -577,9 +587,9 @@ syscall_badsys:
 ENTRY(interrupt)
 .text
 
-vector=0
 ENTRY(irq_entries_start)
RING0_INT_FRAME
+vector=0
 .rept NR_IRQS
ALIGN
  .if vector
@@ -588,11 +598,16 @@ ENTRY(irq_entries_start)
 1: pushl $~(vector)
CFI_ADJUST_CFA_OFFSET 4
jmp common_interrupt
-.data
+ .previous
.long 1b
-.text
+ .text
 vector=vector+1
 .endr
+END(irq_entries_start)
+
+.previous
+END(interrupt)
+.previous
 
 /*
  * the CPU automatically disables interrupts when executing an IRQ vector,
@@ -605,6 +620,7 @@ common_interrupt:
movl %esp,%eax
call do_IRQ
jmp ret_from_intr
+ENDPROC(common_interrupt)
CFI_ENDPROC
 
 #define BUILD_INTERRUPT(name, nr)  \
@@ -617,7 +633,8 @@ ENTRY(name) \
movl %esp,%eax; \
call smp_/**/name;  \
jmp ret_from_intr;  \
-   CFI_ENDPROC
+   CFI_ENDPROC;\
+ENDPROC(name)
 
 /* The include is where all of the SMP etc. interrupts come from */
 #include "entry_arch.h"
@@ -688,6 +705,7 @@ ENTRY(coprocessor_error)
CFI_ADJUST_CFA_OFFSET 4
jmp error_code
CFI_ENDPROC
+END(coprocessor_error)
 
 ENTRY(simd_coprocessor_error)
RING0_INT_FRAME
@@ -697,6 +715,7 @@ ENTRY(simd_coprocessor_error)
CFI_ADJUST_CFA_OFFSET 4
jmp error_code
CFI_ENDPROC
+END(simd_coprocessor_error)
 
 ENTRY(device_not_available)
RING0_INT_FRAME
@@ -717,6 +736,7 @@ device_not_available_emulate:
CFI_ADJUST_CFA_OFFSET -4
jmp ret_from_exception
CFI_ENDPROC
+END(device_not_available)
 
 /*
  * Debug traps and NMI can happen at the one SYSENTER instruction
@@ -860,10 +880,12 @@ ENTRY(native_iret)
.align 4
.long 1b,iret_exc
 .previous
+END(native_iret)
 
 ENTRY(native_irq_enable_sysexit)
sti
sysexit
+END(native_irq_enable_sysexit)
 #endif
 
 KPROBE_ENTRY(int3)
@@ -886,6 +908,7 @@ ENTRY(overflow)
CFI_ADJUST_CFA_OFFSET 4
jmp error_code
CFI_ENDPROC
+END(overflow)
 
 ENTRY(bounds)
RING0_INT_FRAME
@@ -895,6 +918,7 @@ ENTRY(bounds)
CFI_ADJUST_CFA_OFFSET 4
jmp error_code
CFI_ENDPROC
+END(bounds)
 
 ENTRY(invalid_op)
RING0_INT_FRAME
@@ -904,6 +928,7 @@ ENTRY(invalid_op)
CFI_ADJUST_CFA_OFFSET 4
jmp error_code
CFI_ENDPROC
+END(invalid_op)
 
 ENTRY(coprocessor_segment_overrun)
RING0_INT_FRAME
@@ -913,6 +938,7 @@ ENTRY(coprocessor_segment_overrun)
CFI_ADJUST_CFA_OFFSET 4
jmp error_code
CFI_ENDPROC
+END(coprocessor_segment_overrun)
 
 ENTRY(invalid_TSS)
RING0_EC_FRAME
@@ -920,6 +946,7 @@ E

Re: BUG: linux 2.6.19 unable to enable acpi

2007-01-16 Thread Matheus Izvekov

I just tried the firmwarekit, and here are the results, attached.
TYVM, thats a very useful tool.




	apicedge
	(experimental) APIC Edge/Level check
	4
 
	This test checks if legacy interrupts are edge and PCI interrupts are level
	
		Non-Legacy interrupt 0 is incorrectly level triggered
		4
		interrupts://
  0:  22353XT-PIC-XTtimer

	
	
		Non-Legacy interrupt 1 is incorrectly level triggered
		4
		interrupts://
  1:  9XT-PIC-XTi8042

	
	
		Non-Legacy interrupt 2 is incorrectly level triggered
		4
		interrupts://
  2:  0XT-PIC-XTcascade

	
	
		Non-Legacy interrupt 8 is incorrectly level triggered
		4
		interrupts://
  8:  0XT-PIC-XTrtc

	
	
		Non-Legacy interrupt 10 is incorrectly level triggered
		4
		interrupts://
 10: 51XT-PIC-XTohci_hcd:usb1

	


	microcode
	Processor microcode update
	4
 
	This test verifies if the firmware has put a recent version of the microcode into the processor at boot time. Recent microcode is important to have all the required features and errata updates for the processor.
	
		Cpu cpu0 has outdated microcode (version 34 while version 36 is available)
		4
		
			


	FADT
	FADT test
	4
 
	verify FADT SCI_EN bit enabled or NOT.
	
		E820: XSDT (0x2ed382e9) is not in reserved or ACPI memory!
		4
		e820://
			
	
		Legacy mode, SCI_EN bit in PM1a_Control register is incorrectly Disabled
		4
	
	
		E820: XSDT (0x2ed382e9) is not in reserved or ACPI memory!
		4
		e820://
			


	mtrr
	MTRR validation
	4
 
	This test validates the MTRR setup against the memory map to detect any inconsistencies in cachability.
	
		Memory range 0x10 to 0xfde (System RAM) has incorrect attribute default 
		4
		mtrr://System RAM
Memory range 0x10 to 0xfde (System RAM) has incorrect attribute default 
	


	mcfg
	MCFG PCI Express* memory mapped config space
	4
 
	This test tries to validate the MCFG table by comparing the first 16 bytes in the MMIO mapped config space with the 'traditional' config space of the first PCI device (root bridge). The MCFG data is only trusted if it is marked reserved in the E820 table.
	
		E820: XSDT (0x2ed382e9) is not in reserved or ACPI memory!
		4
		e820://
			
	
		No MCFG ACPI table found. This table is required for PCI Express*.
		2
	


	edd
	EDD Boot disk hinting
	4
 
	This test verifies if the BIOS directs the operating system on which storage device to use for booting (EDD information). This is important for systems that (can) have multiple disks. Linux distributions increasingly depend on this info to find out on which device to install the bootloader.
	
		Boot device 0x80 does not support EDD

		4
		
			


	pciresource
	Validate assigned PCI resources
	4
 
	This test is currently a placeholder and just checks the kernel log for complaints about PCI resource errors. In the future the idea is to actually perform a validation step on all PCI resources against a certain rule-set.
	
		Device :01:00.0 has incorrect resources
		4
		pci://:01:00.0
PCI: Ignore bogus resource 6 [0:0] of :01:00.0
	


	thermal_trip
	ACPI passive thermal trip points
	2
 
	This test determines if the passive trip point works as expected.
	
		Cannot test trip points without existing /proc/acpi/thermal_zone.
		2
		
			


	cpufreq
	CPU frequency scaling tests
	2
 
	For each processor in the system, this test steps through the various frequency states (P-states) that the BIOS advertises for the processor. For each processor/frequency combination, a quick performance value is measured. The test then validates that: 
  1) Each processor has the same number of frequency states
  2) Higher advertised frequencies have a higher performance
  3) No duplicate frequency values are reported by the BIOS
  4) Is BIOS wrongly doing Sw_All P-state coordination across cores
  5) Is BIOS wrongly doing Sw_Any P-state coordination across cores

	
		Frequency scaling not supported
		2
		
			


	virt
	VT/VMX Virtualization extensions
	1
 
	This test checks if VT/VMX is set up correctly
	
		Processor does not support Virtualization extensions
		1
		
			


	acpiinfo
	General ACPI information
	1
 
	This test checks the output of the in-kernel ACPI CA against common error messages that indicate a bad interaction with the bios, including those that point at AML syntax errors.
	
		DSDT was compiled by the Microsoft AML compiler
		1
		ACPI: DSDT (v001SiS  620 0x1000 MSFT 0x010a) @ 0x
	


	maxreadreq
	PCI Express MaxReadReq tuning
	0
 
	This test checks if the firmware has set MaxReadReq to a higher value on non-montherboard devices


	os2gap
	OS/2 memory hole test
	0
 
	This test checks if the OS/2 15Mb memory hole is absent


	dmi
	DMI information check
	0
 
	This test checks the DMI/SMBIOS tables for common errors.


	chk_hpet
	HPET configuration test
	0
 
	This test checks the HPET PCI BAR for each timer block in the timer.The base address is passed by the

Re: BUG: linux 2.6.19 unable to enable acpi

2007-01-16 Thread Matheus Izvekov

On 1/17/07, Luming Yu <[EMAIL PROTECTED]> wrote:

On 1/17/07, Matheus Izvekov <[EMAIL PROTECTED]> wrote:
> It used to support power button events, dont know what else. Is there
> anything I can do to check how good the acpi support is?

Did you check BIOS setting? Is there any ACPI related menuitems?

No ACPI related menuitems, just APM ones, which are disabled.

Does MS windows work?


Yes, but i dont have it anymore to check how acpi was working there.
But that is yes for sure, i could turn it off with the power button.


Have you ever tried other kernel  i.e. 2.6.18, 2.6.17, 2.6.16..?



No, but ill try if it proves to be necessary.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-16 Thread Evgeniy Polyakov

Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

aio_sendfile()/aio_sendfile_path() contains of two major parts: AIO 
state machine and page processing code. 
The former is just a small subsystem, which allows to queue callback 
for theirs invocation in process' context on behalf of pool of kernel 
threads. It allows to queue caches of callbacks to the local thread 
or to any other specified. Each cache of callbacks is processed until 
there are callbacks in it, callbacks can requeue themselfs into the 
same cache.

Real work is being done in page processing code - code which populates 
pages into VFS cache and then sends pages to the destination socket 
via ->sendpage(). Unlike previous aio_sendfile() implementation, new 
one does not require low-level filesystem specific callbacks (->get_block())
at all, instead I extended struct address_space_operations to contain new 
member called ->aio_readpages(), which is exactly the same as ->readpage() 
(read: mpage_readpages()) except different BIO allocation and sumbission 
routines. I changed mpage_readpages() to provide mpage_alloc() and 
mpage_bio_submit() to the new function called __mpage_readpages(), which is 
exactly old mpage_readpages() with provided callback invocation instead of 
usage for old functions. mpage_readpages_aio() provides kevent specific 
callbacks, which calls old functions, but with different destructor callbacks,
which are essentially the same, except that they reschedule AIO processing.

aio_sendfile_path() is essentially aio_sendfile(), except that it takes
source filename as parameter and returns opened file descriptor.

Benchmark of the 100 1MB files transfer (files are in VFS already) using sync 
sendfile() against aio_sendfile_path() shows about 10MB/sec performance win 
(78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile sending server is one-way 
AMD Athlong 64 3500+) for aio_sendfile_path().

AIO state machine is a base for network AIO (which becomes
quite trivial), but I will not start implementation until
roadback of kevent as a whole and AIO implementation become more clear.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/fs/bio.c b/fs/bio.c
index 7618bcb..291e7e8 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -120,7 +120,7 @@ void bio_free(struct bio *bio, struct bio_set *bio_set)
 /*
  * default destructor for a bio allocated with bio_alloc_bioset()
  */
-static void bio_fs_destructor(struct bio *bio)
+void bio_fs_destructor(struct bio *bio)
 {
bio_free(bio, fs_bio_set);
 }
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index beaf25f..f08c957 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1650,6 +1650,13 @@ ext3_readpages(struct file *file, struct address_space 
*mapping,
return mpage_readpages(mapping, pages, nr_pages, ext3_get_block);
 }
 
+static int
+ext3_readpages_aio(struct file *file, struct address_space *mapping,
+   struct list_head *pages, unsigned nr_pages, void *priv)
+{
+   return mpage_readpages_aio(mapping, pages, nr_pages, ext3_get_block, 
priv);
+}
+
 static void ext3_invalidatepage(struct page *page, unsigned long offset)
 {
journal_t *journal = EXT3_JOURNAL(page->mapping->host);
@@ -1768,6 +1775,7 @@ static int ext3_journalled_set_page_dirty(struct page 
*page)
 }
 
 static const struct address_space_operations ext3_ordered_aops = {
+   .aio_readpages  = ext3_readpages_aio,
.readpage   = ext3_readpage,
.readpages  = ext3_readpages,
.writepage  = ext3_ordered_writepage,
diff --git a/fs/mpage.c b/fs/mpage.c
index 692a3e5..e5ba44b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -102,7 +102,7 @@ static struct bio *mpage_bio_submit(int rw, struct bio *bio)
 static struct bio *
 mpage_alloc(struct block_device *bdev,
sector_t first_sector, int nr_vecs,
-   gfp_t gfp_flags)
+   gfp_t gfp_flags, void *priv)
 {
struct bio *bio;
 
@@ -116,6 +116,7 @@ mpage_alloc(struct block_device *bdev,
if (bio) {
bio->bi_bdev = bdev;
bio->bi_sector = first_sector;
+   bio->bi_private = priv;
}
return bio;
 }
@@ -175,7 +176,10 @@ map_buffer_to_page(struct page *page, struct buffer_head 
*bh, int page_block)
 static struct bio *
 do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
sector_t *last_block_in_bio, struct buffer_head *map_bh,
-   unsigned long *first_logical_block, get_block_t get_block)
+   unsigned long *first_logical_block, get_block_t get_block,
+   struct bio *(*alloc)(struct block_device *bdev, sector_t 
first_sector, 
+   int nr_vecs, gfp_t gfp_flags, void *priv),
+   struct bio *(*submit)(int rw, struct bio *bio), void *priv)
 {
struct inode *inode = page->mapping->host;
const unsigned blkbits = inode->i_blkbits;
@@ -302,25 +306,25 @@ do_mpage_readpage(struct bio *bio, struc

[take33 5/10] kevent: Timer notifications.

2007-01-16 Thread Evgeniy Polyakov

Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 000..c21a155
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,114 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct kevent_timer
+{
+   struct hrtimer  ktimer;
+   struct kevent_storage   ktimer_storage;
+   struct kevent   *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+   struct kevent_timer *t = container_of(timer, struct kevent_timer, 
ktimer);
+   struct kevent *k = t->ktimer_event;
+
+   kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+   hrtimer_forward(timer, timer->base->softirq_time,
+   ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+   return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+   int err;
+   struct kevent_timer *t;
+
+   t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+   if (!t)
+   return -ENOMEM;
+
+   hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+   t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+   t->ktimer.function = kevent_timer_func;
+   t->ktimer_event = k;
+
+   err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+   if (err)
+   goto err_out_free;
+   lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+   err = kevent_storage_enqueue(&t->ktimer_storage, k);
+   if (err)
+   goto err_out_st_fini;
+
+   hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+   return 0;
+
+err_out_st_fini:
+   kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+   kfree(t);
+
+   return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+   struct kevent_storage *st = k->st;
+   struct kevent_timer *t = container_of(st, struct kevent_timer, 
ktimer_storage);
+
+   hrtimer_cancel(&t->ktimer);
+   kevent_storage_dequeue(st, k);
+   kfree(t);
+
+   return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+   k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+   return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = &kevent_timer_callback,
+   .enqueue = &kevent_timer_enqueue,
+   .dequeue = &kevent_timer_dequeue,
+   .flags = 0,
+   };
+
+   return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[take33 4/10] kevent: Socket notifications.

2007-01-16 Thread Evgeniy Polyakov

Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/fs/inode.c b/fs/inode.c
index bf21dc6..82817b1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct super_block *sb)
}
inode->i_private = NULL;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index 03684e7..d840399 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -49,6 +49,7 @@
 #include   /* struct sk_buff */
 #include 
 #include 
+#include 
 
 #include 
 
@@ -451,6 +452,21 @@ static inline int sk_stream_memory_free(struct sock *sk)
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+   struct socket socket;
+   struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+   return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+   return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
skb->sk = sk;
@@ -478,6 +494,7 @@ static inline void sk_add_backlog(struct sock *sk, struct 
sk_buff *skb)
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+   kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)  \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kiocb(struct 
sock_iocb *si)
return si->kiocb;
 }
 
-struct socket_alloc {
-   struct socket socket;
-   struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-   return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-   return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b7d8317..2763b30 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -864,6 +864,7 @@ static inline int tcp_prequeue(struct sock *sk, struct 
sk_buff *skb)
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+   kevent_socket_notify(sk, 
KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
  (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 000..d1a2701
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,144 @@
+/*
+ * kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+static int kevent_socket_callback(struct kevent *k)
+{
+   struct inode *inode = k->st->origin;
+   unsigned int events = SOCKET_I(inode)

Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 22:27:36 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
> 
> > > Yes this is the result of the hierachical nature of cpusets which already 
> > > causes issues with the scheduler. It is rather typical that cpusets are 
> > > used to partition the memory and cpus. Overlappig cpusets seem to have 
> > > mainly an administrative function. Paul?
> > 
> > The typical usage scenarios don't matter a lot: the examples I gave show
> > that the core problem remains unsolved.  People can still hit the bug.
> 
> I agree the overlap issue is a problem and I hope it can be addressed 
> somehow for the rare cases in which such nesting takes place.
> 
> One easy solution may be to check the dirty ratio before engaging in 
> reclaim. If the dirty ratio is sufficiently high then trigger writeout via 
> pdflush (we already wakeup pdflush while scanning and you already noted 
> that pdflush writeout is not occurring within the context of the current 
> cpuset) and pass over any dirty pages during LRU scans until some pages 
> have been cleaned up.
> 
> This means we allow allocation of additional kernel memory outside of the 
> cpuset while triggering writeout of inodes that have pages on the nodes 
> of the cpuset. The memory directly used by the application is still 
> limited. Just the temporary information needed for writeback is allocated 
> outside.

Gad.  None of that should be necessary.

> Well sounds somehow still like a hack. Any other ideas out there?

Do what blockdevs do: limit the number of in-flight requests (Peter's
recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
is in effect, to keep Trond happy) and implement a mempool for the NFS
request critical store.  Additionally:

- we might need to twiddle the NFS gfp_flags so it doesn't call the
  oom-killer on failure: just return NULL.

- consider going off-cpuset for critical allocations.  It's better than
  going oom.  A suitable implementation might be to ignore the caller's
  cpuset if PF_MEMALLOC.  Maybe put a WARN_ON_ONCE in there: we prefer that
  it not happen and we want to know when it does.



btw, regarding the per-address_space node mask: I think we should free it
when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)).  Chances
are, the inode will be dirty for 30 seconds and in-core for hours.  We
might as well steal its nodemask storage and give it to the next file which
gets written to.  A suitable place to do all this is in
__mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
address_space.dirty_page_nodemask.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[take33 8/10] kevent: Kevent posix timer notifications.

2007-01-16 Thread Evgeniy Polyakov

Kevent posix timer notifications.

Simple extensions to POSIX timers which allows
to deliver notification of the timer expiration
through kevent queue.

Example application posix_timer.c can be found
in archive on project homepage.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>


diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 8786e01..3768746 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -235,6 +235,7 @@ typedef struct siginfo {
 #define SIGEV_NONE 1   /* other notification: meaningless */
 #define SIGEV_THREAD   2   /* deliver via thread creation */
 #define SIGEV_THREAD_ID 4  /* deliver to thread */
+#define SIGEV_KEVENT   8   /* deliver through kevent queue */
 
 /*
  * This works because the alignment is ok on all current architectures
@@ -260,6 +261,8 @@ typedef struct sigevent {
void (*_function)(sigval_t);
void *_attribute;   /* really pthread_attr_t */
} _sigev_thread;
+
+   int kevent_fd;
} _sigev_un;
 } sigevent_t;
 
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index a7dd38f..4b9deb4 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 union cpu_time_count {
cputime_t cpu;
@@ -49,6 +50,9 @@ struct k_itimer {
sigval_t it_sigev_value;/* value word of sigevent struct */
struct task_struct *it_process; /* process to send signal to */
struct sigqueue *sigq;  /* signal queue entry. */
+#ifdef CONFIG_KEVENT_TIMER
+   struct kevent_storage st;
+#endif
union {
struct {
struct hrtimer timer;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 5fe87de..5ec805e 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -48,6 +48,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Management arrays for POSIX timers.  Timers are kept in slab memory
@@ -224,6 +226,100 @@ static int posix_ktime_get_ts(clockid_t which_clock, 
struct timespec *tp)
return 0;
 }
 
+#ifdef CONFIG_KEVENT_TIMER
+static int posix_kevent_enqueue(struct kevent *k)
+{
+   /*
+* It is not ugly - there is no pointer in the id field union, 
+* but its size is 64bits, which is ok for any known pointer size.
+*/
+   struct k_itimer *tmr = (struct k_itimer *)(unsigned 
long)k->event.id.raw_u64;
+   return kevent_storage_enqueue(&tmr->st, k);
+}
+static int posix_kevent_dequeue(struct kevent *k)
+{
+   struct k_itimer *tmr = (struct k_itimer *)(unsigned 
long)k->event.id.raw_u64;
+   kevent_storage_dequeue(&tmr->st, k);
+   return 0;
+}
+static int posix_kevent_callback(struct kevent *k)
+{
+   return 1;
+}
+static int posix_kevent_init(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = &posix_kevent_callback,
+   .enqueue = &posix_kevent_enqueue,
+   .dequeue = &posix_kevent_dequeue,
+   .flags = KEVENT_CALLBACKS_KERNELONLY};
+
+   return kevent_add_callbacks(&tc, KEVENT_POSIX_TIMER);
+}
+
+extern struct file_operations kevent_user_fops;
+
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+   struct ukevent uk;
+   struct file *file;
+   struct kevent_user *u;
+   int err;
+
+   file = fget(fd);
+   if (!file) {
+   err = -EBADF;
+   goto err_out;
+   }
+
+   if (file->f_op != &kevent_user_fops) {
+   err = -EINVAL;
+   goto err_out_fput;
+   }
+
+   u = file->private_data;
+
+   memset(&uk, 0, sizeof(struct ukevent));
+
+   uk.event = KEVENT_MASK_ALL;
+   uk.type = KEVENT_POSIX_TIMER;
+   uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique 
*/
+   uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE;
+   uk.ptr = tmr->it_sigev_value.sival_ptr;
+
+   err = kevent_user_add_ukevent(&uk, u);
+   if (err)
+   goto err_out_fput;
+
+   fput(file);
+
+   return 0;
+
+err_out_fput:
+   fput(file);
+err_out:
+   return err;
+}
+
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+   kevent_storage_fini(&tmr->st);
+}
+#else
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+   return -ENOSYS;
+}
+static int posix_kevent_init(void)
+{
+   return 0;
+}
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+}
+#endif
+
+
 /*
  * Initialize everything, well, just everything in Posix clocks/timers ;)
  */
@@ -241,6 +337,11 @@ static __init int init_posix_timers(void)
register_posix_clock(CLOCK_REALTIME, &clock_realtime);
register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic);
 
+   if (posix_kevent_init()) {
+   printk(KERN_ERR "Failed t

[take33 7/10] kevent: Signal notifications.

2007-01-16 Thread Evgeniy Polyakov

Signal notifications.

This type of notifications allows to deliver signals through kevent queue.
One can find example application signal.c on project homepage.

If KEVENT_SIGNAL_NOMASK bit is set in raw_u64 id then signal will be
delivered only through queue, otherwise both delivery types are used - old
through update of mask of pending signals and through queue.

If signal is delivered only through kevent queue mask of pending signals
is not updated at all, which is equal to putting signal into blocked mask,
but with delivery of that signal through kevent queue.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4463735..e7372f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -82,6 +82,7 @@ struct sched_param {
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1048,6 +1049,10 @@ struct task_struct {
 #ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
 #endif
+#ifdef CONFIG_KEVENT_SIGNAL
+   struct kevent_storage st;
+   u32 kevent_signals;
+#endif
 #ifdef CONFIG_FAULT_INJECTION
int make_it_fail;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index fc723e5..fd7c749 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -118,6 +119,9 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(atomic_read(&tsk->usage));
WARN_ON(tsk == current);
 
+#ifdef CONFIG_KEVENT_SIGNAL
+   kevent_storage_fini(&tsk->st);
+#endif
security_task_free(tsk);
free_uid(tsk->user);
put_group_info(tsk->group_info);
@@ -1126,6 +1130,10 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
if (retval)
goto bad_fork_cleanup_namespaces;
 
+#ifdef CONFIG_KEVENT_SIGNAL
+   kevent_storage_init(p, &p->st);
+#endif
+
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
/*
 * Clear TID on mm_release()?
diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c
new file mode 100644
index 000..abe3972
--- /dev/null
+++ b/kernel/kevent/kevent_signal.c
@@ -0,0 +1,94 @@
+/*
+ * kevent_signal.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int kevent_signal_callback(struct kevent *k)
+{
+   struct task_struct *tsk = k->st->origin;
+   int sig = k->event.id.raw[0];
+   int ret = 0;
+
+   if (sig == tsk->kevent_signals)
+   ret = 1;
+
+   if (ret && (k->event.id.raw_u64 & KEVENT_SIGNAL_NOMASK))
+   tsk->kevent_signals |= 0x8000;
+
+   return ret;
+}
+
+int kevent_signal_enqueue(struct kevent *k)
+{
+   int err;
+
+   err = kevent_storage_enqueue(¤t->st, k);
+   if (err)
+   goto err_out_exit;
+
+   if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+   kevent_requeue(k);
+   err = 0;
+   } else {
+   err = k->callbacks.callback(k);
+   if (err)
+   goto err_out_dequeue;
+   }
+
+   return err;
+
+err_out_dequeue:
+   kevent_storage_dequeue(k->st, k);
+err_out_exit:
+   return err;
+}
+
+int kevent_signal_dequeue(struct kevent *k)
+{
+   kevent_storage_dequeue(k->st, k);
+   return 0;
+}
+
+int kevent_signal_notify(struct task_struct *tsk, int sig)
+{
+   tsk->kevent_signals = sig;
+   kevent_storage_ready(&tsk->st, NULL, KEVENT_SIGNAL_DELIVERY);
+   return (tsk->kevent_signals & 0x8000);
+}
+
+static int __init kevent_init_signal(void)
+{
+   struct kevent_callbacks sc = {
+   .callback = &kevent_signal_callback,
+   .enqueue = &kevent_signal_enqueue,
+   .dequeue = &kevent_signal_dequeue,
+   .flags = 0,
+   };
+
+   return kevent_add_callbacks(&sc, KEVENT_SIGNAL);
+}
+module_init(kevent_init_signal);
diff --git a/kernel/signal.c b/kernel/signal.c
index 5630255..f12ebc0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 

[take33 9/10] kevent: Private userspace notifications.

2007-01-16 Thread Evgeniy Polyakov

Private userspace notifications.

Allows to register notifications of any private userspace
events over kevent. Events can be marked as readt using 
kevent_ctl(KEVENT_READY) command.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/kernel/kevent/kevent_unotify.c b/kernel/kevent/kevent_unotify.c
new file mode 100644
index 000..618c09c
--- /dev/null
+++ b/kernel/kevent/kevent_unotify.c
@@ -0,0 +1,62 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+
+static int kevent_unotify_callback(struct kevent *k)
+{
+   return 1;
+}
+
+int kevent_unotify_enqueue(struct kevent *k)
+{
+   int err;
+
+   err = kevent_storage_enqueue(&k->user->st, k);
+   if (err)
+   goto err_out_exit;
+
+   if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE)
+   kevent_requeue(k);
+
+   return 0;
+
+err_out_exit:
+   return err;
+}
+
+int kevent_unotify_dequeue(struct kevent *k)
+{
+   kevent_storage_dequeue(k->st, k);
+   return 0;
+}
+
+static int __init kevent_init_unotify(void)
+{
+   struct kevent_callbacks sc = {
+   .callback = &kevent_unotify_callback,
+   .enqueue = &kevent_unotify_enqueue,
+   .dequeue = &kevent_unotify_dequeue,
+   .flags = 0,
+   };
+
+   return kevent_add_callbacks(&sc, KEVENT_UNOTIFY);
+}
+module_init(kevent_init_unotify);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[take33 0/10] kevent: Generic event handling mechanism.

2007-01-16 Thread Evgeniy Polyakov

Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

P.S. If you want to be removed from Cc: list just drop me a mail.

Changes from 'take32' patchset:
 * Updated documentation (aio_sendfile_path()).
 * Fixed typo in forward declaration.

Changes from 'take31' patchset:
 * Added aio_sendfile_path() - this syscall allows to asynchronosly transfer
   file specified by provided pathname to destination socket.
   Opened file descriptor is returned.
 * Added trivial scheduler which selects execution thread. It allows
   to specify given thread 'by-hands', but since kaio provides '-1' it uses
   round-robin to get processing thread. In theory it can be bound to
   scheduler statistics or gamma-ray receiver data.
 * Number of bug fixes in kevent based AIO mpage_readpages().
   
   Benchmark of the 100 1Mb files transfer (files are in VFS already) using 
   sync sendfile or this new version shows about 10Mb/sec performance win
   for aio_sendfile_path().

Changes from 'take30' patchset:
 * AIO state machine.
 * aio_sendfile() implementation.
 * moved kevent_user_get/kevent_user_put into header.
 * use *zalloc where needed.

Changes from 'take29' patchset:
 * new private userspace notifications - allows to queue any userspace private 
event and then mark it as ready using kevent_ctl(KEVENT_READY) command
 * KEVENT_REQ_READY flag - if set kevent will be marked as ready at enqueue time
 * port to 2.6.20-rc2 tree (54abb5fcdae74a811ed440ec6556cabc6b24f404 commit)
 * use struct kmem_cache instead of kmem_cache_t
 * added notificaion type into search key, this allows to have the same id for 
different types of notifications

Changes from 'take28' patchset:
 * optimized af_unix to use socket notifications
 * changed ALWAYS_QUEUE behaviour with poll/select notifications - previously
kevent was not queued into poll wait queue when ALWAYS_QUEUE flag
is set
 * added KEVENT_POLL_POLLRDHUP definition into ukevent.h header
 * libevent-1.2 patch (Jamal, your request is completed, so I'm waiting two 
weeks
before starting final countdown :)
All regression tests passed successfully except test_evbuffer(), which 
is
crashed on my amd64 linux 2.6 test machine for all types of 
notifications,
probably it was fixed in libevent-1.2a version, I did not check.
Patch and README can be found at project homepage.

Changes from 'take27' patchset:
 * made kevent default yes in non embedded case.
 * added falgs to callback structures - currently used to check if kevent
can be requested from kernelspace only (posix timers) or 
userspace (all others)

Changes from 'take26' patchset:
 * made kevent visible in config only in case of embedded setup.
 * added comment about KEVENT_MAX number.
 * spell fix.

Changes from 'take25' patchset:
 * use timespec as timeout parameter.
 * added high-resolution timer to handle absolute timeouts.
 * added flags to waiting and initialization syscalls.
 * kevent_commit() has new_uidx parameter.
 * kevent_wait() has old_uidx parameter, which, if not equal to u->uidx,
results in immediate wakeup (usefull for the case when entries
are added asynchronously from kernel (not supported for now)).
 * added interface to mark any event as ready.
 * event POSIX timers support.
 * return -ENOSYS if there is no registered event type.
 * provided file descriptor must be checked for fifo type (spotted by Eric 
Dumazet).
 * signal notifications.
 * documentation update.
 * lighttpd patch updated (the latest benchmarks with lighttpd patch can be 
found in blog).

Changes from 'take24' patchset:
 * new (old (new)) ring buffer implementation with kernel and user indexes.
 * added initialization syscall instead of opening /dev/kevent
 * kevent_commit() syscall to commit ring buffer entries
 * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
   only first thread always if that flag is not set
 * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
   instead of copying back to userspace when kevent is ready immediately when
   it is added.
 * lighttpd patch (Hail! Although nothing really outstanding compared to epoll)

Changes from 'take23' patchset:
 * kevent PIPE notifications
 * KEVE

[take33 1/10] kevent: Description.

2007-01-16 Thread Evgeniy Polyakov

Description.


diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 000..87a1ba9
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,268 @@
+Description.
+
+int kevent_init(struct kevent_ring *ring, unsigned int ring_size, 
+   unsigned int flags);
+
+num - size of the ring buffer in events 
+ring - pointer to allocated ring buffer
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value: kevent control file descriptor or negative error value.
+
+ struct kevent_ring
+ {
+   unsigned int ring_kidx, ring_over;
+   struct ukevent event[0];
+ }
+
+ring_kidx - index in the ring buffer where kernel will put new events 
+   when kevent_wait() or kevent_get_events() is called 
+ring_over - number of overflows of ring_uidx happend from the start.
+   Overflow counter is used to prevent situation when two threads 
+   are going to free the same events, but one of them was scheduled 
+   away for too long, so ring indexes were wrapped, so when that 
+   thread will be awakened, it will free not those events, which 
+   it suppose to free.
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when 
+thread has been cancelled in kevent syscall, thread can be safely removed 
+and no events will be lost, since each syscall (kevent_wait() or 
+kevent_get_events()) will copy event into special ring buffer, accessible 
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed), 
+even if it was ready, it is not copied into ring buffer, since if it is 
+removed, no one cares about it (otherwise user would wait until it becomes 
+ready and got it through usual way using kevent_get_events() or kevent_wait()) 
+and thus no need to copy it to the ring buffer.
+
+---
+
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent 
*arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate. 
+It is created by opening "/dev/kevent" char device, which is created with 
+dynamic minor number and major number assigned for misc devices. 
+
+cmd - is the requested operation. It can be one of the following:
+KEVENT_CTL_ADD - add event notification 
+KEVENT_CTL_REMOVE - remove event notification 
+KEVENT_CTL_MODIFY - modify existing notification 
+KEVENT_CTL_READY - mark existing events as ready, if number of events is 
zero,
+   it just wakes up parked in syscall thread
+
+num - number of struct ukevent in the array pointed to by arg 
+arg - array of struct ukevent
+
+Return value: 
+ number of events processed or negative error value.
+
+When called, kevent_ctl will carry out the operation specified in the 
+cmd parameter.
+---
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, 
+   struct timespec timeout, struct ukevent *buf, unsigned flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+min_nr - minimum number of completed events that kevent_get_events will block 
+waiting for 
+max_nr - number of struct ukevent in buf 
+timeout - time to wait before returning less than min_nr 
+ events. If this is -1, then wait forever. 
+buf - pointer to an array of struct ukevent. 
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied or negative error value.
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed 
+events, copying completed struct ukevents to buf and deleting any 
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many 
+events as possible, but not more than max_nr. In blocking mode it waits until 
+timeout or if at least min_nr events are ready.
+
+This function copies event into ring buffer if it was initialized, if ring 
buffer
+is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field.
+---
+
+ int kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, 
+   struct timespec timeout, unsigned int flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+num - number of processed kevents 
+old_uidx - the last index user is aware of
+timeout - time to wait until there is free space in kevent queue
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied into ring buffer or negative error value.
+
+This syscall waits until either timeout expires or at least one event becomes 
+ready. It also copies events into special ring buffer. If ring buffer is full,
+it waits until there are ready events and then return.
+If kevent is one-shot kevent it is remo

[take33 6/10] kevent: Pipe notifications.

2007-01-16 Thread Evgeniy Polyakov

Pipe notifications.


diff --git a/fs/pipe.c b/fs/pipe.c
index 68090e8..0c75bf1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -313,6 +314,7 @@ redo:
break;
}
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
@@ -322,6 +324,7 @@ redo:
 
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
@@ -484,6 +487,7 @@ redo2:
break;
}
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
do_wakeup = 0;
@@ -495,6 +499,7 @@ redo2:
 out:
mutex_unlock(&inode->i_mutex);
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}
@@ -590,6 +595,7 @@ pipe_release(struct inode *inode, int decr, int decw)
free_pipe_info(inode);
} else {
wake_up_interruptible(&pipe->wait);
+   kevent_pipe_notify(inode, 
KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 000..91dc1eb
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,123 @@
+/*
+ * kevent_pipe.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+   struct inode *inode = k->st->origin;
+   struct pipe_inode_info *pipe = inode->i_pipe;
+   int nrbufs = pipe->nrbufs;
+
+   if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) {
+   if (!pipe->writers)
+   return -1;
+   return 1;
+   }
+   
+   if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) {
+   if (!pipe->readers)
+   return -1;
+   return 1;
+   }
+
+   return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+   struct file *pipe;
+   int err = -EBADF;
+   struct inode *inode;
+
+   pipe = fget(k->event.id.raw[0]);
+   if (!pipe)
+   goto err_out_exit;
+
+   inode = igrab(pipe->f_dentry->d_inode);
+   if (!inode)
+   goto err_out_fput;
+
+   err = -EINVAL;
+   if (!S_ISFIFO(inode->i_mode))
+   goto err_out_iput;
+
+   err = kevent_storage_enqueue(&inode->st, k);
+   if (err)
+   goto err_out_iput;
+
+   if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+   kevent_requeue(k);
+   err = 0;
+   } else {
+   err = k->callbacks.callback(k);
+   if (err)
+   goto err_out_dequeue;
+   }
+
+   fput(pipe);
+
+   return err;
+
+err_out_dequeue:
+   kevent_storage_dequeue(k->st, k);
+err_out_iput:
+   iput(inode);
+err_out_fput:
+   fput(pipe);
+err_out_exit:
+   return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+   struct inode *inode = k->st->origin;
+
+   kevent_storage_dequeue(k->st, k);
+   iput(inode);
+
+   return 0;
+}
+
+void kevent_pipe_notify(struct inode *inode, u32 event)
+{
+   kevent_storage_ready(&inode->st, NULL, event);
+}
+
+static int __init kevent_init_pipe(void)
+{
+ 

[take33 3/10] kevent: poll/select() notifications.

2007-01-16 Thread Evgeniy Polyakov

poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/fs/file_table.c b/fs/file_table.c
index 4c17a18..46f458c 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
f->f_uid = tsk->fsuid;
f->f_gid = tsk->fsgid;
eventpoll_init_file(f);
+   kevent_init_file(f);
/* f->f_version: 0 */
return f;
 
@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
 * in the file cleanup chain.
 */
eventpoll_release(file);
+   kevent_cleanup_file(file);
locks_remove_flock(file);
 
if (file->f_op && file->f_op->release)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 186da81..59e6069 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -280,6 +280,7 @@ extern int dir_notify_enable;
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -408,6 +409,8 @@ struct address_space_operations {
 
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
+   int (*aio_readpages)(struct file *filp, struct address_space *mapping,
+   struct list_head *pages, unsigned nr_pages, void *priv);
 
/*
 * ext3 requires that a successful prepare_write() call be followed
@@ -578,6 +581,10 @@ struct inode {
struct mutexinotify_mutex;  /* protects the watches list */
 #endif
 
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   struct kevent_storage   st;
+#endif
+
unsigned long   i_state;
unsigned long   dirtied_when;   /* jiffies of first dirtying */
 
@@ -737,6 +744,9 @@ struct file {
struct list_headf_ep_links;
spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+   struct kevent_storage   st;
+#endif
struct address_space*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 000..58129fa
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,234 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static struct kmem_cache *kevent_poll_container_cache;
+static struct kmem_cache *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+   struct poll_table_structpt;
+   struct kevent   *k;
+};
+
+struct kevent_poll_wait_container
+{
+   struct list_headcontainer_entry;
+   wait_queue_head_t   *whead;
+   wait_queue_twait;
+   struct kevent   *k;
+};
+
+struct kevent_poll_private
+{
+   struct list_headcontainer_list;
+   spinlock_t  container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+   unsigned mode, int sync, void *key)
+{
+   struct kevent_poll_wait_container *cont =
+   container_of(wait, struct kevent_poll_wait_container, wait);
+   struct kevent *k = cont->k;
+
+   kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+   return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+   struct poll_table_struct *poll_table)
+{
+   struct kevent *k =
+   container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+   struct kevent_poll_private *priv = k->priv;
+   struct kevent_poll_wait_container *cont;
+   unsigned long flags;
+
+   cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL);
+   if (!cont) {
+   kevent_break(k);
+   return;
+   }
+
+   cont->k = k;
+   init_waitqueue_fu

Re: On some configs, sparse spinlock balance checking is broken

2007-01-16 Thread Ingo Molnar

* Roland Dreier <[EMAIL PROTECTED]> wrote:

> (Ingo -- you seem to be the last person to touch all this stuff, and I 
> can't untangle what you did, hence I'm sending this email to you)
> 
> On at least some of my configs on x86_64, when running sparse, I see 
> bogus 'warning: context imbalance in '' - wrong count at exit'.
> 
> This seems to be because I have CONFIG_SMP=y, CONFIG_DEBUG_SPINLOCK=n
> and CONFIG_PREEMPT=n.  Therefore,  does
> 
>   #define spin_lock(lock) _spin_lock(lock)
> 
> which picks up
> 
>   void __lockfunc _spin_lock(spinlock_t *lock)
> __acquires(lock);
> 
> from , but  also has:
> 
>   #if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \
>   !defined(CONFIG_SMP)
>   //...
>   #else
>   # define spin_unlock(lock)  
> __raw_spin_unlock(&(lock)->raw_lock)

this is the direct-inlining speedup some people insisted on.

> and  has:
> 
>   static inline void __raw_spin_unlock(raw_spinlock_t *lock)
>   {
>   asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
>   }
> 
> so sparse doesn't see any __releases() to match the __acquires.
> 
> This all seems to go back to commit bda98685 ("x86: inline spin_unlock
> if !CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPT") but I don't know what
> motivated that change.
> 
> Anyway, Ingo or anyone else, what's the best way to fix this?  Maybe 
> the right way to fix this is just to define away __acquires/__releases 
> unless CONFIG_DEBUG_SPINLOCK is set, but that seems suboptimal.

i think the right way to fix it might be to define a _spin_unlock() 
within those #ifdef branches, and then to define spin_lock as:

static inline void spin_lock(spinlock_t *lock) __acquires(lock)
{
_spin_lock(lock);
}

?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Two 2.6.20-rc5-rt2 issues

2007-01-16 Thread Ingo Molnar

* Rui Nuno Capela <[EMAIL PROTECTED]> wrote:

> Building this already with -rt5, still gives:
> ...
>   LD  arch/i386/boot/compressed/vmlinux
>   OBJCOPY arch/i386/boot/vmlinux.bin
>   BUILD   arch/i386/boot/bzImage
> Root device is (3, 2)
> Boot sector 512 bytes.
> Setup is 7407 bytes.
> System is 1427 kB
> Kernel: arch/i386/boot/bzImage is ready  (#1)
> WARNING: "profile_hits" [drivers/kvm/kvm-intel.ko] undefined!
> WARNING: "profile_hits" [drivers/kvm/kvm-amd.ko] undefined!

ok - in my test-config i didnt have KVM modular - the patch below should 
fix this problem.

Ingo

Index: linux/kernel/profile.c
===
--- linux.orig/kernel/profile.c
+++ linux/kernel/profile.c
@@ -332,7 +332,6 @@ out:
local_irq_restore(flags);
put_cpu();
 }
-EXPORT_SYMBOL_GPL(profile_hits);
 
 static int __devinit profile_cpu_callback(struct notifier_block *info,
unsigned long action, void *__cpu)
@@ -402,6 +401,8 @@ void profile_hits(int type, void *__pc, 
 }
 #endif /* !CONFIG_SMP */
 
+EXPORT_SYMBOL_GPL(profile_hits);
+
 void __profile_tick(int type, struct pt_regs *regs)
 {
if (type == CPU_PROFILING && timer_hook)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> > Yes this is the result of the hierachical nature of cpusets which already 
> > causes issues with the scheduler. It is rather typical that cpusets are 
> > used to partition the memory and cpus. Overlappig cpusets seem to have 
> > mainly an administrative function. Paul?
> 
> The typical usage scenarios don't matter a lot: the examples I gave show
> that the core problem remains unsolved.  People can still hit the bug.

I agree the overlap issue is a problem and I hope it can be addressed 
somehow for the rare cases in which such nesting takes place.

One easy solution may be to check the dirty ratio before engaging in 
reclaim. If the dirty ratio is sufficiently high then trigger writeout via 
pdflush (we already wakeup pdflush while scanning and you already noted 
that pdflush writeout is not occurring within the context of the current 
cpuset) and pass over any dirty pages during LRU scans until some pages 
have been cleaned up.

This means we allow allocation of additional kernel memory outside of the 
cpuset while triggering writeout of inodes that have pages on the nodes 
of the cpuset. The memory directly used by the application is still 
limited. Just the temporary information needed for writeback is allocated 
outside.

Well sounds somehow still like a hack. Any other ideas out there?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: linux 2.6.19 unable to enable acpi

2007-01-16 Thread Luming Yu

On 1/17/07, Matheus Izvekov <[EMAIL PROTECTED]> wrote:

On 1/17/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> On Wed, 2007-01-17 at 02:01 -0200, Matheus Izvekov wrote:
> > Just tried linux for the first time on this old machine, and i got
> > this problem. dmesg below:
>
>
> did this machine EVER support acpi ?
>
>

It used to support power button events, dont know what else. Is there
anything I can do to check how good the acpi support is?


Did you check BIOS setting? Is there any ACPI related menuitems?
Does MS windows work?
Have you ever tried other kernel  i.e. 2.6.18, 2.6.17, 2.6.16..?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-16 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andi Kleen wrote:

> No actually people are fairly unhappy when one node is filled with 
> file data and then they don't get local memory from it anymore.
> I get regular complaints about that for Opteron.

Switch on zone_reclaim and it will take care of it. You can even switch it 
to write mode in order to get rid of dirty pages. However, be aware of the 
significantly reduced performance since you cannot go off node without 
writeback anymore.

> That is another concern. I haven't checked recently, but it used
> to be fairly simple to put a system to its knees by oversubscribing
> a single node with a strict memory policy. Fixing that would be good.

zone_reclaim has dealt with most of those issues.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush_cpu_workqueue: don't flush an empty ->worklist

2007-01-16 Thread Srivatsa Vaddagiri
On Tue, Jan 16, 2007 at 04:27:25PM +0300, Oleg Nesterov wrote:
> > I meant issuing kthread_stop() in DOWN_PREPARE so that worker
> > thread exits itself (much before CPU is actually brought down).
> 
> Deadlock if work_struct re-queues itself.

Are you referring to the problem you described here?

http://lkml.org/lkml/2007/1/8/173

If so, then it can easily be prevented by having run_workqueue() check for 
kthread_should_stop() in its while loop?

> > Even if there are problems with it, how abt something like below:
> >
> > workqueue_cpu_callback()
> > {
> >
> > CPU_DEAD:
> > /* threads are still frozen at this point */
> > take_over_work();
> 
> No, we can't move a currently executing work. This will break flush_workxxx().

What do you mean by "currently" executing work? worker thread executing
some work on the cpu? That is not possible, because all threads are
frozen at this point. There cant be any ongoing flush_workxxx() as well
because of this, which should avoid breaking flush_workxxx() ..

> > if (kthread_marked_stop(current))
> > break;
> 
> Racy. Because of kthread_stop() above we should clear cwq->thread somehow.
> But we can't do this: this workqueue may be already destroyed.

We will surely take workqueue_mutex in CPU_CLEAN_THREADS (when it
traverses the workqueues list), which should avoid this race?

> Please note that the code I posted does something like kthread_mark_stop(), 
> but
> it operates on cwq, not on task_struct, this makes a big difference.

Ok sure ..putting the flag in cwq makes sense. Others also can follow
similar trick for stopping threads (like ksoftirqd).

> And it doesn't need take_over_work() at all. And it doesn't need additional 
> complications. Instead, it lessens both the source and compiled code.

I guess either way, changes are required.

1st method, what you are suggesting:

- Needs separate bitmap(s), cpu_populated_map and possible another
  for create_workqueue()?
- flush_workqueue() traverses thr a separate bitmap
  cpu_populated_map (separate from the online map) while
  create_workqueue() traverses the other bitmap

2nd method:

- Avoids the need for maintenance of separate bitmaps (uses
  existing cpu_online_map). All functions can safely use
  the online_map w/o any races. Personally this is why I like
  this approach.
- Needs changes in worker_thread to exit right after it comes
  out of refrigerator.

I havent made any changes as per 2nd method to see the resulting code
size, so I cant comment on code sizes.

Another point is that once we create code as in 1st method, which
maintains separate bitmaps, that will easily get replicated (over time) 
to other subsystems. Is that a good thing?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] nfs: fix congestion control

2007-01-16 Thread Trond Myklebust
On Wed, 2007-01-17 at 03:41 +0100, Peter Zijlstra wrote:
> On Tue, 2007-01-16 at 17:27 -0500, Trond Myklebust wrote:
> > On Tue, 2007-01-16 at 23:08 +0100, Peter Zijlstra wrote:
> > > Subject: nfs: fix congestion control
> > > 
> > > The current NFS client congestion logic is severely broken, it marks the
> > > backing device congested during each nfs_writepages() call and implements
> > > its own waitqueue.
> > > 
> > > Replace this by a more regular congestion implementation that puts a cap
> > > on the number of active writeback pages and uses the bdi congestion 
> > > waitqueue.
> > > 
> > > NFSv[34] commit pages are allowed to go unchecked as long as we are under 
> > > the dirty page limit and not in direct reclaim.
> 
> > 
> > What on earth is the point of adding congestion control to COMMIT?
> > Strongly NACKed.
> 
> They are dirty pages, how are we getting rid of them when we reached the
> dirty limit?

They are certainly _not_ dirty pages. They are pages that have been
written to the server but are not yet guaranteed to have hit the disk
(they were only written to the server's page cache). We don't care if
they are paged in or swapped out on the local client.

\All the COMMIT does, is to ask the server to write the data from its
page cache onto disk. Once that has been done, we can release the pages.
If the commit fails, then we iterate through the whole writepage()
process again. The commit itself does, however, not even look at the
page data.

> > Why 16MB of on-the-wire data? Why not 32, or 128, or ...
> 
> Andrew always promotes a fixed number for congestion control, I pulled
> one from a dark place. I have no problem with a more dynamic solution.
> 
> > Solaris already allows you to send 2MB of write data in a single RPC
> > request, and the RPC engine has for some time allowed you to tune the
> > number of simultaneous RPC requests you have on the wire: Chuck has
> > already shown that read/write performance is greatly improved by upping
> > that value to 64 or more in the case of RPC over TCP. Why are we then
> > suddenly telling people that they are limited to 8 simultaneous writes?
> 
> min(max RPC size * max concurrent RPC reqs, dirty threshold) then?

That would be far preferable. For instance, it allows those who have
long latency fat pipes to actually use the bandwidth optimally when
writing out the data.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-16 Thread Andi Kleen
On Wednesday 17 January 2007 15:36, Paul Jackson wrote:
> > With a per node dirty limit ...
>
> What would this mean?
>
> Lets say we have a simple machine with 4 nodes, cpusets disabled.

There can be always NUMA policy without cpusets for once.

> Lets say all tasks are allowed to use all nodes, no set_mempolicy
> either.

Ok.

> If a task happens to fill up 80% of one node with dirty pages, but
> we have no dirty pages yet on other nodes, and we have a dirty ratio
> of 40%, then do we throttle that task's writes?

Yes we should actually. Every node should be able to supply
memory (unless extreme circumstances like mlock) and that much dirty 
memory on a node will make that hard.

> I am surprised you are asking for this, Andi.  I would have thought
> that on no-cpuset systems, the system wide throttling served your
> needs fine.  

No actually people are fairly unhappy when one node is filled with 
file data and then they don't get local memory from it anymore.
I get regular complaints about that for Opteron.

Dirty limit wouldn't be a full solution, but a good step.

> If not, then I can only guess that is because NUMA 
> mempolicy constraints on allowed nodes are causing the same dirty page
> problems as cpuset constrained systems -- is that your concern?

That is another concern. I haven't checked recently, but it used
to be fairly simple to put a system to its knees by oversubscribing
a single node with a strict memory policy. Fixing that would be good.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-16 Thread Arjan van de Ven
On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote:
> Helge Hafting <[EMAIL PROTECTED]> wrote:
> > Michael Tokarev wrote:
> 
> >> But seriously - what about just disallowing non-O_DIRECT opens together
> >> with O_DIRECT ones ?
> >>   
> > Please do not create a new local DOS attack.
> > I open some important file, say /etc/resolv.conf
> > with O_DIRECT and just sit on the open handle.
> > Now nobody else can open that file because
> > it is "busy" with O_DIRECT ?
> 
> Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close?

.. then any user can impact the operation, performance and reliability
of the database application of another user... sounds like plugging one
hole by making a bigger hole ;)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH]: MTRR: cosmetic fixes

2007-01-16 Thread Giuliano Procida
Subject: [PATCH]: cosmetic fixes

[MTRR 2.6.19.1]: cosmetic fixes

Signed-off-by: Giuliano Procida <[EMAIL PROTECTED]>

---

Fixed incorrect (though identical) types in struct mtrr_gentry32 and
tided some badly-indented comments.

--- linux-source-2.6.19.1.orig/include/asm-x86_64/mtrr.h2006-12-11 
19:32:53.0 +
+++ linux-source-2.6.19.1/include/asm-x86_64/mtrr.h 2007-01-16 
07:33:19.0 +
@@ -30,7 +30,7 @@
 struct mtrr_sentry
 {
 unsigned long base;/*  Base address */
-unsigned int size;/*  Size of region   */
+unsigned int size; /*  Size of region   */
 unsigned int type; /*  Type of region   */
 };
 
@@ -41,7 +41,7 @@ struct mtrr_sentry
 struct mtrr_gentry
 {
 unsigned long base;/*  Base address */
-unsigned int size;/*  Size of region   */
+unsigned int size; /*  Size of region   */
 unsigned int regnum;   /*  Register number  */
 unsigned int type; /*  Type of region   */
 };
@@ -108,15 +108,15 @@ static __inline__ int mtrr_del_page (int
 struct mtrr_sentry32
 {
 compat_ulong_t base;/*  Base address */
-compat_uint_t size;/*  Size of region   */
+compat_uint_t size; /*  Size of region   */
 compat_uint_t type; /*  Type of region   */
 };
 
 struct mtrr_gentry32
 {
-compat_ulong_t regnum;   /*  Register number  */
-compat_uint_t base;/*  Base address */
-compat_uint_t size;/*  Size of region   */
+compat_uint_t regnum;   /*  Register number  */
+compat_ulong_t base;/*  Base address */
+compat_uint_t size; /*  Size of region   */
 compat_uint_t type; /*  Type of region   */
 };
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] net: vm deadlock avoidance core

2007-01-16 Thread Evgeniy Polyakov
On Tue, Jan 16, 2007 at 05:08:15PM +0100, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:
> On Tue, 2007-01-16 at 18:33 +0300, Evgeniy Polyakov wrote:
> > On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra ([EMAIL 
> > PROTECTED]) wrote:
> > > > > + if (unlikely(skb->emergency))
> > > > > + current->flags |= PF_MEMALLOC;
> > > > 
> > > > Access to 'current' in netif_receive_skb()???
> > > > Why do you want to work with, for example keventd?
> > > 
> > > Can this run in keventd?
> > 
> > Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
> > (or dedicated kernel thread, I do not recall).
> > 
> > > I thought this was softirq context and thus this would either run in a
> > > borrowed context or in ksoftirqd. See patch 3/9.
> > 
> > And how are you going to access 'current' in softirq?
> > 
> > netif_receive_skb() can also be called from a lot of other places
> > including keventd and/or different context - it is permitted to call it
> > everywhere to process packet.
> > 
> > I meant that you break the rule accessing 'current' in that context.
> 
> Yeah, I know, but as long as we're not actually in hard irq context
> current does point to the task_struct in charge of current execution and
> as long as we restore whatever was in the flags field before we started
> poking, nothing can go wrong.
> 
> So, yes this is unconventional, but it does work as expected.
> 
> As for breaking, 3/9 makes it legal.

You operate with 'current' in different contexts without any locks which
looks racy and even is not allowed. What will be 'current' for
netif_rx() case, which schedules softirq from hard irq context -
ksoftirqd, why do you want to set its flags?

> > I meant that you can just mark process which created such socket as
> > PF_MEMALLOC, and clone that flag on forks and other relatest calls without 
> > all that checks for 'current' in different places.
> 
> Ah, thats the wrong level to think here, these processes never reach
> user-space - nor should these sockets.

You limit this just to send an ack?
What about 'level-7' ack as you described in introduction?

> Also, I only want the processing of the actual network packet to be able
> to eat the reserves, not any other thing that might happen in that
> context.
> 
> And since network processing is mostly done in softirq context I must
> mark these sections like I did.

You artificially limit system to just add a reserve to generate one ack.
For that purpose you do not need to have all those flags - just reseve
some data in network core and use it when system is in OOM (or reclaim)
for critical data pathes.

> > > > > + /*
> > > > > +decrease window size..
> > > > > +tcp_enter_quickack_mode(sk);
> > > > > + */
> > > > 
> > > > How does this decrease window size?
> > > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > > or just directly send an ack, which in turn requires allocation, which
> > > > can be bound to this received frame processing...
> > > 
> > > It doesn't, I thought that it might be a good idea doing that, but never
> > > got around to actually figuring out how to do it.
> > 
> > tcp_send_ack()?
> > 
> 
> does that shrink the window automagically?

Yes, it updates window, but having ack generated in that place is
actually very wrong. In that place system has not processed incoming
packet yet, so it can not generate correct ACK for received frame at
all. And it seems that the only purpose of the whole patchset is to
generate that poor ack - reseve 2007 ack packets (MAX_TCP_HEADER) 
in system startup and reuse them when you are under memory pressure.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-16 Thread Paul Jackson
> With a per node dirty limit ...

What would this mean?

Lets say we have a simple machine with 4 nodes, cpusets disabled.

Lets say all tasks are allowed to use all nodes, no set_mempolicy
either.

If a task happens to fill up 80% of one node with dirty pages, but
we have no dirty pages yet on other nodes, and we have a dirty ratio
of 40%, then do we throttle that task's writes?

I am surprised you are asking for this, Andi.  I would have thought
that on no-cpuset systems, the system wide throttling served your
needs fine.  If not, then I can only guess that is because NUMA
mempolicy constraints on allowed nodes are causing the same dirty page
problems as cpuset constrained systems -- is that your concern?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-16 Thread Aubrey Li

On 1/12/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:



On Thu, 11 Jan 2007, Roy Huang wrote:
>
> On a embedded systerm, limiting page cache can relieve memory
> fragmentation. There is a patch against 2.6.19, which limit every
> opened file page cache and total pagecache. When the limit reach, it
> will release the page cache overrun the limit.

I do think that something like this is probably a good idea, even on
non-embedded setups. We historically couldn't do this, because mapped
pages were too damn hard to remove, but that's obviously not much of a
problem any more.

However, the page-cache limit should NOT be some compile-time constant. It
should work the same way the "dirty page" limit works, and probably just
default to "feel free to use 90% of memory for page cache".

Linus



The attached patch limit the page cache by a simple way:

1) If request memory from page cache, Set a flag to mark this kind of
allocation:

static inline struct page *page_cache_alloc(struct address_space *x)
{
-   return __page_cache_alloc(mapping_gfp_mask(x));
+   return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE);
}

2) Have zone_watermark_ok done this limit:

+   if (alloc_flags & ALLOC_PAGECACHE){
+   min = min + VFS_CACHE_LIMIT;
+   }
+
   if (free_pages <= min + z->lowmem_reserve[classzone_idx])
   return 0;

3) So, when __alloc_pages is called by page cache, pass the
ALLOC_PAGECACHE into get_page_from_freelist to trigger the pagecache
limit branch in zone_watermark_ok.

This approach works on my side, I'll make a new patch to make the
limit tunable in the proc fs soon.

The following is the patch:
=
Index: mm/page_alloc.c
===
--- mm/page_alloc.c (revision 2645)
+++ mm/page_alloc.c (working copy)
@@ -892,6 +892,9 @@ failed:
#define ALLOC_HARDER0x10 /* try to alloc harder */
#define ALLOC_HIGH  0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET0x40 /* check for correct cpuset */
+#define ALLOC_PAGECACHE0x80 /* __GFP_PAGECACHE set */
+
+#define VFS_CACHE_LIMIT0x400 /* limit VFS cache page */

/*
 * Return 1 if free pages are above 'mark'. This takes into account the order
@@ -910,6 +913,10 @@ int zone_watermark_ok(struct zone *z, in
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;

+   if (alloc_flags & ALLOC_PAGECACHE){
+   min = min + VFS_CACHE_LIMIT;
+   }
+
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;
for (o = 0; o < order; o++) {
@@ -1000,8 +1007,12 @@ restart:
return NULL;
}

-   page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-   zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+   if (gfp_mask & __GFP_PAGECACHE) 
+   page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+   zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_PAGECACHE);
+   else
+   page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+   zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;

@@ -1027,6 +1038,9 @@ restart:
if (wait)
alloc_flags |= ALLOC_CPUSET;

+   if (gfp_mask & __GFP_PAGECACHE)
+   alloc_flags |= ALLOC_PAGECACHE;
+
/*
 * Go through the zonelist again. Let __GFP_HIGH and allocations
 * coming from realtime tasks go deeper into reserves.
Index: include/linux/gfp.h
===
--- include/linux/gfp.h (revision 2645)
+++ include/linux/gfp.h (working copy)
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use
emergency reserves */
#define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce
hardwall cpuset memory allocs */
#define __GFP_THISNODE  ((__force gfp_t)0x4u)/* No fallback, no policies */
+#define __GFP_PAGECACHE((__force gfp_t)0x8u) /* Is page cache
allocation ? */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
Index: include/linux/pagemap.h
===
--- include/linux/pagemap.h (revision 2645)
+++ include/linux/pagemap.h (working copy)
@@ -62,7 +62,7 @@ static inline struct page *__page_cache_

static inline struct page *page_cache_alloc(struct address_space *x)
{
-   return __page_cache_alloc(mapping_gfp_mask(x));
+   return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE);
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
=

Welcome any comm

Re: [RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-16 Thread Andi Kleen
On Wednesday 17 January 2007 15:20, Paul Jackson wrote:
> Andi wrote:
> > Is there a reason this can't be just done by node, ignoring the cpusets?
>
> This suggestion doesn't make a whole lot of sense to me.
>
> We're looking to see if a task has dirtied most of the
> pages in the nodes it is allowed to use.  If it has, then
> we want to start pushing pages to the disk harder, and
> slowing down the tasks writes.
>
> What would it mean to do this per-node?  And why would
> that be better?

With a per node dirty limit you would get essentially the
same effect and it would have the advantage of helping
people who don't configure any cpusets but run on a NUMA 
system.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/59] Cleanup sysctl

2007-01-16 Thread Andi Kleen
On Wednesday 17 January 2007 03:33, Eric W. Biederman wrote:
> There has not been much maintenance on sysctl in years, and as a result is
> there is a lot to do to allow future interesting work to happen, and being
> ambitious I'm trying to do it all at once :)
>
> The patches in this series fall into several general categories.

[...]

The patches look good to me.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/8] Convert higest_possible_node_id() into nr_node_ids

2007-01-16 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andi Kleen wrote:

> > > Are you sure this is even possible in general on systems with node
> > > hotplug? The firmware might not pass a maximum limit.
> >
> > In that case the node possible map must include all nodes right?
> 
> Yes.

Then we are fine.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: linux 2.6.19 unable to enable acpi

2007-01-16 Thread Matheus Izvekov

On 1/17/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:

On Wed, 2007-01-17 at 02:01 -0200, Matheus Izvekov wrote:
> Just tried linux for the first time on this old machine, and i got
> this problem. dmesg below:


did this machine EVER support acpi ?




It used to support power button events, dont know what else. Is there
anything I can do to check how good the acpi support is?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-16 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andi Kleen wrote:

> On Tuesday 16 January 2007 16:48, Christoph Lameter wrote:
> > Direct reclaim: cpuset aware writeout
> >
> > During direct reclaim we traverse down a zonelist and are carefully
> > checking each zone if its a member of the active cpuset. But then we call
> > pdflush without enforcing the same restrictions. In a larger system this
> > may have the effect of a massive amount of pages being dirtied and then
> > either
> 
> Is there a reason this can't be just done by node, ignoring the cpusets? 

We want to writeout dirty pages that help our situation. Those are located 
on the nodes of the cpuset.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-16 Thread Paul Jackson
Andi wrote:
> Is there a reason this can't be just done by node, ignoring the cpusets?

This suggestion doesn't make a whole lot of sense to me.

We're looking to see if a task has dirtied most of the
pages in the nodes it is allowed to use.  If it has, then
we want to start pushing pages to the disk harder, and
slowing down the tasks writes.

What would it mean to do this per-node?  And why would
that be better?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/8] Convert higest_possible_node_id() into nr_node_ids

2007-01-16 Thread Andi Kleen
On Wednesday 17 January 2007 14:14, Christoph Lameter wrote:
> On Wed, 17 Jan 2007, Andi Kleen wrote:
> > On Tuesday 16 January 2007 16:47, Christoph Lameter wrote:
> > > I think having the ability to determine the maximum amount of nodes in
> > > a system at runtime is useful but then we should name this entry
> > > correspondingly and also only calculate the value once on bootup.
> >
> > Are you sure this is even possible in general on systems with node
> > hotplug? The firmware might not pass a maximum limit.
>
> In that case the node possible map must include all nodes right?

Yes.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-16 Thread Andi Kleen
On Tuesday 16 January 2007 16:48, Christoph Lameter wrote:
> Direct reclaim: cpuset aware writeout
>
> During direct reclaim we traverse down a zonelist and are carefully
> checking each zone if its a member of the active cpuset. But then we call
> pdflush without enforcing the same restrictions. In a larger system this
> may have the effect of a massive amount of pages being dirtied and then
> either

Is there a reason this can't be just done by node, ignoring the cpusets? 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: linux 2.6.19 unable to enable acpi

2007-01-16 Thread Arjan van de Ven
On Wed, 2007-01-17 at 02:01 -0200, Matheus Izvekov wrote:
> Just tried linux for the first time on this old machine, and i got
> this problem. dmesg below:


did this machine EVER support acpi ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 19:40:17 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
> 
> > Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
> > cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
> > all of cpuset B's memory.  A task running in cpuset B gets oomkilled.
> > 
> > Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
> > cpuset containing nodes 0-2 and start using it.  I get oomkilled.
> > 
> > There may be other scenarios.
> 
> Yes this is the result of the hierachical nature of cpusets which already 
> causes issues with the scheduler. It is rather typical that cpusets are 
> used to partition the memory and cpus. Overlappig cpusets seem to have 
> mainly an administrative function. Paul?

The typical usage scenarios don't matter a lot: the examples I gave show
that the core problem remains unsolved.  People can still hit the bug.

> > So what I suggest we do is to fix the NFS bug, then move on to considering
> > the performance problems.
> 
> The NFS "bug" has been there for ages and no one cares since write 
> throttling works effectively. Since NFS can go via any network technology 
> (f.e. infiniband) we have many potential issues at that point that depend 
> on the underlying network technology. As far as I can recall we decided 
> that these stacking issues are inherently problematic and basically 
> unsolvable.

The problem you refer to arises from the inability of the net driver to
allocate memory for an outbound ack.  Such allocations aren't constrained to
a cpuset.

I expect that we can solve the NFS oom problem along the same lines as
block devices.  Certainly it's dumb of us to oom-kill a process rather than
going off-cpuset for a small and short-lived allocation.  It's also dumb of
us to allocate a basically unbounded number of nfs requests rather than
waiting for some of the ones which we _have_ allocated to complete.


> > On reflection, I agree that your proposed changes are sensible-looking for
> > addressing the probable, not-yet-demonstrated-and-quantified performance
> > problem.  The per-inode (should be per-address_space, maybe it is?) node
> 
> The address space is part of the inode.

Physically, yes.  Logically, it is not.  The address_space controls the
data-plane part of a file and is the appropriate place in which to store
this nodemask.

> Some of my development versions at 
> the dirty_map in the address space. However, the end of the inode was a 
> convenient place for a runtime sizes nodemask.
> 
> > map is unfortunate.  Need to think about that a bit more.  For a start, it
> > should be dynamically allocated (from a new, purpose-created slab cache):
> > most in-core inodes don't have any dirty pages and don't need this
> > additional storage.
> 
> We also considered such an approach. However. it creates the problem 
> of performing a slab allocation while dirtying pages. At that point we do 
> not have an allocation context, nor can we block.

Yes, it must be an atomic allocation.  If it fails, we don't care.  Chances
are it'll succeed when the next page in this address_space gets dirtied.

Plus we don't waste piles of memory on read-only files.

> > But this is unrelated to the NFS bug ;)
> 
> Looks more like a design issue (given its layering on top of the 
> networking layer) and not a bug. The "bug" surfaces when writeback is not 
> done properly. I wonder what happens if other filesystems are pushed to 
> the border of the dirty abyss.   The mmap tracking 
> fixes that were done in 2.6.19 were done because of similar symptoms 
> because the systems dirty tracking was off. This is fundamentally the 
> same issue showing up in a cpuset. So we should be able to produce the
> hangs (looks ... yes another customer reported issue on this one is that 
> reclaim is continually running and we basically livelock the system) that 
> we saw for the mmap dirty tracking issues in addition to the NFS problems 
> seen so far.
> 
> Memory allocation is required in most filesystem flush paths. If we cannot 
> allocate memory then we cannot clean pages and thus we continue trying -> 
> Livelock. I still see this as a fundamental correctness issue in the 
> kernel.

I'll believe all that once someone has got down and tried to fix NFS, and
has failed ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Paul Jackson
> Yes this is the result of the hierachical nature of cpusets which already 
> causes issues with the scheduler. It is rather typical that cpusets are 
> used to partition the memory and cpus. Overlappig cpusets seem to have 
> mainly an administrative function. Paul?

The heavy weight tasks, which are expected to be applying serious memory
pressure (whether for data pages or dirty file pages), are usually in
non-overlapping cpusets, or sharing the same cpuset, but not partially
overlapping with, or a proper superset of, some other cpuset holding an
active job.

The higher level cpusets, such as the top cpuset, or the one deeded over
to the Batch Scheduler, are proper supersets of many other cpusets.  We
avoid putting anything heavy weight in those cpusets.

Sometimes of course a task turns out to be unexpectedly heavy weight.
But in that case, we're mostly interested in function (system keeps
running), not performance.

That is, if someone setup what Andrew described, with a job in a large
cpuset sucking up all available memory from one in a smaller, contained
cpuset, I don't think I'm tuning for optimum performance anymore.
Rather I'm just trying to keep the system running and keep unrelated
jobs unaffected while we dig our way out of the hole.  If the smaller
job OOM's, that's tough nuggies.  They asked for it.  Jobs in
-unrelated- (non-overlapping) cpusets should ride out the storm with
little or no impact on their performance.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


BUG: linux 2.6.19 unable to enable acpi

2007-01-16 Thread Matheus Izvekov

Just tried linux for the first time on this old machine, and i got
this problem. dmesg below:

Linux version 2.6.19 ([EMAIL PROTECTED]) (gcc version 4.1.1 (Gentoo
4.1.1-r3)) #10 PREEMPT Sun Dec 10 17:35:24 BRST 2006
BIOS-provided physical RAM map:
BIOS-e820:  - 0009fc00 (usable)
BIOS-e820: 0009fc00 - 000a (reserved)
BIOS-e820: 000dc000 - 000e (reserved)
BIOS-e820: 000f - 0010 (reserved)
BIOS-e820: 0010 - 0fdf (usable)
BIOS-e820: 0fdf - 0fdf8000 (ACPI data)
BIOS-e820: 0fdf8000 - 0fe0 (ACPI NVS)
BIOS-e820: ffef - fff0 (reserved)
BIOS-e820:  - 0001 (reserved)
253MB LOWMEM available.
Entering add_active_range(0, 0, 65008) 0 entries of 256 used
Zone PFN ranges:
 DMA 0 -> 4096
 Normal   4096 ->65008
early_node_map[1] active PFN ranges
   0:0 ->65008
On node 0 totalpages: 65008
 DMA zone: 32 pages used for memmap
 DMA zone: 0 pages reserved
 DMA zone: 4064 pages, LIFO batch:0
 Normal zone: 475 pages used for memmap
 Normal zone: 60437 pages, LIFO batch:15
DMI 2.2 present.
ACPI: RSDP (v000 AMI   ) @ 0x000fb080
ACPI: RSDT (v001 AMIINT  0x MSFT 0x0097) @ 0x0fdf
ACPI: FADT (v001 AMIINT  0x MSFT 0x0097) @ 0x0fdf0030
ACPI: DSDT (v001SiS  620 0x1000 MSFT 0x010a) @ 0x
ACPI: PM-Timer IO Port: 0x408
Allocating PCI resources starting at 1000 (gap: 0fe0:f00f)
Detected 300.701 MHz processor.
Built 1 zonelists.  Total pages: 64501
Kernel command line: root=/dev/sda3
Local APIC disabled by BIOS -- you can enable it with "lapic"
mapped APIC to d000 (01201000)
Initializing CPU#0
CPU 0 irqstacks, hard=c039e000 soft=c039d000
PID hash table entries: 1024 (order: 10, 4096 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 254172k/260032k available (1868k kernel code, 5368k reserved,
603k data, 180k init, 0k highmem)
virtual kernel memory layout:
   fixmap  : 0xfffb7000 - 0xf000   ( 288 kB)
   vmalloc : 0xd080 - 0xfffb5000   ( 759 MB)
   lowmem  : 0xc000 - 0xcfdf   ( 253 MB)
 .init : 0xc036b000 - 0xc0398000   ( 180 kB)
 .data : 0xc02d3086 - 0xc0369fa8   ( 603 kB)
 .text : 0xc010 - 0xc02d3086   (1868 kB)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 601.79 BogoMIPS (lpj=300897)
Security Framework v1.0.0 initialized
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0080f9ff   
  
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
CPU: After all inits, caps: 0080f9ff   0040
  
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: Intel Pentium II (Klamath) stepping 03
Checking 'hlt' instruction... OK.
ACPI: Core revision 20060707
ACPI: setting ELCR to 0200 (from 1c00)
ACPI Error (hwacpi-0179): Hardware did not change modes [20060707]
ACPI Error (evxfevnt-0084): Could not transition to ACPI mode [20060707]
ACPI Warning (utxface-0154): AcpiEnable failed [20060707]
ACPI: Unable to enable ACPI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] nfs: fix congestion control

2007-01-16 Thread Peter Zijlstra
On Tue, 2007-01-16 at 17:27 -0500, Trond Myklebust wrote:
> On Tue, 2007-01-16 at 23:08 +0100, Peter Zijlstra wrote:
> > Subject: nfs: fix congestion control
> > 
> > The current NFS client congestion logic is severely broken, it marks the
> > backing device congested during each nfs_writepages() call and implements
> > its own waitqueue.
> > 
> > Replace this by a more regular congestion implementation that puts a cap
> > on the number of active writeback pages and uses the bdi congestion 
> > waitqueue.
> > 
> > NFSv[34] commit pages are allowed to go unchecked as long as we are under 
> > the dirty page limit and not in direct reclaim.

> 
> What on earth is the point of adding congestion control to COMMIT?
> Strongly NACKed.

They are dirty pages, how are we getting rid of them when we reached the
dirty limit?

> Why 16MB of on-the-wire data? Why not 32, or 128, or ...

Andrew always promotes a fixed number for congestion control, I pulled
one from a dark place. I have no problem with a more dynamic solution.

> Solaris already allows you to send 2MB of write data in a single RPC
> request, and the RPC engine has for some time allowed you to tune the
> number of simultaneous RPC requests you have on the wire: Chuck has
> already shown that read/write performance is greatly improved by upping
> that value to 64 or more in the case of RPC over TCP. Why are we then
> suddenly telling people that they are limited to 8 simultaneous writes?

min(max RPC size * max concurrent RPC reqs, dirty threshold) then?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]: MTRR: fix 32-bit ioctls on x64_32

2007-01-16 Thread H. Peter Anvin

Mikael Pettersson wrote:


These #ifdefs are too ugly.

Since you apparently just add aliases for the case labels,
and do no actual code changes, why not
1. make the new cases unconditional, or 
2. invoke a translation function before the switch which

   maps the MTRRCIOC32_ constants to what the kernel uses



Adding a case can add substantially to the generated code, especially if 
it makes a compact set of case labels non-compact.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc5: known unfixed regressions

2007-01-16 Thread David Chinner
On Tue, Jan 16, 2007 at 05:15:02PM +1100, David Chinner wrote:
> On Sat, Jan 13, 2007 at 08:11:25AM +0100, Adrian Bunk wrote:
> > On Fri, Jan 12, 2007 at 02:27:48PM -0500, Linus Torvalds wrote:
> > >...
> > > A lot of developers (including me) will be gone next week for 
> > > Linux.Conf.Au, so you have a week of rest and quiet to test this, and 
> > > report any problems. 
> > > 
> > > Not that there will be any, right? You all behave now!
> > >...
> > 
> > This still leaves the old regressions we have not yet fixed...
> > 
> > 
> > This email lists some known regressions in 2.6.20-rc5 compared to 2.6.19.
> > 
> > 
> > Subject: BUG: at mm/truncate.c:60 cancel_dirty_page()  (XFS)
> > References : http://lkml.org/lkml/2007/1/5/308
> > Submitter  : Sami Farin <[EMAIL PROTECTED]>
> > Handled-By : David Chinner <[EMAIL PROTECTED]>
> > Status : problem is being discussed
> 
> I'm at LCA and been having laptop dramas so the fix is being held up at this
> point. I and trying to test a change right now that adds an optional unmap
> to truncate_inode_pages_range as XFS needs, in some circumstances, to toss
> out dirty pages (with dirty bufferheads) and hence requires truncate semantics
> that are currently missing unmap calls.
> 
> Semi-untested patch attached below.

The patch has run XFSQA for about 24 hours now on my test rig without
triggering any problems.

Cheers,

Dave.

>  fs/xfs/linux-2.6/xfs_fs_subr.c |6 ++--
>  include/linux/mm.h |2 +
>  mm/truncate.c  |   60 
> -
>  3 files changed, 60 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6.19/fs/xfs/linux-2.6/xfs_fs_subr.c
> ===
> --- linux-2.6.19.orig/fs/xfs/linux-2.6/xfs_fs_subr.c  2006-10-03 
> 23:22:36.0 +1000
> +++ linux-2.6.19/fs/xfs/linux-2.6/xfs_fs_subr.c   2007-01-17 
> 01:24:51.771273750 +1100
> @@ -32,7 +32,8 @@ fs_tosspages(
>   struct inode*ip = vn_to_inode(vp);
>  
>   if (VN_CACHED(vp))
> - truncate_inode_pages(ip->i_mapping, first);
> + truncate_unmap_inode_pages_range(ip->i_mapping,
> +  first, last, 1);
>  }
>  
>  void
> @@ -49,7 +50,8 @@ fs_flushinval_pages(
>   if (VN_TRUNC(vp))
>   VUNTRUNCATE(vp);
>   filemap_write_and_wait(ip->i_mapping);
> - truncate_inode_pages(ip->i_mapping, first);
> + truncate_unmap_inode_pages_range(ip->i_mapping,
> +  first, last, 1);
>   }
>  }
>  
> Index: linux-2.6.19/include/linux/mm.h
> ===
> --- linux-2.6.19.orig/include/linux/mm.h  2007-01-17 01:21:16.01779 
> +1100
> +++ linux-2.6.19/include/linux/mm.h   2007-01-17 01:24:51.775274000 +1100
> @@ -1058,6 +1058,8 @@ extern unsigned long page_unuse(struct p
>  extern void truncate_inode_pages(struct address_space *, loff_t);
>  extern void truncate_inode_pages_range(struct address_space *,
>  loff_t lstart, loff_t lend);
> +extern void truncate_unmap_inode_pages_range(struct address_space *,
> +loff_t lstart, loff_t lend, int unmap);
>  
>  /* generic vm_area_ops exported for stackable file systems */
>  extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, 
> int *);
> Index: linux-2.6.19/mm/truncate.c
> ===
> --- linux-2.6.19.orig/mm/truncate.c   2007-01-17 01:21:23.074231000 +1100
> +++ linux-2.6.19/mm/truncate.c2007-01-17 01:24:51.779274250 +1100
> @@ -59,7 +59,7 @@ void cancel_dirty_page(struct page *page
>  
>   WARN_ON(++warncount < 5);
>   }
> - 
> +
>   if (TestClearPageDirty(page)) {
>   struct address_space *mapping = page->mapping;
>   if (mapping && mapping_cap_account_dirty(mapping)) {
> @@ -122,16 +122,34 @@ invalidate_complete_page(struct address_
>   return ret;
>  }
>  
> +/*
> + * This is a helper for truncate_unmap_inode_page. Unmap the page we
> + * are passed. Page must be locked by the caller.
> + */
> +static void
> +unmap_single_page(struct address_space *mapping, struct page *page)
> +{
> + BUG_ON(!PageLocked(page));
> + while (page_mapped(page)) {
> + unmap_mapping_range(mapping,
> + (loff_t)page->index << PAGE_CACHE_SHIFT,
> + PAGE_CACHE_SIZE, 0);
> + }
> +}
> +
>  /**
> - * truncate_inode_pages - truncate range of pages specified by start and
> + * truncate_unmap_inode_pages_range - truncate range of pages specified by
> + * start and end byte offsets and optionally unmap them first.
>   * end byte offsets
>   * @mapping: mapping to truncate
>   * @lstart: offset from which to truncate
>   * @lend: offset to which to truncate
> + *

Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
> cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
> all of cpuset B's memory.  A task running in cpuset B gets oomkilled.
> 
> Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
> cpuset containing nodes 0-2 and start using it.  I get oomkilled.
> 
> There may be other scenarios.

Yes this is the result of the hierachical nature of cpusets which already 
causes issues with the scheduler. It is rather typical that cpusets are 
used to partition the memory and cpus. Overlappig cpusets seem to have 
mainly an administrative function. Paul?

> So what I suggest we do is to fix the NFS bug, then move on to considering
> the performance problems.

The NFS "bug" has been there for ages and no one cares since write 
throttling works effectively. Since NFS can go via any network technology 
(f.e. infiniband) we have many potential issues at that point that depend 
on the underlying network technology. As far as I can recall we decided 
that these stacking issues are inherently problematic and basically 
unsolvable.

> On reflection, I agree that your proposed changes are sensible-looking for
> addressing the probable, not-yet-demonstrated-and-quantified performance
> problem.  The per-inode (should be per-address_space, maybe it is?) node

The address space is part of the inode. Some of my development versions at 
the dirty_map in the address space. However, the end of the inode was a 
convenient place for a runtime sizes nodemask.

> map is unfortunate.  Need to think about that a bit more.  For a start, it
> should be dynamically allocated (from a new, purpose-created slab cache):
> most in-core inodes don't have any dirty pages and don't need this
> additional storage.

We also considered such an approach. However. it creates the problem 
of performing a slab allocation while dirtying pages. At that point we do 
not have an allocation context, nor can we block.

> But this is unrelated to the NFS bug ;)

Looks more like a design issue (given its layering on top of the 
networking layer) and not a bug. The "bug" surfaces when writeback is not 
done properly. I wonder what happens if other filesystems are pushed to 
the border of the dirty abyss.   The mmap tracking 
fixes that were done in 2.6.19 were done because of similar symptoms 
because the systems dirty tracking was off. This is fundamentally the 
same issue showing up in a cpuset. So we should be able to produce the
hangs (looks ... yes another customer reported issue on this one is that 
reclaim is continually running and we basically livelock the system) that 
we saw for the mmap dirty tracking issues in addition to the NFS problems 
seen so far.

Memory allocation is required in most filesystem flush paths. If we cannot 
allocate memory then we cannot clean pages and thus we continue trying -> 
Livelock. I still see this as a fundamental correctness issue in the 
kernel.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 35/59] sysctl: C99 convert ctl_tables in arch/powerpc/kernel/idle.c

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> This was partially done already and there was no ABI breakage what
> a relief.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>

> ---
>  arch/powerpc/kernel/idle.c |   11 ---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
> index 8994af3..8b27bb1 100644
> --- a/arch/powerpc/kernel/idle.c
> +++ b/arch/powerpc/kernel/idle.c
> @@ -110,11 +110,16 @@ static ctl_table powersave_nap_ctl_table[]={
>   .mode   = 0644,
>   .proc_handler   = &proc_dointvec,
>   },
> - { 0, },
> + {}
>  };
>  static ctl_table powersave_nap_sysctl_root[] = {
> - { 1, "kernel", NULL, 0, 0755, powersave_nap_ctl_table, },
> - { 0,},
> + {
> + .ctl_name   = CTL_KERN,
> + .procname   = "kernel",
> + .mode   = 0755,
> + .child  = powersave_nap_ctl_table,
> + },
> + {}
>  };
>  
>  static int __init

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 36/59] sysctl: C99 convert ctl_tables entries in arch/ppc/kernel/ppc_htab.c

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> And make the mode of the kernel directory 0555 no one is allowed
> to write to sysctl directories.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>

> ---
>  arch/ppc/kernel/ppc_htab.c |   11 ---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/ppc/kernel/ppc_htab.c b/arch/ppc/kernel/ppc_htab.c
> index bd129d3..77b20ff 100644
> --- a/arch/ppc/kernel/ppc_htab.c
> +++ b/arch/ppc/kernel/ppc_htab.c
> @@ -442,11 +442,16 @@ static ctl_table htab_ctl_table[]={
>   .mode   = 0644,
>   .proc_handler   = &proc_dol2crvec,
>   },
> - { 0, },
> + {}
>  };
>  static ctl_table htab_sysctl_root[] = {
> - { 1, "kernel", NULL, 0, 0755, htab_ctl_table, },
> - { 0,},
> + {
> + .ctl_name   = CTL_KERN,
> + .procname   = "kernel",
> + .mode   = 0555,
> + .child  = htab_ctl_table,
> + },
> + {}
>  };
>  
>  static int __init

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 18/59] sysctl: ipmi remove unnecessary insert_at_head flag

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> With unique sysctl binary numbers setting insert_at_head is pointless.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>

> ---
>  drivers/char/ipmi/ipmi_poweroff.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/char/ipmi/ipmi_poweroff.c 
> b/drivers/char/ipmi/ipmi_poweroff.c
> index 9d23136..b3ae65e 100644
> --- a/drivers/char/ipmi/ipmi_poweroff.c
> +++ b/drivers/char/ipmi/ipmi_poweroff.c
> @@ -686,7 +686,7 @@ static int ipmi_poweroff_init (void)
>   printk(KERN_INFO PFX "Power cycle is enabled.\n");
>  
>  #ifdef CONFIG_PROC_FS
> - ipmi_table_header = register_sysctl_table(ipmi_root_table, 1);
> + ipmi_table_header = register_sysctl_table(ipmi_root_table, 0);
>   if (!ipmi_table_header) {
>   printk(KERN_ERR PFX "Unable to register powercycle sysctl\n");
>   rv = -ENOMEM;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/8] Convert higest_possible_node_id() into nr_node_ids

2007-01-16 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andi Kleen wrote:

> On Tuesday 16 January 2007 16:47, Christoph Lameter wrote:
> 
> > I think having the ability to determine the maximum amount of nodes in
> > a system at runtime is useful but then we should name this entry
> > correspondingly and also only calculate the value once on bootup.
> 
> Are you sure this is even possible in general on systems with node
> hotplug? The firmware might not pass a maximum limit.

In that case the node possible map must include all nodes right?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] intel_agp: restore graphics device's pci space early in resume

2007-01-16 Thread Wang Zhenyu

Dave, 

Currently in resuming path graphics device's pci space restore is 
behind host bridge, so resume function wrongly accesses graphics 
device's space. This makes resuming failure which crashed X. So 
here's a patch to restore device's pci space early, which makes
resuming ok with X. Patch against 2.6.20-rc5.

Signed-off-by: Wang Zhenyu <[EMAIL PROTECTED]>

---
diff --git a/drivers/char/agp/intel-agp.c b/drivers/char/agp/intel-agp.c
index ab0a9c0..7af734b 100644
--- a/drivers/char/agp/intel-agp.c
+++ b/drivers/char/agp/intel-agp.c
@@ -1955,6 +1955,15 @@ static int agp_intel_resume(struct pci_d
 
pci_restore_state(pdev);
 
+   /* We should restore our graphics device's config space,
+* as host bridge (00:00) resumes before graphics device (02:00),
+* then our access to its pci space can work right. 
+*/
+   if (intel_i810_private.i810_dev)
+   pci_restore_state(intel_i810_private.i810_dev);
+   if (intel_i830_private.i830_dev)
+   pci_restore_state(intel_i830_private.i830_dev);
+
if (bridge->driver == &intel_generic_driver)
intel_configure();
else if (bridge->driver == &intel_850_driver)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 17:30:26 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> > Nope.  You've completely omitted the little fact that we'll do writeback in
> > the offending zone off the LRU.  Slower, maybe.  But it should work and the
> > system should recover.  If it's not doing that (it isn't) then we should
> > fix it rather than avoiding it (by punting writeback over to pdflush).
> 
> pdflush is not running *at* all nor is dirty throttling working. That is 
> correct behavior? We could do background writeback but we choose not to do 
> so? Instead we wait until we hit reclaim and then block (well it seems 
> that we do not block the blocking there also fails since we again check 
> global ratios)?

I agree that it is a worthy objective to be able to constrain a cpuset's
dirty memory levels.  But as a performance optimisation and NOT as a
correctness fix.

Consider: non-exclusive cpuset A consists of mems 0-15, non-exclusive
cpuset B consists of mems 0-3.  A task running in cpuset A can freely dirty
all of cpuset B's memory.  A task running in cpuset B gets oomkilled.

Consider: a 32-node machine has nodes 0-3 full of dirty memory.  I create a
cpuset containing nodes 0-2 and start using it.  I get oomkilled.

There may be other scenarios.


IOW, we have a correctness problem, and we have a probable,
not-yet-demonstrated-and-quantified performance problem.  Fixing the latter
(in the proposed fashion) will *not* fix the former.

So what I suggest we do is to fix the NFS bug, then move on to considering
the performance problems.



On reflection, I agree that your proposed changes are sensible-looking for
addressing the probable, not-yet-demonstrated-and-quantified performance
problem.  The per-inode (should be per-address_space, maybe it is?) node
map is unfortunate.  Need to think about that a bit more.  For a start, it
should be dynamically allocated (from a new, purpose-created slab cache):
most in-core inodes don't have any dirty pages and don't need this
additional storage.

Also, I worry about the worst-case performance of that linear search across
the inodes.

But this is unrelated to the NFS bug ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/8] Convert higest_possible_node_id() into nr_node_ids

2007-01-16 Thread Andi Kleen
On Tuesday 16 January 2007 16:47, Christoph Lameter wrote:

> I think having the ability to determine the maximum amount of nodes in
> a system at runtime is useful but then we should name this entry
> correspondingly and also only calculate the value once on bootup.

Are you sure this is even possible in general on systems with node
hotplug? The firmware might not pass a maximum limit.

At least CPU hotplug definitely has this issue and I don't see nodes
to be very different.

-Andi
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> Nope.  You've completely omitted the little fact that we'll do writeback in
> the offending zone off the LRU.  Slower, maybe.  But it should work and the
> system should recover.  If it's not doing that (it isn't) then we should
> fix it rather than avoiding it (by punting writeback over to pdflush).

pdflush is not running *at* all nor is dirty throttling working. That is 
correct behavior? We could do background writeback but we choose not to do 
so? Instead we wait until we hit reclaim and then block (well it seems 
that we do not block the blocking there also fails since we again check 
global ratios)?

> > The patchset does not allow processes to allocate from other nodes than 
> > the current cpuset.
> 
> Yes it does.  It asks pdflush to perform writeback of the offending zone(s)
> rather than (or as well as) doing it directly.  The only reason pdflush can
> sucessfuly do that is because pdflush can allocate its requests from other
> zones.

Ok pdflush is able to do that. Still the application is not able to 
extend its memory beyond the cpuset. What about writeback throttling? 
There it all breaks down. The cpuset is effective and we are unable to 
allocate any more memory. 

The reason this works is because not all of memory is dirty. Thus reclaim 
will be able to free up memory or there is enough memory free.

> > AFAIK any filesyste/block device can go oom with the current broken 
> > writeback it just does a few allocations. Its a matter of hitting the 
> > sweet spots.
> 
> That shouldn't be possible, in theory.  Block IO is supposed to succeed if
> *all memory in the machine is dirty*: the old
> dirty-everything-with-MAP_SHARED-then-exit problem.  Lots of testing went
> into that and it works.  It also failed on NFS although I thought that got
> "fixed" a year or so ago.  Apparently not.

Humm... Really?

> > Nope. Why would a dirty zone pose a problem? The proble exist if you 
> > cannot allocate more memory.
> 
> Well one example would be a GFP_KERNEL allocation on a highmem machine in
> whcih all of ZONE_NORMAL is dirty.

That is a restricted allocation which will lead to reclaim.

> > If we have multiple zones then other zones may still provide memory to 
> > continue (same as in UP).
> 
> Not if all the eligible zones are all-dirty.

They are all dirty if we do not throttle the dirty pages.

> Right now, what we have is an NFS bug.  How about we fix it, then
> reevaluate the situation?

The "NFS bug" only exists when using a cpuset. If you run NFS without 
cpusets then the throttling will kick in and everything is fine.

> A good starting point would be to show us one of these oom-killer traces.

No traces. Since the process is killed within a cpuset we only get 
messages like:

Nov 28 16:19:52 ic4 kernel: Out of Memory: Kill process 679783 (ncks) score 0 
and children.
Nov 28 16:19:52 ic4 kernel: No available memory in cpuset: Killed process 
679783 (ncks).
Nov 28 16:27:58 ic4 kernel: oom-killer: gfp_mask=0x200d2, order=0

Probably need to rerun these with some patches.

> > Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
> > on the first node. Then we copy a large file to disk. Node local 
> > allocation means that we allocate from the first node. After we reach 40% 
> > of the node then we throttle? This is going to be a significant 
> > performance degradation since we can no longer use the memory of other 
> > nodes to buffer writeout.
> 
> That was what I was referring to.

Note that this was describing the behavior you wanted not the way things 
work. It is desired behavior not to use all the memory resources of the 
cpuset and slow down the system?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives (k8 cpu errata needed?)

2007-01-16 Thread Christoph Anton Mitterer
Andi Kleen wrote:
> AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
> although there were similar problems on VIA in the past too.
> Unless a good workaround comes around soon I'll probably default
> to iommu=soft on Nvidia.
I've just read the posts about AMDs and NVIDIAs effort to find the
issue,... but in the meantime this would be the best solution.

And if "we"'ll ever find a rue solution,.. we could still deactivate the
iommu=soft setting.


Best wishes,
Chris.
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Re: data corruption with nvidia chipsets and IDE/SATA drives (k8 cpu errata needed?)

2007-01-16 Thread Christoph Anton Mitterer
Chris Wedgwood wrote:
> I'd like to here from Andi how he feels about this?  It seems like a
> somewhat drastic solution in some ways given a lot of hardware doesn't
> seem to be affected (or maybe in those cases it's just really hard to
> hit, I don't know).
>   
Yes this might be true,.. those who have reported working systems might
just have a configuration where the error happens even rarer or where
some other event(s) work around it.

>> Well we can hope that Nvidia will find out more (though I'm not too
>> optimistic).
>> 
> Ideally someone from AMD needs to look into this, if some mainboards
> really never see this problem, then why is that?  Is there errata that
> some BIOS/mainboard vendors are dealing with that others are not?
>   
Some time ago I've asked here in a post if some of you could try to
contact AMD and/or Nvidia,.. as no one did,... I wrote them again (to
all forums and email addresses I knew). (You can see the text here
http://www.nvnews.net/vbulletin/showthread.php?t=82909).
Now Nvidia replied and it seems (thanks to Mr. Friedman) that they're
actually try to investigate in the issue...

I received on reply from AMD (actually in German which is strange as I
wrote to their US support)... where they told me they'd have forwarded
my mail to their Linux engineers... but no reply since then.

Perhaps some of you have some "contacts" and can use them...
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 16:16:30 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
> 
> > It's a workaround for a still-unfixed NFS problem.
> 
> No its doing proper throttling. Without this patchset there will *no* 
> writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each
> and a cpuset that only spans one node.
> 
> Then a process runniung in that cpuset can dirty all of memory and still 
> continue running without writeback continuing. background dirty ratio
> is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever
> be reached because the process will only ever be able to dirty memory on 
> one node which is 5%. There will be no throttling, no background 
> writeback, no blocking for dirty pages.
> 
> At some point we run into reclaim (possibly we have ~99% of of the cpuset 
> dirty) and then we trigger writeout. Okay so if the filesystem / block 
> device is robust enough and does not require memory allocations then we 
> likely will survive that and do slow writeback page by page from the LRU.
> 
> writback is completely hosed for that situation. This patch restores 
> expected behavior in a cpuset (which is a form of system partition that 
> should mirror the system as a whole). At 10% dirty we should start 
> background writeback and at 40% we should block. If that is done then even 
> fragile combinations of filesystem/block devices will work as they do 
> without cpusets.

Nope.  You've completely omitted the little fact that we'll do writeback in
the offending zone off the LRU.  Slower, maybe.  But it should work and the
system should recover.  If it's not doing that (it isn't) then we should
fix it rather than avoiding it (by punting writeback over to pdflush).

Once that's fixed, if we determine that there are remaining and significant
performance issues then we can take a look at that.

> 
> > > Yes we can fix these allocations by allowing processes to allocate from 
> > > other nodes. But then the container function of cpusets is no longer 
> > > there.
> > But that's what your patch already does!
> 
> The patchset does not allow processes to allocate from other nodes than 
> the current cpuset.

Yes it does.  It asks pdflush to perform writeback of the offending zone(s)
rather than (or as well as) doing it directly.  The only reason pdflush can
sucessfuly do that is because pdflush can allocate its requests from other
zones.

> 
> AFAIK any filesyste/block device can go oom with the current broken 
> writeback it just does a few allocations. Its a matter of hitting the 
> sweet spots.

That shouldn't be possible, in theory.  Block IO is supposed to succeed if
*all memory in the machine is dirty*: the old
dirty-everything-with-MAP_SHARED-then-exit problem.  Lots of testing went
into that and it works.  It also failed on NFS although I thought that got
"fixed" a year or so ago.  Apparently not.

> > But we also can get into trouble if a *zone* is all-dirty.  Any solution to
> > the cpuset problem should solve that problem too, no?
> 
> Nope. Why would a dirty zone pose a problem? The proble exist if you 
> cannot allocate more memory.

Well one example would be a GFP_KERNEL allocation on a highmem machine in
whcih all of ZONE_NORMAL is dirty.

> If a cpuset contains a single node which is a 
> single zone then this patchset will also address that issue.
> 
> If we have multiple zones then other zones may still provide memory to 
> continue (same as in UP).

Not if all the eligible zones are all-dirty.

> > > Yes, but when we enter reclaim most of the pages of a zone may already be 
> > > dirty/writeback so we fail.
> > 
> > No.  If the dirty limits become per-zone then no zone will ever have >40%
> > dirty.
> 
> I am still confused as to why you would want per zone dirty limits?

The need for that has yet to be demonstrated.  There _might_ be a problem,
but we need test cases and analyses to demonstrate that need.

Right now, what we have is an NFS bug.  How about we fix it, then
reevaluate the situation?

A good starting point would be to show us one of these oom-killer traces.

> Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
> on the first node. Then we copy a large file to disk. Node local 
> allocation means that we allocate from the first node. After we reach 40% 
> of the node then we throttle? This is going to be a significant 
> performance degradation since we can no longer use the memory of other 
> nodes to buffer writeout.

That was what I was referring to.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc5: usb mouse breaks suspend to ram

2007-01-16 Thread Pavel Machek
Hi!

> > No, HID is the preferred... I am not sure what is going on - on my box
> > STR does not work at all thanks to nvidia chip turning the display on
> > all the way as the very last step of suspend ;(
> 
> One or several of these options might help cure this:
> - agp=off kernel command line (plus AGP driver option enabled in nvidia 
> xorg.conf)
> - suspend: cat /proc/bus/pci/AA/BB.C >/tmp/video_state--> resume
> - suspend: vbetool vbestate save--> resume
> - directly after resume: vbetool post
> - playing with chvt to not stay in X vt upon suspend
> - acpi_sleep=s3_bios or acpi_sleep=s3_mode
> 
> Especially the PCI video_state trick finally got me a working resume on
> 2.6.19-ck2 r128 Rage Mobility M4 AGP *WITH*(!) fully enabled and working
> (and keeping working!) DRI (3D).

Can we get whitelist entry for suspend.sf.net? s2ram from there can do
all the tricks you described, one letter per trick :-). We even got
PCI saving lately.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Two 2.6.20-rc5-rt2 issues

2007-01-16 Thread Rui Nuno Capela
On Tue, January 16, 2007 11:56, Ingo Molnar wrote:
>

> * Rui Nuno Capela <[EMAIL PROTECTED]> wrote:
>
>> First one is about building for UP (CONFIG_SMP not set) on my old P4
>> laptop. As it seems, all my build attempts failed at the final link
>> stage, with undefined references to paravirt_enable. After disabling
>> CONFIG_PARAVIRT I get a similar failure, but this time for a couple
>> kvm* symbols. [...]
>
> ok, i think i have managed to fix both bugs. I have released -rt3,
> please re-check whether it works any better. If it still doesnt then
> please send me the exact .config that fails.
>

Building this already with -rt5, still gives:
...
  LD  arch/i386/boot/compressed/vmlinux
  OBJCOPY arch/i386/boot/vmlinux.bin
  BUILD   arch/i386/boot/bzImage
Root device is (3, 2)
Boot sector 512 bytes.
Setup is 7407 bytes.
System is 1427 kB
Kernel: arch/i386/boot/bzImage is ready  (#1)
WARNING: "profile_hits" [drivers/kvm/kvm-intel.ko] undefined!
WARNING: "profile_hits" [drivers/kvm/kvm-amd.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
...

.config as .gz is attached.


>
>> Second one is already about running SMP, on a Dual Core2 T7200, for
>> which the build goes fine but run-time is haunted by a crippling BUG:
>
>> Call Trace:
>> [] __switch_to+0xcc/0x176
>> [] wake_up_process+0x19/0x1b
>> [] acpi_ec_gpe_handler+0x1f/0x53
>> [] acpi_ev_gpe_dispatch+0x64/0x163
>> [] acpi_ev_gpe_detect+0x94/0xd7
>>
>
> hm, this is a -rt specific thing that i hoped to have worked around but
> apparently not. The ACPI code uses a waitqueue in its idle routine
> (argh!) which cannot by done sanely on PREEMPT_RT. In -rt3 i've added
> a more conservative (but still ugly and incorrect) hack - could you try
> it, does -rt3 work any better?
>

Yes it does. No BUG has been spotted on -rt3, yet.

Cheers.
-- 
rncbc aka Rui Nuno Capela
[EMAIL PROTECTED]

config.gz
Description: GNU Zip compressed data


Re: [PATCH] Driver core: fix refcounting bug

2007-01-16 Thread Greg KH
On Mon, Jan 08, 2007 at 11:06:44AM -0500, Alan Stern wrote:
> This patch (as832) fixes a newly-introduced bug in the driver core.
> When a kobject is assigned to a kset, it must acquire a reference to
> the kset.
> 
> Signed-off-by: Alan Stern <[EMAIL PROTECTED]>
> 
> ---
> 
> The bug was introduced in Kay's "unify /sys/class and /sys/bus at 
> /sys/subsystem" patch.
> 
> I left the assignment of class_dev->kobj.parent as it was, although it is 
> not needed.  The following call to kobject_add() will end up doing the 
> same thing.
> 
> Alan Stern
> 
> P.S.: Tracking down refcounting bugs is a real pain!  I spent an entire 
> afternoon on this one...  :-(

Thanks, I've merged your patch with the one from Kay so we don't
introduce a bug along the way.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc5: usb mouse breaks suspend to ram

2007-01-16 Thread Andreas Mohr
Hi,

On Tue, Jan 16, 2007 at 04:25:20PM -0500, Dmitry Torokhov wrote:
> No, HID is the preferred... I am not sure what is going on - on my box
> STR does not work at all thanks to nvidia chip turning the display on
> all the way as the very last step of suspend ;(

One or several of these options might help cure this:
- agp=off kernel command line (plus AGP driver option enabled in nvidia 
xorg.conf)
- suspend: cat /proc/bus/pci/AA/BB.C >/tmp/video_state--> resume
- suspend: vbetool vbestate save--> resume
- directly after resume: vbetool post
- playing with chvt to not stay in X vt upon suspend
- acpi_sleep=s3_bios or acpi_sleep=s3_mode

Especially the PCI video_state trick finally got me a working resume on
2.6.19-ck2 r128 Rage Mobility M4 AGP *WITH*(!) fully enabled and working
(and keeping working!) DRI (3D).
Or, to be precise, video_state was the ticket to keeping X.org alive
after resume instead of near-100% X lockup, which then allowed
vbestate post to successfully deal with the remaining pixel line distortion
in order to get a clear display again.
And some agp hacks might have played a role here, too, need to investigate
this again and submit something if this is the case.

In your case this sounds like the all-too-familiar mis-signalling of the
TFT display causing it to "melt" which ends up with an all-white screen,
so this should most likely be cured via vbetool post or so.

keywords: agpgart r128 suspend resume vbetool intel-agp dri

Andreas Mohr
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IPv6 router advertisement broken on 2.6.20-rc5

2007-01-16 Thread Aurelien Jarno
On Wed, Jan 17, 2007 at 12:30:53AM +0100, bert hubert wrote:
> On Tue, Jan 16, 2007 at 10:42:53PM +0100, Aurelien Jarno wrote:
> 
> > I have just tried a 2.6.20-rc5 kernel (I previously used a 2.6.19 one),
> > and I have noticed that the IPv6 router advertisement functionality is
> 
> Can you check if rc1, rc2, rc3 etc do work?

Will do that, but probably not in a very short timeframe (in other words
next evening, in about 20 hours).

-- 
  .''`.  Aurelien Jarno | GPG: 1024D/F1BCDB73
 : :' :  Debian developer   | Electrical Engineer
 `. `'   [EMAIL PROTECTED] | [EMAIL PROTECTED]
   `-people.debian.org/~aurel32 | www.aurel32.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [PATCH] Cell SPU task notification

2007-01-16 Thread Christoph Hellwig
Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c
===
--- 
linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/sched.c   
2006-12-04 10:56:04.730698720 -0600
+++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c
2007-01-15 16:22:31.808461448 -0600
@@ -84,15 +84,42 @@
ctx ? ctx->object_id : 0, spu);
 }
 
+static void notify_spus_active(void)
+{
+   int node;
+   /* Wake up the active spu_contexts. When the awakened processes 
+* sees their notify_active flag is set, they will call
+* spu_notify_already_active().
+*/
+   for (node = 0; node < MAX_NUMNODES; node++) {
+   struct spu *spu;
+   mutex_lock(&spu_prio->active_mutex[node]);
+list_for_each_entry(spu, &spu_prio->active_list[node], list) {

You seem to have some issues with tabs vs spaces for indentation
here.

+   struct spu_context *ctx = spu->ctx;
+   spu->notify_active = 1;


Please make this a bit in the sched_flags field that's added in
the scheduler patch series I sent out.

+   wake_up_all(&ctx->stop_wq);
+   smp_wmb();
+   }
+mutex_unlock(&spu_prio->active_mutex[node]);
+   }
+   yield();
+}

Why do you add the yield() here?  yield is pretty much a sign
for a bug

+void spu_notify_already_active(struct spu_context *ctx)
+{
+   struct spu *spu = ctx->spu;
+   if (!spu)
+   return;
+   spu_switch_notify(spu, ctx);
+}

Please just call spu_switch_notify directly from the only
caller.  Also the check for ctx->spu beeing there is not
required if you look a the caller.


*stat = ctx->ops->status_read(ctx);
-   if (ctx->state != SPU_STATE_RUNNABLE)
-   return 1;
+   smp_rmb();


What do you need the barrier for here?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fix typo in geode_configre()@cyrix.c

2007-01-16 Thread Juergen Beisert
On Tuesday 09 January 2007 18:33, Lennart Sorensen wrote:
> Then for the next one it does:
> ccr3 = GetCx86(CX86_CCR3);
> setCx86(CX86_CCR3, (ccr3 & 0x0f) | 0x10);
>
> Couldn't that have been:
> setCx86(CX86_CCR3, (getCx86(CX86_CCR3) & 0x0f) | 0x10);
>
> No temp variable, and again it clearly does not intend to restore the
> value again later (even though the bug actually did cause the value to
> be restored by accident).

No, ccr3 should be restored to protect some registers (or at least bit 4 
should be cleared in ccr3).

BTW:
In function set_cx86_inc()
[...]
/* PCR1 -- Performance Control */
/* Incrementor on, whatever that is */
setCx86(CX86_PCR1, getCx86(CX86_PCR1) | 0x02);
/* PCR0 -- Performance Control */
/* Incrementor Margin 10 */
setCx86(CX86_PCR0, getCx86(CX86_PCR0) | 0x04);
[...]

This setting is only valid for 200MHz...266MHz CPUs, for 300MHz and 333MHz 
CPUs the Incrementor Margin should be 1-1.

There is an AppNote about this setting:
AMD Geode GX1 Processor Memory Timings for Maximum Performance.

Juergen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> It's a workaround for a still-unfixed NFS problem.

No its doing proper throttling. Without this patchset there will *no* 
writeback and throttling at all. F.e. lets say we have 20 nodes of 1G each
and a cpuset that only spans one node.

Then a process runniung in that cpuset can dirty all of memory and still 
continue running without writeback continuing. background dirty ratio
is at 10% and the dirty ratio at 40%. Neither of those boundaries can ever
be reached because the process will only ever be able to dirty memory on 
one node which is 5%. There will be no throttling, no background 
writeback, no blocking for dirty pages.

At some point we run into reclaim (possibly we have ~99% of of the cpuset 
dirty) and then we trigger writeout. Okay so if the filesystem / block 
device is robust enough and does not require memory allocations then we 
likely will survive that and do slow writeback page by page from the LRU.

writback is completely hosed for that situation. This patch restores 
expected behavior in a cpuset (which is a form of system partition that 
should mirror the system as a whole). At 10% dirty we should start 
background writeback and at 40% we should block. If that is done then even 
fragile combinations of filesystem/block devices will work as they do 
without cpusets.


> > Yes we can fix these allocations by allowing processes to allocate from 
> > other nodes. But then the container function of cpusets is no longer 
> > there.
> But that's what your patch already does!

The patchset does not allow processes to allocate from other nodes than 
the current cpuset. There is no change as to the source of memory 
allocations.
 
> > NFS is okay as far as I can tell. dirty throttling works fine in non 
> > cpuset environments because we throttle if 40% of memory becomes dirty or 
> > under writeback.
> 
> Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
> for existing IO to complete.  Back that up with a mempool for NFS requests
> and the problem is solved, I think?

AFAIK any filesyste/block device can go oom with the current broken 
writeback it just does a few allocations. Its a matter of hitting the 
sweet spots.

> But we also can get into trouble if a *zone* is all-dirty.  Any solution to
> the cpuset problem should solve that problem too, no?

Nope. Why would a dirty zone pose a problem? The proble exist if you 
cannot allocate more memory. If a cpuset contains a single node which is a 
single zone then this patchset will also address that issue.

If we have multiple zones then other zones may still provide memory to 
continue (same as in UP).

> > Yes, but when we enter reclaim most of the pages of a zone may already be 
> > dirty/writeback so we fail.
> 
> No.  If the dirty limits become per-zone then no zone will ever have >40%
> dirty.

I am still confused as to why you would want per zone dirty limits?

Lets say we have a cpuset with 4 nodes (thus 4 zones) and we are running 
on the first node. Then we copy a large file to disk. Node local 
allocation means that we allocate from the first node. After we reach 40% 
of the node then we throttle? This is going to be a significant 
performance degradation since we can no longer use the memory of other 
nodes to buffer writeout.

> The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
> reduction in that zone, throttling the dirtying process.  I suspect this
> would work very badly in common situations with, say, typical i386 boxes.

Absolute crap. You can prototype that broken behavior with zone reclaim by 
the way. Just switch on writeback during zone reclaim and watch how memory 
on a cpuset is unused and how the system becomes slow as molasses.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fix typo in geode_configre()@cyrix.c

2007-01-16 Thread Juergen Beisert
On Tuesday 09 January 2007 16:43, Lennart Sorensen wrote:
> On Tue, Jan 09, 2007 at 06:41:56PM +0900, takada wrote:
> > In kernel 2.6, write back wrong register when configure Geode processor.
> > Instead of storing to CCR4, it stores to CCR3.
> >
> > --- linux-2.6.19/arch/i386/kernel/cpu/cyrix.c.orig  2007-01-09
> > 16:45:21.0 +0900 +++
> > linux-2.6.19/arch/i386/kernel/cpu/cyrix.c   2007-01-09 17:10:13.0
> > +0900 @@ -173,7 +173,7 @@ static void __cpuinit geode_configure(vo
> > ccr4 = getCx86(CX86_CCR4);
> > ccr4 |= 0x38;   /* FPU fast, DTE cache, Mem bypass */
> >
> > -   setCx86(CX86_CCR3, ccr3);
> > +   setCx86(CX86_CCR4, ccr4);
> >
> > set_cx86_memwb();
> > set_cx86_reorder();
>
> Any idea what the consequence of this would be?  Any chance that while
> fixing this file anyhow, adding a missing variant could be done?

Writing back of ccr4 should be intented here, but also writing back the ccr3 
to disable the MAPEN again. So both are required. But the ccr4 first:

   setCx86(CX86_CCR4, ccr4);
   setCx86(CX86_CCR3, ccr3);

Juergen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC: 2.6 patch] drivers/scsi/qla4xxx/: possible cleanups

2007-01-16 Thread Ravi Anand
>On Sun, 14 Jan 2007, Adrian Bunk wrote:

> Date: Sun, 14 Jan 2007 14:45:54 +0100
> From: Adrian Bunk <[EMAIL PROTECTED]>
> To: Ravi Anand <[EMAIL PROTECTED]>,
>   David Somayajulu <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED], linux-scsi@vger.kernel.org,
>   linux-kernel@vger.kernel.org
> Subject: [RFC: 2.6 patch] drivers/scsi/qla4xxx/: possible cleanups
> User-Agent: Mutt/1.5.13 (2006-08-11)
> 
> This patch contains the following possible cleanups:
> - make needlessly global code static
> - #if 0 unused functions

Ack. 

Thanx
Ravi Anand
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IPv6 router advertisement broken on 2.6.20-rc5

2007-01-16 Thread Daniel Drake

Aurelien Jarno wrote:

Hi all,

I have just tried a 2.6.20-rc5 kernel (I previously used a 2.6.19 one),
and I have noticed that the IPv6 router advertisement functionality is
broken. The interface is not attributed an IPv6 address anymore, despite
/proc/sys/net/ipv6/conf/all/ra_accept being set to 1 (also true for each
individual interface configuration).


Probably fixed by
https://bugs.gentoo.org/attachment.cgi?id=107087&action=view

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc4-mm1 USB (asix) problem

2007-01-16 Thread Eric Buddington
On Mon, Jan 15, 2007 at 08:32:17PM +, David Hollis wrote:
> Interesting.  It would really be something if your devices happen to
> work better with 0.  Wouldn't make much sense at all unfortunately.  If
> 0 works, could you also try setting it to 2 or 3?  The PHY select value
> is a bit field with the 0 bit being to select the onboard PHY, and 1 bit
> being to 'auto-select' the PHY based on link status.  The data sheet
> indicates that 3 should be the default, but all of the literature I have
> seen from ASIX says to write a 1 to it.

My hardware is ver. B1.

0, 2, and 3 all worked for me. 1, as before, does not.

'rmmod asix' takes a really long time (45-80s) with any setting, and
sometimes coincides with ksoftirqd pegging (99.9% CPU) for several
seconds.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


On some configs, sparse spinlock balance checking is broken

2007-01-16 Thread Roland Dreier
(Ingo -- you seem to be the last person to touch all this stuff, and I
can't untangle what you did, hence I'm sending this email to you)

On at least some of my configs on x86_64, when running sparse, I see
bogus 'warning: context imbalance in '' - wrong count at exit'.

This seems to be because I have CONFIG_SMP=y, CONFIG_DEBUG_SPINLOCK=n
and CONFIG_PREEMPT=n.  Therefore,  does

#define spin_lock(lock) _spin_lock(lock)

which picks up

void __lockfunc _spin_lock(spinlock_t *lock)
__acquires(lock);

from , but  also has:

#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \
!defined(CONFIG_SMP)
//...
#else
# define spin_unlock(lock)  
__raw_spin_unlock(&(lock)->raw_lock)

and  has:

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
}

so sparse doesn't see any __releases() to match the __acquires.

This all seems to go back to commit bda98685 ("x86: inline spin_unlock
if !CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPT") but I don't know what
motivated that change.

Anyway, Ingo or anyone else, what's the best way to fix this?  Maybe
the right way to fix this is just to define away __acquires/__releases
unless CONFIG_DEBUG_SPINLOCK is set, but that seems suboptimal.

Thanks,
  Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IPv6 router advertisement broken on 2.6.20-rc5

2007-01-16 Thread Sridhar Samudrala

I think the following patch

[IPV6] MCAST: Fix joining all-node multicast group on device initialization.
 http://www.spinics.net/lists/netdev/msg22663.html

that went in after 2.6.20-rc5 should fix this problem.

Thanks
Sridhar

On 1/16/07, bert hubert <[EMAIL PROTECTED]> wrote:

On Tue, Jan 16, 2007 at 10:42:53PM +0100, Aurelien Jarno wrote:

> I have just tried a 2.6.20-rc5 kernel (I previously used a 2.6.19 one),
> and I have noticed that the IPv6 router advertisement functionality is

Can you check if rc1, rc2, rc3 etc do work?

Thanks.

--
http://www.PowerDNS.com  Open source, database driven DNS Software
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread David Chinner
On Tue, Jan 16, 2007 at 01:53:25PM -0800, Andrew Morton wrote:
> > On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter
> > <[EMAIL PROTECTED]> wrote:
> >
> > Currently cpusets are not able to do proper writeback since dirty ratio
> > calculations and writeback are all done for the system as a whole.
> 
> We _do_ do proper writeback.  But it's less efficient than it might be, and
> there's an NFS problem.
> 
> > This may result in a large percentage of a cpuset to become dirty without
> > writeout being triggered. Under NFS this can lead to OOM conditions.
> 
> OK, a big question: is this patchset a performance improvement or a
> correctness fix?  Given the above, and the lack of benchmark results I'm
> assuming it's for correctness.

Given that we've already got a 25-30% buffered write performance
degradation between 2.6.18 and 2.6.20-rc4 for simple sequential
write I/O to multiple filesystems concurrently, I'd really like
to see some serious I/O performance regression testing on this
change before it goes anywhere.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
>
> ...
>
> > > This may result in a large percentage of a cpuset
> > > to become dirty without writeout being triggered. Under NFS
> > > this can lead to OOM conditions.
> > 
> > OK, a big question: is this patchset a performance improvement or a
> > correctness fix?  Given the above, and the lack of benchmark results I'm
> > assuming it's for correctness.
> 
> It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

It's a workaround for a still-unfixed NFS problem.

> > - Why does NFS go oom?  Because it allocates potentially-unbounded
> >   numbers of requests in the writeback path?
> > 
> >   It was able to go oom on non-numa machines before dirty-page-tracking
> >   went in.  So a general problem has now become specific to some NUMA
> >   setups.
> 
> 
> Right. The issue is that large portions of memory become dirty / 
> writeback since no writeback occurs because dirty limits are not checked 
> for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
> unable to allocate memory.
>  
> >   So an obvious, equivalent and vastly simpler "fix" would be to teach
> >   the NFS client to go off-cpuset when trying to allocate these requests.
> 
> Yes we can fix these allocations by allowing processes to allocate from 
> other nodes. But then the container function of cpusets is no longer 
> there.

But that's what your patch already does!

It asks pdflush to write the pages instead of the direct-reclaim caller. 
The only reason pdflush doesn't go oom is that pdflush lives outside the
direct-reclaim caller's cpuset and is hence able to obtain those nfs
requests from off-cpuset zones.

> > (But is it really bad? What actual problems will it cause once NFS is 
> > fixed?)
> 
> NFS is okay as far as I can tell. dirty throttling works fine in non 
> cpuset environments because we throttle if 40% of memory becomes dirty or 
> under writeback.

Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
for existing IO to complete.  Back that up with a mempool for NFS requests
and the problem is solved, I think?

> > I don't understand why the proposed patches are cpuset-aware at all.  This
> > is a per-zone problem, and a per-zone fix would seem to be appropriate, and
> > more general.  For example, i386 machines can presumably get into trouble
> > if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
> > address that problem as well.  So I think it should all be per-zone?
> 
> No. A zone can be completely dirty as long as we are allowed to allocate 
> from other zones.

But we also can get into trouble if a *zone* is all-dirty.  Any solution to
the cpuset problem should solve that problem too, no?

> > Do we really need those per-inode cpumasks?  When page reclaim encounters a
> > dirty page on the zone LRU, we automatically know that page->mapping->host
> > has at least one dirty page in this zone, yes?  We could immediately ask
> 
> Yes, but when we enter reclaim most of the pages of a zone may already be 
> dirty/writeback so we fail.

No.  If the dirty limits become per-zone then no zone will ever have >40%
dirty.

The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
reduction in that zone, throttling the dirtying process.  I suspect this
would work very badly in common situations with, say, typical i386 boxes.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left

2007-01-16 Thread Ingo Oeser
On Tuesday, 16. January 2007 06:37, Nate Diller wrote:
> On 1/15/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> > On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote:
> > > Convert code using iocb->ki_left to use the more generic iov_length() 
> > > call.
> >
> > No way.  We need to reduce the numer of iovec traversals, not adding
> > more of them.
> 
> ok, I can work on a version of this that uses struct iodesc.  Maybe
> something like this?
> 
> struct iodesc {
> struct iovec *iov;
> unsigned long nr_segs;
> size_t nbytes;
> };
> 
> I suppose it's worth doing the iodesc thing along with this patchset
> anyway, since it'll avoid an extra round of interface churn.

What about this instead

struct iodesc {
struct iovec *iov;
unsigned long nr_segs;
unsigned long seg_limit;
size_t nr_bytes;
};

That will enable resizeable iodescs with partial completion state and
will enable successive filling of an iodesc with iovs.

This will be needed anyway. I built an complete short userspace 
module for that already. I can post and GPLv2 it somewhere, if people
are interested.

Regards

Ingo Oeser
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 31/59] sysctl: C99 convert the ctl_tables in arch/mips/au1000/common/power.c

2007-01-16 Thread Ingo Oeser
Hi Eric,

On Tuesday, 16. January 2007 17:39, Eric W. Biederman wrote:
> diff --git a/arch/mips/au1000/common/power.c b/arch/mips/au1000/common/power.c
> index b531ab7..31256b8 100644
> --- a/arch/mips/au1000/common/power.c
> +++ b/arch/mips/au1000/common/power.c
> @@ -419,15 +419,41 @@ static int pm_do_freq(ctl_table * ctl, int write, 
> struct file *file,

> + {
> + .ctl_name   = CTL_UNNUMBERED,
> + .procname   = "suspend",
> + .data   = NULL,
> + .maxlen = 0,
> + .mode   = 0600,
> + .proc_handler   = &pm_do_suspend
> + },

No need for zero initialization for maxlen.

> + {
> + .ctl_name   = CTL_UNNUMBERED,
> + .procname   = "sleep",
> + .data   = NULL,
> + .maxlen = 0,
> + .mode   = 0600,
> + .proc_handler   = &pm_do_sleep
> + },

dito

> + {
> + .ctl_name   = CTL_UNNUMBERED,
> + .procname   = "freq",
> + .data   = NULL,
> + .maxlen = 0,
> + .mode   = 0600,
> + .proc_handler   = &pm_do_freq
> + },
> + {}
>  };

dito

Regards

Ingo Oeser
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Kwatch: kernel watchpoints using CPU debug registers

2007-01-16 Thread Christoph Hellwig
Fir4st I'd say thanks a lot for forward-porting this, it's really useful
feature for all kinds of nasty debugging.

I think you should split this into two patches, one for the debugreg
infrastructure, and one for the actual kwatch code.

Also I think you provide one (or even a few) example wathes for
trivial things, say updating i_ino for an inode given through debugfs.

Some comments on the code below:

> --- /dev/null
> +++ usb-2.6/arch/i386/kernel/debugreg.c
> @@ -0,0 +1,182 @@
> +/*
> + *  Debug register
> + *  arch/i386/kernel/debugreg.c

Please don't put in comments that mention the name of the containing
file.  Also the "Debug register" comments seems rather useless.

> + * 2002-Oct  Created by Vamsi Krishna S <[EMAIL PROTECTED]> and
> + *   Bharata Rao <[EMAIL PROTECTED]> to provide debug register
> + *   allocation mechanism.
> + * 2004-Oct  Updated by Prasanna S Panchamukhi <[EMAIL PROTECTED]> with
> + *   idr_allocations mechanism as suggested by Andi Kleen.

I think these kinds of comments aren't in fashion anymore either, all
changelogs should be in git commit messages and initial credits go
into the first commit message.

> +struct debugreg dr_list[DR_MAX];
> +static spinlock_t dr_lock = SPIN_LOCK_UNLOCKED;

I think you're supposed to use magic DEFINE_SPINLOCK macro these days.

> +unsigned long dr7_global_mask = DR_CONTROL_RESERVED | DR_GLOBAL_SLOWDOWN |
> + DR_GLOBAL_ENABLE_MASK;

I'd rahter keep this static and make  set_process_dr7 a non-inline
function.

> +
> +static unsigned long dr7_global_reg_mask(unsigned int regnum)
> +{
> + return (0xf << (16 + regnum * 4)) | (0x1 << (regnum * 2));
> +}
> +
> +static int get_dr(int regnum, int flag)
> +{
> + if (flag == DR_ALLOC_GLOBAL && !dr_list[regnum].flag) {
> + dr_list[regnum].flag = flag;
> + dr7_global_mask |= dr7_global_reg_mask(regnum);
> + return regnum;
> + }
> + if (flag == DR_ALLOC_LOCAL &&
> + dr_list[regnum].flag != DR_ALLOC_GLOBAL) {
> + dr_list[regnum].flag = flag;
> + dr_list[regnum].use_count++;
> + return regnum;
> + }
> + return -1;

This looks rather poorly structured, as the function does compltely
different things depending on the flags passed in.

> +static void free_dr(int regnum)
> +{
> + if (dr_list[regnum].flag == DR_ALLOC_LOCAL) {
> + if (!--dr_list[regnum].use_count)
> + dr_list[regnum].flag = 0;
> + } else {
> + dr_list[regnum].flag = 0;
> + dr_list[regnum].use_count = 0;
> + dr7_global_mask &= ~(dr7_global_reg_mask(regnum));
> + }
> +}

Same here.

> +int dr_alloc(int regnum, int flag)
> +{
> + int ret = -1;
> +
> + spin_lock(&dr_lock);
> + if (regnum >= 0 && regnum < DR_MAX)
> + ret = get_dr(regnum, flag);
> + else if (regnum == DR_ANY) {
> +
> + /* gdb allocates local debug registers starting from 0.
> +  * To help avoid conflicts, we'll start from the other end.
> +  */
> + for (regnum = DR_MAX - 1; regnum >= 0; --regnum) {
> + ret = get_dr(regnum, flag);
> + if (ret >= 0)
> + break;
> + }
> + } else
> + printk(KERN_ERR "dr_alloc: "
> + "Cannot allocate debug register %d\n", regnum);
> + spin_unlock(&dr_lock);
> + return ret;

I suspect this should be replaced wit ha global and local variant
to fix the above mentioned issue.  It's a tiny bit duplicated code,
but seems much cleaner.

> +static int get_dr(int regnum, int flag)
> +{
> + if (flag == DR_ALLOC_GLOBAL && !dr_list[regnum].flag) {
> + dr_list[regnum].flag = flag;
> + dr7_global_mask |= dr7_global_reg_mask(regnum);
> + return regnum;
> + }
> + if (flag == DR_ALLOC_LOCAL &&
> + dr_list[regnum].flag != DR_ALLOC_GLOBAL) {
> + dr_list[regnum].flag = flag;
> + dr_list[regnum].use_count++;
> + return regnum;
> + }
> + return -1;

Same comments about global vs local here.

> +
> +EXPORT_SYMBOL(dr_alloc);
> +EXPORT_SYMBOL(dr_free);

I don't think we want these exported at all, and if a proper modular
user shows up they should be _GPL as they're fairly lowlevel.

Btw, the naming in the whole debugregs code should be consolidated to
be debugreg_ instead of all kinds of different variants.

> +#ifdef CONFIG_KWATCH
> +
> +/* Set the type, len and global flag in dr7 for a debug register */
> +#define SET_DR7(dr, regnum, type, len)   do {\
> + dr &= ~(0xf << (16 + (regnum) * 4));\
> + dr |= (len) - 1) << 2) | (type)) << \
> + (16 + (regnum) * 4)) |  \
> + (0x2 << ((regnum) * 2));\
> + } while (0)
> +
> +/* Disable a debu

Re: IPv6 router advertisement broken on 2.6.20-rc5

2007-01-16 Thread bert hubert
On Tue, Jan 16, 2007 at 10:42:53PM +0100, Aurelien Jarno wrote:

> I have just tried a 2.6.20-rc5 kernel (I previously used a 2.6.19 one),
> and I have noticed that the IPv6 router advertisement functionality is

Can you check if rc1, rc2, rc3 etc do work?

Thanks.

-- 
http://www.PowerDNS.com  Open source, database driven DNS Software 
http://netherlabs.nl  Open and Closed source services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 45/59] sysctl: C99 convert ctl_tables in drivers/parport/procfs.c

2007-01-16 Thread Eric W. Biederman
Ingo Oeser <[EMAIL PROTECTED]> writes:

> Hi Eric,
>
> On Tuesday, 16. January 2007 17:39, Eric W. Biederman wrote:
>> diff --git a/drivers/parport/procfs.c b/drivers/parport/procfs.c
>> index 2e744a2..5337789 100644
>> --- a/drivers/parport/procfs.c
>> +++ b/drivers/parport/procfs.c
>> @@ -263,50 +263,118 @@ struct parport_sysctl_table {
>> +{
>> +.ctl_name   = DEV_PARPORT_BASE_ADDR,
>> +.procname   = "base-addr",
>> +.data   = NULL,
>> +.maxlen = 0,
>> +.mode   = 0444,
>> +.proc_handler   = &do_hardware_base_addr
>> +},
>
> No need to initialize to zero or NULL. Just list any variable, which is NOT 
> zero
> or NULL.

Agreed.  In this case it was left for clarity.

>> +{
>> +.ctl_name   = DEV_PARPORT_AUTOPROBE + 1,
>> +.procname   = "autoprobe0",
>> +.data   = NULL,
>> +.maxlen = 0,
>> +.maxlen = 0444,
>> +.proc_handler   =  &do_autoprobe
>> +},
>
> Typo here? .mode = 0444 makes mor sense.

Yep looks like it.  On my todo.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fix typo in geode_configre()@cyrix.c

2007-01-16 Thread Lennart Sorensen
On Wed, Jan 17, 2007 at 01:38:35AM +0900, takada wrote:
> You are right. I agree to your comment. These variables are needless.
> I made a patch again.
> 
> diff -Narup linux-2.6.19.orig/arch/i386/kernel/cpu/cyrix.c 
> linux-2.6.19/arch/i386/kernel/cpu/cyrix.c
> --- linux-2.6.19.orig/arch/i386/kernel/cpu/cyrix.c2006-11-30 
> 06:57:37.0 +0900
> +++ linux-2.6.19/arch/i386/kernel/cpu/cyrix.c 2007-01-16 19:55:05.0 
> +0900
> @@ -161,19 +161,15 @@ static void __cpuinit set_cx86_inc(void)
>  static void __cpuinit geode_configure(void)
>  {
>   unsigned long flags;
> - u8 ccr3, ccr4;
>   local_irq_save(flags);
>  
>   /* Suspend on halt power saving and enable #SUSP pin */
>   setCx86(CX86_CCR2, getCx86(CX86_CCR2) | 0x88);
>  
> - ccr3 = getCx86(CX86_CCR3);
> - setCx86(CX86_CCR3, (ccr3 & 0x0f) | 0x10);   /* Enable */
> + setCx86(CX86_CCR3, (getCx86(CX86_CCR3) & 0x0f) | 0x10); /* Enable */
>   
> - ccr4 = getCx86(CX86_CCR4);
> - ccr4 |= 0x38;   /* FPU fast, DTE cache, Mem bypass */
> - 
> - setCx86(CX86_CCR3, ccr3);
> + /* FPU fast, DTE cache, Mem bypass */
> + setCx86(CX86_CCR4, getCx86(CX86_CCR4) | 0x30);

Actually is it possible that the original intent was:

ccr3 = getCx86(CX86_CCR3);
setCx86(CX86_CCR3, (ccr3 & 0x0f) | 0x10);   /* Enable */ /* enable advanced 
register access?  */

ccr4 = getCx86(CX86_CCR4);
ccr4 |= 0x38;   /* FPU fast, DTE cache, Mem bypass */
setCx86(CX86_CCR4, ccr4);

setCx86(CX86_CCR3, ccr3); /* restore ccr3 register */

Seems something similar with ccr3 was taking place elsewhere in the
function.

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IPv6 router advertisement broken on 2.6.20-rc5

2007-01-16 Thread Bob Tracy
Aurelien Jarno wrote:
> I have just tried a 2.6.20-rc5 kernel (I previously used a 2.6.19 one),
> and I have noticed that the IPv6 router advertisement functionality is
> broken. The interface is not attributed an IPv6 address anymore, despite
> /proc/sys/net/ipv6/conf/all/ra_accept being set to 1 (also true for each
> individual interface configuration).
> 
> Using tcpdump, I am seeing the router advertisement messages arriving on
> the interface, but they seems to be ignored.

ACK as far as seeing the same thing.  Another symptom: ping6 to the
link-local all-nodes address (ff02::1) is similarly broken.  tcpdump
shows the packets on the wire, but there's no response.  The most
recent kernel from kernel.org where IPv6 seems to be behaving properly
is 2.6.20-rc3.

-- 
---
Bob Tracy   WTO + WIPO = DMCA? http://www.anti-dmca.org
[EMAIL PROTECTED]
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


IPv6 router advertisement broken on 2.6.20-rc5

2007-01-16 Thread Aurelien Jarno
Hi all,

I have just tried a 2.6.20-rc5 kernel (I previously used a 2.6.19 one),
and I have noticed that the IPv6 router advertisement functionality is
broken. The interface is not attributed an IPv6 address anymore, despite
/proc/sys/net/ipv6/conf/all/ra_accept being set to 1 (also true for each
individual interface configuration).

Using tcpdump, I am seeing the router advertisement messages arriving on
the interface, but they seems to be ignored.

Does somebody have also seen this behaviour?

Bye,
Aurelien

-- 
  .''`.  Aurelien Jarno | GPG: 1024D/F1BCDB73
 : :' :  Debian developer   | Electrical Engineer
 `. `'   [EMAIL PROTECTED] | [EMAIL PROTECTED]
   `-people.debian.org/~aurel32 | www.aurel32.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] nfs: fix congestion control

2007-01-16 Thread Trond Myklebust

On Tue, 2007-01-16 at 23:08 +0100, Peter Zijlstra wrote:
> Subject: nfs: fix congestion control
> 
> The current NFS client congestion logic is severely broken, it marks the
> backing device congested during each nfs_writepages() call and implements
> its own waitqueue.
> 
> Replace this by a more regular congestion implementation that puts a cap
> on the number of active writeback pages and uses the bdi congestion waitqueue.
> 
> NFSv[34] commit pages are allowed to go unchecked as long as we are under 
> the dirty page limit and not in direct reclaim.
> 
>   A buxom young lass from Neale's Flat,
>   Bore triplets, named Matt, Pat and Tat.
>   "Oh Lord," she protested,
>   "'Tis somewhat congested ...
>   "You've given me no tit for Tat." 


What on earth is the point of adding congestion control to COMMIT?
Strongly NACKed.

Why 16MB of on-the-wire data? Why not 32, or 128, or ...
Solaris already allows you to send 2MB of write data in a single RPC
request, and the RPC engine has for some time allowed you to tune the
number of simultaneous RPC requests you have on the wire: Chuck has
already shown that read/write performance is greatly improved by upping
that value to 64 or more in the case of RPC over TCP. Why are we then
suddenly telling people that they are limited to 8 simultaneous writes?

Trond


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


patch pci-rework-documentation-pci.txt.patch added to gregkh-2.6 tree

2007-01-16 Thread gregkh

This is a note to let you know that I've just added the patch titled

 Subject: PCI: rework Documentation/pci.txt

to my gregkh-2.6 tree.  Its filename is

 pci-rework-documentation-pci.txt.patch

This tree can be found at 
http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From [EMAIL PROTECTED] Fri Jan  5 22:50:51 2007
From: Grant Grundler <[EMAIL PROTECTED]>
Date: Mon, 25 Dec 2006 01:06:35 -0700
Subject: PCI: rework Documentation/pci.txt
To: Andrew Morton <[EMAIL PROTECTED]>
Cc: Greg KH <[EMAIL PROTECTED]>, Hidetoshi Seto <[EMAIL PROTECTED]>, Linux 
Kernel list , [EMAIL PROTECTED], Kenji Kaneshige 
<[EMAIL PROTECTED]>
Message-ID: <[EMAIL PROTECTED]>
Content-Disposition: inline


Rewrite Documentation/pci.txt:
o restructure document to match how API is used when writing init code.
o update to reflect changes in struct pci_driver function pointers.
o removed language on "new style vs old style" device discovery.
  "Old style" is now deprecated. Don't use it. Left description in
  to document existing driver behaviors.
o add section "Legacy I/O Port free driver" by Kenji Kaneshige
  http://lkml.org/lkml/2006/11/22/25
  (renamed to "pci_enable_device_bars() and Legacy I/O Port space")
o add "MMIO space and write posting" section to help avoid common pitfall
  when converting drivers from IO Port space to MMIO space.
  Orignally posted http://lkml.org/lkml/2006/2/27/24
o many typo/grammer/spelling corrections from Randy Dunlap
o two more spelling corrections from Stephan Richter
o fix CodingStyle as per Randy Dunlap


Signed-off-by: Grant Grundler <[EMAIL PROTECTED]>
Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>

---
 Documentation/pci.txt |  714 ++
 1 file changed, 555 insertions(+), 159 deletions(-)

--- gregkh-2.6.orig/Documentation/pci.txt
+++ gregkh-2.6/Documentation/pci.txt
@@ -1,142 +1,231 @@
-How To Write Linux PCI Drivers
 
-  by Martin Mares <[EMAIL PROTECTED]> on 07-Feb-2000
+   How To Write Linux PCI Drivers
+
+   by Martin Mares <[EMAIL PROTECTED]> on 07-Feb-2000
+   updated by Grant Grundler <[EMAIL PROTECTED]> on 23-Dec-2006
 
 

-The world of PCI is vast and it's full of (mostly unpleasant) surprises.
-Different PCI devices have different requirements and different bugs --
-because of this, the PCI support layer in Linux kernel is not as trivial
-as one would wish. This short pamphlet tries to help all potential driver
-authors find their way through the deep forests of PCI handling.
+The world of PCI is vast and full of (mostly unpleasant) surprises.
+Since each CPU architecture implements different chip-sets and PCI devices
+have different requirements (erm, "features"), the result is the PCI support
+in the Linux kernel is not as trivial as one would wish. This short paper
+tries to introduce all potential driver authors to Linux APIs for
+PCI device drivers.
+
+A more complete resource is the third edition of "Linux Device Drivers"
+by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
+LDD3 is available for free (under Creative Commons License) from:
+
+   http://lwn.net/Kernel/LDD3/
+
+However, keep in mind that all documents are subject to "bit rot".
+Refer to the source code if things are not working as described here.
+
+Please send questions/comments/patches about Linux PCI API to the
+"Linux PCI" <[EMAIL PROTECTED]> mailing list.
+
 
 
 0. Structure of PCI drivers
 ~~~
-There exist two kinds of PCI drivers: new-style ones (which leave most of
-probing for devices to the PCI layer and support online insertion and removal
-of devices [thus supporting PCI, hot-pluggable PCI and CardBus in a single
-driver]) and old-style ones which just do all the probing themselves. Unless
-you have a very good reason to do so, please don't use the old way of probing
-in any new code. After the driver finds the devices it wishes to operate
-on (either the old or the new way), it needs to perform the following steps:
+PCI drivers "discover" PCI devices in a system via pci_register_driver().
+Actually, it's the other way around. When the PCI generic code discovers
+a new device, the driver with a matching "description" will be notified.
+Details on this below.
+
+pci_register_driver() leaves most of the probing for devices to
+the PCI layer and supports online insertion/removal of devices [thus
+supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
+pci_register_driver() call requires passing in a table of function
+pointers and thus dictates the high level structure of a driver.
+
+Once the driver knows about a PCI device and takes ownership, the
+driver generally needs to perform the following initialization:
 
Enable the device
-   Access device configuration space
-   Discover r

Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Wed, 17 Jan 2007, Andi Kleen wrote:

> > Secondly we modify the dirty limit calculation to be based
> > on the acctive cpuset.
> 
> The global dirty limit definitely seems to be a problem
> in several cases, but my feeling is that the cpuset is the wrong unit
> to keep track of it. Most likely it should be more fine grained.

We already have zone reclaim that can take care of smaller units but why 
would we start writeback if only one zone is full of dirty pages and there
are lots of other zones (nodes) that are free?

> > If we are in a cpuset then we select only inodes for writeback
> > that have pages on the nodes of the cpuset.
> 
> Is there any indication this change helps on smaller systems
> or is it purely a large system optimization?

The bigger the system the larger the problem because the ratio of dirty
pages is calculated is currently based on the percentage of dirty pages
in the system as a whole. The less percentage of a system a cpuset 
contains the less effective the dirty_ratio and background_dirty_ratio 
become.

> > B. We add a new counter NR_UNRECLAIMABLE that is subtracted
> >from the available pages in a node. This allows us to
> >accurately calculate the dirty ratio even if large portions
> >of the node have been allocated for huge pages or for
> >slab pages.
> 
> That sounds like a useful change by itself.

I can separate that one out.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] return ENOENT from ext3_link when racing with unlink

2007-01-16 Thread Peter Staubach

Alex Tomas wrote:

Peter Staubach (PS) writes:




 PS> Just out of curosity, what keeps i_nlink from going to 0 immediately
 PS> after the new test is executed?

i_mutex in vfs_link() and vfs_unlink()
  


Ahhh...  Okie doke, thanx!

  ps

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Christoph Lameter
On Tue, 16 Jan 2007, Andrew Morton wrote:

> > On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter <[EMAIL 
> > PROTECTED]> wrote:
> >
> > Currently cpusets are not able to do proper writeback since
> > dirty ratio calculations and writeback are all done for the system
> > as a whole.
> 
> We _do_ do proper writeback.  But it's less efficient than it might be, and
> there's an NFS problem.

Well yes we write back during LRU scans when a potentially high percentage 
of the memory in cpuset is dirty.

> > This may result in a large percentage of a cpuset
> > to become dirty without writeout being triggered. Under NFS
> > this can lead to OOM conditions.
> 
> OK, a big question: is this patchset a performance improvement or a
> correctness fix?  Given the above, and the lack of benchmark results I'm
> assuming it's for correctness.

It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

> - Why does NFS go oom?  Because it allocates potentially-unbounded
>   numbers of requests in the writeback path?
> 
>   It was able to go oom on non-numa machines before dirty-page-tracking
>   went in.  So a general problem has now become specific to some NUMA
>   setups.


Right. The issue is that large portions of memory become dirty / 
writeback since no writeback occurs because dirty limits are not checked 
for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
unable to allocate memory.
 
>   So an obvious, equivalent and vastly simpler "fix" would be to teach
>   the NFS client to go off-cpuset when trying to allocate these requests.

Yes we can fix these allocations by allowing processes to allocate from 
other nodes. But then the container function of cpusets is no longer 
there.

> (But is it really bad? What actual problems will it cause once NFS is fixed?)

NFS is okay as far as I can tell. dirty throttling works fine in non 
cpuset environments because we throttle if 40% of memory becomes dirty or 
under writeback.

> I don't understand why the proposed patches are cpuset-aware at all.  This
> is a per-zone problem, and a per-zone fix would seem to be appropriate, and
> more general.  For example, i386 machines can presumably get into trouble
> if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
> address that problem as well.  So I think it should all be per-zone?

No. A zone can be completely dirty as long as we are allowed to allocate 
from other zones.

> Do we really need those per-inode cpumasks?  When page reclaim encounters a
> dirty page on the zone LRU, we automatically know that page->mapping->host
> has at least one dirty page in this zone, yes?  We could immediately ask

Yes, but when we enter reclaim most of the pages of a zone may already be 
dirty/writeback so we fail. Also when we enter reclaim we may not have
the proper process / cpuset context. There is no use to throttle kswapd. 
We need to throttle the process that is dirtying memory.

> But all of this is, I think, unneeded if NFS is fixed.  It's hopefully a
> performance optimisation to permit writeout in a less seeky fashion. 
> Unless there's some other problem with excessively dirty zones.

The patchset improves performance because the filesystem can do sequential 
writeouts. So yes in some ways this is a performance improvement. But this 
is only because this patch makes dirty throttling for cpusets work in the 
same way as for non NUMA system.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 45/59] sysctl: C99 convert ctl_tables in drivers/parport/procfs.c

2007-01-16 Thread Ingo Oeser
Hi Eric,

On Tuesday, 16. January 2007 17:39, Eric W. Biederman wrote:
> diff --git a/drivers/parport/procfs.c b/drivers/parport/procfs.c
> index 2e744a2..5337789 100644
> --- a/drivers/parport/procfs.c
> +++ b/drivers/parport/procfs.c
> @@ -263,50 +263,118 @@ struct parport_sysctl_table {
> + {
> + .ctl_name   = DEV_PARPORT_BASE_ADDR,
> + .procname   = "base-addr",
> + .data   = NULL,
> + .maxlen = 0,
> + .mode   = 0444,
> + .proc_handler   = &do_hardware_base_addr
> + },

No need to initialize to zero or NULL. Just list any variable, which is NOT 
zero or NULL.

> + {
> + .ctl_name   = DEV_PARPORT_AUTOPROBE + 1,
> + .procname   = "autoprobe0",
> + .data   = NULL,
> + .maxlen = 0,
> + .maxlen = 0444,
> + .proc_handler   =  &do_autoprobe
> + },

Typo here? .mode = 0444 makes mor sense.

Regards

Ingo Oeser
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] nfs: fix congestion control

2007-01-16 Thread Peter Zijlstra
Subject: nfs: fix congestion control

The current NFS client congestion logic is severely broken, it marks the
backing device congested during each nfs_writepages() call and implements
its own waitqueue.

Replace this by a more regular congestion implementation that puts a cap
on the number of active writeback pages and uses the bdi congestion waitqueue.

NFSv[34] commit pages are allowed to go unchecked as long as we are under 
the dirty page limit and not in direct reclaim.

A buxom young lass from Neale's Flat,
Bore triplets, named Matt, Pat and Tat.
"Oh Lord," she protested,
"'Tis somewhat congested ...
"You've given me no tit for Tat." 

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 fs/inode.c  |1 
 fs/nfs/pagelist.c   |8 +
 fs/nfs/write.c  |  257 +++-
 include/linux/backing-dev.h |1 
 include/linux/nfs_fs_sb.h   |2 
 include/linux/nfs_page.h|2 
 include/linux/writeback.h   |1 
 mm/backing-dev.c|   16 ++
 mm/page-writeback.c |6 +
 9 files changed, 240 insertions(+), 54 deletions(-)

Index: linux-2.6-git/fs/nfs/write.c
===
--- linux-2.6-git.orig/fs/nfs/write.c   2007-01-12 08:03:47.0 +0100
+++ linux-2.6-git/fs/nfs/write.c2007-01-12 09:31:41.0 +0100
@@ -89,8 +89,6 @@ static struct kmem_cache *nfs_wdata_cach
 static mempool_t *nfs_wdata_mempool;
 static mempool_t *nfs_commit_mempool;
 
-static DECLARE_WAIT_QUEUE_HEAD(nfs_write_congestion);
-
 struct nfs_write_data *nfs_commit_alloc(void)
 {
struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
@@ -245,6 +243,91 @@ static int wb_priority(struct writeback_
 }
 
 /*
+ * NFS congestion control
+ *
+ * Congestion is composed of two parts, writeback and commit pages.
+ * Writeback pages are active, that is, all that is required to get out of
+ * writeback state is sit around and wait. Commit pages on the other hand
+ * are not and they need a little nudge to go away.
+ *
+ * Normally we want to maximise the number of commit pages for it allows the
+ * server to make best use of unstable storage. However when we are running
+ * low on memory we do need to reduce the commit pages.
+ *
+ * Hence writeback pages are managed in the conventional way, but for commit
+ * pages we poll on every write.
+ *
+ * The threshold is picked so that it allows for 16M of in flight data
+ * given current transfer speeds that seems reasonable.
+ */
+
+#define NFS_CONGESTION_SIZE16
+#define NFS_CONGESTION_ON_THRESH   (NFS_CONGESTION_SIZE << (20 - 
PAGE_SHIFT))
+#define NFS_CONGESTION_OFF_THRESH  \
+   (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+
+/*
+ * Include the commit pages into the congestion logic when there is either
+ * pressure from the total dirty limit or when we're in direct reclaim.
+ */
+static inline int commit_pressure(struct inode *inode)
+{
+   return dirty_pages_exceeded(inode->i_mapping) ||
+   (current->flags & PF_MEMALLOC);
+}
+
+static void nfs_congestion_on(struct inode *inode)
+{
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   if (atomic_read(&nfss->writeback) > NFS_CONGESTION_ON_THRESH)
+   set_bdi_congested(&nfss->backing_dev_info, WRITE);
+}
+
+static void nfs_congestion_off(struct inode *inode)
+{
+   struct nfs_server *nfss = NFS_SERVER(inode);
+
+   if (atomic_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+   clear_bdi_congested(&nfss->backing_dev_info, WRITE);
+   congestion_end(WRITE);
+   }
+}
+
+static inline void nfs_set_page_writeback(struct page *page)
+{
+   if (!test_set_page_writeback(page)) {
+   struct inode *inode = page->mapping->host;
+   atomic_inc(&NFS_SERVER(inode)->writeback);
+   nfs_congestion_on(inode);
+   }
+}
+
+static inline void nfs_end_page_writeback(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   end_page_writeback(page);
+   atomic_dec(&NFS_SERVER(inode)->writeback);
+   nfs_congestion_off(inode);
+}
+
+static inline void nfs_set_page_commit(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   inc_zone_page_state(page, NR_UNSTABLE_NFS);
+   atomic_inc(&NFS_SERVER(inode)->commit);
+   nfs_congestion_on(inode);
+}
+
+static inline void nfs_end_page_commit(struct page *page)
+{
+   struct inode *inode = page->mapping->host;
+   dec_zone_page_state(page, NR_UNSTABLE_NFS);
+   atomic_dec(&NFS_SERVER(inode)->commit);
+   nfs_congestion_off(inode);
+}
+
+/*
  * Find an associated nfs write request, and prepare to flush it out
  * Returns 1 if there was no write request, or if the request was
  * already tagged by nfs_set_page_dirty.Returns 0 if the request
@@ -281,7 +364,7 @@ static

Re: [PATCH 37/59] sysctl: C99 convert arch/sh64/kernel/traps.c and remove ABI breakage.

2007-01-16 Thread Paul Mundt
On Tue, Jan 16, 2007 at 09:39:42AM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> While doing the C99 conversion I notices that the top level sh64
> directory was using the binary number for CTL_KERN.  That is a
> no-no so I removed the support for the sysctl binary interface
> only leaving sysctl /proc support.
> 
> At least the sysctl tables were placed at the end of
> the list so user space did not see this mistake.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Looks good, thanks Eric.

Acked-by: Paul Mundt <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] return ENOENT from ext3_link when racing with unlink

2007-01-16 Thread Alex Tomas
> Peter Staubach (PS) writes:


 PS> Just out of curosity, what keeps i_nlink from going to 0 immediately
 PS> after the new test is executed?

i_mutex in vfs_link() and vfs_unlink()

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andi Kleen

> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.

The global dirty limit definitely seems to be a problem
in several cases, but my feeling is that the cpuset is the wrong unit
to keep track of it. Most likely it should be more fine grained.

> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.

Is there any indication this change helps on smaller systems
or is it purely a large system optimization?

> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
>from the available pages in a node. This allows us to
>accurately calculate the dirty ratio even if large portions
>of the node have been allocated for huge pages or for
>slab pages.

That sounds like a useful change by itself.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] return ENOENT from ext3_link when racing with unlink

2007-01-16 Thread Peter Staubach

Eric Sandeen wrote:
An update from the earlier thread, [PATCH] [RFC] remove ext3 inode 
from orphan list when link and unlink race


I think this is better than the original idea of trying to handle the 
race;
I've seen that the orphan inode list can get corrupted, but there may 
well
be other implications of the race which haven't yet been exposed.  I 
think

it's safer to simply return -ENOENT in this race window, and avoid other
potential problems.  Anything wrong with this?

Thanks for the comments suggesting this approach in the prior thread.

Thanks,

-Eric

---

Return -ENOENT from ext[34]_link if we've raced with unlink and
i_nlink is 0.  Doing otherwise has the potential to corrupt the
orphan inode list, because we'd wind up with an inode with a
non-zero link count on the list, and it will never get properly
cleaned up & removed from the orphan list before it is freed.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

Index: linux-2.6.19/fs/ext3/namei.c
===
--- linux-2.6.19.orig/fs/ext3/namei.c
+++ linux-2.6.19/fs/ext3/namei.c
@@ -2191,6 +2191,8 @@ static int ext3_link (struct dentry * ol

if (inode->i_nlink >= EXT3_LINK_MAX)
return -EMLINK;
+if (inode->i_nlink == 0)
+return -ENOENT;

retry:
handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
Index: linux-2.6.19/fs/ext4/namei.c
===
--- linux-2.6.19.orig/fs/ext4/namei.c
+++ linux-2.6.19/fs/ext4/namei.c
@@ -2189,6 +2189,8 @@ static int ext4_link (struct dentry * ol

if (inode->i_nlink >= EXT4_LINK_MAX)
return -EMLINK;
+if (inode->i_nlink == 0)
+return -ENOENT;

retry:
handle = ext4_journal_start(dir, EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +



Just out of curosity, what keeps i_nlink from going to 0 immediately
after the new test is executed?

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: data corruption with nvidia chipsets and IDE/SATA drives (k8 cpu errata needed?)

2007-01-16 Thread Allen Martin
> I'd like to here from Andi how he feels about this?  It seems like a
> somewhat drastic solution in some ways given a lot of hardware doesn't
> seem to be affected (or maybe in those cases it's just really hard to
> hit, I don't know).
> 
> > Well we can hope that Nvidia will find out more (though I'm not too
> > optimistic).
> 
> Ideally someone from AMD needs to look into this, if some mainboards
> really never see this problem, then why is that?  Is there errata that
> some BIOS/mainboard vendors are dealing with that others are not?

NVIDIA and AMD are ivestigating this issue, we don't know what the
problem is yet.
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/8] Cpuset aware writeback

2007-01-16 Thread Andrew Morton
> On Mon, 15 Jan 2007 21:47:43 -0800 (PST) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
>
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole.

We _do_ do proper writeback.  But it's less efficient than it might be, and
there's an NFS problem.

> This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.

OK, a big question: is this patchset a performance improvement or a
correctness fix?  Given the above, and the lack of benchmark results I'm
assuming it's for correctness.

- Why does NFS go oom?  Because it allocates potentially-unbounded
  numbers of requests in the writeback path?

  It was able to go oom on non-numa machines before dirty-page-tracking
  went in.  So a general problem has now become specific to some NUMA
  setups.

  We have earlier discussed fixing NFS to not do that.  Make it allocate
  a fixed number of requests and to then block.  Just like
  get_request_wait().  This is one reason why block_congestion_wait() and
  friends got renamed to congestion_wait(): it's on the path to getting NFS
  better aligned with how block devices are handling this.

- There's no reason which I can see why NFS _has_ to go oom.  It could
  just fail the memory allocation for the request and then wait for the
  stuff which it _has_ submitted to complete.  We do that for block
  devices, backed by mempools.

- Why does NFS go oom if there's free memory in other nodes?  I assume
  that's what's happening, because things apparently work OK if you ask
  pdflush to do exactly the thing which the direct-reclaim process was
  attempting to do: allocate NFS requests and do writeback.

  So an obvious, equivalent and vastly simpler "fix" would be to teach
  the NFS client to go off-cpuset when trying to allocate these requests.

I suspect that if we do some or all of the above, NFS gets better and the
bug which motivated this patchset goes away.

But that being said, yes, allowing a zone to go 100% dirty like this is
bad, and it's be nice to be able to fix it.

(But is it really bad? What actual problems will it cause once NFS is fixed?)

Assuming that it is bad, yes, we'll obviously need the extra per-zone
dirty-memory accounting.




I don't understand why the proposed patches are cpuset-aware at all.  This
is a per-zone problem, and a per-zone fix would seem to be appropriate, and
more general.  For example, i386 machines can presumably get into trouble
if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
address that problem as well.  So I think it should all be per-zone?




Do we really need those per-inode cpumasks?  When page reclaim encounters a
dirty page on the zone LRU, we automatically know that page->mapping->host
has at least one dirty page in this zone, yes?  We could immediately ask
pdflush to write out some pages from that inode.  We would need to take a
ref on the inode (while the page is locked, to avoid racing with inode
reclaim) and pass that inode off to pdflush (actually pass a list of such
inodes off to pdflush, keep appending to it).

Extra refinements would include

- telling pdflush the file offset of the page so it can do writearound

- getting pdflush to deactivate any pages which it writes out, so that
  rotate_reclaimable_page() has a good chance of moving them to the tail of
  the inactive list for immediate reclaim.

But all of this is, I think, unneeded if NFS is fixed.  It's hopefully a
performance optimisation to permit writeout in a less seeky fashion. 
Unless there's some other problem with excessively dirty zones.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc5: usb mouse breaks suspend to ram

2007-01-16 Thread Pavel Machek
Hi!

> >> >I started using el-cheapo usb mouse... only to find out that it breaks
> >> >suspend to RAM. Suspend-to-disk works okay. I was not able to extract
> >> >any usefull messages...
> >> >
> >> >Resume process hangs; I can still switch console and even type on
> >> >keyboard, but userland is dead, and I was not able to get magic sysrq
> >> >to respond.
> >>
> >> Are you using hid or usbmouse?
> >
> >I think it is hid:
> >
> >[EMAIL PROTECTED]:/data/l/linux$ cat .config | grep MOUSE
...
> >[EMAIL PROTECTED]:/data/l/linux$
> >
> >Should I disable config_hid and try some other driver?
> 
> No, HID is the preferred... I am not sure what is going on - on my box
> STR does not work at all thanks to nvidia chip turning the display on
> all the way as the very last step of suspend ;(

Hmm, I guess we should fix the suspend for you...

Strange, I can't reproduce the hang any more.

I found other weirdness while trying to hang it: if I move the mouse
while suspending, it is _not_ completely powered off while machine is
suspended. LED still shines, at half brightness...?! 
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.20-rc3]: 8139cp: Don't blindly enable interrupts

2007-01-16 Thread Chris Lalancette
Francois Romieu wrote:

>Chris Lalancette <[EMAIL PROTECTED]> :
>[...]
>  
>
>> Thanks for the comments.  While the patch you sent will help, there are
>>still other places that will have problems.  For example, in netpoll_send_skb,
>>we call local_irq_save(flags), then call dev->hard_start_xmit(), and then call
>>local_irq_restore(flags).  This is a similar situation to what I described
>>above; we will re-enable interrupts in cp_start_xmit(), when netpoll_send_skb
>>doesn't expect that, and will probably run into issues.
>> Is there a problem with changing cp_start_xmit to use the
>>spin_lock_irqsave(), besides the extra instructions it needs?
>>
>>
>
>No. Given the history of locking in netpoll and the content of
>Documentation/networking/netdevices.txt, asking Herbert which rule(s)
>the code is supposed to follow seemed safer to me.
>
>You can forget my patch.
>
>Please resend your patch inlined to Jeff as described in
>http://linux.yyz.us/patch-format.html.
>
>  
>
Francois,
 Great.  Resending mail, shortening subject to < 65 characters and
inlining the patch.

Thanks,
Chris Lalancette

Similar to this commit:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d15e9c4d9a75702b30e00cdf95c71c88e3f3f51e

It's not safe in cp_start_xmit to blindly call spin_lock_irq and then
spin_unlock_irq, since it may very well be the case that cp_start_xmit
was called with interrupts already disabled (I came across this bug in
the context of netdump in RedHat kernels, but the same issue holds, for
example, in netconsole). Therefore, replace all instances of
spin_lock_irq and spin_unlock_irq with spin_lock_irqsave and
spin_unlock_irqrestore, respectively, in cp_start_xmit(). I tested this
against a fully-virtualized Xen guest using netdump, which happens to
use the 8139cp driver to talk to the emulated hardware. I don't have a
real piece of 8139cp hardware to test on, so someone else will have to
do that.

Signed-off-by: Chris Lalancette <[EMAIL PROTECTED]>

diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
index e2cb19b..6f93a76 100644
--- a/drivers/net/8139cp.c
+++ b/drivers/net/8139cp.c
@@ -765,17 +765,18 @@ static int cp_start_xmit (struct sk_buff *skb, struct 
net_device *dev)
struct cp_private *cp = netdev_priv(dev);
unsigned entry;
u32 eor, flags;
+   unsigned long intr_flags;
 #if CP_VLAN_TAG_USED
u32 vlan_tag = 0;
 #endif
int mss = 0;
 
-   spin_lock_irq(&cp->lock);
+   spin_lock_irqsave(&cp->lock, intr_flags);
 
/* This is a hard error, log it. */
if (TX_BUFFS_AVAIL(cp) <= (skb_shinfo(skb)->nr_frags + 1)) {
netif_stop_queue(dev);
-   spin_unlock_irq(&cp->lock);
+   spin_unlock_irqrestore(&cp->lock, intr_flags);
printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n",
   dev->name);
return 1;
@@ -908,7 +909,7 @@ static int cp_start_xmit (struct sk_buff *skb, struct 
net_device *dev)
if (TX_BUFFS_AVAIL(cp) <= (MAX_SKB_FRAGS + 1))
netif_stop_queue(dev);
 
-   spin_unlock_irq(&cp->lock);
+   spin_unlock_irqrestore(&cp->lock, intr_flags);
 
cpw8(TxPoll, NormalTxPoll);
dev->trans_start = jiffies;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives (k8 cpu errata needed?)

2007-01-16 Thread Andi Kleen
On Wednesday 17 January 2007 07:31, Chris Wedgwood wrote:
> On Tue, Jan 16, 2007 at 08:52:32PM +0100, Christoph Anton Mitterer wrote:
> > I agree,... it seems drastic, but this is the only really secure
> > solution.
>
> I'd like to here from Andi how he feels about this?  It seems like a
> somewhat drastic solution in some ways given a lot of hardware doesn't
> seem to be affected (or maybe in those cases it's just really hard to
> hit, I don't know).

AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
although there were similar problems on VIA in the past too.
Unless a good workaround comes around soon I'll probably default
to iommu=soft on Nvidia.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [-mm patch] make gfs2_change_nlink_i() static

2007-01-16 Thread Adrian Bunk
On Tue, Jan 16, 2007 at 04:04:15PM -0500, Wendy Cheng wrote:
> Adrian Bunk wrote:
> >On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote:
> >  
> >>...
> >>Changes since 2.6.20-rc3-mm1:
> >>...
> >> git-gfs2-nmw.patch
> >>...
> >> git trees
> >>...
> >>
> >
> >
> >This patch makes the needlessly globlal gfs2_change_nlink_i() static.
> >  
> We will probably need to call this routine from other files in our next 
> round of code check-in.

You can always make it global again when you use it from another file.

> -- Wendy

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4 update] kprobes and traps

2007-01-16 Thread Mathieu Desnoyers
Hi,

I have looked at kprobes code and have some questions for you. I would really
like to use it to patch dynamically my marker immediate value by doing code
patching. Using an int3 seems like the right way to handle this wrt pIII erratum
49.

Everything is ok, except for a limitation important to the LTTng project :
kprobes cannot probe trap handlers. Looking at the code, I see that the kprobes
trap notifier expects interrupts to be disabled when it is run. Looking a little
deeper in the code, I notice that you use per-cpu data structures to keep the
probe control information that is needed for single stepping, which clearly
requires you to disable interrupts so no interrupt handler with a kprobe in it
fires on top of the kprobe handler. It also forbids trap handler and NMI
handler instrumentation, as traps can be triggered by the kprobes handler and
NMIs can come at any point during execution.

Would it be possible to put these data structures on the stack or on a
separate stack accessible through thread_info instead ?

Mathieu


* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> Hi Richard,
> 
> * Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> > > You've got the same optimizations for x86 by modifying an instruction's
> > > immediate operand and thus avoiding a d-cache hit. The only real caveat is
> > > the need to avoid the unsynchronised cross modification erratum. Which
> > > means that all processors will need to issue a serializing operation 
> > > before
> > > executing a Marker whose state is changed. How is that handled?
> > > 
> > 
> > Good catch. I thought that modifying only 1 byte would spare us from this
> > errata, but looking at it in detail tells me than it's not the case.
> > 
> > I see three different ways to address the problem :
> [...]
> > 3 - First write an int3 instead of the instruction's first byte. The handler
> > would do the following :
> > int3_handler :
> >   single-step the original instruction.
> >   iret
> > 
> > Secondly, we call an IPI that does a smp_processor_id() on each CPU and
> > wait for them to complete. It will make sure we execute a synchronizing
> > instruction on every CPU even if we do not execute the trap handler.
> > 
> > Then, we write the new 2 bytes instruction atomically instead of the 
> > int3
> > and immediate value.
> > 
> > 
> 
> Here is the implementation of my proposal using a slightly enhanced kprobes. I
> add the ability to single step a different instruction than the original one,
> and then put the new instruction instead of the original one when removing the
> kprobe. It is an improvement on the djprobes design : AFAIK, djprobes required
> the int3 to be executed by _every_ CPU before the instruction could be
> replaced. It was problematic with rarely used code paths (error handling) and
> with thread CPU affinity. Comments are welcome.
> 
> I noticed that it restrains LTTng by removing the ability to probe
> do_general_protection, do_nmi, do_trap, do_debug and do_page_fault.
> hardirq on/off in lockdep.c must also be tweaked to allow
> local_irq_enable/disable usage within the debug trap handler.
> 
> It would be nice to push the study of the kprobes debug trap handler so it can
> become possible to use it to put breakpoints in trap handlers. For now, 
> kprobes
> refuses to insert breakpoints in __kprobes marked functions. However, as we
> instrument specific spots of the functions (not necessarily the function 
> entry),
> it is sometimes correct to use kprobes on a marker within the function even 
> if 
> it is not correct to use it in the prologue. Insight from the SystemTAP team
> would be welcome on this kprobe limitation.
> 
> Mathieu
> 
> Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
> 
> --- a/arch/i386/kernel/kprobes.c
> +++ b/arch/i386/kernel/kprobes.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -753,6 +754,73 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
> struct pt_regs *regs)
>   return 0;
>  }
>  
> +static struct kprobe xmc_kp;
> +DEFINE_MUTEX(kprobe_xmc_mutex);
> +
> +static int xmc_handler_pre(struct kprobe *p, struct pt_regs *regs)
> +{
> + return 0;
> +}
> +
> +static void xmc_handler_post(struct kprobe *p, struct pt_regs *regs,
> + unsigned long flags)
> +{
> +}
> +
> +static int xmc_handler_fault(struct kprobe *p, struct pt_regs *regs, int 
> trapnr)
> +{
> + return 0;
> +}
> +
> +static void xmc_synchronize_cpu(void *info)
> +{
> + smp_processor_id();
> +}
> +
> +/* Think of it as a memcpy of new into address with safety with regard to 
> + * Intel PIII errata 49. Only valid for modifying a single instruction with
> + * an instruction of the same size or in smaller instructions of the total
> + * same size as the original instruction. */
> +int arch_xmc(void *address, char *newi, int size)
> +{
> + int ret = 0;
> + char *dest = (char*)address;
> + char str[KSY

Re: 2.6.20-rc5: usb mouse breaks suspend to ram

2007-01-16 Thread Dmitry Torokhov

On 1/16/07, Pavel Machek <[EMAIL PROTECTED]> wrote:

Hi!

> >I started using el-cheapo usb mouse... only to find out that it breaks
> >suspend to RAM. Suspend-to-disk works okay. I was not able to extract
> >any usefull messages...
> >
> >Resume process hangs; I can still switch console and even type on
> >keyboard, but userland is dead, and I was not able to get magic sysrq
> >to respond.
>
> Are you using hid or usbmouse?

I think it is hid:

[EMAIL PROTECTED]:/data/l/linux$ cat .config | grep MOUSE
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_SERIAL=y
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_USB_IDMOUSE is not set
[EMAIL PROTECTED]:/data/l/linux$ cat .config | grep HID
CONFIG_BT_HIDP=y
# HID Devices
CONFIG_HID=y
CONFIG_USB_HID=y
# CONFIG_USB_HIDINPUT_POWERBOOK is not set
# CONFIG_HID_FF is not set
# CONFIG_USB_HIDDEV is not set
# CONFIG_USB_PHIDGET is not set
[EMAIL PROTECTED]:/data/l/linux$

Should I disable config_hid and try some other driver?


No, HID is the preferred... I am not sure what is going on - on my box
STR does not work at all thanks to nvidia chip turning the display on
all the way as the very last step of suspend ;(

--
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Attribute removal patch causes lockdep warning

2007-01-16 Thread Greg KH
On Tue, Jan 16, 2007 at 10:15:57PM +0100, Oliver Neukum wrote:
> Am Dienstag, 16. Januar 2007 21:33 schrieb Alan Stern:
> > Are you aware that your patch for safe attribute file removal provokes a 
> > lockdep warning at bootup?
> 
> Yes, I am aware of that. However, the top down lock order is always
> followed. A patch to make the lock checker realize that has been posted
> and included upstream.


Alan, here's the patch:


Signed-off-by: Frederik Deweerdt <[EMAIL PROTECTED]>

diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 8c533cb..3b5574b 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -214,7 +214,7 @@ static inline void orphan_all_buffers(st
struct sysfs_buffer_collection *set = node->i_private;
struct sysfs_buffer *buf;
 
-   mutex_lock(&node->i_mutex);
+   mutex_lock_nested(&node->i_mutex, I_MUTEX_CHILD);
if (node->i_private) {
list_for_each_entry(buf, &set->associates, associates) {
down(&buf->sem);
@@ -271,7 +271,7 @@ int sysfs_hash_and_remove(struct dentry
return -ENOENT;
 
parent_sd = dir->d_fsdata;
-   mutex_lock(&dir->d_inode->i_mutex);
+   mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT);
list_for_each_entry(sd, &parent_sd->s_children, s_sibling) {
if (!sd->s_element)
continue;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >